Chapter 5 Tidy Data
5.1 What is tidy data anyway?
So far, your manipulation of data has focussed on processes which make use of tidyverse functions, including filtering rows, selecting columns and creating new columns, as well as plotting data on charts. All of the functions you’ve seen have been easy to apply to the data you have.
Unfortunately, in real projects, this often won’t be the case because the data you’ll have often won’t be in tidy format, which is how most tidyverse functions expect to receive data.
Tidy data is a standardised structure of data designed make data cleaning and processing steps easy, and is straightforward even if it seems unusual at first.
You are likely to be most familiar with data laid out in a “messy” structure like the one below:
## # A tibble: 30 × 8
## country continent `1967` `1972` `1977` `1982` `1987` `1992`
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albania Europe 66.2 67.7 68.9 70.4 72 71.6
## 2 Austria Europe 70.1 70.6 72.2 73.2 74.9 76.0
## 3 Belgium Europe 70.9 71.4 72.8 73.9 75.4 76.5
## 4 Bosnia and Herzegovina Europe 64.8 67.4 69.9 70.7 71.1 72.2
## 5 Bulgaria Europe 70.4 70.9 70.8 71.1 71.3 71.2
## 6 Croatia Europe 68.5 69.6 70.6 70.5 71.5 72.5
## 7 Czech Republic Europe 70.4 70.3 70.7 71.0 71.6 72.4
## 8 Denmark Europe 73.0 73.5 74.7 74.6 74.8 75.3
## 9 Finland Europe 69.8 70.9 72.5 74.6 74.8 75.7
## 10 France Europe 71.6 72.4 73.8 74.9 76.3 77.5
## # ℹ 20 more rows
Here, the data has one row per country and then multiple columns, with one for each year. This may seem relatively logical and quite tidy, but the problems start when you realise that there are multiple other quite logical and tidy ways to lay this data out if there are few restrictions. For example, it could quite easily be shown with one country per column, and a year per row:
## # A tibble: 12 × 143
## year Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1952 28.8 55.2 43.1 30.0 62.5 69.1 66.8 50.9
## 2 1957 30.3 59.3 45.7 32.0 64.4 70.3 67.5 53.8
## 3 1962 32.0 64.8 48.3 34 65.1 70.9 69.5 56.9
## 4 1967 34.0 66.2 51.4 36.0 65.6 71.1 70.1 59.9
## 5 1972 36.1 67.7 54.5 37.9 67.1 71.9 70.6 63.3
## 6 1977 38.4 68.9 58.0 39.5 68.5 73.5 72.2 65.6
## 7 1982 39.9 70.4 61.4 39.9 69.9 74.7 73.2 69.1
## 8 1987 40.8 72 65.8 39.9 70.8 76.3 74.9 70.8
## 9 1992 41.7 71.6 67.7 40.6 71.9 77.6 76.0 72.6
## 10 1997 41.8 73.0 69.2 41.0 73.3 78.8 77.5 73.9
## 11 2002 42.1 75.7 71.0 41.0 74.3 80.4 79.0 74.8
## 12 2007 43.8 76.4 72.3 42.7 75.3 81.2 79.8 75.6
## # ℹ 134 more variables: Bangladesh <dbl>, Belgium <dbl>, Benin <dbl>,
## # Bolivia <dbl>, `Bosnia and Herzegovina` <dbl>, Botswana <dbl>,
## # Brazil <dbl>, Bulgaria <dbl>, `Burkina Faso` <dbl>, Burundi <dbl>,
## # Cambodia <dbl>, Cameroon <dbl>, Canada <dbl>,
## # `Central African Republic` <dbl>, Chad <dbl>, Chile <dbl>, China <dbl>,
## # Colombia <dbl>, Comoros <dbl>, `Congo, Dem. Rep.` <dbl>,
## # `Congo, Rep.` <dbl>, `Costa Rica` <dbl>, `Cote d'Ivoire` <dbl>, …
Having this kind of wide variety of potential structures means that when you receive data and try to use it, it’s almost impossible to have any kind of standardised approach to this non-standard data structuring. For example, if you want the most recent year of data from an annual dataset, would you need to select
a column or filter
some rows; it would depend entirely on the dataset!
In contrast, tidy data always has the same format:
Every column is a variable.
Every row is an observation.
Every cell is a single value.
This is best explained in a tidy example:
## # A tibble: 10 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
In this dataset, each column is a different variable; years, life expectancy, countries. These are best thought of as different categories that your data can fall into, not values of those categories. For example, year is a variable, but each possible year in that column is a value, and therefore shouldn’t be assigned its own column.
Every row is an observation; a unique combination of the variables included in the data. For example, there is only one observation (row) of data in Afghanistan for 1952, and then a separate one for the same country in 1957.
And importantly, there is only one value in each cell; this seems obvious but is suprisingly often not the case, especially when combined values such as age and sex (25M, 54F) or section and subsection (5b or 8c) are created.
5.2 Tidying messy data
Inevitably, you will often want to use data which is not in tidy format. Luckily the tidyverse
package tidyr
is designed to help with this, and contains a number of functions designed to tidy messy data. The functions you will want to use most frequently from this are the pivot
verbs.
5.2.1 Pivot longer
The pivot_longer
function does exactly what the name suggests, pivots data from a wide format (with many columns and few rows) to a long format (with many rows and few columns). This is very useful when you have data which has values as column names, and you want to gather these together into a single column of multiple values instead.
pay_gap <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-06-28/paygap.csv') %>%
select(employer_name, male_bonus_percent, female_bonus_percent, date_submitted)
This example uses the above dataset; a subsection UK company pay gap data, with male and female bonus percentages displayed in different columns. According to tidy data principles, this data would be easier to use if this data was stored in a single column with an additional variable indicating whether these were male or female bonuses.
Using pivot_longer
we can specify the columns we’re talking about using the cols
argument. This can be done in the same way as in the select
function, either by specifying columns you wish to include or those you want to exclude (using the -
).
There are several ways to use the pivot_longer
function.
##Pivoting data; note that cols listed are the ones we want to include in the pivoting
pay_gap_long <- pay_gap %>%
pivot_longer(cols = c(male_bonus_percent, female_bonus_percent))
##Pivoting data; this returns the same result as the previous code, but the columns listed are excluded from the pivoting, so any not listed are pivoted
pay_gap_long <- pay_gap %>%
pivot_longer(cols = -c(employer_name, date_submitted))
The other two arguments you may want to use are names_to
and values_to
, which allow you to specify the names of the columns you want to put the current column names (male_bonus_percent and female_bonus_percent) and values (the values in the two existing columns) into. If these are not specified, they will default to “name” and “value”, as in the examples above.
5.2.2 Exercise
10:00
- Using the inbuilt dataset
billboard
, pivot the week columns into a single long column. Call the datasetbillboard_long
5.2.2. Solution
#Option 1: Excluding columns not being pivoted
billboard_long <- billboard %>%
pivot_longer(cols = -c(artist, track, date.entered),
names_to = "week",
values_to = "rank")
#Option 2: Specifying columns range explicitly
billboard_long <- billboard %>%
pivot_longer(cols = wk1:w76,
names_to = "week",
values_to = "rank")
#Option 3: Using starts_with()
billboard_long <- billboard %>%
pivot_longer(cols = starts_with("wk"),
names_to = "week",
values_to = "rank")
#Using option 1
billboard_long <- billboard %>%
pivot_longer(cols = -c(artist, track, date.entered),
names_to = "week",
values_to = "rank")
billboard_long
## # A tibble: 24,092 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
## 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
## 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
## # ℹ 24,082 more rows
5.3 Combining data tidying with data manipulation
The real power in tidy data comes with using it in combination with other dplyr verbs such as filtering, selecting and mutating. For example, in the full pay gap dataset, we can calculate whether there’s a difference in the hourly pay and bonuses by gender for different sizes of company:
pay_gap_summary <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-06-28/paygap.csv') %>%
#Select the columns we're interested in
select(employer_name, diff_mean_hourly_percent, diff_mean_bonus_percent, employer_size) %>%
#Pivot to gather up all of the mean values into one column
pivot_longer(cols = -c(employer_name, employer_size), names_to = "gender", values_to = "percent") %>%
#Group by employer size and variable type
group_by(employer_size, gender) %>%
#Summarise by calculating means
dplyr::summarise(percent = mean(percent, na.rm = TRUE)) %>%
#Remove companies which didn't specify a size
filter(employer_size != "Not Provided")
5.4 Tidy data and plotting
The value of having a tidy dataset is also obvious when you come to plot your data in ggplot. Having a single variable in a column means that you’re able to assign that column to an aesthetic in ggplot, and therefore change the colour, shape, faceting, etc on the basis of that value! This opens the door to creating increasingly complex and informative (and pretty!) charts.
##Plot using the tidy data
ggplot(pay_gap_summary, aes(x = employer_size,
y = percent,
fill = gender))+ #Colours can be assigned to the different data types
geom_col(position = "dodge") #makes bars sit next to eachother
5.4.1 Exercise
15:00
- Using the inbuilt dataset
relig_income
, pivot the income columns into a long dataset. - Assign this to an object called
income_long
- Using the dataset income_long you created, plot a bar chart in ggplot. Use religion as the x axis and the count of people as the y axis.
- Assign the income groupings to the colour aesthetic.
5.4.1. Solution
#Tidy the data using pivot_longer()
income_long <- relig_income %>%
pivot_longer(cols = -religion,
names_to = "income_group",
values_to = "count")
#Plot your data
income_chart <-
ggplot(income_long, aes(x = religion, y = count, fill = income_group)) +
#stat = identity creates a stacked bar chart
geom_bar(stat = "identity") +
labs(x = "Religion", y = "Number of People", fill = "Income Group") +
theme_minimal()
#print() allows you to view the chart in the 'Plots' tab
print(income_chart)
5.5 Congratulations, you are a R coder!
Congratulations! You have reached the end of the tutor-led training session. You are welcome to continue your self-directed learning by completing the two 30-minute exercises in Chapter 6!
Remember to complete the feedback form (it’s all anonymous) to help us improve your learning journey to coding at DfT.
Here are a few useful links:
- Department for Transport R Cookbook (guidebook to useful R topics)
- CRAN Wiki (a Wikipedia page detailing the how, where and what of coding at DfT)
- CRANLanD (our internal website to register for coding courses offered at DfT)
- Crimes Against R (slidepack on how to eliminate big risks you probably have in your code)
Our intermediate R courses:
If you have any questions or want any support in R, do not hesitate to contact us at CRANLanD.