Chapter 5 Tidy Data

5.1 What is tidy data anyway?

So far, your manipulation of data has focussed on processes which make use of tidyverse functions, including filtering rows, selecting columns and creating new columns, as well as plotting data on charts. All of the functions you’ve seen have been easy to apply to the data you have.

Unfortunately, in real projects, this often won’t be the case because the data you’ll have often won’t be in tidy format, which is how most tidyverse functions expect to receive data.

Tidy data is a standardised structure of data designed make data cleaning and processing steps easy, and is straightforward even if it seems unusual at first.

You are likely to be most familiar with data laid out in a “messy” structure like the one below:

## # A tibble: 30 × 8
##    country                continent `1967` `1972` `1977` `1982` `1987` `1992`
##    <fct>                  <fct>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Albania                Europe      66.2   67.7   68.9   70.4   72     71.6
##  2 Austria                Europe      70.1   70.6   72.2   73.2   74.9   76.0
##  3 Belgium                Europe      70.9   71.4   72.8   73.9   75.4   76.5
##  4 Bosnia and Herzegovina Europe      64.8   67.4   69.9   70.7   71.1   72.2
##  5 Bulgaria               Europe      70.4   70.9   70.8   71.1   71.3   71.2
##  6 Croatia                Europe      68.5   69.6   70.6   70.5   71.5   72.5
##  7 Czech Republic         Europe      70.4   70.3   70.7   71.0   71.6   72.4
##  8 Denmark                Europe      73.0   73.5   74.7   74.6   74.8   75.3
##  9 Finland                Europe      69.8   70.9   72.5   74.6   74.8   75.7
## 10 France                 Europe      71.6   72.4   73.8   74.9   76.3   77.5
## # ℹ 20 more rows

Here, the data has one row per country and then multiple columns, with one for each year. This may seem relatively logical and quite tidy, but the problems start when you realise that there are multiple other quite logical and tidy ways to lay this data out if there are few restrictions. For example, it could quite easily be shown with one country per column, and a year per row:

## # A tibble: 12 × 143
##     year Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain
##    <int>       <dbl>   <dbl>   <dbl>  <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
##  1  1952        28.8    55.2    43.1   30.0      62.5      69.1    66.8    50.9
##  2  1957        30.3    59.3    45.7   32.0      64.4      70.3    67.5    53.8
##  3  1962        32.0    64.8    48.3   34        65.1      70.9    69.5    56.9
##  4  1967        34.0    66.2    51.4   36.0      65.6      71.1    70.1    59.9
##  5  1972        36.1    67.7    54.5   37.9      67.1      71.9    70.6    63.3
##  6  1977        38.4    68.9    58.0   39.5      68.5      73.5    72.2    65.6
##  7  1982        39.9    70.4    61.4   39.9      69.9      74.7    73.2    69.1
##  8  1987        40.8    72      65.8   39.9      70.8      76.3    74.9    70.8
##  9  1992        41.7    71.6    67.7   40.6      71.9      77.6    76.0    72.6
## 10  1997        41.8    73.0    69.2   41.0      73.3      78.8    77.5    73.9
## 11  2002        42.1    75.7    71.0   41.0      74.3      80.4    79.0    74.8
## 12  2007        43.8    76.4    72.3   42.7      75.3      81.2    79.8    75.6
## # ℹ 134 more variables: Bangladesh <dbl>, Belgium <dbl>, Benin <dbl>,
## #   Bolivia <dbl>, `Bosnia and Herzegovina` <dbl>, Botswana <dbl>,
## #   Brazil <dbl>, Bulgaria <dbl>, `Burkina Faso` <dbl>, Burundi <dbl>,
## #   Cambodia <dbl>, Cameroon <dbl>, Canada <dbl>,
## #   `Central African Republic` <dbl>, Chad <dbl>, Chile <dbl>, China <dbl>,
## #   Colombia <dbl>, Comoros <dbl>, `Congo, Dem. Rep.` <dbl>,
## #   `Congo, Rep.` <dbl>, `Costa Rica` <dbl>, `Cote d'Ivoire` <dbl>, …

Having this kind of wide variety of potential structures means that when you receive data and try to use it, it’s almost impossible to have any kind of standardised approach to this non-standard data structuring. For example, if you want the most recent year of data from an annual dataset, would you need to select a column or filter some rows; it would depend entirely on the dataset!

In contrast, tidy data always has the same format:

  • Every column is a variable.

  • Every row is an observation.

  • Every cell is a single value.

This is best explained in a tidy example:

## # A tibble: 10 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.

In this dataset, each column is a different variable; years, life expectancy, countries. These are best thought of as different categories that your data can fall into, not values of those categories. For example, year is a variable, but each possible year in that column is a value, and therefore shouldn’t be assigned its own column.

Every row is an observation; a unique combination of the variables included in the data. For example, there is only one observation (row) of data in Afghanistan for 1952, and then a separate one for the same country in 1957.

And importantly, there is only one value in each cell; this seems obvious but is suprisingly often not the case, especially when combined values such as age and sex (25M, 54F) or section and subsection (5b or 8c) are created.

5.2 Tidying messy data

Inevitably, you will often want to use data which is not in tidy format. Luckily the tidyverse package tidyr is designed to help with this, and contains a number of functions designed to tidy messy data. The functions you will want to use most frequently from this are the pivot verbs.

5.2.1 Pivot longer

The pivot_longer function does exactly what the name suggests, pivots data from a wide format (with many columns and few rows) to a long format (with many rows and few columns). This is very useful when you have data which has values as column names, and you want to gather these together into a single column of multiple values instead.

pay_gap <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-06-28/paygap.csv') %>%
  select(employer_name, male_bonus_percent, female_bonus_percent,  date_submitted) 

This example uses the above dataset; a subsection UK company pay gap data, with male and female bonus percentages displayed in different columns. According to tidy data principles, this data would be easier to use if this data was stored in a single column with an additional variable indicating whether these were male or female bonuses.

Using pivot_longer we can specify the columns we’re talking about using the cols argument. This can be done in the same way as in the select function, either by specifying columns you wish to include or those you want to exclude (using the -).

There are several ways to use the pivot_longer function.

##Pivoting data; note that cols listed are the ones we want to include in the pivoting
pay_gap_long <- pay_gap %>%
  pivot_longer(cols = c(male_bonus_percent, female_bonus_percent))
##Pivoting data; this returns the same result as the previous code, but the columns listed are excluded from the pivoting, so any not listed are pivoted
pay_gap_long <- pay_gap %>%
  pivot_longer(cols = -c(employer_name,  date_submitted))

The other two arguments you may want to use are names_to and values_to, which allow you to specify the names of the columns you want to put the current column names (male_bonus_percent and female_bonus_percent) and values (the values in the two existing columns) into. If these are not specified, they will default to “name” and “value”, as in the examples above.

pay_gap_long <- pay_gap %>%
  pivot_longer(cols = c(male_bonus_percent, female_bonus_percent),
                      names_to = "gender",
                      values_to = "percent")

5.2.2 Exercise

10:00

  1. Using the inbuilt dataset billboard, pivot the week columns into a single long column. Call the dataset billboard_long
5.2.2. Solution


#Option 1: Excluding columns not being pivoted
billboard_long <- billboard %>%
  pivot_longer(cols = -c(artist, track, date.entered),
               names_to = "week",
               values_to = "rank")

#Option 2: Specifying columns range explicitly
billboard_long <- billboard %>%
  pivot_longer(cols = wk1:w76,
               names_to = "week",
               values_to = "rank")

#Option 3: Using starts_with()
billboard_long <- billboard %>%
  pivot_longer(cols = starts_with("wk"), 
               names_to = "week",
               values_to = "rank")
#Using option 1
billboard_long <- billboard %>%
  pivot_longer(cols = -c(artist, track, date.entered),
               names_to = "week",
               values_to = "rank")

billboard_long
## # A tibble: 24,092 × 5
##    artist track                   date.entered week   rank
##    <chr>  <chr>                   <date>       <chr> <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
## # ℹ 24,082 more rows


5.3 Combining data tidying with data manipulation

The real power in tidy data comes with using it in combination with other dplyr verbs such as filtering, selecting and mutating. For example, in the full pay gap dataset, we can calculate whether there’s a difference in the hourly pay and bonuses by gender for different sizes of company:

pay_gap_summary <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-06-28/paygap.csv') %>%
  #Select the columns we're interested in
  select(employer_name, diff_mean_hourly_percent, diff_mean_bonus_percent, employer_size) %>%
  #Pivot to gather up all of the mean values into one column
  pivot_longer(cols = -c(employer_name, employer_size), names_to = "gender", values_to = "percent") %>%
  #Group by employer size and variable type
  group_by(employer_size, gender) %>%
  #Summarise by calculating means
  dplyr::summarise(percent = mean(percent, na.rm = TRUE)) %>%
  #Remove companies which didn't specify a size
  filter(employer_size != "Not Provided")

5.4 Tidy data and plotting

The value of having a tidy dataset is also obvious when you come to plot your data in ggplot. Having a single variable in a column means that you’re able to assign that column to an aesthetic in ggplot, and therefore change the colour, shape, faceting, etc on the basis of that value! This opens the door to creating increasingly complex and informative (and pretty!) charts.

##Plot using the tidy data
ggplot(pay_gap_summary, aes(x = employer_size, 
                            y = percent, 
                            fill = gender))+ #Colours can be assigned to the different data types
  geom_col(position = "dodge") #makes bars sit next to eachother

5.4.1 Exercise

15:00

  1. Using the inbuilt dataset relig_income, pivot the income columns into a long dataset.
  2. Assign this to an object called income_long
  3. Using the dataset income_long you created, plot a bar chart in ggplot. Use religion as the x axis and the count of people as the y axis.
  4. Assign the income groupings to the colour aesthetic.
5.4.1. Solution


#Tidy the data using pivot_longer()
income_long <- relig_income %>%
  pivot_longer(cols = -religion,
               names_to = "income_group",
               values_to = "count")

#Plot your data 
income_chart <- 
  ggplot(income_long, aes(x = religion, y = count, fill = income_group)) +
  #stat = identity creates a stacked bar chart
  geom_bar(stat = "identity") +
  labs(x = "Religion", y = "Number of People", fill = "Income Group") +
  theme_minimal()

#print() allows you to view the chart in the 'Plots' tab
print(income_chart)


5.5 Congratulations, you are a R coder!

Congratulations! You have reached the end of the tutor-led training session. You are welcome to continue your self-directed learning by completing the two 30-minute exercises in Chapter 6!

Remember to complete the feedback form (it’s all anonymous) to help us improve your learning journey to coding at DfT.

Here are a few useful links:

Our intermediate R courses:

If you have any questions or want any support in R, do not hesitate to contact us at CRANLanD.