Chapter 2 How to make tidy data?

When it comes to tidying and transforming your data, there is one package that stands out –tidyr::. Within this package, there are two workhorse functions that will do the bulk of the heavy lifting for you. These functions are designed to make reshaping and organising your data straightforward and efficient.

Alongside these, we will also explore a couple of other useful functions to help make your data cleaner and easier to work with. Together, these tools will help you prepare your data for analysis with minimal fuss.

2.1 Pivot longer

You have already seen a couple of examples of pivoting data to make it longer. Long data is what you will want most of the time; it is easier to filter, group and summarise, easier to apply the same functions to lots of columns at the same time, and easier to plot on charts.

The pivot_longer() function does exactly what the name suggests, pivots data from a wide format (with many columns and few rows) to a long format (with many rows and few columns). This is very useful when you have data which has values as column names, and you want to gather these together into a single column of multiple values instead. It takes a lot of arguments, but you will probably only want to use a maximum of 4.

Arguments:

  • data: this is the dataset or data frame you want to pivot
  • cols: the columns you want to combine into a longer format, done in the same way as in the select function; specifying columns you wish to include or those you want to exclude (using the -)
  • names_to: the names of all the columns you will be pivoting will be combined into a single labelling column. Unless you specify an alternative name, by default the column name will be called ‘name’.
  • values_to: all the values in the columns you will be pivoting will be combined into a single value column. Unless you specify an alternative name, by default the column name will be called “value”

Example:

construction_long_data <- 
  tidyr::pivot_longer(
    #data
    data = tidyr::construction,
    #choosing the columns to pivot
    cols = c(Northeast, Midwest, South, West),
    #put column names in to a single column called "region"
    names_to = "region",
    #put values in a single column called "value"
    values_to = "value"
  )
Year Month 1 unit 2 to 4 units 5 units or more region value
2018 January 859 NA 348 Northeast 114
2018 January 859 NA 348 Midwest 169
2018 January 859 NA 348 South 596
2018 January 859 NA 348 West 339
2018 February 882 NA 400 Northeast 138
2018 February 882 NA 400 Midwest 160
2018 February 882 NA 400 South 655
2018 February 882 NA 400 West 336
2018 March 862 NA 356 Northeast 150
2018 March 862 NA 356 Midwest 154

2.1.1 Exercise

10:00

Go to taxi_exercise.R in the training repository. Let us try some pivoting of our own with the taxi data. This is a nice table but it is not very R friendly if we want to plot a chart of taxi volumes over time.

You need to install and load the necessary package to read in the taxi0101.xlsx file.

  1. Use either readxl::read_excel() or openxlsx::read.xlsx() to read in the taxi data from the Data folder. Assign the result to taxis. (Hint: Make sure to check the arguments for each function, as there are some differences between the two packages.)
  2. Aside from the year column, the data shows counts of different types of vehicles. To tidy it, you should create one column for years, one for vehicle types, and one for their values. Use the pivot_longer() function to do this and assign the result to taxi_long.
Hint


Your final table should look like this:

## # A tibble: 6 × 3
##   Year  name                                       value
##   <chr> <chr>                                      <chr>
## 1 1965  London Licensed taxis (thousands) [note 1] 7.29 
## 2 1965  England Licensed taxis (thousands)         [w]  
## 3 1965  London Licensed PHVs (thousands) [note 1]  [w]  
## 4 1965  England Licensed PHVs (thousands)          [w]  
## 5 1965  London Total vehicles (thousands) [note 1] [w]  
## 6 1965  England Total vehicles (thousands)         [w]


  1. The taxi data uses [w] to represent NA values, so all the data columns have been read in as text. Now that these values are in a single column, convert them to numeric (using the base R as.numeric() function).
  2. The column names you have pivoted into one column are not the easiest for R to read. Convert them to lower case (using the base R tolower() function).
2.1.1 Solution


taxis <- readxl::read_excel("Data/taxi0101.xlsx", sheet = 3, skip = 7)

taxis_long <- taxis %>%
  tidyr::pivot_longer(cols =  -Year) %>%
  dplyr::mutate(value = as.numeric(value)) %>%
  dplyr::mutate(name = tolower(name))

taxis_long
## # A tibble: 324 × 3
##    Year  name                                       value
##    <chr> <chr>                                      <dbl>
##  1 1965  london licensed taxis (thousands) [note 1]  7.29
##  2 1965  england licensed taxis (thousands)         NA   
##  3 1965  london licensed phvs (thousands) [note 1]  NA   
##  4 1965  england licensed phvs (thousands)          NA   
##  5 1965  london total vehicles (thousands) [note 1] NA   
##  6 1965  england total vehicles (thousands)         NA   
##  7 1970  london licensed taxis (thousands) [note 1]  8.65
##  8 1970  england licensed taxis (thousands)         NA   
##  9 1970  london licensed phvs (thousands) [note 1]  NA   
## 10 1970  england licensed phvs (thousands)          NA   
## # ℹ 314 more rows

The solution code provided is just one way to approach this exercise, and your code might look different. As long as you have successfully used the pivot_longer() function to achieve the desired result, you have met the main goal of the exercise.


2.2 Separate

The separate() function is the other common function you will want to use from the tidyr:: package, and is designed to take a single column and split it into two or more other columns based on a pattern (e.g. a single character or symbol).

This is a commonly used function because, more often than you might expect, column titles can contain more than one piece of information (for example, a column named rail_2020 or road_england). When these column names are reorganised into a tidy format, you can end up with multiple pieces of information in the same cell.

An example of this can be seen in the world_bank_pop dataset:

world_bank_data <- 
  tidyr::world_bank_pop %>%
  tidyr::pivot_longer(cols = -c(country, indicator),
                      names_to = "year")
country indicator year value
ABW SP.URB.TOTL 2000 41625
ABW SP.URB.TOTL 2001 42025
ABW SP.URB.TOTL 2002 42194
ABW SP.URB.TOTL 2003 42277
ABW SP.URB.TOTL 2004 42317
ABW SP.URB.TOTL 2005 42399

Notice how the indicator column contains multiple pieces of information; SP is a prefix for all values, POP or URB indicates whether the value is for either total population or urban population, and TOTL or GROW indicates whether it is total population or population growth. This is fine, but makes it harder to select e.g. all urban population values only, or all growth values only. It would be much more simple if this information was in two different columns.

We can split this information into three separate columns using separate(). The arguments for this function are a little more complex than for pivot_longer():

  • col: the columns we are going to split into multiple columns
  • into: the names of the new columns we are going to create from this column (passed as strings)
  • sep: the symbol you want to use as the separation point between the data in the columns. This works using regular expressions so you need to be careful how you provide this value, usually prefixed with \

There are also some arguments you can leave as default for now but may want to change later, these include:

  • remove: defaults to TRUE, lets you decide whether or not to remove the original column
  • extra: tells the function what to do with any extra content you have not provided a column name for. By default it will tell you this exists, but you can specify to either “drop” this without warning, or “merge” it into the final column so you do not lose it

Going back to the world_bank_pop dataset:

world_bank_separate <- 
  tidyr::world_bank_pop %>%
  #get the data into the tidy format
  tidyr::pivot_longer(cols = -c(country, indicator),
                      names_to = "year") %>%
  #separate out the indicator column
  tidyr::separate(col = indicator,
  #separate it into three columns called sp, pop and measure                
                  into = c("sp", "pop", "measure"),
  #separate on the dots in the name (using "\\" before you referene the ".")
                  sep = "\\.")
country sp pop measure year value
ABW SP URB TOTL 2000 41625
ABW SP URB TOTL 2001 42025
ABW SP URB TOTL 2002 42194
ABW SP URB TOTL 2003 42277
ABW SP URB TOTL 2004 42317
ABW SP URB TOTL 2005 42399
ABW SP URB TOTL 2006 42555
ABW SP URB TOTL 2007 42729
ABW SP URB TOTL 2008 42906
ABW SP URB TOTL 2009 43079

It should now be obvious how you could filter or summarise this data more easily, referencing the pop or measure columns individually.

2.2.1 Exercise

10:00

Using your dataset: taxi_long created in Exercise 2.1.1, complete the following tasks in the taxi_exercise.R:

Similar to the World Bank data, the taxi data contains multiple pieces of information in the name column: the geographic area, the type of vehicle, and some notes.

  1. Use the separate() function to split this column into two new columns based on the . symbol. To ensure no information is lost, set the extra argument to “merge”. Name the new columns region and vehicle_type. Call this data frame taxi_separate
2.2.1 Solution Q1


taxi_separate <- taxis_long %>% 
  tidyr::separate(col = name, 
                  into = c("region", "vehicle_type"), 
                  extra = "merge", 
                  sep = " ")

taxi_separate
## # A tibble: 324 × 4
##    Year  region  vehicle_type                        value
##    <chr> <chr>   <chr>                               <dbl>
##  1 1965  london  licensed taxis (thousands) [note 1]  7.29
##  2 1965  england licensed taxis (thousands)          NA   
##  3 1965  london  licensed phvs (thousands) [note 1]  NA   
##  4 1965  england licensed phvs (thousands)           NA   
##  5 1965  london  total vehicles (thousands) [note 1] NA   
##  6 1965  england total vehicles (thousands)          NA   
##  7 1970  london  licensed taxis (thousands) [note 1]  8.65
##  8 1970  england licensed taxis (thousands)          NA   
##  9 1970  london  licensed phvs (thousands) [note 1]  NA   
## 10 1970  england licensed phvs (thousands)           NA   
## # ℹ 314 more rows


We can refine our use of the separate function to produce just two clean columns for vehicle and region data.

  1. Modify your code from Question 1 to create three new columns (one of which you can discard) and adjust the extra argument accordingly.
2.2.1 Solution Q2


taxi_separate <- taxis_long %>% 
  tidyr::separate(col = name, 
                  into = c("region", "licence", "vehicle_type"), 
                  extra = "drop", 
                  sep = " ")

taxi_separate
## # A tibble: 324 × 5
##    Year  region  licence  vehicle_type value
##    <chr> <chr>   <chr>    <chr>        <dbl>
##  1 1965  london  licensed taxis         7.29
##  2 1965  england licensed taxis        NA   
##  3 1965  london  licensed phvs         NA   
##  4 1965  england licensed phvs         NA   
##  5 1965  london  total    vehicles     NA   
##  6 1965  england total    vehicles     NA   
##  7 1970  london  licensed taxis         8.65
##  8 1970  england licensed taxis        NA   
##  9 1970  london  licensed phvs         NA   
## 10 1970  england licensed phvs         NA   
## # ℹ 314 more rows


  1. Can you filter the dataset to only include the PHV rows (using the dplyr::filter() function? Call this phv_data.
2.2.1 Solution Q3


phv_data <- taxi_separate %>%
  dplyr::filter(vehicle_type == "phvs")

phv_data
## # A tibble: 108 × 5
##    Year  region  licence  vehicle_type value
##    <chr> <chr>   <chr>    <chr>        <dbl>
##  1 1965  london  licensed phvs            NA
##  2 1965  england licensed phvs            NA
##  3 1970  london  licensed phvs            NA
##  4 1970  england licensed phvs            NA
##  5 1971  london  licensed phvs            NA
##  6 1971  england licensed phvs            NA
##  7 1972  london  licensed phvs            NA
##  8 1972  england licensed phvs            NA
##  9 1973  london  licensed phvs            NA
## 10 1973  england licensed phvs            NA
## # ℹ 98 more rows


2.3 Pivot wider

The opposite of pivoting longer is the function pivot_wider(). This takes long tidy data and spreads it back out, taking the column names from one column and the values from the other. This is useful as a final step if you want to present data in a human-readable format, but is also useful for some aspects of data processing such as mutating across different variables.

Like pivot_longer(), in pivot_wider() you can specify the arguments names_from and values_from to assign the columns that the column names for the new spread data will be taken from, and the column the associated values will be taken from. If you do not specify these, it will default to “name” and “value” once again. These arguments also specify the columns the function works over, so in this case there is no need to specify a cols argument.

Example:

fish_data <- 
  tidyr::fish_encounters %>%
  tidyr::pivot_wider(
    #take column names from this column
    names_from = station,
    #take values from this column
    values_from = seen)
fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
4842 1 1 1 1 1 1 1 1 1 1 1
4843 1 1 1 1 1 1 1 1 1 1 1
4844 1 1 1 1 1 1 1 1 1 1 1
4845 1 1 1 1 1 NA NA NA NA NA NA
4847 1 1 1 NA NA NA NA NA NA NA NA
4848 1 1 1 1 NA NA NA NA NA NA NA
4849 1 1 NA NA NA NA NA NA NA NA NA
4850 1 1 NA 1 1 1 1 NA NA NA NA
4851 1 1 NA NA NA NA NA NA NA NA NA
4854 1 1 NA NA NA NA NA NA NA NA NA
4855 1 1 1 1 1 NA NA NA NA NA NA
4857 1 1 1 1 1 1 1 1 1 NA NA
4858 1 1 1 1 1 1 1 1 1 1 1
4859 1 1 1 1 1 NA NA NA NA NA NA
4861 1 1 1 1 1 1 1 1 1 1 1
4862 1 1 1 1 1 1 1 1 1 NA NA
4863 1 1 NA NA NA NA NA NA NA NA NA
4864 1 1 NA NA NA NA NA NA NA NA NA
4865 1 1 1 NA NA NA NA NA NA NA NA

2.3.1 Exercise

10:00

Still using the taxi data (i.e. taxi_separate from Q3 of Exercise 2.2.1), now we have got our data separated out into nice clean columns, let us make use of the pivot wider function to create some new columns with additional data in them.

  1. Pivot the data back out into columns with one column for each region type
  2. Subtract the London column from the England column to create a column called “england_outside_london” (using dplyr::mutate()). Call this taxis_wide.
2.3.1 Solution Q1 & Q2


taxis_wide <- taxi_separate %>%
  tidyr::pivot_wider(names_from = region, values_from = value) %>%
  dplyr::mutate(england_outside_london = england - london)

taxis_wide
## # A tibble: 162 × 6
##    Year  licence  vehicle_type london england england_outside_london
##    <chr> <chr>    <chr>         <dbl>   <dbl>                  <dbl>
##  1 1965  licensed taxis          7.29      NA                     NA
##  2 1965  licensed phvs          NA         NA                     NA
##  3 1965  total    vehicles      NA         NA                     NA
##  4 1970  licensed taxis          8.65      NA                     NA
##  5 1970  licensed phvs          NA         NA                     NA
##  6 1970  total    vehicles      NA         NA                     NA
##  7 1971  licensed taxis          9.59      NA                     NA
##  8 1971  licensed phvs          NA         NA                     NA
##  9 1971  total    vehicles      NA         NA                     NA
## 10 1972  licensed taxis         10.1       NA                     NA
## # ℹ 152 more rows


  1. Pivot the data back into a long format ready for plotting a chart. Call this taxis_final
2.3.1 Solution Q3


taxis_final <- taxis_wide %>%
   tidyr::pivot_longer(cols = c(london, england, england_outside_london),
                       names_to = "region", 
                       values_to = "value")

taxis_final
## # A tibble: 486 × 5
##    Year  licence  vehicle_type region                 value
##    <chr> <chr>    <chr>        <chr>                  <dbl>
##  1 1965  licensed taxis        london                  7.29
##  2 1965  licensed taxis        england                NA   
##  3 1965  licensed taxis        england_outside_london NA   
##  4 1965  licensed phvs         london                 NA   
##  5 1965  licensed phvs         england                NA   
##  6 1965  licensed phvs         england_outside_london NA   
##  7 1965  total    vehicles     london                 NA   
##  8 1965  total    vehicles     england                NA   
##  9 1965  total    vehicles     england_outside_london NA   
## 10 1970  licensed taxis        london                  8.65
## # ℹ 476 more rows


  1. Create a nicer facet ggplot chart of PHV, taxi and all vehicle numbers for England outside of London. Similar to the one below:
taxis_final %>% 
  dplyr::mutate(Year = as.numeric(Year)) %>%
  na.omit() %>%
  ggplot(aes(x = Year, y = value, colour = vehicle_type, group = vehicle_type))+
  geom_line()+
  facet_wrap(~region)

2.3.1 Solution Q4


taxis_final %>% 
  dplyr::mutate(Year = as.numeric(Year)) %>%
  na.omit() %>%
  ggplot(aes(x = Year, y = value, colour = vehicle_type, group = vehicle_type))+
  geom_line()+
  facet_wrap(~region) +
  labs(title = "Taxi volumes by vehicle types", 
       x = "",
       y = "") +
  theme_minimal()