Chapter 2 How to make tidy data?
When it comes to tidying and transforming your data, there is one package that stands out –tidyr::
. Within this package, there are two workhorse functions that will do the bulk of the heavy lifting for you. These functions are designed to make reshaping and organising your data straightforward and efficient.
Alongside these, we will also explore a couple of other useful functions to help make your data cleaner and easier to work with. Together, these tools will help you prepare your data for analysis with minimal fuss.
2.1 Pivot longer
You have already seen a couple of examples of pivoting data to make it longer. Long data is what you will want most of the time; it is easier to filter, group and summarise, easier to apply the same functions to lots of columns at the same time, and easier to plot on charts.
The pivot_longer()
function does exactly what the name suggests, pivots data from a wide format (with many columns and few rows) to a long format (with many rows and few columns). This is very useful when you have data which has values as column names, and you want to gather these together into a single column of multiple values instead. It takes a lot of arguments, but you will probably only want to use a maximum of 4.
Arguments:
data
: this is the dataset or data frame you want to pivotcols
: the columns you want to combine into a longer format, done in the same way as in the select function; specifying columns you wish to include or those you want to exclude (using the-
)names_to
: the names of all the columns you will be pivoting will be combined into a single labelling column. Unless you specify an alternative name, by default the column name will be called ‘name’.values_to
: all the values in the columns you will be pivoting will be combined into a single value column. Unless you specify an alternative name, by default the column name will be called “value”
Example:
construction_long_data <-
tidyr::pivot_longer(
#data
data = tidyr::construction,
#choosing the columns to pivot
cols = c(Northeast, Midwest, South, West),
#put column names in to a single column called "region"
names_to = "region",
#put values in a single column called "value"
values_to = "value"
)
Year | Month | 1 unit | 2 to 4 units | 5 units or more | region | value |
---|---|---|---|---|---|---|
2018 | January | 859 | NA | 348 | Northeast | 114 |
2018 | January | 859 | NA | 348 | Midwest | 169 |
2018 | January | 859 | NA | 348 | South | 596 |
2018 | January | 859 | NA | 348 | West | 339 |
2018 | February | 882 | NA | 400 | Northeast | 138 |
2018 | February | 882 | NA | 400 | Midwest | 160 |
2018 | February | 882 | NA | 400 | South | 655 |
2018 | February | 882 | NA | 400 | West | 336 |
2018 | March | 862 | NA | 356 | Northeast | 150 |
2018 | March | 862 | NA | 356 | Midwest | 154 |
2.1.1 Exercise
10:00
Go to taxi_exercise.R
in the training repository. Let us try some pivoting of our own with the taxi data. This is a nice table but it is not very R friendly if we want to plot a chart of taxi volumes over time.
You need to install and load the necessary package to read in the taxi0101.xlsx
file.
- Use either
readxl::read_excel()
oropenxlsx::read.xlsx()
to read in the taxi data from theData
folder. Assign the result totaxis
. (Hint: Make sure to check the arguments for each function, as there are some differences between the two packages.) - Aside from the year column, the data shows counts of different types of vehicles. To tidy it, you should create one column for years, one for vehicle types, and one for their values. Use the
pivot_longer()
function to do this and assign the result totaxi_long
.
Hint
Your final table should look like this:
## # A tibble: 6 × 3
## Year name value
## <chr> <chr> <chr>
## 1 1965 London Licensed taxis (thousands) [note 1] 7.29
## 2 1965 England Licensed taxis (thousands) [w]
## 3 1965 London Licensed PHVs (thousands) [note 1] [w]
## 4 1965 England Licensed PHVs (thousands) [w]
## 5 1965 London Total vehicles (thousands) [note 1] [w]
## 6 1965 England Total vehicles (thousands) [w]
- The taxi data uses
[w]
to representNA
values, so all the data columns have been read in as text. Now that these values are in a single column, convert them to numeric (using the base Ras.numeric()
function). - The column names you have pivoted into one column are not the easiest for R to read. Convert them to lower case (using the base R
tolower()
function).
2.1.1 Solution
taxis <- readxl::read_excel("Data/taxi0101.xlsx", sheet = 3, skip = 7)
taxis_long <- taxis %>%
tidyr::pivot_longer(cols = -Year) %>%
dplyr::mutate(value = as.numeric(value)) %>%
dplyr::mutate(name = tolower(name))
taxis_long
## # A tibble: 324 × 3
## Year name value
## <chr> <chr> <dbl>
## 1 1965 london licensed taxis (thousands) [note 1] 7.29
## 2 1965 england licensed taxis (thousands) NA
## 3 1965 london licensed phvs (thousands) [note 1] NA
## 4 1965 england licensed phvs (thousands) NA
## 5 1965 london total vehicles (thousands) [note 1] NA
## 6 1965 england total vehicles (thousands) NA
## 7 1970 london licensed taxis (thousands) [note 1] 8.65
## 8 1970 england licensed taxis (thousands) NA
## 9 1970 london licensed phvs (thousands) [note 1] NA
## 10 1970 england licensed phvs (thousands) NA
## # ℹ 314 more rows
The solution code provided is just one way to approach this exercise, and your code might look different. As long as you have successfully used the pivot_longer()
function to achieve the desired result, you have met the main goal of the exercise.
2.2 Separate
The separate()
function is the other common function you will want to use from the tidyr::
package, and is designed to take a single column and split it into two or more other columns based on a pattern (e.g. a single character or symbol).
This is a commonly used function because, more often than you might expect, column titles can contain more than one piece of information (for example, a column named rail_2020
or road_england
). When these column names are reorganised into a tidy format, you can end up with multiple pieces of information in the same cell.
An example of this can be seen in the world_bank_pop dataset:
world_bank_data <-
tidyr::world_bank_pop %>%
tidyr::pivot_longer(cols = -c(country, indicator),
names_to = "year")
country | indicator | year | value |
---|---|---|---|
ABW | SP.URB.TOTL | 2000 | 41625 |
ABW | SP.URB.TOTL | 2001 | 42025 |
ABW | SP.URB.TOTL | 2002 | 42194 |
ABW | SP.URB.TOTL | 2003 | 42277 |
ABW | SP.URB.TOTL | 2004 | 42317 |
ABW | SP.URB.TOTL | 2005 | 42399 |
Notice how the indicator
column contains multiple pieces of information; SP is a prefix for all values, POP or URB indicates whether the value is for either total population or urban population, and TOTL or GROW indicates whether it is total population or population growth. This is fine, but makes it harder to select e.g. all urban population values only, or all growth values only. It would be much more simple if this information was in two different columns.
We can split this information into three separate columns using separate()
. The arguments for this function are a little more complex than for pivot_longer()
:
col
: the columns we are going to split into multiple columnsinto
: the names of the new columns we are going to create from this column (passed as strings)sep
: the symbol you want to use as the separation point between the data in the columns. This works using regular expressions so you need to be careful how you provide this value, usually prefixed with\
There are also some arguments you can leave as default for now but may want to change later, these include:
remove
: defaults toTRUE
, lets you decide whether or not to remove the original columnextra
: tells the function what to do with any extra content you have not provided a column name for. By default it will tell you this exists, but you can specify to either “drop
” this without warning, or “merge
” it into the final column so you do not lose it
Going back to the world_bank_pop
dataset:
world_bank_separate <-
tidyr::world_bank_pop %>%
#get the data into the tidy format
tidyr::pivot_longer(cols = -c(country, indicator),
names_to = "year") %>%
#separate out the indicator column
tidyr::separate(col = indicator,
#separate it into three columns called sp, pop and measure
into = c("sp", "pop", "measure"),
#separate on the dots in the name (using "\\" before you referene the ".")
sep = "\\.")
country | sp | pop | measure | year | value |
---|---|---|---|---|---|
ABW | SP | URB | TOTL | 2000 | 41625 |
ABW | SP | URB | TOTL | 2001 | 42025 |
ABW | SP | URB | TOTL | 2002 | 42194 |
ABW | SP | URB | TOTL | 2003 | 42277 |
ABW | SP | URB | TOTL | 2004 | 42317 |
ABW | SP | URB | TOTL | 2005 | 42399 |
ABW | SP | URB | TOTL | 2006 | 42555 |
ABW | SP | URB | TOTL | 2007 | 42729 |
ABW | SP | URB | TOTL | 2008 | 42906 |
ABW | SP | URB | TOTL | 2009 | 43079 |
It should now be obvious how you could filter or summarise this data more easily, referencing the pop or measure columns individually.
2.2.1 Exercise
10:00
Using your dataset: taxi_long
created in Exercise 2.1.1, complete the following tasks in the taxi_exercise.R
:
Similar to the World Bank data, the taxi data contains multiple pieces of information in the name
column: the geographic area, the type of vehicle, and some notes.
- Use the
separate()
function to split this column into two new columns based on the.
symbol. To ensure no information is lost, set the extra argument to “merge
”. Name the new columnsregion
andvehicle_type
. Call this data frametaxi_separate
2.2.1 Solution Q1
taxi_separate <- taxis_long %>%
tidyr::separate(col = name,
into = c("region", "vehicle_type"),
extra = "merge",
sep = " ")
taxi_separate
## # A tibble: 324 × 4
## Year region vehicle_type value
## <chr> <chr> <chr> <dbl>
## 1 1965 london licensed taxis (thousands) [note 1] 7.29
## 2 1965 england licensed taxis (thousands) NA
## 3 1965 london licensed phvs (thousands) [note 1] NA
## 4 1965 england licensed phvs (thousands) NA
## 5 1965 london total vehicles (thousands) [note 1] NA
## 6 1965 england total vehicles (thousands) NA
## 7 1970 london licensed taxis (thousands) [note 1] 8.65
## 8 1970 england licensed taxis (thousands) NA
## 9 1970 london licensed phvs (thousands) [note 1] NA
## 10 1970 england licensed phvs (thousands) NA
## # ℹ 314 more rows
We can refine our use of the separate function to produce just two clean columns for vehicle and region data.
- Modify your code from Question 1 to create three new columns (one of which you can discard) and adjust the
extra
argument accordingly.
2.2.1 Solution Q2
taxi_separate <- taxis_long %>%
tidyr::separate(col = name,
into = c("region", "licence", "vehicle_type"),
extra = "drop",
sep = " ")
taxi_separate
## # A tibble: 324 × 5
## Year region licence vehicle_type value
## <chr> <chr> <chr> <chr> <dbl>
## 1 1965 london licensed taxis 7.29
## 2 1965 england licensed taxis NA
## 3 1965 london licensed phvs NA
## 4 1965 england licensed phvs NA
## 5 1965 london total vehicles NA
## 6 1965 england total vehicles NA
## 7 1970 london licensed taxis 8.65
## 8 1970 england licensed taxis NA
## 9 1970 london licensed phvs NA
## 10 1970 england licensed phvs NA
## # ℹ 314 more rows
- Can you filter the dataset to only include the PHV rows (using the
dplyr::filter()
function? Call thisphv_data
.
2.2.1 Solution Q3
## # A tibble: 108 × 5
## Year region licence vehicle_type value
## <chr> <chr> <chr> <chr> <dbl>
## 1 1965 london licensed phvs NA
## 2 1965 england licensed phvs NA
## 3 1970 london licensed phvs NA
## 4 1970 england licensed phvs NA
## 5 1971 london licensed phvs NA
## 6 1971 england licensed phvs NA
## 7 1972 london licensed phvs NA
## 8 1972 england licensed phvs NA
## 9 1973 london licensed phvs NA
## 10 1973 england licensed phvs NA
## # ℹ 98 more rows
2.3 Pivot wider
The opposite of pivoting longer is the function pivot_wider()
. This takes long tidy data and spreads it back out, taking the column names from one column and the values from the other. This is useful as a final step if you want to present data in a human-readable format, but is also useful for some aspects of data processing such as mutating across different variables.
Like pivot_longer()
, in pivot_wider()
you can specify the arguments names_from
and values_from
to assign the columns that the column names for the new spread data will be taken from, and the column the associated values will be taken from. If you do not specify these, it will default to “name
” and “value
” once again. These arguments also specify the columns the function works over, so in this case there is no need to specify a cols
argument.
Example:
fish_data <-
tidyr::fish_encounters %>%
tidyr::pivot_wider(
#take column names from this column
names_from = station,
#take values from this column
values_from = seen)
fish | Release | I80_1 | Lisbon | Rstr | Base_TD | BCE | BCW | BCE2 | BCW2 | MAE | MAW |
---|---|---|---|---|---|---|---|---|---|---|---|
4842 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4843 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4844 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4845 | 1 | 1 | 1 | 1 | 1 | NA | NA | NA | NA | NA | NA |
4847 | 1 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA | NA |
4848 | 1 | 1 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA |
4849 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
4850 | 1 | 1 | NA | 1 | 1 | 1 | 1 | NA | NA | NA | NA |
4851 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
4854 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
4855 | 1 | 1 | 1 | 1 | 1 | NA | NA | NA | NA | NA | NA |
4857 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | NA | NA |
4858 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4859 | 1 | 1 | 1 | 1 | 1 | NA | NA | NA | NA | NA | NA |
4861 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4862 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | NA | NA |
4863 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
4864 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
4865 | 1 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA | NA |
2.3.1 Exercise
10:00
Still using the taxi data (i.e. taxi_separate
from Q3 of Exercise 2.2.1), now we have got our data separated out into nice clean columns, let us make use of the pivot wider function to create some new columns with additional data in them.
- Pivot the data back out into columns with one column for each region type
- Subtract the London column from the England column to create a column called “
england_outside_london
” (usingdplyr::mutate()
). Call thistaxis_wide
.
2.3.1 Solution Q1 & Q2
taxis_wide <- taxi_separate %>%
tidyr::pivot_wider(names_from = region, values_from = value) %>%
dplyr::mutate(england_outside_london = england - london)
taxis_wide
## # A tibble: 162 × 6
## Year licence vehicle_type london england england_outside_london
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1965 licensed taxis 7.29 NA NA
## 2 1965 licensed phvs NA NA NA
## 3 1965 total vehicles NA NA NA
## 4 1970 licensed taxis 8.65 NA NA
## 5 1970 licensed phvs NA NA NA
## 6 1970 total vehicles NA NA NA
## 7 1971 licensed taxis 9.59 NA NA
## 8 1971 licensed phvs NA NA NA
## 9 1971 total vehicles NA NA NA
## 10 1972 licensed taxis 10.1 NA NA
## # ℹ 152 more rows
- Pivot the data back into a long format ready for plotting a chart. Call this
taxis_final
2.3.1 Solution Q3
taxis_final <- taxis_wide %>%
tidyr::pivot_longer(cols = c(london, england, england_outside_london),
names_to = "region",
values_to = "value")
taxis_final
## # A tibble: 486 × 5
## Year licence vehicle_type region value
## <chr> <chr> <chr> <chr> <dbl>
## 1 1965 licensed taxis london 7.29
## 2 1965 licensed taxis england NA
## 3 1965 licensed taxis england_outside_london NA
## 4 1965 licensed phvs london NA
## 5 1965 licensed phvs england NA
## 6 1965 licensed phvs england_outside_london NA
## 7 1965 total vehicles london NA
## 8 1965 total vehicles england NA
## 9 1965 total vehicles england_outside_london NA
## 10 1970 licensed taxis london 8.65
## # ℹ 476 more rows
- Create a nicer facet ggplot chart of PHV, taxi and all vehicle numbers for England outside of London. Similar to the one below:
taxis_final %>%
dplyr::mutate(Year = as.numeric(Year)) %>%
na.omit() %>%
ggplot(aes(x = Year, y = value, colour = vehicle_type, group = vehicle_type))+
geom_line()+
facet_wrap(~region)