What is tidy data?

You are likely to be most familiar with data laid out in a “messy” structure like the one below:

## # A tibble: 30 × 8
##    country                continent `1967` `1972` `1977` `1982` `1987` `1992`
##    <fct>                  <fct>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Albania                Europe      66.2   67.7   68.9   70.4   72     71.6
##  2 Austria                Europe      70.1   70.6   72.2   73.2   74.9   76.0
##  3 Belgium                Europe      70.9   71.4   72.8   73.9   75.4   76.5
##  4 Bosnia and Herzegovina Europe      64.8   67.4   69.9   70.7   71.1   72.2
##  5 Bulgaria               Europe      70.4   70.9   70.8   71.1   71.3   71.2
##  6 Croatia                Europe      68.5   69.6   70.6   70.5   71.5   72.5
##  7 Czech Republic         Europe      70.4   70.3   70.7   71.0   71.6   72.4
##  8 Denmark                Europe      73.0   73.5   74.7   74.6   74.8   75.3
##  9 Finland                Europe      69.8   70.9   72.5   74.6   74.8   75.7
## 10 France                 Europe      71.6   72.4   73.8   74.9   76.3   77.5
## # … with 20 more rows

Here, the data has one row per country and then multiple columns, with one for each year. This may seem relatively logical and quite tidy, but the problems start when you realise that there are multiple other quite logical and tidy ways to lay this data out if there are few restrictions. For example, it could quite easily be shown with one country per column, and a year per row:

## # A tibble: 12 × 143
##     year Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain
##    <int>       <dbl>   <dbl>   <dbl>  <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
##  1  1952        28.8    55.2    43.1   30.0      62.5      69.1    66.8    50.9
##  2  1957        30.3    59.3    45.7   32.0      64.4      70.3    67.5    53.8
##  3  1962        32.0    64.8    48.3   34        65.1      70.9    69.5    56.9
##  4  1967        34.0    66.2    51.4   36.0      65.6      71.1    70.1    59.9
##  5  1972        36.1    67.7    54.5   37.9      67.1      71.9    70.6    63.3
##  6  1977        38.4    68.9    58.0   39.5      68.5      73.5    72.2    65.6
##  7  1982        39.9    70.4    61.4   39.9      69.9      74.7    73.2    69.1
##  8  1987        40.8    72      65.8   39.9      70.8      76.3    74.9    70.8
##  9  1992        41.7    71.6    67.7   40.6      71.9      77.6    76.0    72.6
## 10  1997        41.8    73.0    69.2   41.0      73.3      78.8    77.5    73.9
## 11  2002        42.1    75.7    71.0   41.0      74.3      80.4    79.0    74.8
## 12  2007        43.8    76.4    72.3   42.7      75.3      81.2    79.8    75.6
## # … with 134 more variables: Bangladesh <dbl>, Belgium <dbl>, Benin <dbl>,
## #   Bolivia <dbl>, `Bosnia and Herzegovina` <dbl>, Botswana <dbl>,
## #   Brazil <dbl>, Bulgaria <dbl>, `Burkina Faso` <dbl>, Burundi <dbl>,
## #   Cambodia <dbl>, Cameroon <dbl>, Canada <dbl>,
## #   `Central African Republic` <dbl>, Chad <dbl>, Chile <dbl>, China <dbl>,
## #   Colombia <dbl>, Comoros <dbl>, `Congo, Dem. Rep.` <dbl>,
## #   `Congo, Rep.` <dbl>, `Costa Rica` <dbl>, `Cote d'Ivoire` <dbl>, …

Having this kind of wide variety of potential structures means that when you receive data and try to use it, it’s almost impossible to have any kind of standardised approach to this non-standard data structuring. For example, if you want the most recent year of data from an annual dataset, would you need to select a column or filter some rows; it would depend entirely on the dataset!

Simply put, “tidy” data is data designed to be easy for coding programs like R to read in and use. It’s structured in a way that makes processes efficient, both to write code for and for R to perform. Tidy data structures are always the same, regardless of the data inside the structure.

This is best explained in a tidy example:

## # A tibble: 10 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.

In this dataset, each column is a different variable; years, life expectancy, countries. These are best thought of as different categories that your data can fall into, not values of those categories. For example, year is a variable, but each possible year in that column is a value, and therefore shouldn’t be assigned its own column.

Every row is an observation; a unique combination of the variables included in the data. For example, there is only one observation (row) of data in Afghanistan for 1952, and then a separate one for the same country in 1957.

The opposite of tidy data is not “bad” data!

  • Saying something isn’t in tidy format isn’t a criticism of that data. Human-readable data is almost always “messy” data, and this format makes it clear and easy for people to read.

  • Computers are not people, and always want tidy data

The tidyverse demands tidy data

  • The tidyverse set of packages (dplyr, tidyr, ggplot2, etc) are set up to be intuitive and simple to use.

  • If you find yourself using the tidyverse and looping code, repeating the same tasks over and over, or writing a lot of code to achieve a simple task, it’s probably because the data is in the wrong format.

Code can be harmonised when data is standardised

  • Tidy data makes it easier to standardise code, functions and processes.

  • Multiple projects using tidy data can often recycle large chunks of each others code

Your R data is not an Excel pivot table

Loads of people get their start in data wrangling in Excel, but Excel isn’t really set up like a data processing tool. It’s set up as a data tool for humans to use, not computers, so it sacrifices some efficiency for readability.

You don’t need to be able to read your data while R is running code on it, so if you’re not using tidy data you’re sacrificing efficiency for nothing!

Working with billboard data

A great example of this is the “billboard” dataset from the tidyr package

tidyr::billboard
## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## #   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## #   wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## #   wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## #   wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## #   wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## #   wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, …

It has details of the artist, track name, date.entered and then a separate column for each week rating. It’s easy for a human to read, and you can see how you’d easily be able to view, chart and analyse this data in Excel.

But if you want to chart the top 5 rows in ggplot as track rating over time, how would you do that? ggplot wants to be passed a single column for x and y, and you don’t have a column for either week or rating.

How you might try to fix this with an Excel mindset

billboard_small <- head(billboard, 5)

billboard_chart <- billboard_small %>%
  ##Let's just keep the columns we want, and drop artist name and date
  dplyr::select(-c(artist, date.entered)) %>%
## Let's flip the data so the weeks are row names, not column names
  t() %>%
  #And make it back into a data frame
  data.frame()

##The column names are wrong now, so let's move them up from the top row
colnames(billboard_chart) <- billboard_chart[1,]
##And then let's get rid of that useless first row
billboard_chart <- billboard_chart[-1,] 
##And let's use those row names as a column
billboard_chart$week <- row.names(billboard_chart)

##Now we can plot a chart of one column against week!
#It's an ugly bar chart because it doesn't even work as a line chart
ggplot(billboard_chart, aes(x = week, y = `Kryptonite`))+
  geom_bar(stat = "identity")

There’s a lot of things to notice about this code:

This is what the tidyverse needs you to do

billboard_tidy <- billboard_small %>%
  ##Let's shift this data from wide to tidy form
  tidyr::pivot_longer(cols = -c(artist, track, date.entered),
                      names_to = "week",
                      values_to = "position") %>%
  #We can even clean up the week values
  dplyr::mutate(week = as.numeric(gsub("wk", "", week)))

##Now we can plot a chart of week against position
p <- ggplot(billboard_tidy, aes(x = week, y = position, colour = track, group = track))+
  geom_line()

p

Things to note now:

  • It does feel strange to pull that data into that format BUT…
  • It’s so much easier to pass data to functions in the format it’s expecting
  • There are simple, intuitive tidyverse functions for the things I’m trying to do
  • I can make a chart that’s easy and attractive
  • It’s faster than Excel now!
  • I could easily change this, for example:

Here it is with a nice theme applied:

p +
  theme_minimal() +
  scale_colour_brewer(palette = "Dark2")

Here it is faceted out to individual charts by artist:

p +
  facet_wrap(~artist)+
  theme_minimal() +
  scale_colour_brewer(palette = "Dark2")

And it’s even easy to change the original code to show data for the top 10 tracks

billboard_tidy <- head(billboard, 10) %>%
  ##Let's shift this data from wide to tidy form
  tidyr::pivot_longer(cols = -c(artist, track, date.entered),
                      names_to = "week",
                      values_to = "position") %>%
  #We can even clean up the week values
  dplyr::mutate(week = as.numeric(gsub("wk", "", week)))

##Now we can plot a chart of week against position
p <- ggplot(billboard_tidy, aes(x = week, y = position, colour = track, group = track))+
  geom_line()+
  facet_wrap(~artist)+
  theme_minimal() 
   
p       

Doing data analysis of income by religion

This isn’t just the case with data vis, it’s the same if you’d like to do some data analysis too.

relig_income
## # A tibble: 18 × 11
##    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
##  1 Agnostic      27        34        60        81        76       137        122
##  2 Atheist       12        27        37        52        35        70         73
##  3 Buddhist      27        21        30        34        33        58         62
##  4 Catholic     418       617       732       670       638      1116        949
##  5 Don’t k…      15        14        15        11        10        35         21
##  6 Evangel…     575       869      1064       982       881      1486        949
##  7 Hindu          1         9         7         9        11        34         47
##  8 Histori…     228       244       236       238       197       223        131
##  9 Jehovah…      20        27        24        24        21        30         15
## 10 Jewish        19        19        25        25        30        95         69
## 11 Mainlin…     289       495       619       655       651      1107        939
## 12 Mormon        29        40        48        51        56       112         85
## 13 Muslim         6         7         9        10         9        23         16
## 14 Orthodox      13        17        23        32        32        47         38
## 15 Other C…       9         7        11        13        13        14         18
## 16 Other F…      20        33        40        46        49        63         46
## 17 Other W…       5         2         3         4         2         7          3
## 18 Unaffil…     217       299       374       365       341       528        407
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
## #   `Don't know/refused` <dbl>

Again, the relig_income dataset isn’t what you’d think of as “messy”, it’s logical and easy to read for a human. But what if we wanted to multiply all the values in the columns by 1000 (to convert from k figures to actual figures)?

It’s easy to think about how you’d do it in Excel; one column at a time

relig_thousand <- relig_income %>%
  ##Use mutate to multiply each column by 1000
  dplyr::mutate(`<$10k` = `<$10k` * 1000,
                `$10-20k` = `$10-20k` * 1000,
                `$20-30k` = `$20-30k` * 1000,
                `$30-40k` = `$30-40k` * 1000)
##We're not done here, I'm just bored...

It’s a lot of coding; doing it for each column would take another 8 repetitions of this!

There are slightly easier ways to do it (using mutate and across in combination), but again it’s already feeling trickier than it needs to be. And crucially, that’s one of the most simple actions we might want to take with this dataset, imagine if you wanted to group together different incomes into wider groups, or convert $ values into £?

How does tidyverse want you to do it?

relig_thousand <- relig_income %>%
  ##Again, let's pull all the data into one tidy column
  tidyr::pivot_longer(cols = -religion,
                      names_to = "salary",
                      values_to = "count") %>%
  #Then just multiply all the counts by 1000
  dplyr::mutate(count = count * 1000)

This is easier to write, easier to QA, and crucially easier to keep up to date. If they changed the salary groupings (and therefore the column names), the first lot of code would break, but the second lot would keep working.

How to make tidy data?

There’s one package and two workhorse functions that are going to do the majority of the heavy lifting for you here.

There’s a couple of other functions we’re going to use too to make data cleaner and nicer.

Pivot longer

You’ve already seen a couple of examples of me pivoting data to make it longer. Long data is what you’ll want most of the time; it’s easier to filter, group, and summarise, easier to apply the same functions to lots of columns at the same time, and easier to plot on charts.

The pivot_longer function does exactly what the name suggests, pivots data from a wide format (with many columns and few rows) to a long format (with many rows and few columns). This is very useful when you have data which has values as column names, and you want to gather these together into a single column of multiple values instead. It takes a lot of arguments, but you’ll probably only want to use a maximum of 4.

  • data: this is the dataset you want to pivot

  • cols: the columns you want to combine into a longer format. This can be done in the same way as in the select function, either by specifying columns you wish to include or those you want to exclude (using the -).

  • names_to: the names of all the columns you’ll be pivoting will be combined into a single labelling column. This will be called “name” by default, but you can specify an alternative name as a string if you like.

  • values_to: all of the values in the columns you’ll be pivoting will be combined into a single value column. This will be called “value” by default, but you can specify an alternative name as a string if you like.

Gathering construction dataset

construction %>% ##the data
  tidyr::pivot_longer(cols = c(Northeast, Midwest, South, West), #Choosing the columns to pivot
                      names_to = "region", #Put column names in a column called region
                      values_to = "value") #Put values in a column called value
## # A tibble: 36 × 7
##     Year Month    `1 unit` `2 to 4 units` `5 units or more` region    value
##    <dbl> <chr>       <dbl> <lgl>                      <dbl> <chr>     <dbl>
##  1  2018 January       859 NA                           348 Northeast   114
##  2  2018 January       859 NA                           348 Midwest     169
##  3  2018 January       859 NA                           348 South       596
##  4  2018 January       859 NA                           348 West        339
##  5  2018 February      882 NA                           400 Northeast   138
##  6  2018 February      882 NA                           400 Midwest     160
##  7  2018 February      882 NA                           400 South       655
##  8  2018 February      882 NA                           400 West        336
##  9  2018 March         862 NA                           356 Northeast   150
## 10  2018 March         862 NA                           356 Midwest     154
## # … with 26 more rows

Exercises

Let’s try some pivoting of our own with taxi data. This is a nice table but it’s not very R friendly if we’d like to plot a chart of taxi volumes over time

  1. Read in the taxi data from the Data folder in this repo
taxis <- openxlsx::read.xlsx("Data/taxi0101.xlsx", sheet = 3, startRow = 8)
  1. Apart from the year column, all of the data is counts of different types of vehicles, and therefore the tidy format would be to have a column of years, a column of vehicle types, and a column of values. Use the pivot_longer function to achieve this.
taxis_long <- taxis %>%
  tidyr::pivot_longer(cols =  -Year)

Hint: your final table should start like this:

head(taxis_long)
## # A tibble: 6 × 3
##   Year  name                                       value  
##   <chr> <chr>                                      <chr>  
## 1 1965  London.Licensed.taxis.(thousands).[note.1] "7.29" 
## 2 1965  England.Licensed.taxis.(thousands)         "[w]  "
## 3 1965  London.Licensed.PHVs.(thousands).[note.1]  "[w]"  
## 4 1965  England.Licensed.PHVs.(thousands)          "[w]  "
## 5 1965  London.Total.vehicles.(thousands).[note.1] "[w]"  
## 6 1965  England.Total.vehicles.(thousands)         "[w]  "
  1. Because the taxi data uses [w] to show NA values, all of the data columns are read in as text. Now all of those values are in one column, convert them all to numeric values.
taxis_long <- taxis_long %>%
  dplyr::mutate(value = as.numeric(value))
  1. The column names you’ve pivoted into one column aren’t the easiest for R to read. Convert them to lower case
taxis_long <- taxis_long %>%
  dplyr::mutate(name = tolower(name))

Separate

The separate function is the other common function you’ll want to use from the tidyr package, and is designed to take a single column and split it into two or more other columns based on a pattern (e.g. a single character or symbol).

This is common function to use, because more often than you’d expect, there’s more than one piece of information contained in a column title (think about a column called rail_2020, or road_england). When these column names are moved into tidy format, you end up with multiple pieces of data in the same cell.

An example of this can be seen in the world_bank_pop dataset:

tidyr::world_bank_pop %>%
  tidyr::pivot_longer(cols = -c(country, indicator),
                      names_to = "year") %>%
  head()
## # A tibble: 6 × 4
##   country indicator   year  value
##   <chr>   <chr>       <chr> <dbl>
## 1 ABW     SP.URB.TOTL 2000  42444
## 2 ABW     SP.URB.TOTL 2001  43048
## 3 ABW     SP.URB.TOTL 2002  43670
## 4 ABW     SP.URB.TOTL 2003  44246
## 5 ABW     SP.URB.TOTL 2004  44669
## 6 ABW     SP.URB.TOTL 2005  44889

Notice how the indicator column contains multiple pieces of information; SP is a prefix for all values, POP or URB indicates whether the value is for either total population or urban population, and TOTL or GROW indicates whether it’s total population or population growth. This is fine, but makes it harder to select e.g. all urban population values only, or all growth values only. It would be much more simple if this information was in two different columns.

We can split this information into three separate columns using separate. The arguments for this function are a little more complex than for pivot_longer:

col: The column we’re going to split into multiple columns

into: The names of the new columns we’re going to create from this column (passed as strings)

sep: the symbol you want to use as the separation point between the data in the columns. This works using regular expressions so you need to be a bit careful how you provide this value, usually prefixed with “\”

There are also some arguments you can leave as default for now but may want to change later, these include:

remove: defaults to TRUE, lets you decide whether or not to remove the original column

extra: tells the function what to do with any extra content you haven’t provided a column name for. By default it’ll tell you this exists, but you can specify to either “drop” this without warning, or “merge” it into the final column so you don’t lose it.

And here it is in action:

#Get the data into tidy format
tidyr::world_bank_pop %>%
  tidyr::pivot_longer(cols = -c(country, indicator),
                      names_to = "year") %>%
  ##Separate out the indicator column
  tidyr::separate(col = indicator,
  #Separate it into three columns called sp, pop, and measure                
                  into = c("sp", "pop", "measure"),
  #Separate on the dots in the name (using \\ before you reference the .)
                  sep = "\\.") %>%
  head(10)
## # A tibble: 10 × 6
##    country sp    pop   measure year  value
##    <chr>   <chr> <chr> <chr>   <chr> <dbl>
##  1 ABW     SP    URB   TOTL    2000  42444
##  2 ABW     SP    URB   TOTL    2001  43048
##  3 ABW     SP    URB   TOTL    2002  43670
##  4 ABW     SP    URB   TOTL    2003  44246
##  5 ABW     SP    URB   TOTL    2004  44669
##  6 ABW     SP    URB   TOTL    2005  44889
##  7 ABW     SP    URB   TOTL    2006  44881
##  8 ABW     SP    URB   TOTL    2007  44686
##  9 ABW     SP    URB   TOTL    2008  44375
## 10 ABW     SP    URB   TOTL    2009  44052

It should now be obvious how you could filter or summarise this data more easily, referencing the pop or measure columns individually.

Exercises

Let’s go back to our taxi data.

taxis_long
## # A tibble: 324 × 3
##    Year  name                                       value
##    <chr> <chr>                                      <dbl>
##  1 1965  london.licensed.taxis.(thousands).[note.1]  7.29
##  2 1965  england.licensed.taxis.(thousands)         NA   
##  3 1965  london.licensed.phvs.(thousands).[note.1]  NA   
##  4 1965  england.licensed.phvs.(thousands)          NA   
##  5 1965  london.total.vehicles.(thousands).[note.1] NA   
##  6 1965  england.total.vehicles.(thousands)         NA   
##  7 1970  london.licensed.taxis.(thousands).[note.1]  8.65
##  8 1970  england.licensed.taxis.(thousands)         NA   
##  9 1970  london.licensed.phvs.(thousands).[note.1]  NA   
## 10 1970  england.licensed.phvs.(thousands)          NA   
## # … with 314 more rows
  1. Just like the world bank data, notice that our taxi data contains multiple bits of information in the name column; the geographic area and the type of vehicle, as well as some notes. Use the separate function to split this into two different columns on the . symbol. Set the extra argument to merge so you don’t lose any of the information! Call the two new columns “region” and “vehicle_type”
taxi_separate <- taxis_long %>% 
  tidyr::separate(name, into = c("region", "vehicle_type"), extra = "merge", sep = "\\.")
  1. We can actually be even more clever with our separate function to return only two clean columns of vehicle and region data. To do this, you would want to create three new columns of data (one of which you could discard), and change the extra argument. Can you do this?
taxis_separate <- taxis_long %>% 
  tidyr::separate(name, into = c("region", "licence", "vehicle_type"), extra = "drop", sep = "\\.")
  1. Can you see how you would keep just the PHV data from this now using a single filter call?

Pivot wider

The opposite of pivoting longer is the function pivot_wider. This takes long tidy data and spreads it back out, taking the column names from one column and the values from the other. This is useful as a final step if you want to present data in a human-readable format, but is also useful for some aspects of data processing such as mutating across different variables.

Like pivot_longer, in pivot_wider() you can specify the arguments names_from and values_from to assign the columns that the column names for the new spread data will be taken from, and the column the associated values will be taken from. If you don’t specify these, it will default to “name” and “value” once again. These arguments also specify the columns the function works over, so in this case there’s no need to specify a cols argument.

fish_encounters %>% ##the data
  tidyr::pivot_wider(names_from = station, #Take column names from this column
                      values_from = seen) #Take values from this column
## # A tibble: 19 × 12
##    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
##    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
##  1 4842        1     1      1     1       1     1     1     1     1     1     1
##  2 4843        1     1      1     1       1     1     1     1     1     1     1
##  3 4844        1     1      1     1       1     1     1     1     1     1     1
##  4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
##  5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
##  6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
##  7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
##  8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
##  9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
## 12 4857        1     1      1     1       1     1     1     1     1    NA    NA
## 13 4858        1     1      1     1       1     1     1     1     1     1     1
## 14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
## 15 4861        1     1      1     1       1     1     1     1     1     1     1
## 16 4862        1     1      1     1       1     1     1     1     1    NA    NA
## 17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA

Exercises

Still using the taxi data, now we’ve got our data separated out into nice clean columns, let’s make use of the pivot wider function to create some new columns with additional data in them.

  1. Pivot the data back out into columns with one column for each region type
taxis_wide <- taxis_separate %>%
  pivot_wider(names_from = region,
              values_from = value)
  1. Subtract the London column from the England column to create a column called “england_outside_london” (using dplyr::mutate())
 taxis_wide <- taxis_wide %>%
   dplyr::mutate(england_outside_london =  england - london)
  1. Pivot the data back into long form ready for plotting on a chart
 taxis_final <- taxis_wide %>%
   tidyr::pivot_longer(cols = c(london, england, england_outside_london), names_to = "region", values_to = "value")
  1. Can you create a ggplot chart of PHV, taxi and all vehicle numbers for England outside of London? (example below!)
taxis_final %>% 
  dplyr::mutate(Year = as.numeric(Year)) %>%
  na.omit() %>%
  dplyr::filter(region == "england_outside_london") %>%
  ggplot(aes(x = Year, y = value, colour = vehicle_type, group = vehicle_type))+
  geom_line()

Putting it all together

The below are a few long-form exercises for you to try using the above functions; these are designed to be challenging so don’t be put off if they take a bit of work!

Create a chart of concessionary bus journeys taken in various regions of the UK, as a percentage of the total.

  • Read in bus table 0105 from the data directory of this repo
  • Bring the data into tidy format so that there’s only one column of values and one of region names
  • Make sure the values are numeric format
  • Group and mutate the data to create a new column called percent, calculating the percentage for each region of the total
  • Use separate on the year column to split the financial years so that the year column only includes the first of the two years
  • Plot a chart of percentage against year, with the colour being the different regions

Create a chart of miles travelled by grouped transport type from some NTS data

  • Read in ntsq table 1003 from the data directory of this repo
  • Bring the data into tidy format
  • Make sure the values are in numeric format
  • Pivot the data wider so there is an individual column for each mode
  • Create sum columns for active transport (walk, bicycle), private transport (car/van driver, car/van passenger, motorcycle and other private), and public transport (bus, other bus, underground, rail and other public transport)
  • Select to keep only these new columns
  • Return data to tidy format
  • Plot a chart of year and percent, with colour being the different transport types. Use the facet argument in ggplot to produce 3 different charts by purpose.

Create a chart of the ratio of passengers to seats for rail services arriving to and departing from major stations in the UK

  • Read in the rai0201 table
  • Bring the data into a tidy format
  • Separate the column you’ve gathered the names into so that you’ve got two columns of arrival/departure time and variable type
  • Separate the arrival/departure time column again so you’ve got two columns of arrival or departure and time
  • Pivot your data back out so you’ve got individual columns for number of services, number of passengers and number of seats
  • Create a column containing the ratio of passengers per seat
  • Return the data to tidy format
  • Plot a chart of station name and ratio of passengers per seat. Colour the bars according to whether they’re for departures or arrivals, and facet the plots on whether they’re peak time or all day.