Chapter 1 Introduction

Data in the wild can be scary–when confronted with a complicated and messy dataset you may find yourself wondering, where do I even start? The tidyr:: package allows you the wrangle such beasts into nice and tidy datasets. This course is designed the concept and application of tidy data in R, and how it can be used to make code simple to read and write.

1.1 Course instructions

This book is designed to accompany the Tidy Data in R training that we run at DfT. To complete this course, you will need a Cloud R account, and a Github account to clone the training repository.

If you’re running through this book solo, it is recommended to clone the exercise repository, run through the book in order and try out all the of the exercises as you go through.

1.2 Pressumed knowledge

Users of this book should feel comfortable with using R and basic graphics with ggplot2. Our Introduction to R provides a brief introduction to the tidying data concept, as well as basic plotting and data manipulation, both of which are needed to complete this course.

1.3 What is tidy data?

You are likely to be most familiar with data laid out in a “messy” structure like the one below:

country continent 1967 1972 1977 1982 1987 1992
Albania Europe 66.22 67.69 68.93 70.42 72.00 71.581
Austria Europe 70.14 70.63 72.17 73.18 74.94 76.040
Belgium Europe 70.94 71.44 72.80 73.93 75.35 76.460
Bosnia and Herzegovina Europe 64.79 67.45 69.86 70.69 71.14 72.178
Bulgaria Europe 70.42 70.90 70.81 71.08 71.34 71.190
Croatia Europe 68.50 69.61 70.64 70.46 71.52 72.527
Czech Republic Europe 70.38 70.29 70.71 70.96 71.58 72.400
Denmark Europe 72.96 73.47 74.69 74.63 74.80 75.330
Finland Europe 69.83 70.87 72.52 74.55 74.83 75.700
France Europe 71.55 72.38 73.83 74.89 76.34 77.460

Here, the data has one row per country and then multiple columns, with one for each year. This may seem relatively logical and quite tidy, but the problems start when you realise that there are multiple other quite logical and tidy ways to lay this data out if there are few restrictions.

For example, it could quite easily be shown with one country per column, and a year per row:

year Afghanistan Albania Algeria Angola Argentina Australia Austria
1952 28.801 55.230 43.077 30.015 62.485 69.12 66.80
1957 30.332 59.280 45.685 31.999 64.399 70.33 67.48
1962 31.997 64.820 48.303 34.000 65.142 70.93 69.54
1967 34.020 66.220 51.407 35.985 65.634 71.10 70.14
1972 36.088 67.690 54.518 37.928 67.065 71.93 70.63
1977 38.438 68.930 58.014 39.483 68.481 73.49 72.17
1982 39.854 70.420 61.368 39.942 69.942 74.74 73.18
1987 40.822 72.000 65.799 39.906 70.774 76.32 74.94
1992 41.674 71.581 67.744 40.647 71.868 77.56 76.04
1997 41.763 72.950 69.152 40.963 73.275 78.83 77.51

With so many different ways data can be structured, it’s almost impossible to apply a standard approach when working with non-standardised data. For example, if you need the most recent year from an annual dataset, would you select a column or filter rows? It completely depends on how the dataset is organised!

Simply put, “tidy” data is data designed to be easy for coding programmes like R to read in and use. It is structured in a way that make processes efficient, both to write code for an for R to perform. Tidy data structures are always the same, regardless of the data inside the structure.

A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

In tidy data:

  • each variable is a column; each column is a variable
  • each observation is a row; each row is an observation
  • each value is a cell; each cell is a single value

Each column represents a type of information, each row is a single obversation, and each cell holds one specific value.

1.4 A tidy example

country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134
Afghanistan Asia 1982 39.854 12881816 978.0114
Afghanistan Asia 1987 40.822 13867957 852.3959
Afghanistan Asia 1992 41.674 16317921 649.3414
Afghanistan Asia 1997 41.763 22227415 635.3414

In this dataset, each column represents a variable, such as the year, life expectancy, country, population, and GDP per capita. These variables are categories that describe the data, rather than individual values.

For example, year is a variable, but each possible year in that column is a value, and therefore should not be assigned its own column.

Every row is an observation; a unique combination of the variables included in the data. For example, there is only one observation (row) of data in Afghanistan for 1952, and then a separate one for the same country in 1957.

1.5 Understanding tidy data

1.5.1 The Opposite of Tidy Data is Not “Bad” Data

It is important to recognise that untidy data is not necessarily “bad” data. When we say data is not in a tidy format, it does not mean the data is flawed or unusable. Often, human-readable data is “messy” because it is designed to be understood by people, not machines. This format tends to be more flexible and easier to interpret at a glance. However, while human-readable data might be easy for people to understand, it is not always the most efficient or effective for analysis, especially when using computers.

Computers, however, do not read data the same way humans do. They require data to be in a specific format – one that is well-structured and consistent. This is where tidy data comes in. Tidy data is essential for computers to process efficiently because it enables quicker operations, analysis, and easier integration with various software tools.

1.5.2 Tidyverse demands Tidy Data

The tidyverse (a collection of R packages such as dplyr::, tidyr::, ggplot2::, etc.) is designed with tidy data in mind. These packages are built to work seamlessly with tidy data, providing intuitive and simple ways to manipulate, clean, and visualise data.

If you are using the tidyverse and find yourself writing a lot of code to perform simple tasks, or if you are looping through data unnecessarily, it could be because your data is not in the tidy format the tidyverse expects. In such cases, you may need to transform your data into a tidy format to make the most of the tools available to you.

1.5.3 Harmonising code

When data is in a tidy format, it makes it much easier to standardise your code, functions, and processes. Tidy data eliminates the need for repetitive tasks and simplifies the logic behind data analysis. For example, by ensuring your data is tidy, you can more easily apply functions across multiple datasets, allowing for code reuse and reduced complexity.

One of the biggest benefits of using tidy data is that it allows you to reuse code across different projects. When datasets are structured consistently, many tasks can be handled with the same code. This means that if multiple projects use tidy data, you can recycle large chunks of code without having to write new code for every new dataset. This saves time, reduces errors, and ensures consistency across projects.

1.6 Think Excel, Think R

Many people begin their data wrangling journey in Excel, but Excel is not designed as a data processing tool. It is built for humans to interact with, not computers, which means it prioritises readability over efficiency.

In R, you do not need to be able to read your data while the code is running. If you are not using tidy data, you are sacrificing efficiency without any real benefit.

1.6.1 The Excel way of thinking

To showcase the difference between the tidy R and Excel perspectives, we will be working with the billboard dataset from the tidyr:: package.

artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
2 Pac Baby Don’t Cry (Keep… 2000-02-26 87 82 72 77 87 94 99 NA
2Ge+her The Hardest Part Of … 2000-09-02 91 87 92 NA NA NA NA NA
3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66 57 54 53
3 Doors Down Loser 2000-10-21 76 76 72 69 67 65 55 59
504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17 31 36 49
98^0 Give Me Just One Nig… 2000-08-19 51 39 34 26 26 19 2 2
A*Teens Dancing Queen 2000-07-08 97 97 96 95 100 NA NA NA
Aaliyah I Don’t Wanna 2000-01-29 84 62 51 41 38 35 35 38
Aaliyah Try Again 2000-03-18 59 53 38 28 21 18 16 14
Adams, Yolanda Open My Heart 2000-08-26 76 76 74 69 68 67 61 58

It has details of the artist, track name, date.entered and then a separate column for each week rating.

While this format is easy to read and interpret by humans, it is not the most efficient format for analysis, especially when it comes to visualisations or statistical analyses using R, such as creating line charts or aggregating data by week.

Let’s say you want to chart the first 5 rows in ggplot to examine the rating of tracks over time. How would you do that? In ggplot line charts, you need to have a x and y variable. Do we have this in our dataset? No. So, how do we fix this?

If you were thinking in an “Excel-way”, you might:

  1. Extracted the first 5 rows
  2. Selected specific columns and drop unwanted ones
  3. Flipped the data (transpose the table)
  4. Convert the transpost data into a data frame
  5. Fixed the column names so the first row becomes the column headers
  6. Remove the first row (which now contains column names)
  7. Use row names as a new column (for week information)
  8. Plot
# Extract the first 5 rows
billboard_small <- head(billboard, 5)

billboard_chart <- billboard_small %>%
  # Let's just keep the columns we want, and drop artist name and date
  dplyr::select(-c(artist, date.entered)) %>%
  #  Let's flip the data so the weeks are row names, not column names
  t() %>%
  # And make it back into a data frame
  data.frame()

# The column names are wrong now, so let's move them up from the top row
colnames(billboard_chart) <- billboard_chart[1,]
# And then let's get rid of that useless first row
billboard_chart <- billboard_chart[-1,] 
# And let's use those row names as a column
billboard_chart$week <- row.names(billboard_chart)

# Now we can plot a chart of one column against week!
# It's an ugly bar chart because it doesn't even work as a line chart
ggplot(billboard_chart, aes(x = week, y = `Kryptonite`))+
  geom_bar(stat = "identity")

This approach works, but it is not efficient and feels more like working in Excel, where each step requires manual manipulation. You needed to switch from the tidyverse to base R halfway through because the tidyverse does not offer the necessary functions. If you wanted to analyse another artist, you would have to redo everything, and you have not even added the other tracks to the chart yet. It can become exhausting and frustrating, making you wish you could just do this in Excel. There are also many issues with the chart that make it difficult to interpret and visualise properly.

1.6.2 The Tid(ie)R way of thinking

billboard_tidy <- billboard_small %>%
  # Let's shift this data from wide to tidy form
  tidyr::pivot_longer(cols = -c(artist, track, date.entered),
                      names_to = "week",
                      values_to = "position") %>%
  # We can even clean up the week values
  dplyr::mutate(week = as.numeric(gsub("wk", "", week)))

# Now we can plot a chart of week against position
p <- 
  ggplot(billboard_tidy, aes(x = week, y = position, colour = track, group = track))+
  geom_line()

It feels strange to pull data into that format, but it is much easier to pass the data to functions in the format they expect. The tidyverse provides simple, intuitive functions for the tasks you are trying to accomplish, allowing you to create charts that are both easy and attractive. Overall, it is faster than doing the same work in Excel!

You could either make prettier charts:

p +
  theme_minimal() +
  scale_colour_brewer(palette = "Dark2")

Here it is faceted out to individual charts by artist:

p +
  facet_wrap(~artist)+
  theme_minimal() +
  scale_colour_brewer(palette = "Dark2")

And it is even easy to change the original code to show data for the top 10 tracks:

billboard_tidy <- head(billboard, 20) %>%
  # Let's shift this data from wide to tidy form
  tidyr::pivot_longer(cols = -c(artist, track, date.entered),
                      names_to = "week",
                      values_to = "position") %>%
  # We can even clean up the week values
  dplyr::mutate(week = as.numeric(gsub("wk", "", week)))

# Now we can plot a chart of week against position
p <- 
  ggplot(billboard_tidy, aes(x = week, y = position, colour = track, group = track))+
  geom_line()+
  facet_wrap(~artist)+
  theme_minimal() 
   
p       

1.6.3 Using tidyr for data analysis

This is not just the case with data visualisation, it is the same if you would like to do some data analysis too.

religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k
Agnostic 27 34 60 81 76 137 122 109 84
Atheist 12 27 37 52 35 70 73 59 74
Buddhist 27 21 30 34 33 58 62 39 53
Catholic 418 617 732 670 638 1116 949 792 633
Don’t know/refused 15 14 15 11 10 35 21 17 18
Evangelical Prot 575 869 1064 982 881 1486 949 723 414
Hindu 1 9 7 9 11 34 47 48 54
Historically Black Prot 228 244 236 238 197 223 131 81 78
Jehovah’s Witness 20 27 24 24 21 30 15 11 6
Jewish 19 19 25 25 30 95 69 87 151

Again, the relig_income dataset is not what you would think of as “messy”, it is logical and easy to read for a human. But what if we wanted to multiply all the values in the columns by 1000 (to convert from k figures to actual figures)?

It is easy to think about how you would do it in Excel; one column at a time:

relig_thousand <- relig_income %>%
  # Use mutate to multiply each column by 1000
  dplyr::mutate(`<$10k` = `<$10k` * 1000,
                `$10-20k` = `$10-20k` * 1000,
                `$20-30k` = `$20-30k` * 1000,
                `$30-40k` = `$30-40k` * 1000)

# We're not done here, I'm just bored...

It is a lot of coding; doing it for each column would take another 8 repetitions of this!

There are slightly easier ways to do it (using mutate and across in combination), but again it is already feeling trickier than it needs to be. And crucially, that is one of the most simple actions we might want to take with this dataset, imagine if you wanted to group together different incomes into wider groups, or convert $ values into £?

How does tidyverse want you to do it? Look at this code snippet below:

relig_thousand <- relig_income %>%
  # Again, let's pull all the data into one tidy column
  tidyr::pivot_longer(cols = -religion,
                      names_to = "salary",
                      values_to = "count") %>%
  # Then just multiply all the counts by 1000
  dplyr::mutate(count = count * 1000)

This is easier to write, easier to QA, and crucially easier to keep up to date. If they changed the salary groupings (and therefore the column names), the first lot of code would break, but the second lot would keep working.