Tidy Data in R
2025-01-23
Chapter 1 Introduction
Data in the wild can be scary–when confronted with a complicated and messy dataset you may find yourself wondering, where do I even start? The tidyr::
package allows you the wrangle such beasts into nice and tidy datasets. This course is designed the concept and application of tidy data in R, and how it can be used to make code simple to read and write.
1.1 Course instructions
This book is designed to accompany the Tidy Data in R training that we run at DfT. To complete this course, you will need a Cloud R account, and a Github account to clone the training repository.
If you’re running through this book solo, it is recommended to clone the exercise repository, run through the book in order and try out all the of the exercises as you go through.
1.2 Pressumed knowledge
Users of this book should feel comfortable with using R and basic graphics with ggplot2. Our Introduction to R provides a brief introduction to the tidying data concept, as well as basic plotting and data manipulation, both of which are needed to complete this course.
1.3 What is tidy data?
You are likely to be most familiar with data laid out in a “messy” structure like the one below:
country | continent | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 |
---|---|---|---|---|---|---|---|
Albania | Europe | 66.22 | 67.69 | 68.93 | 70.42 | 72.00 | 71.581 |
Austria | Europe | 70.14 | 70.63 | 72.17 | 73.18 | 74.94 | 76.040 |
Belgium | Europe | 70.94 | 71.44 | 72.80 | 73.93 | 75.35 | 76.460 |
Bosnia and Herzegovina | Europe | 64.79 | 67.45 | 69.86 | 70.69 | 71.14 | 72.178 |
Bulgaria | Europe | 70.42 | 70.90 | 70.81 | 71.08 | 71.34 | 71.190 |
Croatia | Europe | 68.50 | 69.61 | 70.64 | 70.46 | 71.52 | 72.527 |
Czech Republic | Europe | 70.38 | 70.29 | 70.71 | 70.96 | 71.58 | 72.400 |
Denmark | Europe | 72.96 | 73.47 | 74.69 | 74.63 | 74.80 | 75.330 |
Finland | Europe | 69.83 | 70.87 | 72.52 | 74.55 | 74.83 | 75.700 |
France | Europe | 71.55 | 72.38 | 73.83 | 74.89 | 76.34 | 77.460 |
Here, the data has one row per country and then multiple columns, with one for each year. This may seem relatively logical and quite tidy, but the problems start when you realise that there are multiple other quite logical and tidy ways to lay this data out if there are few restrictions.
For example, it could quite easily be shown with one country per column, and a year per row:
year | Afghanistan | Albania | Algeria | Angola | Argentina | Australia | Austria |
---|---|---|---|---|---|---|---|
1952 | 28.801 | 55.230 | 43.077 | 30.015 | 62.485 | 69.12 | 66.80 |
1957 | 30.332 | 59.280 | 45.685 | 31.999 | 64.399 | 70.33 | 67.48 |
1962 | 31.997 | 64.820 | 48.303 | 34.000 | 65.142 | 70.93 | 69.54 |
1967 | 34.020 | 66.220 | 51.407 | 35.985 | 65.634 | 71.10 | 70.14 |
1972 | 36.088 | 67.690 | 54.518 | 37.928 | 67.065 | 71.93 | 70.63 |
1977 | 38.438 | 68.930 | 58.014 | 39.483 | 68.481 | 73.49 | 72.17 |
1982 | 39.854 | 70.420 | 61.368 | 39.942 | 69.942 | 74.74 | 73.18 |
1987 | 40.822 | 72.000 | 65.799 | 39.906 | 70.774 | 76.32 | 74.94 |
1992 | 41.674 | 71.581 | 67.744 | 40.647 | 71.868 | 77.56 | 76.04 |
1997 | 41.763 | 72.950 | 69.152 | 40.963 | 73.275 | 78.83 | 77.51 |
With so many different ways data can be structured, it’s almost impossible to apply a standard approach when working with non-standardised data. For example, if you need the most recent year from an annual dataset, would you select
a column or filter
rows? It completely depends on how the dataset is organised!
Simply put, “tidy” data is data designed to be easy for coding programmes like R to read in and use. It is structured in a way that make processes efficient, both to write code for an for R to perform. Tidy data structures are always the same, regardless of the data inside the structure.
A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.
In tidy data:
- each variable is a column; each column is a variable
- each observation is a row; each row is an observation
- each value is a cell; each cell is a single value
Each column represents a type of information, each row is a single obversation, and each cell holds one specific value.
1.4 A tidy example
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.0114 |
Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.3959 |
Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.3414 |
Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 |
In this dataset, each column represents a variable, such as the year, life expectancy, country, population, and GDP per capita. These variables are categories that describe the data, rather than individual values.
For example, year
is a variable, but each possible year in that column is a value, and therefore should not be assigned its own column.
Every row is an observation; a unique combination of the variables included in the data. For example, there is only one observation (row) of data in Afghanistan for 1952, and then a separate one for the same country in 1957.
1.5 Understanding tidy data
1.5.1 The Opposite of Tidy Data is Not “Bad” Data
It is important to recognise that untidy data is not necessarily “bad” data. When we say data is not in a tidy format, it does not mean the data is flawed or unusable. Often, human-readable data is “messy” because it is designed to be understood by people, not machines. This format tends to be more flexible and easier to interpret at a glance. However, while human-readable data might be easy for people to understand, it is not always the most efficient or effective for analysis, especially when using computers.
Computers, however, do not read data the same way humans do. They require data to be in a specific format – one that is well-structured and consistent. This is where tidy data comes in. Tidy data is essential for computers to process efficiently because it enables quicker operations, analysis, and easier integration with various software tools.
1.5.2 Tidyverse demands Tidy Data
The tidyverse
(a collection of R packages such as dplyr::
, tidyr::
, ggplot2::
, etc.) is designed with tidy data in mind. These packages are built to work seamlessly with tidy data, providing intuitive and simple ways to manipulate, clean, and visualise data.
If you are using the tidyverse and find yourself writing a lot of code to perform simple tasks, or if you are looping through data unnecessarily, it could be because your data is not in the tidy format the tidyverse expects. In such cases, you may need to transform your data into a tidy format to make the most of the tools available to you.
1.5.3 Harmonising code
When data is in a tidy format, it makes it much easier to standardise your code, functions, and processes. Tidy data eliminates the need for repetitive tasks and simplifies the logic behind data analysis. For example, by ensuring your data is tidy, you can more easily apply functions across multiple datasets, allowing for code reuse and reduced complexity.
One of the biggest benefits of using tidy data is that it allows you to reuse code across different projects. When datasets are structured consistently, many tasks can be handled with the same code. This means that if multiple projects use tidy data, you can recycle large chunks of code without having to write new code for every new dataset. This saves time, reduces errors, and ensures consistency across projects.
1.6 Think Excel, Think R
Many people begin their data wrangling journey in Excel, but Excel is not designed as a data processing tool. It is built for humans to interact with, not computers, which means it prioritises readability over efficiency.
In R, you do not need to be able to read your data while the code is running. If you are not using tidy data, you are sacrificing efficiency without any real benefit.
1.6.1 The Excel way of thinking
To showcase the difference between the tidy R and Excel perspectives, we will be working with the billboard
dataset from the tidyr::
package.
artist | track | date.entered | wk1 | wk2 | wk3 | wk4 | wk5 | wk6 | wk7 | wk8 |
---|---|---|---|---|---|---|---|---|---|---|
2 Pac | Baby Don’t Cry (Keep… | 2000-02-26 | 87 | 82 | 72 | 77 | 87 | 94 | 99 | NA |
2Ge+her | The Hardest Part Of … | 2000-09-02 | 91 | 87 | 92 | NA | NA | NA | NA | NA |
3 Doors Down | Kryptonite | 2000-04-08 | 81 | 70 | 68 | 67 | 66 | 57 | 54 | 53 |
3 Doors Down | Loser | 2000-10-21 | 76 | 76 | 72 | 69 | 67 | 65 | 55 | 59 |
504 Boyz | Wobble Wobble | 2000-04-15 | 57 | 34 | 25 | 17 | 17 | 31 | 36 | 49 |
98^0 | Give Me Just One Nig… | 2000-08-19 | 51 | 39 | 34 | 26 | 26 | 19 | 2 | 2 |
A*Teens | Dancing Queen | 2000-07-08 | 97 | 97 | 96 | 95 | 100 | NA | NA | NA |
Aaliyah | I Don’t Wanna | 2000-01-29 | 84 | 62 | 51 | 41 | 38 | 35 | 35 | 38 |
Aaliyah | Try Again | 2000-03-18 | 59 | 53 | 38 | 28 | 21 | 18 | 16 | 14 |
Adams, Yolanda | Open My Heart | 2000-08-26 | 76 | 76 | 74 | 69 | 68 | 67 | 61 | 58 |
It has details of the artist, track name, date.entered and then a separate column for each week rating.
While this format is easy to read and interpret by humans, it is not the most efficient format for analysis, especially when it comes to visualisations or statistical analyses using R, such as creating line charts or aggregating data by week.
Let’s say you want to chart the first 5 rows in ggplot
to examine the rating of tracks over time. How would you do that? In ggplot
line charts, you need to have a x
and y
variable. Do we have this in our dataset? No. So, how do we fix this?
If you were thinking in an “Excel-way”, you might:
- Extracted the first 5 rows
- Selected specific columns and drop unwanted ones
- Flipped the data (transpose the table)
- Convert the transpost data into a data frame
- Fixed the column names so the first row becomes the column headers
- Remove the first row (which now contains column names)
- Use row names as a new column (for week information)
- Plot
# Extract the first 5 rows
billboard_small <- head(billboard, 5)
billboard_chart <- billboard_small %>%
# Let's just keep the columns we want, and drop artist name and date
dplyr::select(-c(artist, date.entered)) %>%
# Let's flip the data so the weeks are row names, not column names
t() %>%
# And make it back into a data frame
data.frame()
# The column names are wrong now, so let's move them up from the top row
colnames(billboard_chart) <- billboard_chart[1,]
# And then let's get rid of that useless first row
billboard_chart <- billboard_chart[-1,]
# And let's use those row names as a column
billboard_chart$week <- row.names(billboard_chart)
# Now we can plot a chart of one column against week!
# It's an ugly bar chart because it doesn't even work as a line chart
ggplot(billboard_chart, aes(x = week, y = `Kryptonite`))+
geom_bar(stat = "identity")
This approach works, but it is not efficient and feels more like working in Excel, where each step requires manual manipulation. You needed to switch from the tidyverse to base R halfway through because the tidyverse does not offer the necessary functions. If you wanted to analyse another artist, you would have to redo everything, and you have not even added the other tracks to the chart yet. It can become exhausting and frustrating, making you wish you could just do this in Excel. There are also many issues with the chart that make it difficult to interpret and visualise properly.
1.6.2 The Tid(ie)R way of thinking
billboard_tidy <- billboard_small %>%
# Let's shift this data from wide to tidy form
tidyr::pivot_longer(cols = -c(artist, track, date.entered),
names_to = "week",
values_to = "position") %>%
# We can even clean up the week values
dplyr::mutate(week = as.numeric(gsub("wk", "", week)))
# Now we can plot a chart of week against position
p <-
ggplot(billboard_tidy, aes(x = week, y = position, colour = track, group = track))+
geom_line()
It feels strange to pull data into that format, but it is much easier to pass the data to functions in the format they expect. The tidyverse provides simple, intuitive functions for the tasks you are trying to accomplish, allowing you to create charts that are both easy and attractive. Overall, it is faster than doing the same work in Excel!
You could either make prettier charts:
Here it is faceted out to individual charts by artist:
And it is even easy to change the original code to show data for the top 10 tracks:
billboard_tidy <- head(billboard, 20) %>%
# Let's shift this data from wide to tidy form
tidyr::pivot_longer(cols = -c(artist, track, date.entered),
names_to = "week",
values_to = "position") %>%
# We can even clean up the week values
dplyr::mutate(week = as.numeric(gsub("wk", "", week)))
# Now we can plot a chart of week against position
p <-
ggplot(billboard_tidy, aes(x = week, y = position, colour = track, group = track))+
geom_line()+
facet_wrap(~artist)+
theme_minimal()
p
1.6.3 Using tidyr for data analysis
This is not just the case with data visualisation, it is the same if you would like to do some data analysis too.
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k |
---|---|---|---|---|---|---|---|---|---|
Agnostic | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 |
Atheist | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 |
Buddhist | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 |
Catholic | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 |
Don’t know/refused | 15 | 14 | 15 | 11 | 10 | 35 | 21 | 17 | 18 |
Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 | 949 | 723 | 414 |
Hindu | 1 | 9 | 7 | 9 | 11 | 34 | 47 | 48 | 54 |
Historically Black Prot | 228 | 244 | 236 | 238 | 197 | 223 | 131 | 81 | 78 |
Jehovah’s Witness | 20 | 27 | 24 | 24 | 21 | 30 | 15 | 11 | 6 |
Jewish | 19 | 19 | 25 | 25 | 30 | 95 | 69 | 87 | 151 |
Again, the relig_income
dataset is not what you would think of as “messy”, it is logical and easy to read for a human. But what if we wanted to multiply all the values in the columns by 1000 (to convert from k figures to actual figures)?
It is easy to think about how you would do it in Excel; one column at a time:
relig_thousand <- relig_income %>%
# Use mutate to multiply each column by 1000
dplyr::mutate(`<$10k` = `<$10k` * 1000,
`$10-20k` = `$10-20k` * 1000,
`$20-30k` = `$20-30k` * 1000,
`$30-40k` = `$30-40k` * 1000)
# We're not done here, I'm just bored...
It is a lot of coding; doing it for each column would take another 8 repetitions of this!
There are slightly easier ways to do it (using mutate and across in combination), but again it is already feeling trickier than it needs to be. And crucially, that is one of the most simple actions we might want to take with this dataset, imagine if you wanted to group together different incomes into wider groups, or convert $ values into £?
How does tidyverse want you to do it? Look at this code snippet below:
relig_thousand <- relig_income %>%
# Again, let's pull all the data into one tidy column
tidyr::pivot_longer(cols = -religion,
names_to = "salary",
values_to = "count") %>%
# Then just multiply all the counts by 1000
dplyr::mutate(count = count * 1000)
This is easier to write, easier to QA, and crucially easier to keep up to date. If they changed the salary groupings (and therefore the column names), the first lot of code would break, but the second lot would keep working.