Chapter 1 Introduction

1.1 Overview

Data in the wild can be scary–when confronted with a complicated and messy dataset you may find yourself wondering, where do I even start? The {tidyr} package allows you the wrangle such beasts into nice and tidy datasets. This course is designed the concept and application of tidy data in R, and how it can be used to make code simple to read and write.

1.1.1 Session aims

understand the concept of tidy data and why it is important for data analysis
learn how to use the {tidyr} package to wrangle messy, complicated datasets into tidy formats
use key {tidyr} functions such as pivot_longer(), pivot_wider() and separate()
apply tidy data principles to simplify your code, making it easier to read, write and maintain
use tidy datasets to create clear and insightful data visualisations

1.2 Course instructions

This course is available as a self-directed learning option. To complete it, you will need to get access to Cloud R account and download the exercise repository as a ZIP file and upload it into your RStudio environment:

Download the exercise repository ZIP file (tidy_data_workshop.zip) to your local computer
Create a new folder on your computer

Name the folder: tidyr_training

Open the downloaded ZIP file (do not extract it)

Double-click to open the ZIP file like a normal folder
Select all files and folders inside (e.g. .Rproj, Data/, etc.)
Copy them into your new tidyr_training folder

Make sure you’re copying the contents only, not the ZIP folder itself.

Compress the folder

Right-click the tidyr_training folder
Choose: Send to > Compressed (zipped) folder
This creates a new file: tidyr_training.zip

Upload to Cloud R

Log into your Cloud R account and open RStudio
In RStudio, go to the Files Pane (usually bottom right)
Click Upload, then browse to your new tidyr_training.zip file

You can now navigate into the extracted folder and start working on the exercises.

Each exercise includes a Solution dropdown, where you can find helpful prompts and answers.

You can also listen to a recording of a live session, available on SharePoint, to follow along with the course content. The recordings cover the material in Chapters 1 and 2 only, and do not include the practice exercises in Chapter 3.

After completing the course, please fill out the feedback form to help us improve the experience of DfT coding courses. You can also join the DfT R community to post questions, share knowledge and get support from other R users.

1.3 Pressumed knowledge

Users of this book should feel comfortable with using R and basic graphics with ggplot2. Our Introduction to R provides a brief introduction to the tidying data concept, as well as basic plotting and data manipulation, both of which are needed to complete this course.

1.4 What is tidy data?

You are likely to be most familiar with data laid out in a “messy” structure like the one below:

country	continent	1967	1972	1977	1982	1987	1992
Albania	Europe	66.22	67.69	68.93	70.42	72.00	71.581
Austria	Europe	70.14	70.63	72.17	73.18	74.94	76.040
Belgium	Europe	70.94	71.44	72.80	73.93	75.35	76.460
Bosnia and Herzegovina	Europe	64.79	67.45	69.86	70.69	71.14	72.178
Bulgaria	Europe	70.42	70.90	70.81	71.08	71.34	71.190
Croatia	Europe	68.50	69.61	70.64	70.46	71.52	72.527
Czech Republic	Europe	70.38	70.29	70.71	70.96	71.58	72.400
Denmark	Europe	72.96	73.47	74.69	74.63	74.80	75.330
Finland	Europe	69.83	70.87	72.52	74.55	74.83	75.700
France	Europe	71.55	72.38	73.83	74.89	76.34	77.460

Here, the data has one row per country and then multiple columns, with one for each year. This may seem relatively logical and quite tidy, but the problems start when you realise that there are multiple other quite logical and tidy ways to lay this data out if there are few restrictions.

For example, it could quite easily be shown with one country per column, and a year per row:

year	Afghanistan	Albania	Algeria	Angola	Argentina	Australia	Austria
1952	28.801	55.230	43.077	30.015	62.485	69.12	66.80
1957	30.332	59.280	45.685	31.999	64.399	70.33	67.48
1962	31.997	64.820	48.303	34.000	65.142	70.93	69.54
1967	34.020	66.220	51.407	35.985	65.634	71.10	70.14
1972	36.088	67.690	54.518	37.928	67.065	71.93	70.63
1977	38.438	68.930	58.014	39.483	68.481	73.49	72.17
1982	39.854	70.420	61.368	39.942	69.942	74.74	73.18
1987	40.822	72.000	65.799	39.906	70.774	76.32	74.94
1992	41.674	71.581	67.744	40.647	71.868	77.56	76.04
1997	41.763	72.950	69.152	40.963	73.275	78.83	77.51

With so many different ways data can be structured, it’s almost impossible to apply a standard approach when working with non-standardised data. For example, if you need the most recent year from an annual dataset, would you select a column or filter rows? It completely depends on how the dataset is organised!

Simply put, “tidy” data is data designed to be easy for coding programmes like R to read in and use. It is structured in a way that make processes efficient, both to write code for an for R to perform. Tidy data structures are always the same, regardless of the data inside the structure.

A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

In tidy data:

each variable is a column; each column is a variable
each observation is a row; each row is an observation
each value is a cell; each cell is a single value

Each column represents a type of information, each row is a single obversation, and each cell holds one specific value.

1.5 A Tidy example

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971
Afghanistan	Asia	1972	36.088	13079460	739.9811
Afghanistan	Asia	1977	38.438	14880372	786.1134
Afghanistan	Asia	1982	39.854	12881816	978.0114
Afghanistan	Asia	1987	40.822	13867957	852.3959
Afghanistan	Asia	1992	41.674	16317921	649.3414
Afghanistan	Asia	1997	41.763	22227415	635.3414

In this dataset, each column represents a variable, such as the year, life expectancy, country, population, and GDP per capita. These variables are categories that describe the data, rather than individual values.

For example, year is a variable, but each possible year in that column is a value, and therefore should not be assigned its own column.

Every row is an observation; a unique combination of the variables included in the data. For example, there is only one observation (row) of data in Afghanistan for 1952, and then a separate one for the same country in 1957.

1.6 Understanding Tidy Data

1.6.1 The Opposite of Tidy Data is Not “Bad” Data

It is important to recognise that untidy data is not necessarily “bad” data. When we say data is not in a tidy format, it does not mean the data is flawed or unusable. Often, human-readable data is “messy” because it is designed to be understood by people, not machines. This format tends to be more flexible and easier to interpret at a glance. However, while human-readable data might be easy for people to understand, it is not always the most efficient or effective for analysis, especially when using computers.

Computers, however, do not read data the same way humans do. They require data to be in a specific format – one that is well-structured and consistent. This is where tidy data comes in. Tidy data is essential for computers to process efficiently because it enables quicker operations, analysis, and easier integration with various software tools.

1.6.2 Tidyverse demands Tidy Data

The tidyverse (a collection of R packages such as dplyr::, tidyr::, ggplot2::, etc.) is designed with tidy data in mind. These packages are built to work seamlessly with tidy data, providing intuitive and simple ways to manipulate, clean, and visualise data.

If you are using the tidyverse and find yourself writing a lot of code to perform simple tasks, or if you are looping through data unnecessarily, it could be because your data is not in the tidy format the tidyverse expects. In such cases, you may need to transform your data into a tidy format to make the most of the tools available to you.

1.6.3 Harmonising code

When data is in a tidy format, it makes it much easier to standardise your code, functions, and processes. Tidy data eliminates the need for repetitive tasks and simplifies the logic behind data analysis. For example, by ensuring your data is tidy, you can more easily apply functions across multiple datasets, allowing for code reuse and reduced complexity.

One of the biggest benefits of using tidy data is that it allows you to reuse code across different projects. When datasets are structured consistently, many tasks can be handled with the same code. This means that if multiple projects use tidy data, you can recycle large chunks of code without having to write new code for every new dataset. This saves time, reduces errors, and ensures consistency across projects.

1.7 Think Excel, Think R

Many people begin their data wrangling journey in Excel, but Excel is not designed as a data processing tool. It is built for humans to interact with, not computers, which means it prioritises readability over efficiency.

In R, you do not need to be able to read your data while the code is running. If you are not using tidy data, you are sacrificing efficiency without any real benefit.

1.7.1 The Excel way of thinking

To showcase the difference between the tidy R and Excel perspectives, we will be working with the billboard dataset from the tidyr:: package.

artist	track	date.entered	wk1	wk2	wk3	wk4	wk5	wk6	wk7	wk8
2 Pac	Baby Don’t Cry (Keep…	2000-02-26	87	82	72	77	87	94	99	NA
2Ge+her	The Hardest Part Of …	2000-09-02	91	87	92	NA	NA	NA	NA	NA
3 Doors Down	Kryptonite	2000-04-08	81	70	68	67	66	57	54	53
3 Doors Down	Loser	2000-10-21	76	76	72	69	67	65	55	59
504 Boyz	Wobble Wobble	2000-04-15	57	34	25	17	17	31	36	49
98^0	Give Me Just One Nig…	2000-08-19	51	39	34	26	26	19	2	2
A*Teens	Dancing Queen	2000-07-08	97	97	96	95	100	NA	NA	NA
Aaliyah	I Don’t Wanna	2000-01-29	84	62	51	41	38	35	35	38
Aaliyah	Try Again	2000-03-18	59	53	38	28	21	18	16	14
Adams, Yolanda	Open My Heart	2000-08-26	76	76	74	69	68	67	61	58

It has details of the artist, track name, date.entered and then a separate column for each week rating.

While this format is easy to read and interpret by humans, it is not the most efficient format for analysis, especially when it comes to visualisations or statistical analyses using R, such as creating line charts or aggregating data by week.

Let’s say you want to chart the first 5 rows in ggplot to examine the rating of tracks over time. How would you do that? In ggplot line charts, you need to have a x and y variable. Do we have this in our dataset? No. So, how do we fix this?

If you were thinking in an “Excel-way”, you might:

Extracted the first 5 rows
Selected specific columns and drop unwanted ones
Flipped the data (transpose the table)
Convert the transpost data into a data frame
Fixed the column names so the first row becomes the column headers
Remove the first row (which now contains column names)
Use row names as a new column (for week information)
Plot

# Extract the first 5 rows
billboard_small <- head(billboard, 5)

billboard_chart <- billboard_small %>%
  # Let's just keep the columns we want, and drop artist name and date
  dplyr::select(-c(artist, date.entered)) %>%
  #  Let's flip the data so the weeks are row names, not column names
  t() %>%
  # And make it back into a data frame
  data.frame()

# The column names are wrong now, so let's move them up from the top row
colnames(billboard_chart) <- billboard_chart[1,]
# And then let's get rid of that useless first row
billboard_chart <- billboard_chart[-1,] 
# And let's use those row names as a column
billboard_chart$week <- row.names(billboard_chart)

# Now we can plot a chart of one column against week!
# It's an ugly bar chart because it doesn't even work as a line chart
ggplot(billboard_chart, aes(x = week, y = `Kryptonite`))+
  geom_bar(stat = "identity")

This approach works, but it is not efficient and feels more like working in Excel, where each step requires manual manipulation. You needed to switch from the tidyverse to base R halfway through because the tidyverse does not offer the necessary functions. If you wanted to analyse another artist, you would have to redo everything, and you have not even added the other tracks to the chart yet. It can become exhausting and frustrating, making you wish you could just do this in Excel. There are also many issues with the chart that make it difficult to interpret and visualise properly.

1.7.2 The Tid(ie)R way of thinking

billboard_tidy <- billboard_small %>%
  # Let's shift this data from wide to tidy form
  tidyr::pivot_longer(cols = -c(artist, track, date.entered),
                      names_to = "week",
                      values_to = "position") %>%
  # We can even clean up the week values
  dplyr::mutate(week = as.numeric(gsub("wk", "", week)))

# Now we can plot a chart of week against position
p <- 
  ggplot(billboard_tidy, aes(x = week, y = position, colour = track, group = track))+
  geom_line()

It feels strange to pull data into that format, but it is much easier to pass the data to functions in the format they expect. The tidyverse provides simple, intuitive functions for the tasks you are trying to accomplish, allowing you to create charts that are both easy and attractive. Overall, it is faster than doing the same work in Excel!

You could either make prettier charts:

p +
  theme_minimal() +
  scale_colour_brewer(palette = "Dark2")

Here it is faceted out to individual charts by artist:

p +
  facet_wrap(~artist)+
  theme_minimal() +
  scale_colour_brewer(palette = "Dark2")

And it is even easy to change the original code to show data for the top 10 tracks:

billboard_tidy <- head(billboard, 20) %>%
  # Let's shift this data from wide to tidy form
  tidyr::pivot_longer(cols = -c(artist, track, date.entered),
                      names_to = "week",
                      values_to = "position") %>%
  # We can even clean up the week values
  dplyr::mutate(week = as.numeric(gsub("wk", "", week)))

# Now we can plot a chart of week against position
p <- 
  ggplot(billboard_tidy, aes(x = week, y = position, colour = track, group = track))+
  geom_line()+
  facet_wrap(~artist)+
  theme_minimal() 
   
p

1.7.3 Using {tidyr} for data analysis

This is not just the case with data visualisation, it is the same if you would like to do some data analysis too.

religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k
Agnostic	27	34	60	81	76	137	122	109	84
Atheist	12	27	37	52	35	70	73	59	74
Buddhist	27	21	30	34	33	58	62	39	53
Catholic	418	617	732	670	638	1116	949	792	633
Don’t know/refused	15	14	15	11	10	35	21	17	18
Evangelical Prot	575	869	1064	982	881	1486	949	723	414
Hindu	1	9	7	9	11	34	47	48	54
Historically Black Prot	228	244	236	238	197	223	131	81	78
Jehovah’s Witness	20	27	24	24	21	30	15	11	6
Jewish	19	19	25	25	30	95	69	87	151

Again, the relig_income dataset is not what you would think of as “messy”, it is logical and easy to read for a human. But what if we wanted to multiply all the values in the columns by 1000 (to convert from k figures to actual figures)?

It is easy to think about how you would do it in Excel; one column at a time:

relig_thousand <- relig_income %>%
  # Use mutate to multiply each column by 1000
  dplyr::mutate(`<$10k` = `<$10k` * 1000,
                `$10-20k` = `$10-20k` * 1000,
                `$20-30k` = `$20-30k` * 1000,
                `$30-40k` = `$30-40k` * 1000)

# We're not done here, I'm just bored...

It is a lot of coding; doing it for each column would take another 8 repetitions of this!

There are slightly easier ways to do it (using mutate and across in combination), but again it is already feeling trickier than it needs to be. And crucially, that is one of the most simple actions we might want to take with this dataset, imagine if you wanted to group together different incomes into wider groups, or convert $ values into £?

How does tidyverse want you to do it? Look at this code snippet below:

relig_thousand <- relig_income %>%
  # Again, let's pull all the data into one tidy column
  tidyr::pivot_longer(cols = -religion,
                      names_to = "salary",
                      values_to = "count") %>%
  # Then just multiply all the counts by 1000
  dplyr::mutate(count = count * 1000)

This is easier to write, easier to QA, and crucially easier to keep up to date. If they changed the salary groupings (and therefore the column names), the first lot of code would break, but the second lot would keep working.

Tidy Data in R