Chapter 3 Practice
Congratulations! You have reached the end of the tutor-led session and learning material on Tidy Data in R. Hopefully, you now feel confident in transforming and tidying your data using the tools and techniques we have covered. You have learned how to manipulate data efficiently with functions like pivot_longer()
, pivot_wider()
and separate()
, making it ready for analysis.
You are welcome to continue your self-directed learning by completing the three 20-minute exercises below! You can complete the exercises in the tidyr_practice.R
found in the training repository.
Remember to complete the feedback form (it’s all anonymous) to help us improve your learning journey to coding at DfT.
Here are a few useful links:
- a series of recordings of the Tidy Data workshop
- Department for Transport R Cookbook (guidebook to useful R topics)
- CRAN Wiki (a Wikipedia page detailing the how, where and what of coding at DfT)
- CRANLanD (our internal website to register for coding courses offered at DfT)
- Crimes Against R (slidepack on how to eliminate big risks you probably have in your code)
Our intermediate to advanced R courses:
3.1 Exercise
20:00
You have been tasked to create a chart of concessionary bus journeys taken in various regions of the UK, as a percentage of the total.
- Read in the bus data
bus0105.xlsx
from theData
directory in the training repository - Reshape the data into a tidy format (using the
tidyr::pivot_longer()
function), ensuring there is just one column for region names and one column for values - Convert the values column into a numeric format
- Group the data by region and add a new column called
percent
. This column should calculate the percentage of the total for each region - Split the financial years in the year column so it only includes the first year (e.g. 2022-23 becomes 2022)
- Create a chart showing the percentage against the year, using different colours to represent each region
3.1 Solution
#Question 1
bus_data <- readxl::read_excel("Data/bus0105.xlsx", skip = 5) %>%
dplyr::select(`Year`,
`London`,
`English metropolitan areas`,
`English non-metropolitan areas`,
`England`,
`Scotland`,
`Wales`)
#Question 2
bus_long <- bus_data %>%
tidyr::pivot_longer(cols = -Year, names_to = "region", values_to = "value")
#Question 3
bus_mutate <- bus_long %>%
dplyr::mutate(value = as.numeric(value))
#Question 4
bus_group <- bus_mutate %>%
dplyr::group_by(region) %>%
dplyr::mutate(percent = (value / sum(value, na.rm = TRUE)) * 100) %>%
dplyr::ungroup()
#Question 5
bus_split <- bus_group %>%
tidyr::separate(col = Year, into = c("year_start", "year_end"), sep = "/") %>%
dplyr::mutate(year = as.numeric(year_start)) %>%
dplyr::select(year, region, value, percent)
#Question 6
ggplot(bus_split, aes(x = year, y = percent, colour = region)) +
geom_line() +
labs(
title = "% of Total by Region Over Time",
x = "",
y = ""
) +
# Create a facet for each region
facet_wrap(~ region, scales = "free_y") +
theme_minimal() +
theme(
# Remove legend since regions are in facets
legend.position = "none",
# Style facet labels
strip.text = element_text(size = 10, face = "bold")
)
3.2 Exercise
20:00
You have been tasked to create a chart of miles travelled by grouped transport type from a sample of NTS data.
- Read in the ntsq table,
ntsq01003.xlsx
, from theData
directory in the training repository - Reshape the data into a tidy format (using the
tidyr::pivot_longer()
function) so each value has its own column - Ensure all values are in numeric format
- Pivot the data into a wider format (using the
tidyr::pivot_wider()
function) so each mode of transport has its own column - Create new columns for:
- Active transport: Sum of walking and cycling
- Private transport: Sum of car/van drivers, car/van passengers, motorcycles, and other private modes
- Public transport: Sum of buses, other buses, underground, rail and other public modes
Additionally, select to keep only these new columns
- Convert the data back to a tidy (long) format with a single column for transport type
- Create a chart showing the percentage over the years, with transport type represented by different colours
- Use faceting in
ggplot2::
to create three separate charts based on the purpose of transport
3.2 Solution
#Question 1
nts_data <- readxl::read_excel("Data/ntsq01003.xlsx", skip = 7)
#Question 2
nts_long <- nts_data %>%
tidyr::pivot_longer(cols = -c(Mode, Purpose),
names_to = "Year",
values_to = "Value")
#Question 3
nts_numeric <- nts_long %>%
dplyr::mutate(Value = as.numeric(Value))
#Question 4
nts_wide <- nts_numeric %>%
tidyr::pivot_wider(names_from = "Mode",
values_from = "Value")
#Question 5
nts_mutate <- nts_wide %>%
dplyr::mutate(
`Active transport` = Walk1 + Bicycle,
`Private transport` = `Car / van driver` + `Car / van passenger` + `Motorcycle` + `Other private transport2`,
`Public transport` = `Bus in London` + `Other local bus` + `London Underground` + `Surface Rail` + `Other public transport3`
) %>%
dplyr::select(Year,
`Active transport`,
`Private transport`,
`Public transport`,
Purpose)
#Question 6
nts_convert_long <- nts_mutate %>%
tidyr::pivot_longer(cols = -c(Year, Purpose),
names_to = "Mode",
values_to = "Value")
#Question 7 & 8
#custom colour palette
custom_colours <- c(
"Active transport" = "#004D3B",
"Private transport" = "#969810",
"Public transport" = "#001A70"
)
ggplot(nts_convert_long, aes(x = Year, y = Value, colour = Mode, group = Mode)) +
# Use linewidth instead of size
geom_line(linewidth = 1) +
labs(
title = "% by Mode, 2002 to 2017",
x = "",
y = "",
colour = ""
) +
# Facet by purpose
facet_wrap(~ Purpose, scales = "fixed") +
scale_x_discrete(breaks = c(2002, 2007, 2012, 2017)) +
scale_colour_manual(values = custom_colours) +
theme_minimal() +
theme(
# Place legend at the top
legend.position = "top",
# Style facet labels
strip.text = element_text(size = 8, face = "bold")
)
3.3 Exercise
20:00
You have been tasked to create a chart of the ratio of passengers to seats for rail services arriving to and departing from major stations in the UK.
- Read in the rail table,
rai0201.xlsx
, from theData
directory in the training repository - Tidy the data so it is in a structured format (using the
tidyr::pivot_longer()
function) - Split the column with combined names into two separate columns: one for arrival/departure time and the other for variable type (using the
tidyr::separate()
function) - Further divide the arrival/departure time column into two new columns: one for “arrival or departure” and the other for “time”
- Reshape the data so there are individual columns for the number of services, passengers and seats
When reshaping the data using pivot_wider()
, you may encounter duplicate rows for the same combination of Station, Arrival or departure, and Time. To handle this, you need to include a new argument called values_fn
, which should be set to sum
. This ensures that if there are duplicate values for the same combination, they are aggregated by summing them together.
- Add a new column to calculate the ratio of passengers to seats
- Restructure the data back into a tidy (long) format, where the services, passengers, seats and ratio columns are now in one column
- Create a bar chart showing the station name on the x-axis and the ratio of passengers to seats on the y-axis
- Use different colours for bars to indicate whether they relate to departures or arrivals
- Use facets to split the chart by peak time and all-day data
3.3 Solution
#Question 1
rail_data <- readxl::read_excel("Data/rai0201.xlsx", skip = 6) %>%
dplyr::select(-c("...5", "...9", "...13"))
#Question 2
rail_long <- rail_data %>%
tidyr::pivot_longer(cols = -Station,
names_to = "Journeys",
values_to = "Value") %>%
dplyr::mutate(Value = as.numeric(Value))
#Question 3 & 4
rail_separate <- rail_long %>%
# Separate the 'Journeys' column into 'Time' and 'Variable'
tidyr::separate(col = Journeys,
into = c("Time", "Variable"),
sep = "_") %>%
# Further separate the 'Time' column into 'Arrival or departure' and 'Time'
tidyr::separate(col = Time,
into = c("Arrival or departure", "Time"),
sep = " ")
#Question 5
rail_wide <- rail_separate %>%
tidyr::pivot_wider(names_from = "Variable",
values_from = "Value",
values_fn = sum)
#Question 6
rail_ratio <- rail_wide %>%
dplyr::mutate(`Passengers per seat ratio` = Passengers / `Total seats`)
#Question 7
rail_final <- rail_ratio %>%
tidyr::pivot_longer(cols = -c(Station, `Arrival or departure`, Time),
names_to = "Variable",
values_to = "Value")
#Question 8
#additional data cleaning
rail_prepare <- rail_final %>%
dplyr::filter(Variable == "Passengers per seat ratio") %>%
dplyr::mutate(Station = stringr::str_remove(Station, "\\d+$"))
ggplot(rail_prepare, aes(x = Station, y = Value, fill = `Arrival or departure`)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ Time, scales = "fixed") +
scale_fill_manual(values = c("#FE5500", "#8A003E", "#FF479A")) +
labs(
title = "Ratio of Passengers to Seats by Station and Time of Day",
x = "",
y = "",
fill = "Arrival or departure"
) +
theme_minimal() +
coord_flip() +
theme(strip.text = element_text(size = 12, face = "bold")
)