Chapter 3 Practice

Congratulations! You have reached the end of the tutor-led session and learning material on Tidy Data in R. Hopefully, you now feel confident in transforming and tidying your data using the tools and techniques we have covered. You have learned how to manipulate data efficiently with functions like pivot_longer(), pivot_wider() and separate(), making it ready for analysis.

You are welcome to continue your self-directed learning by completing the three 20-minute exercises below! You can complete the exercises in the tidyr_practice.R found in the training repository.

Remember to complete the feedback form (it’s all anonymous) to help us improve your learning journey to coding at DfT.

Here are a few useful DfT resources:

a series of recordings of the Tidy Data workshop
Department for Transport R Cookbook

A guidebook to useful R topics
CRAN Wiki

A Wikipedia page detailing the how, where and what of coding at DfT
CRANLanD

Our internal website to register for and access coding courses offered at DfT
Crimes Against R

A slidepack on how to eliminate big risks you probably have in your code
Coding and Reproducible Analytical Network Microsoft Teams (MS) Channel

Join our internal MS channel where you can find information about upcoming training courses; coffee and coding talks, and share tips, materials and advice on anything coding related. We also have R-specific sub-channels within the MS channel, including “Beginner coding questions”, “R” and “R Shiny”

Our intermediate to advanced R courses:

3.1 Exercise

20:00

You have been tasked to create a chart of concessionary bus journeys taken in various regions of the UK, as a percentage of the total.

Read in the bus data bus0105.xlsx from the Data directory in the training repository
Reshape the data into a tidy format (using the tidyr::pivot_longer() function), ensuring there is just one column for region names and one column for values
Convert the values column into a numeric format
Group the data by region and add a new column called percent. This column should calculate the percentage of the total for each region
Split the financial years in the year column so it only includes the first year (e.g. 2022-23 becomes 2022)
Create a chart showing the percentage against the year, using different colours to represent each region

3.1 Solution

#Question 1
bus_data <- readxl::read_excel("Data/bus0105.xlsx", skip = 5) %>%
  dplyr::select(`Year`,
                `London`,
                `English metropolitan areas`,
                `English non-metropolitan areas`,
                `England`, 
                `Scotland`, 
                `Wales`)

#Question 2
bus_long <- bus_data %>%
  tidyr::pivot_longer(cols = -Year, names_to = "region", values_to = "value") 

#Question 3
bus_mutate <- bus_long %>%
  dplyr::mutate(value = as.numeric(value))

#Question 4
bus_group <- bus_mutate %>%
  dplyr::group_by(region) %>%
  dplyr::mutate(percent = (value / sum(value, na.rm = TRUE)) * 100) %>%
  dplyr::ungroup()

#Question 5
bus_split <- bus_group %>%
  tidyr::separate(col = Year, into = c("year_start", "year_end"), sep = "/") %>%
  dplyr::mutate(year = as.numeric(year_start)) %>%
  dplyr::select(year, region, value, percent)

#Question 6
ggplot(bus_split, aes(x = year, y = percent, colour = region)) +
geom_line() +
  labs(
    title = "% of Total by Region Over Time",
    x = "",
    y = ""
    ) +
  # Create a facet for each region
  facet_wrap(~ region, scales = "free_y") + 
  theme_minimal() +
  theme(
    # Remove legend since regions are in facets
    legend.position = "none", 
    # Style facet labels
    strip.text = element_text(size = 8, face = "bold") 
  )

3.2 Exercise

20:00

You have been tasked to create a chart of miles travelled by grouped transport type from a sample of NTS data.

Read in the ntsq table, ntsq01003.xlsx, from the Data directory in the training repository
Reshape the data into a tidy format (using the tidyr::pivot_longer() function) so each value has its own column
Ensure all values are in numeric format
Pivot the data into a wider format (using the tidyr::pivot_wider() function) so each mode of transport has its own column
Create new columns for:

Active transport: Sum of walking and cycling
Private transport: Sum of car/van drivers, car/van passengers, motorcycles, and other private modes
Public transport: Sum of buses, other buses, underground, rail and other public modes

Additionally, select to keep only these new columns

Convert the data back to a tidy (long) format with a single column for transport type
Create a chart showing the percentage over the years, with transport type represented by different colours
Use faceting in ggplot2:: to create three separate charts based on the purpose of transport

3.2 Solution

#Question 1
nts_data <- readxl::read_excel("Data/ntsq01003.xlsx", skip = 7)

#Question 2
nts_long <- nts_data %>%
  tidyr::pivot_longer(cols = -c(Mode, Purpose), 
                      names_to = "Year", 
                      values_to = "Value")

#Question 3
nts_numeric <- nts_long %>%
  dplyr::mutate(Value = as.numeric(Value))

#Question 4
nts_wide <- nts_numeric %>%
  tidyr::pivot_wider(names_from = "Mode",
                     values_from = "Value")

#Question 5
nts_mutate <- nts_wide %>%
  dplyr::mutate(
    `Active transport` = Walk1 + Bicycle,
    `Private transport` = `Car / van driver` + `Car / van passenger` + `Motorcycle` + `Other private transport2`,
    `Public transport` = `Bus in London` + `Other local bus` + `London Underground` + `Surface Rail` + `Other public transport3`
  ) %>%
  dplyr::select(Year, 
                `Active transport`, 
                `Private transport`, 
                `Public transport`, 
                Purpose)

#Question 6
nts_convert_long <- nts_mutate %>%
  tidyr::pivot_longer(cols = -c(Year, Purpose),
                      names_to = "Mode",
                      values_to = "Value")

#Question 7 & 8
#custom colour palette
custom_colours <- c(
  "Active transport" = "#004D3B",   
  "Private transport" = "#969810",   
  "Public transport" = "#001A70"    
)

ggplot(nts_convert_long, aes(x = Year, y = Value, colour = Mode, group = Mode)) +
  # Use linewidth instead of size
  geom_line(linewidth = 1) + 
  labs(
    title = "% by Mode, 2002 to 2017",
    x = "",
    y = "",
    colour = ""
  ) +
  # Facet by purpose
  facet_wrap(~ Purpose, scales = "fixed") + 
  scale_x_discrete(breaks = c(2002, 2007, 2012, 2017)) +
  scale_colour_manual(values = custom_colours) +
  theme_minimal() +
  theme(
    # Place legend at the top
    legend.position = "top", 
    # Style facet labels
    strip.text = element_text(size = 8, face = "bold") 
  )

3.3 Exercise

20:00

You have been tasked to create a chart of the ratio of passengers to seats for rail services arriving to and departing from major stations in the UK.

Read in the rail table, rai0201.xlsx, from the Data directory in the training repository
Tidy the data so it is in a structured format (using the tidyr::pivot_longer() function)
Split the column with combined names into two separate columns: one for arrival/departure time and the other for variable type (using the tidyr::separate() function)
Further divide the arrival/departure time column into two new columns: one for “arrival or departure” and the other for “time”
Reshape the data so there are individual columns for the number of services, passengers and seats

When reshaping the data using pivot_wider(), you may encounter duplicate rows for the same combination of Station, Arrival or departure, and Time. To handle this, you need to include a new argument called values_fn, which should be set to sum. This ensures that if there are duplicate values for the same combination, they are aggregated by summing them together.

Add a new column to calculate the ratio of passengers to seats
Restructure the data back into a tidy (long) format, where the services, passengers, seats and ratio columns are now in one column
Create a bar chart showing the station name on the x-axis and the ratio of passengers to seats on the y-axis

Use different colours for bars to indicate whether they relate to departures or arrivals
Use facets to split the chart by peak time and all-day data

3.3 Solution

#Question 1
rail_data <- readxl::read_excel("Data/rai0201.xlsx", skip = 6) %>%
  dplyr::select(-c("...5", "...9", "...13"))

#Question 2
rail_long <- rail_data %>%
  tidyr::pivot_longer(cols = -Station,
                      names_to = "Journeys",
                      values_to = "Value") %>%
  dplyr::mutate(Value = as.numeric(Value))

#Question 3 & 4
rail_separate <- rail_long %>%
  # Separate the 'Journeys' column into 'Time' and 'Variable'
  tidyr::separate(col = Journeys, 
                  into = c("Time", "Variable"),
                  sep = "_") %>%
  # Further separate the 'Time' column into 'Arrival or departure' and 'Time'
  tidyr::separate(col = Time,
                  into = c("Arrival or departure", "Time"),
                  sep = " ")

#Question 5
rail_wide <- rail_separate %>%
  tidyr::pivot_wider(names_from = "Variable",
                     values_from = "Value",
                     values_fn = sum)

#Question 6
rail_ratio <- rail_wide %>%
  dplyr::mutate(`Passengers per seat ratio` = Passengers / `Total seats`)

#Question 7
rail_final <- rail_ratio %>%
  tidyr::pivot_longer(cols = -c(Station, `Arrival or departure`, Time),
                      names_to = "Variable",
                      values_to = "Value") 

#Question 8
#additional data cleaning
rail_prepare <- rail_final %>%
  dplyr::filter(Variable == "Passengers per seat ratio") %>%
  dplyr::mutate(Station = stringr::str_remove(Station, "\\d+$"))

ggplot(rail_prepare, aes(x = Station, y = Value, fill = `Arrival or departure`)) +
  geom_bar(stat = "identity", position = "dodge") + 
  facet_wrap(~ Time, scales = "fixed") +
  scale_fill_manual(values = c("#FE5500", "#8A003E", "#FF479A")) +
  labs(
    title = "Ratio of Passengers to Seats by Station and Time of Day",
    x = "",
    y = "",
    fill = "Arrival or departure"
  ) +
  theme_minimal() +
  coord_flip() +
  theme(strip.text = element_text(size = 12, face = "bold")
  )