Chapter 5 Table/Data Frame manipulation

This chapter provides an overview of code examples for table or data frame manipulation (a tidyverse data frame is referred to as a tibble).

One of the main things you will have to do in any R project or RAP project will be manipulating the data that you are using in order to get it into the format you require.

One of the main packages used to manipulate data is the {dplyr} package which we recommend and use throughout this book. The {dplyr} package (and others e.g. {tidyr}) are all part of the tidyverse. The tidyverse is a group of packages developed by Hadley Wickham and others and are all designed to work with each other. See https://www.tidyverse.org/ for more info.

Tidyverse packages and functions can be combined and layered using the pipe operator %>%.

{dplyr} is built to work with tidy data. To find out more about tidy data please look at the following link https://r4ds.had.co.nz/tidy-data.html but the general principles are:

  1. Each variables must have its own column
  2. Each observation must have its own row
  3. Each value must have its own cell

5.1 Pivoting tables

There are two packages still in common use for pivoting data into tidy format.

  • The {tidyr} package is a tidyverse package which contains functions used to create tidy data (among other data cleaning functions)
  • The {reshape2} package is the non-tidyverse equivalent used to pivot tables. No updates have been made to this package in the last few years, and where possible tidyr should always be used in preference to it.

We want to have the day of the week variable running along the top so each day of the week is its own column.

Table 5.1: Number of road accidents by accident severity and weekday
Accident_Severity Day_of_Week n
1 1 300
1 2 205
1 3 187
1 4 233
1 5 220
1 6 250

{tidyr} package

The {tidyr} package is a tidyverse package for manipulating data into a tidy format, and should be the first choice for gathering and pivoting functions. Use of the tidyr package is made slightly more complicated by the fact that there are both legacy (prior to v1.2) and more recent (v1.2 onwards) pivoting verbs available:

gather and spread

The functions gather and spread were the tidyr functions used prior to version 1.2 of the package. These verbs are still supported in more recent versions of the package so old code will still work, but you should aim not to use these in new code.

  • gather makes wide data longer: equivalent is now pivot_longer()

  • spread makes long data wider: equivalent is now pivot_wider()

pivot_longer and pivot_wider

The pivot_longer and pivot_wider functions are the updated version of gather and spread, available in newer versions of the tidyr package. These are designed to have more intuative names, and more logical arguments in comparison to the older versions. You should aim to use these in all new code you are writing.

  • pivot_longer makes wide data longer i.e. variables running along the top can be “gathered” into rows running down.

  • pivot_wider makes long data wider i.e. one variable can be spread and run along the top with each value being a variable.

# Pivot table using newer tidyr package
road_accidents_weekdays <- road_accidents_small %>%
  tidyr::pivot_wider(names_from = Day_of_Week,
                     values_from = n)

In the pivot_wider function, you need to specify the dataset you want to transform, as well as the column you want to take the names from and the one you want to take the values from, in this case Day_of_Week and n respectively.

Table 5.2: Number of road accidents by accident severity and weekday, tidyr::pivot_wider
Accident_Severity 1 2 3 4 5 6 7
1 300 205 187 233 220 250 281
2 3009 2948 3230 3227 3246 3649 3225
3 11668 14783 16065 15859 16331 17346 13720

The opposite can also be done using the pivot_longer function:

# Pivot table using newer tidyr package
road_accidents_gather <- road_accidents_weekdays %>%
  tidyr::pivot_longer(cols = c(`1`, `2`, `3`, `4`, `5`, `6`, `7`), names_to = "weekday", values_to = "n")

To use pivot_longer, specify which columns you want to be gathered into one column (in this case the individual weekday columns). Then specify the column name you’d like to move the names to (weekday) and the one you’d like to move the values to (n).

It is also possible to pivot all columns except the ones you choose to exclude by specifying only those you do not want to pivot, using - to indicate this. In the example below, this is done by excluding the Accident_Severity column and pivoting all other columns by specifying cols = -Accident_Severity. This is useful when you may have a large number of columns or columns with unknown names that you want to turn into tidy data.

# Pivot by excluding the accident severity column 
road_accidents_gather <- road_accidents_weekdays %>%
  tidyr::pivot_longer(cols = -Accident_Severity, names_to = "weekday", values_to = "n")
Table 5.3: Number of road accidents by accident severity and weekday, tidyr::pivot_longer
Accident_Severity weekday n
1 1 300
1 2 205
1 3 187
1 4 233
1 5 220
1 6 250

Extended data transpose example

This section shows how to transpose data using the pivot_wider and pivot_longer functions with the tidyr package.


The aim is to take a published table of reported road accidents in which each row contains the data for one year with the severity of accident recorded in the columns. The tranposition will enable the calculation of the percentage increases for each of the last two years.

Firstly, the current year is specified.

Yr <- 2018   ## select for the latest of the three years you require 

This example uses data from RAS30013 “Reported casualty rates by road user type and severty: Great Britain”. https://www.gov.uk/government/statistical-data-sets/ras30-reported-casualties-in-road-accidents#casualty-rates-and-risk

The contents of the published table is read in (the last 10 years only), and the adjusted data is rounded to the nearest whole number.

Accidents <- readxl::read_excel("data/ras10013.xlsx") %>%
  dplyr::filter(Year >= 2009) %>%
  dplyr::mutate(SeriousAdjusted = round(SeriousAdjusted),
                FatalSeriousAdjusted = round(FatalSeriousAdjusted),
                SlightAdjusted = round(SlightAdjusted))
Year Fatal SeriousUnadjusted SeriousAdjusted FatalSeriousUnadjusted FatalSeriousAdjusted SlightUnadjusted SlightAdjusted AllAccidents
2,009 2,057 21,997 34,198 24,054 36,255 139,500 127,299 163,554
2,010 1,731 20,440 31,831 22,171 33,562 132,243 120,852 154,414
2,011 1,797 20,986 32,476 22,783 34,273 128,691 117,201 151,474
2,012 1,637 20,901 32,005 22,538 33,642 123,033 111,929 145,571
2,013 1,608 19,624 30,243 21,232 31,851 117,428 106,809 138,660
2,014 1,658 20,676 31,889 22,334 33,547 123,988 112,775 146,322
2,015 1,616 20,038 30,474 21,654 32,090 118,402 107,966 140,056
2,016 1,695 21,725 29,072 23,420 30,767 113,201 105,854 136,621
2,017 1,676 22,534 27,938 24,210 29,614 105,772 100,368 129,982
2,018 1,671 23,165 28,266 24,836 29,937 97,799 92,698 122,635
2,019 1,658 23,422 27,222 25,080 28,880 92,456 88,656 117,536

The data is now filtered to keep only the last three years of data.

We could have hard coded the year in the filter below e.g. Year %in% c(2017,2018,2019). To save having to change the code every time we wish to run for different years, we can derive the required years from a previously defined variable. This picks up the most recent year from yr defined above and calculates the other years as 1 and 2 years before yr as yr-1 and yr-2.

A new variable YearLab with the respective text ThisYear, LastYear and YearBefore is created to automate calculations below.

Last3Years <- dplyr::filter(Accidents,
                Year %in% c(Yr, Yr-1, Yr-2)) %>%
  dplyr::mutate(YearLab = dplyr::case_when(
    Year == Yr ~ "ThisYear",
    Year == Yr-1 ~ "LastYear",
    TRUE ~ "YearBefore" # else option 
  )) %>%
  dplyr::select(-Year)
Fatal SeriousUnadjusted SeriousAdjusted FatalSeriousUnadjusted FatalSeriousAdjusted SlightUnadjusted SlightAdjusted AllAccidents YearLab
1,695 21,725 29,072 23,420 30,767 113,201 105,854 136,621 YearBefore
1,676 22,534 27,938 24,210 29,614 105,772 100,368 129,982 LastYear
1,671 23,165 28,266 24,836 29,937 97,799 92,698 122,635 ThisYear

The table above holds the number of accidents for each severity (in columns) with a row for each of the last three years. We wish to calculate the annual growth in accidents by severity. To achieve this we need to transpose the table. This can be achieved using the pivot_longer function within the tidyr package which allows us to put data into a tidy format which makes it easier to manipulate.

The template of the pivot_longer function is:

tidyr::pivot_longer(
      cols = variables_to_input_into_one_column, 
      names_to = column to move data names to, 
      values_to = column to move data values to)

The variables we want to input into one column are the accident severities (Fatal, SeriousUnadjusted, …, AllAccidents); every column except for the YearLab. The names_to parameter we will call SeverityType. The values_to parameter is the name of the column holding the number of accidents (n). The variable YearLabel is the only one that will be unchanged, so can be specifically excluded from the pivot_longer call.

A new field SeverityNum is created so that the severity types can be sorted in a logical order in the final output.

Severity <- Last3Years %>% 
  tidyr::pivot_longer(cols = -YearLab,
                names_to = "SeverityType",
                values_to = "n") %>%
  dplyr::mutate(SeverityNum = dplyr::case_when(
  SeverityType=='Fatal' ~ 1,
  SeverityType=='SeriousUnadjusted' ~ 2,
  SeverityType=='SeriousAdjusted' ~ 3,
  SeverityType=='FatalSeriousUnadjusted' ~ 4,
  SeverityType=='FatalSeriousAdjusted' ~ 5,
  SeverityType=='SlightUnadjusted' ~ 6,
  SeverityType=='SlightAdjusted' ~ 7,
  SeverityType=='AllAccidents' ~ 8,
  TRUE ~ 0 # else option 
))
YearLab SeverityType n SeverityNum
YearBefore Fatal 1,695 1
YearBefore SeriousUnadjusted 21,725 2
YearBefore SeriousAdjusted 29,072 3
YearBefore FatalSeriousUnadjusted 23,420 4
YearBefore FatalSeriousAdjusted 30,767 5
YearBefore SlightUnadjusted 113,201 6
YearBefore SlightAdjusted 105,854 7
YearBefore AllAccidents 136,621 8
LastYear Fatal 1,676 1
LastYear SeriousUnadjusted 22,534 2
LastYear SeriousAdjusted 27,938 3
LastYear FatalSeriousUnadjusted 24,210 4
LastYear FatalSeriousAdjusted 29,614 5
LastYear SlightUnadjusted 105,772 6
LastYear SlightAdjusted 100,368 7
LastYear AllAccidents 129,982 8
ThisYear Fatal 1,671 1
ThisYear SeriousUnadjusted 23,165 2
ThisYear SeriousAdjusted 28,266 3
ThisYear FatalSeriousUnadjusted 24,836 4
ThisYear FatalSeriousAdjusted 29,937 5
ThisYear SlightUnadjusted 97,799 6
ThisYear SlightAdjusted 92,698 7
ThisYear AllAccidents 122,635 8

We now wish to transpose this table so that the three years (ThisYear, LastYear and YearBefore) will be in separate columns rather than one.

To do this, we use the pivot_wider function within the tidyr package.

The template of the pivot_wider function is:

tidyr::pivot_wider(cols = variables_to_change_from rows_into_columns,
      names_from = column_containing_variable_names,
      values_form = column_containing_values)

Here the column containing the names is YearLab, and the column containing the values is the number of accidents n. The variables SeverityType and SeverityNum will each remain in one column so are not listed in the pivot_wider function.

By default, the data would be sorted in alphabetical order of SeverityType. Instead the data is sorted in a pre-defined logical order using the SeverityNum variable defined above. The SeverityNum, having done its job, is then dropped from the final output.

Now that the number of accidents for each year are in a separate column, the percentage growth for one year to the next can be calculated using the mutate function.

Transposed <- Severity %>% 
  tidyr::pivot_wider(names_from = YearLab, 
                     values_from = n) %>%
  dplyr::arrange(SeverityNum) %>%
  dplyr::select(-SeverityNum) %>%
  dplyr::mutate(PerGrowthThisYear = round(ThisYear/LastYear *100-100,1),
                PerGrowthLastYear=round(LastYear/YearBefore *100-100,1))
SeverityType YearBefore LastYear ThisYear PerGrowthThisYear PerGrowthLastYear
Fatal 1,695 1,676 1,671 -0.3 -1.1
SeriousUnadjusted 21,725 22,534 23,165 2.8 3.7
SeriousAdjusted 29,072 27,938 28,266 1.2 -3.9
FatalSeriousUnadjusted 23,420 24,210 24,836 2.6 3.4
FatalSeriousAdjusted 30,767 29,614 29,937 1.1 -3.7
SlightUnadjusted 113,201 105,772 97,799 -7.5 -6.6
SlightAdjusted 105,854 100,368 92,698 -7.6 -5.2
AllAccidents 136,621 129,982 122,635 -5.7 -4.9

5.2 Dropping and selecting columns

Use the {dplyr} select function to both select and drop columns.

Select columns

road_accidents_4_cols <- road_accidents %>%
  dplyr::select(acc_index, Accident_Severity, Date, Police_Force)
Table 5.4: Four columns from road accidents 2017
acc_index Accident_Severity Date Police_Force
2017010001708 1 2017-08-05 1
2017010009342 3 2017-01-01 1
2017010009344 3 2017-01-01 1
2017010009348 3 2017-01-01 1
2017010009350 2 2017-01-01 1
2017010009351 3 2017-01-01 1

Drop columns

As for pivoting, to drop columns the difference is putting a “-” in front of the variable name.

road_accidents_3_cols <- road_accidents_4_cols %>%
  dplyr::select(-Police_Force)
Table 5.5: Three columns from road accidents 2017
acc_index Accident_Severity Date
2017010001708 1 2017-08-05
2017010009342 3 2017-01-01
2017010009344 3 2017-01-01
2017010009348 3 2017-01-01
2017010009350 2 2017-01-01
2017010009351 3 2017-01-01

5.3 Rename variables

Use the rename function from {dplyr} to rename variables where the new variable name is on the left hand side of the = equals sign, and the old variable name is on the right hand.

road_accidents_rename <- road_accidents_4_cols %>%
  dplyr::rename(Date_of_Accident = Date)
Table 5.6: Rename date column to Date_of_Accident
acc_index Accident_Severity Date_of_Accident Police_Force
2017010001708 1 2017-08-05 1
2017010009342 3 2017-01-01 1
2017010009344 3 2017-01-01 1
2017010009348 3 2017-01-01 1
2017010009350 2 2017-01-01 1
2017010009351 3 2017-01-01 1

5.4 Filtering data

Use the {dplyr} filter function to filter data.

This example filters the data for slight severity accidents (accident severity 3).

road_accidents_slight <- road_accidents_4_cols %>%
  dplyr::filter(Accident_Severity == 3)
Table 5.7: Slight severity road accidents 2017
acc_index Accident_Severity Date Police_Force
2017010009342 3 2017-01-01 1
2017010009344 3 2017-01-01 1
2017010009348 3 2017-01-01 1
2017010009351 3 2017-01-01 1
2017010009353 3 2017-01-01 1
2017010009354 3 2017-01-01 1

To filter multiple conditions:

And operator

road_accidents_filter <- road_accidents_4_cols %>%
  dplyr::filter(Accident_Severity == 3 & Police_Force == 4)

Or operator

road_accidents_filter2 <- road_accidents_4_cols %>%
  dplyr::filter(Accident_Severity == 3 | Accident_Severity == 2)

Note: filtering with characters must be wrapped in “quotation marks” e.g:

road_accidents_filter3 <- road_accidents %>%
dplyr::filter(`Local_Authority_(Highway)` == "E09000010")

Also note that in the above example the variable is quoted in back ticks (`). This is necessary when a column or variable name contains non-standard characters (anything other than alphanumerics, and selected punctuation marks: _, - or .) or start with numbers; they need to be wrapped in back ticks so R knows that everything inside the back ticks is a variable name.

5.5 Group data

Use the {dplyr} group_by function to group data. This works in a similar manner to “GROUP BY” in SQL.

The below example groups the data by accident severity and weekday, and creates totals for each group using the “tally” function.

# Create grouped data set with counts
road_accidents_small <- road_accidents %>%
  dplyr::group_by(Accident_Severity, Day_of_Week) %>%
  dplyr::tally()
Table 5.8: Road accidents 2017 by accident severity and weekday
Accident_Severity Day_of_Week n
1 1 300
1 2 205
1 3 187
1 4 233
1 5 220
1 6 250

5.6 Order data

Use the {dplyr} arrange function to order data. This works in a similar manner to “ORDER BY” in SQL.

This example orders the data by date and number of casualties.

# Order data by date and number of casualties
road_accidents_ordered <- road_accidents %>%
  dplyr::sample_n(7) %>%
  dplyr::select(acc_index, Accident_Severity, Police_Force, Number_of_Casualties, Date) %>%
  dplyr::arrange(Date, Number_of_Casualties)
Table 5.9: Road accidents 2017 ordered by date and number of casualties
acc_index Accident_Severity Police_Force Number_of_Casualties Date
201797UA00802 3 97 1 2017-02-09
2017010036859 2 1 1 2017-05-11
2017134791665 3 13 1 2017-07-09
2017200259717 3 20 1 2017-12-01
2017010076218 3 1 2 2017-12-03
2017010077687 3 1 1 2017-12-13

5.7 Get counts of data

To get counts for groups of data, the {dplyr} tally function can be used in conjunction with the {dplyr} group by function. This groups the data into the required groups and then tallys how many records are in each group.

# Create grouped data set with counts
road_accidents_small <- road_accidents %>%
  dplyr::group_by(Accident_Severity, Day_of_Week) %>%
  dplyr::tally()

The above example creates groups by accident severity and weekday and counts how many accidents are in each group (one row equals one accident therefore the tally is counting accidents).

Table 5.10: Road accidents 2017 by accident severity and weekday
Accident_Severity Day_of_Week n
1 1 300
1 2 205
1 3 187
1 4 233
1 5 220
1 6 250

5.8 Combine tables

When combining data from two tables there are two ways to do this in R:

  • Bind the tables by basically either appending the tables on the rows or columns
  • Join the tables using the {dplyr} version of SQL joins

Binding tables

Binding tables is mainly done to append tables by creating more rows, however tables can also be bound by adding more columns. Although it is recommended to use the {dplyr} join functions to combine columns (see 5.6).

Here are three tables, one shows data for accident severity of 1, one for accident severity of 2, and one for accident severity of 3.

Table 5.11: Number of fatal road accidents in 2017, by weekday
Accident_Severity Day_of_Week n
1 1 300
1 2 205
1 3 187
1 4 233
1 5 220
1 6 250
1 7 281
Table 5.11: Number of serious injury road accidents in 2017, by weekday
Accident_Severity Day_of_Week n
2 1 3009
2 2 2948
2 3 3230
2 4 3227
2 5 3246
2 6 3649
2 7 3225
Table 5.11: Number of slight injury road accidents in 2017, by weekday
Accident_Severity Day_of_Week n
3 1 11668
3 2 14783
3 3 16065
3 4 15859
3 5 16331
3 6 17346
3 7 13720

To combine these tables we can use the bind_rows function from the {dplyr} package. Use bind_rows when you want to append the tables underneath one another to make one longer table, i.e. you want to add more rows.

Ensure that the column names for each table are exactly the same in each table.

# combine tables using bind_rows
library(dplyr)

all_accidents <- accidents_1 %>%
  dplyr::bind_rows(accidents_2, accidents_3)
Table 5.12: Road accident data 2017, bind_rows
Accident_Severity Day_of_Week n
1 1 300
1 2 205
1 3 187
1 4 233
1 5 220
1 6 250
1 7 281
2 1 3009
2 2 2948
2 3 3230
2 4 3227
2 5 3246
2 6 3649
2 7 3225
3 1 11668
3 2 14783
3 3 16065
3 4 15859
3 5 16331
3 6 17346
3 7 13720

5.9 Joining tables

Joins in R can be done using {dplyr}. This is generally to combine columns of data from two tables:

# combine tables using left join
library(dplyr)

all_accidents_cols_join <- road_acc_1 %>%
  dplyr::left_join(road_acc_2, by = "acc_index")

This uses the same principles as SQL, by specifying what the tables should be joined on using the by = argument.

{dplyr} has all the usual SQL joins for example, inner_join, full_join, right_join. All of these are used in the same way as the left join example above.

Another useful join for data manipulation is an anti_join. This provides all the data that is not in the joined table. For example, the below snapshot of a table displays road accident totals broken down by accident severity and weekday:

Accident_Severity Day_of_Week n
1 1 300
1 2 205
1 3 187
1 4 233
1 5 220
1 6 250

I am interested in creating two sub-groups of this data, a table for all accidents on a Monday (weekday 2), and all other accidents.

First, I get the Monday data using the {dplyr} filter function (see 5.3).

Accident_Severity Day_of_Week n
1 2 205
2 2 2948
3 2 14783

Then, I can use an anti-join to create a table which has all of the data that is not in the above table:

# create table of all rows not in the joined table
library(dplyr)

all_accidents_not_monday <- road_accidents_small %>%
  dplyr::anti_join(accidents_monday, by = c("Accident_Severity", "Day_of_Week"))

The above code takes the initial table we want to get our data from (road_accidents_small) and anti joins accidents_monday. This says, “get all the rows from road_accidents_small that are not in accidents_monday”. Again, note the need to specify what the join rows would be joined and compared by.

Table 5.13: Road accident data 2017 not on a Monday by accident severity
Accident_Severity Day_of_Week n
1 1 300
1 3 187
1 4 233
1 5 220
1 6 250
1 7 281
2 1 3009
2 3 3230
2 4 3227
2 5 3246
2 6 3649
2 7 3225
3 1 11668
3 3 16065
3 4 15859
3 5 16331
3 6 17346
3 7 13720

5.10 Select specific columns in a join

Doing a join with {dplyr} will join all columns from both tables, however sometimes not all columns from each table are needed.

Let’s look at some previous tables again:

Table 5.14: Police force and accident severity information for accidents
acc_index Police_Force Accident_Severity
2017010001708 1 1
2017010009342 1 3
2017010009344 1 3
2017010009348 1 3
2017010009350 1 2
2017010009351 1 3
Table 5.14: Date and weekday information for accidents
acc_index Date Day_of_Week
2017010001708 2017-08-05 7
2017010009342 2017-01-01 1
2017010009344 2017-01-01 1
2017010009348 2017-01-01 1
2017010009350 2017-01-01 1
2017010009351 2017-01-01 1

Let’s say we want acc_index and Police_Force from the first table, and Date from the second table.

# select specific columns from each table and left join
library(dplyr)

road_acc_3 <- road_acc_1 %>%
  dplyr::select(acc_index, Police_Force) %>%
  dplyr::left_join(select(road_acc_2, acc_index, Date), by = "acc_index")

The above code takes the first table and uses the select statement to select the required columns from the first table.

Then within the left_join command, to select the data from the second table, you again add the select statement.

Note: you will need to select the joining variable in both tables but this will only appear once

Table 5.15: Police force and Date information for specific accidents
acc_index Police_Force Date
2017010001708 1 2017-08-05
2017010009342 1 2017-01-01
2017010009344 1 2017-01-01
2017010009348 1 2017-01-01
2017010009350 1 2017-01-01
2017010009351 1 2017-01-01

5.11 Sum rows or columns

5.11.1 Sum rows

Summing across a row can be done using the dplyr function c_across; you just need to specify that this should be done row-wise before performing the calculation:

# sum across a row 
road_accidents_weekdays <- road_accidents_weekdays %>% 
  dplyr::rowwise() %>%
  dplyr::mutate(rowsum = sum(c_across())) 
Table 5.16: Road accidents 2017 by accident severity and weekday
Accident_Severity 1 2 3 4 5 6 7 rowsum
1 300 205 187 233 220 250 281 1676
2 3009 2948 3230 3227 3246 3649 3225 22534
3 11668 14783 16065 15859 16331 17346 13720 105772

To sum across specific rows, you can name these within the c_across function:

# sum across specific rows 
road_accidents_weekdays <- road_accidents_weekdays %>% 
  dplyr::rowwise() %>%
  dplyr::mutate(alldays = sum(c_across(`1`:`5`))) 
Table 5.17: Road accidents 2017 by accident severity and weekday
Accident_Severity 1 2 3 4 5 6 7 alldays
1 300 205 187 233 220 250 281 1145
2 3009 2948 3230 3227 3246 3649 3225 15660
3 11668 14783 16065 15859 16331 17346 13720 74706

5.11.2 Sum columns

Columns can also be summed, although it isn’t recommended that these sums are added to a data table:

# sum columns
road_accidents_weekdays %>%
  dplyr::summarise_if(is.numeric, sum)

5.12 Replace NAs or other values

To replace all NAs in one column (Junction Control column) with a specific value:

library (tidyr)
# replace all NAs with value -1
road_accidents_na <- road_accidents_na %>%
  dplyr::mutate(Junction_Control = tidyr::replace_na(Junction_Control, -1))

Note: To replace NA with a character the character replacement must be wrapped in “quotation marks”

To replace all NAs in a data frame or tibble:

# replace all NAs with value -1
road_accidents_na <- road_accidents_na %>%
  mutate_if(is.numeric, tidyr::replace_na, -1)

To replace values with NA, specify what value you want to be replaced with NA using the na_if function:

# create nas
road_accidents_na <- road_accidents_na %>%
  dplyr::na_if(-1)

Note: to only create NAs in a specific column specify the column name in a similar manner to the first example in this section.

To replace values:

# replace 1st_road_class 
road_accidents_na <- road_accidents_na %>%
  dplyr::mutate(`1st_Road_Class` = dplyr::case_when(`1st_Road_Class` == 3 ~ "A Road",
                                      TRUE ~ as.character(`1st_Road_Class`)))

The case_when function is similar to using CASE WHEN in SQL.

The TRUE argument indicates that if the values aren’t included in the case_when then they should be whatever is after the tilda (~) i.e. the equivalent of the ELSE statement in SQL.

The “as.character” function says that everything that in 1st_Road_Class isn’t 3 should be kept as it is, this could be replaced by an arbitrary character or value e.g. “Other”. This would make everything that is not a 3, coded as “Other”.

You can have multiple case_when arguments for multiple values, they just need to be seperated with a comma. Multiple case_when statements for different variables can be layered using the pipe operator %>%.

5.13 Reordering rows/columns

5.13.1 Reordering rows

Rows can be reordered by certain variables using the {dplyr} arrange function with examples in the 4.5 Order data sub-chapter of this book. This will order the data in ascending order by the variables quoted. To order rows in descending order the desc() command can be used within the arrange function.

# Order data by date and number of casualties
road_accidents_ordered_desc <- road_accidents %>%
  dplyr::select(acc_index, Accident_Severity, Police_Force, Number_of_Casualties, Date) %>%
  dplyr::arrange(desc(Date), Number_of_Casualties)
Table 5.18: Road accidents 2017 ordered by date (descending) and number of casualties
acc_index Accident_Severity Police_Force Number_of_Casualties Date
2017010065745 2 1 1 2017-12-31
2017010080880 3 1 1 2017-12-31
2017010080888 3 1 1 2017-12-31
2017010080889 2 1 1 2017-12-31
2017010080894 3 1 1 2017-12-31
2017010080901 3 1 1 2017-12-31

5.13.2 Reordering columns

Use the {dplyr} select statement to reorder columns, where the order of the variables quoted represents the order of the columns in the table.

Table 5.19: Four columns from road accidents 2017
acc_index Accident_Severity Date Police_Force
2017010001708 1 2017-08-05 1
2017010009342 3 2017-01-01 1
2017010009344 3 2017-01-01 1
2017010009348 3 2017-01-01 1
2017010009350 2 2017-01-01 1
2017010009351 3 2017-01-01 1

To reorder this table we do:

table_reordered <- road_accidents_4_cols %>%
  dplyr::select(Accident_Severity, Date, acc_index, Police_Force)

5.14 Creating new variables

The {dplyr} mutate function can be used to create new variables based on current variables or other additional information.

For example, to create a new variable which is speed limit in km:

road_acc_km <- road_accidents %>%
  dplyr::mutate(speed_km = Speed_limit * 1.6) %>%
  dplyr::select(acc_index, Police_Force, Speed_limit, speed_km)
Table 5.20: Road accidents by km/h
acc_index Police_Force Speed_limit speed_km
2017010001708 1 30 48
2017010009342 1 30 48
2017010009344 1 30 48
2017010009348 1 30 48
2017010009350 1 20 32
2017010009351 1 30 48

5.15 Summarising data

The {dplyr} summarise function can be used to summarise data (mean, median, sd, min, max, n_distinct). See https://dplyr.tidyverse.org/reference/summarise.html for more examples.

For example, to get the mean number of accidents for each weekday:

Table 5.21: Road accidents 2017, by severity and weekday
Accident_Severity Day_of_Week n
1 1 300
1 2 205
1 3 187
1 4 233
1 5 220
1 6 250

The group by function is used with the summarise function to specify what groups the mean will be applied to, in this case weekday.

road_acc_mean <- road_accidents_small %>%
  dplyr::group_by(Day_of_Week) %>%
  dplyr::summarise(mean = mean(n))
Table 5.22: Mean number of accidents in 2017, by weekday
Day_of_Week mean
1 4992.333
2 5978.667
3 6494.000
4 6439.667
5 6599.000
6 7081.667

5.16 Look up tables

Aside from importing a separate lookup data file into R, named vectors can be used as lookup tables.

For example, to assign accident severity values with labels, named vectors can be used (note: numbers must also be in quotation marks):

lookup_severity <- c("1" = "Fatal", "2" = "Serious", "3" = "Slight")

To convert the data and create a label variable (note: the Accident_Severity variable values can be replaced with the lookup values by changing the name of the variable on the left to Accident_Severity):

road_accidents_small$Accident_Severity_label <- lookup_severity[road_accidents_small$Accident_Severity]
Table 5.23: Road accidents 2017, by severity and weekday
Accident_Severity Day_of_Week n Accident_Severity_label
1 1 300 Fatal
1 2 205 Fatal
1 3 187 Fatal
1 4 233 Fatal
1 5 220 Fatal
1 6 250 Fatal