Chapter 2 Data Manipulation
dplyr
is one of the most important packages in an R user’s arsenal. A key part of the tidyverse
, it provides easy-to-use functions for manipulating data frames, which is a crucial component of the data analysis cycle.
There are six key dplyr
functions that allow you to solve the vast majority of data-manipulation challenges;
select()
to pick variables/columns by their names.group_by()
to change the scope of each function from operating on the entire data set to operating on it group-by-group.summarise()
to collapse many values down to a single summary.filter()
to pick observations by their value.mutate()
to create new variables/columns with functions of existing variables/columns.arrange()
to reorder the rows.
These functions look similar to SQL statements and are designed to replace the need for any data manipulation in SQL.
All dplyr
functions allow you to specify the column names without “quotations”. However, if there are spaces in the column name, you’ll need to use `back ticks`
.
2.1 Select
Select allows you to choose the columns that you’d like to keep from a dataset.
Looking at the gapminder dataset, if we want to create a new dataset which only included the year, country and life expectancy, we could do this by selecting those columns:
The first argument within the select command specifies use of the gapminder dataset. Following this we list the variables we want to keep.
It is also possible to select to exclude specific columns. This is ideal if you want to keep all columns except for one or two, and can be done by using a -
minus sign in front of column names.
#Removes the continent column but keeps all others
gapminder_no_continent <- select(gapminder, -continent)
You can also use the select
function to reorder columns as it will select columns in the order specified.
2.2 Pipes
Before we continue, let’s visit one of the most important (and cool) operators in R… the pipe %>%
.
It’s likely that you’ll want to use multiple functions consecutively, especially when using dplyr
. Currently, we may do something like this:
gapminder_new <- select(gapminder, -continent)
gapminder_new <- select(gapminder_new, year, country, lifeExp)
This code is a little frustrating to write because we have to name each intermediate data frame, even though we don’t care about it. Naming things is hard, and having separate names for each step makes it difficult to read. Let’s see how we can rewrite this code using the pipe:
The pipe means we can read this code as a series of statements separated by the pipe representing “and then”; e.g. take the gapminder data and then remove the continent column and then select the year, country and lifeExp columns.
You may notice that we don’t need to specify the data
argument in each function when using %>%. By piping, the subsequent function recognises we want to use the result of our previous statement as our data.
2.3 Grouping and summarising data
We can produce breakdowns of statistics using the group_by and summarise commands from the dplyr package:
- group_by() identifies which variables we want to produce breakdowns by.
- summarise() is used to indicate which values we want to calculate.
Using these functions together we can produce summary statistics in a similar way to pivot tables in Excel. We can use the pipe (%>%) operator to chain these functions together.
So if we want the mean life expectancy by continent and year:
Here R takes the dataset, then groups it first by year and then by continent and then outputs the mean life expectancy. The mean life expentancy variable is created as a new column called life_exp
. The results are saved into a new dataset called mean_life_exp
.
There are other functions that could be used here instead of mean e.g. n, n_distinct, min, max, mean, median, var and sd.
If we want to add a new variable that we decide to call country_count
that provides the counts by year and continent we can rerun as follows using the pipe operator:
mean_life_exp <- gapminder %>%
group_by(year, continent) %>%
summarise(life_exp = mean(lifeExp),
country_count = n())
2.4 Filter
If you would like to produce statistics for a subset of rows or observations, a good function to use is filter() from the dplyr package.
Let’s first take a look at the different possible values of the continent variable. We can do that quickly using the group_by/summarise combination.
To filter we just specify the data that we want to filter (gapminder) and the value that we want to filter on. In this case lets filter where continent is “Asia” and year is after 1992 then recalculate the mean life expectancy by country:
mean_life_exp <- gapminder %>%
filter(continent == "Asia" & year > 1992) %>%
group_by(country) %>%
summarise(life_exp = mean(lifeExp))
R provides the standard suite of comparison operators which can be used to filter:
Comparison | Operator |
---|---|
Greater than | > |
Greater than or equal to | >= |
Less than | < |
Less than or equal to | <= |
Equal to | == |
Not equal to | != |
And | & |
Or | | |
Not | ! |
Group membership | %in% |
The %in%
operator allows you to compare a column against a vector of values to see if it matches any one of them; this is much more convenient than comparing against each value individually.
2.5 Rename
We can rename variables using the dplyr function rename(). Let’s amend our previous code creating the mean_life_exp dataset to change the name of the “year” column to “selected_year”.
mean_life_exp <- gapminder %>%
filter(continent == "Asia" & year > 1992) %>%
group_by(year, country) %>%
summarise(life_exp = mean(lifeExp)) %>%
rename(selected_year = year)
Within the rename function, the new name “selected_year” is specified on the left and the old name on the right of the equal sign.
2.6 Mutate
You can create new columns and perform calculations on existing columns using the dplyr command mutate().
For example, imagine we wanted to calculate overall GDP as a new column in the gapminder dataset. We could do this by multiplying the gdpPercap
and pop
columns:
You can also use functions like mean()
and sum()
in mutate()
. For example, using x / sum(x)
for calculating proportions of a total and y - mean(y)
for difference from the mean.
Notice that by default, mutate calculates values on a rowwise basis; each value in the gdp_total column is made by multiplying the values in the corresponding row. This default behaviour can be changed by grouping data before mutate, e.g. this code produces a mean population column by country:
You can also combine mutate with the case_when
function to perform one or more if/else conditions.Maybe we want to have coded values for each year by decade. The case_when function allows you to provide multiple instances of a statement which evaluates to TRUE/FALSE, and then a result if that condition is true (after ~
). The function evaluates these statements in order, so if an earlier statement is TRUE, a later one will not be evaluated. Finally, for cases that don’t meet any of the conditions, the final TRUE value is used (this defaults to NA if not specified)
gapminder %>%
mutate(decade =
case_when(year >= 1950 & year < 1960 ~ "1950s",
year >= 1960 & year < 1970 ~ "1960s",
year >= 1970 & year < 1980 ~ "1970s",
year >= 1980 & year < 1990 ~ "1980s",
year >= 1990 & year < 2000 ~ "1990s",
TRUE ~ "Post-2000"))
You can download the Data Transformation Cheat Sheet (and other cheatsheets) at: https://www.rstudio.com/resources/cheatsheets/
2.7 Arrange
arrange()
is used to change the order of rows. It takes a data frame as it’s first argument and a column name to sort by as it’s second. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. By default arrange()
will sort in ascending order (1-9 and A-Z). If you’d like to sort in descending order wrap the column name in desc()
. Using arrange with one column sorts how you’d might expect:
Sorting with multiple columns sorts within the hierarchy specified:
It’s worth noting that missing values (NA) are always sorted at the end:
## # A tibble: 4 × 1
## x
## <dbl>
## 1 1
## 2 2
## 3 3
## 4 NA
## # A tibble: 4 × 1
## x
## <dbl>
## 1 3
## 2 2
## 3 1
## 4 NA