You are likely to be most familiar with data laid out in a “messy” structure like the one below:
## # A tibble: 30 × 8
## country continent `1967` `1972` `1977` `1982` `1987` `1992`
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albania Europe 66.2 67.7 68.9 70.4 72 71.6
## 2 Austria Europe 70.1 70.6 72.2 73.2 74.9 76.0
## 3 Belgium Europe 70.9 71.4 72.8 73.9 75.4 76.5
## 4 Bosnia and Herzegovina Europe 64.8 67.4 69.9 70.7 71.1 72.2
## 5 Bulgaria Europe 70.4 70.9 70.8 71.1 71.3 71.2
## 6 Croatia Europe 68.5 69.6 70.6 70.5 71.5 72.5
## 7 Czech Republic Europe 70.4 70.3 70.7 71.0 71.6 72.4
## 8 Denmark Europe 73.0 73.5 74.7 74.6 74.8 75.3
## 9 Finland Europe 69.8 70.9 72.5 74.6 74.8 75.7
## 10 France Europe 71.6 72.4 73.8 74.9 76.3 77.5
## # … with 20 more rows
Here, the data has one row per country and then multiple columns, with one for each year. This may seem relatively logical and quite tidy, but the problems start when you realise that there are multiple other quite logical and tidy ways to lay this data out if there are few restrictions. For example, it could quite easily be shown with one country per column, and a year per row:
## # A tibble: 12 × 143
## year Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1952 28.8 55.2 43.1 30.0 62.5 69.1 66.8 50.9
## 2 1957 30.3 59.3 45.7 32.0 64.4 70.3 67.5 53.8
## 3 1962 32.0 64.8 48.3 34 65.1 70.9 69.5 56.9
## 4 1967 34.0 66.2 51.4 36.0 65.6 71.1 70.1 59.9
## 5 1972 36.1 67.7 54.5 37.9 67.1 71.9 70.6 63.3
## 6 1977 38.4 68.9 58.0 39.5 68.5 73.5 72.2 65.6
## 7 1982 39.9 70.4 61.4 39.9 69.9 74.7 73.2 69.1
## 8 1987 40.8 72 65.8 39.9 70.8 76.3 74.9 70.8
## 9 1992 41.7 71.6 67.7 40.6 71.9 77.6 76.0 72.6
## 10 1997 41.8 73.0 69.2 41.0 73.3 78.8 77.5 73.9
## 11 2002 42.1 75.7 71.0 41.0 74.3 80.4 79.0 74.8
## 12 2007 43.8 76.4 72.3 42.7 75.3 81.2 79.8 75.6
## # … with 134 more variables: Bangladesh <dbl>, Belgium <dbl>, Benin <dbl>,
## # Bolivia <dbl>, `Bosnia and Herzegovina` <dbl>, Botswana <dbl>,
## # Brazil <dbl>, Bulgaria <dbl>, `Burkina Faso` <dbl>, Burundi <dbl>,
## # Cambodia <dbl>, Cameroon <dbl>, Canada <dbl>,
## # `Central African Republic` <dbl>, Chad <dbl>, Chile <dbl>, China <dbl>,
## # Colombia <dbl>, Comoros <dbl>, `Congo, Dem. Rep.` <dbl>,
## # `Congo, Rep.` <dbl>, `Costa Rica` <dbl>, `Cote d'Ivoire` <dbl>, …
Having this kind of wide variety of potential structures means that
when you receive data and try to use it, it’s almost impossible to have
any kind of standardised approach to this non-standard data structuring.
For example, if you want the most recent year of data from an annual
dataset, would you need to select
a column or
filter
some rows; it would depend entirely on the
dataset!
Simply put, “tidy” data is data designed to be easy for coding programs like R to read in and use. It’s structured in a way that makes processes efficient, both to write code for and for R to perform. Tidy data structures are always the same, regardless of the data inside the structure.
This is best explained in a tidy example:
## # A tibble: 10 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
In this dataset, each column is a different variable; years, life expectancy, countries. These are best thought of as different categories that your data can fall into, not values of those categories. For example, year is a variable, but each possible year in that column is a value, and therefore shouldn’t be assigned its own column.
Every row is an observation; a unique combination of the variables included in the data. For example, there is only one observation (row) of data in Afghanistan for 1952, and then a separate one for the same country in 1957.
Saying something isn’t in tidy format isn’t a criticism of that data. Human-readable data is almost always “messy” data, and this format makes it clear and easy for people to read.
Computers are not people, and always want tidy data
The tidyverse set of packages (dplyr, tidyr, ggplot2, etc) are set up to be intuitive and simple to use.
If you find yourself using the tidyverse and looping code, repeating the same tasks over and over, or writing a lot of code to achieve a simple task, it’s probably because the data is in the wrong format.
Tidy data makes it easier to standardise code, functions and processes.
Multiple projects using tidy data can often recycle large chunks of each others code
Loads of people get their start in data wrangling in Excel, but Excel isn’t really set up like a data processing tool. It’s set up as a data tool for humans to use, not computers, so it sacrifices some efficiency for readability.
You don’t need to be able to read your data while R is running code on it, so if you’re not using tidy data you’re sacrificing efficiency for nothing!
A great example of this is the “billboard” dataset from the tidyr package
tidyr::billboard
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, …
It has details of the artist, track name, date.entered and then a separate column for each week rating. It’s easy for a human to read, and you can see how you’d easily be able to view, chart and analyse this data in Excel.
But if you want to chart the top 5 rows in ggplot as track rating over time, how would you do that? ggplot wants to be passed a single column for x and y, and you don’t have a column for either week or rating.
billboard_small <- head(billboard, 5)
billboard_chart <- billboard_small %>%
##Let's just keep the columns we want, and drop artist name and date
dplyr::select(-c(artist, date.entered)) %>%
## Let's flip the data so the weeks are row names, not column names
t() %>%
#And make it back into a data frame
data.frame()
##The column names are wrong now, so let's move them up from the top row
colnames(billboard_chart) <- billboard_chart[1,]
##And then let's get rid of that useless first row
billboard_chart <- billboard_chart[-1,]
##And let's use those row names as a column
billboard_chart$week <- row.names(billboard_chart)
##Now we can plot a chart of one column against week!
#It's an ugly bar chart because it doesn't even work as a line chart
ggplot(billboard_chart, aes(x = week, y = `Kryptonite`))+
geom_bar(stat = "identity")
There’s a lot of things to notice about this code:
billboard_tidy <- billboard_small %>%
##Let's shift this data from wide to tidy form
tidyr::pivot_longer(cols = -c(artist, track, date.entered),
names_to = "week",
values_to = "position") %>%
#We can even clean up the week values
dplyr::mutate(week = as.numeric(gsub("wk", "", week)))
##Now we can plot a chart of week against position
p <- ggplot(billboard_tidy, aes(x = week, y = position, colour = track, group = track))+
geom_line()
p
Things to note now:
Here it is with a nice theme applied:
p +
theme_minimal() +
scale_colour_brewer(palette = "Dark2")
Here it is faceted out to individual charts by artist:
p +
facet_wrap(~artist)+
theme_minimal() +
scale_colour_brewer(palette = "Dark2")
And it’s even easy to change the original code to show data for the top 10 tracks
billboard_tidy <- head(billboard, 10) %>%
##Let's shift this data from wide to tidy form
tidyr::pivot_longer(cols = -c(artist, track, date.entered),
names_to = "week",
values_to = "position") %>%
#We can even clean up the week values
dplyr::mutate(week = as.numeric(gsub("wk", "", week)))
##Now we can plot a chart of week against position
p <- ggplot(billboard_tidy, aes(x = week, y = position, colour = track, group = track))+
geom_line()+
facet_wrap(~artist)+
theme_minimal()
p
This isn’t just the case with data vis, it’s the same if you’d like to do some data analysis too.
relig_income
## # A tibble: 18 × 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k… 15 14 15 11 10 35 21
## 6 Evangel… 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori… 228 244 236 238 197 223 131
## 9 Jehovah… 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## 11 Mainlin… 289 495 619 655 651 1107 939
## 12 Mormon 29 40 48 51 56 112 85
## 13 Muslim 6 7 9 10 9 23 16
## 14 Orthodox 13 17 23 32 32 47 38
## 15 Other C… 9 7 11 13 13 14 18
## 16 Other F… 20 33 40 46 49 63 46
## 17 Other W… 5 2 3 4 2 7 3
## 18 Unaffil… 217 299 374 365 341 528 407
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
## # `Don't know/refused` <dbl>
Again, the relig_income dataset isn’t what you’d think of as “messy”, it’s logical and easy to read for a human. But what if we wanted to multiply all the values in the columns by 1000 (to convert from k figures to actual figures)?
relig_thousand <- relig_income %>%
##Use mutate to multiply each column by 1000
dplyr::mutate(`<$10k` = `<$10k` * 1000,
`$10-20k` = `$10-20k` * 1000,
`$20-30k` = `$20-30k` * 1000,
`$30-40k` = `$30-40k` * 1000)
##We're not done here, I'm just bored...
It’s a lot of coding; doing it for each column would take another 8 repetitions of this!
There are slightly easier ways to do it (using mutate and across in combination), but again it’s already feeling trickier than it needs to be. And crucially, that’s one of the most simple actions we might want to take with this dataset, imagine if you wanted to group together different incomes into wider groups, or convert $ values into £?
relig_thousand <- relig_income %>%
##Again, let's pull all the data into one tidy column
tidyr::pivot_longer(cols = -religion,
names_to = "salary",
values_to = "count") %>%
#Then just multiply all the counts by 1000
dplyr::mutate(count = count * 1000)
This is easier to write, easier to QA, and crucially easier to keep up to date. If they changed the salary groupings (and therefore the column names), the first lot of code would break, but the second lot would keep working.
There’s one package and two workhorse functions that are going to do the majority of the heavy lifting for you here.
There’s a couple of other functions we’re going to use too to make data cleaner and nicer.
You’ve already seen a couple of examples of me pivoting data to make it longer. Long data is what you’ll want most of the time; it’s easier to filter, group, and summarise, easier to apply the same functions to lots of columns at the same time, and easier to plot on charts.
The pivot_longer
function does exactly what the name
suggests, pivots data from a wide format (with many columns and few
rows) to a long format (with many rows and few columns). This is very
useful when you have data which has values as column names, and you want
to gather these together into a single column of multiple values
instead. It takes a lot of arguments, but you’ll probably only want to
use a maximum of 4.
data
: this is the dataset you want to pivot
cols
: the columns you want to combine into a longer
format. This can be done in the same way as in the select
function, either by specifying columns you wish to include or those you
want to exclude (using the -
).
names_to
: the names of all the columns you’ll be
pivoting will be combined into a single labelling column. This will be
called “name” by default, but you can specify an alternative name as a
string if you like.
values_to
: all of the values in the columns you’ll
be pivoting will be combined into a single value column. This will be
called “value” by default, but you can specify an alternative name as a
string if you like.
construction %>% ##the data
tidyr::pivot_longer(cols = c(Northeast, Midwest, South, West), #Choosing the columns to pivot
names_to = "region", #Put column names in a column called region
values_to = "value") #Put values in a column called value
## # A tibble: 36 × 7
## Year Month `1 unit` `2 to 4 units` `5 units or more` region value
## <dbl> <chr> <dbl> <lgl> <dbl> <chr> <dbl>
## 1 2018 January 859 NA 348 Northeast 114
## 2 2018 January 859 NA 348 Midwest 169
## 3 2018 January 859 NA 348 South 596
## 4 2018 January 859 NA 348 West 339
## 5 2018 February 882 NA 400 Northeast 138
## 6 2018 February 882 NA 400 Midwest 160
## 7 2018 February 882 NA 400 South 655
## 8 2018 February 882 NA 400 West 336
## 9 2018 March 862 NA 356 Northeast 150
## 10 2018 March 862 NA 356 Midwest 154
## # … with 26 more rows
Let’s try some pivoting of our own with taxi data. This is a nice table but it’s not very R friendly if we’d like to plot a chart of taxi volumes over time
taxis <- openxlsx::read.xlsx("Data/taxi0101.xlsx", sheet = 3, startRow = 8)
pivot_longer
function to achieve this.taxis_long <- taxis %>%
tidyr::pivot_longer(cols = -Year)
Hint: your final table should start like this:
head(taxis_long)
## # A tibble: 6 × 3
## Year name value
## <chr> <chr> <chr>
## 1 1965 London.Licensed.taxis.(thousands).[note.1] "7.29"
## 2 1965 England.Licensed.taxis.(thousands) "[w] "
## 3 1965 London.Licensed.PHVs.(thousands).[note.1] "[w]"
## 4 1965 England.Licensed.PHVs.(thousands) "[w] "
## 5 1965 London.Total.vehicles.(thousands).[note.1] "[w]"
## 6 1965 England.Total.vehicles.(thousands) "[w] "
taxis_long <- taxis_long %>%
dplyr::mutate(value = as.numeric(value))
taxis_long <- taxis_long %>%
dplyr::mutate(name = tolower(name))
The separate
function is the other common function
you’ll want to use from the tidyr package, and is designed to take a
single column and split it into two or more other columns based on a
pattern (e.g. a single character or symbol).
This is common function to use, because more often than you’d expect, there’s more than one piece of information contained in a column title (think about a column called rail_2020, or road_england). When these column names are moved into tidy format, you end up with multiple pieces of data in the same cell.
An example of this can be seen in the world_bank_pop dataset:
tidyr::world_bank_pop %>%
tidyr::pivot_longer(cols = -c(country, indicator),
names_to = "year") %>%
head()
## # A tibble: 6 × 4
## country indicator year value
## <chr> <chr> <chr> <dbl>
## 1 ABW SP.URB.TOTL 2000 42444
## 2 ABW SP.URB.TOTL 2001 43048
## 3 ABW SP.URB.TOTL 2002 43670
## 4 ABW SP.URB.TOTL 2003 44246
## 5 ABW SP.URB.TOTL 2004 44669
## 6 ABW SP.URB.TOTL 2005 44889
Notice how the indicator
column contains multiple pieces
of information; SP is a prefix for all values, POP or URB indicates
whether the value is for either total population or urban population,
and TOTL or GROW indicates whether it’s total population or population
growth. This is fine, but makes it harder to select e.g. all urban
population values only, or all growth values only. It would be much more
simple if this information was in two different columns.
We can split this information into three separate columns using separate. The arguments for this function are a little more complex than for pivot_longer:
col
: The column we’re going to split into multiple
columns
into
: The names of the new columns we’re going to create
from this column (passed as strings)
sep
: the symbol you want to use as the separation point
between the data in the columns. This works using regular expressions so
you need to be a bit careful how you provide this value, usually
prefixed with “\”
There are also some arguments you can leave as default for now but may want to change later, these include:
remove
: defaults to TRUE, lets you decide whether or not
to remove the original column
extra
: tells the function what to do with any extra
content you haven’t provided a column name for. By default it’ll tell
you this exists, but you can specify to either “drop” this without
warning, or “merge” it into the final column so you don’t lose it.
And here it is in action:
#Get the data into tidy format
tidyr::world_bank_pop %>%
tidyr::pivot_longer(cols = -c(country, indicator),
names_to = "year") %>%
##Separate out the indicator column
tidyr::separate(col = indicator,
#Separate it into three columns called sp, pop, and measure
into = c("sp", "pop", "measure"),
#Separate on the dots in the name (using \\ before you reference the .)
sep = "\\.") %>%
head(10)
## # A tibble: 10 × 6
## country sp pop measure year value
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 ABW SP URB TOTL 2000 42444
## 2 ABW SP URB TOTL 2001 43048
## 3 ABW SP URB TOTL 2002 43670
## 4 ABW SP URB TOTL 2003 44246
## 5 ABW SP URB TOTL 2004 44669
## 6 ABW SP URB TOTL 2005 44889
## 7 ABW SP URB TOTL 2006 44881
## 8 ABW SP URB TOTL 2007 44686
## 9 ABW SP URB TOTL 2008 44375
## 10 ABW SP URB TOTL 2009 44052
It should now be obvious how you could filter or summarise this data more easily, referencing the pop or measure columns individually.
Let’s go back to our taxi data.
taxis_long
## # A tibble: 324 × 3
## Year name value
## <chr> <chr> <dbl>
## 1 1965 london.licensed.taxis.(thousands).[note.1] 7.29
## 2 1965 england.licensed.taxis.(thousands) NA
## 3 1965 london.licensed.phvs.(thousands).[note.1] NA
## 4 1965 england.licensed.phvs.(thousands) NA
## 5 1965 london.total.vehicles.(thousands).[note.1] NA
## 6 1965 england.total.vehicles.(thousands) NA
## 7 1970 london.licensed.taxis.(thousands).[note.1] 8.65
## 8 1970 england.licensed.taxis.(thousands) NA
## 9 1970 london.licensed.phvs.(thousands).[note.1] NA
## 10 1970 england.licensed.phvs.(thousands) NA
## # … with 314 more rows
separate
function to split this into two different columns
on the .
symbol. Set the extra
argument to
merge so you don’t lose any of the information! Call the two new columns
“region” and “vehicle_type”taxi_separate <- taxis_long %>%
tidyr::separate(name, into = c("region", "vehicle_type"), extra = "merge", sep = "\\.")
extra
argument. Can you do
this?taxis_separate <- taxis_long %>%
tidyr::separate(name, into = c("region", "licence", "vehicle_type"), extra = "drop", sep = "\\.")
The opposite of pivoting longer is the function
pivot_wider
. This takes long tidy data and spreads it back
out, taking the column names from one column and the values from the
other. This is useful as a final step if you want to present data in a
human-readable format, but is also useful for some aspects of data
processing such as mutating across different variables.
Like pivot_longer, in pivot_wider()
you can specify the
arguments names_from
and values_from
to assign
the columns that the column names for the new spread data will be taken
from, and the column the associated values will be taken from. If you
don’t specify these, it will default to “name” and “value” once again.
These arguments also specify the columns the function works over, so in
this case there’s no need to specify a cols
argument.
fish_encounters %>% ##the data
tidyr::pivot_wider(names_from = station, #Take column names from this column
values_from = seen) #Take values from this column
## # A tibble: 19 × 12
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 4842 1 1 1 1 1 1 1 1 1 1 1
## 2 4843 1 1 1 1 1 1 1 1 1 1 1
## 3 4844 1 1 1 1 1 1 1 1 1 1 1
## 4 4845 1 1 1 1 1 NA NA NA NA NA NA
## 5 4847 1 1 1 NA NA NA NA NA NA NA NA
## 6 4848 1 1 1 1 NA NA NA NA NA NA NA
## 7 4849 1 1 NA NA NA NA NA NA NA NA NA
## 8 4850 1 1 NA 1 1 1 1 NA NA NA NA
## 9 4851 1 1 NA NA NA NA NA NA NA NA NA
## 10 4854 1 1 NA NA NA NA NA NA NA NA NA
## 11 4855 1 1 1 1 1 NA NA NA NA NA NA
## 12 4857 1 1 1 1 1 1 1 1 1 NA NA
## 13 4858 1 1 1 1 1 1 1 1 1 1 1
## 14 4859 1 1 1 1 1 NA NA NA NA NA NA
## 15 4861 1 1 1 1 1 1 1 1 1 1 1
## 16 4862 1 1 1 1 1 1 1 1 1 NA NA
## 17 4863 1 1 NA NA NA NA NA NA NA NA NA
## 18 4864 1 1 NA NA NA NA NA NA NA NA NA
## 19 4865 1 1 1 NA NA NA NA NA NA NA NA
Still using the taxi data, now we’ve got our data separated out into nice clean columns, let’s make use of the pivot wider function to create some new columns with additional data in them.
taxis_wide <- taxis_separate %>%
pivot_wider(names_from = region,
values_from = value)
taxis_wide <- taxis_wide %>%
dplyr::mutate(england_outside_london = england - london)
taxis_final <- taxis_wide %>%
tidyr::pivot_longer(cols = c(london, england, england_outside_london), names_to = "region", values_to = "value")
taxis_final %>%
dplyr::mutate(Year = as.numeric(Year)) %>%
na.omit() %>%
dplyr::filter(region == "england_outside_london") %>%
ggplot(aes(x = Year, y = value, colour = vehicle_type, group = vehicle_type))+
geom_line()
The below are a few long-form exercises for you to try using the above functions; these are designed to be challenging so don’t be put off if they take a bit of work!