Chapter 4 Importing Data
So far, we have only made use of data which is pre-loaded into R via packages, but it is also possible to load your own data in from a variety of sources. We will focus on two different file types;
- CSV
- Excel
4.1 Cloud R upload
As described in chapter 1, the bottom right pane of RStudio allows you to view files that are within your own personal Cloud R directory. You are free to create new folders in this area, using the New Folder
button. Create a new folder called data
in your root directory. In order to read in the files for the next 2 sections, you’ll need to upload the data into Cloud R. Use the Upload
button, set the target directory to your new data
folder and browse to the 2 datasets that we provided at the start of the course (nyt_titles.csv
and covid_stats.xlsx
).
4.2 CSV
Although there is a read.csv()
function in base R, like most things there is a better tidyverse
alternative! read_csv()
from the readr
package reads CSVs in as a tibble (which has additional features compared to a standard data frame), is much faster (~10X), and allows you to specify how you read data in more easily.
As always, lets read the function documentation using ?read_csv
. This tells us we need to provide a path to the file. This path can be either local or remote; so it will work equally well for data inside your project or from the internet.
To read in a local file, you have to specify the exact location of the file. You can do this as either an absolute filepath, which starts from the drive name right through to the final file (e.g. C:/Documents/My_work/file.csv), or as a relative file path. A relative file path just gives the location of the file starting from your current working environment. You can check what your current working environment is using the command getwd()
. The advantage of using relative file paths is if someone duplicates your project from Github, the code will still work on their own computer.
We will start by reading in some local data you previously uploaded to Cloud R, which contains details of New York Times bestsellers:
Notice that the file is inside the data folder inside the current working directory.
nyt_data
will now show in your environment. The environment viewer (top right) shows you basic information about the data that has been loaded in; 7431 obs (rows) of 8 variables (columns). You can also click on any object to view it in your script window.
You can also read in data directly from the web using the same function. For example, reading in this CSV of Eurovision Song Contest data:
eurovision_data <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-17/eurovision.csv")
This works exactly the same way as reading in local data, and the object you have created will appear in your environment (top right).
4.2.1 Exercise
05:00
- Read in the frogs dataset found here:‘https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-02/frogs.csv’ as an object called
frogs
4.2.1. Solution
4.2.2 Inspecting the dataset
As noted in the previous section, you can see by looking in the environment window that the eurovision dataset has 2005 observations and 18 variables. You can also return this information (and more) about datasets programatically, using the glimpse()
function, again from the dplyr package:
## Rows: 2,005
## Columns: 18
## $ event <chr> "Turin 2022", "Turin 2022", "Turin 2022", "Turin 2022",…
## $ host_city <chr> "Turin", "Turin", "Turin", "Turin", "Turin", "Turin", "…
## $ year <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
## $ host_country <chr> "Italy", "Italy", "Italy", "Italy", "Italy", "Italy", "…
## $ event_url <chr> "https://eurovision.tv/event/turin-2022", "https://euro…
## $ section <chr> "first-semi-final", "first-semi-final", "first-semi-fin…
## $ artist <chr> "Kalush Orchestra", "S10", "Amanda Georgiadi Tenfjord",…
## $ song <chr> "Stefania", "De Diepte", "Die Together", "Saudade, Saud…
## $ artist_url <chr> "https://eurovision.tv/participant/kalush-orchestra-22"…
## $ image_url <chr> "https://static.eurovision.tv/hb-cgi/images/963164d0-06…
## $ artist_country <chr> "Ukraine", "Netherlands", "Greece", "Portugal", "Bulgar…
## $ country_emoji <chr> ":flag_ua:", ":flag_nl:", ":flag_gr:", ":flag_pt:", ":f…
## $ running_order <dbl> 6, 8, 15, 10, 7, 5, 17, 16, 3, 9, 4, 14, 11, 1, 12, 2, …
## $ total_points <dbl> 337, 221, 211, 208, 29, 15, 187, 177, 159, 154, 118, 10…
## $ rank <dbl> 1, 2, 3, 4, 16, 17, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ rank_ordinal <chr> "1st", "2nd", "3rd", "4th", "16th", "17th", "5th", "6th…
## $ qualified <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE,…
## $ winner <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
As well as returning the number of rows and columns in the data, the glimpse function also shows you the names of the columns, the column classes (indicated in
4.2.3 Exercise
05:00
- Use the glimpse and View functions to examine the frogs dataset. How many rows and columns does it have?
4.2.3. Solution
## Rows: 311
## Columns: 16
## $ Site <chr> "Crane Prairie", "Crane Prairie", "Crane Prairie", "Crane P…
## $ Subsite <chr> "SE Pond", "SE Pond", "SE Pond", "SE Pond", "SE Pond", "SE …
## $ HabType <chr> "Pond", "Pond", "Pond", "Pond", "Pond", "Pond", "Pond", "Po…
## $ SurveyDate <chr> "9/25/2018", "10/2/2018", "10/9/2018", "10/15/2018", "10/22…
## $ Ordinal <int> 268, 275, 282, 288, 295, 305, 309, 317, 322, 330, 270, 277,…
## $ Frequency <dbl> 164.169, 164.169, 164.169, 164.169, 164.169, 164.169, 164.1…
## $ UTME_83 <int> 597369, 597352, 597345, 597340, 597344, 597410, 597369, 597…
## $ UTMN_83 <int> 4846486, 4846487, 4846458, 4846464, 4846460, 4846451, 48464…
## $ Interval <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,…
## $ Female <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Water <chr> "Deep water", "Deep water", "Shallow water", "Deep water", …
## $ Type <chr> "Marsh/Pond", "Marsh/Pond", "Marsh/Pond", "Marsh/Pond", "Ma…
## $ Structure <chr> "Herbaceous veg", "Open", "Herbaceous veg", "Woody debris",…
## $ Substrate <chr> "Unknown substrate", "Unknown substrate", "Unknown substrat…
## $ Beaver <chr> "No beaver", "No beaver", "No beaver", "No beaver", "No bea…
## $ Detection <chr> "Captured", "No visual", "No visual", "No visual", "No visu…
4.3 Excel
Reading excel files works in much the same way as CSV files. However, due to the difference in underlying structures we require the function read_excel()
from a different package called readxl
.
The main difference when reading excel files is three additional arguments that we can set;
sheet
which allows us to specify which sheet to read. It can take the form of a string (the name of the sheet) or an integer (the position of the sheet); andrange
which allows us to specify a cell range. It takes a typical cell reference like “B3:D10”.skip
an alternative to specifying a cell range, you can simply indicate how many rows to skip at the start of the sheet. This is ideal if you want to read in a sheet with an unknown number of columns and/or rows, but know there are several lines of metadata at the top of the sheet.
If we don’t set any of these arguments it will assume our data is in the first row of the first sheet (and it becomes almost identical to read_csv above).
4.3.1 Exercise
10:00
- Install and load the readxl package.
- Try reading in the
covid_stats.xlsx
file in the data folder. - Specify the sheet name you want to read in.
- Examine the data you have read in; it currently includes the titles at the top of the sheet. Work out how to skip these and only read in the data, with the correct column names.