Chapter 4 Importing Data

So far, we have only made use of data which is pre-loaded into R via packages, but it is also possible to load your own data in from a variety of sources. We will focus on two different file types;

CSV
Excel

4.1 Cloud R upload

As described in Chapter 1, the bottom right pane of RStudio allows you to view files that are within your own personal Cloud R directory. You are free to create new folders in this area, using the New Folder button.

4.1.1 Exercise

05:00

Navigate to the Files at the bottom right pane of your RStudio
Create a new folder called data
Inside the data folder, upload (using the Upload button) the two datasets either sent to you before the start of the session, or downloaded from Chapter 1: nyt_titles.csv and covid_stats.xlsx (note, you can only upload one file at a time)

Great, we will come back to using these two files, but first let’s discuss how to read in csv files.

4.2 CSV

Although there is a read.csv() function in base R, like most things there is a better tidyverse alternative! read_csv() from the readr package reads CSVs in as a tibble (which has additional features compared to a standard data frame), is much faster (~10X), and allows you to specify how you read data in more easily.

As always, let’s read the function documentation using ?read_csv. This tells us we need to provide a path to the file. This path can be either local or remote; so it will work equally well for data inside your project or from the internet.

To read in a local file, you have to specify the exact location of the file. You can do this as either an absolute filepath, which starts from the drive name right through to the final file (e.g. C:/Documents/My_work/file.csv), or as a relative file path. A relative file path just gives the location of the file starting from your current working environment. You can check what your current working environment is using the function getwd(). The advantage of using relative file paths is if someone duplicates your project from Github, the code will still work on their own computer.

We will start by reading in some local data you previously uploaded to Cloud R, which contains details of New York Times bestsellers:

nyt_data <- read_csv("data/nyt_titles.csv")

Notice that the file is inside the data folder inside the current working directory.

nyt_data will now show in your environment. The environment viewer (top right) shows you basic information about the data that has been loaded in; 7431 obs (rows) of 8 variables (columns). You can also click on any object to view it in your script window.

You can also read in data directly from the web using the same function. For example, reading in this CSV of Eurovision Song Contest data:

eurovision_data <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-17/eurovision.csv")

This works exactly the same way as reading in local data, and the object you have created will appear in your environment (top right).

4.2.1 Exercise

05:00

Read in the frogs dataset found here:‘https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-02/frogs.csv’ as an object called frogs using the Base R read.csv or read_csv from the readr package.

NOTE: The readr package is part of the core tidyverse, which you loaded in Chapter 1.7.1 — so there’s no need to install it separately.

4.2.1. Solution

frogs <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-02/frogs.csv")

4.2.2 Inspecting the dataset

As noted in the previous section, you can see by looking in the environment window that the eurovision dataset has 2005 observations and 18 variables. You can also return this information (and more) about datasets programatically, using the glimpse() function, again from the dplyr package:

glimpse(eurovision_data)

## Rows: 2,005
## Columns: 18
## $ event          <chr> "Turin 2022", "Turin 2022", "Turin 2022", "Turin 2022",…
## $ host_city      <chr> "Turin", "Turin", "Turin", "Turin", "Turin", "Turin", "…
## $ year           <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
## $ host_country   <chr> "Italy", "Italy", "Italy", "Italy", "Italy", "Italy", "…
## $ event_url      <chr> "https://eurovision.tv/event/turin-2022", "https://euro…
## $ section        <chr> "first-semi-final", "first-semi-final", "first-semi-fin…
## $ artist         <chr> "Kalush Orchestra", "S10", "Amanda Georgiadi Tenfjord",…
## $ song           <chr> "Stefania", "De Diepte", "Die Together", "Saudade, Saud…
## $ artist_url     <chr> "https://eurovision.tv/participant/kalush-orchestra-22"…
## $ image_url      <chr> "https://static.eurovision.tv/hb-cgi/images/963164d0-06…
## $ artist_country <chr> "Ukraine", "Netherlands", "Greece", "Portugal", "Bulgar…
## $ country_emoji  <chr> ":flag_ua:", ":flag_nl:", ":flag_gr:", ":flag_pt:", ":f…
## $ running_order  <dbl> 6, 8, 15, 10, 7, 5, 17, 16, 3, 9, 4, 14, 11, 1, 12, 2, …
## $ total_points   <dbl> 337, 221, 211, 208, 29, 15, 187, 177, 159, 154, 118, 10…
## $ rank           <dbl> 1, 2, 3, 4, 16, 17, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ rank_ordinal   <chr> "1st", "2nd", "3rd", "4th", "16th", "17th", "5th", "6th…
## $ qualified      <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE,…
## $ winner         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…

As well as returning the number of rows and columns in the data, the glimpse function also shows you the names of the columns, the column classes (indicated in ), and an example of the first few rows of data.

4.2.3 Exercise

05:00

Use the glimpse() and View() functions to examine the frogs dataset. How many rows and columns does it have?

4.2.3. Solution

View(frogs)

glimpse(frogs)

## Rows: 311
## Columns: 16
## $ Site       <chr> "Crane Prairie", "Crane Prairie", "Crane Prairie", "Crane P…
## $ Subsite    <chr> "SE Pond", "SE Pond", "SE Pond", "SE Pond", "SE Pond", "SE …
## $ HabType    <chr> "Pond", "Pond", "Pond", "Pond", "Pond", "Pond", "Pond", "Po…
## $ SurveyDate <chr> "9/25/2018", "10/2/2018", "10/9/2018", "10/15/2018", "10/22…
## $ Ordinal    <dbl> 268, 275, 282, 288, 295, 305, 309, 317, 322, 330, 270, 277,…
## $ Frequency  <dbl> 164.169, 164.169, 164.169, 164.169, 164.169, 164.169, 164.1…
## $ UTME_83    <dbl> 597369, 597352, 597345, 597340, 597344, 597410, 597369, 597…
## $ UTMN_83    <dbl> 4846486, 4846487, 4846458, 4846464, 4846460, 4846451, 48464…
## $ Interval   <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,…
## $ Female     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Water      <chr> "Deep water", "Deep water", "Shallow water", "Deep water", …
## $ Type       <chr> "Marsh/Pond", "Marsh/Pond", "Marsh/Pond", "Marsh/Pond", "Ma…
## $ Structure  <chr> "Herbaceous veg", "Open", "Herbaceous veg", "Woody debris",…
## $ Substrate  <chr> "Unknown substrate", "Unknown substrate", "Unknown substrat…
## $ Beaver     <chr> "No beaver", "No beaver", "No beaver", "No beaver", "No bea…
## $ Detection  <chr> "Captured", "No visual", "No visual", "No visual", "No visu…

4.3 Excel

Reading excel files works in much the same way as CSV files. However, due to the difference in underlying structures we require the function read_excel() from a different package called readxl.

The main difference when reading excel files is three additional arguments that we can set;

sheet which allows us to specify which sheet to read. It can take the form of a string (the name of the sheet) or an integer (the position of the sheet); and
range which allows us to specify a cell range. It takes a typical cell reference like “B3:D10”.
skip an alternative to specifying a cell range, you can simply indicate how many rows to skip at the start of the sheet. This is ideal if you want to read in a sheet with an unknown number of columns and/or rows, but know there are several lines of metadata at the top of the sheet.

If we don’t set any of these arguments it will assume our data is in the first row of the first sheet (and it becomes almost identical to read_csv above).

4.3.1 Exercise

10:00

Install and load the readxl package.
Try reading in the covid_stats.xlsx file in the data folder.
Specify the sheet name you want to read in.
Examine the data you have read in; it currently includes the titles at the top of the sheet. Work out how to skip these and only read in the data, with the correct column names.

4.3.1. Solution

install.packages("readxl")

library(readxl)

covid_stats <- read_excel("data/covid_stats.xlsx", sheet = "Transport_use_(GB)", skip = 6)