Chapter 2 Creating (basic) plots

Before exploring ggplot further, let’s revisit the basics of plotting with ggplot2::, as covered in the “Introduction to R - Chapter 3 Basic Plotting”.

2.1 Structure

The main function in ggplot2 is ggplot(), which is used to start a plot. ggplot2 plots are built in layers, with each layer adding to the appearance or functionality of the chart. Layers are added using the + symbol.

The basic structure of a ggplot2 chart is:

Call the ggplot() function:

Specify the dataset to use
Map your x and y values using the aes() function

Add a geom function:

The geom defines the type of plot, such as a bar chart, line graph or scatter plot
It also determine how the data and aesthetic (appearance) mappings are transformed into the visualisation

Code

ggplot(data = <DATA>, aes(x = <COLUMN1>, y = <COLUMN2>)) +
  <GEOM_FUNCTION>()

2.1.1 A worked example

Code

# State your data
# State your x and y values in the aes() function
ggplot(data = driving_bar_data, aes(x = Financial_Year, y = Pass_rate)) +
# This geom function here creates a bar chart    
  geom_col()

2.1.2 Adding aesthetics: Themes

To make this chart look a bit more presentable we can simply add on one of ggplot’s built-in themes as a layer. A theme is a quick and easy way to set many of the visual aspects of your charts, such as the appearance of grid lines, size of text, and position of the legends.

Some themes to choose from:

theme_classic()
theme_bw()
theme_minimal()
theme_light()

Code

ggplot(data = driving_bar_data, aes(x = Financial_Year, y = Pass_rate)) +
  geom_col() + 
  # Add the classic theme as a layer 
  theme_classic()

You can also customise your charts further by creating your own theme in ggplot2. Themes control non-data elements like fonts, gridlines, and background colours, allowing you to match charts to a specific style or branding.

Here’s an example of a basic chart with a custom theme applied:

Code

# Define a custom theme
custom_theme <- theme(
  panel.background = element_rect(fill = "white"),
  panel.grid.major = element_line(colour = "grey80"),
  axis.text = element_text(size = 10)
)

ggplot(data = driving_bar_data, aes(x = Financial_Year, y = Pass_rate)) +
  geom_col()+ 
  # Add the custom theme as a layer 
  custom_theme

2.1.3 Adding aesthetics: Colour

Colour is a powerful way to represent two-dimensional data, helping to distinguish between different categories effectively.

In ggplot2, you can use scale_colour_brewer() or scale_fill_brewer() to apply ColorBrewer palettes. These palettes are visually appealing and many are colour-blind friendly, making them an excellent choice for accessible data visualisation.

When customising colours in a chart, it’s important to understand the difference:

colour is used for lines and points (e.g. scale_colour_brewer())
fill is used for the interior colour of objects like bars (e.g. scale_fill_brewer())

This distinction ensures your chart looks exactly as intended, with the right colours applied to the right elements.

You can see the full range of palettes available with their names here:

Simply add the function as another layer and state the palette you would like to use.

Code

ggplot(data = driving_bar_data, aes(x = Financial_Year, y = Pass_rate, fill = factor(Financial_Year))) +
  geom_col()+ 
  # Add a theme as a layer 
  theme_classic()+
  # Add a colour palette as a layer 
  scale_fill_brewer(palette = "Spectral")

2.2 Types of charts: Geoms

In this section, we are going to cover the most common charts people used for their analysis: Bar Charts, Box plots, Line Graphs and Scatter plots.

2.2.1 Bar Chart

Bar charts are one of the most commonly used types of visualisation because they are excellent for comparing the sizes of different categories.

Stacked bar charts are particularly useful when you want to compare multiple part-to-whole relationships within a single chart. However, if the data doesn’t represent part-to-whole relationships, caution is needed, as stacked bar charts can quickly become cluttered and difficult to interpret, especially with many categories.

In ggplot2::, you can create bar charts using either geom_col() or geom_bar(), depending on your data:

use geom_col() when you have already have a dataset with pre-summarised values (e.g. totals or percentages). This function directly maps the height of bars to values in your data
use geom_bar() when your dataset contains raw, unaggregated data. This function automatically counts the number of occurrences in each category to create the bar heights

To create a stacked bar chart, include the argument position = "stack" in your geom function. This is the default behaviour for bar charts in ggplot2, and it stacks the bars for different groups on top of each other within each category.

Code

ggplot(driving_bar_data, aes(x = Financial_Year, y = Pass_rate, fill = Gender)) +
  geom_col(position = "stack") +
  # Lets make it presentable 
  scale_fill_brewer(palette = "YIGnBu")+
  theme_classic()

To create a stacked bar chart in ggplot2::, your data needs to be structured appropriately for this type of chart. Typically, bar charts are used to visualise categorical data, such as “types of transport” or “age range”. Therefore, your dataset should include at least one categorical variable and one numerical variable.

For a stacked bar chart, one categorical variable is used to group the bars along the X-axis, while a second categorical variable defines the segments that are stacked within each bar.

When working with data that contains two categorical variables, an alternative to stacking is to use dodged bar charts. In a dodged bar chart, bars for each category are placed side by side rather than stacked, making it easier to compare individual values directly. Again you can use geom_bar() or geom_col() but this time you set the position argument to dodge.

Code

ggplot(driving_bar_data, aes(x = Financial_Year, y = Pass_rate, fill = Gender)) +
  geom_col(position = "dodge") +
  # Lets make it presentable 
  scale_fill_brewer(palette = "YIGnBu")+
  theme_classic()

Choosing between stacked and dodged bar charts depends on your goal:

use stacked bar charts if your focus is on the overall totals or the contribution of parts to a whole
use dodged bar charts if you want to compare individual categories clearly without the complexity of stacking

Selecting the right style ensures your chart is clear, easy to interpret, and suited to the data you’re presenting.

The table below shows the data prepped in the format for the stacked bar chart above. There are two categorical variables (Gender and Financial_Year) and one numerical variable (Pass_rate).

Financial_Year	Gender	Pass_rate
2012	Female	44
2012	Male	51
2013	Female	44
2013	Male	51
2014	Female	44
2014	Male	51

Once your data is ready, start by calling the ggplot() function, specify your data in the first argument, then open the aes() function and define your x and y variables. Add a geom_col() layer. So far, this will achieve a simple bar chart.

Code

ggplot(driving_bar_data, aes(x = Financial_Year, y = Pass_rate)) +
  geom_col()

Next, we need to add some additional parameters that allow for stacking. Use the your second categorical variable in a ‘fill’ parameter within the aes() function. The ‘fill’ argument uses colour to create the stacks in the chart.

Code

ggplot(driving_bar_data, aes(x = Financial_Year, y = Pass_rate, fill = Gender)) +
  geom_col()

Finally, to transform your data into a stacked bar chart, use the geom_col() geom function, and set the argument ‘position’ to “stack”. The ‘position’ argument specifies how the columns should be positioned.

Code

ggplot(driving_bar_data, aes(x = Financial_Year, y = Pass_rate, fill = Gender)) +
  geom_col(position = "stack")

2.2.2 Best practice

Generally we should limit a stack to four categories. If you have more than four categories in the stack, it can get quite cluttered and may be difficult to read. Hence, it may be more useful to have your stacks labeled rather than try to match the colours off a legend.

We can do this by adding a geom_text() layer. Use the label argument to specify what data you would like to appear on the chart as labels, then use the position argument again to position the labels in a stacked format.

2.2.3 Exercise

15:00

The table below shows the data you will be using in the exercise. We have two categorical variables (year and transport) and one numerical variable (usage).

year	transport	usage
2020	Busses	39
2020	Cars/Vans	76
2020	Cycling	133
2020	Rail	29
2020	TFL Bus	43
2020	TFL Tube	25

In a ggplot() function, use the transport_bar_data dataset to create a stacked bar chart. Plot the year as the X axis, and usage as the Y axis, and the remaining categorical variable as the fill parameter
Add the geom_col() layer and define the position argument
Add data labels on the stacks using a geom_text() layer. Use the label argument inside an aes() call to define what variable to use from the data for the labels, and use position again set the position of the labels.
Add a built-in theme to easily format the chart (e.g. theme_classic(), theme_bw(), theme_minimal())
Change the colour palette using the scale_fill_brewer() function. Use a colour palette from the RColourBrewer package

HINT:

Code

geom_text(aes(label = transport), position = "stack")

SOLUTION 2.2.3.

Code

ggplot(transport_bar_data, aes(x = year, y = usage, fill = transport)) +
  geom_col(position = "stack") +
  geom_text(aes(label = transport), position = "stack", vjust = 1.8, size = 3) +
  # Lets make it presentable 
  scale_fill_brewer(palette = "PuBuGn")+
  theme_classic()

2.2.4 Box plots

Box plots are a powerful way to visualise the distribution of a dataset. They clearly show the range, median, quartiles, and outliers, making it easier to identify values that differ significantly from most of the data. When multiple distributions are plotted together, box plots allow for quick comparisons of spread, skewness, and central tendency.

To create a box plot in ggplot2, your data must include at least one categorical variable (for grouping on the x-axis) and one numerical variable (for the y-axis). For example, let’s use the same dataset we used for the stacked bar chart to visualise pass rates by gender.

The geom_boxplot() function is the most commonly used function to create box plots. It handles all the elements of a box plot, including the box, whiskers, and outliers. However, when you want to customise or control the error bars (whiskers) separately, you can use the stat_boxplot() function.

The stat_boxplot() function calculates the statistics for the error bars (whiskers) and lets you render them separately by setting geom = "errorbar". This is useful when you want to control the visual layering, such as ensuring that the error bars appear behind the box for better clarity.

You can use stat_boxplot() without geom_boxplot() if you only want to visualise the error bars, but it is more common to use them together for a complete box plot. Conversely, you can use geom_boxplot() alone when you do not need separate control over the error bars.

Example:

Code

pass_data_2020 <- raw_driving_pass_data %>%  
  dplyr::filter(Financial_Year == "2020")

ggplot(pass_data_2020, aes(x = Gender, y = Pass_rate)) +
   stat_boxplot(geom = "errorbar",
               # width of the errorbar
               width = 0.1, 
               # line weight
               lwd = 0.3) +
  geom_boxplot(lwd = 0.3)

Start by calling ggplot() and mapping your x variable (categorical) and y variable (numerical) in the aes() function
Use the stat_boxplot() to create the error bars (whiskers)

set the geom argument to "errorbar" to specify the type of plot
adjust the width to control how wide the error bars extend horizontally
use the lwd argument to change the line thickness

Adding the error bars first ensure they are drawn below the box layer, avoiding visual overlap.

Use geom_boxplot() to create the box part of the plot

you can customise the box appearance with arguments like lwd to adjust the line thickness or other aesthetic options

2.2.5 Exercise

15:00

In a ggplot function, use the transport_bar_data dataset and plot a boxplot with transport as the X axis, and usage as the Y axis.
Add a built-in theme to improve the appearance of your boxplot

SOLUTION 2.2.5.

Code

ggplot(transport_bar_data, aes(x = transport, y = usage)) +
   stat_boxplot(geom = "errorbar",
               #width of the errorbar
               width = 0.1, 
               #line weight
               lwd = 0.3) +
  geom_boxplot(lwd = 0.3) +
  theme_bw()

2.2.6 Break

10:00

This is a good opportunity to take a 10-minute break away from the computer to refresh your mind, stretch, and reset before continuing onto the next section.

2.2.7 Line Graph

Line graphs are good for displaying trends over time or continuous variables. It is useful for comparing multiple trends, identifying patterns, showing changes over time and highlighting outliers for quality assurance through unusual trends.

To construct a line graph, ensure your data is in a tidy format. If you we want to plot multiple lines, all the categorical variables need to be in one column which can be done by pivoting columns into a longer format using tidyr::pivot_longer(). More information about tidying data can be found in our Tidy Data in R course.

For example, if we wanted to compare the trend between mean driving tests conducted and mean tests passed using the dataset below, we would not be able to plot both lines because the categories we want to plot are in different columns.

For the line graph we want to plot, there are three different pieces of information we are trying to include in the plot - categories (tests conducted and tests passed), the values, and the years. When data is not in the correct format, one may think they can use separate functions to plot each line, like below, which in the end does not work because each line overwrites the other like in the example below:

Code

bad_driving_data <- raw_driving_pass_data %>% 
  dplyr::group_by(Financial_Year) %>% 
  dplyr::summarise(mean_conducted = mean(Conducted), mean_passed = mean(Passed))

Financial_Year	mean_conducted	mean_passed
2007	73405.88	32459.04
2008	72444.75	32811.67
2009	63901.04	29325.79
2010	66899.62	31002.33
2011	65377.46	30673.21
2012	59853.04	28218.83

Code

bad_driving_data_graph <- 
  ggplot(data = bad_driving_data) +
    # plotting line 1
    geom_line(aes(x = Financial_Year, y =  mean_conducted)) +
    # plotting line 2 - but this geom_line() function overwrites the one above
    geom_line(aes(x = Financial_Year, y = mean_passed))

So let’s put our data in the write format to plot multiple lines on a chart by pivoting the categories into a column of their own.

Code

good_driving_data <- raw_driving_pass_data %>% 
  tidyr::pivot_longer(cols = c(Conducted, Passed), 
               names_to = "category", 
               values_to = "value") %>% 
  dplyr::group_by(Financial_Year, category) %>% 
  dplyr::summarise(value = mean(value/100))

Financial_Year	category	value
2007	Conducted	734.0588
2007	Passed	324.5904
2008	Conducted	724.4475
2008	Passed	328.1167
2009	Conducted	639.0104
2009	Passed	293.2579

Now we plot.

To plot data on driving tests conducted and driving tests passed is now pretty simple.

As usual, begin with the ggplot function and assign the data you would like to use. This time, we’re using good_driving_data. Then add a new layer and use geom_line() transform your data into a line graph.

Next, in the aes() function, since we want to observe the data over a span of time, we assign Financial_Year to the x axis, and value column to the y axis. To plot our third piece of information which is the two categories we are interested in, we set the group argument in the geom_line() function as “category”. This tells ggplot to identify the different the groupings in the category column in the data, and plot them separately. You can also use the colour argument to give your lines different colours to each other if you find it useful.

Code

ggplot(data = good_driving_data) +
  geom_line(aes(y = value, x = Financial_Year, group = category, color = category)) +
  # and add a theme layer for good measure
  theme_classic()

2.2.8 Exercise

15:00

In a ggplot function, plot the usage of TFL Bus and TFL Tube over the year 2022. Use the transport_line_data dataset. Using the geom_line() function, set the X axis, Y axis, group and colour arguments
Add a different theme this time to change the appearance of the plot

SOLUTION 2.2.7.

Code

ggplot(data = transport_line_data) +
  geom_line(aes(y = mean_usage, x = month, group = transport, color = transport)) +
  theme_classic() +
  scale_colour_brewer(palette = "Dark2")

2.2.9 Scatter Plot

Scatter plots are quite simple to create since it involves plotting only two numerical values.

We begin by specifying our data in the ggplot function. We are using the raw_travel_data as our variables exist in separate columns in this dataset.

date	weekday	cars	light_commercial_vehicles	heavy_goods_vehicles	all_motor_vehicles	national_rail	tfl_tube	tfl_bus	busses	cycling
2020-03-01	Sun	1.03	1.11	1.08	1.04	0.95	1.04	1.02	1	0.89
2020-03-02	Mon	1.02	1.06	1.03	1.03	0.95	0.95	0.97	1	0.89
2020-03-03	Tue	1.01	1.05	1.02	1.02	0.95	0.95	0.96	1	0.89
2020-03-04	Wed	1.01	1.04	1.03	1.01	0.96	0.95	0.97	1	0.89
2020-03-05	Thu	1.00	1.03	1.02	1.00	0.96	0.92	0.92	1	0.89
2020-03-06	Fri	1.02	1.03	1.02	1.02	0.99	0.92	0.96	1	0.89

Since the x and y variables are all that is needed for plotting a scatter plot, we could add the geom_point() layer and leave it there. But, there some useful arguments worth exploring for scatter plots:

Shape allows you to choose a different marker to plot your data. There are 25 shapes to choose from
Size lets you choose the size of the point
Jitter in the position arguments spreads your points apart a little if there are too many overlapping each other
Alpha sets the transparency of your points. This can be useful when there are lots over overlapping points that are clumped together.
Colour lets you set the colour of the points

Another useful layer for a scatter plot is a line of best fit which we can add using a geom_smooth() function. On its own, the geom_smooth() function will create a smooth line through the points. However, to create a straight line, we use the method argument and set it to "lm", and "se" to FALSE. "lm" stands for linear model, and "se" set to false suppresses confidence intervals.

Code

ggplot(data = raw_travel_data, aes(x = busses, y = national_rail)) +
   geom_point(shape = 3, size = 3,  position = "jitter", alpha = 0.4, colour = "black") +
   geom_smooth(method = "lm", se = FALSE) +
   theme_classic()