Chapter 2 Creating (basic) plots
Before exploring ggplot further, let’s revisit the basics of plotting with ggplot2::
, as covered in the “Introduction to R - Chapter 3 Basic Plotting”.
2.1 Structure
The main function in ggplot2 is ggplot()
, which is used to start a plot. ggplot2 plots are built in layers, with each layer adding to the appearance or functionality of the chart. Layers are added using the +
symbol.
The basic structure of a ggplot2 chart is:
- Call the
ggplot()
function:
- Specify the dataset to use
- Map your
x
andy
values using theaes()
function
- Add a
geom
function:
- The
geom
defines the type of plot, such as a bar chart, line graph or scatter plot - It also determine how the data and aesthetic (appearance) mappings are transformed into the visualisation
2.1.2 Adding aesthetics: Themes
To make this chart look a bit more presentable we can simply add on one of ggplot’s built-in themes as a layer. A theme is a quick and easy way to set many of the visual aspects of your charts, such as the appearance of grid lines, size of text, and position of the legends.
Some themes to choose from:
theme_classic()
theme_bw()
theme_minimal()
theme_light()
Code
You can also customise your charts further by creating your own theme in ggplot2. Themes control non-data elements like fonts, gridlines, and background colours, allowing you to match charts to a specific style or branding.
Here’s an example of a basic chart with a custom theme applied:
Code
# Define a custom theme
custom_theme <- theme(
panel.background = element_rect(fill = "white"),
panel.grid.major = element_line(colour = "grey80"),
axis.text = element_text(size = 10)
)
ggplot(data = driving_bar_data, aes(x = Financial_Year, y = Pass_rate)) +
geom_col()+
# Add the custom theme as a layer
custom_theme
2.1.3 Adding aesthetics: Colour
Colour is a powerful way to represent two-dimensional data, helping to distinguish between different categories effectively.
In ggplot2, you can use scale_colour_brewer()
or scale_fill_brewer
() to apply ColorBrewer palettes. These palettes are visually appealing and many are colour-blind friendly, making them an excellent choice for accessible data visualisation.
When customising colours in a chart, it’s important to understand the difference:
colour
is used for lines and points (e.g.scale_colour_brewer()
)fill
is used for the interior colour of objects like bars (e.g.scale_fill_brewer()
)
This distinction ensures your chart looks exactly as intended, with the right colours applied to the right elements.
You can see the full range of palettes available with their names here:
Simply add the function as another layer and state the palette you would like to use.
2.2 Types of charts: Geoms
In this section, we are going to cover the most common charts people used for their analysis: Bar Charts, Box plots, Line Graphs and Scatter plots.
2.2.1 Bar Chart
Bar charts are one of the most commonly used types of visualisation because they are excellent for comparing the sizes of different categories.
Stacked bar charts are particularly useful when you want to compare multiple part-to-whole relationships within a single chart. However, if the data doesn’t represent part-to-whole relationships, caution is needed, as stacked bar charts can quickly become cluttered and difficult to interpret, especially with many categories.
In ggplot2::
, you can create bar charts using either geom_col()
or geom_bar()
, depending on your data:
- use
geom_col()
when you have already have a dataset with pre-summarised values (e.g. totals or percentages). This function directly maps the height of bars to values in your data - use
geom_bar()
when your dataset contains raw, unaggregated data. This function automatically counts the number of occurrences in each category to create the bar heights
To create a stacked bar chart, include the argument position = "stack"
in your geom function. This is the default behaviour for bar charts in ggplot2, and it stacks the bars for different groups on top of each other within each category.
Code
To create a stacked bar chart in ggplot2::
, your data needs to be structured appropriately for this type of chart. Typically, bar charts are used to visualise categorical data, such as “types of transport” or “age range”. Therefore, your dataset should include at least one categorical variable and one numerical variable.
For a stacked bar chart, one categorical variable is used to group the bars along the X-axis, while a second categorical variable defines the segments that are stacked within each bar.
When working with data that contains two categorical variables, an alternative to stacking is to use dodged bar charts. In a dodged bar chart, bars for each category are placed side by side rather than stacked, making it easier to compare individual values directly. Again you can use geom_bar()
or geom_col()
but this time you set the position
argument to dodge
.
Code
Choosing between stacked and dodged bar charts depends on your goal:
- use stacked bar charts if your focus is on the overall totals or the contribution of parts to a whole
- use dodged bar charts if you want to compare individual categories clearly without the complexity of stacking
Selecting the right style ensures your chart is clear, easy to interpret, and suited to the data you’re presenting.
The table below shows the data prepped in the format for the stacked bar chart above. There are two categorical variables (Gender
and Financial_Year
) and one numerical variable (Pass_rate
).
Financial_Year | Gender | Pass_rate |
---|---|---|
2012 | Female | 44 |
2012 | Male | 51 |
2013 | Female | 44 |
2013 | Male | 51 |
2014 | Female | 44 |
2014 | Male | 51 |
Once your data is ready, start by calling the ggplot()
function, specify your data in the first argument, then open the aes()
function and define your x
and y
variables. Add a geom_col()
layer. So far, this will achieve a simple bar chart.
Next, we need to add some additional parameters that allow for stacking. Use the your second categorical variable in a ‘fill
’ parameter within the aes()
function. The ‘fill
’ argument uses colour to create the stacks in the chart.
Finally, to transform your data into a stacked bar chart, use the geom_col()
geom function, and set the argument ‘position
’ to “stack
”. The ‘position
’ argument specifies how the columns should be positioned.
2.2.2 Best practice
Generally we should limit a stack to four categories. If you have more than four categories in the stack, it can get quite cluttered and may be difficult to read. Hence, it may be more useful to have your stacks labeled rather than try to match the colours off a legend.
We can do this by adding a geom_text()
layer. Use the label
argument to specify what data you would like to appear on the chart as labels, then use the position
argument again to position the labels in a stacked format.
2.2.3 Exercise
15:00
The table below shows the data you will be using in the exercise. We have two categorical variables (year
and transport
) and one numerical variable (usage
).
year | transport | usage |
---|---|---|
2020 | Busses | 39 |
2020 | Cars/Vans | 76 |
2020 | Cycling | 133 |
2020 | Rail | 29 |
2020 | TFL Bus | 43 |
2020 | TFL Tube | 25 |
- In a
ggplot()
function, use thetransport_bar_data
dataset to create a stacked bar chart. Plot theyear
as the X axis, andusage
as the Y axis, and the remaining categorical variable as thefill
parameter - Add the
geom_col()
layer and define theposition
argument - Add data labels on the stacks using a
geom_text()
layer. Use thelabel
argument inside anaes()
call to define what variable to use from the data for the labels, and use position again set the position of the labels. - Add a built-in theme to easily format the chart (e.g.
theme_classic()
,theme_bw()
,theme_minimal()
) - Change the colour palette using the
scale_fill_brewer()
function. Use a colour palette from theRColourBrewer
package
Hint
Solution
2.2.4 Box plots
Box plots are a powerful way to visualise the distribution of a dataset. They clearly show the range, median, quartiles, and outliers, making it easier to identify values that differ significantly from most of the data. When multiple distributions are plotted together, box plots allow for quick comparisons of spread, skewness, and central tendency.
To create a box plot in ggplot2, your data must include at least one categorical variable (for grouping on the x-axis) and one numerical variable (for the y-axis). For example, let’s use the same dataset we used for the stacked bar chart to visualise pass rates by gender.
The geom_boxplot()
function is the most commonly used function to create box plots. It handles all the elements of a box plot, including the box, whiskers, and outliers. However, when you want to customise or control the error bars (whiskers) separately, you can use the stat_boxplot()
function.
The stat_boxplot()
function calculates the statistics for the error bars (whiskers) and lets you render them separately by setting geom = "errorbar"
. This is useful when you want to control the visual layering, such as ensuring that the error bars appear behind the box for better clarity.
You can use stat_boxplot()
without geom_boxplot()
if you only want to visualise the error bars, but it is more common to use them together for a complete box plot. Conversely, you can use geom_boxplot()
alone when you do not need separate control over the error bars.
Example:
Code
- Start by calling
ggplot()
and mapping yourx
variable (categorical) andy
variable (numerical) in theaes()
function - Use the
stat_boxplot()
to create the error bars (whiskers)
- set the
geom
argument to"errorbar"
to specify the type of plot - adjust the
width
to control how wide the error bars extend horizontally - use the
lwd
argument to change the line thickness
Adding the error bars first ensure they are drawn below the box layer, avoiding visual overlap.
- Use
geom_boxplot()
to create the box part of the plot
- you can customise the box appearance with arguments like
lwd
to adjust the line thickness or other aesthetic options
2.2.5 Exercise
15:00
- In a ggplot function, use the
transport_bar_data
dataset and plot a boxplot withtransport
as the X axis, andusage
as the Y axis. - Add a built-in theme to improve the appearance of your boxplot
Solution
Code
This is a good opportunity to take a 10-minute break away from the computer to refresh your mind, stretch, and reset before continuing onto the next section.
2.2.6 Line Graph
Line graphs are good for displaying trends over time or continuous variables. It is useful for comparing multiple trends, identifying patterns, showing changes over time and highlighting outliers for quality assurance through unusual trends.
To construct a line graph, ensure your data is in a tidy format. If you we want to plot multiple lines, all the categorical variables need to be in one column which can be done by pivoting columns into a longer format using tidyr::pivot_longer()
. More information about tidying data can be found in our Tidy Data in R course.
For example, if we wanted to compare the trend between mean driving tests conducted and mean tests passed using the dataset below, we would not be able to plot both lines because the categories we want to plot are in different columns.
For the line graph we want to plot, there are three different pieces of information we are trying to include in the plot - categories (tests conducted and tests passed), the values, and the years. When data is not in the correct format, one may think they can use separate functions to plot each line, like below, which in the end does not work because each line overwrites the other like in the example below:
Code
Financial_Year | mean_conducted | mean_passed |
---|---|---|
2007 | 73405.88 | 32459.04 |
2008 | 72444.75 | 32811.67 |
2009 | 63901.04 | 29325.79 |
2010 | 66899.62 | 31002.33 |
2011 | 65377.46 | 30673.21 |
2012 | 59853.04 | 28218.83 |
Code
So lets put our data in the write format to plot multiple lines on a chart by pivoting the categories into a column of their own.
Code
Financial_Year | category | value |
---|---|---|
2007 | Conducted | 734.0588 |
2007 | Passed | 324.5904 |
2008 | Conducted | 724.4475 |
2008 | Passed | 328.1167 |
2009 | Conducted | 639.0104 |
2009 | Passed | 293.2579 |
Now we plot.
To plot data on driving tests conducted and driving tests passed is now pretty simple.
As usual, begin with the ggplot function and assign the data you would like to use. This time, we’re using good_driving_data
. Then add a new layer and use geom_line()
transform your data into a line graph.
Next, in the aes()
function, since we want to observe the data over a span of time, we assign Financial_Year
to the x
axis, and value
column to the y
axis. To plot our third piece of information which is the two categories we are interested in, we set the group argument in the geom_line()
function as “category
”. This tells ggplot to identify the different the groupings in the category
column in the data, and plot them separately. You can also use the colour
argument to give your lines different colours to each other if you find it useful.
2.2.7 Exercise
15:00
- In a ggplot function, plot the usage of TFL Bus and TFL Tube over the year 2022. Use the
transport_line_data
dataset. Using thegeom_line()
function, set the X axis, Y axis, group and colour arguments - Add a different theme this time to change the appearance of the plot
Solution
2.2.8 Scatter Plot
Scatter plots are quite simple to create since it involves plotting only two numerical values.
We begin by specifying our data in the ggplot function. We are using the raw_travel_data
as our variables exist in separate columns in this dataset.
date | weekday | cars | light_commercial_vehicles | heavy_goods_vehicles | all_motor_vehicles | national_rail | tfl_tube | tfl_bus | busses | cycling |
---|---|---|---|---|---|---|---|---|---|---|
2020-03-01 | Sun | 1.03 | 1.11 | 1.08 | 1.04 | 0.95 | 1.04 | 1.02 | 1 | 0.89 |
2020-03-02 | Mon | 1.02 | 1.06 | 1.03 | 1.03 | 0.95 | 0.95 | 0.97 | 1 | 0.89 |
2020-03-03 | Tue | 1.01 | 1.05 | 1.02 | 1.02 | 0.95 | 0.95 | 0.96 | 1 | 0.89 |
2020-03-04 | Wed | 1.01 | 1.04 | 1.03 | 1.01 | 0.96 | 0.95 | 0.97 | 1 | 0.89 |
2020-03-05 | Thu | 1.00 | 1.03 | 1.02 | 1.00 | 0.96 | 0.92 | 0.92 | 1 | 0.89 |
2020-03-06 | Fri | 1.02 | 1.03 | 1.02 | 1.02 | 0.99 | 0.92 | 0.96 | 1 | 0.89 |
Since the x
and y
variables are all that is needed for plotting a scatter plot, we could add the geom_point()
layer and leave it there. But, there some useful arguments worth exploring for scatter plots:
Shape
allows you to choose a different marker to plot your data. There are 25 shapes to choose fromSize
lets you choose the size of the pointJitter
in the position arguments spreads your points apart a little if there are too many overlapping each otherAlpha
sets the transparency of your points. This can be useful when there are lots over overlapping points that are clumped together.Colour
lets you set the colour of the points
Another useful layer for a scatter plot is a line of best fit which we can add using a geom_smooth()
function. On its own, the geom_smooth()
function will create a smooth line through the points. However, to create a straight line, we use the method
argument and set it to "lm"
, and "se"
to FALSE. "lm"
stands for linear model, and "se"
set to false suppresses confidence intervals.