Chapter 3 Scales
Changing the x
and y
axes can be done using the scale_x_
and scale_y_
group of functions.
Scales in ggplot2
can be broadly categorized into two types:
- positional scales
- non-positional scales
Positional scales determine the position of graphical elements on the plot, such as the x
and y
axes. They are used to represent continuous or discrete variables. Examples of positional scales include scale_x_continuous()
and scale_y_discrete()
.
Non-positional scales map variables to non-position-related aesthetics, such as color, size, shape, fill, and more. They are used to represent categorical or continuous variables. An example of a non-positional scale is the scale_color_manual()
which we covered earlier.
Positional scales are important in statistics or analysis specifically because they ensure that your data is consistently represented across different plots. By using the same scale for a specific variable, you can easily compare and interpret the visualized data across different plots or layers.
3.1 Date scale
Let’s first explore scale_x_date()
.
The scale_x_date()
function is used to customise the x-axis scale for date and time data in a plot. It allows you to control the appearance and formatting of dates on the x-axis, including the labeling, breaks, and limits.
Date breaks are also a key argument in scale_x_date()
function. The date_breaks
parameter lets you control the positioning of the date axis ticks or breaks. You can specify various options, such as "years"
, "months"
, "weeks"
, "days"
, or custom numeric values.
For example, if you have several years worth of data, you can set date_breaks = "3 month"
so that the axis only displays months quarterly and would create tick marks every quarter on the x-axis.
Lastly, the date_labels
parameter allows you to specify custom formats for your dates on the x-axis. For example, changing your date format from day-month-year, to month-year. This can be really useful to reducing clutter on the x-axis, or presenting your chart with quarterly dates.
To use the scale_x_date()
function, you data needs to be in the correct format. Sometimes this may involve several steps to convert your data into a date and into the correct date format. See below for how we converted some data into dates.
This is the original data:
Financial_Year | Month | Conducted | Passed | Pass_rate | Gender |
---|---|---|---|---|---|
2007 | April | 62897 | 30053 | 47.78129 | Male |
2007 | May | 67605 | 31865 | 47.13409 | Male |
2007 | June | 72471 | 34364 | 47.41759 | Male |
2007 | July | 73600 | 34716 | 47.16848 | Male |
2007 | August | 67920 | 32210 | 47.42344 | Male |
2007 | September | 70662 | 33228 | 47.02386 | Male |
We then do some data cleaning steps:
Code
driving_time_series <- raw_driving_pass_data %>%
# Merge date, month and year characters together
# so as.Date() function can recognise a date format
dplyr::mutate(Date = paste("01", Month, Financial_Year, sep = "-"),
# Use as.Date function to transform datatype to Date
Date = as.Date(Date, format ="%d-%B-%Y")) %>%
# Make another column for formatted date. e.g. Apr-2008
dplyr::mutate(Month_Year = format(Date, "%b-%Y")) %>%
# Convert data into long format
tidyr::pivot_longer(cols = c(Conducted, Passed),
names_to = "category",
values_to = "value") %>%
# Final data cleaning
dplyr::group_by(Month_Year, Date, category) %>%
dplyr::summarise(value = mean(value / 100)) %>%
dplyr::filter(category == "Passed")
Month_Year | Date | category | value |
---|---|---|---|
Apr-2007 | 2007-04-01 | Passed | 286.190 |
Apr-2008 | 2008-04-01 | Passed | 368.720 |
Apr-2009 | 2009-04-01 | Passed | 303.850 |
Apr-2010 | 2010-04-01 | Passed | 303.210 |
Apr-2011 | 2011-04-01 | Passed | 265.875 |
Apr-2012 | 2012-04-01 | Passed | 266.855 |
To then produce our chart:
Code
3.2 Continuous scales
A useful y
axis scale function is the scale_y_continuous
.
The scale_y_continuous()
function is used to make adjustments to the appearance and behavior of the y-axis. It offers a range of parameters that allow you to set the axis limits, breaks (tick marks) and axis labels.
One of the most interesting ways of using this function is to present the y
axis scale as a percentage scale rather than the original numeric values. This function automatically converts your numeric values to percentages and adds the percentage symbol (%) to the axis labels. To do this, you can use the labels
parameter, and set the scales to percent
. You also have to bear in mind that because your axis values are being converted to a percent scale, the values in your data need to be adjusted to the correct scale factor. For example, if you have a value of 48 that you want to present as 48%, use the dplyr::mutate()
function on your data to change the value to 0.48.
Limits are also very useful to set on your chart. Limits specify the minimum and maximum values of the y
axis. This is useful for when you have values in your data that you do not want to plot, such as negative values, or values above a certain number. Its also useful for ensuring your data is presented in a way that wont be misleading. This is because changing the limit can change the angle of trend line if it is incorrectly steep. You can set the limits of the y-axis using the limits parameter. For the example above, limits = c(0, 1)
would restrict the y
axis to show values only between 0% and 100%.
Another useful option is the expand
parameter inside scale_y_continuous()
. By default, ggplot2
adds padding around your data so points and bars do not sit directly on the axes. If you want to remove that padding and have the data start exactly at the axis line — for example, at y = 0 — you can set expand = c(0, 0)
. This is particularly useful for bar charts or when aligning with horizontal or vertical reference lines.
For example:
Financial_Year | Gender | Pass_rate |
---|---|---|
2012 | Female | 0.44 |
2012 | Male | 0.51 |
2013 | Female | 0.44 |
2013 | Male | 0.51 |
2014 | Female | 0.44 |
2014 | Male | 0.51 |
Code
Alternatively, you can use expand = expansion(mult = c(0, 0))
, which is a newer and more flexible approach, allowing different levels of expansion at the low and high ends of the axis.
3.2.1 Exercise
15:00
- Use the
transport_percent
data to chart rail usage in the year 2020. Setx
to month. Use thescale_y_continuous()
function to present the usage as percentage instead of integers.
SOLUTION 3.2.1.
Code
expand_limits()
is a separate helper function and is not part of scale_y_continuous()
, but it can be used alongside it. While limits
sets the axis limits and removes data outside that range, expand_limits()
extends the axis range to include certain values without removing any data. This is useful when you want to make sure a particular value (like 0) appears on the axis, even if it’s not in the dataset.
For example, expand_limits(y = 0)
. This will ensure that 0 is included on the y-axis, even if your data starts above it — useful for giving visual context or grounding a chart at zero.
In summary:
- use
limits
to constrain the axis range - use
expand
to control padding around the data - use
expand_limits()
to include specific values in the axis range without filtering the data
Together, these functions give you full control over how your y-axis appears and behaves.