Chapter 3 Scales

Changing the x and y axes can be done using the scale_x_ and scale_y_ group of functions.

Scales in ggplot2 can be broadly categorized into two types:

positional scales
non-positional scales

Positional scales determine the position of graphical elements on the plot, such as the x and y axes. They are used to represent continuous or discrete variables. Examples of positional scales include scale_x_continuous() and scale_y_discrete().

Non-positional scales map variables to non-position-related aesthetics, such as color, size, shape, fill, and more. They are used to represent categorical or continuous variables. An example of a non-positional scale is the scale_color_manual() which we covered earlier.

Positional scales are important in statistics or analysis specifically because they ensure that your data is consistently represented across different plots. By using the same scale for a specific variable, you can easily compare and interpret the visualized data across different plots or layers.

3.1 Date scale

Let’s first explore scale_x_date().

The scale_x_date() function is used to customise the x-axis scale for date and time data in a plot. It allows you to control the appearance and formatting of dates on the x-axis, including the labeling, breaks, and limits.

Date breaks are also a key argument in scale_x_date() function. The date_breaks parameter lets you control the positioning of the date axis ticks or breaks. You can specify various options, such as "years", "months", "weeks", "days", or custom numeric values.

For example, if you have several years worth of data, you can set date_breaks = "3 month" so that the axis only displays months quarterly and would create tick marks every quarter on the x-axis.

Lastly, the date_labels parameter allows you to specify custom formats for your dates on the x-axis. For example, changing your date format from day-month-year, to month-year. This can be really useful to reducing clutter on the x-axis, or presenting your chart with quarterly dates.

To use the scale_x_date() function, you data needs to be in the correct format. Sometimes this may involve several steps to convert your data into a date and into the correct date format. See below for how we converted some data into dates.

This is the original data:

Financial_Year	Month	Conducted	Passed	Pass_rate	Gender
2007	April	62897	30053	47.78129	Male
2007	May	67605	31865	47.13409	Male
2007	June	72471	34364	47.41759	Male
2007	July	73600	34716	47.16848	Male
2007	August	67920	32210	47.42344	Male
2007	September	70662	33228	47.02386	Male

We then do some data cleaning steps:

Code

driving_time_series <- raw_driving_pass_data %>%  
  # Merge date, month and year characters together 
  # so as.Date() function can recognise a date format 
  dplyr::mutate(Date = paste("01", Month, Financial_Year, sep = "-"),
                # Use as.Date function to transform datatype to Date
                Date = as.Date(Date, format ="%d-%B-%Y")) %>% 
  # Make another column for formatted date. e.g. Apr-2008
  dplyr::mutate(Month_Year = format(Date, "%b-%Y"))  %>% 
  # Convert data into long format
  tidyr::pivot_longer(cols = c(Conducted, Passed), 
                      names_to = "category", 
                      values_to = "value") %>%
  # Final data cleaning
  dplyr::group_by(Month_Year, Date, category) %>% 
  dplyr::summarise(value = mean(value / 100)) %>% 
  dplyr::filter(category == "Passed")

Month_Year	Date	category	value
Apr-2007	2007-04-01	Passed	286.190
Apr-2008	2008-04-01	Passed	368.720
Apr-2009	2009-04-01	Passed	303.850
Apr-2010	2010-04-01	Passed	303.210
Apr-2011	2011-04-01	Passed	265.875
Apr-2012	2012-04-01	Passed	266.855

To then produce our chart:

Code

ggplot(data = driving_time_series) +
  geom_line(aes(y = value, x = Date, group = "")) +
  scale_x_date(date_breaks = "2 year", date_labels = "%b-%Y") +
  theme_classic()

3.1.1 Exercise

15:00

Use the transport_date_data to chart the usage of TFL Tube usage over-time.
Use the date_labels and date_break arguments from the scale_x_date() function. Set the date_break to “3 month”.

SOLUTION 3.1.1.

Code

ggplot(data = transport_date_data) +
  geom_line(aes(y = usage, x = date, group = "")) +
  scale_x_date(date_breaks = "3 month", date_labels = "%b-%Y") +
  theme_classic()

3.2 Continuous scales

A useful y axis scale function is the scale_y_continuous.

The scale_y_continuous() function is used to make adjustments to the appearance and behavior of the y-axis. It offers a range of parameters that allow you to set the axis limits, breaks (tick marks) and axis labels.

One of the most interesting ways of using this function is to present the y axis scale as a percentage scale rather than the original numeric values. This function automatically converts your numeric values to percentages and adds the percentage symbol (%) to the axis labels. To do this, you can use the labels parameter, and set the scales to percent. You also have to bear in mind that because your axis values are being converted to a percent scale, the values in your data need to be adjusted to the correct scale factor. For example, if you have a value of 48 that you want to present as 48%, use the dplyr::mutate() function on your data to change the value to 0.48.

Limits are also very useful to set on your chart. Limits specify the minimum and maximum values of the y axis. This is useful for when you have values in your data that you do not want to plot, such as negative values, or values above a certain number. Its also useful for ensuring your data is presented in a way that wont be misleading. This is because changing the limit can change the angle of trend line if it is incorrectly steep. You can set the limits of the y-axis using the limits parameter. For the example above, limits = c(0, 1) would restrict the y axis to show values only between 0% and 100%.

Another useful option is the expand parameter inside scale_y_continuous(). By default, ggplot2 adds padding around your data so points and bars do not sit directly on the axes. If you want to remove that padding and have the data start exactly at the axis line — for example, at y = 0 — you can set expand = c(0, 0). This is particularly useful for bar charts or when aligning with horizontal or vertical reference lines.

For example:

Financial_Year	Gender	Pass_rate
2012	Female	0.44
2012	Male	0.51
2013	Female	0.44
2013	Male	0.51
2014	Female	0.44
2014	Male	0.51

Code

ggplot(driving_bar_scaled, aes(x = Financial_Year, y = Pass_rate)) +
  geom_col(fill = "navy") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1), expand = c(0, 0)) +
  theme_classic()

Alternatively, you can use expand = expansion(mult = c(0, 0)), which is a newer and more flexible approach, allowing different levels of expansion at the low and high ends of the axis.

3.2.1 Exercise

15:00

Use the transport_percent data to chart rail usage in the year 2020. Set x to month. Use the scale_y_continuous() function to present the usage as percentage instead of integers.

SOLUTION 3.2.1.

Code

ggplot(transport_percent, aes(x = month, y = usage_percent)) +
  geom_col(fill = "grey")+
  scale_y_continuous(labels = scales::percent, limits = c(0, 1), expand = expansion(mult = c(0, 0))) +
  theme_classic()

expand_limits() is a separate helper function and is not part of scale_y_continuous(), but it can be used alongside it. While limits sets the axis limits and removes data outside that range, expand_limits() extends the axis range to include certain values without removing any data. This is useful when you want to make sure a particular value (like 0) appears on the axis, even if it’s not in the dataset.

For example, expand_limits(y = 0). This will ensure that 0 is included on the y-axis, even if your data starts above it — useful for giving visual context or grounding a chart at zero.

In summary:

use limits to constrain the axis range
use expand to control padding around the data
use expand_limits() to include specific values in the axis range without filtering the data

Together, these functions give you full control over how your y-axis appears and behaves.

3.3 Break

45:00

This is a good opportunity to take a 45-minute to an hour lunch break away from the computer to refresh your mind, stretch, and reset before continuing onto Chapter 4.