Chapter 3 Scales

Changing the x and y axes can be done using the scale_x_ and scale_y_ group of functions.

Scales in ggplot2 can be broadly categorized into two types: positional scales and non-positional scales.

Positional scales determine the position of graphical elements on the plot, such as the x and y axes. They are used to represent continuous or discrete variables. Examples of positional scales include scale_x_continuous() and scale_y_discrete().

Non-positional scales map variables to non-position-related aesthetics, such as color, size, shape, fill, and more. They are used to represent categorical or continuous variables. An example of a non-positional scale is the scale_color_manual() which we covered earlier.

Positional scales are important in statistics or analysis specifically because they ensure that your data is consistently represented across different plots. By using the same scale for a specific variable, you can easily compare and interpret the visualized data across different plots or layers.

3.1 Date scale

Lets first explore scale_x_date().

The scale_x_date() function is used to customize the x axis scale for date and time data in a plot. It allows you to control the appearance and formatting of dates on the x axis, including the labeling, breaks, and limits.

Date breaks are a key argument in scale_x_date() function. The date_breaks parameter lets you control the positioning of the date axis ticks or breaks. You can specify various options, such as "years", "months", "weeks", "days", or custom numeric values. For example, if you have several years worth of data, you can set date_breaks = "3 month" so that the axis only displays months quarterly and would create tick marks every quarter on the x axis.

The date_labels parameter allows you to specify custom formats for your dates on the x axis. For example, changing your date format from day-month-year, to month-year. This can be really useful to reducing clutter on the x-axis, or presenting your chart with quarterly dates.

To use the scale_x_date() function, you data needs to be in the correct format. Sometimes this may involve several steps to convert your data into a date and into the correct date format. See below for how we converted some data into dates.

This is the orginal data:
Financial_Year Month Conducted Passed Pass_rate Gender
2007 April 62897 30053 47.78129 Male
2007 May 67605 31865 47.13409 Male
2007 June 72471 34364 47.41759 Male
2007 July 73600 34716 47.16848 Male
2007 August 67920 32210 47.42344 Male
2007 September 70662 33228 47.02386 Male

We then do some data cleaning steps:

Code
driving_time_series <- raw_driving_pass_data %>%  
  
  # merge date, month and year characters together so as.Date function can recognise a date format 
  mutate(Date = paste("01", Month, Financial_Year, sep="-"),

  # Use as.Date function to transform datatype to Date
         Date = as.Date(Date, format ="%d-%B-%Y")) %>% 
  
  # Make another column for formatted date. eg Apr-2008
  mutate(Month_Year = format(Date, "%b-%Y"))  %>% 
  
  pivot_longer(cols = c(Conducted, Passed), names_to = "category", values_to = "value") %>% 
  group_by(Month_Year, Date, category) %>% 
  summarise(value = mean(value/100)) %>% 
  filter(category == "Passed")
Month_Year Date category value
Apr-2007 2007-04-01 Passed 286.190
Apr-2008 2008-04-01 Passed 368.720
Apr-2009 2009-04-01 Passed 303.850
Apr-2010 2010-04-01 Passed 303.210
Apr-2011 2011-04-01 Passed 265.875
Apr-2012 2012-04-01 Passed 266.855

Without any date scale transformations:

Code
ggplot(data = driving_time_series) +
  geom_line(aes(y = value, x = Date, group = ""))+
  theme_classic()

With date scale transformation:

Code
ggplot(data = driving_time_series) +
  geom_line(aes(y = value, x = Date, group = ""))+
  scale_x_date(date_breaks = "2 year", date_labels = "%b-%Y") +
  theme_classic()

3.1.1 Excercise

15:00

  1. Use the transport_date_data to chart the usage of TFL Tube usage over-time.
  2. Use the date_labels and date_break funtions. Set the date_break to “3 month”.
Code
ggplot(data = transport_date_data) +
  geom_line(aes(y = usage, x = date, group = ""))+
  scale_x_date(date_breaks = "3 month", date_labels = "%b-%Y") +
  theme_classic()

3.2 Continuous scales

A useful y axis scale function is the scale_y_continuous.

The scale_y_continuous() function is used to make adjustments to the appearance and behavior of the y axis. It offers a range of parameters that allow you to set the axis limits, breaks (tick marks) and axis labels.

One of the most interesting ways of using this function is to present the y axis scale as a percentage scale rather than the original numeric values. This function automatically converts your numeric values to percentages and adds the percentage symbol (%) to the axis labels. To do this, you can use the labels parameter, and set the scales to percent. You also have to bear in mind that because your axis values are being converted to a percent scale, the values in your data need to be adjusted to the correct scale factor. ie. if you have a value of 48 that you want to present as 48%, use the mutate() function on your data to change the value to 0.48.

Limits are also very useful to set on your chart. Limits specify the minimum and maximum values of the y axis. This is useful for when you have values in your data that you dont want to plot, such as negative values, or values above a certain number. Its also useful for ensuring your data is presented in a way that wont be misleading. This is because changing the limit can change the angle of trend line if it is incorrectly steep.

You can set the limits of the y-axis using the limits parameter. For the example above, limits = c(0, 1) would restrict the y axis to show values only between 0% and 100%.

Financial_Year Gender Pass_rate
2012 Female 0.44
2012 Male 0.51
2013 Female 0.44
2013 Male 0.51
2014 Female 0.44
2014 Male 0.51
Code
ggplot(driving_bar_scaled, aes(x = Financial_Year, y = Pass_rate))+
  geom_col(fill = "navy")+
  scale_y_continuous(labels = scales::percent, limits = c(0,1))+
  theme_classic()

3.2.1 Excercise

15:00

The transport_percent data is about rail usage in the year 2020.

  1. Use the transport_percent data to chart rail usage during the year. Set x to month and select the correct column for y. Use the scale_y_continuous() function to present the usage as percentage instead of integers.
Code
ggplot(transport_percent, aes(x = month, y = usage_percent))+
  geom_col(fill = "grey")+
  scale_y_continuous(labels = scales::percent, limits = c(0,1))+
  theme_classic()