Intermediate ggplot2
2024-03-05
Chapter 1 Introduction
1.1 Session aims
- Understanding how to create widely used visualizations
- Customizing charts using themes
- Using annotations to explain charts better
1.2 What is ggplot2?
ggplot2 is a data visualization package in R that provides a flexible and elegant way to create static and interactive visualizations. In this course we will only focus on static charts. ggplot2 charts are formed via a combination of visual elements such as geometric shapes, lines and colours, that can be combined to represent different types of data. Built-in themes and other components of ggplot2 can be added in layers to enhance the explanatory power of complex charts.
Some of the advantages of using GGplot2:
- It is popular: ggplot2 has a large and active community of users, which means that there is a wealth of resources and support available.
- It is easy to learn: ggplot2 follows a consistent structure, known as the grammar of graphics, making it easier to learn and use.
- It is flexible: it provides a wide range of options to create visualisations and customise the look and feel of the visualisations.
- It is powerful: it allows data to be statistically mapped in endless ways, including logarithms, distributions, and maps.
1.3 Reproducibility at the Civil Service
Reproducibility in visualisations refers to the ability of others to recreate the same visual representation of data based on the information and code used to create it. In the Civil Service, reproducibilty is a key objective as an organisation that publishes routine and re-occurring publications.
Reasons reproducibilty is important in the civil service:
- Replicability: Reproducubility makes it possible to replicate the results and build upon the work of others, which is important for efficiency as it eliminates duplicating efforts.
- Transparency: It promotes transparency in the data analysis process as others can see exactly how the data was processed, modeled and visualised.
- Verifiability: Code leaves an audit trail of steps taken to process the data and visualise it. Hence, it makes it possible for others to verify the results and to replicate the analysis for quality assurance. this build confidence in our processes and validates results and reduces the chances of errors.
- Collaboration: It makes it easier for colleagues to collaborate and enhance their analysis and results since code is shareable.
- Retain knowledge: Reproducibility ensures that expert knowledge, methods and thought processes are better documented and retained long after colleagues move to different roles - which is very common at the Civil Service. Their analysis and results remain accessible and usable.
1.3.1 Why should we use ggplot2 to achieve this goal?
One of the advantages of using ggplot is that the syntax has a consistent style for creating plots. It helps with reproducibility because the code will exhibit a similar pattern throughout, and plots created using ggplot2 will have a similar look and feel regardless of who created them. It also uses declarative syntax. This means that you use code to describe how you want your charts to look rather than how to create it, which makes the code easier to read and understand. As a result, the consistency and clarity present in ggplot2 code allows colleagues to learn, use and share their code, making their charts widely reproducible.
ggplot2 also has built-in themes that can be easily applied to plots using a single line of code. This furthers the scope of consistency by allowing users to reproduce the style of their visualisations.
Additionally, since visualisations in ggplot2 are made up of various layered components, complex charts are easier to update when underlying data changes and components can be easily tweaked to support changes, without having to create a whole new chart. This means that charts made in the past can continue to be used.
Lastly, ggplot2 being part of the tidyverse package furthers these benefits since it is the preferred approach to coding at DfT. This is because it creates a consistent coding style across the department, resulting in more easily readable and usable code across teams. It also means that training, technical support, and debugging will always be available for ggplot2. Using ggplot2 at DfT also benefits the consumer because it means we can achieve a consistent formatting and appearance across charts we publish. If users are familiar with a certain type of charts, it becomes easier and quicker for them to understand the key message behind a chart.