Chapter 13 The DfT R coding standard
Aim of the standard: to provide a framework for writing R code which is simple for analysts to develop and maintain, and future-proof. This standard places the highest priority on clear, readable and simple code, as well as standardisation across teams.
13.1 Packages
13.1.1 R version
Code should always be written to work on the DfT Cloud R server, with the current version of R as installed. New code should not rely on:
- A local version of R
- R in Citrix
- Any other non-standard Cloud version of R (including on VM and the open Posit Cloud)
13.1.2 Tidyverse as default
Tidyverse packages are the default for any analytical code, as they offer the most clear and consistent approach. You should use tidyverse packages in preference to any other package when available, including base R functions.
Specifically, this should include:
- dplyr and tidyr for data manipulation
- readxl and readr to read in data
- ggplot2 for data visualisation
Where possible, you should use a pipe %>%
to apply multiple functions to a dataset. This improves readability and is preferable to both nested functions and repeated assignment.
13.1.3 Use of non-tidyverse packages
When a tidyverse package with the required functionality is not available, the order of preference for use of other packages should always be:
- A well-known and clearly documented CRAN package (e.g. rmarkdown, shiny, openxlsx)
- Base R
- Any other CRAN package
- A DfT written and centrally maintained Github package
- Other Github packages by known and reputable authors which are actively maintained
Code should never rely on:
- Github or other non-CRAN repository packages of unclear or experimental origin
- Packages which are no longer maintained or have been superceded
- Versions of packages which are obsolete at the time of writing
- A bespoke DfT package which does not have central support and maintenance
13.1.4 Package installation and package management
Package installation and management should be done through one of two approaches, with the first being more preferable than the second.
Renv for package management. The package
renv
should be used to control package installation and versioning wherever possible. Project-level lockfiles should be created and maintained, and stored alongside the code.User-controlled package installation. Individual users are responsible for package installation and maintenance, using the R console. This should only be considered for adhoc coding work.
Package installation should never be managed by scripts which include install.packages()
calls, even if those calls are commented out as default.
CRAN packages should generally always be installed from the package manager associated with Cloud R, and the most recent package available at the time of writing the code used. The only exception to this is when central DfT guidance is provided by the Coding and Reproducible Analysis Network which suggests installing 1, a different version and/or 2, from a different repository. This will only ever be done when there are installation issues which cannot be resolved any other way.
13.2 Calling packages
Packages should be called at the start of the script with a library()
call. Where additional scripts are sourced, library calls should always occur at the start of the main script.
It is also preferable to specify the package associated with any function using double colon notation (e.g. dplyr::filter()
). This usage is essential when building packages or Shiny apps.
13.3 Project structure
Code should always be set up and used as an R project from creation. This will automatically set the working directory to inside the project.
The working directory of a project should never be changed using setwd(); this can easily introduce breaking changes to other people’s environments. The working directory should always remain inside the project, and references to data stored in G drive locations made using absolute file paths (e.g. starting from G:/ etc).
The R project should be version controlled and linked to a Github repository. Users should interact with the code by cloning it to their local drive and making use of Git push and pull commands to share code changes. Code should not be cloned to a shared location, as it negates many of the benefits of version control.
Code should be written to be easy for a human to read. Code files should not be too wide to fit onto a standard laptop screen. R Studio provides a guide line to indicate where this point is, and line breaks should be used to avoid code going beyond this limit.
Code files should also not be excessively long; beyond 500 lines you should be looking to split the code across several files.
Running code in order should not rely on human intervention. Multiple code files should be run from a single central file using the source()
command to run them in order.
13.3.1 Good practice in code structure
Code comments should not be used to store anything other than documentation in finished code. Any to do lists for improvements or errors should be stored using Github issues. In addition, code should not remain commented out; if it is no longer required, it should be removed and version controlled using Github. Commenting out should never be used to make code run conditionally (e.g. comment out this code and only run it once a year)
Variables should be named in a logical and consistent way throughout all code files. The preferred naming style is lower case snake_case
, using underscores (or hyphens) instead of spaces only. All variables should be given sensible, meaningful and unique names. Names should not be reused where it could cause unintended consequences in the code, or confusion for other people reading it.
If variables need to be hard coded (e.g. specifying a year), this should be done once at the start of the script (or in a separate variables file).
Where not explicitly set out in this document, best practice should always follow the tidyverse style guide.
13.4 Use of functions
Functions can help to simplify and streamline code, making it easier to write, read, update and run. However, use of functions which are very complex, long or do not add significant value can offset these benefits by introducing unnecessary complexity.
Functions should only be written when:
- Similar code is repeated two or more times and repetition could be eliminated by using a function
- Code would otherwise produce multiple intermediate objects which can be streamlined by producing them inside a function
- They are of a reasonable length (< 1 page) and as short and simple as possible
- The function could be tested using example data
- They follow good practice principles; functions should not assign by default, and should not use objects (e.g. data) from the global environment without them being specified as arguments.
Functions written “for the sake of functions” are not encouraged; code that could otherwise run in exactly the same way without being converted to a function is often easier for people to read and use as is.
The tidyverse purrr map functions in combination with user-written functions are preferable over using for loops both in terms of simplicity of code to read and write, and also for code performance.
Where you have utility functions which are being used in two or more projects, you should consider whether these would be a useful addition to the dftutils package where they can be maintained, tested and used centrally.
13.5 Creating packages
For general data processing and visualisation, the advantages of building projects into packages is considered to be offset by the increased complexity of maintenance.
A well structured project made up of a mixture of repeated functions, utility functions and plain code is preferable to a project built into a package.
Where the same code is called by multiple subsequent projects, creating a package may be the most sensible solution for a team. Before this is done, consideration should be given as to how these will be maintained long term; does the responsible team have the resources to train new starters in the building and maintenance of packages?
Where packages are made, they should be hosted on Github and made public to facilitate installation.
13.6 Storage of sensitive information
Regardless of the visibility (or otherwise) of the code, secret objects such as passwords and API keys should never be stored as plain text. These should always be either encrypted, entered by the user, or stored as environmental variables.
Less sensitive information such as email addresses, local file paths, SQL connection strings, etc can be stored in code which is internally-facing only (on Github or stored locally). These pieces of information should be removed from public-facing code.
13.7 Documentation
Completed code should always have accompanying documentation, at a minimum this should be useful and complete code comments.
Code comments should be detailed enough that someone with a working knowledge of R and the tidyverse can understand what the code is doing and why.
Code comments should not be used to give day-to-day usage instructions (e.g. “now run this section”). These should be contained within a README file, which is version controlled alongside the code.
Code comments should not be used to alter the running of the code (e.g. by commenting out certain aspects at certain times).
13.8 When to deviate from this approach:
Deviation from this approach should always be the exception, not the rule, and there should be consideration of both the benefits and drawbacks of doing so.
This is likely to be in situations where significant improvements in efficiency can be made with relatively small changes (e.g. use of data.table over dplyr to manage large data processing, creating packages for code which will be reused over multiple projects).
In these instances, you should always have a clear plan for how you will manage this dependency, especially if it requires skills/knowledge not covered by central training. At a minimum, you should document how and why the code deviates from this approach, and ensure the code itself is well documented.
You may also want to record this as a business continuity risk in a risk register.
Significant departures from this standard approach should never occur without careful consideration, and should not be based on convenience or preference of the original developer alone.