Chapter 3 Choosing a coding approach

Before selecting a coding tool, the first consideration should always be “is coding the right approach for a task”? While it’s theoretically possible to do anything by coding it, it’s usually sensible to consider how much value a coding approach will add to a project, compared to the potential costs, and sometimes coding doesn’t offer an improvement over a manual approach in Excel.

3.1 Data security considerations

For many people, using coding tools can feel like an additional layer of complexity and uncertainty in terms of data protection, storage and security. Luckily, the DfT-specific coding tools take the security of our data into account for you. The following general principles apply whether you are using coding languages like R or Python, and supplementary tools such as Github.

  • You should always use DfT-specific coding tools; these are things like our Cloud R servers, Python products available through your IT Focal point, an Enterprise Github licence, and our internal rsconnect server. If you use other coding products (e.g. a Python instance you have built yourself on a virtual machine, off-network laptops, etc) none of the below points are relevant and you will be responsible for engaging with Digital Services and ensuring you have met your own data security obligations.

  • Our coding tools are approved and secure for use with data up to OFFSEN.

  • All of our servers are EU-only

  • Our coding tools are compatible with all our normal ways of storing data files, whether on GCP or shared drives. There is no need to move data from agreed locations to use them.

  • There are best practice procedures for all of our tools which you should follow to improve the handling of data within your code, you should not aim to rely solely on our fail-safes!

3.2 Benefits of coding approaches

The usual benefits of coding approaches to consider are:

  • Time: Running code is faster than producing the same outputs automatically. It may or may not be faster to produce the code in the first place!
  • Automation: Remove the need for boring and time-consuming manual tasks.
  • Reproducibility: Code will run the same way every time, and eliminate manual steps that people might forget or make mistakes in.
  • Transparency: open-source code allows you to clearly see how every step of the analysis has taken place; there are no black boxes.
  • Version control: you can see exactly when changes were made, what they apply to, who made them, and why the change was made. No accidental overwriting of formulae!
  • Quality: Improvements to the above should improve the quality of the analysis overall.

However, how much each one of these applies to your project depends on a number of factors. When deciding whether to use code for a project, you probably want to consider:

  • How much data am I working with? The increased speed of coding really comes into its own as data gets larger. With large datasets, it can be a struggle to process data or even open files with tools such as Excel, while the value of automation is much more valuable if you’re automating hundreds or even thousands of checks.

  • How many times will the outputs/content be reused? The reproducibility and time saving benefits of coding are obviously highly dependent on how likely it is that you will want to repeat this code! Repeatability takes many forms though; the most obvious is over time, how often are you going to want to run this project again? But also, how often will you be asked this same or similar questions about your data which this code can answer, with code easily allowing you to run repeated analysis on just the latest data every time. These benefits will be significantly reduced for a one-off bespoke piece of work in comparison to code which meets a monthly data ask.

  • How important is the quality of the output? Quality is generally a key consideration for analysts, but realistically it is often balanced against other factors. Where quality is a primary consideration for analysis, the auditable nature of code and the ability for it to be published and assured rigorously is extremely valuable. For example, for situations where analytical methodology is likely to be subject to public scrutiny, will form the basis of high-profile decisions, or there will be an expectation of the underlying code being made public, the additional benefits of transparency and quality will be more pronounced in contrast to internal-only or exploratory analysis.

  • How complex is the process being carried out? There are multiple aspects of complexity to be considered here! For complicated statistical processes, having a coding approach allows an expert statistician to adjust parameters and develop methodology in a transparent way, rather than acting as a black box. Similarly, coding can make it easy to apply the same long and complicated set of processes to multiple data sources in a transparent and reproducible way. Where processes are simple or well understood, such as producing a standard chart, a coded approach may not offer any benefit over something like Excel.

  • How many times am I going to want to do a very similar task? Similar to reusability, opportunities to recycle code for similar outputs is very valuable. Are you going to want the same report for every year for the past 10 years, or the same chart for every region; coding allows you to write code once and use it on hundreds or thousands of iterations. Similarly, you can reuse larger chunks of code for subsequent projects, giving a much better return on initial development time.

  • How novel is what you are doing? Once again in the idea of reproducibility. If you are carrying out standard procedures such as creating publication tables and charts, you can adopt standard approaches and make use of existing code and DfT-specific packages, which have quality, reproducibility, and good quality documentation baked in. If you’re breaking new ground, you’re going to need to develop those things yourself.

  • Who will be using this project long term? What kind of coding skills can you expect in the people who will be responsible for this project maintenance and development moving forward? If it’s going to be supported by an analyst team they will have the ability to make best use of the advantages of the coding approach, whereas if it’s designed to be maintained by non-analysts it may well add additional complexity for no benefit.

  • How much human intervention is required? Coding approaches work best where decisions can be easily defined by strict logical rules; these are simple to write, execute and QA. In contrast, writing code to take into account a wide range of exceptions, subtleties and nuances can be somewhere between difficult and impossible!

3.2.1 When coding is the best option

Taking all of these factors into account, coding has a clear and significant benefit for applications such as:

  • Analysis of large data (more than 1 million rows)
  • Repetitive processes such as creating the same chart for every region, or updating the same table every month
  • Projects where you want to prioritise quality, transparency and the ability to audit the work
  • Decisions that can be determined logically with no need for subjective judgement; e.g. check that every row sum matches the value in a related table, highlight every value that has changed by more than 10%

Counter to this, coding offers little or no benefit over other approaches when:

  • You’re doing quick exploratory analysis for your own benefit
  • Data has no fixed structure or highly variable outputs, and your approach may change regularly
  • Your project has more exceptions than rules
  • Your analysis requires a lot of human judgement, or decisions not easily boiled down into simple logical choices
  • You are not repeating any part of the process

Realistically, many cases will fall somewhere between these two extremes, where coding will offer some benefit (particularly in the long term), which will be balanced against the associated costs.

Case study: taxi statistics

The production of annual taxi stats was previously done with a high proportion of manual data manipulation steps in Excel, across a process of data ingestion, processing, and visualisation.

When moving this to a reproducible analysis in R, the team selected aspects of the work which were very time consuming, highly repetitive and required little human input to convert into code. For example, producing individual validation reports and emails to check and confirm data returns for each local authority would take days to do manually, and would be a very repetitive task for an analyst. In contrast, these reports and emails could be automatically generated and sent using Rmarkdown in less than an hour. In contrast, evaluation of replies to these emails and adjusting data records based on responses from local authorities remains an entirely manual process, requiring a high level of human decision-making and a small amount of metadata best recorded in an Excel log.

3.2.2 Prioritising coding tasks

The most usual scenario for analyst teams is that they have an entire analytical process which is based in non-coding tools such as Excel, and they (eventually!) want to move the whole thing into a coded process using SQL, R, etc. Inevitably you have to start the process somewhere, but it’s not always easy to decide where the most logical place is. A sensible approach to making this decision is:

3.2.2.1 Break your process into chunks

Rather than thinking of an analytical process as a continuous pipeline or a series of individual steps, for coding it’s generally more helpful to think of your process as a flow of discrete chunks. Breaking it down into logical chunks makes it easier to convert your processes a piece at a time.

A logical chunk of analysis:

  • Has a defined data input. This could be a raw survey output, a SQL database, or a clean data table.

  • Does a single task that can be summed up in a few words. This could be something like “cleans the raw data”, “makes the publication table” or “visualises the data”.

  • All similar outputs are grouped into one chunk. Producing charts is all one chunk, ditto producing all charts or suppressing all data points.

  • Doesn’t have significant overlap of processes with any other chunk. If you’re running the same data processing for tables as for charts, the data processing for both should be a single chunk, followed by two different chunks to actually produce the outputs.

  • Has a defined output. This could be a dataset, a chart, a dashboard, anything!

Example The analytical chunks of an example project

3.2.2.2 Prioritise

While it’s tempting to work through the chunks in chronological order, with well-structured code it’s not necessary to do this, and you can start coding on the section which is the highest priority. Prioritisation of the chunks can be done taking into account:

  • Which of the sections chunks offer the biggest benefits (see above for evaluating the benefits of coding approaches)? Additionally:

  • Which sections represent the biggest potential for time-saving (creating a snowball effect in terms of more development time)?

  • Can some sections fulfill a dual development purpose (i.e. automate table creation while also making it accessible)?

  • What other processes are dependent on this chunk (i.e. if you don’t have a clear idea of clean data structure prior to table creation, it would make sense to do these chunks in order)?

  • Is the existing method causing problems, or is it relatively robust?

Example Prioritisation of the chunks

3.2.2.3 Code

The fun bit! Make sure you have plenty of time to complete the work so it isn’t rushed (which includes time for testing, code review and documentation).

Make sure code is regularly reviewed to make sure it’s following best practice, staying on objective, and isn’t getting too complicated and unwieldy.

3.2.2.4 Reflect

Once you’ve completed each chunk of coding, it makes sense to re-evaluate the prioritisation, and how the developing coding landscape has changed since you started:

  • Is the code more reusable than you thought, and you can recycle some of the code in a later chunk?
  • How have coding skills changed in your team?
  • What additional central resources are now available?

3.2.2.5 Know when to stop!

Code is rarely a static product. Even when it’s finished, you’re probably going to want to review and develop it on at least an annual basis. Generally this will be small changes to improve readability, improve quality, and fix bugs or annoying functionality.

This being said, code can be developed more or less infinitely, and it’s important to know when to move into a “maintenance” rather than development phase. A good point to stop is:

  • Your code meets all essential good practice guidance, is clear and easy to read
  • There is good documentation that explains how to use the code and how it works
  • The code works and has no major issues or bugs
  • The structure of the project is tidy and clear
  • The code is version controlled
  • It is efficiently written; there’s no excessively complicated bits or lots of repetition
  • It’s efficient to run; it doesn’t have to be the best possible code but no part is excessively slow or clunky
  • It’s appropriately complicated for the general level of coding knowledge expected in your team