Chapter 11 Identifying risks within code

One of the most tricky parts of being the manager of a team who is converting existing analysis projects to code is the fear of unknown risks. While an old spreadsheet may be creaking and regularly cause problems, the fact that most analysts are familiar with the kind of problems spreadsheets can cause makes it easier to document these risks, even if they’re not necessarily easy to spot or fix!

The purpose of this section is to demystify the risks that can come alongside coded approaches. Even if you’re not a regular coder yourself, the aim is to give you the appropriate knowledge and vocabulary to discuss, quantify and record risks in collaboration with your team. And the added benefit to code over spreadsheets is that all of these risks have clear mitigations which can be applied; whereas there’s no way to prevent the dreaded “someone applied the wrong formula to a cell in Excel”, there are no black boxes in code and you can always apply checks, failsafes and preventative measures to a coded projects.

11.1 Quality risks

The most significant risks that can be caused by code are related to the quality of outputs; that your code contains unexpected errors which (in a worst case scenario) look plausible enough to not be spotted until long after publication.

11.1.1 Incidents to record

  • Your code produces incorrect values without a warning at the end of the analysis; this could be anything from a mislabelled column to an actual incorrect figure.

  • This risk is far more significant if the incorrect figure seems plausible; values which are coincidentally similar to the correct output.

  • Someone describes a piece of code as “temperamental”. Because coded languages like R aren’t black boxes like Excel, you can see every piece of the inner workings, and therefore they can’t be temperamental. There’s nothing to go wrong except for the code you have written and the way you run it, so this suggests there’s an actual problem with at least one of those things.

11.1.2 Near misses

  • You have a coded project that isn’t regularly maintained. Regular maintenance should look like a thorough check at least annually, making sure redundant code has been removed, new additions have been appropriately code reviewed, and documentation is relevant and up-to-date.

  • Your code takes hours to update every time you run it. Flexible, well-written code should be updated to accept new data by altering only a few parameters, which should be quick and easy to do. Anything that takes a lot longer than this is heavily reliant on human intervention to work properly, and is massively increasing your risk of something being missed or done incorrectly.

  • Closely related to the infamous “temperamental code”, code that has to be run one chunk or file at a time is a good indicator that you have problems brewing. Good code should be happy running unsupervised, and should be non-destructive; you should never be relying on human intervention to stop it overwriting data you need, run without failing, or causing any other problems.

  • Most people at one time or another have a coding “grey box”; code that isn’t unknowable, but isn’t understood properly by anyone currently working on it. Every line of your code should be easy to understand by anyone in your team running it, because it should be written according to good practice guidelines and be accompanied by helpful code comments. If you have code that you cross your fingers and run every month, it’ll become a problem at some point.

  • Not automatically recognisable as a bad sign, but a project that has stalled development with a mixture of hard-coded and automated values is an easy way to end up with the wrong values in it. If you have a report where years update automatically but the values associated with them are hardcoded, it’s easy to report a 2019 value as belonging to 2020. Aim to eliminate all hard-coded values as much as possible, making use of named variables for ones you’ll have to update manually so you only do it once at the start of your code.

11.1.3 Mitigating factors

  • Like any analysis, code shouldn’t be seen as a one-and-done project. Once development is complete, make sure you’re scheduling in code maintenance on at least an annual basis; as part of the publication washup process is an ideal time. This is also a great opportunity to upskill everyone in your team; consider asking a less experienced coder to do the maintenance, either as a solo development project, or as a paired coding session with a more experienced person.

  • Best practice in coding is designed to avoid these kinds of risks by making code transparent, easy to follow, and as simple as possible. Check out the best practice chapter to get examples of platform-agnostic and language-specific best practice approaches you should be following. You may also want to make use of the R coding standard in this book, which sets out a very rigorous standard to follow.

  • Increasing the propensity of your code to “fail safe”. Increasing the probabilty of receiving an error message rather than an output may seem like an unintuitive way to increase the quality of your data! However, adding clear error messages and warnings is easy with the stop() and warning() functions in R (and equivalents in other languages), and these allow you to receive informative messages when the code encounters something unexpected. For example, an error message saying “data for latest year not found” is generally much more useful than running an entire coded project only to discover it’s crunched the figures for 2008 instead of 2018.

  • Testing your functions before you use them. Adding unit testing to your functions isn’t the most exciting part of coding, but it allows you to regularly check that your code still does what you expect it to on a standard dataset. You can even set up automations to check your code against the tests every time you make a change to it, so there’s never a situation where you can accidentally break your code during development!

Case study

A stats team had publication code written by a former member of the team who was a confident R coder; this code was written in a complex way with lots of large functions and fancy documentation. The existing members of the team didn’t have a complete understanding of how it worked, and the associated documentation didn’t make this clear.

Eventually, 18 months after the team member left, the code stopped working and wouldn’t accept the latest year of data to produce the annual publication. While the problem was (luckily) easy to spot, the code was convoluted and difficult to fix. The team resolved the problem in the short term, and then took a step back on the existing code base. They recognised the risks associated with running complex, poorly-documented code, and began development work to completely redesign it, building something more simple and adhering to best principles, including a significant improvement in the quality and quantity of associated documentation. They also planned to regularly maintain the code, checking and improving it as part of the post-publication washup every year.

11.2 Data risks

In a perfect world, there would be no data risks associated with your code; you would receive perfect, clean data from your supplier which never changes in terms of expected values, structure, etc. Given that this is rarely (never…) the case for stats teams, risks around changing and unexpected data inputs will be an ongoing concern for most coded projects.

11.2.1 Incidents to record

  • You receive data which, when fed into your existing project code, fails to produce the expected output.

  • As for the previous example, this risk is far more significant if the code still produces a plausible output. It is also more of an issue if the code produces no warnings or errors relating to the fact that the change has occurred.

11.2.2 Near misses

  • You receive data which differs in an unexpected way from the previous delivery, but it doesn’t cause a problem. Your code being robust enough to cope with this is a good sign, however it’s still something to engage with your data supplier with, to ensure they notify you of future changes, no matter how small!

  • You know that your data structure or format will change from time to time. While changes you expect can be accomodated more easily, it’s worth making a note of this as a risk, and making sure your data checks and code accomodate and warn about this when appropriate.

  • Your data is ingested from a non-standard file format. Ideally, you would receive your data in a flat file format (e.g. a CSV) or from something like an API. If you receive data in a proprietary format, whether that’s an Excel file or something like a SAS or SPSS data file, or an otherwise unusual data format, this likely presents additional data risk, as you are dependent on both the structure of these files (where changes are not always transparent) and the additional coding libraries to read it in properly.

  • Data in a non-tidy (human-readable) format. Human-readable data is unsurprisingly great for humans to read, but can cause problems if you’re trying to use it in code. Things to look out for are regions, years, or other variables being listed as individual columns. Your code will need to be flexible enough to cope with any new columns or changes of name, without needing manual adjustment every year.

  • Code that you know is rigid or clunky to run can compound any of the above problems. If you know your code is reliant on manually updating years, months, etc, or relies on columns having specific names or being listed in a specific order, this increases the risk of running into problems when data changes.

11.2.3 Mitigating factors

  • Your data has associated data dictionaries or similar agreed metadata. By having an agreed and documented data structure, you can easily understand what your code needs to be able to accommodate, how this should be written, and when you should expect it to produce errors, as well as being confident that your data supplier also understands how the data should be delivered to you. You can also easily produce dummy data for testing based on this template.

  • Your data is delivered to you in a tidy data format. A “long” or tidy data format is the ideal way to receive your data. It facilitates the use of more flexible code, and doesn’t require you to take into account things such as incorporating a new column of data each month/year.

  • The data you receive is checked before being analysed. These kinds of checks are usually high level and simple, but allow you to spot any significant departures from what you were expecting. This can check anything from data size and shape, when it was last updated, unique values in each column, or the names of each column.

  • Your code is specifically written with flexibility in mind. Things to consider when you’re judging this include no hard coded dates or names, and including all provided data rather than named or positioned columns. Be aware that this can be a double-edged sword though; you want code flexible enough to incorporate slight data changes, but not so flexible that it will mindlessly incorporate any old data to give you garbage results.

Case study

HGV testing data was received and processed weekly by a stats team for publication. This data was admin data, and provided via a manually completed Excel spreadsheet which made use of a number of formulae to autocomplete some cells.

Due to the nature of the data provided, data issues were a frequent occurrence, including formulae being overwritten, data being input incorrectly, and the structure of the data being changed frequently to meet the needs of the users recording it. These problems would often prevent the code from running correctly, but far more seriously would sometimes result in an incorrect result being produced with no warning, which was more difficult to spot.

To prevent these issues from causing data quality problems in the final data, the team implemented a number of checks and improvements into the code. This included:

  • adding pre-processing validation checks such as number of columns and rows of data, column names, latest date included, etc, to spot issues at a very early stage
  • automating QA checks of things like matching totals, and comparison to the previous day’s data to spot anomalous values
  • flexible code which would accommodate small changes to things like spacing and punctuation in column names

On top of that, the team also worked with the data providers to explain the issues that could arise from sudden and unexpected changes to data structure, and established a process to both give them advanced warning of planned changes, and notify them when unavoidable changes had been made.

11.3 Practical risks

Finally, the development and use of code is a practical rather than theoretical exercise, and you can also expect to encounter a wide range of problems as a result of the setup of the coding language you’re using, the coding tools available at DfT, or the people and resources available in your team. All practical risks will essentially boil down to code you used to be able to run and you can’t any more, despite the code itself and the data not changing.

11.3.1 Incidents to record

  • A change made to the coding tools you have access to means your code no longer runs.

  • A new package version, or an old one no longer being available causes new problems in your code.

  • Loss of coding expertise in your team means that no one understands how to run your code any more.

11.3.2 Near misses

  • The dependencies (version of coding software and packages) that your project relies on are outdated; usually more 12 months out of date. The longer you go between updating dependencies, the more likely it is that a new change will be forced on you, and that changes will break something.

  • Code that is undocumented or poorly documented. As for any other analytical process, if code requires the specific knowledge that is not written down anywhere, it’s only as good as the person who knows how to run it.

  • A project that relies on only one person to run it. Whether this is due to permissions, knowledge or software availability, code that only a single person can run is only one step away from becoming code that no one can run.

  • You know that your code is very complicated, written by many people and not regularly maintained, or extremely long and winding. Complex code can become unintelligible code very easily.

  • You’re doing something a bit hacky. It’s inevitable that occasionally you’ll need to do “a bit of a fix” to code that only runs in a particular way, or work around restrictions or limitations of the software you’re using. Doing this increases the risk that your personal work-around will just stop working one day!

11.3.3 Mitigating factors

  • Good documentation is one of the most important factors in ensuring that you aren’t left with a mysterious folder of code when your star coder moves on. You’re aiming for a situation where anyone in your team (yes, even you) could pick up their documentation and run a project from start-to-end with no external support. If you’re not there yet, make documentation improvement a priority.

  • You have decentralised coding expertise in your team. You don’t have a small number of expert coders in your team who otherwise don’t know how to run your code, instead you have a larger number of beginner-intermediate level coders (and maybe even a couple of experts!) who share knowledge with each other.

  • Understanding and following best practice guidance allows you to simplify the process of understanding all of your code. If everyone in your team is using the same standard approach, this makes it easier for new people to pick up code, individuals to move between projects, and experts from outside your team to give you help if required.

  • Code review and version control help to ensure code is written well, peer-reviewed to pick up points of complexity and error, and that it doesn’t go massively off-piste (and can be brought back in line if it does!).

  • Package management tools are an underappreciated aspect of code. They essentially take a snapshot of the packages your code uses at any one time, and allows you to return to them to run your code. Without them, you’ll be surprised how often a new version of an old favourite package will introduce new errors to your code.

  • Keeping informed about upcoming changes to coding tools can help you plan for the future; whether this is changes to software versioning and features, or just knowing what new products are available to you. Luckily, within DfT the CRAN community and your local CRAN rep should know all about changes that will affect you, so check in with them regularly to see what’s on the horizon.

Case study

A stats team had an annual publication which was fully automated in code and had been for several years. As a result, the code hadn’t changed since being written and running the code had become routine; it was left to a single person who was familiar with the code, and it was run in the same way every year.

For the most recent publication year, there had been an update to the version of R available to teams, and a key mapping package needed in the publication was no longer available in this new version of R. Because the code wasn’t run regularly, it was only discovered when it was time to publish. The problem was complex to solve because only one person actually understood how it worked, and the package dependency was heavily integrated into the code. The team was forced to request a special “rebuild” of the older version of R to run the code, taking a significant amount of Digital resource to do so.

Once the publication was resolved, the code was updated to a newer equivalent package, and the renv package used to manage packages moving forward. The team also made an effort to engage with the Digital updates via Slack to understand when future developments like this would be implemented, and planned them into routine code updates.