Chapter 5 Best practice for coding tools
5.1 What does best practice look like in coding?
Best practice in coding encompasses a wide range of techniques, methods, and guidelines that are widely accepted by the coding community as the most effective, efficient, and safe way to write and maintain high-quality software.
Some principles of coding best practice are specific to individual coding languages, and will specify aspects like code appearance, naming, and creating objects. It is important that any code you write adheres to the generally accepted best practice of the language you are using, and details of these specific practices can be found at the end of the chapter.
In addition to this, there are a wide range of general code best practices which you should aim to follow regardless of coding language or platform, to make your code clear, well-documented and future-proofed.
5.1.1 Setting up the code
Fundamentally, good coding practice begins at the point of project initiation! Having a clean project layout and structure makes creating good code easier.
Understand the scope: Before starting any coding, it’s important to have a clear understanding of what the project is supposed to accomplish and what features and functionalities it should include. That helps you decide how to split up the code in files and in branches in Github. For example, if the purpose of the project is to produce accessible tables, you may decide to have one branch and file per table. If the purpose is to produce an entire publication, you may only have a single Github branch to complete all tables within.
Separate code from data: Your code should be stored entirely separately from your data. Generally, your data will be stored in a shared drive or cloud repository, whereas your version of the code will be saved on a local drive. Keeping the two separate makes code more reproducible for future years, and removes the risk of accidentally sharing data.
Version control: Establish version control right from the start of the process and use it. Link your folder structure on your local drive to a shared Github repository. This will allow others to access the code and collaborate with you on it, as well as providing a backup for all changes you make.
Set up a branching structure on Github: Set up main and development branches from Github from the start. This will make the first early code reviews easy to request and complete, and will provide structure for early code which may frequently change dramatically.
Consider how to split and store your code: Outline roughly what the split point will be for the code (e.g. will you have the code for each chart in a single file?), and where individual code files will be stored (e.g. will you have a whole folder for all source code, or one for chart code, one for tables, etc?). You can always split files further if code gets too long to read manageably.
5.1.2 Writing the code
When initially writing the code base, the aim is to produce code that is fundamentally easy to understand and use as you go along. Practices such as making code structure neat and adding documentation should be done as you go along, rather than being left until last!
Simplicity: Simple is beautiful! Don’t introduce more complex features into code unless they have significant advantages in terms of reducing code repetition and complexity. For example, user-defined functions are more complex to understand, but if writing a short function can remove 400 lines of repeating code, this is a reduction in complexity overall.
Avoid code duplication: Repeating the same code in multiple places can make your code harder to maintain and update. Instead, make use of reusable functions where code is repeated more than a few times.
Use common approaches: Stick to common ways to achieve things as much as possible, including packages everyone has heard of and methods widely used across DfT. Choosing to do something in a new and exciting way is only worth it if the benefits of a new approach outweigh the cost of increased complexity.
Use meaningful, descriptive, and unique variable names: Using short, meaningful and clear variable names makes your code easier to read and understand. Calling a dataset “new_price_values” rather than “x” is much more meaningful. Ensure that names are not too similar or duplicated; repeating a name (e.g. a data frame called “buses” with a column called “buses”, or a function called “round” with an argument called “round”) can cause unexpected consequences when running your code, as well as confusion for people reading it.
Use standard data formats: Using flat and widely recognised data formats such as CSV files or BigQuery tables means that you can use common packages to read in and write out data, and simplify your approach.
Write modular code: Stick to your organised code structure as set up above, and keep code in short files which can be called individually in a central set up script, or reused in multiple places.
Documentation: Ensure that you are adding comments to your code and maintaining a README file as you go along. This ensures that your comments are clear to end-users as well as developers during the code writing process.
Use version control: Make good use of the version control process you have initially established. Make sure you are writing good commit messages that will make sense to others, and using Github branches for each new feature (e.g. for tables or charts), rather than for each person writing code (e.g. for Sarah and Dave).
Code reviewing: As part of your version control, make sure you are getting peer code review for the code as it develops, to make sure the code structure makes sense and is efficient. You should ideally get a code review for code in chunks of 100-500 lines, to stop it getting unwieldy.
Refer to language-specific good practice: Make sure your code structure and appearance meets the standards for the language you are using.
5.1.3 Documenting the code
Good documentation of your code is the most fundamental aspect of good practice, and should be actively considered throughout development and maintenance. Documentation should allow anyone who will need to run your code to do so without any input from anyone else, and should always be stored alongside the code, so there is never a situation where people can find one but not the other. Beyond that, you may also want documentation which explains how to update, develop and bugfix the code, aimed at future technical experts who will want to improve your code easily.
There are a range of different ways to store documentation alongside code. They all have slightly different uses, and it is rare you will be able to properly document code using only a single approach:
Code comments: Code comments are the most common type of documentation you will see, as they are present inside the code files themselves. They need to be marked up to ensure that your coding platform understands that they are comments and not code (usually with either
#
or/*
depending on the language). Code comments should be used to make code human-readable as well as machine-readable; anyone should be able to read the code alongside the comments to understand exactly what the code is trying to achieve. These comments should explain the why of your code, rather than the how, and shouldn’t duplicate what is obvious from the code. E.g. for code statingfilter(data, fy %in% c(2009, 2010))
, a code comment saying “# filter the data for the two most recent financial years (2009-2010)” is more useful than “#filter for fy in 2009, 2010”, which doesn’t provide any more information than the code alone.README file: The README file is a document which provides more detailed information than the code comments alone. It is designed to be read alongside the code, but also instead of it, and a user will assume that any information they need to know before running the code will be contained in it. It can also be a lot longer than code comments, so can contain a comprehensive guide to using the code in the project. If you need a user to provide certain credentials, have access to specific folders, or update parameters, this information should be included in the README rather than the code comments. You can create a README in any Github repository by setting up a file called README.md, and Github will automatically display it on the landing page for that repository.
Github wiki: Github also contains the functionality to create an entire wiki for a repository. This has the ability to store a lot more information than a README or code comments, and can be divided into pages. This allows you to build up a full set of desk notes or developer instructions, and is ideal in addition to a README and code comments for complex or large projects, or projects which you expect someone new to pick up maintenance of entirely. A wiki can contain details such as how to change the underlying code of the project, what each file contains, how the code was developed and any rationale behind this, how to change lookups, or how the code fits into a larger project structure.
Issues on Github: Github repositories also allow you to associate issues with a code project; these can either be bugs in the code that need to be fixed, or future development opportunities you have identified. These can be used alongside code comments to document things such as known problems, areas where improvements need to be made, or places where the code is actively being developed. Using issues for these rather than code comments or README documentation keeps your actual documentation clean, but also allows you to look back and see how these issues were developed or fixed in the past; Github keeps a log of past issues, unlike code comments which are gone forever once deleted.
5.1.4 Maintaining the code
Once your code base is written, it’s unlikely that you’ll be able to run it every month or year into perpetuity without maintaining it. In contrast, regular light touch maintenance reduces the risk of catastrophic failure; but you can reduce the amount of tedious maintenance or updating required by considering how you write your code.
Minimise what actually needs to be updated every time: As far as possible, make use of processes which update automatically or make use of named variables rather than ones which need to be updated every time. For example, create charts which always use the last 10 years of data, not 2010-2020, then 2011-2021, and read in the most recent file in a folder, not a specific named file.
Document code changes: Document all changes made to the code, including bug fixes, new features, and changes in functionality, preferably using something like Github issues and pull requests. This will help you and others understand the evolution of the code over time, and keeps it all stored in one place.
Use a Github workflow to fix changes: Using a structured approach to Github use such as Gitflow allows you to only merge changes into code once you’re sure they have been tested and are working. This allows you to make maintenance changes while also using the live version of the code.
Limit maintenance overheads for your team: Avoid building complicated structures such as packages for your team to maintain unnecessarily, which take a lot more work than simple code. Aim to incorporate any cool functions into existing packages, and don’t create new packages without good reason.
Test your code: Create simple tests for your code using example data, this allows you to easily identify whether errors are caused by the code base or new data. You can also consider using automated testing tools to help you test your code more efficiently.
Review the code regularly: You should be aiming to check code at least annually to ensure that it is not getting unwieldy, obsolete sections are removed, and content is continuing to meet best practice standards.
Monitor performance: Make a note of areas of the code which run slowly, cause problems, or regularly fail, and target improvement efforts to these areas specifically to minimise future problems.
5.2 Planning for the future
Finally, aspects of good practice should be implemented to ensure your code is as future-proof as possible and less likely to fail as a result of unexpected changes.
Use package/environment management: One of the most common reason code will fail over time, especially if it is not regularly used or updated, is incompatibility with more recent packages. Use of package or environment managers allows you to maintain the working collection of packages alongside your code, and update them in a controlled way when doing maintenance.
Keep dependencies up-to-date: Keeping dependencies up-to-date can improve security, performance, and compatibility. Regularly check for updates and review the changelogs to ensure that updates don’t introduce new problems.
Avoid premature or overly complex optimisation: Code can be constantly improved, but this is often at the expense of making it more and more complex. Focus on writing clear, maintainable code that meets best practice guidelines, and can therefore be easily modified or updated, rather than chasing speed and efficiency improvements.
Stay up-to-date with new technology: Follow the CRAN network teams channels to keep up to date with significant changes to recommended packages, and changes you may need to make to code as a result of updates to tool versions. These changes will be infrequent You don’t need to update constantly!!
Plan for change: Plan ahead for changes by making code flexible and adaptable. Avoid hard-coding values that you know may change in the future.