Chapter 7 Publishing your code

7.1 What is open source code?

Open source code refers to code which is made freely available to anyone, allowing them to view, modify, and distribute the code without restriction. This is frequently used to refer to software code (e.g. both Python and R are fully open source), where it allows anyone to download and use the software for free, and also have the ability to modify the code to suit their specific needs or to fix any bugs. In the analytical community, open sourcing code for analysis encourages transparency in data and analysis, allowing users to understand how data has been processed, reproduce analysis themselves to confirm the results, and improves confidence in data-based decision making.

GDS promotes the use of open source code in all of its software and digital tools, via the Digital Service Standard. This requires government agencies to use open source code and open standards whenever possible in the development of digital services. This helps to ensure that systems are cost-effective, reliable, and secure, while also promoting the use of open source code.

In DfT, open source code is published in public-facing Github repositories, which makes it easy for any interested party to view, copy and build upon the code you have written. By default, all DfT repositories in Github are initially internally-facing only, and there are a number of theoretical and practical considerations to go through before making the decision to change the repository visibility.

7.2 What are the benefits of open sourcing?

Open sourcing of code relating to analysis can have a key advantage in helping providers of National Statistics to meet their obligations under the Code of Practice. This includes:

Trustworthiness: Open sourcing code promotes transparency and reproducibility in the analysis. By allowing others to see and verify the code, it can help to build public trust in the results and methodology around the analysis and ensure that the results are accurate.

Quality: Open source code helps to ensure that the analysis is of high quality by allowing others to review and test the code. This can lead to improvements in the analysis, and can help to ensure that the results are reliable and valid.

Value: Open sourcing code can help to increase the value of the analysis by increasing understanding of the limitations of the analysis and the data, and therefore the appropriate usage of the data. By allowing others to understand the processes that contribute to the analysis and share their own ideas and techniques, it can also lead to greater coherence between user needs and output.

It can also offer additional advantages including:

Community: Open source code fosters a community of data analysts who share a common interest in the data and analysis. This community can provide support, feedback, and opportunities for collaboration and networking.

Efficiency: Open source code can save time and effort in the data analysis process by allowing other analysts to reuse code that has already been developed and tested by others.

By default, you should always aim to consider whether it is valuable to your team and your customers to make your code available in an open source manner.

7.3 When open sourcing isn’t useful

As mentioned above, it is always valuable to consider whether open sourcing your code is a valuable thing to do. However, there are situations where it is inappropriate or of little value to do so, such as:

  • Sensitive code: While code should never contain hard-coded sensitive variables such as passwords and API keys, there are some situations in which the code itself may contain sensitive information about decision-making processes etc which may not be appropriate to share with the general public.

  • Poor quality code: If code has been written with no documentation, will not be maintained, or contains serious problems, making the code open source may actually reduce trust and quality perception from users who expect a higher standard from public code.

  • Limited resources: For analysis which is conducted at pace and/or with limited resources, it may not be feasible to make the code open source, and efforts should focus on completing the analysis and delivering the results, rather than on sharing the code.

  • Lack of interest: If the code is related to a very niche dataset or use case, it may not be useful to make it open source. In these circumstances, the effort required to make the code open source may not be worth the potential benefits.

  • Sensitive underlying data: If the data being analysed contains sensitive information which means it can’t be made available to the user, it makes the associated code less useful as an open source product. In these circumstances, external users will not be able to run the code themselves, and you should consider if it is still valuable for them to be able to view the code (e.g. to check or understand the underlying methodology associated with published data).

7.4 Risks when open sourcing code

Additionally, there can be a wide range of risks associated with making your code public which must be considered:

  • Security vulnerabilities: Making code open source can increase the risk of sensitive and security-related information being shared publicly. If good practice is consistently followed when coding, this risk should be mitigated.

  • Misuse of the code: Similar to publishing data, making code open source may allow others to use the code in unintended ways or for unintended purposes. It is important to ensure that code is properly documented and caveats and limitations explained, to avoid inappropriate manipulation of data or others producing misleading results.

  • Reputational risk: Making code open source may reduce public trust in analysts if the code contains errors or produces results which do not align with published figures. It is important that open-sourced code is of appropriate quality.

  • Data security risks: Similar to security vulnerabilities, open sourcing code increases the risk of private or pre-release data being shared more widely. As previously mentioned, if good practice is consistently followed when coding, this risk should be mitigated.

  • Maintenance and support: Once code is open sourced, it will require ongoing maintenance and support to ensure code remains current and appropriate to the analysis being conducted, which can be time-consuming. You should ensure that you have the resource to support this code being made public long-term; if you are already properly managing code using a methodology like Gitflow this might be a minimal ask.

  • Transparency beyond code: Making your code base public goes beyond the code itself being public, and will include documentation and your development within Github, including comments, issues, pull requests, etc. Are you in a position to ensure that all of this meets the standard to be public facing?

7.5 Practicalities of open sourcing code

This section assumes you are starting from a point of having your code written, and currently store it in an internal-facing DfT Github repository. In addition to the range of theoretical pros and cons of open sourcing code, there are a number of practical considerations you will want to work through as part of the process of making your code public.

Below is a flow chart which goes through the main steps of open sourcing code, with details of each of these steps covered in the following sections:

7.5.1 Deciding whether to open source your code

Choosing whether making your code open source is the most appropriate option should always be done taking into account all of the above pros, cons, and risks. While open code across the Civil Service should be the default, there are also numerous considerations which may mean that publishing your code is of low value for the public, and may constitute an ongoing burden for your team.

Always ensure you have considered:

  • Use: is the code something that the general public could make use of, and is it something they would want to make use of?

  • Quality: will the code improve trust in our analysis and make it more transparent?

  • Maintenance: do your current coding practices allow you to easily maintain the code in the public view? What would you need to amend in order to make this so?

If you are unsure whether the benefits of open sourcing your code outweigh the risks, you can get in touch with the StatsAID team to discuss your use case further.

7.5.3 Security considerations

When preparing your code for publishing externally, your key consideration should be how this impacts security; you have the potential to create security vulnerabilities in DfT data and IT infrastructure if this is not carefully managed.

  • Your code should never contain API keys, passwords, or other authentication strings. There are known bots which scan Github repositories for these kinds of obvious vulnerabilities and you will be amazed how quickly these are picked up and used for nefarious purposes. These can even be exposed if pushed to Github and then subsequently deleted, so seek advice from the StatsAID team to remove these from your Git history if you know or suspect this has ever happened.

  • Data and code shouldn’t be stored together, even if the data will be made public at a later point. Storing them together increases the risk of accidental data sharing.

  • Of a lesser security concern are links to locations of code and data; while these don’t present any specific security risk, they do allow people to build up an idea of how we store data, and present a maintenance dependency. For published code, you should store any SQL, GCP, G drive, etc locations as environmental variables inside the project, instead of in the code itself.

  • Code should always undergo independent code review before external publication. Generally, this review can be carried out by any other analyst who is confident in their ability to review code. In situations where a security breach could cause significant reputational or operational damage to DfT, you should contact the Digital Information and Assurance team for expert code review.

  • Consider that on publishing, all aspects of your repository will be visible. Ensure any review also checks commit comments and comments throughout pull requests. You may find that these inadvertently contain information about the processes behind the code which may cause security concerns (e.g. a comments saying “the password is buses2023” or “this section doesn’t work since the local security certificate changed to XYZ”).

7.5.4 Making your code fit for public viewing

As mentioned previously, it is unlikely that any code is ready for publication immediately after development. Following writing your code, you will want to work through a number of steps to ensure that your code is of suitable quality to be valuable to external users.

  • Test your code: Ensure that your code is functional and has been tested thoroughly. You may wish to include formal unit testing, and using automated testing tools (such as Github Actions) to catch any bugs or errors further down the line.

  • Meet code standards: Make sure that your code follows the relevant code standards for both DfT and your coding language. A code review can help to ensure this is the case.

  • Clean up your code: Remove any unnecessary comments and unused code snippets. Debug non-functional code and ensure that any critical features or bugs have been resolved. Ensure that your code is well-organised and easy to read alongside the provided comments.

  • Clean up your repository: Consider whether the commit messages, comments, issues, and pull requests associated with your development are appropriate for open sourcing; these may contain references to sensitive information, or be unhelpful to an end user. Quite often, it is easiest to move the code to a completely blank repository as part of the publication process.

  • Conduct a security review: Get someone familiar with the coding language and the security considerations of your data and code to conduct a review of the code to check for anything which may cause a security or privacy issue. You can log the results of this as part of your repository, as in this example

  • Add documentation: Add documentation that explains how your code works, its purpose, and any limitations or known issues. It should be accessible to an external user and should particularly not assume any knowledge around limitations or caveats around the data and code. Documentation should include at minimum a README file and inline comments, but may also include a Github wiki.

  • Add a license: Ensure that your code is appropriately licenced. The license should state the terms and conditions under which others can use, modify, and distribute your code. You can make use of the licencing template in this repository

7.5.5 Oversight for your public repository

Before making your repository public, you will need to ensure you have someone within your team who has the appropriate technical knowledge to evaluate the code before it becomes public, carry out the transfer to public facing, and monitor the repository after publishing to ensure it continues to meet standards for open sourced code.

To ensure that people have the appropriate skills to do this, DfT runs a Github technical lead training course on request. This course is designed for someone with an existing familiarity with Github and takes them through the technical, security and practical considerations of making a repository public, and managing it after the fact. Once they have completed this course, they are eligible for repository maintainer rights on public repositories.

You will not be able to make any code open source/public facing without someone who has completed this training and is willing to act as the technical lead for your repository

7.5.6 Making your code public on Github

Once you have completed all of the above steps, actually making your code open source is a very simple process!

  • Your technical lead can create a new repository on the DfT Public Github organisation, with a public visibility
  • They can then add your code to this repository in the way they have agreed is the most appropriate; this might be creating a copy of the entire original repository including issues, comments, etc, or simply copying the code files into the new repository.
  • You will want to consider how you manage the original and public repositories moving forward to avoid confusion. Is it appropriate for you to do all development work in the public repository, or should you do this privately and update the public repository periodically?

7.5.7 Maintaining your open source code

As mentioned above, you will want to consider whether it is appropriate to develop code in public repositories; for published statistics it is likely not possible to do this while maintaining appropriate pre-release controls on published information. For all other code, developing in the open is considered to be good practice and in line with GDS standards.

Maintenance of code which is publicly available is the same process as maintaining any other code. You will want to continue to ensure that it meets coding best practice, works without bugs, and that good documentation is maintained around both running and developing the code. Because both the code and your development work will be viewable by everyone, it is important that you adhere to good coding standards such as:

  • Making a note of issues and feature requests using Github tools, and maintaining this list with progress updates
  • Using an appropriate Github workflow such as Gitflow to keep your code development tidy and ensuring breaking changes are not introduced into live code
  • Making sure your commit messages and comments on the code are clear, easy to read, and helpful for a third party to read
  • Regularly ensuring that your code is tidy, and refactoring to remove redundant code when appropriate

Case Study: choosing whether to make code associated with the weekly transport usage stats open source

The weekly transport usage statistics have been produced using a well-established code base for the past 18 months. As part of this developing publication process, consideration was given as to whether this code should be open sourced.

Points considered:

  • The data is of significant public interest with a highly technical audience, so it is likely that the code would attract some interest too.
  • The development of the code is carried out using good Gitflow methodology, and the code itself adheres to good practice.
  • The weekly publication timescale meant that resourcing was tight, and it would be difficult to devote significant time to open sourcing.
  • Due to the nature of the data, structures and methodologies of reading the data in change regularly; this kind of regular refactoring often at short notice may not enhance public trust in the resulting data.
  • All of the data used to produce these outputs is commercially sensitive, and cannot be made available to the public alongside the code.
  • The code itself is for a very specific purpose, and is not generally useful to the public.

Overall, it was judged that the benefits of open sourcing code that the public could not run were limited, and offset by disadvantages for the data and publishing teams.

Instead, aspects of the code which were more generally useful were split into R packages and open sourced where possible (e.g. slidepackr). This has allowed us to maintain our commitment to open sourcing code where possible, but in a practical and useful way for the end user and DfT.