Chapter 9 Application of reproducible analysis to adhoc analysis
Traditionally, the many facets of reproducible analysis have been presented relative to large, pipelined projects which will be repeated on a monthly, quarterly or annual basis. In these instances, it is easy to understand why you have a need for reproducibility (you are going to run this code again, after all!).
However, with one-off and adhoc analysis, the arguments for some reproducible analysis techniques are less clear cut. But this doesn’t mean you shouldn’t aim to sensibly and strategically use some of them where appropriate; similar to QA and other quality measures, the aim is instead to determine which aspects of reproducible analysis practice are relevant and proportionate to your one-off analysis asks.
9.1 What are the advantages of using reproducible analysis practice for one-off analysis?
9.1.1 There’s no such thing as one-off analysis
As this tweet does an excellent job in summarising, there’s rarely such a thing as a true one off analysis request. You will inevitably be asked to run adhocs repeatedly, make small changes, or expand the scope. While requests might not be repeated every month or year, being asked for small iterative changes to a request over the course of a week still relies heavily on the reproducibility of the method used.
9.1.2 Consistency between different asks
Particularly for adhoc requests, figures will often be subject to particularly high-profile scrutiny. It is awkward and embarrassing to explain why two figures provided at different times are different; this can be avoided through use of techniques such as code review, and being able to audit exactly when changes were made.
9.1.3 Facilitating collaboration
It is almost inevitable that there will be times where more than one person wants to work on an adhoc request at a time. Use of Github facilitates collaboration between individuals as much as it allows for version control, and stops people accidentally overwriting or breaking someone else’s code. Use of things like library management also ensures that code written by one person can be run by another; useful when you’re working at pace!
9.1.4 Understanding what choices were made
Adhoc analysis is generally done at pace, and communication done in a mixture of in person and informal channels. This is great at the time, but 6 months down the line can make it extremely difficult to determine what choices were made. Simple steps like being able to look at code comments, read back through code reviews, or even see who made changes and when on Github, allow you to feel confident you understand the decisions made at the time; even when original team members have moved on.
9.1.5 Finding mistakes
Spotting and solving problems in analysis is just as important in adhocs as it is in repeated analysis. Good code comments and code review give you the best possible chance to spot occasions where the intention and end result of code don’t necessarily match.
9.1.6 Not leaving a cold trail
Without careful management, the documentation, code and outputs from adhocs can easily end up scattered across multiple folders and drives, especially when existing systems are set up to more easily store Excel-based analysis. Good repository management can keep all adhoc code in a logical, easy to find location.
9.2 How does this differ to repeated projects?
Obviously, there are fundamental differences between coded projects for one-off analysis, and those you will be running repeatedly, both in terms of the relative value of implementing various aspects of reproducible analysis, and the time you have available to devote to improvement and development of code. In addition, the way you are thinking about “reproducibility” should be different in the two different scenarios.
For repeated projects, reproducibility is often about having code which produces outputs over and over again, in an automated fashion, and with limited opportunities for human intervention to cause things to break; in this circumstance some complexity can often be well worth it to achieve a seamless process. In contrast, for adhoc projects reproducibility is more short term and focussed on being able to run this code again; you want anyone in your team to be able to run this specific piece of code for this specific purpose only, and it doesn’t matter too much if there are steps which will need to be done manually or if the code would need to be edited to perform a similar function in six months time. To this aim, you should consider:
Focus on what provides immediate benefits: There are no bad aspects of reproducible analysis to include in one-off analysis, but equally there are some which are unlikely to be justified given how long they take and/or how complex they are to implement.
Time saving should not be the primary goal: Because the time requirements of reproducible analysis are “front loaded”, the time saving benefits are rarely realised on the first (and possibly only) run of code. In these instances, you should evaluate the benefits of a reproducible analysis for the quality benefits they offer instead.
Include aspects which just formalise approaches you will already be taking: You will have processes in place to ensure the validity and quality of adhoc analysis. The best approaches in reproducible analysis will take these and ensure they are applied equally to code written, rather than creating additional work which you had not previously considered necessary.
9.3 Partial application of reproducible analysis for adhocs
Considering all of the above, what aspects of reproducible analysis does it therefore make sense to apply to adhoc analysis, and which are likely beyond the scope of a one-off project? The graphic below aims to aid in making that decision, ordering aspects from most to least useful in adhoc analysis, and drawing a likely “line of practicality”; where the benefits are unlikely to ever outweigh the costs for an adhoc project.
The aspects you may therefore want to consider for any adhoc request are:
Writing code: As opposed to doing adhoc work in Excel, using R or a similar coding language for adhocs is the most fundamental reproducible analysis technique. It allows you to add commentary, reduces the risk of copy and paste and Excel formula errors when working at pace, increases your ability to recycle analysis for future adhocs, and can leave you with more attractive outputs. As a bonus, it is an excellent opportunity for less confident coders to practice their skills on a small, self-contained project, while confident coders will inevitably be able to code faster than you could produce the same output in Excel.
Adding code comments: Including code comments in your adhoc code is an incredibly valuable use of time. Not only does it ensure that future users can understand the choices made in your code, and know at a glance what each section is supposed to do, it streamlines and simplifies QA by preempting questions about analytical decisions.
Putting it on Github: Linking your code to Github opens up a whole world of version control and collaboration, making it easier to work on slightly larger adhoc projects, and avoiding the inevitable situation when working at pace where one code change overwrites another. To note though; setting up a new repository for every adhoc can be time consuming and tricky to manage, you may want to consider a single adhocs repo with different folders for each set of code.
Getting code review: This just formalises a process that you will almost certainly want to do with adhoc code; getting someone else to cast a quick eye over it and say “yes, you’ve done what you set out to do”. Combined with Github, you can store code reviews alongside the code itself, and refer back to them if required. Sometimes, inevitably you will end up doing a general QA of the figures themselves rather than the code; this will depend on timescales and the resource you have available in the team.
Package management: At the very edge of the “line of practicality” is the use of package management such as renv for adhocs. This allows you to refer back to the libraries you used at the time of creation of the code, so you can ensure the code will run again in future, but it can be a little more complex to set up. It’s probably only worthwhile for projects you strongly suspect you will want to run again in more than 6 months time, or will definitely want others to be able run right now.