Chapter 14 The Reproducible Analysis Project Checklist

Not sure if your project is a good candidate for a reproducible analysis approach? Complete the following checklist, ticking all of the statements that apply to your work:

Repeatability

The analysis will need to be rerun with similar data on an annual or more frequent basis (we’d like to get an update of this slidepack every quarter)

You are likely to be asked to run the analysis again to look at different groupings or breakdowns (can you show me what this looks like when broken down by age instead of gender?)

The analysis includes similar manipulation or visualisation of slightly different datasets (we need this same report for the matching data for all our ALBs)

Your report includes similar outputs for different cuts of the data (I’d like to see the same charts for every local authority)

Someone else will need to be able to easily run your analysis

Scalability

Your data is large in size, more than 100,000 rows of data, or more than 50 columns

Your data is rapidly increasing in size; you are gaining more than 50 new rows of data per month

Your data refreshes daily or more frequently

Data comes from a large number (5 or more) of sources

Analysis of your data comprises multiple (3 or more) stages

Quality

Your analysis is likely to be compared to other analysis in a similar area (can you explain why your figure here doesn’t match this one produced 3 months earlier?)

There is likely to be a large amount of public or media interest in your work, and methodology or copy/paste error or correction would be embarrassing for the department

The output of your analysis feeds in to high-impact decision making or policy, and needs to be error-free

Auditability of your work in future is a key consideration

Automation

The content you are producing doesn’t change format often (Commentary such as “category X increased by 10,000, up 12% on the previous quarter”)

Your data is provided in a format which is stable and doesn’t change often (Numbers of columns and names of columns remain consistent every time data refreshes)

The process relies on a number of manual steps carried out in order (processes such as saving files with specific names, copying formulae in Excel, or copy-pasting data into the right location)

Result: