Chapter 6 Estimating length and complexity of projects

It’s universally acknowledged fact that it can often be extremely difficult to quantify how long a coding task should take, and whether what you want to do can be summed up in a few short lines of code or a seriously complex codebase.

This can make it difficult to schedule in coding projects into team workplans, and makes it easy to feel nervous that what you assume to be a small project could be a massive endeavour. Conversely, you could be missing out on quick wins in your work by overestimating how long a simple task with an impressive outcome will take.

6.1 How long will the underlying infrastructure take to set up?

When undertaking any coding project, don’t forget that you may need to request new access or permissions for tools you haven’t used before. Some of these things may take a matter of minutes, whereas others take longer and will need to be considered in any timelines. Some examples of different coding tool requests and associated timelines include:

  • Installing Cloud R, Python libraries, getting access to individual Github repos: instant (self managed within your team)
  • Access to Cloud R: 10 minutes
  • Changing permissions in an existing GCP project: 2 hours
  • Access to Github: 1 day
  • Installation of local software (PyCharm, Git bash, MS SQL): 2-3 days
  • Set up of a new standard GCP project or sandbox: 1 week
  • Design and set up of a custom GCP project: 1-2 months

6.2 Factors influencing timing and complexity in coding projects

On top of infrastructure timings, the complexity and length of time required for coding projects vary significantly. You will want to take the below in to account when scheduling a coding project.

6.2.1 Uniqueness of the idea

Is this type of project something that is regularly done in your chosen coding language, and are there well-established packages to support you doing so. If not, you may find yourself working with documentation and code from multiple coding languages and developing your own functions and packages.

Doing standard analytical tasks such as ingesting data, summarising it, and producing charts is easy in languages like R and Python, and there is a lot of associated support to make the code quick and simple to write. On the other hand, doing things like reading files from Sharepoint or creating complex Powerpoint documents is rare in R, so you would have to create a lot of the underpinning code from scratch!

6.2.2 An existing code base to build from

Similar to the above point, what code is already available that does exactly (or close to) the processes you’re trying to achieve? Modifying small parts of existing code can save a lot of time compared to writing code from scratch, but it does depend on how similar the two datasets are.

It can also introduce more complexity than you’d expect both into the development process and into the code itself, as you’ll need to understand what the existing code is doing to know what to remove or change. Trying to build from bad code or code that is very different is often harder than writing from scratch!

6.2.3 The actual thing you want to do

Some stuff is just hard! A good rule of thumb is that increasing the number of technologies involved always makes things more difficult, and the less well connected the technologies are, the more complexity it introduces.

Working between BigQuery and R is more challenging than just using one or the other, but not much more tricky because they are good, established links between the two platforms. Working between Github and BigQuery is extremely difficult as the underlying processes are not in place to make this possible. Similiarly, working between R and CSV files is simple (well-established links) whereas between R and ODS files is much more complex (limited overlap).

6.2.4 Experience your team already has

If your team are new to coding, it will take them time to both build up skills in the specific task, and upskill in the core coding skills such as version control, good practice, and understanding how to fix broken code.

Experienced coders, even if they are not familiar with the specific task you want to do, will be able to do this faster. Be aware that more experienced coders may produce code which is more complex for a beginner to read and edit by default, so you may want to set out your requirements at an early stage.

6.2.5 Using a standardised approach

Most coding languages have a standardised approach to best practice; there’s a relatively prescriptive version for R in this book! Sticking closely to a standard approach of agreed format and packaging won’t save much time, but will result in a more simple codebase that’s easier to maintain and use moving forward. If you want to do anything wild and exploratory, this should be done with recognition that it could add to both time taken and resulting complexity.

6.2.6 Data structure and the stability of this

Data tidying and cleaning is a major part of a lot of analytical projects, and you will usually have to clean up text, move around columns and rows, change column names, and recode values. If the data is in a relatively machine-readable format and the bits that need cleaning up are quite standardised, that’s a boring but routine task. If data comes in already clean and ready to go, that can massively speed up your project. On the other hand, if data comes in with merged cells, colour to denote meaning, non-standard formats of dates or lots of spelling errors, it could delay timelines by several weeks.

Beyond this, even the smallest changes to a data structure you’ve already established can be very problematic. When these are expected they will still ramp up time taken and complexity, but if they’re unexpected, they can take a long time to resolve.

6.2.7 Exceptions to rules

All code runs on the basis of logical rules. Every time there is an exception to a rule, you need to write a larger amount, and more complex code. If data for England is always coded as “1”, this is quick and easy; if data for England is always coded “1” on a Tuesday but “2” if it’s raining but “3” if…that’s going to be tougher.

6.2.8 How exacting your idea is

If you have an extremely narrow set of specifications for the function or output of the code you are writing, it will take longer to achieve this and the code required to do so may be more complex. This is particularly noticeable with aesthetic specifications; if you know exactly where you want the legend to be placed on a chart, this can be surprisingly complex and time consuming to achieve. Even a small amount of flexibility can make coding projects much more simple.

6.3 Examples in R

The matrix below showcases a rough idea of time required and resulting complexity of the code for a range of common analytical coding tasks.

This is not an exhaustive list of tasks, and actual timings will vary massively between individual analysts. In addition, this assumes the analysis is carried out by a confident coder; for new coders you can assume it will take around three times as long (to carry out associated learning, learn as they code, and complete the task). If you are still unsure about the scale and complexity of the project you are interested in carrying out, you can contact the StatsAID team who can provide clearer estimates based on the details of your project.