Chapter 3 Data and coding terminology
Throughout this book (and generally when talking about data and coding!), there is likely to be a range of terminology that you’re unfamiliar with. This section aims to flag some of the most common words and explain what they mean for your work and team!
3.1 The people
Cloud engineering: the technical experts of all things infrastructure in the cloud. You probably will only rarely interact with them as they control the very high-level aspects of your work in the cloud (e.g. creating and deleting projects). Their involvement with a project tends to start at the very beginning of the project creation, and stops at the point at which the empty project is ready to receive data.
Data engineering: the people responsible for how data moves through the cloud. They will tend to pick up the project at the point it is created, and will work out how to move your data into the project, how to clean and structure it, and how to save it into its final form ready to use in your analysis. Their interest tends to stop at the point at which you’re creating analysis, insights and visualisations. They can also help advise you on how data should be structured, how you should ingest it from data suppliers, and how you can set up access to your data and metadata.
Data architecture: the ones who decide what the overall structure and setup of projects should be. Rather than being involved in technical development, architecture instead ensure that all of the data projects we carry out in DfT follow established protocols in terms of tools, security, standards, and access. They make sure that whether you’re collecting or analysing data, you’re doing it in a safe and reproducible way.
3.2 The cloud
Cloud computing: You’ll often hear people talk about “the cloud” or “cloud computing”. This is a term which basically describes moving your processes to a remote server. So, rather than having physical servers in the basement which contain your data storage or processing capabilities, all of this takes place via a server elsewhere, that we access via the internet. The advantage of doing this is that the underlying hardware and infrastructure becomes someone else’s problem! Rather than us having to store, procure, maintain, etc our own servers, we can pay to make use of these remote servers belonging to someone like Google or Amazon. This also gives us a lot of flexibility, we can pay only for the storage and compute power we need at that exact moment, which reduces both cost and environmental impact.
Cloud native: Any computing processes moved to remote servers (aka the cloud) are “on the cloud”. However, just lifting and shifting local processes to remote servers often isn’t the best way to do this, it can be inefficient and expensive, as well as less secure, and potentially difficult to maintain. In contrast, “cloud native” processes make use of appropriate cloud tools to move processes to the cloud in an efficient, easy to maintain, and effective way. There are cloud native equivalents are almost any tools, especially for things like data storage, processing, and reporting, and the ambition should always be to make use of these where possible.
GCP: The cloud provider DfT primarily uses is Google, and their cloud computing platform is called Google Cloud Platform (GCP). GCP itself is just a platform; basically a place where lots of different cloud-based tools which perform different functions are hosted. You can think of it similar to your mobile phone, which acts as a platform for tools like messaging, social media, and maps. The tools inside GCP are varied, and can include data storage, AI tools, right up to entire virtual computers which can load and run software exactly like your laptop can.
GCP Project: All of the work you do inside GCP happens inside a project. Similar to a mobile phone, when projects are created initially they are all blank; it’s up to the user to choose which tools or apps they want to install in the project in order to complete their work. Only users you choose will be able to access your project, and you can even restrict what tools they can use inside that project (e.g. you could allow them to view one dataset but not another).
Google Cloud storage, BigQuery, etc: As mentioned above, there are a broad range of tools that exist on the GCP platform, similar to how there are multiple apps on your mobile phone. You might hear the names of some of these used; for example BigQuery (data warehousing), Google Cloud Storage (storage buckets) or Looker (dashboarding). You can pick and choose which of these tools you want to use inside an individual GCP project, based on your needs and experience of using these tools.
3.3 Artificial intelligence
AI: Artificial Intelligence is a catch-all term to describe computer programmes which are designed to do tasks previously thought to require “human-level” intelligence. It’s a very broad term that can mean anything from a chatbot which can help you book a holiday to a complex machine learning algorithm capable of spotting defects in manufacturing processes. It’s usually not meaningful to ask if someone can “use AI” to complete a task, it’s similar to asking if someone could “use computers” to do something!
Machine learning: A subset of AI, where a computer takes an existing dataset and learns from it, rather than being given specific instructions or rules. This can be for many applications, e.g. to understand the existing data in more detail to make predictions in future, or to generate new content from it.
GenAI: A specific type of machine learning which is capable of generating new content, learning from datasets of existing examples. This can be human-like conversation, code, pictures, etc. This is the most common type of AI people now experience day-to-day, think things like ChatGPT, but it’s not the only type!
3.4 Data storage and analysis
API: An Application Programming Interface (API) is a way for computers to talk to each other programmatically. This is a great modern way to share data, especially when you know that the end user will want to make use of it by loading it into a database or otherwise processing it using code. Rather than storing the data in a human-readable table in an Excel file, an API allows you to store the same data in a computer-readable format. The end user can then make coded requests to return only the data they want; e.g. for a specific year or geographic region, which is particularly useful for sharing big datasets. APIs also allow you to set limits on who can access your data, or how much data they can download at a time, making them secure and efficient.
Database: A database is a structured way to store data that is optimised for fast writing and retrieval of single lines of data. They’re ideally suited to storing data for transactional workloads, where you’re most interested in storing and retrieving a single line of data at a time e.g. recording the clicks of customers visiting your website. However, databases are less optimised for the kinds of analytical workloads that analysts usually want to carry out, where we’re interested in performing more complex aggregations over a large number of rows and columns e.g. processing daily data for an entire year to produce averages. In DfT, we have traditionally stored our data in large SQL Server databases, both locally and in the cloud.
Data warehouse: A data warehouse is similar to a database in that it’s ideal for storing structured data, however it is ideally optimised for the kind of analytical workloads analysts carry out. It is capable of storing very large datasets in an efficient and cost-effective way, and comes with associated processing power to allow you to run large and complex queries over the data to produce aggregate statistics, analyse trends over time, and even carry out complex modelling and machine learning approaches. In DfT, our data warehouse solution is BigQuery, and for most applications teams are transitioning their data into BigQuery to take advantage of the advantages it offers over our old SQL databases.
Data lake: A data lake is another storage option, often described as for “object storage”. This means that while it can hold structured data, unlike a database it’s capable of holding a wide range of unstructured data types too; anything from images to emails to video recordings. A data lake is a great place to dump your files before cleaning and processing; often for analysis you will drop raw data into a data lake in a CSV format, which is then cleaned and put into a data warehouse for further use. In DfT, we make use of the Google Cloud Storage buckets as our data lake solution, with these buckets offering a great flexible option to store almost any type of file as part of your analysis.
Metadata: Metadata is best described as “data about your data”. This includes things like column names and what they contain, as well as last updated dates, expected values, and what null entries are encoded as. While understanding your data has always been important, digital transformation means that formally recorded metadata is more important than ever; it adds business value to datasets, ensures that data is consistent and not duplicated, allows us to spot problems in data pipelines, and allows us to share data with relevant stakeholders through APIs.
Pipeline: The word pipeline is used in a range of data contexts, referring to the automated movement of data within a system. In data engineering, this most often refers to a data ingestion pipeline, where data is brought in from an external source, whether this is an API, web scraping, or a file transfer. The data is then cleaned and loaded into a data warehouse, ready to use in further analytical work.
Reproducible analytical pipeline: Related but distinct from data pipelines, you may often hear the term “reproducible analytical pipeline”, or RAP. This is describing the process of making use of the data from a data pipeline, incorporating it into analysis to produce reports, charts and tables in a reproducible way which meets coding best practice.
3.5 Code quality
Code review: Similar to how data outputs need quality assurance before they can be used, code needs to be peer reviewed before it can be used to produce a live output; whether this is a dashboard, a model, or a chart. Code review is often used interchangeably with QA, but it’s actually quite different. A code review can be done by a non-expert in the data and analysis, and is looking for issues with the code that may mean it is not reproducible, slow to run, or badly structured. Similar to data that hasn’t been QA’d, non-reviewed code presents significant risks to the quality and trustworthiness of the project it is associated with.
Open source: Code and programs are described as “open source” if the underlying code is freely available for others to view, make use of, and even copy and make their own edits to. You will likely hear “open source” used in two circumstances, one as the user and one as the creator. In the first example, people will often encourage you to use open source coding tools; these are things like R and Python which are free to use, and the underlying code explaining exactly how these languages work is freely available. Open source coding tools offer advantages over proprietary software like Excel or SPSS to do analysis, as they come without licencing costs, but also offer much better transparency in how they process the analysis.
In the second example, people will be discussing the advantages of open sourcing your own analysis as code; this means making the code publicly available through a code platform like Github, so other people can have confidence in the work you’ve done, as well as making suggestions to improve it.
Version control: When writing code, it is important that the code is appropriately versioned. This allows you to know what version is live, whether it’s been reviewed, and what changes have been made between versions. Code version control is carried out using Git and Github; these are tools purpose-built for version controlling code, and allow you to update, edit, review and use code easily. Alternatives such as storing code on Sharepoint or a shared drive mean that version control cannot be done properly, which limits the quality of coded analysis.