Chapter 3 Google Cloud Storage


Before working with data in BigQuery, you need a way to upload and store your files in the cloud. One common approach is using Google Cloud Storage (GCS) — Google’s secure cloud-based file system.

GCS acts like a shared drive in the cloud where you can upload large CSVs, Excel files, JSON files, or other formats. From there, you can easily bring your data into BigQuery for analysis.

Note: You can also skip GCS altogether and upload data directly into BigQuery — we’ll cover that method in Chapter 4. But using GCS can be useful if:

  • You’re working with large or sensitive files
  • You want to store source files separately from your analysis
  • You need a repeatable or automated upload process (e.g. with scheduled jobs, files not going into BigQuery like images)

In this chapter, we’ll show you how to:

  • Get access to a GCS bucket
  • Upload data to it (manually or using R)
  • Transfer data from GCS into BigQuery

But first, there is some key terminology we need to cover…

3.1 Definition

Google Cloud Storage (GCS) is a scalable and secure object storage service provided by Google Cloud Platform (GCP). It allows users to store vast amounts of unstructured data, such as images, videos, and datasets, in the cloud. GCS is designed to handle large-scale data with high availability and durability, making it an ideal choice for storing critical data that needs to be accessed frequently or infrequently.

Key features:

Scalability

GCS can scale to store as much data as needed, from a few gigabytes to petabytes, making it suitable for both small projects and large enterprise data storage needs.


Security

Data stored in GCS is encrypted both at rest and in transit. Users have control over who can access their data, with options to set permissions at the bucket or object level.


Data accessibility

GCS integrates seamlessly with other GCP services, such as BigQuery for data analysis and AI Platform for machine learning. It also supports access via a web interface, command-line tools, and APIs, providing flexibility for various use cases.


Cost efficiency and storage class

GCS offers multiple storage classes, each designed to optimise costs based on data access frequency and retention needs .


3.1.1 Why use GCS with R?

GCS is an essential tool for R users working with large datasets. Unlike storing data locally or using shared drives like G/V Drive or SharePoint, GCS offers a scalable, secure, and cost-effective cloud storage solution. With GCS, you can store datasets too large to fit into local memory and access them as needed, which is crucial when working with R on data analysis tasks.

In R, large datasets can overwhelm your computer’s RAM, causing performance issues or crashes. GCS helps solve this problem by allowing you to store your data remotely and load it into R only when needed. This way, you avoid overloading your system’s memory and can handle much larger datasets that wouldn’t be manageable with local storage. Using the packages {googleCloudStorageR}, you can read data directly from GCS into R without manually downloading files, streamlining your workflow.

Additionally, GCS lets you centralise your data, making it easily accessible to team members and collaborators without transferring files. With fine-grained access controls, you can manage who has permission to view, edit, or delete files, ensuring that your data remains secure and organised. This flexibility makes GCS a reliable choice for R users looking to work with large datasets, maintain efficient workflows, and collaborate in a secure cloud environment.

3.2 How to: Access GCS

When you request access to a new GCP project or an existing one, GCS should already be enabled.

You can navigate to GCS using the search bar, or by using the Navigation menu (three lines in the top left corner of the console) > All products > Storage > Cloud Storage. You can also “pin” products to your navigation menu from the “All products” page.

3.3 Buckets

Cloud Storage is a service for storing files in GCP in containers called “Buckets”. You can store files of any format in buckets.

Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. You can use buckets to organise your data and control access to your data, but unlike directories and folders, you cannot nest buckets. You can, however, nest folders within a bucket. Buckets are associated with a GCP project and there is no limit to how many buckets you can have in a project or location.

Objects are the individual pieces of data that you store in Cloud Storage. There is no limit on the number of objects that you can create in a bucket.

Read the documentation on Buckets for more detail.

3.3.1 Naming buckets

In GCP, every bucket must have a unique name across the entire platform, not just within a single project. Once a bucket is deleted, its name becomes available for use by any GCP user. There are several rules to follow when naming buckets:

Allowed characters

Bucket names can only contain lowercase letters, numbers, dashes (-), underscores (_), and dots (.). Spaces are not allowed, and names that contain dots require additional verification.


Naming structure

Bucket names must start and end with a letter or number.

Best practice, start with project-name-what-your-bucket-is-doing-or-storing.


Length requirements

Bucket names must be between 3 and 63 characters long. If the name contains dots, it can have up to 222 characters in total, but each component separated by a dot can be no longer than 63 characters.


Not an IP address

Bucket names cannot resemble an IP address in dotted-decimal format (e.g., 192.168.5.4).


Restricted prefixes and terms

Bucket names cannot start with the prefix “goog” or contain the word “google” including common misspellings like “g00gle.”


3.3.1.1 Exercise

Which of the following bucket names are invalid? Why?

  1. My-Travel-Maps
  2. ellies_test_bucket_15
  3. my_google_bucket
  4. test bucket
  5. data_entry-bucket-12
  6. dft-stats-gcp-showcase-training-datasets
Solution


The 1st, 3rd and 4th names are invalid, while the rest are valid with some considerations.

  1. INVALID; as it contains uppercase letters
  2. VALID; however, it is bad practice to name buckets after an individual and the bucket purpose is not obvious
  3. INVALID; contains “google”
  4. INVALID; contains a space
  5. VALID; however, it is bad practice to mix hyphens and underscores in the same name bucket, so best to choose one to keep it consistent
  6. VALID


3.3.2 Regions

All buckets are associated with a region. This is where the data is physically kept in a Google data centre. At DfT, the standard region is europe-west2 which is based in London.

If you have previously used europe-west1 for your projects, there is no need to change it — the cost difference is minimal.

We typically use europe-west1 or europe-west2 as our standard regions because they offer lower costs and reduced CO₂ emissions compared to other regions. Additionally, some GCP products may not work seamlessly across regions, and transferring data between regions can incur extra charges.

3.3.3 Storage types

All buckets have a storage class. This informs the object’s availability and pricing. When you create a bucket, you can specify a default storage class for the bucket.

The four storage classes are:

Standard

Ideal for data that requires frequent access, providing the highest level of availability and performance. Suitable for active datasets, frequently accessed media files, and applications requiring low-latency access.


Nearline

Cost-effective for data accessed less frequently but still needs to be available on demand, typically once a month or less. This is suitable for backups, data archiving, and disaster recovery files.


Coldline

Best for data that is rarely accessed, at most once per quarter, and can tolerate longer retrieval times. This is ideal for long-term backups, archival data, and compliance data.


Archive

The most cost-effective option for data accessed very infrequently, typically once a year or less. This is suitable for historical data, long-term compliance records, and other data that needs preservation but is not actively used. Archive is referred to as the “coldest” storage class.


The longer the access period for the storage class, the cheaper it is to store; Archive has the cheapest costs for data storage. However, it is more expensive to retrieve the data for the longer timeframe classes.

Storage class should be selected based how often you want to use it (download, change, view, access from R or BigQuery). If you need to access data on a weekly basis, then it will be cheaper in GCS to use Standard as the access costs are lower. If you are likely to never use the data again unless there is an audit, then it will be cheapest to store in Archive as the access costs will be null and the storage costs will be lowest out of the four classes.

Read the class description GCP documentation for more detail.

Most data being actively used in a GCP project will require Standard storage.

When you add objects to a bucket, they automatically inherit the bucket’s default storage class, unless you specify a different one for the object. If no default storage class is set when creating the bucket, it defaults to Standard storage.

Changing a bucket’s default storage class only affects newly added objects—any existing objects in the bucket will retain their original storage class.

You can enable the Autoclass feature on a bucket to let Cloud Storage automatically manage storage class transitions for you. When Autoclass is enabled, the bucket starts with the Standard storage class, and objects are automatically moved to lower-cost storage classes if they are not accessed over time.

Autoclass is ideal for objects that are frequently accessed at first but become less needed over time. However, if you expect the access frequency to remain the same throughout the bucket’s lifetime, it’s better to manually select the most suitable static storage class from the start.

3.4 How to: Use GCS

3.4.1 Create a bucket

Following the announcement of the GCP project transition, the “Creating a Bucket” click-point (ClickOps) option will be unavailable.

We have retained the method in the playbook because, while we are in the initial stage of transitioning the GCP project to the new infrastructure, some existing projects may still use the “ClickOps” – although this will no longer be an option once the GCP project transition is complete.

To create a bucket:

  1. Go to Google Cloud Console > Type in “Buckets” > Click on “Create
  2. In each section:
Get started

Choose a globally unique, permanent name. Refer to the Google Naming guidelines, or Section: Naming Buckets for guidance. Tip: Do not include any sensitive information.

Leave the rest black and click “Continue


Choose where to store your data

Set “Region” as the Location type and “europe-west2 (London)”. Leave the rest blank and click “Continue”.


Choose a storage class for your data

Choose an appropriate storage class. Refer to the Section on Storage types for guidance. Leave the rest blank and click “Continue”.


Choose how to control access to objects

No need to change anything, leave as the default options.


Choose how to protect object data

No need to change anything, leave as the default options.


  1. Click “Create

3.4.1.1 Folders

Folders act in a similar way as folders in your file explorer. They help organise data and can have multiple layers of nested subfolders. Folders inherit the region, storage class, and policies of the bucket they are created in.

While folders can improve data organisation, they do not function like buckets or objects when it comes to permissions and policies. You cannot set specific access controls on a folder—permissions must be managed at the bucket or object level.

To create a folder:

  1. Navigate to your bucket
  2. Select “Create folder
  3. In the pop-up window, enter a name for your folder. Though a folder name does allow spaces, it is good practice to use underscores or hyphens (e.g. folder_name or folder-name) anyway.

When referencing items in a folder, the path follows the format: bucket-name/folder-name/filename.

You can easily copy a folder’s path from within your bucket by clicking the overlapping paper icon next to the folder name.

3.4.2 Connect to GCS from R

Step 1: Install the the relevant packages
  1. Install the {gargle} package: install.packages(“gargle”)
  2. Load the package in your script: library(gargle)
  3. Install {googleCloudStorageR} package: install.packages("googleCloudStorageR")
  4. Load the package in your script: library(googleCloudStorageR)


Step 2: Sign in to Google via R code
#RUN THE FOLLOWING CODE SNIPPET ONCE
#RECOMMEND DOING THIS IN YOUR R CONSOLE, NOT IN YOUR SCRIPT

#Keep on clicking 'Continue', ensure you select your work email to sign in to Google

#Follow the authorisation code instructions

scope <- c("https://www.googleapis.com/auth/cloud-platform")

token <- gargle::credentials_user_oauth2(
  scopes = scope
)


Step 3: Authenticate your R session
#RUN THIS FOR EACH R SESSION
#This can you added to top of your script 

googleCloudStorageR::gcs_auth(token = token)

If the scope and token no longer appear in your Environment pane (top right of your R Studio), then you may need to complete Step 2 again. This can happen if you’ve started a new R session or cleared your environment, as these objects are not saved automatically between sessions.


Step 4: Check the connection by listing GCS objects

To check your connection successful, run the following code to view your objects (i.e. data) in your bucket:

gcs_list_objects(bucket = "your_bucket_name")


For any help with using the method, please post on the GCP channel, as others may be experiencing the same issue and can provide suggestions and potential solutions.

We also have a dedicated post about using this method, which many people have commented on, so do check there if you need assistance—someone may have had a similar experience to you.

3.4.3 Add data

Now that you have created a bucket, you can upload or add data to it. There are several ways to do this, in this playbook, we will cover three methods:

3.4.3.1 Add data: From your device

How to manually upload or add files into a GCP Bucket


Written instructions:

  1. Navigate to your bucket folder
  2. Select “Upload
  3. Select “Upload files” or “Upload folder” from the dropdown menu. Please note, when uploading a folder, the folder structure from your upload will be recreated in GCS
  4. Select your file or folder from the file explorer pop-up
  5. Click “Upload

3.4.3.2 Add data: From a bucket transfer

How to add data into a bucket from a bucket transfer


Written instructions:

  1. Navigate to your destination bucket or folder
  2. Select Transfer data > Transfer data in
  3. In each section:
Get started

Keep the “Source type” and “Destination type” as Google Cloud Storage


Schedulling mode

Should remain as Batch


Choose a source

Select your bucket or folder. This can be down by either pasting in the path, or using the Browse option

  • If you are using data from a different project, you need to include the project ID in the path
  • This option will copy everything from the source folder unless you specify which files. You can do this by using Filter by prefix or Filter by last modified time


Choose a destination

You can change the destination. By default, it will be set to your current bucket or folder


Choose when to run job

You can change the schedule from a one-off job to a specific time, or a regular transfer


Choose settings

Add a description

  • This is the section where you can also change metadata options, how duplicate names are handled or overwritten, whether the source data is deleted, and notifications or logging for the job


  1. Click “Create” to transfer your data

3.4.3.3 Add data: From (Cloud) R

How to add data from R into a GCS bucket


Written instructions:

Before following any of the steps below, make sure your data is saved as a CSV file (or another suitable format, such as JSON).

In R, you can do this using functions like write.csv() or readr::write_csv().

To upload your data from GCS to R, you can use the gcs_upload() function:

googleCloudStorageR::gcs_upload("file_name.csv", bucket = "your_bucket_name")

Replace:

  • "file_name.csv" with the actual name of your data file
  • "your_bucket_name” with the name of your GCS bucket

This will upload the file directly to the root of the bucket. Or, if you want to upload a file to a folder in a GCS bucket:

googleCloudStorageR::gcs_upload(file = "path/to/your-file.csv", bucket = "your-bucket-name", name = "folder-name/your-file.csv")

folder-name/your-file.csv means the file will be saved inside “folder-name” within the bucket.

Even though folders in GCS don’t work exactly like folders on your computer, this structure helps with organising files.

After running the command, you should see a message in the R console confirming the upload. It will look something like this:

ℹ 2025-08-05 10:26:24.730322 > File size detected as  84.2 Kb
ℹ 2025-08-05 10:26:25.160446 > Request Status Code:  400
! http_400 Cannot insert legacy ACL for an object when uniform bucket-level access is enabled. Retrying with predefinedAcl='bucketLevel'
ℹ 2025-08-05 10:26:25.230381 > File size detected as  84.2 Kb
==Google Cloud Storage Object==
Name:                file_name.csv 
Type:                text/csv 
Size:                84.2 Kb 
Media URL:           https://www.googleapis.com/download/storage/v1/b/your_bucket_name/o/file_name.csv?generation=...
Download URL:        https://storage.cloud.google.com/your_bucket_name/file_name.csv 
Public Download URL: https://storage.googleapis.com/your_bucket_name/file_name.csv 
Bucket:              your_bucket_name 
...

Your upload was successful if you see output like this (and no error).

Next, we will move onto sharing data to ensure that the right people or applications can access your uploaded files.

3.4.4 Share data

Buckets can be shared both internally and externally.

3.4.4.1 Setting permissions

There are two ways to set GCS bucket permissions:

Using IAM
  • To grant access at the project level, you need to request this through the ITFP.
  • ITFP can assign the Storage Admin role to an individual or a group.
  • Storage Admin provides full access, including creating, deleting and managing objects.
  • Refer to permissions reference for more detail.
  • This method grants access to all buckets within the project


Granting permissions to specific buckets

You must have the Storage Admin role to manage bucket permissions directly

  1. Open the Permissions tab in the Google Console
  2. Ensure the view it set to Principles, then click “Grant Access
  3. In the New Principals box, enter the Google account email of the user or group (DfT users can be searched by name).
  4. In the Select a role dropdown, choose Cloud Storage, then select the appropriate role
  5. Click “Save


3.4.4.2 Best practice for managing permissions

Permissions should always be assigned to a team or group rather than individual users.

This makes it easier to manage access when team members change roles. A single ITFP request can update the GCP Group instead of modifying multiple projects.

If you need to grant permissions for a specific bucket, submit an ITFP request and clearly state that the role should only apply to the named bucket (which must already exist).

Once access is granted, always double-check that the correct permissions have been applied.

3.4.4.3 Granting permissions to external teams

In some cases, you may need to grant bucket access to teams outside of the department.

For example:

  • So, data providers can drop data in on a regular agreed basis
  • Providing access to outputs to external users

Before granting access to external users, please consult with the Data Engineering team.

3.4.5 Read data

Data stored in GCS can be used in several ways. Some popular options include:

  • As a table or external table in BigQuery
  • Downloading the file locally
  • From Cloud R
  • Using Cloud Run functions

In this playbook, we will look at using data in GCS as a table in BigQuery or in Cloud R.

3.4.5.1 Read data: In BigQuery

This option is useful when you have data return where you wish to keep the original data for audit purposes, and query it with SQL. It will only work for a specific set of file types including CSVs, newline delimited JSONs and Parquet files.

Using data from a bucket into BigQuery


Written instructions:

Step 1: Go to Google Cloud Console

Navigate to BigQuery on the Google Console.


Step 2: Select the Dataset you want to create the table in
  • If you already have a dataset in BigQuery, select it from the left-hand panel

  • If you do not have a dataset created, go to the BigQuery console, then the three vertical dots next to your project name, and then click “Create data set”:

    1. Enter a unique Data set ID. Refer to Chapter 4 BigQuery | Naming BigQuery Objects for more detail.
    2. Select “Region” for Location type
    3. Select “europe-west2” from the Region dropdown menu
    4. Leave everything else as the default options and click “Create data set


Step 3: Select where you are getting your sourcing your table

Source > Create table from > Google Cloud Storage


Step 4: Select the file

Select the file you want to create a table from, and select File format if it not autoselected


Step 5: Choose an appropriate name for your table

Write an appropriate, useful table name. Refer to Chapter 4 BigQuery | Tables for more detail


Step 6: Select Table type
  • Native table - this is a standard approach. Once data is loaded into BigQuery, it is stored within BigQuery itself. Any changes made to the original data in Cloud Storage will not affect the table after it has been created.

A native table is used when you want to load data directly into BigQuery for faster performance, full SQL support (including updates and deletes), and advanced features like partitioning and clustering. This is the standard approach when your data doesn’t need to stay in sync with the original source.

  • External table - allows you to query data directly from an external source, such as a Cloud Storage bucket, without storing the data in BigQuery. The data remains in Cloud Storage, and the table in BigQuery is read-only—meaning you cannot modify or delete data from BigQuery.

An external table is used when you want to query data stored outside BigQuery, such as in Cloud Storage, without duplicating or importing it. This option is ideal for read-only access to frequently updated files, allowing you to always work with the latest version of the source data.


Step 7: Adjust the positioning and clustering settings (if needed)

If needed, adjust the portioning or clustering settings. Otherwise, leave them as the default options.


Step 8: Create table

Now you are ready to click “Create table”.


Note: If your dataset in Cloud Storage is too large to load all at once, consider using a temporary external table in BigQuery. This lets you query and filter only the data you need before saving it as a permanent table. Check Chapter 4 BigQuery | Querying using SQL.

3.4.5.2 Read data: In (Cloud) R

This option is useful for smaller files that do not necessitate a BigQuery table, such as a lookup table that will change infrequently.

How to read data from a GCS bucket into R


Written instructions:

Use googleCloudStorageR::gcs_get_object() to download a file from your GCS bucket.

This function takes the following key arguments:

  • object_name: The exact path and filename of the file inside your GCS bucket.
    • If the file is in the root of the bucket, just use the filename (e.g., “example_file.csv”).
    • If the file is nested in folders, include the full path (e.g., ”folder/subfolder/example_file.csv”
  • saveToDisk (optional): The name (and path) you want the downloaded file to have on your local machine, relative to your R project folder.
    • Use ”example_file.csv” to save it to your root project folder.
    • Use "data/example_file.csv" to save it inside a data/ folder in your project.
    • This avoids needing to change your working directory manually.
  • bucket: The name of your GCS bucket

For example:

googleCloudStorageR::gcs_get_object(
  object_name = "sample_file.csv",
  bucket = "your_bucket_name"
)

NOTE: Do not include the full URL or "https://storage.googleapis.com/" in the object_name. The function expects the path inside the bucket, not a web address.

If you want to read the file directly into R without saving locally (for example, a CSV or JSON), you can use the parseObject = TRUE argument, which will return the data as an R object.

To read the file directly into R as an object (instead of saving it to disk), you can assign it like this:

 data <- googleCloudStorageR::gcs_get_object(
  object_name = "sample_file.csv",
  bucket = "your_bucket_name"
)

3.4.5.3 Reading multiple files from a bucket

Sometimes you may have several files stored in a Google Cloud Storage bucket and want to read them all into R in one go.

The example below shows how to:

  1. authenticate to Google Cloud Storage using a service account key
  2. list all files in a bucket
  3. loop over the files and read them in
  4. add a new column identifying which server each file came from

Note: This is a sample code snippet demonstrating one way to read multiple files from a bucket. It is intended as a template that can be adapted to your own work.


library(googleCloudStorageR)
library(jsonencryptor) # if you're using encrypted JSON
library(dplyr)
library(purrr)
library(stringr)

# Authenticate (replace with your own method if not using jsonencryptor)

googleCloudStorageR::gcs_auth(jsonencryptor::secret_read("encrypted_key.json"))

# list files in the bucket

files_df <- googleCloudStorageR::gcs_list_objects(bucket = "your_bucket_name")

# loop over files and read them in

all_users <- purrr::map_dfr(files_df$name, function(file_name) {
    
    # Extract the server name from the file name by removing the suffix
  
    server_name <- stringr::str_remove(file_name, "_users_email\\.csv$")
    
    # gcs_get_object() will parse the CSVs into a data.frame automatically
    
    df <- googleCloudStorageR::gcs_get_object(
      object_name = file_name,
      bucket = "your_bucket_name"
    )
    
    # data manipulation to get the data is the format I want
    
    df %>%
      dplyr::rename(
        SAMAccountName = 1,
        EmailAddress = 2
      ) %>%
      dplyr::mutate(server = server_name, .before = 1)
    
})

Why use purrr?

Using purrr::map_dfr() is usually better than alternatives like lapply() or writing a manual for loop because:

  • It automatically combines results into a single dataframe (_dfr = row-bind dataframes).
  • The syntax is concise and integrates naturally with the tidyverse.
  • It is easier to read and maintain, especially when working with pipelines (%>%).
  • It avoids manual bookkeeping (e.g. pre-allocating an empty dataframe, binding results at the end).

So while for loops or lapply() could achieve the same result, purrr::map_dfr() makes the code cleaner, safer, and less error-prone.

3.4.6 Delete a bucket

If you have the correct permissions, it is straightforward to delete data, folders, and buckets. We will discuss setting permissions and bucket protection to prevent accidental deletion later in Chapter 3.5: Making best use of GCS.

Data, folders, and buckets can be deleted in the console using the Delete button. Select the object/folder/bucket using the tick boxes on the left and click “Delete” in the tool bar that appears above

3.5 Making best use of GCS

3.5.1 When to use GCS

Data ingestion

One of the main use cases for a bucket is to “drop” data returns in as the source for a data pipeline. Some typical options include:

  • Run a data transfer on a regular basis e.g. monthly from a CSV saved in a bucket to a BigQuery table
  • Trigger a Cloud Run Function (not covered in this course) pipeline to ingest, clean and populate a BigQuery table on a periodic schedule
  • Manually create BigQuery tables from a CSV saved using the Google Cloud user interface

For questions about how to best ingest your data, please contact the Data Engineering team (data.engineering@dft.gov.uk) who can advise on best practice.


Retaining raw data

Like how the G/V drive has been historically used, you can use GCS to archive raw data returns. Data in GCS can be reasonably and securely stored for years allowing for any audit or quality assurance activities to access source data or complex analysis. Source data can be saved in the same GCP project as complex pipelines or published data tables meaning all relevant data is in the same platform.


Small data

As mentioned in the previous chapter, data can be queried directly from Cloud R. For small data sets, e.g. lookup tables, Cloud R can comfortably handle downloading and processing tables from GCS for use in analysis.

Note: You MUST store large datasets in BigQuery to be able to connect them to Cloud R. This prevents unnecessary memory use in reading source files. BigQuery is better suited for large tables of structured data. We cover getting data from R into BigQuery in Chapter 4.


Unstructured data

GCS can store data in any format including pictures, videos, text files, and sensor data.


3.5.2 When to not use GCS

We suggest not using GCS for the following purposes:

  • Large, structured data which needs to be accessed regularly for analysis (use BigQuery instead as your analytics database, and store raw data in GCS)
  • Documentation (use GitHub to document data being transferred from R to GCS, and vice versa)
  • Storing “secret” data – storing data that if “Official-Sensitive” in GCS is fine

3.6 Data Lake

You may hear the term “Data Lake” mentioned when discussing GCS usage. A data lake is a centralised repository designed to store, process, and secure large amounts of structured, semi structured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits. Data lakes are often used by organisations to hold large quantities of data without having to first process everything into structured tables.

Data lakes are often used in conjunction with a data warehouse which is a structured data storage often used for repeatable reporting. In DfT analysis, typically GCS will be used as the data lake, and BigQuery will be built “on top” as the data warehouse.

3.7 Data Management

There are several features available with GCS buckets for retention, versioning, auditability, and other management needs.

3.7.1 Retaining outdated versions of objects

There are two options here Object versioning and soft-delete. Soft-delete is recommended over object versioning.

3.7.1.1 Object versioning

Object versioning allows you to keep older versions of objects in your bucket even after they’ve been deleted or replaced. These older versions, called non-current objects, remain accessible until you choose to delete them. This feature is useful for tracking changes and recovering previous versions of files.

If you have billing or data retention requirements, you can set up Object Lifecycle Management rules to automatically keep a certain number of non-current versions. Object versioning offers some protection against accidental deletions of objects, but it does not protect against deletions at the bucket level or manual deletion of non-current versions.

How to enable OV in a bucket:

  1. Go to Google Cloud Console
  2. Select your project (use the project dropdown at the top of the page to choose the correct project)
  3. Open Cloud Storage (in the left-hand menu, go to Storage > Buckets)
  4. Click on the name of the bucket
  5. Go to the “Configuration” tab and scroll down to the Object versioning section
  6. Click “Edit” and toggle the switch to Enable object versioning
  7. Click “Save” to apply the changes

Once enabled, every time you overwrite or delete an object, a version is kept in the background. You can view and manage these versions through the Cloud Console.

3.7.1.2 Soft-delete

Soft-delete helps protect your data by keeping deleted or overwritten objects and buckets in a recoverable state for a set period. This gives you time to restore anything that was removed by mistake or due to a malicious action. During this retention period, the objects or buckets cannot be permanently deleted. Soft-delete is turned on by default for all buckets, and unless your organisation has set a different policy, the default retention period is seven days.

Unlike Object versioning, which keeps older versions of individual objects, soft-delete provides protection at the bucket level, safeguarding both live and non-current objects from permanent loss. It also offers protection against deletions at the bucket level, which Object versioning does not. To restore soft-deleted objects, you need the right permissions – specifically, the Storage Admin role in IAM.

If you want a higher level of protection, you can enable both Object versioning and soft-delete on the same bucket. You can then set up Object Lifecycle Management rules to automatically delete noncurrent versions after a specific number of days, while soft-delete continues to provide a safety net against permanent loss.

3.7.2 Lifecycle

Object Lifecycle Management is set at the bucket level and lets you automatically carry out certain actions when specific rules or conditions are met. For example, you can:

  • Move objects older than 365 days to a cheaper storage class, like Coldline
  • Delete objects that were created before a specific date, such as 1 January 2019
  • Keep only the three most recent versions of each object in a versioned bucket

This feature is particularly useful for managing storage costs by automatically downgrading storage classes for older data. It can also be used alongside retention rules to clean up objects once their required storage period has passed.

Read the documentation on Lifecycle for further details.

3.7.3 Retention

Retention policies help make sure important data, such as files needed for audits or long-term records, aren’t deleted too soon. For example, if you’re required to keep a data return for several years, a retention policy will make sure it stays in place for that full period.

These policies override any delete permissions given to users, meaning no one can delete an object while it’s still within its set retention period—even if they normally have permission to do so.

3.7.3.1 Bucket lock

Bucket Lock allows you to set and enforce a retention policy on a Cloud Storage bucket. This policy defines how long objects in the bucket must be kept before they can be deleted or replaced. It applies to the entire bucket and covers all current and future objects.

Once a retention policy is in place, objects can only be deleted or modified if they are older than the retention period. You can also permanently lock the policy. Once locked, the retention period cannot be shortened or removed—only increased if needed.

If a policy is permanently locked, you will only be able to delete the bucket once all objects inside have met the retention period. This helps ensure critical data isn’t accidentally or maliciously removed.

Read the documentation on Bucket lock for further detail.

3.7.3.2 Object retention lock

Object Retention Lock lets you define data retention requirements on a per-object basis. This is like bucket lock.

Read the documentation on Object lock for further detail.

3.7.4 Monitoring

The monitoring options for GCS largely cover errors, data ingress/egress, and read/write counts. There are currently no options to see which users have specifically edited/accessed objects.

Monitoring can be accessed on a per-bucket basis via Bucket > Bucket details > Observability. For all buckets in a project navigate to Cloud Storage > Monitoring

3.8 Metadata

Metadata is information about the data. For example, time created, table schemas, source.

3.8.1 Bucket metadata

If you are setting up your bucket as you go through this playbook, you would have probably come across your bucket metadata. Metadata identifies properties of the bucket and specifies how the bucket should be handled when it’s accessed.

The following metadata is set at bucket creation and cannot be edited or removed:

  • Bucket name
  • Bucket location (e.g. europe-west2)
  • The GCP project to which the bucket belongs
  • Generation number (identifies bucket version even if more than one version of a bucket shares the same name. This value never changes)
  • Metadata generation number (identifies metadata state. Starts at 1 at creation and increases each item the metadata is modified)

Editable metadata include items such as:

  • Default storage class
  • Lifecycle configuration
  • Whether autoclassing is enabled
  • Bucket labels
  • Other protection and management settings

You can view bucket metadata by navigating to Bucket > Configuration.

3.8.2 Object metadata

Object metadata is information about an object, stored in the form of key-value pairs, just like metadata for buckets. These pairs describe properties of the object, such as its name, creation date, content type, and any custom information you choose to add.

Read the documentation on object metadata for further detail.