Chapter 2 Basic Queries

In this chapter, we will explore the basics of SQL queries and syntax in BigQuery. We will cover the SELECT statement, filtering data using the WHERE clause, sorting data using the ORDER BY clause, limiting the number of results using the LIMIT clause and aggregating data using the GROUP BY clause.

2.1 The SELECT statement

The SELECT statement is used to retrieve data from one or more tables in a database. It has the following basic syntax:

SELECT column1, column2, ...
FROM table_name;

In this syntax, column1, column2, … are the columns you want to retrieve data from, and table_name is the name of the table where the data is stored. For example, the following query retrieves the status, name columns from a table named tst_storms:

SELECT status, name
FROM `dft-stats-gcp-showcase.TS_Training.tst_storms`;

SELECT * allows you to select all columns from a table. For example, the following query retrieves the all columns from a table named tst_storms:

SELECT status, name
FROM `dft-stats-gcp-showcase.TS_Training.tst_storms`;

BigQuery stores table data in columnar format, meaning it stores each column separately. Column-oriented databases are particularly efficient at scanning individual columns over an entire dataset. Therefore, it is generally discouraged to use * as you are charged per column of data you retrieve from BigQuery, so it is best to explicitly state which columns you wish to select. Column-oriented databases are optimized for analytic workloads that aggregate data over a very large number of records. Often, an analytic query only needs to read a few columns from a table. For example, if you want to compute the sum of a column over millions of rows, BigQuery can read that column data without reading every field of every row.

SELECT DISTINCT is a variation of the SELECT statement that returns unique values of the specified columns. This can be useful when you want to get a list of all unique values in a column, or when you want to eliminate duplicate rows from your query results.

SELECT DISTINCT status, name
FROM `dft-stats-gcp-showcase.TS_Training.tst_storms`;

2.1.1 Exercise

05:00

  1. Using what we learned above, can you select all the columns from the starwars_characters table in the starwars_dataset dataset.
  2. Now can you only select name, species and homeworld from the same table?
  3. How many unique species are there in the table?

2.2 Filtering data using the WHERE clause

The WHERE clause is used to filter data based on specific conditions. It has the following basic syntax:

SELECT column1, column2, ...
FROM table_name
WHERE condition;

In this syntax, condition is the condition that must be met for a row to be included in the results. For example, the following query retrieves all storms where wind was greater than 70:

SELECT wind, status, name, hour, month, year
FROM `dft-stats-gcp-showcase.TS_Training.tst_storms`
WHERE wind > 70;

One specific case worth mentioning is retrieving null values. The following query retrieves all storms where hurricane force diameter is not null.

SELECT hurricane_force_diameter, wind, status, name, hour, month, year
FROM `dft-stats-gcp-showcase.TS_Training.tst_storms`
WHERE hurricane_force_diameter IS NOT NULL

And if you wanted to find storms that did have a null hurricane force diamater, you would remove the NOT.

2.3 Sorting data using the ORDER BY clause

The ORDER BY clause is used to sort data in ascending or descending order based on one or more columns. It has the following basic syntax:

SELECT column1, column2, ...
FROM table_name
ORDER BY column_name [ASC|DESC];

In this syntax, column_name is the name of the column used to sort the data, and ASC or DESC specifies the sorting order. For example, the following query retrieves all storms sorted by wind in descending order:

SELECT wind, status, name, hour, month, year
FROM `dft-stats-gcp-showcase.TS_Training.tst_storms`
ORDER BY wind DESC;

2.4 Limiting the number of results using the LIMIT clause

The LIMIT clause is used to limit the number of results returned by a query. It has the following basic syntax:

SELECT column1, column2, ...
FROM table_name
LIMIT number;

In this syntax, number is the maximum number of rows to be returned by the query. For example, the following query retrieves the top 10 recorded wind speeds:

SELECT wind, status, name, hour, month, year
FROM `dft-stats-gcp-showcase.TS_Training.tst_storms`
ORDER BY wind DESC
LIMIT 10;

2.4.1 Exercise

05:00

  1. Using what we learned above, can you filter the starwars_characters to only be droids?
  2. Now can you order these characters by height? Who is the tallest droid?
  3. How does this change if you limit the results to only 3? Who are the top 3 droids?

2.5 Aggregating Data

You can use function such as COUNT, SUM, AVG, MIN, and MAX functions to aggregate data. The syntax for the aggregate functions is as follows:

SELECT aggregate_function(column)
FROM table_name

In this syntax, column is the column used to aggregate data.

For example, the following query calculates the average wind for all storms in my dataset:

SELECT AVG(wind)
FROM `dft-stats-gcp-showcase.TS_Training.tst_storms`;

2.6 Grouping Data

You can use the GROUP BY clause to group data based on one or more columns. The syntax for the GROUP BY clause is as follows:

SELECT column1, column2, ...
FROM table_name
GROUP BY column1, column2, ...;

In this syntax, column1, column2, … are the columns used to group data. This works extremely well with aggregation functions.

For example, the following query gives the average wind speed by storm type.:

SELECT status, AVG(wind)
FROM `dft-stats-gcp-showcase.TS_Training.tst_storms`
GROUP BY status;

In this chapter, we covered the basics of SQL queries and syntax in BigQuery. In the next chapter, we will explore more advanced queries, including creating tables, joining them, and inserting data.

2.6.1 Exercise

05:00

  1. Using what we learned above, what is the average starwars character height?
  2. What is the average height by species?
  3. What is the max height of each species?