Image by Editor | Ideogram
Â
Having practiced with basic querying examples in BigQuery, including filters and data sorting, this post moves on to explore more advanced querying clauses and operations to make the most of your data stored in BigQuery tables.
Similar to the previous post in this series, the illustrative examples below consider a set of tables loaded in a previous tutorial describing asian cuisines. For this tutorial, we will also consider a new table with a different schema, consisting of two colums. This table, called dish_ingredients
contains the ingredients associated to some of the chinese dishes.
The new table can be created by importing a CSV file containing the following comma-separated information:
dish_name,ingredient
Mapo Tofu,Tofu
Mapo Tofu,Ground Pork
Mapo Tofu,Sichuan Peppercorns
Kung Pao Chicken,Chicken
Kung Pao Chicken,Peanuts
Kung Pao Chicken,Dried Chili Peppers
Sweet and Sour Pork,Pork
Sweet and Sour Pork,Pineapple
Sweet and Sour Pork,Bell Peppers
Hot and Sour Soup,Tofu
Hot and Sour Soup,Mushrooms
Hot and Sour Soup,Vinegar
Peking Duck,Duck
Peking Duck,Hoisin Sauce
Peking Duck,Scallions
Â
JOIN Operations
Â
JOIN operators, as its name suggests, allow querying multiple tables jointly and retrieving specific data columns across them. A requisite for a JOIN operation to work is that there exists a link between the tables involved, in other word, there must be a common or closely related column between them.
By looking at the chinese_cuisine
and dish_ingredients
tables, there is one linkind column between them: name
. A JOIN operation uses this bridge between tables to retrieve information from them jointly. Let’s see it in action!
SELECT chinese_cuisine.name, chinese_cuisine.description, dish_ingredients.ingredient
FROM `asian_cuisines.chinese_cuisine` AS chinese_cuisine
JOIN `asian_cuisines.dish_ingredients` AS dish_ingredients
ON chinese_cuisine.name = dish_ingredients.dish_name;
Â
The above query retrieves rows that contain the dish name, description, and ingredient of Chinese dishes that have their ingredients registered in the dish_ingredients
table. As usual, these three columnar results are specified in the SELECT clause, where the <table_name>.<column_name>
syntax is utilized to avoid potential confusion between tables.
The FROM clause becomes somewhat special here, as it involves two tables, hence we need the keyword JOIN to bridge them together. The AS keyword creates an alias for tables to simplify syntax wherever they are referred to across the query (note that outside of the FROM clause, the preffix with the project or dataset name is no longer needed).
Last, since the two related columns between tables do not have the same name (name
and dish_name
), the ON clause at the end explicitly clarifies the linkage between them.
This is what the query results look like in BigQuery GUI:
Â
Â
The “default” JOIN operation only retrieves dishes whose ingredientes are registered in the ingredients table. Some chinese dishes do not have their ingredients in that table yet, so how can we list the same information as above, even for dishes with no listed ingredients? The LEFT JOIN is the solution: a LEFT JOIN returns all rows from the table on the left-hand side (the first of the two tables listed in the FROM-JOIN clause) even if some of them do not have any matches with rows in the right-hand side.
Let’s see it through an example:
SELECT chinese_cuisine.name, chinese_cuisine.description, dish_ingredients.ingredient
FROM `asian_cuisines.chinese_cuisine` AS chinese_cuisine
LEFT JOIN `asian_cuisines.dish_ingredients` AS dish_ingredients
ON chinese_cuisine.name = dish_ingredients.dish_name;
Â
Now all the dishes in the chinese_cuisine
table are retrieved, even if they have no related ingredients in the other table (I know, everyone missed the tasty mooncakes in the previous query):
Â
Â
Similar to LEFT JOIN, the RIGHT JOIN option can be used when we want to ensure all the rows in the second of “right-hand side” table will be retrieved, even if some of them aren’t linked to none of the rows in the first table. Finally FULL OUTER JOIN combines the logic of LEFT JOIN and RIGHT JOIN, to list all rows from both tables whether or not there is cross-table relations for some of them.
Â
Window Functions, Subqueries, and Nested Queries
Â
Window functions undertake calculations across a group of table rows that have some sort of relation to a target row. Here, we calculate the rank of each dish based on its preparation_time_mins
.
One example of such calculation could be setting a ranking of rows based on the values of a specified column. The RANK() window function can do this, as shown below.
SELECT
name,
preparation_time_mins,
RANK() OVER (ORDER BY preparation_time_mins ASC) AS prep_time_rank
FROM
`asian_cuisines.chinese_cuisine`;
Â
The application of the window function yields a new column (which we call prep_time_rank
) returned as part of the results in the SELECT clause. Notice we also nest an ORDER BY clause inside the window function to specify that the ranking of dishes will be made in ascending order of the preparation time.
Output:
name preparation_time_mins prep_time_rank
Egg Fried Rice 20 1
Steamed Dumplings 50 9
Sichuan Hotpot 90 12
Dim Sum 60 11
Ma Po Tofu 25 2
Spring Rolls 45 8
Hot and Sour Soup 40 7
Red Bean Buns 90 12
Kung Pao Chicken 30 3
Wonton Soup 50 9
Sweet and Sour Pork 35 5
Peking Duck 240 15
Chow Mein 30 3
Sesame Chicken 35 5
Mooncake 120 14
Â
There are many more window functions, for example to perform arithmetic or aggregation operators like SUM() and AVG().
Subqueries are, simply put, a query inside another query. Normally, the inner query becomes one of the filtering conditions in the WHERE clause.
For example, suppose we want to list information about chinese dishes whose preparation time is under the average preparation time for all chinese dishes. Since we do not have the actual average value of preparation times stored in our database, but it can be calculated using the aggregation operator AVG(), we can:
- Query the average preparation time over all chinese dishes.
- Encapsulate the previous query as a subquery that becomes part of the filtering condition to list dishes with preparation times below the average
SELECT
name,
preparation_time_mins
FROM
`asian_cuisines.chinese_cuisine`
WHERE
preparation_time_mins
Â
Output:
name preparation_time_mins
Egg Fried Rice 20
Spring Rolls 45
Ma Po Tofu 25
Kung Pao Chicken 30
Chow Mein 30
Sweet and Sour Pork 35
Sesame Chicken 35
Hot and Sour Soup 40
Wonton Soup 50
Steamed Dumplings 50
Dim Sum 60
Â
Nested queries are very similar to subqueries, and in certain cases both are interchangeable, but nested queries support more complex scenarios e.g. querying involving multiple tables, or nesting more than two queries one inside another.
This final example list the ingredients of the recipe taking the longest time to prepare, along with the recipe name itself.
SELECT
dish_ingredients.dish_name,
dish_ingredients.ingredient
FROM
`asian_cuisines.dish_ingredients` AS dish_ingredients
WHERE
dish_ingredients.dish_name = (
SELECT name
FROM `asian_cuisines.chinese_cuisine`
ORDER BY preparation_time_mins DESC
LIMIT 1
);
Â
The inner query is, again, part of the filtering condition in the WHERE clause of the outer query. The condition is used to seek the name of the chinese dish with the longest preparation time (note the LIMIT 1 at the end of the inner query to return one dish only, after sorting them with ORDER BY). Once that dish is fetched, the outer query is used to list its name and ingredients one by one.
dish_name ingredient
Peking Duck Duck
Peking Duck Hoisin Sauce
Peking Duck Scallions
Â
This ends our tutorial where we started to explore more sophisticated type of queries in Google BigQuery.
Â
Â
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.