SQL CTE vs. Subquery: This Debate Ain’t Over Yet!



Image by Author | Canva

 

Working with SQL will eventually get you to the point where you’ll ask: Are Common Table Expressions (CTEs) better than subqueries? What’s their point? It’s like debating tabs vs. spaces, light mode vs. dark mode, or espresso vs. expresso.

Every time you say “expresso”, one Italian falls off Vespa somewhere in Rome. Every time you use CTE instead of a subquery happens… what, exactly?

Maybe it’s not that obvious, but SQL CTE and a subquery are not the same, and I’m not talking only about syntax here.

Before we see what these differences are and pronounce a winner, let’s clearly define each.

 

CTE & Subquery Definitions

 
CTEs are introduced using WITH. They are temporary results that can be referenced multiple times in another query. Think of them as temporary tables that are available only when the query is run.

Subqueries are queries (SELECT statements, to be precise) within another query. They can be used in the following statements/clauses:

  • SELECT
  • INSERT
  • UPDATE
  • DELETE
  • WHERE
  • HAVING
  • FROM
  • SET

It’s true they both have the same fundamental purpose: helping you break down queries. But, they do it differently, as they do some other things.

 

The Great Showdown: CTEs vs Subqueries

 
We will compare these two on four aspects of SQL coding. They are given in the table below, together with the overall and each round’s winner.

 
The Great Showdown CTEs vs Subqueries
 

Now, let’s go into each round’s details, showing you examples and why CTE won in three out of four.

 

Round 1: Readability

 
If you rely on subqueries to construct a complex query, it can turn into a hardly readable mess. Having multiple subqueries gets out of hand quickly, making debugging an utter nightmare.

Take, for example, this interview question by Meta. It asks you to find countries that have risen in the rankings based on the number of comments between December 2019 and January 2020. When writing a query, you should avoid gaps between country ranks.

Solving the question using subqueries will result shown below. Even I don’t know where to start explaining such a solution. And I wrote the query.

SELECT country
FROM (
    SELECT d.country,
           d.dec_rank,
           j.jan_rank,
           d.total_comments AS dec_comments,
           j.total_comments AS jan_comments
    FROM (
        SELECT country,
               total_comments,
               DENSE_RANK() OVER (ORDER BY total_comments DESC) AS dec_rank
        FROM (
            SELECT u.country,
                   SUM(c.number_of_comments) AS total_comments
            FROM fb_comments_count AS c
            JOIN fb_active_users AS u ON c.user_id = u.user_id
            WHERE c.created_at >= '2019-12-01'
              AND c.created_at < '2020-01-01' GROUP BY u.country ) AS december ) AS d JOIN ( SELECT country, total_comments, DENSE_RANK() OVER (ORDER BY total_comments DESC) AS jan_rank FROM ( SELECT u.country, SUM(c.number_of_comments) AS total_comments FROM fb_comments_count AS c JOIN fb_active_users AS u ON c.user_id = u.user_id WHERE c.created_at >= '2020-01-01'
              AND c.created_at < '2020-02-01' GROUP BY u.country ) AS january ) AS j ON d.country = j.country ) AS rank_compare WHERE dec_rank > jan_rank
ORDER BY dec_rank;

 

Now, compare this to the solution that uses CTEs instead of subqueries. This approach doesn’t necessarily result in shorter code. However, using CTEs makes the code easily understandable, as it breaks it down into logical blocks in the order the calculations are performed. The naming of each CTE also helps.

It’s pretty clear that the first CTE calculates the total number of comments for Dec 2019 and Jan 2020 for each country.

The next two CTEs extract data for December and January, respectively. That same data is ranked using DENSE_RANK() in the two subsequent CTEs.

In the final CTE, the December and January ranks are compared to identify improvements.

Finally, a SELECT statement fetches only countries where the January 2020 rank is higher than the December 2019 rank by referencing the last CTE in FROM.

WITH monthly_comments AS (
    SELECT u.country,
           DATE_TRUNC('month', c.created_at)::DATE AS month_start,
           SUM(c.number_of_comments) AS total_comments
    FROM fb_comments_count AS c
    JOIN fb_active_users AS u ON c.user_id = u.user_id
    WHERE c.created_at >= '2019-12-01'
    AND c.created_at < '2020-02-01' GROUP BY u.country, DATE_TRUNC('month', c.created_at)::DATE ), december AS ( SELECT country, total_comments FROM monthly_comments WHERE month_start="2019-12-01" ), january AS ( SELECT country, total_comments FROM monthly_comments WHERE month_start="2020-01-01" ), december_rank AS ( SELECT country, total_comments, DENSE_RANK() OVER ( ORDER BY total_comments DESC) AS dec_rank FROM december ), january_rank AS ( SELECT country, total_comments, DENSE_RANK() OVER (ORDER BY total_comments DESC) AS jan_rank FROM january ), rank_compare AS ( SELECT d.country, d.dec_rank, j.jan_rank, d.total_comments AS dec_comments, j.total_comments AS jan_comments FROM december_rank d JOIN january_rank j USING (country) ) SELECT country FROM rank_compare WHERE dec_rank > jan_rank
ORDER BY dec_rank;

 

Now, go back to the subqueries-only solution and try to find all the steps I described above. Yup, a bit trickier.

Verdict: CTEs win

 

Round 2: Performance

 
CTEs and subqueries usually don’t differ in regard to performance. However, sometimes, there can be scenarios where subqueries perform worse than CTEs.

This stems from the subqueries logic: they are executed every time they are referenced in the main query. If a subquery is used multiple times (e.g., in multiple WHERE or JOIN clauses), it gets evaluated each time it’s referenced. The possible outcome is increased execution time due to redundant calculations.

CTE, on the other hand, is executed only once within a query. If the CTE is referenced multiple times within the same query, it will reuse the CTE output without rerunning it. This is possible because a database engine stores the CTE output as a temporary result, also known as CTE materialization.

However, it’s possible that you’ll experience this benefit only in some database engines. PostgreSQL and SQL Server handle CTE materialization well. MySQL doesn’t always optimize CTEs efficiently, and Oracle sometimes favors inline subqueries.

This might not matter with smaller datasets. But, for large queries and datasets, CTEs might perform significantly better than subqueries. Or not. All depending on the database engine you use.

Verdict: No winner

 

Round 3: Reusability & Maintainability

 
Ever caught yourself copy-pasting the same subquery multiple times in the same query? That’s a red flag, signaling you probably should’ve used CTEs instead.

CTEs require defining the logic once; you can reuse it as many times as you want. This makes code maintenance much easier, as you only update one CTE, no matter how many times its result is used. On the other hand, having several of the same subqueries in one main query means you have to make the same update in each subquery.

Here’s an example based on the interview question by Lyft. Let’s say I understood the task is to find the city with the most profitable month in 2019. I would write this code using CTEs.

The CTE calculates the profit for each city and month. The SELECT statement then references that calculation to find the highest profit and output, the city, month, and the profit level.

WITH cte AS (
    SELECT city,
           EXTRACT(MONTH FROM order_date) AS p_month,
           SUM(order_fare) AS profit
   FROM lyft_orders o
   JOIN lyft_payment_details p ON o.order_id = p.order_id
   WHERE EXTRACT(YEAR FROM order_date) = '2019'
   GROUP BY o.city, p_month
)

SELECT city,
       p_month,
       profit
FROM cte
WHERE profit =
    (SELECT MAX(profit)
     FROM cte);

 

Here’s the result.

 
Advantage of cte over subquery reusability and maintainability
 

Five minutes after I sent this report, my boss stormed in, insisting it couldn’t be true that Paris had the most profitable month in 2021. Wait, what!? I thought she said 2019, not 2021. OK, no problem; my bad. Here, I just replaced ‘2019’ with ‘2021’, and the query was updated.

WITH cte AS (
    SELECT city,
           EXTRACT(MONTH FROM order_date) AS p_month,
           SUM(order_fare) AS profit
   FROM lyft_orders o
   JOIN lyft_payment_details p ON o.order_id = p.order_id
   WHERE EXTRACT(YEAR FROM order_date) = '2021'
   GROUP BY o.city, p_month
)

SELECT city,
       p_month,
       profit
FROM cte
WHERE profit =
    (SELECT MAX(profit)
     FROM cte);

 

Here’s the correct output.

 
Advantage of cte over subquery is reusability and maintainability
 

Now, the solution with subqueries is unnecessarily long: there’s the same subquery written twice. This redundancy is the first drawback of the subqueries. Second, if I made the same mistake as above, I’d have to change ‘2019’ to ‘2021’ twice.

SELECT city, p_month, profit
FROM (
    SELECT city, 
           EXTRACT(MONTH FROM order_date) AS p_month, 
           SUM(order_fare) AS profit
    FROM lyft_orders o
    JOIN lyft_payment_details p ON o.order_id = p.order_id
    WHERE EXTRACT(YEAR FROM order_date) = '2021'
    GROUP BY o.city, EXTRACT(MONTH FROM order_date)
) AS aggregated_data
WHERE profit = (
    SELECT MAX(profit)
    FROM (
        SELECT city, 
               EXTRACT(MONTH FROM order_date) AS p_month, 
               SUM(order_fare) AS profit
        FROM lyft_orders o
        JOIN lyft_payment_details p ON o.order_id = p.order_id
        WHERE EXTRACT(YEAR FROM order_date) = '2021'
        GROUP BY o.city, EXTRACT(MONTH FROM order_date)
    ) AS max_profit_data
);

 

Changing the year twice instead of once doesn’t seem such a big deal. However, take into account that this is just an example. Imagine much more complex code, where you would first have to discover what each subquery does. Only then can you change the logic, and who knows how many times you’d have to do it.

Also, the change of logic can be much more complex than in our example, so there’s an increasing possibility that you will make mistakes in subsequent subqueries, even if you’re not retyping but copying and pasting parts of subqueries.

Verdict: CTEs win

 

Round 4: Recursion

 
One significant advantage of CTEs over subqueries is that they can handle recursion. With recursive queries, you can query hierarchical (e.g., organizational structure, family tree) and graph-based data (e.g., finding the shortest or the longest path from one city to another).

Actually, true recursion in SQL can’t be achieved any other way than with CTEs. (OK, there are some alternative ways that utilize self-joins or stored procedures and loops. Those approaches have their problems: they don’t work if the hierarchy depth is unknown, aren’t pure SQL solutions, are shockingly difficult to maintain, or are much slower than CTEs. So, in SQL, recursion equals CTEs.)

As an example, let’s use the table employees (shown below) to find all employees whose boss is an employee with ID = 4.

 
The advantage of CTE over subquery is recursion
 

To make a CTE recursive in PostgreSQL, add the keyword RECURSIVE after WITH. In the CTE’s first SELECT, the query fetches info about the employee whose ID is 4. This sets the base, i.e., the boss whose team members we seek.

This is UNIONed with another SELECT that joins the employees table with the CTE where the manager’s ID equals the team member’s ID. This means that CTE references itself (this is what recursion is) and will keep on doing so until no matches exist.

In the outside SELECT, we simply choose all data from the CTE where the ID is not 4; we don’t want to show the boss data, only the employees that report to her.

WITH RECURSIVE team_members AS (
        SELECT  id,
                first_name,
                last_name,
                manager_id
        FROM employees
        WHERE id = 4
 
    UNION
 
        SELECT  e.id,
                e.first_name,
                e.last_name,
                e.manager_id
        FROM employees e
JOIN team_members tm
ON e.manager_id = tm.id
)
         
SELECT *
FROM team_members
WHERE id != 4;

 

Here’s the output.

 
The advantage of CTE over subquery is handling recursion
 

Verdict: CTEs win

 

Conclusion

 
The CTEs win in three out of four showdowns. The only case where they   only maybe   are not better than subqueries is performance. But this is not always the case, so if this is your concern, test both and see what works best for your data and query.

CTE is a clear winner in all other cases — readability, reusability & maintenance, and recursion.

People often stick with subqueries, as they typically learn them first when writing code that requires more complex data subsetting and performing calculations within calculations.

My advice is to transition towards CTE slowly: syntactically, they are almost the same as subqueries (it’s the same SELECT you write as a subquery), but with benefits I’ve demonstrated in this article.
 
 

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here