Image by Author | Canva
Â
SQL can sometimes seem like a run-of-the-mill language, but it’s much more capable than given credit.
In the article about essential SQL commands for data analysis, I can’t avoid talking about the commands that are, hm, absolutely essential. You can find more details about many commands I’ll mention in the SQL cheat sheet.
But add some pizzazz and mention some not-that-common commands that you’ll find useful in data analysis.
Â
Essential Commands
Â
Even the most basic data analysis in SQL cannot go without at least some of these commands, often even all of them.
They are so essential we can even call them 10 command(ment)s.
Â
Â
1. SELECT
Â
Without SELECT, you can’t do anything that resembles a data analysis, as you get data in the first place, so you have something to analyze. That’s the purpose of SELECT: retrieving data from one or more tables.
This example query retrieves specific columns (first_name
, last_name
, salary
) from the employees table. It demonstrates the basic data retrieval functionality of SELECT.
SELECT first_name, last_name, salary
FROM employees
WHERE department="Sales";
Â
2. WHERE
Â
The WHERE clause filters data, allowing you to narrow down the datasets.
This example filters products to show only those in category 1 with a unit price greater than 50. The WHERE clause allows multiple conditions using AND/OR operators.
SELECT product_name, unit_price
FROM products
WHERE category_id = 1 AND unit_price > 50;
Â
3. GROUP BY
Â
The GROUP BY clause organizes data into groups based on the same values from the columns. That way, you can more easily summarize large amounts of data.
This example groups employees by their department and calculates the count of employees and average salary per department. GROUP BY is essential for aggregating data at different levels.
SELECT department, COUNT(*) as employee_count, AVG(salary) as avg_salary
FROM employees
GROUP BY department;
Â
4. HAVING
Â
This is another clause for filtering data in SQL. HAVING filters data after the aggregation, thus allowing aggregate functions in the clause, unlike WHERE.
This example shows only categories that have more than 10 products. HAVING filters grouped results, working with aggregate functions unlike WHERE.
SELECT category, COUNT(*) as product_count
FROM products
GROUP BY category
HAVING COUNT(*) > 10;
Â
5. JOIN
Â
The JOIN clause allows you to combine data from two or more tables. The database logic requires databases to be normalized, which means you typically won’t find all the data you need in one table. JOIN is important if you don’t want to be imprisoned within the limits of one table, producing the so-called ‘Alcatraz data analyses’. (Just made that up.)
This example combines order information with customer details by matching customer IDs. JOIN connects related data from multiple tables.
SELECT o.order_id, c.customer_name, o.order_date
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
Â
6. ORDER BY
Â
If you don’t want to depend on the whims (actually, logic, but whims sound more dramatic) of SQL of how it will output your data, use ORDER BY. It’s a clause for sorting the result set by one or more columns. Sorting can be ascending (alphabetically for text data; from the lowest to the highest number for numerical data; from the oldest to the newest date) or descending (reverse-alphabetically for text data; from the highest to the lowest number for numerical data; from the newest to the oldest date).
The following example sorts products by price in descending order and then alphabetically by name. ORDER BY can sort by multiple columns with different directions.
SELECT product_name, unit_price
FROM products
ORDER BY unit_price DESC, product_name ASC;
Â
7. LIMITÂ
Â
The LIMIT clause is useful when you want to restrict the number of rows returned by a query.
This example shows the top 5 products by units sold. LIMIT restricts the output to a specific number of rows, useful for top-N analysis.
SELECT product_name, units_sold
FROM sales
ORDER BY units_sold DESC
LIMIT 5;
Â
8. CASE
Â
The CASE statement is an SQL implementation of the if-else logic. With it, you can group data dynamically (based on the condition), label it, and derive new columns.
This example categorizes products into price ranges using conditional logic. CASE allows for dynamic classification of data based on conditions.
SELECT product_name,
CASE
WHEN unit_price
Â
9. DISTINCT
Â
In data analysis, removing duplicates is one of the common problems, and DISTINCT solves it elegantly: it retrieves only unique values.
The following example shows unique combinations of category and supplier. DISTINCT eliminates duplicate rows from the result set.
SELECT DISTINCT category, supplier_id
FROM products
ORDER BY category;
Â
10. UNION
Â
The UNION and UNION ALL operators combine the results of two or more SELECT statements into one result. The difference between them is that UNION removes duplicates (rows that appear in all the queries’ outputs), while UNION ALL includes all rows.
This example combines active and discontinued product lists into a single result set. UNION merges results from multiple SELECT statements while removing duplicates.
SELECT product_id, product_name, 'Active' as status
FROM current_products
UNION
SELECT product_id, product_name, 'Discontinued'
FROM discontinued_products;
Â
Not-So-Essential (But Nevertheless Very Cool) Commands
Â
Now, let’s mention some other commands that might be often overlooked but can add more versatility to data analysis in SQL.
Â
Â
- WITH (Common Table Expressions or CTEs): For defining a temporary result set (similar to a temporary table or named subquery) that can be reused in another query and recursion (traversing graphs and analyzing hierarchical data)
- JSON_EXTRACT: Extracts specific data from JSON strings, allowing you to work with semi-structured data.
- Window Functions: For performing calculations across a subset of rows related to the current row (used for rankings, running totals, moving averages, etc.)
- PIVOT(): For creating pivot tables.
- COALESCE(): Ensures that NULL values are replaced with a default value.
- STRING_AGG(): Combining strings into a comma-separated list.
- ARRAY_AGG(): Grouping data into an array.
- EXCEPT: Returns rows from the first query that are not present in the second query.
Â
Conclusion
Â
The commands we listed will be commonly used in most of your data analysis. They perform unavoidable analysis tasks, such as fetching, filtering, grouping, and aggregating data.
Combine them with those eight bonus commands, and you’ll easily get a flexible data analysis tool in SQL.
Â
Â
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.