Image by Author | Created on Canva
Are you a developer familiar with SQL and Python? If so, you may want to start using DuckDB—an in-process OLAP database—for data analytics.
SQL is the language for querying databases and is the most important in your data toolbox. So when you switch to Python, you’re probably looking at pandas—reading in data from various sources into a dataframe and analyzing it.
But wouldn’t it be nice to query pandas dataframes as well as data sources such as CSV and Parquet files using SQL. DuckDB lets you do just that and much more. In this tutorial, we’ll learn how to use DuckDB in Python to analyze data. Let’s get started!
Setting Up the Environment
To get started, create and activate a virtual environment:
$ python3 -m venv v1
$ source v1/bin/activate
Next install duckdb:
Because we also generate sample data to work with, we’ll also need NumPy and Pandas:
$ pip3 install numpy pandas
Querying Data with DuckDB
With the quick installation out of the way we can move on to do some data analysis.
Note: It’s common to use connections when interacting with databases. You can use
duckdb.connect()
to work with both in-memory databases and persistent storage.
- Using
duckdb.connect()
to connect to an in-memory database that exists only during the session. This is suitable for quick analysis, especially when you don’t need to store the results long-term. - To persist data between sessions and queries, pass a file path to the
connect()
function like so:duckdb.connect('my_database.db')
.
But we’ll query CSV files and don’t quite need a connection object. So this was just a note to give you an idea when you’re querying databases.
Generating Sample CSV Files
▶️ You can find the code for this tutorial on GitHub.
We’ll create a mock sales dataset, a couple of csv files, that include product details, prices, quantities sold, and the regions in which the sales occurred. Running generate_csv.py in your project folder to generate two CSV files: sales_data.csv and product_details.csv.
When Working with CSV files in DuckDB, you can read the file into a relation: duckdb.read_csv(‘your_file.csv’)
and then query it. Or you can work directly with files and query them like so:
import duckdb
duckdb.sql("SELECT * FROM 'sales_data.csv' LIMIT 5").df()
You can save the results of the query using df()
as shown in the example.
Let’s now run some (actually helpful) SQL queries to analyze the data in the CSV files.
Example Query 1: Calculate Total Sales by Region
To understand which region generated the most revenue, we can calculate the total sales per region. You can calculate total sales by multiplying the price of each product by the quantity sold and summing it up for each region.
# Calculate total sales (Price * Quantity_Sold) per region
query = """
SELECT Region, SUM(Price * Quantity_Sold) as Total_Sales
FROM 'sales_data.csv'
GROUP BY Region
ORDER BY Total_Sales DESC
"""
total_sales = duckdb.sql(query).df()
print("Total sales per region:")
print(total_sales)
This query outputs:
Total sales per region:
Region Total_Sales
0 East 454590.49
1 South 426352.72
2 West 236804.52
3 North 161048.07
Example Query 2: Find the Top 5 Best-Selling Products
Next, we want to identify the top 5 best-selling products by quantity sold. This can give us insight into which products are performing the best across all regions.
# Find the top 5 best-selling products by quantity
query = """
SELECT Product_Name, SUM(Quantity_Sold) as Total_Quantity
FROM 'sales_data.csv'
GROUP BY Product_Name
ORDER BY Total_Quantity DESC
LIMIT 5
"""
top_products = duckdb.sql(query).df()
print("Top 5 best-selling products:")
print(top_products)
This gives the top 5 products with the highest sales:
Top 5 best-selling products:
Product_Name Total_Quantity
0 Product_42 99.0
1 Product_97 98.0
2 Product_90 96.0
3 Product_27 94.0
4 Product_54 94.0
Example Query 3: Calculate Average Price by Region
We can also calculate the average price of products sold in each region to identify any price differences between regions.
# Calculate the average price of products by region
query = """
SELECT Region, AVG(Price) as Average_Price
FROM 'sales_data.csv'
GROUP BY Region
"""
avg_price_region = duckdb.sql(query).df()
print("Average price per region:")
print(avg_price_region)
This query calculates the average price for products sold in each region and returns the results grouped by region:
Average price per region:
Region Average_Price
0 North 263.119167
1 East 288.035625
2 West 200.139000
3 South 254.894722
Example Query 4: Total Quantity Sold by Region
To further analyze the data, we can calculate the total quantity of products sold in each region. This helps us see which regions have the most sales activity in terms of volume.
# Calculate total quantity sold by region
query = """
SELECT Region, SUM(Quantity_Sold) as Total_Quantity
FROM 'sales_data.csv'
GROUP BY Region
ORDER BY Total_Quantity DESC
"""
total_quantity_region = duckdb.sql(query).df()
print("Total quantity sold per region:")
print(total_quantity_region)
This query calculates the total quantity sold per region and sorts the result in descending order, showing which region sold the most products:
Total quantity sold per region:
Region Total_Quantity
0 South 1714.0
1 East 1577.0
2 West 1023.0
3 North 588.0
Example Query 4: Joining CSVs
DuckDB offers several advanced features that make it versatile for data analysis. For example, you can easily join multiple CSV files for more complex queries, or query larger datasets stored on disk without loading them entirely into memory.
This SQL JOIN query combines two CSV files, sales_data.csv and product_details.csv, by matching rows based on a common column: Product_ID.
query = """
SELECT s.Product_Name, s.Region, s.Price, p.Manufacturer
FROM 'sales_data.csv' s
JOIN 'product_details.csv' p
ON s.Product_ID = p.Product_ID
"""
joined_data = duckdb.sql(query).df()
print(joined_data.head())
This should output:
Product_Name Region Price Manufacturer
0 Product_1 North 283.08 Manufacturer_4
1 Product_2 East 325.94 Manufacturer_3
2 Product_3 West 39.54 Manufacturer_2
3 Product_4 South 248.82 Manufacturer_4
4 Product_5 East 453.62 Manufacturer_5
Wrapping Up
In this tutorial, we looked at how to use DuckDB for data analysis with Python.
We worked with CSV files. But you can work with parquet and JSON files and relational databases the same way. So yeah, DuckDB is a useful tool for analyzing large datasets in Python and is quite a useful addition to your Python data analysis toolkit.
I suggest using DuckDB in your next data analysis project. Happy coding!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.