Tips for Handling Large Datasets in Python

Image by Author | Created on Canva

Handling large datasets in Python can be a real challenge, especially when you’re used to working with smaller datasets that your computer can easily handle. But fear not! Python is packed with tools and tricks to help you efficiently process and analyze big data.

In this tutorial, I’ll walk you through essential tips to process large datasets like a pro by focusing on:

Core Python features: generators and multiprocessing
Chunked processing of large datasets in pandas
Using Dask for parallel computing
Using PySpark for distributed computing

Let’s get started!

Note: This is a short guide covering useful tips and does not focus on a specific dataset. But you can use datasets like the Bank Marketing dataset or The NYC TLC trip record data. We’ll use generic filenames like large_dataset.csv in the code examples.

1. Use Generators Instead of Lists

When processing large datasets, it’s often tempting to load everything into a list. But this usually takes up a huge chunk of memory. Generators, on the other hand, generate data only when needed—on the fly—making the task more memory efficient.

Say you have a large log file—server logs or user activity data—and you need to analyze it line by line. Using a generator, you can read and process the contents—one line at a time—without loading the whole file into memory.

def read_large_file(file_name):
    with open(file_name, 'r') as file:
        for line in file:
            yield line

Using a generator function returns a generator object and not a list. You can loop through this generator, processing each line individually.

2. Go Parallel with Multiprocessing

Processing big datasets can be slow if you’re limited to a single processor core. Python’s multiprocessing module lets you split up tasks across multiple CPU cores. This speeds things up significantly.

Suppose you have a massive dataset with house prices and want to remove outliers and normalize the data. With multiprocessing, you can handle this task in parallel chunks to cut down on processing time.

import pandas as pd
import numpy as np
from multiprocessing import Pool

# Function to clean and normalize data for each chunk
def clean_and_normalize(df_chunk):
    # Remove top 5% as outliers in the 'price' column
    df_chunk = df_chunk[df_chunk['price']

In each chunk, we filter out high-priced outliers and then normalize the price column. Using Pool(4), we create four processes to handle separate chunks at the same time, speeding up the cleaning and normalization. With multiple cores working on chunks, processing is much faster.

3. Use Pandas’ chunksize for Chunked Processing

Pandas is great for data analysis manipulation, but loading a massive dataset into a DataFrame can put a strain on your memory. Luckily, the chunksize parameter in pd.read_csv() lets you process large datasets in manageable pieces, or “chunks.”

Let’s say you’re working with sales data and want to calculate the total sales. Using chunksize, you can read and sum up sales values chunk by chunk.

import pandas as pd

total_sales = 0
chunk_size = 100000  # Define chunk size to read data in batches

# Load data in chunks and process each chunk
for chunk in pd.read_csv('large_sales_data.csv', chunksize=chunk_size):
    total_sales += chunk['sales'].sum()  # Summing up sales column in each chunk

print(f"Total Sales: total_sales")

Each chunk is loaded and processed independently, allowing you to work with huge files without blowing up memory. We sum the sales in each chunk, gradually building up the total. Only a part of the file is in memory at any time, so you can handle very large files without issue.

4. Use Dask for Parallel Computing

If you’re comfortable with Pandas but need more power, Dask offers a convenient next step with parallel computing. Dask dataframe uses pandas under the hood. So you’ll get up to speed with Dask quickly if you’re already comfortable with pandas.

You can also use Dask in conjunction with libraries like Scikit-learn and XGBoost to train machine learning models on large datasets. You can install Dask with pip.

Let’s say you want to calculate the average sales per category in a huge dataset. Here’s an example:

import dask.dataframe as dd

# Load data into a Dask DataFrame
df = dd.read_csv('large_sales_data.csv')

# Group by 'category' and calculate mean sales (executed in parallel)
mean_sales = df.groupby('category')['sales'].mean().compute()

print(mean_sales)

So if you’re used to Pandas syntax, Dask feels familiar but can better handle much larger datasets.

5. Consider PySpark for Big Data Processing

For truly large datasets—think hundreds of gigabytes or more—you can use PySpark. PySpark is designed to handle data distributed across clusters, suited for massive data processing.

Say you’re working with a dataset containing millions of movie ratings and would like to find the average rating for each genre.

from pyspark.sql import SparkSession

# Start a Spark session
spark = SparkSession.builder.appName("MovieRatings").getOrCreate()

# Load the dataset into a Spark DataFrame
df = spark.read.csv('movie_ratings.csv', header=True, inferSchema=True)

# Perform transformations (e.g., group by genre and calculate average rating)
df_grouped = df.groupBy('genre').mean('rating')

# Show the result
df_grouped.show()

PySpark distributes both data and computation across multiple machines or cores.

Wrapping Up

Handling large datasets in Python is much more manageable with the right tools. Here’s a quick recap:

Generators for processing data with memory efficiency
Multiprocessing allows you to divide work across CPU cores for faster processing
Use chunksize in pandas to load data in chunks for memory-efficient processing
Dask, a pandas-like library for parallel computing, excellent for large datasets
PySpark for distributed processing of big data

Happy coding!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Tips for Handling Large Datasets in Python

1. Use Generators Instead of Lists

2. Go Parallel with Multiprocessing

3. Use Pandas’ chunksize for Chunked Processing

4. Use Dask for Parallel Computing

5. Consider PySpark for Big Data Processing

Wrapping Up

Recent Articles

Advanced Time Intelligence in DAX with Performance in Mind

North Korean Hackers Target Freelance Developers in Job Scam to Deploy Malware

Nvidia’s Big Plan to Beat Back RTX 5090 Scalpers Is the Same as Everybody Else

Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge

KGGen: Advancing Knowledge Graph Extraction with Language Models and Clustering Techniques

Related Stories

Leave A Reply Cancel reply