Image by Author
Â
When coding in Python, you don’t usually have to wrap your head around the details of memory allocation. But tracing memory allocation can be helpful, especially if you’re working with memory-intensive operations and large datasets.
Our Top 3 Course Recommendations
1. Google Cybersecurity Certificate – Get on the fast track to a career in cybersecurity.
2. Google Data Analytics Professional Certificate – Up your data analytics game
3. Google IT Support Professional Certificate – Support your organization in IT
Python’s built-in tracemalloc module comes with functions that’ll help you understand memory usage and debug applications. With tracemalloc, you can get where and how many blocks of memory have been allocated, take snapshots, compare differences between snapshots, and more.
We’ll look at some of these in this tutorial. Let’s get started.
Â
Before You Begin
Â
We’ll use a simple Python script for data processing. For this, we’ll create a sample dataset and process it. Besides a recent version of Python, you also need pandas and NumPy in your working environment.
Create a virtual environment and activate it:
$ python3 -m venv v1
$ source v1/bin/activate
Â
And install the required libraries:
$ pip3 install numpy pandas
Â
You can find the code for this tutorial on GitHub.
Â
Create a Sample Dataset with Order Details
Â
We’ll generate a sample CSV file with order details. You can run the following script to create a CSV file with 100K order records:
# create_data.py
import pandas as pd
import numpy as np
# Create a sample dataset with order details
num_orders = 100000
data = {
'OrderID': np.arange(1, num_orders + 1),
'CustomerID': np.random.randint(1000, 5000, num_orders),
'OrderAmount': np.random.uniform(10.0, 1000.0, num_orders).round(2),
'OrderDate': pd.date_range(start="2023-01-01", periods=num_orders, freq='min')
}
df = pd.DataFrame(data)
df.to_csv('order_data.csv', index=False)
Â
This script populates a pandas dataframe with 100K records with the following four features, and exports the dataframe to a CSV file:
- OrderID: Unique identifier for each order
- CustomerID: ID for the customer
- OrderAmount: The amount of each order
- OrderDate: The date and time of the order
Trace Memory Allocation with tracemalloc
Â
Now we’ll create a Python script to load and process the dataset. We’ll also trace memory allocations.
First, we define functions load_data
and process_data
to load and process records from the CSV file:
# main.py
import pandas as pd
def load_data(file_path):
print("Loading data...")
df = pd.read_csv(file_path)
return df
def process_data(df):
print("Processing data...")
df['DiscountedAmount'] = df['OrderAmount'] * 0.9 # Apply a 10% discount
df['OrderYear'] = pd.to_datetime(df['OrderDate']).dt.year # Extract the order year
return df
Â
We can then go ahead with tracing memory allocation by doing the following:
- Initialize the memory tracing with
tracemalloc.start()
. - The
load_data()
function reads the CSV file into a dataframe. We take a snapshot of memory usage after this step. - The
process_data()
function adds two new columns to the dataframe: ‘DiscountedAmount’ and ‘OrderYear’. We take another snapshot after processing. - We compare the two snapshots to find memory usage differences and print out the top memory-consuming lines.
- And then print the current and peak memory usage to understand the overall impact.
Here’s the corresponding code:
import tracemalloc
def main():
# Start tracing memory allocations
tracemalloc.start()
# Load data
df = load_data('order_data.csv')
# Take a snapshot
snapshot1 = tracemalloc.take_snapshot()
# Process data
df = process_data(df)
# Take another snapshot
snapshot2 = tracemalloc.take_snapshot()
# Compare snapshots
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("[ Top memory-consuming lines ]")
for stat in top_stats[:10]:
print(stat)
# Current and peak memory usage
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 1024 / 1024:.1f} MB")
print(f"Peak usage: {peak / 1024 / 1024:.1f} MB")
tracemalloc.stop()
if __name__ == "__main__":
main()
Â
Now run the Python script:
Â
This outputs the top memory-consuming lines as well as the current and peak memory usage:
Loading data...
Processing data...
[ Top 3 memory-consuming lines ]
/home/balapriya/trace_malloc/v1/lib/python3.11/site-packages/pandas/core/frame.py:12683: size=1172 KiB (+1172 KiB), count=4 (+4), average=293 KiB
/home/balapriya/trace_malloc/v1/lib/python3.11/site-packages/pandas/core/arrays/datetimelike.py:2354: size=781 KiB (+781 KiB), count=3 (+3), average=260 KiB
:123: size=34.6 KiB (+15.3 KiB), count=399 (+180), average=89 B
Current memory usage: 10.8 MB
Peak usage: 13.6 MB
Â
Wrapping Up
Â
Using tracemalloc to trace memory allocation helps identify memory-intensive operations and potentially optimize performance using the memory trace and statistics returned.
You should be able to see if you can use more efficient data structures and processing methods to minimize memory usage. For long-running applications, you can use tracemalloc periodically to track memory usage. That said, you can always use tracemalloc in conjunction with other profiling tools to get a comprehensive view of memory usage.
If you’re interested in learning memory profiling with memory-profiler, read Introduction to Memory Profiling in Python.
Â
Â
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.