How to Speed Up Pandas Code – Vectorization
If we want our deep learning models to train on a dataset, we have to optimize our code to parse through that data quickly. We want to read our data tables as fast as possible using an optimized way to write our code. Even the smallest performance gain exponentially improves performance over tens of thousands of data points. In this blog, we will define Pandas and provide an example of how you can vectorize your Python code to optimize dataset analysis using Pandas to speed up your code over 300x times faster.
What is Pandas for Python?
Pandas is an essential and popular open-source data manipulation and data analysis library for the Python programming language. Pandas is widely used in various fields such as finance, economics, social sciences, and engineering. It is beneficial for data cleaning, preparation, and analysis in data science and machine learning tasks.
It provides powerful data structures (such as the DataFrame and Series) and data manipulation tools to work with structured data, including reading and writing data in various formats (e.g. CSV, Excel, JSON) and filtering, cleaning, and transforming data. Additionally, it supports time series data and provides powerful data aggregation and visualization capabilities through integration with other popular libraries such as NumPy and Matplotlib.
Our Dataset and Problem
The Data
In this example, we are going to create a random dataset in a Jupyter Notebook using NumPy to fill in our Pandas data frame with arbitrary values and strings. In this dataset, we are naming 10,000 people of varying ages, the amount of time they work, and the percentage of time they are productive at work. They will also be assigned a random favorite treat, as well as a random bad karma event.
We are first going to import our frameworks and generate some random code before we start:
import pandas as pd
import numpy as np
Next, we are going to create our dataset with some by creating some random data. Now your code will most likely rely on actual data but for our use case, we will create some arbitrary data.
def get_data(size = 10_000):
df = pd.DataFrame()
df['age'] = np.random.randint(0, 100, size)
df['time_at_work'] = np.random.randint(0,8,size)
df['percentage_productive'] = np.random.rand(size)
df['favorite_treat'] = np.random.choice(['ice_cream', 'boba', 'cookie'], size)
df['bad_karma'] = np.random.choice(['stub_toe', 'wifi_malfunction', 'extra_traffic'])
return df
The Parameters and Rules
- If a person’s ‘time_at_work’ is at least 2 hours AND where ‘percentage_productive’ is more than 50%, we return with ‘favorite treat’.
- Otherwise, we give them ‘bad_karma’.
- If they are over 65 years old, we return with a ‘favorite_treat’ since we our elderly to be happy.
def reward_calc(row):
if row['age'] >= 65:
return row ['favorite_treat']
if (row['time_at_work'] >= 2) & (row['percentage_productive'] >= 0.5):
return row ['favorite_treat']
return row['bad_karma']
Now that we have our dataset and our parameters for what we want to return, we can go ahead and explore the fastest way to execute this type of analysis.
Which Pandas Code Is Fastest: Looping, Apply, or Vectorization?
To time our functions, we will be using a Jupyter Notebook to make it relatively simple with the magic function %%timeit. There are other ways to time a function in Python but for demonstration purposes, our Jupyter Notebook will suffice. We will do a demo run on the same dataset with 3 ways of calculating and evaluating our problem using Looping/Iterating, Apply, and Vectorization.
Looping/Iterating
Looping and Iterating is the most basic way to deliver the same calculation row by row. We call the data frame and iterate rows with a new cell called reward and run the calculation to fill in the new reward
according to our previously defined reward_calc
code block. This is the most basic and probably the first method learned when coding similar to For Loops.
%%timeit
df = get_data()
for index, row in df.iterrows():
df.loc[index, 'reward'] = reward_calc(row)
This is what it returned:
3.66 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Inexperienced data scientists might see a couple of seconds as no big deal. But, 3.66 seconds is quite long to run a simple function through a dataset. Let’s see what the apply
function can do for us for speed.
Apply
The apply
function effectively does the same thing as the loop. It will create a new column titled reward and apply the calculation function every 1 row as defined by axis=1
. The apply
function is a faster way to run a loop to your dataset.
%%timeit
df = get_data()
df['reward'] = df.apply(reward_calc, axis=1)
The time it took to run is as follows:
404 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Wow, so much faster! About 9x faster, a huge improvement to a Loop. Now the Apply Function is perfectly fine to use and will be applicable in certain scenarios, but for our use case, let’s see if we can speed it up more.
Vectorization
Our last and final way to evaluate this dataset is to use vectorization. We will call our dataset and apply the default reward being bad_karma
to the entire data frame. Then we will only check for those that satisfy our parameters using boolean indexing. Think of it like setting a true/false value for each row. If any or all of the rows return false in our calculation, then the reward
row will remain bad_karma
. While if all the rows are true, we will redefine the data frame for the reward
row as favorite_treat
.
%%timeit
df = get_data()
df['reward'] = df['bad_karma']
df.loc[((df['percentage_productive'] >= 0.5) &
(df['time_at_work'] >= 2)) |
(df['age'] >= 65), 'reward'] = df['favorite_treat']
The time it took to run this function on our dataset is as follows:
10.4 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That is extremely fast. 40x faster than the Apply and approximately 360x faster than Looping…
Why Vectorization in Pandas is over 300x Faster
The reason why vectorization is so much faster than Looping/Iterating and Apply is that it doesn’t calculate the entire row every single time but instead applies the parameters to the entire dataset as a whole. Vectorization is a process where operations are applied to entire arrays of data at once, instead of operating on each element of the array individually. This allows for much more efficient use of memory and CPU resources.
When using Loops or Apply to perform calculations on a Pandas data frame, the operation is applied sequentially. This causes repeated access to memory, calculations, and updated values which can be slow and resource intensive.
Vectorized operations, on the other hand, are implemented in Cython (Python in C or C++) and utilize the CPU’s vector processing capabilities, which can perform multiple operations at once, further increasing performance by calculating multiple parameters at the same time. Vectorized operations also avoid the overhead of constantly accessing memory which is the crutch of Loop and Apply.
How to Vectorize your Pandas Code
- Use Built-in Pandas and NumPy Functions that have implemented C like sum(), mean(), or max().
- Use vectorized operations that can apply to entire DataFrames and Series including mathematical operations, comparisons, and logic to create a boolean mask to select multiple rows from your data set.
- You can use the .values attribute or the
.to_numpy()
to return the underlying NumPy array and perform vectorized calculations directly on the array. - Use vectorized string operations to apply to your dataset such as
.str.contains()
,.str.replace()
, and.str.split()
.
Whenever you’re writing functions on Pandas DataFrames, try to vectorize your calculations as much as possible. As datasets get larger and larger and your calculations get more and more complex, the time savings add up exponentially when you utilize vectorization. It’s worth noting that not all operations can be vectorized and sometimes it’s necessary to use loops or apply functions. However, wherever it’s possible, vectorized operations can greatly improve performance and make your code more efficient.
Kevin Vu manages Exxact Corp blog and works with many of its talented authors who write about different aspects of Deep Learning.