Processing a Directory of CSVs Too Big for Memory with Dask



Image by Editor | Midjourney

 

Data has become a resource every business needs, but not all data is stored in a simple database. Many companies still rely on old-fashioned CSV files to store and exchange all their tabular data, as this is the simplest form for data storage.

As the company grows, data collection will increase exponentially. These files could accumulate significantly in size, making it impossible to load them with common libraries such as Pandas. These large CSV files will slow down many data activities and strain our system resources, which is why many professionals try to use the alternative solution for big data.

The above problem is why Dask was born. Dask is a powerful Python library designed for data manipulation but with parallel computing capability. It allows the user to work with data that exceeds our machine memory by breaking it into manageable partitions and performing the operation in parallel. Dask also manages the memory using lazy evaluation where any computation is optimized and only executed when explicitly requested.

As Dask becomes an important tool for many data professionals, this article will explore how to process a directory of CSVs with Dask, especially if it’s too big for memory with Dask.

 

Processing CSVs with Dask

 
Let’s start by preparing the sample CSV dataset. You can use your actual dataset or a sample dataset from Kaggle, which I will use here. Put the files in the’ data’ folder and rename them.

With the dataset ready, let’s install the Dask library for us to use.

pip install dask[complete]

 

If the installation is successful, we can use Dask to read and process our CSV directory.

First, let’s see all the CSV datasets inside the folder. We can do that using the following code.

import dask.dataframe as dd
import glob

file_pattern = "data/*.csv"
files = glob.glob(file_pattern)

 

The output will be similar to the list below. It might be longer if you have many CSV files in your data folder.

['data/features_3_sec.csv', 'data/features_30_sec.csv']

 
Using the list above, we will read all the CSV files using Dask CSV reader.

ddf = dd.read_csv(file_pattern, assume_missing=True)

 

In the code above, Dash doesn’t immediately load the CSV data into the memory. Instead, it creates a lazy DataFrame where each (or parts of) becomes a partition. We also assume a missing parameter will make the inferred data type flexible.

In the background, Dask already automates the parallelization process, so we don’t need to manually divide the data when we call the Dask CSV reader; it already breaks it into manageable block sizes.

We can check the number of partitions by reading the CSV files directory.

print("Number of partitions:", ddf.npartitions)

 

The output is similar to the “Number of partitions: 2”.

Let’s try to filter the data using the following code.

filtered_ddf = ddf[ddf["rms_mean"] > 0.1]

 

You might be familiar with the operations above, as they’re similar to the Pandas filtering. However, Dask applied the operations lazily on each operation so as not to load all the data into memory.

We can then perform a computational operation on our filtered dataset using the code below.

mean_spectral_centroid_mean = filtered_ddf["spectral_centroid_mean"].mean().compute()

print("Mean of feature2 for rows where rms_mean > 0.1:", mean_spectral_centroid_mean)

 

The output will be something like the below.

Mean of feature2 for rows where rms_mean > 0.1: 2406.2594844026335

 

In the code above, we perform the mean operation across all the partitions, and only by using the trigger do we perform the actual computation. The final result is where it will be stored in the memory.

If you want to save each partition that has gone through all the computational process, we can use the following code.

filtered_ddf.to_csv("output/filtered_*.csv", index=False)

 

The CSV dataset will be all the previously filtered partitions and stored in our local.

Now, we can use the code below to control the memory limitation, the number of workers, and the thread.

from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=1, memory_limit="2GB")

 

By worker, we mean a separate process that can execute the tasks independently. We also assign one thread per worker so the worker can run the task in parallel with others on different cores. Finally, we set the memory limit so the process will not exceed our limitations.

Speaking of memory, we can control how much data should be in each partition using the blocksize parameter.

ddf_custom = dd.read_csv("data/*.csv", blocksize="5MB", assume_missing=True)

 

The blocksize parameter will enforce a limitation on the size for each partition. This flexibility is one of Dask’s strengths, allowing users to work efficiently regardless of file size.

Lastly, we can perform each operation individually for each partition instead of aggregating it across all the partitions using the following code.

partition_means = ddf_custom["spectral_centroid_mean"].map_partitions(lambda df: df.mean()).compute()
print(partition_means)

 

The result will look like the below data series.

0    2201.780898
1    2021.533468
2    2376.124512
dtype: float64

 

You can see that the custom blocksize divides our 2 CSV files into 3 partitions, and we can operate on each partition.

That’s all for a simple introduction to processing directory CSV with Dask. You can try with your CSV dataset and execute with more complex operations.

 

Conclusion

 
CSV files are a standard file that many companies use as a data storage, where it could accumulate and the sizes become big. The usual library, such as Pandas, is hard to process these big data files, making us think about an alternative solution. The Dask library comes to solve that problem.

In this article, we have learned that Dask can read multiple CSV files from a directory, partition the data into manageable chunks, and perform parallel computations with lazy evaluation, offering flexible control over memory and processing resources. These examples show how strong Dask is when used for data manipulation activity.

I hope this has helped!
 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here