Python Tooling Beyond Pandas: Libraries to Broaden Your Data Science Toolkit


Image by Author | Ideogram

 

As a data scientist working daily with Python programming, I have already become familiar with the Pandas library. It’s a flexible library that offers many easy-to-use APIs for data manipulation without much hassle. However, Pandas still have some disadvantages that make people choose alternatives.

Pandas are often unsuitable for processing a large dataset as their memory consumption is inefficient, and they can be slower to execute specific calculation processes. Moreover, we can’t rely on parallelization to speed up the process, as Pandas don’t support it natively.

We can look beyond Pandas library with a few things against Pandas’ usage. This article will examine various libraries that will broaden your data science toolkit.

 

Dask

 
The first library we will explore is Dask. As mentioned previously, Pandas have weaknesses in accelerating their workflow execution as they only rely on single CPU cores. Well, it’s precisely what Dask tries to solve.

Dask introduces itself as “Parallel Python and Easy.” The tagline comes from the Dask library’s capabilities to extend the Pandas’ ability to manipulate data using flexible parallel computing frameworks. It means that Dask can perform faster by using parallelization.

The library boasts lightweight utilization that will speed up more than 50% of our work without additional virtualization or compilers. By leveraging the parallelization process, Dask can use multiple CPUs or machines where we distribute our work to handle big data efficiently. Moreover, the library uses familiar APIs similar to Pandas, allowing newcomers to use Dask easily.

Let’s try out the Dask library. For this example, I will use the Kaggle coronavirus Tweet Dataset.

First, let’s install the Dask library using the following code.

 

Once you install the library, we can try out the Dask APIs.

As mentioned above, the Dask library uses APIs similar to those of the Pandas library. The following code demonstrates this.

import dask.dataframe as dd

df = dd.read_csv('Corona_NLP_test.csv')

sentiment_counts = df.groupby('Sentiment').size().compute()
sentiment_counts

 

The APIs are similar to the Pandas library, but there is a difference. The process will only triggered if the l ‘compute()’ code is present.

You can even create a new feature similar to how Pandas work.

df["tweet_length"] = df["OriginalTweet"].str.len()
df_positive = df[df["Sentiment"] == "Positive"]

avg_length_positive = df_positive["tweet_length"].mean().compute()
avg_length_positive

 

But Dask is more than that. It works with parallelization, which allows custom Python function parallelization. Here is an example of triggering parallel computation around an arbitrary Python function.

from dask import delayed
import time

def slow_process_tweet(tweet):
    time.sleep(0.5)
    return len(tweet) if tweet else 0

tweets = df["OriginalTweet"].head(10)
delayed_results = [delayed(slow_process_tweet)(tweet) for tweet in tweets]
total_length = delayed(sum)(delayed_results)

# Trigger parallel computation
result = total_length.compute()

 

The example code shows that we could transform ordinary Python functions into a set of parallel tasks.

You can read the documentation for examples of further implementation.

 

Polars

 
Polars is an open-source library that works as a Pandas alternative. Pandas might be slower with a higher volume of data and complex workflow, but Polars helps solve that problem.

Polars is a library that combines the power of Rust and Python to their execution, which means you can select either one to use. Utilizing both programming languages can effectively perform parallelization and enable algorithms for fast processing. With Polars, we can harness multi-threading under the hood for data wrangling works.

It’s also easy to use, as the APIs are similar to how Pandas works. The feature also evolves constantly with the community because the library is supported widely by communities around the world.

Let’s try out the library to understand further. First, we will install the library.

 

Then, we can use Polars with the example code below. Here is an example of eager execution to read and examine the dataset.

import polars as pl

df = pl.read_csv("Corona_NLP_test.csv")

print(df.head())

print("Shape:", df.shape)      
print("Columns:", df.columns)

 

The basic APIs are similar to the Pandas implementation. However, there are differences if we want to try using Lazy execution and a more complex selection. For example, here is the code for window function aggregation.

import polars as pl

df_lazy = pl.scan_csv("Corona_NLP_test.csv")

query = (
    df_lazy
    .select([
        pl.col("Location"),
        pl.col("Sentiment"),
        pl.count().over("Location").alias("location_tweet_count"),
        (pl.col("Sentiment") == "Positive").cast(pl.Int32).sum().over("Location").alias("positive_count"),
    ]).unique()
)

result = query.collect()
print(result)

 

We can also chain multiple executions to make a concise pipeline.

result = (
    df.lazy()
    .filter(pl.col("Sentiment") == "Positive")
    .with_columns([
        pl.col("OriginalTweet").str.len_chars().alias("tweet_length")
    ])
    .select([
        pl.count().alias("num_positive_tweets"),        
        pl.col("tweet_length").mean().alias("avg_length"),
        pl.col("tweet_length").quantile(0.9).alias("90th_percentile_length")
    ])
    .collect()
)

print(result)

 

There are still many things you can do with Polars library; please refer to the documentation for more information.

 

PyArrow

 
PyArrow is a Python library that utilizes Apache Arrow for data interchange with in-memory analytics. It’s designed to speed up an efficient analytic process in the data ecosystem by allowing easier reading of multiple data formats and zero-copy data sharing between different frameworks.

The library is optimized for read and write data that is expected to be 10x faster than Pandas for large datasets and able to implement data sharing for data types that usually not compatible with each other

Let’s try out the PyArrow code implementation to understand further. First, let’s install the library using the following code.

 

PyArrow is all about data interchange from different formats. For example, here is how we can switch between the Pandas and PyArrow datasets.

import pandas as pd
import pyarrow as pa

pd_df = pd.DataFrame(
    "Location": ["USA", "Canada", "USA"],
    "Value": [10, 20, 30]
)

arrow_table = pa.Table.from_pandas(pd_df)
back_to_pd = arrow_table.to_pandas()

 

We could also read the dataset and perform operations similar to the Pandas APIs.

import pyarrow.csv as pv
import pyarrow.compute as pc

table = pv.read_csv('Corona_NLP_test.csv')
df = table.to_pandas()

result = df.groupby('Location').agg(
    'Sentiment': ['count', lambda x: (x == 'Positive').sum()]
)

result.columns = ['tweet_count', 'positive_count']
print(result)

 

Here is also an example of filtering data with the PyArrow library.

positive_mask = pc.equal(table["Sentiment"], pa.scalar("Positive"))
table_positive = table.filter(positive_mask)

count_positive = table_positive.num_rows

 

That’s a simple introduction for PyArrow. You can refer to the documentation for further exploration.

 

PySpark

 
PySpark is a Python implementation of Apache Spark that distributes computational power to the user. It allows data people to process massive datasets using distributed computational power, which Apache Spark excels at.

The library allows datasets to be broken down into smaller chunks for parallelization. It’s also suitable for handling diverse workloads, such as batch processing, SQL queries, real-time streaming, and more.

It is easy to use and can be used for scalable processes, making it an ideal framework for big data applications. It also has a lot of support from the community, which many still use to this day.

Let’s try to use the PySpark implementation with Python. We need to install the library using the following code:

 

PySpark works similarly to SQL but with a touch of Python implementation. For example, here is how we use PySpark to count the positive sentiment in the data.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, sum

spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.csv('Corona_NLP_test.csv', header=True, inferSchema=True)
result = df.groupBy('Location').agg(
    count('*').alias('tweet_count'),
    sum((col('Sentiment') == 'Positive').cast('int')).alias('positive_count')
)

result.show()

 

There is also a simple implementation for data pivoting with the code below.

pivoted_df = (
    df
    .groupBy("Location")
    .pivot("Sentiment")
    .agg(count("*").alias("count_by_sentiment"))
)

pivoted_df.show()

 

PySpark also allows caching, so we don’t need to re-execute the dataset every time we want to load the data.

# Example: caching a DataFrame
df.cache()

df.count()
df.show(5)

 

Conclusion

 
Pandas is the most popular Python data manipulation library, as it’s easy to use and offers many APIs that are powerful for any data person’s needs. However, Pandas have a few weaknesses, including but not limited to slower execution and lack of parallelization execution.

That’s why this article introduces a few libraries as Pandas alternatives, including Dask, Polars, PyArrow, and PySpark.

I hope this has helped!
 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here