Data Wrangling in Rust with Polars

Image by Editor | Midjourney

Data wrangling is an important step in preparing data for analysis or machine learning. It involves turning raw data into a clean and organized format that’s ready to use. While Python has been a popular choice for this, Rust has been gaining attention for data tasks. Polars, a library built in Rust, is one of the best tools for handling data. In this article, we’ll explore the fundamentals of using Polars for data wrangling in Rust.

What is Polars?

Polars is a fast and efficient DataFrame library designed for Rust and Python. It focuses on high performance and low memory usage. The library takes advantage of Rust’s strengths, such as memory safety, zero-cost abstractions, and concurrency. Polars provides key data manipulation features that enhance its usability:

DataFrames and Series: Polars uses DataFrames and Series for structuring data. DataFrames hold rows and columns, while Series are the individual columns.
Lazy Execution: It lets you chain operations without immediate computation and improves query optimization.
Parallel Execution: Polars uses multiple CPU cores for processing large datasets. This improves speed, especially for big data.

Setting Up the Environment

First, create a new Rust project using Cargo. You can do this by running the following command:

cargo new your_project_name

Replace your_project_name with the name of your project. Once the project is created, navigate into the project directory:

Next, open the Cargo.toml file, which is located in the root of your project directory. Add the Polars dependency under [dependencies]:

[dependencies]
polars =  version = "0.25", features = ["lazy"]

This will add the Polars library to your project. It also enables the lazy execution feature. Lazy execution helps optimize data transformations.

Loading Data into Polars

Polars supports various file formats, including CSV, Parquet, and JSON.

use polars::prelude::*;

fn main() -> Result 
    // Create a DataFrame with 4 names, ages, and cities
    let df = df![
        "name" => &["James", "William", "Oliver", "Sophia"],
        "age" => &[25, 30, 35, 28],
        "city" => &["New York", "Los Angeles", "Chicago", "San Francisco"]
    ]?;

    // Display the DataFrame
    println!(":?", df);

    Ok(())

Here, we create a DataFrame with columns for names, ages, and cities.

loading_data

Filtering Data

Filtering helps you select specific rows based on conditions. Polars makes it easy to filter DataFrames using column values or expressions.

use polars::prelude::*;

fn main() -> Result 
    let df = df![
        "name" => &["James", "William", "Oliver", "Sophia"],
        "age" => &[25, 30, 35, 28],
        "city" => &["New York", "Los Angeles", "Chicago", "San Francisco"]
    ]?;

    // Filter rows where age is greater than 30
    let filtered_df = df.filter(&df["age"].gt(30))?;
    println!(":?", filtered_df);

    Ok(())

This code will return only the rows where the “age” column is greater than 30.

filtering_data

Aggregating Data

Aggregating combines data to find summaries like totals or averages. Polars provides methods to calculate sums, means, counts, and more.

use polars::prelude::*;

fn main() -> Result 
    let df = df![
        "name" => &["James", "William", "Oliver", "Sophia"],
        "age" => &[25, 30, 35, 28],
        "city" => &["New York", "Los Angeles", "Chicago", "San Francisco"]
    ]?;

    // Calculate the average age
    let avg_age = df["age"].mean()?;
    println!("Average Age: ", avg_age);

    Ok(())

This code calculates the average of the “age” column and prints the result.

aggregating_data

Sorting Data

Sorting is a simple but important part of data wrangling. Polars lets you sort DataFrames in ascending or descending order based on one or more columns.

use polars::prelude::*;

fn main() -> Result 
    let df = df![
        "name" => &["James", "William", "Oliver", "Sophia"],
        "age" => &[25, 30, 35, 28],
        "city" => &["New York", "Los Angeles", "Chicago", "San Francisco"]
    ]?;

    // Sort DataFrame by 'age' in descending order
    let sorted_df = df.sort("age", false)?;
    println!(":?", sorted_df);

    Ok(())

In the above code, the sort() method sorts the DataFrame by the “age” column in descending order.

sorting_data

Joining DataFrames

Polars lets you join DataFrames, similar to SQL joins. You can merge data using one or more key columns. It supports inner, left, right, and outer joins.

use polars::prelude::*;

fn main() -> Result 
    let df1 = df![
        "name" => &["James", "William", "Oliver", "Sophia"],
        "age" => &[25, 30, 35, 28]
    ]?;

    let df2 = df![
        "name" => &["James", "Sophia"],
        "city" => &["New York", "San Francisco"]
    ]?;

    // Join the two DataFrames on 'name'
    let joined_df = df1.join(&df2, ["name"], ["name"], JoinType::Inner)?;
    println!(":?", joined_df);

    Ok(())

This example demonstrates an inner join on the “name” column.

joining_dataframes

Lazy Execution for Performance Optimization

One great feature of Polars is lazy execution. It lets you chain and optimize multiple operations before running them. This is helpful for large datasets as it reduces extra computations and saves memory. With the LazyFrame API, you set up transformations without executing them right away. Polars runs everything efficiently when you call the collect() function.

Let’s look at a practical example of lazy execution:

use polars::prelude::*;

fn main() -> Result 
    let df = df![
        "name" => &["James", "William", "Oliver", "Sophia"],
        "age" => &[25, 30, 35, 28],
        "city" => &["New York", "Los Angeles", "Chicago", "San Francisco"]
    ]?;

    // Convert to LazyFrame and chain operations
    let lazy_df = df.lazy()
        .filter(col("age").gt(30))
        .select([col("name"), col("city")])
        .collect()?; // Trigger execution

    println!(":?", lazy_df);

    Ok(())

This code creates a DataFrame and converts it to a LazyFrame. It filters rows where “age” is greater than 30. Then, it selects the “name” and “city” columns. Finally, it runs the operations and prints the result.

lazy_execution

Comparing Polars and Pandas

Polars and Pandas are both used for data manipulation, but they differ in key areas:

Performance: Rust’s memory safety and efficient parallel execution contribute to Polars’ faster performance compared to Pandas, especially with large datasets.
API Design: Rust’s immutable design in Polars leads to a more efficient code compared to Pandas’ object-based approach.
Memory Usage: Rust’s low-level memory control helps Polars manage memory more efficiently and minimizes overhead compared to Pandas.

Conclusion

Rust with Polars is great for data wrangling. Polars is fast and saves memory. It handles tasks like filtering, joining, and grouping data well. Rust makes sure your code is safe and reliable. Polars works well with large datasets. It keeps improving for better performance. Using Polars will speed up your data work. Try it for your next project!

Jayita Gulati is a machine learning enthusiast and technical writer driven by her passion for building machine learning models. She holds a Master’s degree in Computer Science from the University of Liverpool.

Data Wrangling in Rust with Polars

What is Polars?

Setting Up the Environment

Loading Data into Polars

Filtering Data

Aggregating Data

Sorting Data

Joining DataFrames

Lazy Execution for Performance Optimization

Comparing Polars and Pandas

Conclusion

Recent Articles

Over 57 Nation-State Threat Groups Using AI for Cyber Operations

Distributed Tracing: A Powerful Approach to Debugging Complex Systems | by Hareesha Dandamudi | Dec, 2024

Get a Top Password Manager for Just $1.27/Month

Streamline grant proposal reviews using Amazon Bedrock

The Mistakes Of CSS | CSS-Tricks

Related Stories

Leave A Reply Cancel reply