Image by Editor | Midjourney
Data wrangling is an important step in preparing data for analysis or machine learning. It involves turning raw data into a clean and organized format that’s ready to use. While Python has been a popular choice for this, Rust has been gaining attention for data tasks. Polars, a library built in Rust, is one of the best tools for handling data. In this article, we’ll explore the fundamentals of using Polars for data wrangling in Rust.
What is Polars?
Polars is a fast and efficient DataFrame library designed for Rust and Python. It focuses on high performance and low memory usage. The library takes advantage of Rust’s strengths, such as memory safety, zero-cost abstractions, and concurrency. Polars provides key data manipulation features that enhance its usability:
- DataFrames and Series: Polars uses DataFrames and Series for structuring data. DataFrames hold rows and columns, while Series are the individual columns.
- Lazy Execution: It lets you chain operations without immediate computation and improves query optimization.
- Parallel Execution: Polars uses multiple CPU cores for processing large datasets. This improves speed, especially for big data.
Setting Up the Environment
First, create a new Rust project using Cargo. You can do this by running the following command:
cargo new your_project_name
Replace your_project_name with the name of your project. Once the project is created, navigate into the project directory:
Next, open the Cargo.toml file, which is located in the root of your project directory. Add the Polars dependency under [dependencies]:
[dependencies]
polars = version = "0.25", features = ["lazy"]
This will add the Polars library to your project. It also enables the lazy execution feature. Lazy execution helps optimize data transformations.
Loading Data into Polars
Polars supports various file formats, including CSV, Parquet, and JSON.
use polars::prelude::*;
fn main() -> Result
// Create a DataFrame with 4 names, ages, and cities
let df = df![
"name" => &["James", "William", "Oliver", "Sophia"],
"age" => &[25, 30, 35, 28],
"city" => &["New York", "Los Angeles", "Chicago", "San Francisco"]
]?;
// Display the DataFrame
println!(":?", df);
Ok(())
Here, we create a DataFrame with columns for names, ages, and cities.
Filtering Data
Filtering helps you select specific rows based on conditions. Polars makes it easy to filter DataFrames using column values or expressions.
use polars::prelude::*;
fn main() -> Result
let df = df![
"name" => &["James", "William", "Oliver", "Sophia"],
"age" => &[25, 30, 35, 28],
"city" => &["New York", "Los Angeles", "Chicago", "San Francisco"]
]?;
// Filter rows where age is greater than 30
let filtered_df = df.filter(&df["age"].gt(30))?;
println!(":?", filtered_df);
Ok(())
This code will return only the rows where the “age” column is greater than 30.
Aggregating Data
Aggregating combines data to find summaries like totals or averages. Polars provides methods to calculate sums, means, counts, and more.
use polars::prelude::*;
fn main() -> Result
let df = df![
"name" => &["James", "William", "Oliver", "Sophia"],
"age" => &[25, 30, 35, 28],
"city" => &["New York", "Los Angeles", "Chicago", "San Francisco"]
]?;
// Calculate the average age
let avg_age = df["age"].mean()?;
println!("Average Age: ", avg_age);
Ok(())
This code calculates the average of the “age” column and prints the result.
Sorting Data
Sorting is a simple but important part of data wrangling. Polars lets you sort DataFrames in ascending or descending order based on one or more columns.
use polars::prelude::*;
fn main() -> Result
let df = df![
"name" => &["James", "William", "Oliver", "Sophia"],
"age" => &[25, 30, 35, 28],
"city" => &["New York", "Los Angeles", "Chicago", "San Francisco"]
]?;
// Sort DataFrame by 'age' in descending order
let sorted_df = df.sort("age", false)?;
println!(":?", sorted_df);
Ok(())
In the above code, the sort() method sorts the DataFrame by the “age” column in descending order.
Joining DataFrames
Polars lets you join DataFrames, similar to SQL joins. You can merge data using one or more key columns. It supports inner, left, right, and outer joins.
use polars::prelude::*;
fn main() -> Result
let df1 = df![
"name" => &["James", "William", "Oliver", "Sophia"],
"age" => &[25, 30, 35, 28]
]?;
let df2 = df![
"name" => &["James", "Sophia"],
"city" => &["New York", "San Francisco"]
]?;
// Join the two DataFrames on 'name'
let joined_df = df1.join(&df2, ["name"], ["name"], JoinType::Inner)?;
println!(":?", joined_df);
Ok(())
This example demonstrates an inner join on the “name” column.
Lazy Execution for Performance Optimization
One great feature of Polars is lazy execution. It lets you chain and optimize multiple operations before running them. This is helpful for large datasets as it reduces extra computations and saves memory. With the LazyFrame API, you set up transformations without executing them right away. Polars runs everything efficiently when you call the collect() function.
Let’s look at a practical example of lazy execution:
use polars::prelude::*;
fn main() -> Result
let df = df![
"name" => &["James", "William", "Oliver", "Sophia"],
"age" => &[25, 30, 35, 28],
"city" => &["New York", "Los Angeles", "Chicago", "San Francisco"]
]?;
// Convert to LazyFrame and chain operations
let lazy_df = df.lazy()
.filter(col("age").gt(30))
.select([col("name"), col("city")])
.collect()?; // Trigger execution
println!(":?", lazy_df);
Ok(())
This code creates a DataFrame and converts it to a LazyFrame. It filters rows where “age” is greater than 30. Then, it selects the “name” and “city” columns. Finally, it runs the operations and prints the result.
Comparing Polars and Pandas
Polars and Pandas are both used for data manipulation, but they differ in key areas:
- Performance: Rust’s memory safety and efficient parallel execution contribute to Polars’ faster performance compared to Pandas, especially with large datasets.
- API Design: Rust’s immutable design in Polars leads to a more efficient code compared to Pandas’ object-based approach.
- Memory Usage: Rust’s low-level memory control helps Polars manage memory more efficiently and minimizes overhead compared to Pandas.
Conclusion
Rust with Polars is great for data wrangling. Polars is fast and saves memory. It handles tasks like filtering, joining, and grouping data well. Rust makes sure your code is safe and reliable. Polars works well with large datasets. It keeps improving for better performance. Using Polars will speed up your data work. Try it for your next project!
Jayita Gulati is a machine learning enthusiast and technical writer driven by her passion for building machine learning models. She holds a Master’s degree in Computer Science from the University of Liverpool.