Image by Author | DALLE-3 & Canva
Â
Have you ever dealt with messy datasets? They are one of the biggest hurdles in any data science project. These datasets can contain inconsistencies, missing values, or irregularities that hinder analysis. Data cleaning is the essential first step that lays the foundation for accurate and reliable insights, but it’s lengthy and time-consuming.
Fear not! Let me introduce you to Pyjanitor, a fantastic Python library that can save the day. It is a convenient Python package, providing a simple remedy to these data-cleaning challenges. In this article, I am going to discuss the importance of Pyjanitor along with its features and practical usage.
By the end of this article, you will have a clear understanding of how Pyjanitor simplifies data cleaning and its application in everyday data-related tasks.
Â
What is Pyjanitor?
Â
Pyjanitor is an extended R package of Python, built on top of pandas that simplifies data cleaning and preprocessing tasks. It extends its functionality by offering a variety of useful functions that refine the process of cleaning, transforming, and preparing datasets. Think of it as an upgrade to your data-cleaning toolkit. Are you eager to learn about Pyjanitor? Me too. Let’s start.
Â
Getting Started
Â
First things first, you need to install Pyjanitor. Open your terminal or command prompt and run the following command:
Â
The next step is to import Pyjanitor and Pandas into your Python script. This can be done by:
import janitor
import pandas as pd
Â
Now, you are ready to use Pyjanitor for your data cleaning tasks. Moving forward, I will cover some of the most useful features of Pyjanitor which are:
Â
1. Cleaning Column Names
Raise your hand if you have ever been frustrated by inconsistent column names. Yup, me too. With Pyjanitor’s clean_names()
function, you can quickly standardize your column names making them uniform and consistent with just a simple call. This powerful function replaces spaces with underscores, converts all characters to lowercase, strips leading and trailing whitespace, and even replaces dots with underscores. Let’s understand it with a basic example.
#Create a data frame with inconsistent column names
student_df = pd.DataFrame(
'Student.ID': [1, 2, 3],
'Student Name': ['Sara', 'Hanna', 'Mathew'],
'Student Gender': ['Female', 'Female', 'Male'],
'Course*': ['Algebra', 'Data Science', 'Geometry'],
'Grade': ['A', 'B', 'C']
)
#Clean the column names
clean_df = student_df.clean_names()
print(clean_df)
Â
Output:
student_id student_name student_gender course grade
0 1 Sara Female Algebra A
1 2 Hanna Female Data Science B
2 3 Mathew Male Geometry C
Â
2. Renaming Columns
At times, renaming columns not only enhances our understanding of the data but also improves its readability and consistency. Thanks to the rename_column()
function, this task becomes effortless. A simple example showcasing the usability of this function is as follows:
student_df = pd.DataFrame(
'stu_id': [1, 2],
'stu_name': ['Ryan', 'James'],
)
# Renaming the columns
student_df = student_df.rename_column('stu_id', 'Student_ID')
student_df =student_df.rename_column('stu_name', 'Student_Name')
print(student_df.columns)
Â
Output:
Index(['Student_ID', 'Student_Name'], dtype="object")
Â
3. Handling Missing Values
Missing values are a real headache when dealing with datasets. Fortunately, the fill_missing()
comes in handy for addressing these issues. Let’s explore how to handle missing values using Pyjanitor with a practical example. First, we will create a dummy data frame and populate it with some missing values.
# Create a data frame with missing values
employee_df = pd.DataFrame(
'employee_id': [1, 2, 3, 4, 5],
'name': ['Ryan', 'James', 'Alicia'],
'department': ['HR', None, 'Engineering'],
'salary': [60000, 55000, None]
)
Â
Now, let’s see how Pyjanitor can assist in filling up these missing values:
# Replace missing 'department' with 'Unknown'
# Replace the missing 'salary' with the mean of salaries
employee_df = employee_df.fill_missing(
'department': 'Unknown',
'salary': employee_df['salary'].mean(),
)
print(employee_df)
Â
Output:
employee_id name department salary
0 1 Ryan HR 60000.0
1 2 James Unknown 55000.0
2 3 Alicia Engineering 57500.0
Â
In this example, the department of employee ‘James’ is substituted with ‘Unknown’, and the salary of ‘Alicia’ is substituted with the average of ‘Ryan’ and ‘James’ salaries. You can use various strategies for handling missing values like forward pass, backward pass, or, filling with a specific value.
Â
4. Filtering Rows & Selecting Columns
Filtering rows and columns is a crucial task in data analysis. Pyjanitor simplifies this process by providing functions that allow you to select columns and filter rows based on specific conditions. Suppose you have a data frame containing student records, and you want to filter out students(rows) whose marks are less than 60. Let’s explore how Pyjanitor helps us in achieving this.
# Create a data frame with student data
students_df = pd.DataFrame(
'student_id': [1, 2, 3, 4, 5],
'name': ['John', 'Julia', 'Ali', 'Sara', 'Sam'],
'subject': ['Maths', 'General Science', 'English', 'History''],
'marks': [85, 58, 92, 45, 75],
'grade': ['A', 'C', 'A+', 'D', 'B']
)
# Filter rows where marks are less than 60
filtered_students_df = students_df.query('marks >= 60')
print(filtered_students_df)
Â
Output:
student_id name subject marks grade
0 1 John Math 85 A
2 3 Lucas English 92 A+
4 5 Sophia Math 75 B
Â
Now suppose you also want to output only specific columns, such as only the name and ID, rather than their entire data. Pyjanitor can also help in doing this as follows:
# Select specific columns
selected_columns_df = filtered_students_df.loc[:,['student_id', 'name']]
Â
Output:
student_id name
0 1 John
2 3 Lucas
4 5 Sophia
Â
5. Chaining Methods
With Pyjanitor’s method chaining feature, you can perform multiple operations in a single line. This capability stands out as one of its best features. To illustrate, let’s consider a data frame containing data about cars:
# Create a data frame with sample car data
cars_df =pd.DataFrame (
'Car ID': [101, None, 103, 104, 105],
'Car Model': ['Toyota', 'Honda', 'BMW', 'Mercedes', 'Tesla'],
'Price ($)': [25000, 30000, None, 40000, 45000],
'Year': [2018, 2019, 2017, 2020, None]
)
print("Cars Data Before Applying Method Chaining:")
print(cars_df)
Â
Output:
Cars Data Before Applying Method Chaining:
Car ID Car Model Price ($) Year
0 101.0 Toyota 25000.0 2018.0
1 NaN Honda 30000.0 2019.0
2 103.0 BMW NaN 2017.0
3 104.0 Mercedes 40000.0 2020.0
4 105.0 Tesla 45000.0 NaN
Â
Now that we see the data frame contains missing values and inconsistent column names. We can solve this by performing operations sequentially, such as clean_names()
, rename_column()
, and, dropna()
, etc. in multiple lines. Alternatively, we can chain these methods together– performing multiple operations in a single line –for a fluent workflow and cleaner code.
# Chain methods to clean column names, drop rows with missing values, select specific columns, and rename columns
cleaned_cars_df = (
cars_df
.clean_names() # Clean column names
.dropna() # Drop rows with missing values
.select_columns(['car_id', 'car_model', 'price']) #Select columns
.rename_column('price', 'price_usd') # Rename column
)
print("Cars Data After Applying Method Chaining:")
print(cleaned_cars_df)
Â
Output:
Cars Data After Applying Method Chaining:
car_id car_model price_usd
0 101.0 Toyota 25000
3 104.0 Mercedes 40000
Â
In this pipeline, the following operations have been performed:
clean_names()
function cleans out the column names.dropna()
function drops the rows with missing values.select_columns()
function selects specific columns which are ‘car_id’, ‘car_model’ and ‘price’.rename_column()
function renames the column ‘price’ with ‘price_usd’.
Â
Wrapping Up
Â
So, to wrap up, Pyjanitor proves to be a magical library for anyone working with data. It offers many more features than discussed in this article, such as encoding categorical variables, obtaining features and labels, identifying duplicate rows, and much more. All of these advanced features and methods can be explored in its documentation. The deeper you delve into its features, the more you will be surprised by its powerful functionality. Lastly, enjoy manipulating your data with Pyjanitor.
Â
Â
Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.