Image by Author
Â
Pandas is the most widely used Python library for data analysis and manipulation. But the data that you read from the source often requires a series of data cleaning steps—before you can analyze it to gain insights, answer business questions, or build machine learning models.
This guide breaks down the process of data cleaning with pandas into 7 practical steps. We’ll spin up a sample dataset and work through the data cleaning steps.
Let’s get started!
Â
Spinning Up a Sample DataFrame
Â
Link to Colab Notebook
Before we get started with the actual data cleaning steps, let’s create pandas dataframe with employee records. We’ll use Faker for synthetic data generation. So install it first:
Â
If you’d like, you can follow along with the same example. You can also use a dataset of your choice. Here’s the code to generate 1000 records:
import pandas as pd
from faker import Faker
import random
# Initialize Faker to generate synthetic data
fake = Faker()
# Set seed for reproducibility
Faker.seed(42)
# Generate synthetic data
data = []
for _ in range(1000):
data.append({
'Name': fake.name(),
'Age': random.randint(18, 70),
'Email': fake.email(),
'Phone': fake.phone_number(),
'Address': fake.address(),
'Salary': random.randint(20000, 150000),
'Join_Date': fake.date_this_decade(),
'Employment_Status': random.choice(['Full-Time', 'Part-Time', 'Contract']),
'Department': random.choice(['IT', 'Engineering','Finance', 'HR', 'Marketing'])
})
Â
Let’s tweak this dataframe a bit to introduce missing values, duplicate records, outliers, and more:
# Let's tweak the records a bit!
# Introduce missing values
for i in random.sample(range(len(data)), 50):
data[i]['Email'] = None
# Introduce duplicate records
data.extend(random.sample(data, 100))
# Introduce outliers
for i in random.sample(range(len(data)), 20):
data[i]['Salary'] = random.randint(200000, 500000)
Â
Now let’s create a dataframe with these records:
# Create dataframe
df = pd.DataFrame(data)
Â
Note that we set the seed for Faker and not the random module. So there’ll be some randomness in the records you generate.
Â
Step 1: Understanding the Data
Â
Step 0 is always to understand the business question/problem that you are trying to solve. Once you know that you can start working with the data you’ve read into your pandas dataframe.
But before you can do anything meaningful on the dataset, it’s important to first get a high-level overview of the dataset. This includes getting some basic information on the different fields and the total number of records, inspecting the head of the dataframe, and the like.
Here we run the info()
method on the dataframe:
Â
Output >>>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 1100 non-null object
1 Age 1100 non-null int64
2 Email 1047 non-null object
3 Phone 1100 non-null object
4 Address 1100 non-null object
5 Salary 1100 non-null int64
6 Join_Date 1100 non-null object
7 Employment_Status 1100 non-null object
8 Department 1100 non-null object
dtypes: int64(2), object(7)
memory usage: 77.5+ KB
Â
And inspect the head of the dataframe:
Â
Output of df.head()
Â
Step 2: Handling Duplicates
Â
Duplicate records are a common problem that skews the results of analysis. So we should identify and remove all duplicate records so that we’re working with only the unique data records.
Here’s how we find all the duplicates in the dataframe and then drop all the duplicates in place:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print("Number of duplicate rows:", duplicates)
# Removing duplicate rows
df.drop_duplicates(inplace=True)
Â
Output >>>
Number of duplicate rows: 100
Â
Step 3: Handling Missing Data
Â
Missing data is a common data quality issue in many data science projects. If you take a quick look at the result of the info()
method from the previous step, you should see that the number of non-null objects is not identical for all fields, and there are missing values in the email column. We’ll get the exact count nonetheless.
To get the number of missing values in each column you can run:
# Check for missing values
missing_values = df.isna().sum()
print("Missing Values:")
print(missing_values)
Â
Output >>>
Missing Values:
Name 0
Age 0
Email 50
Phone 0
Address 0
Salary 0
Join_Date 0
Employment_Status 0
Department 0
dtype: int64
Â
If there are missing values in one or more numeric column, we can apply suitable imputation techniques. But because the ‘Email’ field is missing, let’s just set the missing emails to a placeholder email like so:
# Handling missing values by filling with a placeholder
df['Email'].fillna('unknown@example.com', inplace=True)
Â
Step 4: Transforming Data
Â
When you’re working on the dataset, there may be one or more fields that do not have the expected data type. In our sample dataframe, the ‘Join_Date’ field has to be cast into a valid datetime object:
# Convert 'Join_Date' to datetime
df['Join_Date'] = pd.to_datetime(df['Join_Date'])
print("Join_Date after conversion:")
print(df['Join_Date'].head())
Â
Output >>>
Join_Date after conversion:
0 2023-07-12
1 2020-12-31
2 2024-05-09
3 2021-01-19
4 2023-10-04
Name: Join_Date, dtype: datetime64[ns]
Â
Because we have the joining date, it’s actually more helpful to have a `Years_Employed` column as shown:
# Creating a new feature 'Years_Employed' based on 'Join_Date'
df['Years_Employed'] = pd.Timestamp.now().year - df['Join_Date'].dt.year
print("New feature 'Years_Employed':")
print(df[['Join_Date', 'Years_Employed']].head())
Â
Output >>>
New feature 'Years_Employed':
Join_Date Years_Employed
0 2023-07-12 1
1 2020-12-31 4
2 2024-05-09 0
3 2021-01-19 3
4 2023-10-04 1
Â
Step 5: Cleaning Text Data
Â
It’s quite common to run into string fields with inconsistent formatting or similar issues. Cleaning text can be as simple as applying a case conversion or as hard as writing a complex regular expression to get the string to the required format.
In the example dataframe that we have, we see that the ‘Address’ column contains many ‘\n’ characters that hinder readability. So let’s replace them with spaces like so:
# Clean address strings
df['Address'] = df['Address'].str.replace('\n', ' ', regex=False)
print("Address after text cleaning:")
print(df['Address'].head())
Â
Output >>>
Address after text cleaning:
0 79402 Peterson Drives Apt. 511 Davisstad, PA 35172
1 55341 Amanda Gardens Apt. 764 Lake Mark, WI 07832
2 710 Eric Estate Carlsonfurt, MS 78605
3 809 Burns Creek Natashaport, IA 08093
4 8713 Caleb Brooks Apt. 930 Lake Crystalbury, CA...
Name: Address, dtype: object
Â
Step 6: Handling Outliers
Â
If you scroll back up, you’ll see that we set some of the values in the ‘Salary’ column to be extremely high. Such outliers should also be identified and handled appropriately so that they don’t skew the analysis.
You’ll often want to factor in what makes a data point an outlier (if it’s incorrect data entry or if they’re actually valid values and not outliers). You may then choose to handle them: drop records with outliers or get the subset of rows with outliers and analyze them separately.
Let’s use the z-score and find those salary values that are more than three standard deviations away from the mean:
# Detecting outliers using z-score
z_scores = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std()
outliers = df[abs(z_scores) > 3]
print("Outliers based on Salary:")
print(outliers[['Name', 'Salary']].head())
Â
Output >>>
Outliers based on Salary:
Name Salary
16 Michael Powell 414854
131 Holly Jimenez 258727
240 Daniel Williams 371500
328 Walter Bishop 332554
352 Ashley Munoz 278539
Â
Step 7: Merging Data
Â
In most projects, the data that you have may not be the data you’ll want to use for analysis. You have to find the most relevant fields to use and also merge data from other dataframes to get more useful data that you can use for analysis.
As a quick exercise, create another related dataframe and merge it with the existing dataframe on a common column such that the merge makes sense. Merging in pandas works very similarly to joins in SQL, so I suggest you try that as an exercise!
Â
Wrapping Up
Â
That’s all for this tutorial! We created a sample dataframe with records and worked through the various data cleaning steps. Here is an overview of the steps: understanding the data, handling duplicates, missing values, transforming data, cleaning text data, handling outliers, and merging data.
If you want to learn all about data wrangling with pandas, check out 7 Steps to Mastering Data Wrangling with Pandas and Python.
Â
Â
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.