7 Steps to Mastering Data Cleaning with Python and Pandas


7-steps-pandas
Image by Author

 

Pandas is the most widely used Python library for data analysis and manipulation. But the data that you read from the source often requires a series of data cleaning steps—before you can analyze it to gain insights, answer business questions, or build machine learning models.

This guide breaks down the process of data cleaning with pandas into 7 practical steps. We’ll spin up a sample dataset and work through the data cleaning steps.

Let’s get started!

 

Spinning Up a Sample DataFrame

 

Link to Colab Notebook

Before we get started with the actual data cleaning steps, let’s create pandas dataframe with employee records. We’ll use Faker for synthetic data generation. So install it first:

 

If you’d like, you can follow along with the same example. You can also use a dataset of your choice. Here’s the code to generate 1000 records:

import pandas as pd
from faker import Faker
import random

# Initialize Faker to generate synthetic data
fake = Faker()

# Set seed for reproducibility
Faker.seed(42)

# Generate synthetic data
data = []
for _ in range(1000):
    data.append({
        'Name': fake.name(),
        'Age': random.randint(18, 70),
        'Email': fake.email(),
        'Phone': fake.phone_number(),
        'Address': fake.address(),
        'Salary': random.randint(20000, 150000),
        'Join_Date': fake.date_this_decade(),
        'Employment_Status': random.choice(['Full-Time', 'Part-Time', 'Contract']),
        'Department': random.choice(['IT', 'Engineering','Finance', 'HR', 'Marketing'])
    })

 

Let’s tweak this dataframe a bit to introduce missing values, duplicate records, outliers, and more:

# Let's tweak the records a bit!
# Introduce missing values
for i in random.sample(range(len(data)), 50):
    data[i]['Email'] = None

# Introduce duplicate records
data.extend(random.sample(data, 100))

# Introduce outliers
for i in random.sample(range(len(data)), 20):
    data[i]['Salary'] = random.randint(200000, 500000)

 

Now let’s create a dataframe with these records:

# Create dataframe
df = pd.DataFrame(data)

 

Note that we set the seed for Faker and not the random module. So there’ll be some randomness in the records you generate.

 

Step 1: Understanding the Data

 

Step 0 is always to understand the business question/problem that you are trying to solve. Once you know that you can start working with the data you’ve read into your pandas dataframe.

But before you can do anything meaningful on the dataset, it’s important to first get a high-level overview of the dataset. This includes getting some basic information on the different fields and the total number of records, inspecting the head of the dataframe, and the like.

Here we run the info() method on the dataframe:

 

Output >>>

RangeIndex: 1100 entries, 0 to 1099
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Name               1100 non-null   object
 1   Age                1100 non-null   int64 
 2   Email              1047 non-null   object
 3   Phone              1100 non-null   object
 4   Address            1100 non-null   object
 5   Salary             1100 non-null   int64 
 6   Join_Date          1100 non-null   object
 7   Employment_Status  1100 non-null   object
 8   Department         1100 non-null   object
dtypes: int64(2), object(7)
memory usage: 77.5+ KB

 

And inspect the head of the dataframe:

 

df-headdf-head
Output of df.head()

 

Step 2: Handling Duplicates

 

Duplicate records are a common problem that skews the results of analysis. So we should identify and remove all duplicate records so that we’re working with only the unique data records.

Here’s how we find all the duplicates in the dataframe and then drop all the duplicates in place:

# Check for duplicate rows
duplicates = df.duplicated().sum()
print("Number of duplicate rows:", duplicates)

# Removing duplicate rows
df.drop_duplicates(inplace=True)

 

Output >>>
Number of duplicate rows: 100

 

Step 3: Handling Missing Data

 

Missing data is a common data quality issue in many data science projects. If you take a quick look at the result of the info() method from the previous step, you should see that the number of non-null objects is not identical for all fields, and there are missing values in the email column. We’ll get the exact count nonetheless.

To get the number of missing values in each column you can run:

# Check for missing values
missing_values = df.isna().sum()
print("Missing Values:")
print(missing_values)

 

Output >>>
Missing Values:
Name                  0
Age                   0
Email                50
Phone                 0
Address               0
Salary                0
Join_Date             0
Employment_Status     0
Department            0
dtype: int64

 

If there are missing values in one or more numeric column, we can apply suitable imputation techniques. But because the ‘Email’ field is missing, let’s just set the missing emails to a placeholder email like so:


# Handling missing values by filling with a placeholder
df['Email'].fillna('unknown@example.com', inplace=True)

 

Step 4: Transforming Data

 

When you’re working on the dataset, there may be one or more fields that do not have the expected data type. In our sample dataframe, the ‘Join_Date’ field has to be cast into a valid datetime object:

# Convert 'Join_Date' to datetime
df['Join_Date'] = pd.to_datetime(df['Join_Date'])
print("Join_Date after conversion:")
print(df['Join_Date'].head())

 

Output >>>
Join_Date after conversion:
0   2023-07-12
1   2020-12-31
2   2024-05-09
3   2021-01-19
4   2023-10-04
Name: Join_Date, dtype: datetime64[ns]

 

Because we have the joining date, it’s actually more helpful to have a `Years_Employed` column as shown:

# Creating a new feature 'Years_Employed' based on 'Join_Date'
df['Years_Employed'] = pd.Timestamp.now().year - df['Join_Date'].dt.year
print("New feature 'Years_Employed':")
print(df[['Join_Date', 'Years_Employed']].head())

 

Output >>>
New feature 'Years_Employed':
   Join_Date  Years_Employed
0 2023-07-12               1
1 2020-12-31               4
2 2024-05-09               0
3 2021-01-19               3
4 2023-10-04               1

 

Step 5: Cleaning Text Data

 

It’s quite common to run into string fields with inconsistent formatting or similar issues. Cleaning text can be as simple as applying a case conversion or as hard as writing a complex regular expression to get the string to the required format.

In the example dataframe that we have, we see that the ‘Address’ column contains many ‘\n’ characters that hinder readability. So let’s replace them with spaces like so:

# Clean address strings
df['Address'] = df['Address'].str.replace('\n', ' ', regex=False)
print("Address after text cleaning:")
print(df['Address'].head())

 

Output >>>
Address after text cleaning:
0    79402 Peterson Drives Apt. 511 Davisstad, PA 35172
1     55341 Amanda Gardens Apt. 764 Lake Mark, WI 07832
2                 710 Eric Estate Carlsonfurt, MS 78605
3                 809 Burns Creek Natashaport, IA 08093
4    8713 Caleb Brooks Apt. 930 Lake Crystalbury, CA...
Name: Address, dtype: object

 

Step 6: Handling Outliers

 

If you scroll back up, you’ll see that we set some of the values in the ‘Salary’ column to be extremely high. Such outliers should also be identified and handled appropriately so that they don’t skew the analysis.

You’ll often want to factor in what makes a data point an outlier (if it’s incorrect data entry or if they’re actually valid values and not outliers). You may then choose to handle them: drop records with outliers or get the subset of rows with outliers and analyze them separately.

Let’s use the z-score and find those salary values that are more than three standard deviations away from the mean:

# Detecting outliers using z-score
z_scores = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std()
outliers = df[abs(z_scores) > 3]
print("Outliers based on Salary:")
print(outliers[['Name', 'Salary']].head())

 

Output >>>
Outliers based on Salary:
                Name  Salary
16    Michael Powell  414854
131    Holly Jimenez  258727
240  Daniel Williams  371500
328    Walter Bishop  332554
352     Ashley Munoz  278539

 

Step 7: Merging Data

 

In most projects, the data that you have may not be the data you’ll want to use for analysis. You have to find the most relevant fields to use and also merge data from other dataframes to get more useful data that you can use for analysis.

As a quick exercise, create another related dataframe and merge it with the existing dataframe on a common column such that the merge makes sense. Merging in pandas works very similarly to joins in SQL, so I suggest you try that as an exercise!

 

Wrapping Up

 

That’s all for this tutorial! We created a sample dataframe with records and worked through the various data cleaning steps. Here is an overview of the steps: understanding the data, handling duplicates, missing values, transforming data, cleaning text data, handling outliers, and merging data.

If you want to learn all about data wrangling with pandas, check out 7 Steps to Mastering Data Wrangling with Pandas and Python.

 

 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here