10 Essential Data Cleaning Techniques Explained in 12 Minutes



Image by Author | Ideogram

 

Data cleaning is often not the most enjoyable task! Yet ask any seasoned data professional about their most valuable skill, and many will tell you it’s their ability to wrangle messy data into something useful.

This article covers essential cleaning techniques that are part of all successful data projects. I’ve included practical code examples and a sample dataset so you can follow along and apply these methods to your own work.

Let’s get started.

Note: You can use this sample messy dataset to follow along with this article.

 

1. Handling Missing Values

 
Missing data is perhaps the most common problem in real-world data.  It can skew your analysis and make your results unreliable.

So identifying and dealing with missing values is, therefore, the first data cleaning step.

 

Identification

First, quantify and understand your missing data patterns:

import pandas as pd
import numpy as np

# Load our sample dataset
df = pd.read_csv('messy_data.csv')

# Get overall missing values count and percentage
missing_count = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100

# Display columns with missing values
missing_data = pd.concat([missing_count, missing_percentage], axis=1, 
                         keys=['Count', 'Percentage'])
print(missing_data[missing_data['Count'] > 0].sort_values('Count', ascending=False))

 

This snippet loads our dataset, counts missing values in each column, calculates what percentage they represent, and then displays only the columns containing missing values, sorted by the number of missing entries.

 

Deletion Strategies

Sometimes you may need to remove rows or columns with missing values (when it’s the best you can do).

The first line removes rows that are completely empty.

df_clean = df.dropna(how='all')

 

The second removes rows where any of the critical columns (name, email, age) are missing.

df_clean = df.dropna(subset=['name', 'email', 'age'])

 

The third removes entire columns if they have more than 50% missing values.

threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)

 

Imputation Methods

Imputation involves replacing missing values with meaningful estimates.

We need to use different imputation strategies depending on the data we’re working with. For numeric data like age, we use the median (less affected by outliers). For income, we use the mean.

df['age'] = df['age'].fillna(df['age'].median())
df['income'] = df['income'].fillna(df['income'].mean())
df['customer_rating'] = df['customer_rating'].fillna(df['customer_rating'].mode()[0])

 

For categorical data, we use the mode (most common value) or a meaningful default.

df['gender'] = df['gender'].fillna(df['gender'].mode()[0])
df['comments'] = df['comments'].fillna('No comment provided')

 

The KNN imputer is more sophisticated, looking at similar records to make a more educated guess for missing values.

# More advanced: use KNN imputation for multiple columns
from sklearn.impute import KNNImputer
numeric_cols = ['age', 'income', 'customer_rating', 'purchase_amount']
imputer = KNNImputer(n_neighbors=3)
df[numeric_cols] = pd.DataFrame(
    imputer.fit_transform(df[numeric_cols]), 
    columns=numeric_cols,
    index=df.index
)

 

2. Removing Duplicates

 
Duplicate records can skew statistics and often lead to incorrect conclusions.

First, we check for exact duplicates across all columns.

# Find exact duplicates across all columns
exact_duplicates = df.duplicated().sum()
print(f"Number of exact duplicate rows: exact_duplicates")

# Remove exact duplicates
df_unique = df.drop_duplicates()

 

Then we look for “functional” duplicates based on key identifiers like name and email.

The keep=False flag marks all duplicates, not just subsequent ones. Finally, we implement a smart deduplication strategy that keeps the most complete record (row with the fewest missing values) when duplicates are found.

# Find and examine potential functional duplicates based on key fields
potential_duplicates = df.duplicated(subset=['name', 'email'], keep=False)
print(f"Number of potential functional duplicates: potential_duplicates.sum()")
print("Potential duplicate records:")
print(df[potential_duplicates].sort_values('name'))

# Remove duplicates but keep the most complete record
# First sort by number of non-null values, then drop duplicates
df['completeness'] = df.notna().sum(axis=1)
df_sorted = df.sort_values('completeness', ascending=False)
df_clean = df_sorted.drop_duplicates(subset=['name', 'email'])
df_clean = df_clean.drop(columns=['completeness'])

 

3. Standardizing Text Data

 
Text inconsistencies create unnecessary variety that complicates analysis. Standardizing text fields helps ensure consistency.

We start by standardizing the case of text fields: title case for names, uppercase for countries, and lowercase for job titles.

# Convert case for consistency in categorical fields
df['name'] = df['name'].str.title()
df['country'] = df['country'].str.upper()
df['job_title'] = df['job_title'].str.lower()

&nbsp

Next, we use mapping dictionaries to standardize country names and gender values.

# Standardize country names using mapping
country_mapping = 
    'US': 'USA',
    'U.S.A.': 'USA',
    'United States': 'USA',
    'united states': 'USA',
    'United states': 'USA'

df['country'] = df['country'].replace(country_mapping)

# Standardize gender values
gender_mapping = 
    'M': 'Male',
    'm': 'Male',
    'Male': 'Male',
    'male': 'Male',
    'F': 'Female',
    'f': 'Female',
    'Female': 'Female',
    'female': 'Female'

df['gender'] = df['gender'].replace(gender_mapping)

 

Finally, we create a custom function to standardize education levels based on keywords, handling various formats and spellings.

# Standardize education level
def standardize_education(edu_str):
    if pd.isna(edu_str):
        return np.nan
    
    edu_str = str(edu_str).lower().strip()
    
    if 'bachelor' in edu_str:
        return "Bachelor's Degree"
    elif 'master' in edu_str or 'mba' in edu_str or 'msc' in edu_str:
        return "Master's Degree"
    elif 'phd' in edu_str or 'doctor' in edu_str:
        return "Doctorate"
    else:
        return "Other"

df['education'] = df['education'].apply(standardize_education)

 

4. Managing Outliers

 
Outliers can dramatically skew statistics and model performance. Proper identification and handling of outliers is necessary.

First, we ensure our numeric columns are actually stored as numbers.

df['income'] = pd.to_numeric(df['income'], errors="coerce")
df['age'] = pd.to_numeric(df['age'], errors="coerce")

 

Then we apply two different outlier detection methods: Z-scores (which assumes normal distribution) and the IQR method (more robust to non-normal data).

# Detecting outliers using Z-score method
from scipy import stats
z_scores = stats.zscore(df['income'].dropna())
outliers_z = (abs(z_scores) > 3)
print(f"Z-score method identified outliers_z.sum() outliers in income")

# IQR 
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = ((df['income']  upper_bound))
print(f"IQR method identified outliers_iqr.sum() outliers in income")

# Visualize outliers
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.boxplot(df['income'].dropna())
plt.title('Income Distribution with Outliers')
plt.ylabel('Income')
plt.show()

 

Finally, we fix obviously incorrect values (negative income) and demonstrate winsorization—capping extreme values at percentile thresholds to reduce their impact without removing data.

# Fix negative values that shouldn't be negative
df.loc[df['income'] 

 

5. Data Type Conversion

 
Ensuring correct data types improves performance and enables appropriate operations on each column.

We start by checking the current data types. Then we convert numeric columns from strings to actual numeric types, handling errors by converting them to NaN.

# Display current data types
print(df.dtypes)

# Convert string to numeric for appropriate columns
df['age'] = pd.to_numeric(df['age'], errors="coerce")
df['income'] = pd.to_numeric(df['income'], errors="coerce")
df['customer_rating'] = pd.to_numeric(df['customer_rating'], errors="coerce")
df['purchase_amount'] = pd.to_numeric(df['purchase_amount'], errors="coerce")

For dates, we create a function that tries multiple date formats, since our dataset has inconsistent date formatting. Once dates are properly formatted, we extract useful components like year and month, and calculate tenure.

# Date conversion with error handling
def parse_date(date_str):
    if pd.isna(date_str):
        return np.nan
    
    for fmt in ['%Y-%m-%d', '%Y/%m/%d', '%m/%d/%Y']:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue
    
    return pd.NaT 

df['start_date'] = df['start_date'].apply(parse_date)

# Extract useful components from dates
df['start_year'] = df['start_date'].dt.year
df['start_month'] = df['start_date'].dt.month
df['tenure_days'] = (pd.Timestamp('now') - df['start_date']).dt.days

 

Finally, we standardize phone numbers to a consistent format, first stripping non-digits and then reformatting.

# Standardize phone format
def standardize_phone(phone):
    if pd.isna(phone):
        return np.nan
    
    # Remove all non-digit characters
    digits_only = ''.join(c for c in str(phone) if c.isdigit())
    
    # Format as XXX-XXX-XXXX if we have 10 digits
    if len(digits_only) == 10:
        return f"digits_only[:3]-digits_only[3:6]-digits_only[6:]"
    else:
        return digits_only  # Return as-is if not 10 digits

df['phone_number'] = df['phone_number'].apply(standardize_phone)

 

6. Handling Inconsistent Categories

 
Categorical variables often suffer from inconsistency, misspellings, and too many unique values.

First, we examine the current distribution of job titles. Then we standardize job titles by mapping similar titles to common forms (e.g., “Sr. Developer” and “Senior Developer” both become “senior developer”).

# Examine current unique values in job titles
print(f"Original job title count: df['job_title'].nunique()")
print(df['job_title'].value_counts())

# Standardize job titles
job_mapping = 
    'sr. developer': 'senior developer',
    'senior developer': 'senior developer',
    'ux designer': 'designer',
    'regional manager': 'manager',
    'project manager': 'manager',
    'product manager': 'manager',
    'lead engineer': 'senior developer',
    'bi analyst': 'data analyst',
    'data scientist': 'data analyst',
    'hr specialist': 'specialist',
    'marketing specialist': 'specialist'

df['standardized_job'] = df['job_title'].str.lower().replace(job_mapping)

 
Next, we group departments into broader categories to simplify analysis.

# Group departments into broader categories
dept_categories = 
    'IT': 'Technology',
    'Engineering': 'Technology',
    'Analytics': 'Technology',
    'Design': 'Creative',
    'Marketing': 'Business',
    'Product': 'Business',
    'Executive': 'Management',
    'Human Resources': 'Operations',
    'Management': 'Management'

df['dept_category'] = df['department'].replace(dept_categories)

 

Finally, we demonstrate how to handle rare categories by grouping any job title that appears less than a certain number of times into an “Other” category.

# Handle rare categories by grouping them
value_counts = df['standardized_job'].value_counts()
threshold = 2  # Minimum count to keep as separate category
frequent_jobs = value_counts[value_counts >= threshold].index
df['job_grouped'] = df['standardized_job'].apply(
    lambda x: x if x in frequent_jobs else 'Other'
)

 

7. Scaling and Normalization

 
If different features are not all on the same scale, it can be quite inconvenient for many machine learning algorithms.

We first select numeric columns and handle missing values.

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Select numeric columns for scaling
numeric_cols = ['age', 'income', 'customer_rating', 'purchase_amount']
numeric_df = df[numeric_cols].copy()

# Handle missing values before scaling
numeric_df = numeric_df.fillna(numeric_df.median())

 

Then we apply three different scaling techniques.

StandardScaler: Transforms data to have a mean of 0 and standard deviation of 1

# Standardization (Z-score normalization)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_df)
df_scaled = pd.DataFrame(scaled_data, 
                          columns=[f"col_scaled" for col in numeric_cols],
                          index=df.index)

 

MinMaxScaler: To scale data to a specific range (default [0,1])

# Min-Max scaling to [0,1] range
min_max = MinMaxScaler()
minmax_data = min_max.fit_transform(numeric_df)
df_minmax = pd.DataFrame(minmax_data, 
                         columns=[f"col_minmax" for col in numeric_cols],
                         index=df.index)

 

RobustScaler: Uses median and quantiles, making it less sensitive to outliers

# Robust scaling (less affected by outliers)
robust = RobustScaler()
robust_data = robust.fit_transform(numeric_df)
df_robust = pd.DataFrame(robust_data, 
                         columns=[f"col_robust" for col in numeric_cols],
                         index=df.index)

# Combine original data with scaled versions
df_with_scaled = pd.concat([df, df_scaled, df_minmax, df_robust], axis=1)

# Compare the first few rows
print(df_with_scaled[['income', 'income_scaled', 'income_minmax', 'income_robust']].head())

 

We create new columns for each scaling method and then compare them side by side. This is particularly important when comparing features with vastly different scales, like income (thousands) vs. customer rating (1-5).
 

8. String Cleaning and Regex

 
Regular expressions are super useful for cleaning text data and extracting information from it. For the sample dataset (and any dataset in general), we can implement several string cleaning techniques.

Standardizing case and splitting into first/last name components:

import re

# Clean names: remove extra spaces, standardize case
df['name_clean'] = df['name'].str.strip().str.title()

# Extract first and last name
def extract_names(full_name):
    if pd.isna(full_name):
        return pd.Series([np.nan, np.nan])
    
    parts = full_name.strip().split()
    if len(parts) >= 2:
        return pd.Series([parts[0], parts[-1]])
    else:
        return pd.Series([parts[0] if parts else np.nan, np.nan])

name_parts = df['name_clean'].apply(extract_names)
df['first_name'] = name_parts[0]
df['last_name'] = name_parts[1]

 

For email validation, checking against a regex pattern and attempting to fix common issues:

# Validate and clean email addresses
def clean_email(email):
    if pd.isna(email):
        return np.nan
    
    # Basic pattern for valid email
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]2,$'
    
    if re.match(pattern, email):
        return email.lower()
    else:
        # Try to fix common issues
        if '@' not in email:
            return np.nan  # Can't fix emails missing @
        
        # Add domain extension if missing
        if not re.search(r'\.[a-zA-Z]2,$', email):
            return email.lower() + '.com'
        
        return email.lower()

df['email_clean'] = df['email'].apply(clean_email)

 

Extracting email domain information:

# Extract domain from email
df['email_domain'] = df['email_clean'].str.extract(r'@([^@]+)$')

 

Converting various phone number formats to a consistent pattern:

# Standardize/extract phone numbers
def extract_phone(phone):
    if pd.isna(phone):
        return np.nan
    
    # Extract digits only
    digits = re.sub(r'\D', '', str(phone))
    
    # Ensure we have a 10-digit number
    if len(digits) == 10:
        return f"(digits[:3]) digits[3:6]-digits[6:]"
    elif len(digits) > 10:  # Might include country code
        return f"(digits[-10:-7]) digits[-7:-4]-digits[-4:]"
    else:
        return np.nan  # Not enough digits

df['phone_standardized'] = df['phone_number'].apply(extract_phone)

 

These techniques make text data more consistent and can extract structured information from unstructured or semi-structured fields.

 

9. Feature Engineering from Dirty Data

 
Sometimes the patterns in messy data are informative and you can create new representative features as needed.

Missing value indicators: Create dummy variables showing which fields are missing

# Create indicators for missing values in key fields
missing_indicators = ['age', 'income', 'email', 'phone_number', 'comments']
for col in missing_indicators:
    df[f'col_is_missing'] = df[col].isnull().astype(int)

 

Data quality score: Calculate overall record completeness and categorize it

# Create data quality score
df['quality_score'] = df.notna().sum(axis=1) / len(df.columns) * 10
df['quality_category'] = pd.cut(
    df['quality_score'], 
    bins=[0, 6, 8, 10], 
    labels=['Poor', 'Average', 'Good']
)

 

Flag suspicious values like perfectly round incomes, ages outside reasonable ranges:

# Detect potential data entry errors
df['income_suspiciously_round'] = (df['income'] % 10000 == 0).astype(int)
df['age_out_of_range'] = ((df['age']  80)).astype(int)
df['rating_out_of_range'] = ((df['customer_rating']  5)).astype(int)

 

Instead of just cleaning dirty data, we extract useful information from it.

 

10. Dealing with Formatting Issues

 
Inconsistent formatting can cause problems when analyzing data or building models.

Here we address common formatting issues.

Converting various date formats to a consistent format:

# Standardize date formats
df['start_date_clean'] = pd.to_datetime(df['start_date'], errors="coerce")

# Format dates consistently for display
df['start_date_formatted'] = df['start_date_clean'].dt.strftime('%Y-%m-%d')

 

Removing currency symbols and separators before numeric conversion:

# Fix currency formatting issues
def clean_currency(amount):
    if pd.isna(amount):
        return np.nan
    
    if isinstance(amount, (int, float)):
        return amount
    
    # Remove currency symbols and commas
    amount_str = str(amount)
    amount_str = re.sub(r'[$,]', '', amount_str)
    
    # Convert to float
    try:
        return float(amount_str)
    except:
        return np.nan

# If purchase_amount might have formatting issues
df['purchase_amount_clean'] = df['purchase_amount'].apply(clean_currency)

 

Creating consistently formatted versions of values for reporting:

# Format numbers for consistency
df['income_formatted'] = df['income'].apply(
    lambda x: f"$x:,.2f" if not pd.isna(x) else ""
)

 

Consistent formatting is particularly important for data that will be presented to end-users or used in reports.

After applying multiple cleaning techniques, we create a final clean dataset ready for analysis.

 

Wrapping Up

 
We’ve covered practical techniques that address the most common data quality issues you’ll encounter in real-world datasets. These methods help transform problematic data into reliable analytical assets.

Data cleaning is an iterative process—each step often gives new insights about your data’s structure and quality. The patterns shared here can be adapted to your specific needs and incorporated into automated workflows.

Happy data cleaning!
 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here