Image by Author | Ideogram
Â
Data cleaning is often not the most enjoyable task! Yet ask any seasoned data professional about their most valuable skill, and many will tell you it’s their ability to wrangle messy data into something useful.
This article covers essential cleaning techniques that are part of all successful data projects. I’ve included practical code examples and a sample dataset so you can follow along and apply these methods to your own work.
Let’s get started.
Note: You can use this sample messy dataset to follow along with this article.
Â
1. Handling Missing Values
Â
Missing data is perhaps the most common problem in real-world data. It can skew your analysis and make your results unreliable.
So identifying and dealing with missing values is, therefore, the first data cleaning step.
Â
Identification
First, quantify and understand your missing data patterns:
import pandas as pd
import numpy as np
# Load our sample dataset
df = pd.read_csv('messy_data.csv')
# Get overall missing values count and percentage
missing_count = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100
# Display columns with missing values
missing_data = pd.concat([missing_count, missing_percentage], axis=1,
keys=['Count', 'Percentage'])
print(missing_data[missing_data['Count'] > 0].sort_values('Count', ascending=False))
Â
This snippet loads our dataset, counts missing values in each column, calculates what percentage they represent, and then displays only the columns containing missing values, sorted by the number of missing entries.
Â
Deletion Strategies
Sometimes you may need to remove rows or columns with missing values (when it’s the best you can do).
The first line removes rows that are completely empty.
df_clean = df.dropna(how='all')
Â
The second removes rows where any of the critical columns (name, email, age) are missing.
df_clean = df.dropna(subset=['name', 'email', 'age'])
Â
The third removes entire columns if they have more than 50% missing values.
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)
Â
Imputation Methods
Imputation involves replacing missing values with meaningful estimates.
We need to use different imputation strategies depending on the data we’re working with. For numeric data like age, we use the median (less affected by outliers). For income, we use the mean.
df['age'] = df['age'].fillna(df['age'].median())
df['income'] = df['income'].fillna(df['income'].mean())
df['customer_rating'] = df['customer_rating'].fillna(df['customer_rating'].mode()[0])
Â
For categorical data, we use the mode (most common value) or a meaningful default.
df['gender'] = df['gender'].fillna(df['gender'].mode()[0])
df['comments'] = df['comments'].fillna('No comment provided')
Â
The KNN imputer is more sophisticated, looking at similar records to make a more educated guess for missing values.
# More advanced: use KNN imputation for multiple columns
from sklearn.impute import KNNImputer
numeric_cols = ['age', 'income', 'customer_rating', 'purchase_amount']
imputer = KNNImputer(n_neighbors=3)
df[numeric_cols] = pd.DataFrame(
imputer.fit_transform(df[numeric_cols]),
columns=numeric_cols,
index=df.index
)
Â
2. Removing Duplicates
Â
Duplicate records can skew statistics and often lead to incorrect conclusions.
First, we check for exact duplicates across all columns.
# Find exact duplicates across all columns
exact_duplicates = df.duplicated().sum()
print(f"Number of exact duplicate rows: exact_duplicates")
# Remove exact duplicates
df_unique = df.drop_duplicates()
Â
Then we look for “functional” duplicates based on key identifiers like name and email.
The keep=False flag marks all duplicates, not just subsequent ones. Finally, we implement a smart deduplication strategy that keeps the most complete record (row with the fewest missing values) when duplicates are found.
# Find and examine potential functional duplicates based on key fields
potential_duplicates = df.duplicated(subset=['name', 'email'], keep=False)
print(f"Number of potential functional duplicates: potential_duplicates.sum()")
print("Potential duplicate records:")
print(df[potential_duplicates].sort_values('name'))
# Remove duplicates but keep the most complete record
# First sort by number of non-null values, then drop duplicates
df['completeness'] = df.notna().sum(axis=1)
df_sorted = df.sort_values('completeness', ascending=False)
df_clean = df_sorted.drop_duplicates(subset=['name', 'email'])
df_clean = df_clean.drop(columns=['completeness'])
Â
3. Standardizing Text Data
Â
Text inconsistencies create unnecessary variety that complicates analysis. Standardizing text fields helps ensure consistency.
We start by standardizing the case of text fields: title case for names, uppercase for countries, and lowercase for job titles.
# Convert case for consistency in categorical fields
df['name'] = df['name'].str.title()
df['country'] = df['country'].str.upper()
df['job_title'] = df['job_title'].str.lower()
 
Next, we use mapping dictionaries to standardize country names and gender values.
# Standardize country names using mapping
country_mapping =
'US': 'USA',
'U.S.A.': 'USA',
'United States': 'USA',
'united states': 'USA',
'United states': 'USA'
df['country'] = df['country'].replace(country_mapping)
# Standardize gender values
gender_mapping =
'M': 'Male',
'm': 'Male',
'Male': 'Male',
'male': 'Male',
'F': 'Female',
'f': 'Female',
'Female': 'Female',
'female': 'Female'
df['gender'] = df['gender'].replace(gender_mapping)
Â
Finally, we create a custom function to standardize education levels based on keywords, handling various formats and spellings.
# Standardize education level
def standardize_education(edu_str):
if pd.isna(edu_str):
return np.nan
edu_str = str(edu_str).lower().strip()
if 'bachelor' in edu_str:
return "Bachelor's Degree"
elif 'master' in edu_str or 'mba' in edu_str or 'msc' in edu_str:
return "Master's Degree"
elif 'phd' in edu_str or 'doctor' in edu_str:
return "Doctorate"
else:
return "Other"
df['education'] = df['education'].apply(standardize_education)
Â
4. Managing Outliers
Â
Outliers can dramatically skew statistics and model performance. Proper identification and handling of outliers is necessary.
First, we ensure our numeric columns are actually stored as numbers.
df['income'] = pd.to_numeric(df['income'], errors="coerce")
df['age'] = pd.to_numeric(df['age'], errors="coerce")
Â
Then we apply two different outlier detection methods: Z-scores (which assumes normal distribution) and the IQR method (more robust to non-normal data).
# Detecting outliers using Z-score method
from scipy import stats
z_scores = stats.zscore(df['income'].dropna())
outliers_z = (abs(z_scores) > 3)
print(f"Z-score method identified outliers_z.sum() outliers in income")
# IQR
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = ((df['income'] upper_bound))
print(f"IQR method identified outliers_iqr.sum() outliers in income")
# Visualize outliers
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.boxplot(df['income'].dropna())
plt.title('Income Distribution with Outliers')
plt.ylabel('Income')
plt.show()
Â
Finally, we fix obviously incorrect values (negative income) and demonstrate winsorization—capping extreme values at percentile thresholds to reduce their impact without removing data.
# Fix negative values that shouldn't be negative
df.loc[df['income']
Â
5. Data Type Conversion
Â
Ensuring correct data types improves performance and enables appropriate operations on each column.
We start by checking the current data types. Then we convert numeric columns from strings to actual numeric types, handling errors by converting them to NaN.
# Display current data types
print(df.dtypes)
# Convert string to numeric for appropriate columns
df['age'] = pd.to_numeric(df['age'], errors="coerce")
df['income'] = pd.to_numeric(df['income'], errors="coerce")
df['customer_rating'] = pd.to_numeric(df['customer_rating'], errors="coerce")
df['purchase_amount'] = pd.to_numeric(df['purchase_amount'], errors="coerce")
For dates, we create a function that tries multiple date formats, since our dataset has inconsistent date formatting. Once dates are properly formatted, we extract useful components like year and month, and calculate tenure.
# Date conversion with error handling
def parse_date(date_str):
if pd.isna(date_str):
return np.nan
for fmt in ['%Y-%m-%d', '%Y/%m/%d', '%m/%d/%Y']:
try:
return pd.to_datetime(date_str, format=fmt)
except:
continue
return pd.NaT
df['start_date'] = df['start_date'].apply(parse_date)
# Extract useful components from dates
df['start_year'] = df['start_date'].dt.year
df['start_month'] = df['start_date'].dt.month
df['tenure_days'] = (pd.Timestamp('now') - df['start_date']).dt.days
Â
Finally, we standardize phone numbers to a consistent format, first stripping non-digits and then reformatting.
# Standardize phone format
def standardize_phone(phone):
if pd.isna(phone):
return np.nan
# Remove all non-digit characters
digits_only = ''.join(c for c in str(phone) if c.isdigit())
# Format as XXX-XXX-XXXX if we have 10 digits
if len(digits_only) == 10:
return f"digits_only[:3]-digits_only[3:6]-digits_only[6:]"
else:
return digits_only # Return as-is if not 10 digits
df['phone_number'] = df['phone_number'].apply(standardize_phone)
Â
6. Handling Inconsistent Categories
Â
Categorical variables often suffer from inconsistency, misspellings, and too many unique values.
First, we examine the current distribution of job titles. Then we standardize job titles by mapping similar titles to common forms (e.g., “Sr. Developer” and “Senior Developer” both become “senior developer”).
# Examine current unique values in job titles
print(f"Original job title count: df['job_title'].nunique()")
print(df['job_title'].value_counts())
# Standardize job titles
job_mapping =
'sr. developer': 'senior developer',
'senior developer': 'senior developer',
'ux designer': 'designer',
'regional manager': 'manager',
'project manager': 'manager',
'product manager': 'manager',
'lead engineer': 'senior developer',
'bi analyst': 'data analyst',
'data scientist': 'data analyst',
'hr specialist': 'specialist',
'marketing specialist': 'specialist'
df['standardized_job'] = df['job_title'].str.lower().replace(job_mapping)
Â
Next, we group departments into broader categories to simplify analysis.
# Group departments into broader categories
dept_categories =
'IT': 'Technology',
'Engineering': 'Technology',
'Analytics': 'Technology',
'Design': 'Creative',
'Marketing': 'Business',
'Product': 'Business',
'Executive': 'Management',
'Human Resources': 'Operations',
'Management': 'Management'
df['dept_category'] = df['department'].replace(dept_categories)
Â
Finally, we demonstrate how to handle rare categories by grouping any job title that appears less than a certain number of times into an “Other” category.
# Handle rare categories by grouping them
value_counts = df['standardized_job'].value_counts()
threshold = 2 # Minimum count to keep as separate category
frequent_jobs = value_counts[value_counts >= threshold].index
df['job_grouped'] = df['standardized_job'].apply(
lambda x: x if x in frequent_jobs else 'Other'
)
Â
7. Scaling and Normalization
Â
If different features are not all on the same scale, it can be quite inconvenient for many machine learning algorithms.
We first select numeric columns and handle missing values.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Select numeric columns for scaling
numeric_cols = ['age', 'income', 'customer_rating', 'purchase_amount']
numeric_df = df[numeric_cols].copy()
# Handle missing values before scaling
numeric_df = numeric_df.fillna(numeric_df.median())
Â
Then we apply three different scaling techniques.
StandardScaler: Transforms data to have a mean of 0 and standard deviation of 1
# Standardization (Z-score normalization)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_df)
df_scaled = pd.DataFrame(scaled_data,
columns=[f"col_scaled" for col in numeric_cols],
index=df.index)
Â
MinMaxScaler: To scale data to a specific range (default [0,1])
# Min-Max scaling to [0,1] range
min_max = MinMaxScaler()
minmax_data = min_max.fit_transform(numeric_df)
df_minmax = pd.DataFrame(minmax_data,
columns=[f"col_minmax" for col in numeric_cols],
index=df.index)
Â
RobustScaler: Uses median and quantiles, making it less sensitive to outliers
# Robust scaling (less affected by outliers)
robust = RobustScaler()
robust_data = robust.fit_transform(numeric_df)
df_robust = pd.DataFrame(robust_data,
columns=[f"col_robust" for col in numeric_cols],
index=df.index)
# Combine original data with scaled versions
df_with_scaled = pd.concat([df, df_scaled, df_minmax, df_robust], axis=1)
# Compare the first few rows
print(df_with_scaled[['income', 'income_scaled', 'income_minmax', 'income_robust']].head())
Â
We create new columns for each scaling method and then compare them side by side. This is particularly important when comparing features with vastly different scales, like income (thousands) vs. customer rating (1-5).
Â
8. String Cleaning and Regex
Â
Regular expressions are super useful for cleaning text data and extracting information from it. For the sample dataset (and any dataset in general), we can implement several string cleaning techniques.
Standardizing case and splitting into first/last name components:
import re
# Clean names: remove extra spaces, standardize case
df['name_clean'] = df['name'].str.strip().str.title()
# Extract first and last name
def extract_names(full_name):
if pd.isna(full_name):
return pd.Series([np.nan, np.nan])
parts = full_name.strip().split()
if len(parts) >= 2:
return pd.Series([parts[0], parts[-1]])
else:
return pd.Series([parts[0] if parts else np.nan, np.nan])
name_parts = df['name_clean'].apply(extract_names)
df['first_name'] = name_parts[0]
df['last_name'] = name_parts[1]
Â
For email validation, checking against a regex pattern and attempting to fix common issues:
# Validate and clean email addresses
def clean_email(email):
if pd.isna(email):
return np.nan
# Basic pattern for valid email
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]2,$'
if re.match(pattern, email):
return email.lower()
else:
# Try to fix common issues
if '@' not in email:
return np.nan # Can't fix emails missing @
# Add domain extension if missing
if not re.search(r'\.[a-zA-Z]2,$', email):
return email.lower() + '.com'
return email.lower()
df['email_clean'] = df['email'].apply(clean_email)
Â
Extracting email domain information:
# Extract domain from email
df['email_domain'] = df['email_clean'].str.extract(r'@([^@]+)$')
Â
Converting various phone number formats to a consistent pattern:
# Standardize/extract phone numbers
def extract_phone(phone):
if pd.isna(phone):
return np.nan
# Extract digits only
digits = re.sub(r'\D', '', str(phone))
# Ensure we have a 10-digit number
if len(digits) == 10:
return f"(digits[:3]) digits[3:6]-digits[6:]"
elif len(digits) > 10: # Might include country code
return f"(digits[-10:-7]) digits[-7:-4]-digits[-4:]"
else:
return np.nan # Not enough digits
df['phone_standardized'] = df['phone_number'].apply(extract_phone)
Â
These techniques make text data more consistent and can extract structured information from unstructured or semi-structured fields.
Â
9. Feature Engineering from Dirty Data
Â
Sometimes the patterns in messy data are informative and you can create new representative features as needed.
Missing value indicators: Create dummy variables showing which fields are missing
# Create indicators for missing values in key fields
missing_indicators = ['age', 'income', 'email', 'phone_number', 'comments']
for col in missing_indicators:
df[f'col_is_missing'] = df[col].isnull().astype(int)
Â
Data quality score: Calculate overall record completeness and categorize it
# Create data quality score
df['quality_score'] = df.notna().sum(axis=1) / len(df.columns) * 10
df['quality_category'] = pd.cut(
df['quality_score'],
bins=[0, 6, 8, 10],
labels=['Poor', 'Average', 'Good']
)
Â
Flag suspicious values like perfectly round incomes, ages outside reasonable ranges:
# Detect potential data entry errors
df['income_suspiciously_round'] = (df['income'] % 10000 == 0).astype(int)
df['age_out_of_range'] = ((df['age'] 80)).astype(int)
df['rating_out_of_range'] = ((df['customer_rating'] 5)).astype(int)
Â
Instead of just cleaning dirty data, we extract useful information from it.
Â
10. Dealing with Formatting Issues
Â
Inconsistent formatting can cause problems when analyzing data or building models.
Here we address common formatting issues.
Converting various date formats to a consistent format:
# Standardize date formats
df['start_date_clean'] = pd.to_datetime(df['start_date'], errors="coerce")
# Format dates consistently for display
df['start_date_formatted'] = df['start_date_clean'].dt.strftime('%Y-%m-%d')
Â
Removing currency symbols and separators before numeric conversion:
# Fix currency formatting issues
def clean_currency(amount):
if pd.isna(amount):
return np.nan
if isinstance(amount, (int, float)):
return amount
# Remove currency symbols and commas
amount_str = str(amount)
amount_str = re.sub(r'[$,]', '', amount_str)
# Convert to float
try:
return float(amount_str)
except:
return np.nan
# If purchase_amount might have formatting issues
df['purchase_amount_clean'] = df['purchase_amount'].apply(clean_currency)
Â
Creating consistently formatted versions of values for reporting:
# Format numbers for consistency
df['income_formatted'] = df['income'].apply(
lambda x: f"$x:,.2f" if not pd.isna(x) else ""
)
Â
Consistent formatting is particularly important for data that will be presented to end-users or used in reports.
After applying multiple cleaning techniques, we create a final clean dataset ready for analysis.
Â
Wrapping Up
Â
We’ve covered practical techniques that address the most common data quality issues you’ll encounter in real-world datasets. These methods help transform problematic data into reliable analytical assets.
Data cleaning is an iterative process—each step often gives new insights about your data’s structure and quality. The patterns shared here can be adapted to your specific needs and incorporated into automated workflows.
Happy data cleaning!
Â
Â
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.