10 Pandas One-Liners for Data Cleaning



Image by Author | Ideogram

 

In the real world, data rarely comes in a clean, analysis-ready format. Whether you’re working with CSV exports, API responses, or database dumps, you’ll encounter inconsistencies, missing values, and formatting issues that need to be addressed before you can extract meaningful insights.

The good news? Pandas makes data cleaning straightforward with its expressive syntax. This article goes over 10 useful pandas one-liners that can transform your messy data into clean, usable datasets with minimal code.

🔗 You can find the code on GitHub

First you can run generate_df.py to spin up a sample dataframe of customer orders and code along.

 

Output:

  order_date    customer_id                  email product_id  quantity  \
0 2024-04-12           7376    user208@hotmail.com   PROD-642       5.0   
1 2024-12-14  Customer 3393     user349company.com   PROD-626       9.0   
2 2024-09-27          C8240   user958@company.com    PROD-645     874.0   
3 2024-04-16          C7190      user951@yahoo.com   PROD-942       1.0   
4 2024-03-12       CUS-7493      user519@yahoo.com   PROD-115       7.0   

        price shipping_status  
0   27.347234       DELIVERED  
1   99.343948         Shipped  
2   77.172318      In Transit  
3  147.403597         Shipped  
4         159       delivered  

 

Our dataset contains:

  • Inconsistent customer ID formats
  • Email addresses with typos and formatting issues
  • Missing product IDs
  • Outliers in the quantity column
  • Prices in different formats (some with $ signs, some negative)
  • Inconsistent shipping status values
  • Completely blank rows and duplicates

Now, let’s tackle these issues one by one with pandas one-liners.

 

1. Drop Rows with All Missing Values

 

This one-liner removes any row where all values are missing. It’s often a good first step in data cleaning as completely empty rows provide no useful information and can skew your analysis.

The how='all' parameter ensures we only drop rows where every single column is empty, preserving rows that have at least some data.


df_clean = df.dropna(how='all')

 

2. Standardize Text Case and Remove Whitespace

 

Inconsistent text case and extra whitespace are common issues in text data.

This one-liner standardizes the shipping status by converting all values to lowercase and removing any leading or trailing whitespace.

df_clean.loc[:, 'shipping_status'] = df_clean['shipping_status'].str.lower().str.strip()

 

 

3. Extract Patterns with Regular Expressions

 

When dealing with inconsistent ID formats, regular expressions can be useful. This one-liner extracts just the numeric portion from various customer ID formats (like “CUS-1234”, “C1234”, “Customer 1234”) and standardizes them all to “CUS-1234” format.


df_clean.loc[:,'customer_id'] = 'CUS-' + df_clean['customer_id'].str.extract(r'(\d+)').fillna('0000')

 

The fillna('0000') ensures we have a valid ID even if the extraction fails.

 

4. Convert Mixed Data Types to Numeric

 

Dealing with mixed data types is one of the most common challenges in data cleaning. This one-liner converts all price values to numeric by first converting everything to strings, removing dollar signs, and then converting to numbers.

df_clean.loc[:,'price'] = pd.to_numeric(df_clean['price'].astype(str).str.replace('$', ''), errors="coerce")

 

The errors="coerce" parameter ensures that any values that can’t be converted become NaN rather than causing an error.

 

5. Fix Common Email Formatting Issues

 

Email validation is important for data quality. This one-liner fixes common email formatting issues by removing extra spaces and adding the @ symbol where it’s missing.

df_clean.loc[:,'email'] = df_clean['email'].str.strip().str.replace(r'([^@]+)([^@]*\.com)', r'\1@\2')

 

The regular expression looks for patterns that resemble an email address without the @ symbol and adds it in the correct position.

 

6. Handle Outliers Using the IQR Method

 

Outliers can significantly skew your analysis. This one-liner caps extreme values using the Interquartile Range (IQR) method, a robust statistical technique.

df_clean.loc[:,'quantity'] = df_clean['quantity'].clip(upper=df_clean['quantity'].quantile(0.75) + 1.5 * (df_clean['quantity'].quantile(0.75) - df_clean['quantity'].quantile(0.25)))

 

It calculates the upper bound as Q3 + 1.5 * IQR and clips any values above this threshold. This preserves the data distribution while preventing extreme values from distorting your analysis.

 

7. Standardize Categories with Mapping

 

Categorical variables often come with inconsistent naming conventions. This one-liner uses a dictionary to map various representations of the same category to a standard format.

df_clean.loc[:,'shipping_status'] = df_clean['shipping_status'].replace('in transit': 'in_transit', 'in-transit': 'in_transit', 'shipped': 'shipped', 'delivered': 'delivered', 'pending': 'pending')

 

This standardization is crucial for accurate aggregation and analysis, ensuring that “in transit”, “In Transit”, and “in-transit” are all treated as the same category.

 

8. Remove Duplicates Based on Specific Columns

 

Duplicate records can inflate your counts and skew your analysis. This one-liner removes duplicates based on a combination of key columns.

Unlike a simple drop_duplicates(), this approach preserves rows that might have the same values in some columns but represent different data points.

df_clean = df_clean.drop_duplicates(subset=['customer_id', 'order_date', 'product_id'])

 

It’s particularly useful for transaction data where the same customer might order the same product on different dates.

 

9. Create Validation Flags for Data Quality

 

Data validation is essential for maintaining data quality. This one-liner creates a boolean flag indicating whether each email address matches a standard email format.

These flags can be used to filter out invalid records or to report on data quality.

df_clean['is_valid_email'] = df_clean['email'].str.contains(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]2,$')

 

Similar validation can be applied to phone numbers, postal codes, or any other formatted data.

 

10. Fill Missing Values with Forward Fill

 

When dealing with time series or sequential data, forward filling can be a smart way to handle missing values. This one-liner first sorts the data by date, then groups by customer ID, and fills missing values with the previous valid value for each customer.

df_clean = df_clean.sort_values('order_date').groupby('customer_id').ffill()

 

This approach preserves the temporal relationship in your data and is more meaningful than filling with means or constants.

 

Wrapping Up

 

With these pandas one-liners, you can tackle the most common data quality issues effectively. Remember, the key to successful data cleaning is understanding your data first, then applying targeted transformations to address specific issues.

Before applying these techniques to your own data, always take time to:

  1. Explore your data
  2. Understand the business context and requirements
  3. Document your cleaning steps for reproducibility

Happy data cleaning!
 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here