Image by Author | Segmind SSD-1B Model
Â
Almost all data projects start with messy real-world data. Before you get into analysis or building models, you need to make sure your dataset is in good shape. Luckily, pandas makes it super easy to spot and fix common issues like missing values, duplicates, or inconsistent formatting—all with just a line of code.
In this article, we’ll explore 10 essential one-liners to help you identify common issues such as missing values, incorrect data types, out-of-range values, inconsistent entries, and duplicate records. Let’s get started.
Â
Sample DataFrame
Â
Here’s a small sample dataset simulating e-commerce transactions with common data quality issues such as missing values, inconsistent formatting, and potential outliers:
import pandas as pd
import numpy as np
# Sample e-commerce transaction data
data =
"TransactionID": [101, 102, 103, 104, 105],
"CustomerName": ["Jane Rust", "june young", "June Doe", None, "JANE RUST"],
"Product": ["Laptop", "Phone", "Laptop", "Tablet", "Phone"],
"Price": [1200, 800, 1200, -300, 850], # Negative value indicates an issue
"Quantity": [1, 2, None, 1,1], # Missing value
"TransactionDate": ["2024-12-01", "2024/12/01", "01-12-2024", None, "2024-12-01"],
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
Â
This is the dataframe we’ll be working with:
TransactionID CustomerName Product Price Quantity TransactionDate
0 101 Jane Rust Laptop 1200 1.0 2024-12-01
1 102 june young Phone 800 2.0 2024/12/01
2 103 Jane Rust Laptop 1200 NaN 01-12-2024
3 104 None Tablet -300 1.0 None
4 105 JUNE YOUNG Phone 850 1.0 2024-12-01
Â
Before going ahead, let’s get some basic info on the dataframe:
Â
Output:
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TransactionID 5 non-null int64
1 CustomerName 4 non-null object
2 Product 5 non-null object
3 Price 5 non-null int64
4 Quantity 4 non-null float64
5 TransactionDate 4 non-null object
dtypes: float64(1), int64(2), object(3)
memory usage: 368.0+ bytes
Â
1. Check for Missing Values
Â
This one-liner checks each column for missing values and sums them up.
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)
Â
Output:
Missing Values:
TransactionID 0
CustomerName 1
Product 0
Price 0
Quantity 1
TransactionDate 1
dtype: int64
Â
2. Identify Incorrect Data Types
Â
Reviewing data types is important. For example, TransactionDate should be a datetime type, but it’s not so in the example.
print("Data Types:\n", df.dtypes)
Â
Output:
Data Types:
TransactionID int64
CustomerName object
Product object
Price int64
Quantity float64
TransactionDate object
dtype: object
Â
Running this quick check should help identify columns that need transformation.
Â
3. Convert Dates to a Consistent Format
Â
This one-liner converts ‘TransactionDate’ to a consistent datetime format. Any unconvertible values—invalid formats—are replaced with NaT (Not a Time).
df["TransactionDate"] = pd.to_datetime(df["TransactionDate"], errors="coerce")
print(df["TransactionDate"])
Â
Output:
0 2024-12-01
1 NaT
2 NaT
3 NaT
4 2024-12-01
Name: TransactionDate, dtype: datetime64[ns]
Â
4. Find Outliers in Numeric Columns
Â
Finding outliers in numeric columns is another important check. However, you’ll need some domain knowledge to identify potential outliers. Here, we filter the rows where the ‘Price’ is less than 0, flagging negative values as potential outliers.
outliers = df[df["Price"]
Â
Output:
Outliers:
TransactionID CustomerName Product Price Quantity TransactionDate
3 104 None Tablet -300 1.0 NaT
Â
5. Detect Duplicate Records
Â
This checks for duplicate rows based on ‘CustomerName’ and ‘Product’, ignoring unique TransactionIDs. Duplicates might indicate repeated entries.
duplicates = df.duplicated(subset=["CustomerName", "Product"], keep=False)
print("Duplicate Records:\n", df[duplicates])
Â
Output:
Duplicate Records:
TransactionID CustomerName Product Price Quantity TransactionDate
0 101 Jane Rust Laptop 1200 1.0 2024-12-01
2 103 Jane Rust Laptop 1200 NaN NaT
Â
6. Standardize Text Data
Â
Standardizes CustomerName by removing extra spaces and ensuring proper capitalization ( “jane rust” → “Jane Rust”).
df["CustomerName"] = df["CustomerName"].str.strip().str.title()
print(df["CustomerName"])
Â
Output:
0 Jane Rust
1 June Young
2 Jane Rust
3 None
4 June Young
Name: CustomerName, dtype: object
Â
7. Validate Data Ranges
Â
With numeric values, ensuring they lie within the expected range is necessary. Let’s check if all prices fall within a realistic range, say 0 to 5000. Rows with price values outside this range are flagged.
invalid_prices = df[~df["Price"].between(0, 5000)]
print("Invalid Prices:\n", invalid_prices)
Â
Output:
Invalid Prices:
TransactionID CustomerName Product Price Quantity TransactionDate
3 104 None Tablet -300 1.0 NaT
Â
8. Count Unique Values in a Column
Â
Let’s get an overview of how many times each product appears using the `value-counts()` method. This is useful for spotting typos or anomalies in categorical data.
unique_products = df["Product"].value_counts()
print("Unique Products:\n", unique_products)
Â
Output:
Unique Products:
Product
Laptop 2
Phone 2
Tablet 1
Name: count, dtype: int64
Â
9. Check for Inconsistent Formatting Across Columns
Â
Detects inconsistently formatted entries in ‘CustomerName’. This regex flags names that may not match the expected title case format.
inconsistent_names = df["CustomerName"].str.contains(r"[A-Z]2,", na=False)
print("Inconsistent Formatting in Names:\n", df[inconsistent_names])
Â
Output:
Inconsistent Formatting in Names:
Empty DataFrame
Columns: [TransactionID, CustomerName, Product, Price, Quantity, TransactionDate]
Index: []
Â
Here there are no inconsistent entries in the ‘CustomerName’ column as we’ve already formatted them in the title case.
Â
10. Identify Rows with Multiple Issues
Â
This identifies rows with more than one issue, such as missing values, negative prices, or invalid dates, for focused attention during cleaning.
issues = df.isnull().sum(axis=1) + (df["Price"] 1]
print("Rows with Multiple Issues:\n", problematic_rows)
Â
Output:
Rows with Multiple Issues:
TransactionID CustomerName Product Price Quantity TransactionDate
1 102 June Young Phone 800 2.0 NaT
2 103 Jane Rust Laptop 1200 NaN NaT
3 104 None Tablet -300 1.0 NaT
Â
Conclusion
Â
Data cleaning doesn’t have to be overwhelming. With these pandas one-liners in your toolkit, you can run some important data quality checks. Which will help you better decide your next steps in data cleaning.
Whether it’s handling missing values, catching outliers, or checking for the right data types, these quick checks will save you time and headaches.
Â
Â
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.