Â
Image by Author
Â
Cleaning the data is one of the most important and time-consuming parts of any data science project.
With the best algorithm, you still need clean data to get good results.
In this article, I will give you seven tricks to up your data-cleaning game in Python.
Â
1. Handling Invalid Data Entries
Â
Real-life datasets often contain invalid data entries. To avoid corruption or unexpected values, these should be corrected before any analysis.
Â
Predicting Price
Â
We will use this project in the following five ways. Haensel AMS has used this data project in the recruitment process for the data scientist position. Here is the link to this project.
Â
Application
In our dataset, the loc1 column contains unexpected string values like ‘S’ and ‘T’, which should not be present if loc1 is expected to be numeric.
# Check for invalid entries in 'loc1'
df["loc1"].value_counts()
Â
Here is the output.
Now, let’s remove rows that include invalid values.
# Remove rows with invalid 'loc1' values
df = df[(df["loc1"].str.contains("S") == False) & (df["loc1"].str.contains("T") == False)]
df.shape
Â
Here is the output.
Â
Let’s evaluate the output.
- Before Cleaning: The value_counts() output shows that ‘S’ and ‘T’ appear once each in loc1.
- After Cleaning: Removing these entries reduces the dataset size from 10,000 to 9,998 rows.
- Impact: Eliminating invalid entries ensures erroneous data does not skew subsequent analyses and models.
Â
2. Converting Data Types Appropriately
Â
The data types used must be correct so that you can subsequently conduct specific operations. Converting data to the proper type provides a way to ensure correct computations and prevent errors.
Â
Application
The loc1 and loc2 columns are initially of type object, possibly due to leading zeros or non-numeric characters. They need to be converted to numeric types for analysis.
Here is the code.
df["loc2"] = pd.to_numeric(df["loc2"], errors="coerce")
df["loc1"] = pd.to_numeric(df["loc1"], errors="coerce")
df.dropna(inplace = True)
df.shape
Â
Here is the output.
Â
Let’s evaluate what we did here.
- After Conversion: They are converted to float64 or int64 types.
- Data Loss: The dataset reduces slightly in size (from 9,998 to 9,993 rows) due to rows with non-convertible values being dropped.
- Impact: Converting data types allows for numerical operations and is essential for modeling.
Â
3. Encoding Categorical Variables
Â
Machine learning models can consume only numerical input. So, categorical values must undergo the encoding process, transforming them into a numerical form, which will preserve their inherent information.
Â
Application
The dow (day of the week) column is categorical with values like ‘Mon’, ‘Tue’, etc. You used two methods to encode this data:
- One-Hot Encoding: Creating binary columns for each category.
- Ordinal Encoding: Mapping categories to numerical values.
Let’s see examples.
Â
One-Hot Encoding
# Create dummy variables
dow_dummies = pd.get_dummies(df['dow'])
df = df.join(dow_dummies).drop('dow', axis=1)
df.head()
Â
Here is the output.
Â
Ordinal Encoding
# Map days of the week to numerical values
days_of_week = {'Mon': 1, 'Tue': 2, 'Wed': 3, 'Thu': 4, 'Fri': 5, 'Sat': 6, 'Sun': 7}
df['dow'] = df['dow'].map(days_of_week)
df.head()
Â
Here is the output.
Â
Let’s evaluate the output.
- One-Hot Encoding: Adds new columns (Mon, Tue, etc.) with binary indicators.
- Ordinal Encoding: Replaces dow values with numerical representations.
- Impact: Both methods convert categorical data into a format suitable for modeling. One-hot encoding is preferable when there is no inherent order, while ordinal encoding assumes an order.
Â
4. Handling Outliers
Â
Outliers can skew your statistical analyses and ruin your models. Identifying and controlling outliers is one way to counterbalance this and increase the robustness of your results.
Â
Application
Let’s first check outliers. Here is the c
ode.
from pandas.plotting import scatter_matrix
# Suppress the output of the scatter_matrix function
_ = scatter_matrix(result.iloc[:,0:7], figsize=(12, 8))
Â
Here is the output.
Let’s see para1’s values
result["para1"].value_counts()
Â
Here is the output.
We have identified that the para1 column has extreme values (e.g., 337), which are outliers compared to the rest of the data. Let’s filter this column.
# Analyze 'para1' value counts
print(result["para1"].value_counts())
# Remove outliers in 'para1'
result = result[result["para1"] < 10]
Â
Here is the evaluation of the output.
- Before Removal: para1 has values up to 337, while most entries are between 0 and 7.
- After Removal: Entries with para1 >= 10 are removed, reducing the dataset size.
- Impact: Removing outliers prevents them from skewing the analysis and improves model performance.
Â
5. Feature Selection Based on Correlation
Â
Only features that highly correlate to the target variable are selected, which can increase the model’s accuracy and reduce complexity.
Â
Application
You calculated the correlation between features and the target variable price, selecting the top features.
five_best = []
df_5 = pd.DataFrame(result.corr()["price"]).sort_values(by = "price", ascending = False)
df_5 = df_5.drop(df_5.index[0]).head(5)
for i in range(len(df_5)):
five_best.append(df_5.index[i])
five_best
Â
Here is the output.
Â
Here is the evaluation of what we did.
- Top Features Identified: [‘para2’, ‘para4’, ‘para3’, ‘para1’, ‘Fri’]
- Impact: Using features with a higher correlation to price can improve the predictive power of your models.
Â
6. Scaling Features
Â
Scaling ensures that all features contribute equally to the model training process, which is especially important for algorithms sensitive to feature scales.
Scaling each feature and ensuring all features contribute the same to the model is highly important in machine learning, especially for some algorithms that are sensitive to varying scales of input features.
Â
Application
We applied different scaling techniques:
- Min-Max Scaling
- Standard Scaling
- Robust Scaling
Â
So, let’s look at their application.
Â
Standard Scaling Example
from sklearn.preprocessing import StandardScaler
# Separate features and target
X = result.drop('price', axis=1)
y = result['price']
# Apply Standard Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Â
Let’s evaluate what we have done.
- Model Performance: Scaling improved the model’s training and testing errors.
- Comparison: You compared the performance of different scaling methods.
- Impact: Proper scaling can lead to faster convergence and better model accuracy.
Â
7. Handling Missing Values
Â
Machine learning models can go quirky with missing values. Filling means making sure the set for training purposes is fully realized.
Â
Model Building on Synthetic Dataset
This data project has been used during recruitment for Capital One’s data science positions. Here is the link.
Â
Application
Your datasets contain missing values across multiple features. To maintain data integrity, you chose to fill these missing values with the median of each feature.
First, let’s check missing values.
# Check missing values in train_data
missing_train = train_data.isna().sum()
print("Missing values in train_data:")
print(missing_train[missing_train > 0])
# Check missing values in test_data
missing_test = test_data.isna().sum()
print("\nMissing values in test_data:")
print(missing_test[missing_test > 0])
Â
Here is the output.
Â
Now, let’s see which code we’ll use to clean.
# Filling missing values in train_data with median
for column in train_data.columns:
median_value = train_data[column].median()
train_data[column].fillna(median_value, inplace=True)
# Filling missing values in test_data with median
for column in test_data.columns:
median_value = test_data[column].median()
test_data[column].fillna(median_value, inplace=True)
Â
Now, let’s check one more time. Here is the code.
# Check missing values in train_data
missing_train = train_data.isna().sum()
print("Missing values in train_data:")
print(missing_train[missing_train > 0])
# Check missing values in test_data
missing_test = test_data.isna().sum()
print("\nMissing values in test_data:")
print(missing_test[missing_test > 0])
Â
Here is the output.
Let’s evaluate what we did here.
- Before Imputation: Numerous features have missing values in both datasets.
- After Imputation: All missing values are filled; datasets are complete.
- Impact: Enhances model performance by providing a complete dataset for training and evaluation.
Â
Final Thoughts
Â
In this article, we have discovered seven key data-cleaning methods that will teach you more about Python and help you create better models. Also, check out these Python Libraries for Data Cleaning.
Using these methods will greatly improve your data analysis, especially on real-life data projects. It will also prepare you for the data scientist hiring process.
Â
Â
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.