7 Ways to Improve Your Data Cleaning Skills with Python



 

Image by Author

 

Cleaning the data is one of the most important and time-consuming parts of any data science project.

With the best algorithm, you still need clean data to get good results.

In this article, I will give you seven tricks to up your data-cleaning game in Python.

 

1. Handling Invalid Data Entries

 

Real-life datasets often contain invalid data entries. To avoid corruption or unexpected values, these should be corrected before any analysis.

 

Predicting Price

Handling Invalid Data Entries to Improve Data Cleaning SkillsHandling Invalid Data Entries to Improve Data Cleaning Skills

 

We will use this project in the following five ways. Haensel AMS has used this data project in the recruitment process for the data scientist position. Here is the link to this project.

 

Application

In our dataset, the loc1 column contains unexpected string values like ‘S’ and ‘T’, which should not be present if loc1 is expected to be numeric.

# Check for invalid entries in 'loc1'
df["loc1"].value_counts()

 

Here is the output.

Application For Handling Invalid Data EntriesApplication For Handling Invalid Data Entries

Now, let’s remove rows that include invalid values.

# Remove rows with invalid 'loc1' values
df = df[(df["loc1"].str.contains("S") == False) & (df["loc1"].str.contains("T") == False)] 
df.shape

 

Here is the output.

Application For Handling Invalid Data EntriesApplication For Handling Invalid Data Entries

 

Let’s evaluate the output.

  • Before Cleaning: The value_counts() output shows that ‘S’ and ‘T’ appear once each in loc1.
  • After Cleaning: Removing these entries reduces the dataset size from 10,000 to 9,998 rows.
  • Impact: Eliminating invalid entries ensures erroneous data does not skew subsequent analyses and models.

 

2. Converting Data Types Appropriately

 

The data types used must be correct so that you can subsequently conduct specific operations. Converting data to the proper type provides a way to ensure correct computations and prevent errors.

 

Application

The loc1 and loc2 columns are initially of type object, possibly due to leading zeros or non-numeric characters. They need to be converted to numeric types for analysis.

Here is the code.

df["loc2"] = pd.to_numeric(df["loc2"], errors="coerce")
df["loc1"] = pd.to_numeric(df["loc1"], errors="coerce")
df.dropna(inplace = True)
df.shape

 

Here is the output.

Converting Data Types To Improve Data Cleaning SkillsConverting Data Types To Improve Data Cleaning Skills

 

Let’s evaluate what we did here.

  • After Conversion: They are converted to float64 or int64 types.
  • Data Loss: The dataset reduces slightly in size (from 9,998 to 9,993 rows) due to rows with non-convertible values being dropped.
  • Impact: Converting data types allows for numerical operations and is essential for modeling.

 

3. Encoding Categorical Variables

 

Machine learning models can consume only numerical input. So, categorical values must undergo the encoding process, transforming them into a numerical form, which will preserve their inherent information.

 

Application

The dow (day of the week) column is categorical with values like ‘Mon’, ‘Tue’, etc. You used two methods to encode this data:

  1. One-Hot Encoding: Creating binary columns for each category.
  2. Ordinal Encoding: Mapping categories to numerical values.

Let’s see examples.

 

One-Hot Encoding

# Create dummy variables
dow_dummies = pd.get_dummies(df['dow'])
df = df.join(dow_dummies).drop('dow', axis=1)
df.head()

 

Here is the output.

Encoding Categorial Variables To Improve Data Cleaning SkillsEncoding Categorial Variables To Improve Data Cleaning Skills

 

Ordinal Encoding

# Map days of the week to numerical values
days_of_week = {'Mon': 1, 'Tue': 2, 'Wed': 3, 'Thu': 4, 'Fri': 5, 'Sat': 6, 'Sun': 7}
df['dow'] = df['dow'].map(days_of_week)
df.head()

 

Here is the output.

Encoding Categorial Variables To Improve Data Cleaning SkillsEncoding Categorial Variables To Improve Data Cleaning Skills

 

Let’s evaluate the output.

  • One-Hot Encoding: Adds new columns (Mon, Tue, etc.) with binary indicators.
  • Ordinal Encoding: Replaces dow values with numerical representations.
  • Impact: Both methods convert categorical data into a format suitable for modeling. One-hot encoding is preferable when there is no inherent order, while ordinal encoding assumes an order.

 

4. Handling Outliers

 

Outliers can skew your statistical analyses and ruin your models. Identifying and controlling outliers is one way to counterbalance this and increase the robustness of your results.

 

Application

Let’s first check outliers. Here is the c
ode.

from pandas.plotting import scatter_matrix

# Suppress the output of the scatter_matrix function
_ = scatter_matrix(result.iloc[:,0:7], figsize=(12, 8))

 

Here is the output.

Handling Outliers to Improve Data Cleaning SkillsHandling Outliers to Improve Data Cleaning Skills

Let’s see para1’s values

result["para1"].value_counts()

 

Here is the output.

Handling Outliers to Improve Data Cleaning SkillsHandling Outliers to Improve Data Cleaning Skills

We have identified that the para1 column has extreme values (e.g., 337), which are outliers compared to the rest of the data. Let’s filter this column.

# Analyze 'para1' value counts
print(result["para1"].value_counts())

# Remove outliers in 'para1'
result = result[result["para1"] < 10]

 

Here is the evaluation of the output.

  • Before Removal: para1 has values up to 337, while most entries are between 0 and 7.
  • After Removal: Entries with para1 >= 10 are removed, reducing the dataset size.
  • Impact: Removing outliers prevents them from skewing the analysis and improves model performance.

 

5. Feature Selection Based on Correlation

 

Only features that highly correlate to the target variable are selected, which can increase the model’s accuracy and reduce complexity.

 

Application

You calculated the correlation between features and the target variable price, selecting the top features.

five_best = []
df_5 = pd.DataFrame(result.corr()["price"]).sort_values(by = "price", ascending = False)
df_5 = df_5.drop(df_5.index[0]).head(5)
for i in range(len(df_5)):
    five_best.append(df_5.index[i])
five_best

 

Here is the output.

Feature Selection Based on Correlation to Improve Data Cleaning SkillsFeature Selection Based on Correlation to Improve Data Cleaning Skills

 

Here is the evaluation of what we did.

  • Top Features Identified: [‘para2’, ‘para4’, ‘para3’, ‘para1’, ‘Fri’]
  • Impact: Using features with a higher correlation to price can improve the predictive power of your models.

 

6. Scaling Features

 

Scaling ensures that all features contribute equally to the model training process, which is especially important for algorithms sensitive to feature scales.

Scaling each feature and ensuring all features contribute the same to the model is highly important in machine learning, especially for some algorithms that are sensitive to varying scales of input features.

 

Application

We applied different scaling techniques:

  • Min-Max Scaling
  • Standard Scaling
  • Robust Scaling

 

So, let’s look at their application.

 

Standard Scaling Example

from sklearn.preprocessing import StandardScaler

# Separate features and target
X = result.drop('price', axis=1)
y = result['price']

# Apply Standard Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

 

Let’s evaluate what we have done.

  • Model Performance: Scaling improved the model’s training and testing errors.
  • Comparison: You compared the performance of different scaling methods.
  • Impact: Proper scaling can lead to faster convergence and better model accuracy.

 

7. Handling Missing Values

 

Machine learning models can go quirky with missing values. Filling means making sure the set for training purposes is fully realized.

 

Model Building on Synthetic Dataset

Handling Missing Values to Improve Data Cleaning SkillsHandling Missing Values to Improve Data Cleaning Skills

This data project has been used during recruitment for Capital One’s data science positions. Here is the link.

 

Application

Your datasets contain missing values across multiple features. To maintain data integrity, you chose to fill these missing values with the median of each feature.

First, let’s check missing values.

# Check missing values in train_data
missing_train = train_data.isna().sum()
print("Missing values in train_data:")
print(missing_train[missing_train > 0])

# Check missing values in test_data
missing_test = test_data.isna().sum()
print("\nMissing values in test_data:")
print(missing_test[missing_test > 0])

 

Here is the output.

Handling Missing Values to Improve Data Cleaning SkillsHandling Missing Values to Improve Data Cleaning Skills

 

Now, let’s see which code we’ll use to clean.

# Filling missing values in train_data with median
for column in train_data.columns:
    median_value = train_data[column].median()
    train_data[column].fillna(median_value, inplace=True)

# Filling missing values in test_data with median
for column in test_data.columns:
    median_value = test_data[column].median()
    test_data[column].fillna(median_value, inplace=True)

 

Now, let’s check one more time. Here is the code.

# Check missing values in train_data
missing_train = train_data.isna().sum()
print("Missing values in train_data:")
print(missing_train[missing_train > 0])

# Check missing values in test_data
missing_test = test_data.isna().sum()
print("\nMissing values in test_data:")
print(missing_test[missing_test > 0])

 

Here is the output.

Handling Missing Values to Improve Data Cleaning SkillsHandling Missing Values to Improve Data Cleaning Skills

Let’s evaluate what we did here.

  • Before Imputation: Numerous features have missing values in both datasets.
  • After Imputation: All missing values are filled; datasets are complete.
  • Impact: Enhances model performance by providing a complete dataset for training and evaluation.

 

Final Thoughts

 

In this article, we have discovered seven key data-cleaning methods that will teach you more about Python and help you create better models. Also, check out these Python Libraries for Data Cleaning.

Using these methods will greatly improve your data analysis, especially on real-life data projects. It will also prepare you for the data scientist hiring process.

 

 

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here