Image by Author
Machine learning is an important technology, but before applying it, your dataset needs to be in a specific format before you can use its models.
To ensure this, some techniques have been used on real-life datasets.
In this article, we’ll explore some of those methods on the real-life data project asked during an interview with Haensel AMS. You will learn the concept and see its real-life applications, so let’s start with the data project details first and then move on to the data cleaning techniques!
Data Project: Predicting Price
Here is the link to this data project, which is used in recruitment processes for data science positions at Haensel AMS. This take-home assignment challenges candidates to build machine-learning models that predict prices based on various features.
The dataset comprises seven attributes, and the price is the target column. It has loc1, para2, and dow parameters extracted from these columns as key features.
The purpose is to create the data and inspect all possible correlations between features by cleaning it efficiently such that readable information can be obtained for machine learning models. We expand our tasks in this part by missing data treatment, type data transformation, outliers, and feature selection.
This dataset will show the implications of every data cleaning we mention in this read on the quality of the model made, hence improving any accuracy involved.
Handling Missing Data
Real-world datasets often have missing values, which can bias the data or make your model less accurate.
There are many different techniques for dealing with missing data, such as deleting them completely or replacing them with the data points’ mean, median, or mode.
Example from the Dataset
In the Predicting Price dataset, we’ll use pandas’ dropna() method to remove any rows with missing values. This ensures that our machine-learning models are trained on complete data.
Before we start cleaning, let’s examine the dataset’s structure and determine how many missing values there are.
import pandas as pd
df = pd.read_csv("sample.csv")
print(df.info())
Here is the output.
There are no missing values in this case, but if there are missing values, here is one method for removing these rows.
Note: we have three object rows, which we must take care of before applying ml models, which we’ll do in the next step.
# Remove rows with missing values
df.dropna(inplace=True)
Alternative Methods to Handle Missing Data
But we also have multiple methods to handle missing values;
- Filling Missing Values with Mean or Median: Instead of removing rows, we can fill missing numerical values with the column’s mean or median (or constant) value.
- Filling with Mode (for Categorical Data): When working with categorical features like dow, you can fill missing values by mode, i.e., the most frequent value of that column.
- Using Forward Fill or Backward Fill: Forward filling or backward filling refers to replacing the missing value with that of the previous row in either direction.
- Advanced Imputation Techniques: In case of complex missing data, you can use K-Nearest Neighbors (KNN) or impute with Sklearn, which also provides an in-built library for regression-based mean/median/mode value imputations.
Data Type Conversion
In many datasets, features can be represented as numerical or categorical data. However, numeric data works better in a machine-learning model.
For categorical variables, one typical way is to use the so-called OneHot encoder, which allocates a new column for every category value in binary form. As an alternative, it might be to coerce numeric-looking strings into numerics using pd. to_numeric().
One common preprocessing step is converting categorical variables like day of the week (dow) into continuous values so that our model can interpret them. Next, we want to ensure that other features, such as loc1 and loc2, are in the proper number format for analysis.
Example from the Dataset
In the Predicting Price dataset, we have columns like loc1 and loc2, which must be converted to numerical types. Additionally, the dow column (representing the day of the week) is categorical, so we’ll use one-hot encoding to convert it into multiple binary columns.
First, let’s see our values.
df["loc1"].value_counts()
Here is the output.
As you can see from the above, we first need to remove “S” and “T” to continue because Machine learning models need numerical values. Here is the code.
df["loc2"] = pd.to_numeric(df["loc2"], errors="coerce")
df["loc1"] = pd.to_numeric(df["loc1"], errors="coerce")
df.dropna(inplace = True)
Using the code above, we have converted two columns(objects) to numeric values. As you might recall, one more column needs to be transformed: dow. I’ll leave this one to you.
Alternative Methods for Data Type Conversion
- Label Encoding: Label encoding is another method for converting categorical data. It assigns a unique integer to each category. However, this can introduce unintended ordinal relationships, so it’s usually less preferred than one-hot encoding.
- Ordinal Encoding: If the categories of a feature hold an order, such as levels in education, we can employ ordinal encoding that assigns it numerical values based on that meaningful sequence.
- Tip: This is how you can convert the dow column into numbers.
Outlier Removal
Outliers are extreme data points that will ruin your machine-learning model. For instance, if you predict human heights based on a group of people, values like 10 cm or 500 cm can be evaluated as such.
Of course, there are alternative concepts for evaluating outliers. A common approach to outlier removal is using statistical methods like the Interquartile Range (IQR) or setting thresholds based on domain knowledge.
Example from the Dataset
In the Predicting Price dataset, the para1 column contains some extreme values. We’ll filter out rows where para1 exceeds 10.
result = result[result["para1"] < 10]
But before this step, drawing a scatter matrix to check outliers is a good way of finding them. And if we are curious about one of the features, we’ll use the following code to see the data points.
result["para1"].value_counts()
Alternative Methods for Handling Outliers
- Using the Interquartile Range (IQR): This method includes computing the 25th and 75th percentile (the IQR); an outlier is any value that falls outside of 1.5 times the IQR.
- Z-Score Method: It normalizes the data and finds outliers based on the standard deviation distance from the mean. Any z-score more significant than 3 is an outlier.
- Capping: It is used to cap outliers at specific percentiles rather than removing them entirely.
Standardization
Standardizing a feature means making it centered around zero mean (i.e., Mean=0) and having a variance equal to 1.
This is important because many Machine Learning models work better when input variables are on similar scales. If the data is not standardized, features with the most significant magnitude will control the learning process.
Standardization Techniques →
- Min-Max Scaling: The most basic form of scaling, which scales between 0 to 1.
- Standard Scaling (Z-score normalization): This method of scaling data uses a mean value of 0 and a standard deviation of 1.
- Robust Scaling: This approach scales based on the IQR and is less sensitive to outliers.
Example from the Dataset
In the Predicting Price dataset, we will apply three different scaling techniques to columns like para1, para2, para3, and para4 to ensure they are on the same scale before feeding the data into a machine learning model.
Here is the code to scale the feature between 0 and 1: Min-max scaling.
from sklearn.preprocessing import MinMaxScaler
df = pd.read_csv("sample.csv")
columns_to_scale = ['para1', 'para2', 'para3', 'para4']
scaler = MinMaxScaler()
df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])
print(df[columns_to_scale].head())
We will also do standard and robust scaling in this data project, but alternative ways exist.
Alternative Methods for Scaling
- MaxAbsScaler: Scales data by dividing each feature by its maximum absolute value, which is helpful for data already centered at zero.
- Log Transformation: This non-linear transformation helps reduce the skewness of highly distributed data.
Feature Selection
Feature selection is the procedure of picking some relevant features to build a model from the rest. It decreases a dataset’s dimensionality and helps create an accurate model that is less likely to overfit.
Feature selection methods –correlation-based filter feature selection, recursive feature elimination(RFE), and tree-based algo’s importance of features in predicting target variable.
Example from the Dataset
In the Predicting Price dataset, we will calculate the correlation between the features and price and then select the top 3 and 5 most relevant features. Here is the code to select the 5 most relevant features.
five_best = []
df_5 = pd.DataFrame(result.corr()["price"]).sort_values(by = "price", ascending = False)
df_5 = df_5.drop(df_5.index[0]).head(5)
for i in range(len(df_5)):
five_best.append(df_5.index[i])
By selecting the top 3 or 5 most correlated features, we now have only a subset of our data that is much more relevant to predicting the target variable.
Alternative Methods for Feature Selection
- Recursive Feature Elimination (RFE): RFE individually eliminates the least essential features and tracks their importance, giving an optimal subset.
- Tree-Based Feature Importance: Random Forest or Gradient Boosting models can rank features based on their importance.
When cleaning your dataset, avoid these 5 traps in your data.
Conclusion
In this article, we have explored some of the most critical data-cleaning techniques, starting with handling missing data and moving on to standardization methods. You will also learn about alternative approaches.
Using the Predicting Price data project, we applied these techniques to a real-world dataset from an interview assignment. This prepares you for future interviews while keeping you well-informed about practical data-cleaning strategies.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.