Image by Author | Ideogram
Â
Outliers are a common issue in real-world data. From manufacturing quality control to financial market transactions to electrical energy readings, there are many situations when, all of a sudden, an unexpected or statistically unlikely observation is collected. This could happen for a variety of reasons: measurement or human errors during data collection, fluctuations in the natural variability in the process, human error during data entry, or simply genuine but rare events like market crashes, natural disasters, or even the start of a lockdown…
This hands-on article uncovers several useful strategies to deal with outliers effectively, depending on the nature of the dataset you are dealing with and the requirements of your project or real-world problem.
Before presenting three common strategies to manage outliers, we start with the preparatory steps for the practical examples and the creation and visualization of a synthetic dataset consisting of two attributes.
Importing necessary Python libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Â
Creating a synthetic dataset. Notice in this example we generate a set of normally distributed points in both dimensions (attributes), but we intentionally add afterward three more observations manually, namely the points: (90,40), (150,30), and (200,50).
np.random.seed(0)
x = np.random.normal(50, 10, 100)
y = np.random.normal(30, 5, 100)
x = np.append(x, [90, 150, 200]) # Outliers in x
y = np.append(y, [40, 30, 50]) # Corresponding y values
data = 'Feature1': x, 'Feature2': y
df = pd.DataFrame(data)
Â
plt.scatter(df['Feature1'], df['Feature2'], color="blue", label="Original Data")
plt.title('Original Data')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.legend()
plt.show()
Â
Output:
Â
Â
For the purpose of justifying decisions on the use of different strategies for dealing with outliers, we will assume that the features x and y are the petal length and petal width, in millimeters, of observed specimens of a tropical flower species known to have a remarkable variability of petal size across specimens.
Â
Strategy 1: Removing Outliers
Â
The simplest and sometimes most effective strategy to deal with observations unusually distinct from the rest is to simply assume them as the result of an error and discard them.
While in certain situations where the dataset is small and manageable, this can be done manually, it’s normally best to rely upon statistical methods to identify the outliers and remove them accordingly. One approach is to first calculate the z-scores for the data features. Calculating z-scores, given by z = (x-μ)/σ for each attribute value x, helps identify outliers by standardizing data and measuring how many standard deviations (σ) each data point is from the mean (μ). Combined with a thresholding rule, e.g. label instances whose distance to the mean is greater than 3σ, is an effective way to identify outliers and remove them. The larger the threshold, the less strict the criterion to remove outliers.
mean_x, mean_y = df['Feature1'].mean(), df['Feature2'].mean()
std_x, std_y = df['Feature1'].std(), df['Feature2'].std()
df['Z-Score1'] = (df['Feature1'] - mean_x) / std_x
df['Z-Score2'] = (df['Feature2'] - mean_y) / std_y
Â
Notice we created two new attributes containing the z-values of the two original attributes.
We now apply the thresholding rule as a condition to keep in a new dataframe, df_cleaned, only the observations whose distance to the mean is equal or less than three times the standard deviation:
df_cleaned = df[(abs(df['Z-Score1'])
Â
By visualizing the new dataset, we can see that two of the original data points are now gone: namely two of the three extra observations we manually added to the set of randomly generated observations at the beginning, while the third manually added is not considered “deviated enough” to be deemed as an outlier.
plt.scatter(df_cleaned['Feature1'], df_cleaned['Feature2'], color="green", label="Cleaned Data")
plt.title('Cleaned Data')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.show()
Â
Output:
Â
Â
Transforming the Data to Reduce the Impact of Outliers
Â
As an alternative to removing outliers, applying a mathematical transformation to the original might be a more suitable solution in situations where even outliers are meant to contain valuable information that should not be gotten rid of, or when the data is nonlinear or skewed, in which case applying transformations can also help normalize it and ease its further analysis. Let’s try applying a logarithmic transformation on the original dataframe aided by numpy, and see what happens.
df['Log-Feature1'] = np.log1p(df['Feature1'])
df['Log-Feature2'] = np.log1p(df['Feature2'])
plt.scatter(df['Log-Feature1'], df['Log-Feature2'], color="green", label="Transformed Data")
plt.title('Transformed Data')
plt.xlabel('Log-Feature1')
plt.ylabel('Log-Feature2')
plt.legend()
plt.show()
Â
Output:
Â
Â
A logarithmic transformation contributes to bringing extreme values closer to the majority of the data. In this example, depending on your problem requirements or further intended analysis -e.g. building a predictive machine learning model- you may use this transformed data or decide that the transformation was not effective enough in reducing the impact of outliers. If you are later training a classifier to determine whether or not these flower observations belong to a certain tropical species, you may want your model to keep its former accuracy before and after transforming the training data.
Â
Capping or Winsorizing data
Â
To finalize, instead of transforming all observations, which might be computationally costly for large datasets, capping consists in transforming only the observations with the most extreme values. How? By limiting the values to a specified range, typically a very high (resp. low) percentile, for example, attribute values above the 99th percentile or below the 1st percentile are replaced with the threshold percentile values used themselves. The numpy clip() method helps do this:
lower_cap1, upper_cap1 = df['Feature1'].quantile(0.01), df['Feature1'].quantile(0.99)
lower_cap2, upper_cap2 = df['Feature2'].quantile(0.01), df['Feature2'].quantile(0.99)
df['Capped-Feature1'] = np.clip(df['Feature1'], lower_cap1, upper_cap1)
df['Capped-Feature2'] = np.clip(df['Feature2'], lower_cap2, upper_cap2)
plt.scatter(df['Capped-Feature1'], df['Capped-Feature2'], color="green", label="Capped Data")
plt.title('Capped Data')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.legend()
plt.show()
Â
Output:
Â
Â
Notice above how the two most extreme observations became vertically aligned as a result of capping them.
A very similar strategy is called winsorizing, where instead of replacing extreme values by certain percentiles, they are substituted with the nearest observations’ values that fall within the specified percentile range, e.g. 1st-99th.
Capping and winsorizing are recommended strategies when preserving the integrity of the data is crucial, thereby only transforming the most extreme cases found instead of transforming them all. It is also preferred when it is important to avoid major changes in the distribution of data attributes.
Â
Â
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.