Image by Editor | Midjourney
Data analysis often involves dealing with outliers. These are data points that deviate significantly from the rest of the dataset and have the potential to skew the results of most analyses. This article explores various strategies for managing outliers to ensure accurate and robust statistical analyses.
Identifying Outliers
The first step in being able to handle outliers is identifying them. There are many methods for doing this, but one of the most common is through visual inspection of the data. Box plots, scatterplots, and histograms can all be used to highlight points that lie outside of the rest of the dataset. For example, a single bar separated from the rest of a histogram may indicate an outlier.
There are also statistical methods for demonstrating whether a data point is an outlier. Z scores can be calculated for each data point using the mean and standard deviation of the variable. Typically, Z scores greater than 3 or less than -3 are considered potential outliers. For non-normally distributed data, the interquartile range (IQR) method can be used instead. Here, values below the first quartile minus 1.5 times the IQR or values above the third quartile plus 1.5 times the IQR are considered outliers.
1. Removing Outliers
In some cases, the presence of outliers indicates data errors, such as an incorrect date of birth leading to someone’s age being calculated as 150. When this occurs, simply removing the outliers is the best approach, particularly if the correct data cannot be recovered. It is up to the researcher and the specific statistical tests being conducted whether only that data point should be removed, or the entire row of data.
It is important to keep in mind that removing too many data points can reduce the robustness of your analysis, particularly in smaller sample sizes or in cases where maintaining a certain level of statistical power is critical.
2. Transform Your Data
If outliers are genuine data points but still distort the analysis, applying a transformation to the dataset can be an effective solution. Transforming the data will effectively compress the range of values, reducing the relative influence of extreme points on the analysis. As an added benefit, transformation tends to improve the normality of the data distribution, making it more suitable for parametric tests. However, transformations could potentially obscure the original scale or relationships in the data, which might be important in certain analyses.
There are a wide range of common transformations. Each should be used in a specific scenario and has its own pros and cons. For example, log transformation is useful for right skewed data while a square root transformation reduces the impact of larger values while keeping most of the information about the data’s scale. More advanced methods such as Box-Cox transformation can be used as well.
3. Impute Outliers
Instead of removing outliers, you can also consider replacing them with an imputed value. This technique is common in dealing with missing data, but can also be a way to address outliers. This is particularly useful when the outliers are data entry errors and an approximation of the true value can be determined from the rest of the data.
The easiest imputation method is to replace the outlier with the mean or median of the variable. More advanced methods include predictive modeling. A logistic or linear regression can be used to calculate an imputed value. Machine learning models such as K-Nearest Neighbor can also estimate a more reasonable value for the outlier.
4. Segment Your Data
Sometimes, you may find an entire group of entries that are an outlier from the rest of the distribution, such as a cloud appearing separately from the rest of the data in a scatterplot. In these scenarios, it can be beneficial to segment your data and explore different subsets independently. You can then compare results across the different segments to see if findings are similar, or if some subgroups should be treated differently from the rest of the population.
This is a common method used in customer analytics. High spending customers may be outliers when looking at the overall dataset, but segmenting these customers and analyzing them independently may offer unique insights that aren’t observed when analyzing the entire dataset.
5. Analyze with Robust Methods
When outliers are legitimate data points and it is not appropriate to remove or transform them, robust statistical methods can be used to minimize their impact on analysis. Many statistical methods require that the input data be normally distributed, but there are also non-parametric alternatives that can be used when this assumption is not valid.
These more advanced statistical techniques, such as median based measures, robust regression, and tree-based models, provide more reliable results in the presence of extreme values. These are commonly used in fields like healthcare and finance where informative outliers are more likely to be present in datasets.
Summary
Outliers can complicate data analysis by skewing results. But, by understanding their impact and following a structured approach to dealing with them, you can improve the quality and accuracy of your analysis. No matter what method you use to handle outliers in your data, it is critical to document your decisions to ensure reproducibility and transparency in your analytical process.
Mehrnaz Siavoshi holds a Masters in Data Analytics and is a full time biostatistician working on complex machine learning development and statistical analysis in healthcare. She has experience with AI and has taught university courses in biostatistics and machine learning at University of the People.