DATA PREPROCESSING
Artificially generating and deleting data for the greater good
⛳️ More DATA PREPROCESSING, explained:
· Missing Value Imputation
· Categorical Encoding
· Data Scaling
· Discretization
▶ Oversampling & Undersampling
Collecting a dataset where each class has exactly the same number of class to predict can be a challenge. In reality, things are rarely perfectly balanced, and when you are making a classification model, this can be an issue. When a model is trained on such dataset, where one class has more examples than the other, it has usually become better at predicting the bigger groups and worse at predicting the smaller ones. To help with this issue, we can use tactics like oversampling and undersampling — creating more examples of the smaller group or removing some examples from the bigger group.
There are many different oversampling and undersampling methods (with intimidating names like SMOTE, ADASYN, and Tomek Links) out there but there doesn’t seem to be many resources that visually compare how they work. So, here, we will use one simple 2D dataset to show the changes that occur in the data after applying those methods so we can see how different the output of each method is. You will see in the visuals that these various approaches give different solutions, and who knows, one might be suitable for your specific machine learning challenge!
Oversampling
Oversampling make a dataset more balanced when one group has a lot fewer examples than the other. The way it works is by making more copies of the examples from the smaller group. This helps the dataset represent both groups more equally.
Undersampling
On the other hand, undersampling works by deleting some of the examples from the bigger group until it’s almost the same in size to the smaller group. In the end, the dataset is smaller, sure, but both groups will have a more similar number of examples.
Hybrid Sampling
Combining oversampling and undersampling can be called “hybrid sampling”. It increases the size of the smaller group by making more copies of its examples and also, it removes some of example of the bigger group by removing some of its examples. It tries to create a dataset that is more balanced — not too big and not too small.
Let’s use a simple artificial golf dataset to show both oversampling and undersampling. This dataset shows what kind of golf activity a person do in a particular weather condition.
⚠️ Note that while this small dataset is good for understanding the concepts, in real applications you’d want much larger datasets before applying these techniques, as sampling with too little data can lead to unreliable results.
Random Oversampling
Random Oversampling is a simple way to make the smaller group bigger. It works by making duplicates of the examples from the smaller group until all the classes are balanced.
👍 Best for very small datasets that need to be balanced quickly
👎 Not recommended for complicated datasets
SMOTE
SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling technique that makes new examples by interpolating the smaller group. Unlike the random oversampling, it doesn’t just copy what’s there but it uses the examples of the smaller group to generate some examples between them.
👍 Best when you have a decent amount of examples to work with and need variety in your data
👎 Not recommended if you have very few examples
👎 Not recommended if data points are too scattered or noisy
ADASYN
ADASYN (Adaptive Synthetic) is like SMOTE but focuses on making new examples in the harder-to-learn parts of the smaller group. It finds the examples that are trickiest to classify and makes more new points around those. This helps the model better understand the challenging areas.
👍 Best when some parts of your data are harder to classify than others
👍 Best for complex datasets with challenging areas
👎 Not recommended if your data is fairly simple and straightforward
Undersampling shrinks the bigger group to make it closer in size to the smaller group. There are some ways of doing this:
Random Undersampling
Random Undersampling removes examples from the bigger group at random until it’s the same size as the smaller group. Just like random oversampling the method is pretty simple, but it might get rid of important info that really show how different the groups are.
👍 Best for very large datasets with lots of repetitive examples
👍 Best when you need a quick, simple fix
👎 Not recommended if every example in your bigger group is important
👎 Not recommended if you can’t afford losing any information
Tomek Links
Tomek Links is an undersampling method that makes the “lines” between groups clearer. It searches for pairs of examples from different groups that are really alike. When it finds a pair where the examples are each other’s closest neighbors but belong to different groups, it gets rid of the example from the bigger group.
👍 Best when your groups overlap too much
👍 Best for cleaning up messy or noisy data
👍 Best when you need clear boundaries between groups
👎 Not recommended if your groups are already well separated
Near Miss
Near Miss is a set of undersampling techniques that works on different rules:
- Near Miss-1: Keeps examples from the bigger group that are closest to the examples in the smaller group.
- Near Miss-2: Keeps examples from the bigger group that have the smallest average distance to their three closest neighbors in the smaller group.
- Near Miss-3: Keeps examples from the bigger group that are furthest away from other examples in their own group.
The main idea here is to keep the most informative examples from the bigger group and get rid of the ones that aren’t as important.
👍 Best when you want control over which examples to keep
👎 Not recommended if you need a simple, quick solution
ENN
Edited Nearest Neighbors (ENN) method gets rid of examples that are probably noise or outliers. For each example in the bigger group, it checks whether most of its closest neighbors belong to the same group. If they don’t, it removes that example. This helps create cleaner boundaries between the groups.
👍 Best for cleaning up messy data
👍 Best when you need to remove outliers
👍 Best for creating cleaner group boundaries
👎 Not recommended if your data is already clean and well-organized
SMOTETomek
SMOTETomek works by first creating new examples for the smaller group using SMOTE, then cleaning up messy boundaries by removing “confusing” examples using Tomek Links. This helps creating a more balanced dataset with clearer boundaries and less noise.
👍 Best for unbalanced data that is really severe
👍 Best when you need both more examples and cleaner boundaries
👍 Best when dealing with noisy, overlapping groups
👎 Not recommended if your data is already clean and well-organized
👎 Not recommended for small dataset
SMOTEENN
SMOTEENN works by first creating new examples for the smaller group using SMOTE, then cleaning up both groups by removing examples that don’t fit well with their neighbors using ENN. Just like SMOTETomek, this helps create a cleaner dataset with clearer borders between the groups.
👍 Best for cleaning up both groups at once
👍 Best when you need more examples but cleaner data
👍 Best when dealing with lots of outliers
👎 Not recommended if your data is already clean and well-organized
👎 Not recommended for small dataset