Oversampling and Undersampling, Explained: A Visual Guide with Mini 2D Dataset | by Samy Baladram | Oct, 2024

DATA PREPROCESSING

Artificially generating and deleting data for the greater good

⛳️ More DATA PREPROCESSING, explained: · Missing Value Imputation · Categorical Encoding · Data Scaling · Discretization ▶ Oversampling & Undersampling

Collecting a dataset where each class has exactly the same number of class to predict can be a challenge. In reality, things are rarely perfectly balanced, and when you are making a classification model, this can be an issue. When a model is trained on such dataset, where one class has more examples than the other, it has usually become better at predicting the bigger groups and worse at predicting the smaller ones. To help with this issue, we can use tactics like oversampling and undersampling — creating more examples of the smaller group or removing some examples from the bigger group.

There are many different oversampling and undersampling methods (with intimidating names like SMOTE, ADASYN, and Tomek Links) out there but there doesn’t seem to be many resources that visually compare how they work. So, here, we will use one simple 2D dataset to show the changes that occur in the data after applying those methods so we can see how different the output of each method is. You will see in the visuals that these various approaches give different solutions, and who knows, one might be suitable for your specific machine learning challenge!

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Oversampling

Oversampling make a dataset more balanced when one group has a lot fewer examples than the other. The way it works is by making more copies of the examples from the smaller group. This helps the dataset represent both groups more equally.

Undersampling

On the other hand, undersampling works by deleting some of the examples from the bigger group until it’s almost the same in size to the smaller group. In the end, the dataset is smaller, sure, but both groups will have a more similar number of examples.

Hybrid Sampling

Combining oversampling and undersampling can be called “hybrid sampling”. It increases the size of the smaller group by making more copies of its examples and also, it removes some of example of the bigger group by removing some of its examples. It tries to create a dataset that is more balanced — not too big and not too small.

Let’s use a simple artificial golf dataset to show both oversampling and undersampling. This dataset shows what kind of golf activity a person do in a particular weather condition.

Columns: Temperature (0–3), Humidity (0–3), Golf Activity (A=Normal Course, B=Drive Range, or C=Indoor Golf). The training dataset has 2 dimensions and 9 samples.

⚠️ Note that while this small dataset is good for understanding the concepts, in real applications you’d want much larger datasets before applying these techniques, as sampling with too little data can lead to unreliable results.

Random Oversampling

Random Oversampling is a simple way to make the smaller group bigger. It works by making duplicates of the examples from the smaller group until all the classes are balanced.

👍 Best for very small datasets that need to be balanced quickly
👎 Not recommended for complicated datasets

Random Oversampling simply duplicates selected samples from the smaller group (A) while keeping all samples from the bigger groups (B and C) unchanged, as shown by the A×2 markings in the right plot.

SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling technique that makes new examples by interpolating the smaller group. Unlike the random oversampling, it doesn’t just copy what’s there but it uses the examples of the smaller group to generate some examples between them.

👍 Best when you have a decent amount of examples to work with and need variety in your data
👎 Not recommended if you have very few examples
👎 Not recommended if data points are too scattered or noisy

SMOTE creates new A samples by selecting pairs of A points and placing new points somewhere along the line between them. Similarly, a new B point is placed between pairs of randomly chosen B points

ADASYN

ADASYN (Adaptive Synthetic) is like SMOTE but focuses on making new examples in the harder-to-learn parts of the smaller group. It finds the examples that are trickiest to classify and makes more new points around those. This helps the model better understand the challenging areas.

👍 Best when some parts of your data are harder to classify than others
👍 Best for complex datasets with challenging areas
👎 Not recommended if your data is fairly simple and straightforward

ADASYN creates more synthetic points from the smaller group (A) in ‘difficult areas’ where A points are close to other groups (B and C). It also generates new B points in similar areas.

Undersampling shrinks the bigger group to make it closer in size to the smaller group. There are some ways of doing this:

Random Undersampling

Random Undersampling removes examples from the bigger group at random until it’s the same size as the smaller group. Just like random oversampling the method is pretty simple, but it might get rid of important info that really show how different the groups are.

👍 Best for very large datasets with lots of repetitive examples
👍 Best when you need a quick, simple fix
👎 Not recommended if every example in your bigger group is important
👎 Not recommended if you can’t afford losing any information

Random Undersampling removes randomly chosen points from the bigger groups (B and C) while keeping all points from the smaller group (A) unchanged.

Tomek Links

Tomek Links is an undersampling method that makes the “lines” between groups clearer. It searches for pairs of examples from different groups that are really alike. When it finds a pair where the examples are each other’s closest neighbors but belong to different groups, it gets rid of the example from the bigger group.

👍 Best when your groups overlap too much
👍 Best for cleaning up messy or noisy data
👍 Best when you need clear boundaries between groups
👎 Not recommended if your groups are already well separated

Tomek Links identifies pairs of points from different groups (A-B, B-C) that are closest neighbors to each other. Points from the bigger groups (B and C) that form these pairs are then removed while all points from the smaller group (A) are kept.”

Near Miss

Near Miss is a set of undersampling techniques that works on different rules:

Near Miss-1: Keeps examples from the bigger group that are closest to the examples in the smaller group.
Near Miss-2: Keeps examples from the bigger group that have the smallest average distance to their three closest neighbors in the smaller group.
Near Miss-3: Keeps examples from the bigger group that are furthest away from other examples in their own group.

The main idea here is to keep the most informative examples from the bigger group and get rid of the ones that aren’t as important.

👍 Best when you want control over which examples to keep
👎 Not recommended if you need a simple, quick solution

NearMiss-1 keeps points from the bigger groups (B and C) that are closest to the smaller group (A), while removing the rest. Here, only the B and C points nearest to A points are kept.

ENN

Edited Nearest Neighbors (ENN) method gets rid of examples that are probably noise or outliers. For each example in the bigger group, it checks whether most of its closest neighbors belong to the same group. If they don’t, it removes that example. This helps create cleaner boundaries between the groups.

👍 Best for cleaning up messy data
👍 Best when you need to remove outliers
👍 Best for creating cleaner group boundaries
👎 Not recommended if your data is already clean and well-organized

ENN removes points from bigger groups (B and C) whose majority of nearest neighbors belong to a different group. In the right plot, crossed-out points are removed because most of their closest neighbors are from other groups.

SMOTETomek

SMOTETomek works by first creating new examples for the smaller group using SMOTE, then cleaning up messy boundaries by removing “confusing” examples using Tomek Links. This helps creating a more balanced dataset with clearer boundaries and less noise.

👍 Best for unbalanced data that is really severe
👍 Best when you need both more examples and cleaner boundaries
👍 Best when dealing with noisy, overlapping groups
👎 Not recommended if your data is already clean and well-organized
👎 Not recommended for small dataset

SMOTETomek combines two steps: first applying SMOTE to create new A points along lines between existing A points (shown in middle plot), then removing Tomek Links from bigger groups (B and C). The final result has more balanced groups with clearer boundaries between them.

SMOTEENN

SMOTEENN works by first creating new examples for the smaller group using SMOTE, then cleaning up both groups by removing examples that don’t fit well with their neighbors using ENN. Just like SMOTETomek, this helps create a cleaner dataset with clearer borders between the groups.

👍 Best for cleaning up both groups at once
👍 Best when you need more examples but cleaner data
👍 Best when dealing with lots of outliers
👎 Not recommended if your data is already clean and well-organized
👎 Not recommended for small dataset

Oversampling and Undersampling, Explained: A Visual Guide with Mini 2D Dataset | by Samy Baladram | Oct, 2024

DATA PREPROCESSING

Artificially generating and deleting data for the greater good

Oversampling

Undersampling

Hybrid Sampling

Random Oversampling

SMOTE

ADASYN

Random Undersampling

Tomek Links

Near Miss

ENN

SMOTETomek

SMOTEENN

Recent Articles

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

Samlify bug lets attackers bypass single sign-on

Predictive Analytics in Inventory Management: A Game Changer for Retailers | Datumquest

The Most Lifelike 3D Video Calling That Didn’t Totally Blow Me Away

Integrate Amazon Bedrock Agents with Slack

Related Stories

Leave A Reply Cancel reply