ML Classification: Fundamentals for building an efficient classification model | by Adarsh Trivedi | Oct, 2024

Classification is one of the two most common types of problems in Machine Learning. Even though there are various algorithms to solve classification problems, the fundamentals for developing an efficient classification model for a problem statement are the same.

In this article, I want to introduce the readers to fundamentals concepts and performance techniques that are used for building a highly efficient classification model.

Imagine you work for a company that builds an app to help people find pets that they can adopt. Suppose the app currently only allows users to adopt a dog or a cat. Your team wants to build a service that helps user find either all dogs or all cats based on some selection filters like location, breed, etc. Your job is to build a model, that can take input a picture of a pet animal and classify it as dog or a cat, so that app can use the service to show users what they are looking for.

This is a classic classification problem, where you model takes input in the form of a sample image and outputs a label from a set of allowed labels. In this case, the input is a photo of a pet and the output is a label either dog or cat.

A picture of a dog on the left and a cat on the right looking at each other. — Source: https://www.kaggle.com/c/dogs-vs-cats

Binary Classification: In these kinds of problems, an output label can only have two values. Generally these values are called True or False. For example, our dog vs cat problem can be converted to a binary classification problem, by rewording the problem statement. What if we change our problem statement to, build a model that takes in a picture as input and outputs 1, if its a picture of a dog. Otherwise it predicts 0. In our case, since we only have photos of dogs and cats, a predicted value of 0 means the photo is not of a dog; hence, it will be a cat.
Multi-class Classification: In these kinds of classification problems, the output label set has a size greater than two. For example, what if our collection of pet photos includes dogs, cats and birds? In this case, each photo can be either of a dog, a cat or a bird. Note that a picture can only belong to one class.
Multi-label Classification: This is an extension of binary classification, where instead of having just one label set with two possible values, we can have more than 1 label set. For example, what if our model predicts 2 things: whether a picture is of a dog or not and whether it is a puppy or not. In this case, for an image of a grown-up dog our model’s output will be [1, 0]. Here, 1 represents that it is a dog and 0 represents it is not a puppy.
Multi-output Classification: This is an extension of multi-class classification, where instead of having just one label set with more than two possible values, we can have more than one label set.

Imagine you have a model ready that can predict whether an input image is of a dog or not.

A picture of a dog. — Source: https://www.nylabone.com/dog101/10-intelligent-dog-breeds

For the above image:

if your model predicts it as a dog, then the output value will be called a True Positive.
if your model predicts it as not a dog, then the output value will be called a False Negative.

And for the image below:

if your model predicts it as not a dog, then the output value will be called a True Negative.
if your model predicts it as a dog, then the output value will be called a False Positive.

A picture of a cat. — Source: https://www.cats.org.uk/cats-blog/9-things-to-know-before-getting-your-first-cat

Let’s take the previous model that predicted whether an image was of a dog or not. Now, assume you have 1,000 images of dogs and cats. Out of 1,000 images, 500 images areof dogs and 500 are of cats.

We let these images go through the model and the model predicts following:

Out of 500 dog images, 400 of those were correctly predicted as dogs. As we learned in the previous section, these are True Positives.
Out of 500 dog images, 100 of those were incorrectly predicted as not dogs. These are False Negatives.
Out of 500 cat images, 300 of those were correctly predicted as not dogs. These are True Negatives.
Out of 500 cat images, 200 of those were incorrectly predicted as dogs. These are False Positives.

Accuracy

Accuracy is the metric that measures the proportion of correct predictions out of all predictions.

The accuracy metric tells us how accurate our classifier is overall, considering correct identification of both positives and negatives.

The problem with using just accuracy to measure a model’s efficiency can be misleading. Assume there is a dataset that is skewed and has 95% negative class samples. In that case, a very dumb classifier that just predicts negative for any sample input will have a 95% accuracy.

Precision

Precision measures the proportion of true positive predictions out of all positive predictions made by the model.

In simpler terms, precision answers the question: Of all the instances that the model predicted as positive, how many were actually positive?

Precision is useful when the cost of false positives is high, such as in face recognition systems that ensure the security of our digital devices. If the camera falsely identifies someone else as your face and unlocks your phone, that can cause a lot of trouble for customers.

Recall

Recall is a metric that measures the proportion of actual positive instances that were correctly identified by the model. It is also known as sensitivity or the true positive rate.

Recall gives a sense of the model’s ability to correctly label a positive sample as positive. If there are 500 total positive samples, and it predicts 490 of them correctly as positive and only 10 as false negatives, then the model has a recall value of 0.98. Whereas, if the model predict only 250 out of all positive samples as positive and predicts the rest of the 250 as negative then the recall value of the model is just 0.5.

Recall is particularly important when missing positive cases is costly, such as in medical diagnoses. A model with low recall will not be suitable for such use cases. Failing to predict a sample that indicates an underlying medical condition as positive has a significant impact on the life of the patient.

The F1 score is a metric that combines both precision and recall into a single performance number. The prime idea behind the F1 score is that, if the precision or recall of a model is low, then its F1 score should be low.

A high F1 score indicates high precision and high recall. A low F1 score can mean low precision or low recall or low precision and recall.

F1 score is the harmonic mean of precision and recall. Here is a easy-to understand medium post on why we use harmonic mean for calculating F1 score.

Formula to calculate F1 score of a classification model.

Precision and Recall exist in an inverse relationship. In most cases, an attempt to improve one would result in a decrease in the other.

Let’s try to understand why these two metrics have an inverse relationship.

Precision is defined as the ratio of the count of true positives to all positive predictions.

Precision = TP / (TP + FP)

A model tuned for high precision will be highly careful when predicting positives. It will attempt to keep its False Positive count as low as possible. It will predict positives only when its super confident. Hence, for samples about which the model is not very confident, it would rather predict them as negative, rather than taking the risk of making a false positive prediction. In other words, it’ll label high number of positive samples as false negative to avoid increase false positive count and maintaining high precision.

But when the model labels high number of positive samples as false negative because its tune to predict positive only when its very confident, that cause the count for False Negative predictions to go up. And if you remember:

Recall = TP / (TP + FN)

The higher the count of False Negatives, lower the recall.

Thus, when the model is tuned for high precision it will automatically cause recall to drop.

Accuracy can be misleading.
Depending solely on precision or recall can also be misleading.
F1 score is a good measure that combines both precision and recall. However, if the F1 score is low, it doesn’t tell us if its because of low precision or recall.
The best strategy for fine-tuning a model based on the problem statement is to use both precision and recall.

Classification models usually have an internal threshold. This threshold is used by model to tag a sample as positive or negative.

The decision logic internally looks something like this:

def get_output_label(predicted_value):"""
Return if the predicted value is of Positive or Negative classification.
Parameters:
predicted_value (float): The predicted numeric value from the model.
Returns:
bool: If True, it means the predicted value belongs to positive class, 
otherwise it belongs to negative class.
"""
if predicted_value > threshold:
return True
else:
return False

The higher this threshold, the more precise the model will be. Conversely, a lower the threshold will result in higher recall but lower precision.

A graph of Recall-Precision versus Threshold is generally used to determine the threshold that works best for a particular use case.

Below is a sample graph that contains the precision and recall values of a model for various threshold values.

A graph similar to the one above is used to determine the threshold that works best for particular classification model. A model that requires high precision should be set with a higher threshold value than a model that requires higher recall value.

Scikit-Learn exposes this threshold in an indirect way through decision scores per sample input. The decision score can be used to indirectly set the threshold for a model.

In the next article, we will go through code samples that will help us practically understand these metrics and their application for fine-tuning a model based on our problem statement and requirements.

If you made it to the end of this article and found it helpful, please show your appreciation and support for this post through a clap.

Thank you!

ML Classification: Fundamentals for building an efficient classification model | by Adarsh Trivedi | Oct, 2024

Accuracy

Precision

Recall

Recent Articles

Google I/O 2025: What to expect from Gemini, Workspace

Week 1 of Stanford’s Machine Learning 🤖 Course: My AI Journey | by Stefen Ewers | May, 2025

A Reader’s Question on Nested Lists

5 Problems Encountered Fine-Tuning LLMs with Solutions

This month in security with Tony Anscombe – February 2025 edition

Related Stories

Leave A Reply Cancel reply