Performance Metrics of Machine Learning Models — Accuracy, Precision, Recall and F1 Score | by Ali Naeem Chaudhry | May, 2024


The simplest way to assess the performance of a machine learning model is to calculate its Accuracy which is the ratio of the number of correct predictions to the total number of predictions made:

Below is an example of a cat classifier:

Cat Classifier

Accuracy doesn’t represent a fair estimate of a model’s performance in all the scenarios, especially in the case of skewed datasets. A skewed dataset contains classes with uneven instances.

Let’s take the example of a rare disease classifier whose test set is highly skewed:

Skewed Test Set of Rare Disease Classifier

The test set contains 900 “NEGATIVE” class examples which represent the absence of disease and 100 “POSITIVE” class instances which represent the presence of disease.

Suppose, the classifier always predicts “NEGATIVE” class irrespective of the input, which means that the classifier has no intelligence at all. Let’s see if the Accuracy points out that this model is good for nothing:

Since the classifier always predicts the “NEGATIVE” class, it ends up making 900 correct predictions out of the total 1000, which gives an Accuracy of 90%. This is an instance where Accuracy misrepresents a lousy algorithm as excellent.

The performance metrics which are immune to skewed test sets, are Precision and Recall. They focus on different aspects and, combinedly, offer a thorough estimate of a model’s performance.

Precision focuses on the correctness of a model’s predictions for a particular class, say “XYZ”. This means that the Precision tells if a model predicts a particular class “XYZ” a certain number of times, what proportion of these predictions are correct. It can be calculated as:

Precision values range from 0 to 1 with the value of 1 signifying that when the model predicted a particular class a certain number of times, all of those predictions were correct.

Recall, on the other hand, measures the ability of a model to correctly classify instances of a particular class, say “ABC”, from the test set. This means that if the test set contains a certain number of instances of a particular class “ABC”, the Recall tells what proportion of these instances are correctly classified by the model. It can be estimated as:

Recall values range from 0 to 1 with the value of 1 indicating that the model correctly identified all the instances of a particular class.

Car Classifier

The model shown above (Car Classifier) has the highest Recall (100%) but a lower Precision (50%) which indicates that this model correctly classifies all the instances of class “CAR” from the test set, however, when it predicts class “CAR”, half of the time the prediction is wrong.

Skewed Test Set of Rare Disease Classifier

At this point, I’ll make a minor change to the supposition to avoid undefined Precision value and that is:

The classifier always predicts “NEGATIVE” class but correctly classifies “POSITIVE” class once.

Since the classifier only predicted the “POSITIVE” class once and it was correct, the Precision comes out to be 100%. However, as the classifier is poor at identifying “POSITIVE” class instances, the Recall is very low (1%) and this exposes the true performance of the Rare Disease Classifier.

  • The models that have high Precision and low Recall when predict a particular class, most of the time, the prediction is correct, however, they miss out or misclassify many instances of that particular class.
  • Conversely, the models that have high Recall and low Precision pick up or correctly identify most of the instances of a particular class, but they also associate many instances to that class, that actually don’t belong to it.

Ideally, we want both Precision and Recall to be as high as possible. But the two parameters oppose each other. To understand their conflicting nature, we have to first learn about how high Precision and high Recall are achieved:

We know that the precision is high when most of the predictions for a particular class are correct. To achieve this, you have to increase the threshold or confidence level on which you predict that particular class:

Logistic Regression Example:

Logistic Function

Suppose you predict:

  • Class 1 for g(z) > 0.5
  • Class 0 for g(z) ≤ 0.5

If you want high Precision for Class 1, you may want to increase the threshold from 0.5 to, maybe, 0.9, i-e:

  • Class 1 for g(z) > 0.9 ⇒ High Precision

What this will do is make the classifier predict class 1 only when it is dead sure and this will increase the likelihood that the predictions for class 1 are correct, thereby increasing the precision.

However, this will render low Recall!

Because high Recall is attained when most of the instances of a particular class are correctly identified but with a high threshold, the model will miss out or misclassify a lot of instances of class 1 with relatively low confidence value or low output of the sigmoid function. To catch such instances, you may want to decrease the threshold value, i-e

  • Class 1 for g(z) > 0.3 ⇒ High Recall

This will increase the Recall but lower the Precision. That’s how these two parameters clash with each other.

Trade-off Between Precision and Recall

As we’ve discussed, ideally we want both Precision and Recall high, while also these two parameters are opposing in nature, this begs the question: How should we decide between the two?

This totally depends on the application, for instance:

  • A classifier for a life-threatening disease may require a high recall. This is because we don’t want to risk the life of a patient by missing them out. So we will consider the treatment even if we have low confidence.
  • A classifier for a benign disease with an expensive or risky treatment may require high precision. This is because we only want to go for the treatment if we are very sure. Also, we can afford to miss out a patient because the disease is benign.
  • Yet other applications may require a golden balance between Precision and Recall.

F1 score combines precision and recall to give an idea about the overall performance of a model.

It is the harmonic mean of precision and recall which gives more weightage to smaller values and the benefit of this is that if any one of the Precision or Recall is low, it will pull the F1 score down and we know that there is something off about the model.

F1 score also ranges from 0 to 1 with 1 being the best score and to approach it, we need both precision and recall high.

As an example, let’s take the Rare Disease Classifier. It has:

  • Precision = 1 & Recall = 0.01

Notice how the small value of recall pulled the F1 score down even when the precision was 100%.

In this article, we discussed multiple performance metrics of machine learning models along with their shortcomings and caveats. These include Accuracy and its erroneous nature in case of skewed test sets. Further, we demonstrated Precision and Recall along with their opposing nature and how one can trade off between the two in a better way. In the end, we introduced the F1 Score which cleverly combines Precision and Recall to give a single figure representing the performance of the model.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here