fuatakal
- Mar 22, 2022
- 2 min read

The Effects of Varying Threshold on Evaluation Metrics

Let's assume you have a prediction model that returns a probability score to indicate how likely a tumor tissue sample is malignant.

Further assume that we have a test set of 10 tumor tissues samples labeled as benign (green, not malignant) or malignant (red), which we ran through our model to get an output score for each of them.

We can arrange the samples from left to right in ascending order of the model's predictions as in the figure below.

As previously said, the model returns a probability score. That is, it may return 0.999 for a particular tissue and predict that it is very likely to be malignant. Conversely, 0.001 would mean the tissue is very likely not malignant.

It seems the model can make binary classification well if the prediction scores are at extremes.

How about a sample with a prediction score of 0.65?

This is where the classification threshold, also called the decision threshold comes into the game.

In our model, a value above that threshold indicates malignant, a value below indicates not malignant (or, benign) samples.

Let's get back to our example test set. The figure below shows sensitivity and specificity values for different thresholds.

A threshold small t sets everything on the right of the threshold indicated with a dashed line as positive (i.e., malignant) and everything on the left as negative (i.e., not malignant, or benign).

Sensitivity tells us given the model predicts malignant on a sample,

what is the probability that the sample is actually malignant? Also called the positive predictive value (PPV). specificity, on the other hand, is about the probability that a sample is not malignant, which is also called the negative predictive value (NPV).

Let's compute sensitivity and specificity for threshold t = 0.5.

On the right-hand side of the threshold, we have 3 red and 2 green dots representing true positives (TP) and false positives (FP), respectively.

On the left-hand side of the threshold, we have 1 red and 4 green dots representing false negatives (FN) and true negatives (TN), respectively.

Recall the formulas for sensitivity and specificity.

You do the math and will end up with 3/(3+1)=3/4=0.75 and 4/(4+2)=4/6=0.66 for sensitivity and specificity when the threshold t = 0.5, respectively.

Let's increase the threshold and see what happens.

For threshold t = 0.7, sensitivity is 2/(2+2)=2/4=0.5 and specificity is 5/(5+1)=5/6=0.83.

That is, if we increase the threshold from 0.5 to 0.7, we should expect to see fewer examples as positive and more examples as negative.

We can take the threshold to the upper extreme, which is 1.0.

For threshold t = 1.0, sensitivity is 0/(0+4)=0/4=0.0 and specificity is 6/(6+0)=6/6=1.0.

In this case, sensitivity and specificity will be 0.0 and 1.0, respectively. In other words, all samples will be predicted as not malignant.

To sum up, increasing the threshold does not seem like a good idea for this model as we may miss malignant samples and predict them as benign.

The threshold value is problem-dependent which needs to be tuned.

Thank you for reading this post. If you have anything to say/object/correct, please drop a comment down below.

The Good Class

by Fuat Akal

The Effects of Varying Threshold on Evaluation Metrics

Recent Posts

Never Miss a Post. Subscribe Now!