Don't Get Fooled by Numbers!

July 21, 2025

Fellas,

Just wanted to share a story that’s been bugging me for days.

Last week, I had the opportunity to attend a presentation by a vendor (unfortunately, for certain reasons, I can’t mention the name). Basically, they’ve been developing a web-based app specifically for tagging purposes (I can’t go into more detail). I was hooked by the concise yet appealing presentation until...they started showcasing a machine learning model they had built to perform the tagging task.

The tagging, simply put, is a text classification task with multi-label output (a text can be classified into more than one label at the same time). They claimed the model had been performing consistently well, demonstrating the usual metrics: accuracy, precision, recall, and F1-score. All the scores were above 94%, which left some of my colleagues in awe.

This is where I started to smell something fishy. I began asking multiple questions to confirm my suspicion. Starting with the dataset they used, then moving on to how they calculated and validated those metrics. From their responses, it became clear that they were evaluating the model’s performance (keep in mind, this is a multi-label model) using precision, recall, F1-score, and even accuracy. All in isolation.

I mean, who uses accuracy as a quality metric for multi-class classification, let alone multi-label?

This suggests that the metrics were calculated separately for each label and, worse, aggregated by selecting the maximum score to represent the model’s overall performance.

Let’s look at an example:

Suppose an article (A) is associated with a primary class ("news") and a secondary class ("politics"). To properly evaluate the model’s performance in classifying such articles, you must consider its ability to assign both labels correctly: news and politics.

You can’t calculate performance by looking only at whether it gets the primary class right—it might still assign an incorrect secondary label, which distorts the real performance of the model.

Lesson learned:

When dealing with data, be skeptical!
You can’t blindly trust the numbers presented to you, especially when they’re used as indicators of model performance.
Metrics can be misleading, and as a user, you need to critically assess what those scores and numbers actually mean.
In any data analytics project, you need to establish clear success criteria to determine whether the product truly meets your business requirements.

data-and-data

Don't Get Fooled by Numbers!

Comments

Post a Comment

Popular posts from this blog

Adding a Dashed Connector in Draw.io

DBeaver / PostgreSQL: "Error: database already exists" Workaround

How to Get The Replication Factor of HDFS Files