Story Melange
  • Home/Blog
  • Technical Blog
  • Book Reviews
  • Projects
  • Archive
  • About
  • Subscribe

On this page

  • 1 Accuracy is best for balanced datasets
  • 2 Precision, when false positives are costly
  • 3 Recall, when false negatives are costly
  • 4 F1-Score, when you need a balance
  • 5 AUC-ROC, when you need to rank predictions
  • 6 Decision Graph
  • 7 Final Thoughts

Choosing the Right Metric for Classification Models

python
machine learning
technical
In this post, we explore various classification metrics: accuracy, precision, recall, F1-score, and AUC-ROC.
Author

Dominik Lindner

Published

August 29, 2025

When evaluating machine learning models, accuracy is often the first metric that comes to mind. However, accuracy alone can be misleading, especially in cases where the dataset is imbalanced or when different types of misclassifications have different consequences. Choosing the right evaluation metric is crucial for ensuring that the model performs well in real-world applications.

1 Accuracy is best for balanced datasets

Accuracy measures the percentage of correctly classified instances in a dataset. It is calculated as:

\[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]

Where:

  • TP (True Positives) = Correctly predicted positives
  • TN (True Negatives) = Correctly predicted negatives
  • FP (False Positives) = Incorrectly predicted positives
  • FN (False Negatives) = Incorrectly predicted negatives

For a more detailed definition see, Precision, Recall, and the Confusion Matrix.

Accuracy works well when the dataset is balanced and the cost of false positives and false negatives is roughly the same.

Example:

In image classification, where we classify objects like “dog vs. cat” with roughly equal numbers of each class, accuracy is a reliable metric.

Counter-Example:

Imagine a fraud detection system where only 1% of transactions are fraudulent. A naive model that predicts “non-fraud” for every transaction would be 99% accurate but completely useless in identifying fraud.

2 Precision, when false positives are costly

Precision measures how many of the positive predictions are actually correct:

\[Precision = \frac{TP}{TP + FP}\]

A high precision means fewer false positives, which is important when a false positive carries significant consequences.

Example:

  • Spam email filtering → Marking an important email as spam (false positive) can cause users to miss critical messages.
  • Hiring decisions → Selecting the wrong candidate (false positive) could be costly for a company.

Counter-Example:

If false negatives (missed positive cases) are more harmful, recall is the better metric.


3 Recall, when false negatives are costly

Recall (also called sensitivity or true positive rate) measures how many actual positives are correctly identified:

\[Recall = \frac{TP}{TP + FN}\]

A high recall means the model captures most actual positive cases, even if it produces some false positives.

Example:

  • Cancer detection → A false negative (failing to detect cancer) is much worse than a false positive (a healthy person being sent for more tests).
  • Fraud detection → Missing a fraudulent transaction is riskier than investigating a few false alarms.

Counter-Example:

If false positives are expensive or disruptive, precision is the better metric.


4 F1-Score, when you need a balance

F1-score is the harmonic mean of precision and recall:

\[F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]

F1-score is particularly useful in cases where the dataset is imbalanced, and both false positives and false negatives matter.

Example:

  • Fake news detection → You need to both catch fake news (recall) and avoid falsely labeling real news as fake (precision).
  • Medical diagnostics → It’s important to minimize both missed diagnoses (FN) and false alarms (FP).

Counter-Example:

If the dataset is balanced and errors are equally costly, accuracy is often sufficient.

5 AUC-ROC, when you need to rank predictions

AUC-ROC (Area Under the Receiver Operating Characteristic Curve) measures a model’s ability to distinguish between classes at different thresholds.

The ROC curve plots:

  • True Positive Rate (Recall) vs. False Positive Rate
  • AUC (Area Under Curve) closer to 1 means better classification performance.

Example:

  • Credit risk assessment → Banks rank loan applicants from “low risk” to “high risk” rather than making a strict yes/no decision.
  • Medical triage systems → Doctors prioritize high-risk patients based on a ranking rather than a strict diagnosis.

Counter-Example:

AUC-ROC is great for ranking, but for specific misclassification penalties, precision, recall, or F1-score might be better.

6 Decision Graph

I came up with a simple decision graph

flowchart TD

A[Start] --> B{Dataset balanced?}

B -- Yes --> C[Use Accuracy]

B -- No --> D{What matters more?}

D -- Balance FP & FN --> E[F1-score]

D -- Avoid False Positives --> F[Precision]

D -- Avoid False Negatives --> G[Recall]

C --> H{Need ranking?}

E --> H

F --> H

G --> H

H -- Yes --> I[AUC-ROC]

H -- No --> J[Done]

Scenario Best Metric Why?
Balanced dataset Accuracy Errors are equally important.
Imbalanced dataset F1-score Balances false positives & false negatives.
False positives are costly Precision Avoids unnecessary alarms.
False negatives are costly Recall Ensures we catch as many positives as possible.
Need ranking, not classification AUC-ROC Measures how well the model separates classes.

7 Final Thoughts

Choosing the right metric is critical to building a model that truly performs well in its intended application. Instead of blindly relying on accuracy, always consider:

  • Is the dataset balanced or imbalanced?
  • Is it worse to have a false positive or a false negative?
  • Are you making a hard classification or ranking predictions?

By aligning the evaluation metric with your real-world goals, you’ll ensure that your model delivers meaningful and impactful results.

Like this post? Get espresso-shot tips and slow-pour insights straight to your inbox.

Comments

Join the discussion below.


© 2025 by Dr. Dominik Lindner
This website was created with Quarto


Impressum

Cookie Preferences


Real stories of building systems and leading teams, from quick espresso shots to slow pours.