Receiver Operating Characteristic (ROC) and Precision-Recall Curves with Evaluation Metrics

Author: Sáez Silvestre, Carlos

Introduction

This interactive application is designed to help you explore and understand key evaluation metrics and visualizations for classification models, particularly in healthcare. By experimenting with parameters and thresholds, you can observe and gain understanding in how confusion matrices and derived metrics (such as sensitivity, specificity, F1-score) as well as ROC and Precision-Recall curves behave under varying conditions, providing deeper insight into model behavior and decision-making trade-offs.

Learning Goals

Instructions

  1. Select the 'Interactive App' tab and review the evaluation results under the starting conditions.
  2. Use the left panel to adjust the decision threshold and analyze how the metrics and the threshold point in the curves change. Compare the current threshold with Youden’s optimal point.
  3. Use the left panel to modify class balance, sample size, and baseline discriminative ability individually (moving these sliders will regenerate the data), and analyze the effect first on the confusion matrix and then on the metrics and curves.
  4. Try to simulate the behavior of a real screening test. Look for realistic performance values, consider class imbalance, and analyze the effects of varying parameters and decision thresholds.

Laboratorio virtual

[Pulse aquí para iniciar el laboratorio en una nueva pestaña]

Conclusions

While ROC curves remain largely unaffected by class imbalance as based on normalized rates (sensitivity and specificity), Precision–Recall curves can provide an informative picture in settings with imbalanced classes, such as screening tests. In these contexts, both specificity and the precision (PPV) are critical, since even a highly specific test may yield a low PPV when disease prevalence is low.
There is no single universally best threshold; the optimal cutoff depends on clinical context, such as the relative consequences of false negatives and false positives. For example, in cancer screening, sensitivity may be prioritized to avoid missing cases, whereas in confirmatory testing, specificity may be prioritized to reduce unnecessary follow-up procedures.
AUC and AUPRC values summarize overall model discrimination, but they do not indicate how a test performs at a specific threshold. In practice, threshold-dependent metrics such as sensitivity, specificity, PPV, and NPV are often more relevant for decision-making in clinical workflows.