Author: Sáez Silvestre, Carlos
This interactive application is designed to help you explore and understand key evaluation metrics and visualizations for classification models, particularly in healthcare. By experimenting with parameters and thresholds, you can observe and gain understanding in how confusion matrices and derived metrics (such as sensitivity, specificity, F1-score) as well as ROC and Precision-Recall curves behave under varying conditions, providing deeper insight into model behavior and decision-making trade-offs.
While ROC curves remain largely unaffected by class imbalance as based on normalized rates (sensitivity and specificity), Precision–Recall curves can provide an informative picture in settings with imbalanced classes, such as screening tests. In these contexts, both specificity and the precision (PPV) are critical, since even a highly specific test may yield a low PPV when disease prevalence is low.
There is no single universally best threshold; the optimal cutoff depends on clinical context, such as the relative consequences of false negatives and false positives. For example, in cancer screening, sensitivity may be prioritized to avoid missing cases, whereas in confirmatory testing, specificity may be prioritized to reduce unnecessary follow-up procedures.
AUC and AUPRC values summarize overall model discrimination, but they do not indicate how a test performs at a specific threshold. In practice, threshold-dependent metrics such as sensitivity, specificity, PPV, and NPV are often more relevant for decision-making in clinical workflows.