You are building a model to detect fraudulent loan applications. Dataset is 98% legitimate, 2% fraudulent. Your manager says optimize for accuracy.
(1) Why is accuracy the wrong metric here? Compute the accuracy of a trivial "predict not-fraud" baseline.
(2) Which metric would you actually optimize for, and why? Discuss precision, recall, F1, and AUC-ROC and which fits this problem best.
(3) How would you set the decision threshold, given that missed fraud costs $5,000 but a false positive costs approximately $200 in lost revenue? Show your math.
(4) Your offline AUC is 0.96 but after deploying, the team finds the model is performing worse than expected. Name two reasons offline evaluation can disagree with online performance.