A team is deploying a new recommendation model. A junior engineer proposes deploying directly to 100% of traffic since it outperformed the current model on offline evaluation.
(1) Explain two specific failure modes that offline evaluation cannot catch but production would reveal. Be concrete — give scenarios where the offline AUC could improve but production metrics could regress.
(2) Compare shadow and canary deployment for this problem. What does each validate, what risk does each carry, and in what order would you use them?
(3) After a 5% canary for 48 hours, click-through rate is identical to control but revenue per session is 8% lower. What would you conclude, and what would you do? Be specific about the next steps and what additional data you would gather.