Your RAG pipeline retrieves top-k chunks to answer user questions but you suspect retrieval quality is poor — the right documents are not always in the top-k. Design a retrieval evaluation framework. Cover: how to build a ground-truth evaluation set without human annotation at scale, what metrics to use (precision@k, recall@k, MRR, NDCG), how to evaluate re-ranking separately from initial retrieval, how to detect when retrieval succeeds but generation still fails (vs retrieval failure), and how to implement this evaluation in a CI/CD pipeline to catch regressions.