HardAI EngineeringSystem Design

How do you evaluate retrieval quality in a RAG system?

Your RAG pipeline retrieves top-k chunks to answer user questions but you suspect retrieval quality is poor — the right documents are not always in the top-k. Design a retrieval evaluation framework. Cover: how to build a ground-truth evaluation set without human annotation at scale, what metrics to use (precision@k, recall@k, MRR, NDCG), how to evaluate re-ranking separately from initial retrieval, how to detect when retrieval succeeds but generation still fails (vs retrieval failure), and how to implement this evaluation in a CI/CD pipeline to catch regressions.

Sign in to attempt this problem

Free account gives you full access to community problems with the complete solution reveal — golden answer, senior walkthrough, and score breakdown — after submission.

Start free →Already have an account? Sign in