RAG Interview Questions: 25 Real Problems from Top AI Companies (2026)
Real RAG interview questions verified from Anthropic, OpenAI, Cohere, Perplexity, and 20+ AI companies. Covers chunking, retrieval, evaluation, and production RAG systems. With answers.
Retrieval-augmented generation has become one of the most tested topics in AI engineering interviews. If you are preparing for an AI engineer, ML engineer, or software engineer role at a company building with LLMs, you will almost certainly face RAG questions.
This guide covers the real RAG interview questions that top AI companies are asking in 2026, organized by topic, with explanations of what strong answers look like and what most candidates get wrong.
Why RAG Is So Heavily Tested
RAG is tested heavily for a simple reason: it is one of the primary ways companies deploy LLMs in production, and it has a rich set of failure modes that separate engineers who have built real systems from engineers who have only read about them.
A candidate who has only built a RAG demo can explain what RAG is. A candidate who has debugged a RAG system in production can explain why RAG fails — and that is what companies are hiring for.
The failure modes are specific and learnable. Here is how companies test for them.
Chunking and Retrieval Questions
"You are building a RAG system over a large technical documentation corpus. Your initial chunking strategy splits documents into fixed 512-token chunks. Users report that answers frequently miss important context. What is happening and how would you fix it?"
What most candidates say: Increase the chunk size.
What strong candidates say: Fixed-size chunking without overlap loses context at chunk boundaries. A question whose answer spans two adjacent chunks will never be fully retrieved because neither chunk alone contains the complete answer. The fix involves adding chunk overlap (typically 10-20% of chunk size) and considering semantic chunking strategies that split on meaningful boundaries like paragraphs or sections rather than arbitrary token counts. Additionally, the retrieval strategy should consider returning adjacent chunks when a relevant chunk is found.
"Your RAG system retrieves the top 5 documents for every query. Faithfulness is 0.91 and context relevance is 0.88. Users still report that 1 in 5 answers feels wrong or irrelevant. What do you check?"
This is one of the most commonly reported interview questions across Anthropic, Cohere, and Perplexity.
What most candidates say: The faithfulness and context relevance scores look good, so the problem must be the model.
What strong candidates say: Faithfulness measures whether the answer is grounded in retrieved documents. Context relevance measures whether retrieved documents relate to the query. Neither measures whether the answer actually addresses what the user asked — that is answer relevance, and it is the missing metric here.
If answer relevance is low, the retrieved documents are topically related but do not contain the specific information needed to answer the question. This often indicates a chunking problem — the relevant sentence exists in the corpus but lands in a chunk that is not being retrieved — or an embedding model problem where the model is capturing topic similarity but not answer similarity.
"Explain the difference between BM25 and dense embedding retrieval. When would you use each?"
What most candidates say: BM25 is keyword search and dense retrieval is semantic search. Use dense retrieval for better results.
What strong candidates say: BM25 is a sparse retrieval method based on term frequency and inverse document frequency. It performs extremely well on exact-match queries, especially those involving proper nouns, technical terms, product names, and specific identifiers. It fails on semantic or paraphrased queries where the user's words do not match the document's words.
Dense retrieval embeds both queries and documents into a vector space and retrieves based on cosine similarity. It handles semantic variation well but can miss exact-match queries, especially for rare terms that appear infrequently in training data.
In production, neither approach dominates. Hybrid retrieval — running both in parallel and combining results before a reranking step — consistently outperforms either alone. BM25 catches what embeddings miss (exact technical terms, proper nouns) and embeddings catch what BM25 misses (semantic queries, paraphrased questions). The reranker resolves conflicts. Cohere in particular tests this question heavily because their reranker product sits in exactly this architecture.
"You upgrade your embedding model from text-embedding-ada-002 to text-embedding-3-large for better quality. After deployment, your search crashes with a dimension mismatch error. What happened and how do you fix it?"
What most candidates say: Re-embed everything with the new model.
What strong candidates say: text-embedding-ada-002 outputs 1536-dimensional vectors. text-embedding-3-large outputs 3072-dimensional vectors. The index was built with the old model's dimensions — you cannot compare a 3072-dim query vector against a 1536-dim index. The immediate fix is to roll back the query model to match the index. The proper fix is to re-embed the entire corpus with the new model, rebuild the index, and deploy both changes together. Additionally — and this is what separates senior candidates — you should pin the embedding model version to the index metadata so this class of error is caught before deployment rather than in production.
"Your retrieval pipeline adds a reranker to improve result quality. After deploying, accuracy gets worse instead of better. The reranker works correctly in isolation. What is wrong?"
What most candidates say: The reranker model is not well-suited to this domain.
What strong candidates say: The most common cause of this failure is reranker order. A reranker is designed to take a candidate set of retrieved documents and reorder them by relevance — it is not designed to retrieve from scratch. If the reranker is being run against the full corpus (50,000 documents) instead of against the top-k results from initial retrieval (50-100 documents), it is doing the wrong job and will perform poorly. The correct pipeline is: initial retrieval (BM25 or dense, returns 50-100 candidates) → reranker (reorders those candidates, returns top 5-10) → LLM generation. Running the reranker first inverts this pipeline entirely.
Evaluation Questions
"How do you evaluate whether a RAG system is working? What metrics do you use?"
This is almost always the first RAG evaluation question in an interview. Candidates who can only name metrics get a mediocre score. Candidates who can explain what each metric misses get a strong score.
The four core metrics from RAGAS:
Faithfulness — what fraction of the generated answer is grounded in the retrieved context? A low faithfulness score means the model is hallucinating. A high faithfulness score does not mean the answer is correct — it means the answer did not invent information beyond what was retrieved.
Context relevance — what fraction of the retrieved context is relevant to the query? Low context relevance means retrieval is pulling in noisy, unrelated documents. This pollutes the context window and can distract the model from the relevant information.
Answer relevance — how well does the generated answer address the actual question asked? This is the metric users feel most directly. High faithfulness plus low answer relevance means the model produced a grounded answer to a slightly different question than the user asked.
Context recall — what fraction of the ground-truth answer's information is present in the retrieved context? This measures whether retrieval is finding everything it needs to. Low context recall means information exists in the corpus but is not being retrieved.
Strong candidates also note what these metrics miss: they do not measure factual accuracy against the real world (only internal consistency with retrieved documents), they do not measure whether the response tone is appropriate, and they require a ground truth dataset that can be expensive to produce at scale.
"Your RAG evaluation pipeline shows faithfulness of 0.83 on your test set. You deploy to production and faithfulness drops to 0.71. The model did not change. What do you investigate?"
What most candidates say: Something changed in the infrastructure.
What strong candidates say: The gap between test and production performance almost always comes from distribution shift. The test set was sampled from one distribution of queries; production queries follow a different distribution. Specifically: the test set likely contains shorter, cleaner, more keyword-aligned queries — the kind that are easy to write for a test set. Production queries are longer, messier, more varied, and more likely to span chunk boundaries in ways that retrieval misses.
The investigation involves: analyzing the production queries where faithfulness is lowest, finding what they have in common (length, query type, entity type, phrasing patterns), and updating the test set to better represent this distribution. The chunking or retrieval strategy may also need to be updated based on what the production query analysis reveals.
"How do you build a ground truth dataset for evaluating your RAG system when you have no labeled data?"
This is a senior-level question that tests whether you have thought about the eval bootstrapping problem.
Strong answers involve: using an LLM to generate synthetic question-answer pairs from your corpus (LLM-as-judge for initial labeling), starting with a small human-labeled seed set and expanding via active learning (label the examples your model is most uncertain about), mining production queries for natural ground truth (if users ever give explicit feedback or if query-click patterns are available), and being explicit about the limitations of each approach — LLM-generated ground truth has its own biases, and human labeling is expensive to scale.
Production RAG Questions
"You are deploying a RAG system that will handle 10,000 queries per day. Walk me through the cost and latency considerations."
This question separates engineers who have shipped production systems from those who have only built demos.
Key considerations:
Embedding cost: Every query must be embedded before retrieval. At scale, this adds up. Caching embeddings for repeated or similar queries can significantly reduce costs — many production queries are near-duplicates.
Retrieval latency: Vector search at scale requires either approximate nearest neighbor algorithms (FAISS, HNSW) or managed vector database services (Pinecone, Weaviate, Qdrant). The trade-off between index accuracy and query latency needs to be understood and configured for your latency SLA.
LLM cost: The majority of cost in a RAG system is typically the generation step. The number of retrieved documents directly determines how many tokens are sent to the LLM. Reducing top-k from 10 to 5 roughly halves generation cost but may hurt recall — this trade-off needs to be measured, not assumed. Tiering by query complexity — using a smaller model for simple queries and a larger model for complex ones — can dramatically reduce average cost.
Observability: At production scale, you need logging of query, retrieved documents, generated response, latency, and token count for every request from day one. Without this, debugging production failures is nearly impossible and cost anomalies go undetected until the monthly bill arrives.
"A senior engineer tells you to just increase the context window to avoid chunking entirely. How do you respond?"
This tests whether you understand why chunking exists, not just how to do it.
Increasing context window size does not eliminate the need for retrieval strategy — it changes the trade-off. Very large context windows introduce their own problems: LLM performance degrades on long contexts (the "lost in the middle" phenomenon, where information in the middle of a long context is less likely to be used than information at the beginning or end), cost scales linearly with context size, and latency increases significantly.
For most production use cases with large corpora, retrieval over chunked documents remains the correct approach. The senior engineer's suggestion makes sense only for small, bounded document sets where full-document retrieval is feasible and the total token count stays manageable. A good answer acknowledges the trade-off rather than dismissing the idea entirely.
"How would you handle a RAG system where the same information exists in multiple documents with slightly different versions — for example, a product that has been updated and the old documentation still exists in the corpus?"
This is a production realism question that many candidates have not thought through.
The naive RAG system retrieves both versions and the model either hallucinates a reconciliation or picks one arbitrarily — often the wrong one.
The production solutions involve: metadata filtering (tag documents with version or date metadata and filter by recency at retrieval time), document deduplication before indexing (identify near-duplicate documents and keep only the canonical version), or a post-retrieval step that identifies conflicting information and resolves it before passing context to the model. Anthropic and Perplexity both test this type of question because it reflects a real problem in enterprise document corpora.
"Your RAG system answers a user's question with a confident, detailed response. The answer is completely wrong — it contradicts the source documents. Faithfulness is 0.94. How is this possible?"
This is a trick question that tests deep understanding of what faithfulness actually measures.
Faithfulness of 0.94 means 94% of the generated answer is grounded in the retrieved context. The answer can still be factually wrong if the retrieved context itself contains incorrect information — faithfulness measures internal consistency with retrieved documents, not factual accuracy against the real world. If the corpus contains an outdated document with incorrect information, a faithful RAG system will confidently reproduce that incorrect information.
The fix involves: adding a knowledge cutoff metadata field to documents, filtering out documents past a certain age for time-sensitive queries, and where possible, cross-referencing answers against multiple retrieved documents rather than generating from a single source.
The One Thing Candidates Get Wrong About RAG Interviews
The most common mistake candidates make in RAG interviews is treating the questions as knowledge tests — as if the goal is to demonstrate that they know what RAG is.
The goal is not to demonstrate knowledge. The goal is to demonstrate production intuition.
Every RAG interview question is really asking: have you built one of these systems, watched it fail in a way you did not expect, and figured out why?
The candidates who get offers at Anthropic, Cohere, Perplexity, and OpenAI are not the ones who can recite the definition of faithfulness. They are the ones who can say "here is the scenario where faithfulness looks great but the system is still broken — and here is how I would catch it."
That kind of answer only comes from practice on real, broken systems — not from reading about RAG in the abstract.