A RAG evaluation checklist for production AI systems
A practical checklist for evaluating RAG systems: retrieval relevance, source coverage, grounded answers, citations, abstention, latency, and feedback loops.
RAG evaluation should answer one basic question: did the system retrieve enough trustworthy evidence to answer, and did the model stay faithful to that evidence? If you only evaluate the final answer, you miss the part of the pipeline that most often needs tuning.
Retrieval checks
- Are the top results relevant to the user question?
- Do they cover the answer completely, or only a fragment?
- Are metadata filters removing important documents?
- Does hybrid search beat vector-only search for your corpus?
Answer checks
Score whether the answer is correct, complete, grounded in the retrieved sources, and written at the right level of confidence. A helpful answer that invents one unsupported claim is still a trust problem.
Abstention and citations
A production RAG system needs to say "I do not know" when retrieval is weak. It also needs citations that point to the exact source passages used, not decorative links at the end of a response.
Once these checks run on every prompt or retrieval change, RAG tuning becomes engineering work rather than vibe-checking answers in a chat window.