The Complete Guide to RAG Evaluation
Learn how to measure and improve your RAG system with practical metrics and evaluation frameworks
The Complete Guide to RAG Evaluation
Building a RAG (Retrieval-Augmented Generation) system is one thing. Knowing if it actually works is another. After building 10+ production RAG systems, here’s how I evaluate them.
Why Evaluation Matters
Without proper evaluation, you’re flying blind. You might think your RAG system is working great, but users are getting wrong answers 30% of the time. Or maybe your retrieval is perfect, but the LLM is hallucinating.
The Three Pillars
1. Retrieval Quality
Metrics:
- Recall@K: Are the right documents in the top K results?
- MRR (Mean Reciprocal Rank): How high are the correct docs ranked?
- NDCG: Normalized Discounted Cumulative Gain for ranking quality
How to measure:
def calculate_recall_at_k(retrieved_docs, ground_truth_docs, k=5):
retrieved_set = set(retrieved_docs[:k])
ground_truth_set = set(ground_truth_docs)
if not ground_truth_set:
return 0.0
return len(retrieved_set & ground_truth_set) / len(ground_truth_set)
2. Generation Quality
Metrics:
- Answer Relevance: Does the answer address the question?
- Faithfulness: Is the answer grounded in retrieved context?
- Completeness: Does it cover all aspects of the question?
Simple heuristics:
- Check for citations in the answer
- Verify key entities from context appear in answer
- Use LLM-as-judge for subjective evaluation
3. End-to-End System
Metrics:
- User satisfaction: Thumbs up/down feedback
- Task completion rate: Can users find what they need?
- Time to answer: How fast is the system?
Building an Eval Set
Start with 50 golden questions:
- Real user queries from logs (anonymized)
- Edge cases you’ve encountered
- Known difficult questions that exposed past bugs
For each question, you need:
- The question
- Expected answer (or answer criteria)
- Relevant source documents
- Difficulty level (easy/medium/hard)
Continuous Evaluation
Set up automated evaluation that runs on every change:
# Run eval suite
python eval.py --eval-set golden_questions.json
# Compare to baseline
python compare.py --current results.json --baseline baseline.json
Track metrics over time. Any regression should block deployment.
Common Pitfalls
❌ Only testing happy path - Include edge cases, ambiguous queries, out-of-scope questions
❌ Eval set too small - 10 questions isn’t enough. Aim for 50-100.
❌ Not updating eval set - Add new edge cases as you discover them
❌ Ignoring latency - Fast wrong answers aren’t useful
My Evaluation Workflow
- Before changes: Run eval suite to establish baseline
- During development: Manually test on 5-10 key questions
- Before deployment: Full eval suite + regression testing
- In production: Monitor real user feedback and add failures to eval set
Key Takeaway
You can’t improve what you don’t measure. Start with simple metrics, build a small eval set, and iterate from there.
Next week: “Defensive Prompt Engineering for Production Systems”
Related Articles
Defensive Prompt Engineering for Production
Battle-tested techniques to make your LLM prompts robust, reliable, and production-ready