Skip to main content
RAG LLM Evaluation

The Complete Guide to RAG Evaluation

Learn how to measure and improve your RAG system with practical metrics and evaluation frameworks

3 min read

The Complete Guide to RAG Evaluation

Building a RAG (Retrieval-Augmented Generation) system is one thing. Knowing if it actually works is another. After building 10+ production RAG systems, here’s how I evaluate them.

Why Evaluation Matters

Without proper evaluation, you’re flying blind. You might think your RAG system is working great, but users are getting wrong answers 30% of the time. Or maybe your retrieval is perfect, but the LLM is hallucinating.

The Three Pillars

1. Retrieval Quality

Metrics:

  • Recall@K: Are the right documents in the top K results?
  • MRR (Mean Reciprocal Rank): How high are the correct docs ranked?
  • NDCG: Normalized Discounted Cumulative Gain for ranking quality

How to measure:

def calculate_recall_at_k(retrieved_docs, ground_truth_docs, k=5):
    retrieved_set = set(retrieved_docs[:k])
    ground_truth_set = set(ground_truth_docs)
    
    if not ground_truth_set:
        return 0.0
    
    return len(retrieved_set & ground_truth_set) / len(ground_truth_set)

2. Generation Quality

Metrics:

  • Answer Relevance: Does the answer address the question?
  • Faithfulness: Is the answer grounded in retrieved context?
  • Completeness: Does it cover all aspects of the question?

Simple heuristics:

  • Check for citations in the answer
  • Verify key entities from context appear in answer
  • Use LLM-as-judge for subjective evaluation

3. End-to-End System

Metrics:

  • User satisfaction: Thumbs up/down feedback
  • Task completion rate: Can users find what they need?
  • Time to answer: How fast is the system?

Building an Eval Set

Start with 50 golden questions:

  1. Real user queries from logs (anonymized)
  2. Edge cases you’ve encountered
  3. Known difficult questions that exposed past bugs

For each question, you need:

  • The question
  • Expected answer (or answer criteria)
  • Relevant source documents
  • Difficulty level (easy/medium/hard)

Continuous Evaluation

Set up automated evaluation that runs on every change:

# Run eval suite
python eval.py --eval-set golden_questions.json

# Compare to baseline
python compare.py --current results.json --baseline baseline.json

Track metrics over time. Any regression should block deployment.

Common Pitfalls

Only testing happy path - Include edge cases, ambiguous queries, out-of-scope questions

Eval set too small - 10 questions isn’t enough. Aim for 50-100.

Not updating eval set - Add new edge cases as you discover them

Ignoring latency - Fast wrong answers aren’t useful

My Evaluation Workflow

  1. Before changes: Run eval suite to establish baseline
  2. During development: Manually test on 5-10 key questions
  3. Before deployment: Full eval suite + regression testing
  4. In production: Monitor real user feedback and add failures to eval set

Key Takeaway

You can’t improve what you don’t measure. Start with simple metrics, build a small eval set, and iterate from there.

Next week: “Defensive Prompt Engineering for Production Systems”

Isragel Andres

Isragel Andres

AI Specialist focused on RAG systems, workflow automation, and AI agents. I build production-ready AI systems with measurable outcomes.

Related Articles

Defensive Prompt Engineering for Production

Battle-tested techniques to make your LLM prompts robust, reliable, and production-ready

LLM Prompt Engineering