RAG LLM Evaluation

The Complete Guide to RAG Evaluation

Learn how to measure and improve your RAG system with practical metrics and evaluation frameworks

October 10, 2024 • 3 min read

The Complete Guide to RAG Evaluation

Building a RAG (Retrieval-Augmented Generation) system is one thing. Knowing if it actually works is another. After building 10+ production RAG systems, here’s how I evaluate them.

Why Evaluation Matters

Without proper evaluation, you’re flying blind. You might think your RAG system is working great, but users are getting wrong answers 30% of the time. Or maybe your retrieval is perfect, but the LLM is hallucinating.

The Three Pillars

1. Retrieval Quality

Metrics:

Recall@K: Are the right documents in the top K results?
MRR (Mean Reciprocal Rank): How high are the correct docs ranked?
NDCG: Normalized Discounted Cumulative Gain for ranking quality

How to measure:

def calculate_recall_at_k(retrieved_docs, ground_truth_docs, k=5):
    retrieved_set = set(retrieved_docs[:k])
    ground_truth_set = set(ground_truth_docs)
    
    if not ground_truth_set:
        return 0.0
    
    return len(retrieved_set & ground_truth_set) / len(ground_truth_set)

2. Generation Quality

Metrics:

Answer Relevance: Does the answer address the question?
Faithfulness: Is the answer grounded in retrieved context?
Completeness: Does it cover all aspects of the question?

Simple heuristics:

Check for citations in the answer
Verify key entities from context appear in answer
Use LLM-as-judge for subjective evaluation

3. End-to-End System

Metrics:

User satisfaction: Thumbs up/down feedback
Task completion rate: Can users find what they need?
Time to answer: How fast is the system?

Building an Eval Set

Start with 50 golden questions:

Real user queries from logs (anonymized)
Edge cases you’ve encountered
Known difficult questions that exposed past bugs

For each question, you need:

The question
Expected answer (or answer criteria)
Relevant source documents
Difficulty level (easy/medium/hard)

Continuous Evaluation

Set up automated evaluation that runs on every change:

# Run eval suite
python eval.py --eval-set golden_questions.json

# Compare to baseline
python compare.py --current results.json --baseline baseline.json

Track metrics over time. Any regression should block deployment.

Common Pitfalls

❌ Only testing happy path - Include edge cases, ambiguous queries, out-of-scope questions

❌ Eval set too small - 10 questions isn’t enough. Aim for 50-100.

❌ Not updating eval set - Add new edge cases as you discover them

❌ Ignoring latency - Fast wrong answers aren’t useful

My Evaluation Workflow

Before changes: Run eval suite to establish baseline
During development: Manually test on 5-10 key questions
Before deployment: Full eval suite + regression testing
In production: Monitor real user feedback and add failures to eval set

Key Takeaway

You can’t improve what you don’t measure. Start with simple metrics, build a small eval set, and iterate from there.

Next week: “Defensive Prompt Engineering for Production Systems”

Isragel Andres

AI Specialist focused on RAG systems, workflow automation, and AI agents. I build production-ready AI systems with measurable outcomes.

GitHub LinkedIn

Defensive Prompt Engineering for Production

Battle-tested techniques to make your LLM prompts robust, reliable, and production-ready

LLM Prompt Engineering

← Back to Blog

The Complete Guide to RAG Evaluation

The Complete Guide to RAG Evaluation

Why Evaluation Matters

The Three Pillars

1. Retrieval Quality

2. Generation Quality

3. End-to-End System

Building an Eval Set

Continuous Evaluation

Common Pitfalls

My Evaluation Workflow

Key Takeaway

Related Articles

Defensive Prompt Engineering for Production

Install App