Skip to main content
Beginner

Evaluating RAG Answers with Simple Heuristics

Quick techniques to validate RAG system outputs without expensive LLM-as-judge approaches

Tools Required

Python NLTK Sentence Transformers

Tags

RAG LLM Evaluation

Evaluating RAG Answers with Simple Heuristics

You don’t need GPT-4 to evaluate your RAG answers. Start with simple heuristics.

Why Simple Heuristics?

  • Fast: No API calls, instant feedback
  • Cheap: No LLM costs
  • Transparent: You know exactly what’s being checked
  • Good enough: Catches 80% of issues

Heuristic 1: Citation Check

Rule: Answer must reference source documents

def has_citations(answer):
    # Check for citation markers
    patterns = [
        r'\[(\d+)\]',           # [1], [2]
        r'\(Source: .+\)',      # (Source: doc.pdf)
        r'According to .+,',    # According to Policy Doc,
    ]
    
    for pattern in patterns:
        if re.search(pattern, answer):
            return True
    
    return False

# Usage
if not has_citations(answer):
    score -= 20  # Penalty for no citations

Heuristic 2: Entity Overlap

Rule: Important entities from context should appear in answer

import spacy

nlp = spacy.load("en_core_web_sm")

def calculate_entity_overlap(context, answer):
    context_entities = set(ent.text for ent in nlp(context).ents)
    answer_entities = set(ent.text for ent in nlp(answer).ents)
    
    if not context_entities:
        return 1.0  # No entities to match
    
    overlap = len(context_entities & answer_entities) / len(context_entities)
    return overlap

# Usage
overlap = calculate_entity_overlap(context, answer)
if overlap < 0.3:
    print("Warning: Low entity overlap, possible hallucination")

Heuristic 3: Length Check

Rule: Answer should be appropriate length

def check_length(answer, question):
    words = len(answer.split())
    
    # Very short answers might be incomplete
    if words < 10:
        return "too_short"
    
    # Very long answers might be rambling
    if words > 500:
        return "too_long"
    
    # Check if question asks for specific format
    if "list" in question.lower() or "steps" in question.lower():
        if words < 50:
            return "too_short"
    
    return "ok"

Heuristic 4: Hallucination Patterns

Rule: Flag common hallucination indicators

def detect_hallucination_patterns(answer):
    # Phrases that often indicate uncertainty or hallucination
    uncertain_phrases = [
        "i think",
        "i believe",
        "probably",
        "it seems",
        "as far as i know",
        "i'm not sure"
    ]
    
    # Conversational patterns (RAG should be factual)
    conversational = [
        "well,",
        "you know,",
        "um,",
        "to be honest"
    ]
    
    answer_lower = answer.lower()
    
    for phrase in uncertain_phrases + conversational:
        if phrase in answer_lower:
            return True, phrase
    
    return False, None

# Usage
is_uncertain, phrase = detect_hallucination_patterns(answer)
if is_uncertain:
    print(f"Warning: Uncertain language detected: '{phrase}'")

Heuristic 5: Context Similarity

Rule: Answer should be semantically similar to context

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_similarity(context, answer):
    context_embedding = model.encode(context, convert_to_tensor=True)
    answer_embedding = model.encode(answer, convert_to_tensor=True)
    
    similarity = util.cos_sim(context_embedding, answer_embedding)
    return similarity.item()

# Usage
similarity = calculate_similarity(context, answer)
if similarity < 0.5:
    print("Warning: Answer diverges from source context")

Combining Heuristics

Build a simple scoring system:

def evaluate_answer(question, context, answer):
    score = 100
    issues = []
    
    # Check 1: Citations
    if not has_citations(answer):
        score -= 20
        issues.append("No citations")
    
    # Check 2: Entity overlap
    overlap = calculate_entity_overlap(context, answer)
    if overlap < 0.3:
        score -= 15
        issues.append(f"Low entity overlap: {overlap:.2f}")
    
    # Check 3: Length
    length_check = check_length(answer, question)
    if length_check != "ok":
        score -= 10
        issues.append(f"Answer {length_check}")
    
    # Check 4: Hallucination patterns
    is_uncertain, phrase = detect_hallucination_patterns(answer)
    if is_uncertain:
        score -= 25
        issues.append(f"Uncertain language: {phrase}")
    
    # Check 5: Similarity
    similarity = calculate_similarity(context, answer)
    if similarity < 0.5:
        score -= 20
        issues.append(f"Low similarity: {similarity:.2f}")
    
    return {
        "score": max(0, score),
        "issues": issues,
        "passed": score >= 70
    }

Real-World Usage

# In your RAG pipeline
result = rag_system.query("What is our return policy?")

evaluation = evaluate_answer(
    question=result["question"],
    context=result["retrieved_context"],
    answer=result["answer"]
)

if evaluation["passed"]:
    return result["answer"]
else:
    # Fallback: Try again with different retrieval or prompt
    logger.warning(f"Answer failed evaluation: {evaluation['issues']}")
    return retry_with_fallback()

When to Use Each Heuristic

HeuristicUse WhenSkip When
CitationsAnswers need attributionCasual Q&A
Entity overlapTechnical/factual contentOpinion/analysis
Length checkConsistent answer formatVaried question types
Hallucination patternsHigh-stakes decisionsLow risk
SimilaritySingle source answersMulti-source synthesis

Limitations

These heuristics aren’t perfect:

  • ✅ Catch obvious errors quickly
  • ✅ Provide instant feedback
  • ❌ Can’t judge semantic correctness
  • ❌ May have false positives

Use them as a first line of defense, not the only evaluation.

Next Steps

  1. Start with 2-3 heuristics that matter most for your use case
  2. Track which heuristics catch real issues
  3. Tune thresholds based on your data
  4. Graduate to LLM-based evaluation for edge cases

Remember: Perfect is the enemy of good. Simple heuristics that run on every query are better than perfect evaluation you never implement.