Beginner
Evaluating RAG Answers with Simple Heuristics
Quick techniques to validate RAG system outputs without expensive LLM-as-judge approaches
Tools Required
Python NLTK Sentence Transformers
Tags
RAG LLM Evaluation
Evaluating RAG Answers with Simple Heuristics
You don’t need GPT-4 to evaluate your RAG answers. Start with simple heuristics.
Why Simple Heuristics?
- Fast: No API calls, instant feedback
- Cheap: No LLM costs
- Transparent: You know exactly what’s being checked
- Good enough: Catches 80% of issues
Heuristic 1: Citation Check
Rule: Answer must reference source documents
def has_citations(answer):
# Check for citation markers
patterns = [
r'\[(\d+)\]', # [1], [2]
r'\(Source: .+\)', # (Source: doc.pdf)
r'According to .+,', # According to Policy Doc,
]
for pattern in patterns:
if re.search(pattern, answer):
return True
return False
# Usage
if not has_citations(answer):
score -= 20 # Penalty for no citations
Heuristic 2: Entity Overlap
Rule: Important entities from context should appear in answer
import spacy
nlp = spacy.load("en_core_web_sm")
def calculate_entity_overlap(context, answer):
context_entities = set(ent.text for ent in nlp(context).ents)
answer_entities = set(ent.text for ent in nlp(answer).ents)
if not context_entities:
return 1.0 # No entities to match
overlap = len(context_entities & answer_entities) / len(context_entities)
return overlap
# Usage
overlap = calculate_entity_overlap(context, answer)
if overlap < 0.3:
print("Warning: Low entity overlap, possible hallucination")
Heuristic 3: Length Check
Rule: Answer should be appropriate length
def check_length(answer, question):
words = len(answer.split())
# Very short answers might be incomplete
if words < 10:
return "too_short"
# Very long answers might be rambling
if words > 500:
return "too_long"
# Check if question asks for specific format
if "list" in question.lower() or "steps" in question.lower():
if words < 50:
return "too_short"
return "ok"
Heuristic 4: Hallucination Patterns
Rule: Flag common hallucination indicators
def detect_hallucination_patterns(answer):
# Phrases that often indicate uncertainty or hallucination
uncertain_phrases = [
"i think",
"i believe",
"probably",
"it seems",
"as far as i know",
"i'm not sure"
]
# Conversational patterns (RAG should be factual)
conversational = [
"well,",
"you know,",
"um,",
"to be honest"
]
answer_lower = answer.lower()
for phrase in uncertain_phrases + conversational:
if phrase in answer_lower:
return True, phrase
return False, None
# Usage
is_uncertain, phrase = detect_hallucination_patterns(answer)
if is_uncertain:
print(f"Warning: Uncertain language detected: '{phrase}'")
Heuristic 5: Context Similarity
Rule: Answer should be semantically similar to context
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
def calculate_similarity(context, answer):
context_embedding = model.encode(context, convert_to_tensor=True)
answer_embedding = model.encode(answer, convert_to_tensor=True)
similarity = util.cos_sim(context_embedding, answer_embedding)
return similarity.item()
# Usage
similarity = calculate_similarity(context, answer)
if similarity < 0.5:
print("Warning: Answer diverges from source context")
Combining Heuristics
Build a simple scoring system:
def evaluate_answer(question, context, answer):
score = 100
issues = []
# Check 1: Citations
if not has_citations(answer):
score -= 20
issues.append("No citations")
# Check 2: Entity overlap
overlap = calculate_entity_overlap(context, answer)
if overlap < 0.3:
score -= 15
issues.append(f"Low entity overlap: {overlap:.2f}")
# Check 3: Length
length_check = check_length(answer, question)
if length_check != "ok":
score -= 10
issues.append(f"Answer {length_check}")
# Check 4: Hallucination patterns
is_uncertain, phrase = detect_hallucination_patterns(answer)
if is_uncertain:
score -= 25
issues.append(f"Uncertain language: {phrase}")
# Check 5: Similarity
similarity = calculate_similarity(context, answer)
if similarity < 0.5:
score -= 20
issues.append(f"Low similarity: {similarity:.2f}")
return {
"score": max(0, score),
"issues": issues,
"passed": score >= 70
}
Real-World Usage
# In your RAG pipeline
result = rag_system.query("What is our return policy?")
evaluation = evaluate_answer(
question=result["question"],
context=result["retrieved_context"],
answer=result["answer"]
)
if evaluation["passed"]:
return result["answer"]
else:
# Fallback: Try again with different retrieval or prompt
logger.warning(f"Answer failed evaluation: {evaluation['issues']}")
return retry_with_fallback()
When to Use Each Heuristic
| Heuristic | Use When | Skip When |
|---|---|---|
| Citations | Answers need attribution | Casual Q&A |
| Entity overlap | Technical/factual content | Opinion/analysis |
| Length check | Consistent answer format | Varied question types |
| Hallucination patterns | High-stakes decisions | Low risk |
| Similarity | Single source answers | Multi-source synthesis |
Limitations
These heuristics aren’t perfect:
- ✅ Catch obvious errors quickly
- ✅ Provide instant feedback
- ❌ Can’t judge semantic correctness
- ❌ May have false positives
Use them as a first line of defense, not the only evaluation.
Next Steps
- Start with 2-3 heuristics that matter most for your use case
- Track which heuristics catch real issues
- Tune thresholds based on your data
- Graduate to LLM-based evaluation for edge cases
Remember: Perfect is the enemy of good. Simple heuristics that run on every query are better than perfect evaluation you never implement.