Hallucination Detection in RAG Systems
Hallucinations are RAG's Achilles heel. Learn how to detect, measure, and prevent them with proven techniques.
TL;DR
- Hallucination = response not supported by the provided context
- 2 types: intrinsic (contradictions) and extrinsic (fabrications)
- Detection: NLI, LLM-as-judge, grounding metrics
- Prevention: better retrieval, strict prompts, guardrails
- Monitor hallucinations in real-time on Ailog
What is a RAG Hallucination?
In the RAG context, a hallucination is information generated by the LLM that is not present in the retrieved documents.
Types of Hallucinations
1. Extrinsic Hallucinations (Fabrications)
Context: "Our company was founded in 2010 in Paris."
Question: "When and where was the company founded?"
Response: "The company was founded in 2010 in Paris by John Smith."
^^^^^^^^^^^^^
Made up - not in context
2. Intrinsic Hallucinations (Contradictions)
Context: "The product costs $99 and is available in blue."
Question: "What is the product price?"
Response: "The product costs $89."
^^^^
Contradicts context ($99)
3. Extrapolation Hallucinations
Context: "Sales increased by 20% in Q1."
Question: "How are sales doing?"
Response: "Sales are excellent and should reach a record this year."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Unjustified extrapolation
Detection via Natural Language Inference (NLI)
The NLI approach verifies if the context entails the response.
DEVELOPERpythonfrom transformers import pipeline nli_classifier = pipeline( "text-classification", model="facebook/bart-large-mnli" ) def check_entailment(context: str, claim: str) -> dict: """ Checks if the context entails the claim. Labels: entailment, contradiction, neutral """ # Format for NLI input_text = f"{context}</s></s>{claim}" result = nli_classifier(input_text) label = result[0]['label'] score = result[0]['score'] return { "label": label, "confidence": score, "is_grounded": label == "entailment", "is_contradiction": label == "contradiction" } # Example context = "Delivery takes 3 to 5 business days." claim = "Delivery takes a week." result = check_entailment(context, claim) # {"label": "contradiction", "confidence": 0.92, ...}
Decomposition into Claims
For precise detection, decompose the response into atomic claims:
DEVELOPERpythondef extract_claims(response: str, llm_client) -> list: """ Extracts atomic claims from a response. """ prompt = f"""Extract all factual claims from this text. Each claim should be a single, verifiable statement. Text: {response} Output as a numbered list: 1. [First claim] 2. [Second claim] ...""" result = llm_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0 ) claims_text = result.choices[0].message.content claims = [line.split('. ', 1)[1] for line in claims_text.strip().split('\n') if '. ' in line] return claims def check_all_claims(context: str, response: str, llm_client) -> dict: """ Verifies each claim in the response against the context. """ claims = extract_claims(response, llm_client) results = [] for claim in claims: check = check_entailment(context, claim) results.append({ "claim": claim, **check }) hallucinated = [r for r in results if not r["is_grounded"]] contradictions = [r for r in results if r["is_contradiction"]] return { "total_claims": len(claims), "grounded_claims": len(claims) - len(hallucinated), "hallucinations": hallucinated, "contradictions": contradictions, "hallucination_rate": len(hallucinated) / len(claims) if claims else 0 }
Detection via LLM-as-Judge
Use an LLM to evaluate grounding:
DEVELOPERpythondef llm_judge_hallucination( context: str, question: str, response: str, llm_client ) -> dict: """ Uses an LLM as a judge to detect hallucinations. """ prompt = f"""You are a fact-checking expert. Analyze if the response contains hallucinations. Context (source of truth): {context} Question: {question} Response to check: {response} For each piece of information in the response, classify as: - SUPPORTED: Directly stated or clearly implied by context - HALLUCINATION: Not in context (made up) - CONTRADICTION: Conflicts with context - EXTRAPOLATION: Goes beyond what context states Output format: VERDICT: [CLEAN / HAS_HALLUCINATIONS / HAS_CONTRADICTIONS] ANALYSIS: - [Quote from response]: [SUPPORTED/HALLUCINATION/etc] - [reason] SUMMARY: Brief explanation""" result = llm_client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], temperature=0 ) analysis = result.choices[0].message.content has_issues = "HAS_HALLUCINATIONS" in analysis or "HAS_CONTRADICTIONS" in analysis return { "has_hallucinations": has_issues, "analysis": analysis, "should_regenerate": has_issues }
Detection Metrics
ROUGE-L for Overlap
DEVELOPERpythonfrom rouge_score import rouge_scorer scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True) def check_overlap(context: str, response: str, threshold: float = 0.3) -> dict: """ Checks textual overlap between context and response. A very low score may indicate hallucinations. """ scores = scorer.score(context, response) rouge_l = scores['rougeL'].fmeasure return { "rouge_l": rouge_l, "potential_hallucination": rouge_l < threshold, "interpretation": ( "High overlap - likely grounded" if rouge_l > 0.5 else "Low overlap - potential hallucinations" if rouge_l < threshold else "Moderate overlap - review recommended" ) }
BERTScore for Semantic Similarity
DEVELOPERpythonfrom bert_score import score as bert_score def semantic_similarity_check( context: str, response: str, threshold: float = 0.7 ) -> dict: """ Checks semantic similarity between context and response. """ P, R, F1 = bert_score( [response], [context], lang="en", rescale_with_baseline=True ) f1 = F1[0].item() return { "bert_score": f1, "potential_hallucination": f1 < threshold, "precision": P[0].item(), "recall": R[0].item() }
SelfCheckGPT
Technique that uses consistency across multiple responses:
DEVELOPERpythondef selfcheck_hallucination( question: str, context: str, llm_client, num_samples: int = 5 ) -> dict: """ Generates multiple responses and checks their consistency. Hallucinations are inconsistent across samples. """ # Generate multiple responses responses = [] for _ in range(num_samples): result = llm_client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": f"Answer based on: {context}"}, {"role": "user", "content": question} ], temperature=0.7 # Variation to see inconsistency ) responses.append(result.choices[0].message.content) # Extract claims from first response main_claims = extract_claims(responses[0], llm_client) # Check each claim in other responses claim_consistency = [] for claim in main_claims: present_count = 0 for other_response in responses[1:]: if is_claim_present(claim, other_response, llm_client): present_count += 1 consistency = present_count / (num_samples - 1) claim_consistency.append({ "claim": claim, "consistency": consistency, "likely_hallucination": consistency < 0.5 }) # Inconsistent claims are likely hallucinations hallucinations = [c for c in claim_consistency if c["likely_hallucination"]] return { "claims_checked": len(main_claims), "consistent_claims": len(main_claims) - len(hallucinations), "potential_hallucinations": hallucinations, "overall_reliability": 1 - (len(hallucinations) / len(main_claims)) if main_claims else 1 } def is_claim_present(claim: str, text: str, llm_client) -> bool: """ Checks if a claim is present in a text. """ prompt = f"""Does this text contain or imply this claim? Claim: {claim} Text: {text} Answer only YES or NO.""" result = llm_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], max_tokens=3, temperature=0 ) return "YES" in result.choices[0].message.content.upper()
Hallucination Prevention
1. Improve Retrieval
DEVELOPERpythondef enhanced_retrieval(query: str, retriever, threshold: float = 0.7) -> list: """ Retrieval with confidence threshold. Better to return nothing than return noise. """ results = retriever.retrieve(query, k=10) # Filter by score confident_results = [ r for r in results if r['score'] > threshold ] if not confident_results: return { "docs": [], "confidence": "low", "should_fallback": True } return { "docs": confident_results, "confidence": "high" }
2. Strict Prompts
DEVELOPERpythonANTI_HALLUCINATION_PROMPT = """You are a precise assistant that ONLY uses information from the provided context. STRICT RULES: 1. ONLY state facts that are EXPLICITLY written in the context 2. If the context doesn't contain the answer, say "I don't have this information in my sources" 3. NEVER add information from your general knowledge 4. NEVER extrapolate or make assumptions 5. When uncertain, express uncertainty Context: {context} Question: {question} Answer based ONLY on the context above:"""
3. Source Citations
DEVELOPERpythondef generate_with_citations( question: str, docs: list, llm_client ) -> dict: """ Forces the LLM to cite sources, reducing hallucinations. """ numbered_docs = "\n\n".join([ f"[{i+1}] {doc['content']}" for i, doc in enumerate(docs) ]) prompt = f"""Answer the question using ONLY the numbered sources below. For each fact, add a citation like [1] or [2]. If a fact isn't in any source, don't mention it. Sources: {numbered_docs} Question: {question} Answer with citations:""" result = llm_client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], temperature=0 ) response = result.choices[0].message.content # Verify citations exist import re citations = re.findall(r'\[(\d+)\]', response) valid_citations = [c for c in citations if int(c) <= len(docs)] return { "response": response, "citations_found": len(set(citations)), "all_citations_valid": len(valid_citations) == len(citations) }
Complete Detection Pipeline
DEVELOPERpythonclass HallucinationDetector: def __init__(self, llm_client, nli_model=None): self.llm = llm_client self.nli = nli_model def analyze( self, context: str, question: str, response: str ) -> dict: """ Complete hallucination analysis. """ results = { "response": response, "checks": {} } # 1. NLI Check (fast) if self.nli: claims = extract_claims(response, self.llm) nli_results = [] for claim in claims: check = check_entailment(context, claim) nli_results.append(check) results["checks"]["nli"] = { "claims_count": len(claims), "grounded": sum(1 for r in nli_results if r["is_grounded"]), "hallucinations": sum(1 for r in nli_results if not r["is_grounded"]) } # 2. LLM Judge (precise but slow) judge_result = llm_judge_hallucination( context, question, response, self.llm ) results["checks"]["llm_judge"] = judge_result # 3. Semantic overlap overlap = check_overlap(context, response) results["checks"]["overlap"] = overlap # 4. Final verdict hallucination_signals = 0 total_signals = 0 if "nli" in results["checks"]: if results["checks"]["nli"]["hallucinations"] > 0: hallucination_signals += 1 total_signals += 1 if results["checks"]["llm_judge"]["has_hallucinations"]: hallucination_signals += 1 total_signals += 1 if results["checks"]["overlap"]["potential_hallucination"]: hallucination_signals += 1 total_signals += 1 results["verdict"] = { "has_hallucinations": hallucination_signals >= 2, "confidence": hallucination_signals / total_signals, "recommendation": ( "REJECT" if hallucination_signals >= 2 else "REVIEW" if hallucination_signals == 1 else "ACCEPT" ) } return results # Usage detector = HallucinationDetector(llm_client=openai_client) analysis = detector.analyze( context="Our product costs $99 and ships in 3-5 days.", question="What is the price?", response="The premium product costs $99 with free express shipping." ) if analysis["verdict"]["recommendation"] == "REJECT": # Regenerate response pass
Detection Benchmarks
| Method | Precision | Recall | Latency | Cost |
|---|---|---|---|---|
| ROUGE-L | 60% | 75% | 5ms | Free |
| NLI | 78% | 82% | 50ms | Free |
| BERTScore | 72% | 70% | 100ms | Free |
| GPT-4o Judge | 92% | 88% | 500ms | $$$ |
| SelfCheckGPT | 85% | 80% | 2s | $$ |
| Ensemble | 94% | 90% | 600ms | $$ |
Related Guides
Evaluation and Quality:
- RAG Evaluation - Complete metrics
- RAG Guardrails - Production security
- RAG Monitoring - Continuous supervision
Retrieval:
- Retrieval Strategies - Improve retrieval
- Reranking - Better results
Are your users encountering hallucinations? Let's analyze your pipeline together →
Tags
Articles connexes
Evaluating RAG Systems: Metrics and Methodologies
Comprehensive guide to measuring RAG performance: retrieval metrics, generation quality, end-to-end evaluation, and automated testing frameworks.
Guardrails for RAG: Securing Your AI Assistants
Implement robust guardrails to prevent dangerous, off-topic, or inappropriate responses in your production RAG systems.
Hierarchical Chunking: Preserving Document Structure
Hierarchical chunking maintains parent-child relationships in your documents. Learn how to implement this advanced technique to improve RAG retrieval quality.