Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

TL;DR

Hallucination = response not supported by the provided context
2 types: intrinsic (contradictions) and extrinsic (fabrications)
Detection: NLI, LLM-as-judge, grounding metrics
Prevention: better retrieval, strict prompts, guardrails
Monitor hallucinations in real-time on Ailog

What is a RAG Hallucination?

In the RAG context, a hallucination is information generated by the LLM that is not present in the retrieved documents.

Types of Hallucinations

1. Extrinsic Hallucinations (Fabrications)

Context: "Our company was founded in 2010 in Paris."
Question: "When and where was the company founded?"
Response: "The company was founded in 2010 in Paris by John Smith."
                                            ^^^^^^^^^^^^^
                                            Made up - not in context

2. Intrinsic Hallucinations (Contradictions)

Context: "The product costs $99 and is available in blue."
Question: "What is the product price?"
Response: "The product costs $89."
                          ^^^^
                          Contradicts context ($99)

3. Extrapolation Hallucinations

Context: "Sales increased by 20% in Q1."
Question: "How are sales doing?"
Response: "Sales are excellent and should reach a record this year."
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                Unjustified extrapolation

Detection via Natural Language Inference (NLI)

The NLI approach verifies if the context entails the response.

DEVELOPERpython
from transformers import pipeline

nli_classifier = pipeline(
    "text-classification",
    model="facebook/bart-large-mnli"
)

def check_entailment(context: str, claim: str) -> dict:
    """
    Checks if the context entails the claim.

    Labels: entailment, contradiction, neutral
    """
    # Format for NLI
    input_text = f"{context}</s></s>{claim}"

    result = nli_classifier(input_text)

    label = result[0]['label']
    score = result[0]['score']

    return {
        "label": label,
        "confidence": score,
        "is_grounded": label == "entailment",
        "is_contradiction": label == "contradiction"
    }

# Example
context = "Delivery takes 3 to 5 business days."
claim = "Delivery takes a week."

result = check_entailment(context, claim)
# {"label": "contradiction", "confidence": 0.92, ...}

Decomposition into Claims

For precise detection, decompose the response into atomic claims:

DEVELOPERpython
def extract_claims(response: str, llm_client) -> list:
    """
    Extracts atomic claims from a response.
    """
    prompt = f"""Extract all factual claims from this text.
Each claim should be a single, verifiable statement.

Text: {response}

Output as a numbered list:
1. [First claim]
2. [Second claim]
..."""

    result = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    claims_text = result.choices[0].message.content
    claims = [line.split('. ', 1)[1] for line in claims_text.strip().split('\n') if '. ' in line]

    return claims

def check_all_claims(context: str, response: str, llm_client) -> dict:
    """
    Verifies each claim in the response against the context.
    """
    claims = extract_claims(response, llm_client)

    results = []
    for claim in claims:
        check = check_entailment(context, claim)
        results.append({
            "claim": claim,
            **check
        })

    hallucinated = [r for r in results if not r["is_grounded"]]
    contradictions = [r for r in results if r["is_contradiction"]]

    return {
        "total_claims": len(claims),
        "grounded_claims": len(claims) - len(hallucinated),
        "hallucinations": hallucinated,
        "contradictions": contradictions,
        "hallucination_rate": len(hallucinated) / len(claims) if claims else 0
    }

Detection via LLM-as-Judge

Use an LLM to evaluate grounding:

DEVELOPERpython
def llm_judge_hallucination(
    context: str,
    question: str,
    response: str,
    llm_client
) -> dict:
    """
    Uses an LLM as a judge to detect hallucinations.
    """
    prompt = f"""You are a fact-checking expert. Analyze if the response contains hallucinations.

Context (source of truth):
{context}

Question: {question}

Response to check: {response}

For each piece of information in the response, classify as:
- SUPPORTED: Directly stated or clearly implied by context
- HALLUCINATION: Not in context (made up)
- CONTRADICTION: Conflicts with context
- EXTRAPOLATION: Goes beyond what context states

Output format:
VERDICT: [CLEAN / HAS_HALLUCINATIONS / HAS_CONTRADICTIONS]
ANALYSIS:
- [Quote from response]: [SUPPORTED/HALLUCINATION/etc] - [reason]
SUMMARY: Brief explanation"""

    result = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    analysis = result.choices[0].message.content

    has_issues = "HAS_HALLUCINATIONS" in analysis or "HAS_CONTRADICTIONS" in analysis

    return {
        "has_hallucinations": has_issues,
        "analysis": analysis,
        "should_regenerate": has_issues
    }

Detection Metrics

ROUGE-L for Overlap

DEVELOPERpython
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

def check_overlap(context: str, response: str, threshold: float = 0.3) -> dict:
    """
    Checks textual overlap between context and response.
    A very low score may indicate hallucinations.
    """
    scores = scorer.score(context, response)
    rouge_l = scores['rougeL'].fmeasure

    return {
        "rouge_l": rouge_l,
        "potential_hallucination": rouge_l < threshold,
        "interpretation": (
            "High overlap - likely grounded" if rouge_l > 0.5
            else "Low overlap - potential hallucinations" if rouge_l < threshold
            else "Moderate overlap - review recommended"
        )
    }

BERTScore for Semantic Similarity

DEVELOPERpython
from bert_score import score as bert_score

def semantic_similarity_check(
    context: str,
    response: str,
    threshold: float = 0.7
) -> dict:
    """
    Checks semantic similarity between context and response.
    """
    P, R, F1 = bert_score(
        [response],
        [context],
        lang="en",
        rescale_with_baseline=True
    )

    f1 = F1[0].item()

    return {
        "bert_score": f1,
        "potential_hallucination": f1 < threshold,
        "precision": P[0].item(),
        "recall": R[0].item()
    }

SelfCheckGPT

Technique that uses consistency across multiple responses:

DEVELOPERpython
def selfcheck_hallucination(
    question: str,
    context: str,
    llm_client,
    num_samples: int = 5
) -> dict:
    """
    Generates multiple responses and checks their consistency.
    Hallucinations are inconsistent across samples.
    """
    # Generate multiple responses
    responses = []
    for _ in range(num_samples):
        result = llm_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": f"Answer based on: {context}"},
                {"role": "user", "content": question}
            ],
            temperature=0.7  # Variation to see inconsistency
        )
        responses.append(result.choices[0].message.content)

    # Extract claims from first response
    main_claims = extract_claims(responses[0], llm_client)

    # Check each claim in other responses
    claim_consistency = []
    for claim in main_claims:
        present_count = 0
        for other_response in responses[1:]:
            if is_claim_present(claim, other_response, llm_client):
                present_count += 1

        consistency = present_count / (num_samples - 1)
        claim_consistency.append({
            "claim": claim,
            "consistency": consistency,
            "likely_hallucination": consistency < 0.5
        })

    # Inconsistent claims are likely hallucinations
    hallucinations = [c for c in claim_consistency if c["likely_hallucination"]]

    return {
        "claims_checked": len(main_claims),
        "consistent_claims": len(main_claims) - len(hallucinations),
        "potential_hallucinations": hallucinations,
        "overall_reliability": 1 - (len(hallucinations) / len(main_claims)) if main_claims else 1
    }

def is_claim_present(claim: str, text: str, llm_client) -> bool:
    """
    Checks if a claim is present in a text.
    """
    prompt = f"""Does this text contain or imply this claim?

Claim: {claim}

Text: {text}

Answer only YES or NO."""

    result = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=3,
        temperature=0
    )

    return "YES" in result.choices[0].message.content.upper()

Hallucination Prevention

1. Improve Retrieval

DEVELOPERpython
def enhanced_retrieval(query: str, retriever, threshold: float = 0.7) -> list:
    """
    Retrieval with confidence threshold.
    Better to return nothing than return noise.
    """
    results = retriever.retrieve(query, k=10)

    # Filter by score
    confident_results = [
        r for r in results
        if r['score'] > threshold
    ]

    if not confident_results:
        return {
            "docs": [],
            "confidence": "low",
            "should_fallback": True
        }

    return {
        "docs": confident_results,
        "confidence": "high"
    }

2. Strict Prompts

DEVELOPERpython
ANTI_HALLUCINATION_PROMPT = """You are a precise assistant that ONLY uses information from the provided context.

STRICT RULES:
1. ONLY state facts that are EXPLICITLY written in the context
2. If the context doesn't contain the answer, say "I don't have this information in my sources"
3. NEVER add information from your general knowledge
4. NEVER extrapolate or make assumptions
5. When uncertain, express uncertainty

Context:
{context}

Question: {question}

Answer based ONLY on the context above:"""

3. Source Citations

DEVELOPERpython
def generate_with_citations(
    question: str,
    docs: list,
    llm_client
) -> dict:
    """
    Forces the LLM to cite sources, reducing hallucinations.
    """
    numbered_docs = "\n\n".join([
        f"[{i+1}] {doc['content']}"
        for i, doc in enumerate(docs)
    ])

    prompt = f"""Answer the question using ONLY the numbered sources below.
For each fact, add a citation like [1] or [2].
If a fact isn't in any source, don't mention it.

Sources:
{numbered_docs}

Question: {question}

Answer with citations:"""

    result = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    response = result.choices[0].message.content

    # Verify citations exist
    import re
    citations = re.findall(r'\[(\d+)\]', response)
    valid_citations = [c for c in citations if int(c) <= len(docs)]

    return {
        "response": response,
        "citations_found": len(set(citations)),
        "all_citations_valid": len(valid_citations) == len(citations)
    }

Complete Detection Pipeline

DEVELOPERpython
class HallucinationDetector:
    def __init__(self, llm_client, nli_model=None):
        self.llm = llm_client
        self.nli = nli_model

    def analyze(
        self,
        context: str,
        question: str,
        response: str
    ) -> dict:
        """
        Complete hallucination analysis.
        """
        results = {
            "response": response,
            "checks": {}
        }

        # 1. NLI Check (fast)
        if self.nli:
            claims = extract_claims(response, self.llm)
            nli_results = []
            for claim in claims:
                check = check_entailment(context, claim)
                nli_results.append(check)

            results["checks"]["nli"] = {
                "claims_count": len(claims),
                "grounded": sum(1 for r in nli_results if r["is_grounded"]),
                "hallucinations": sum(1 for r in nli_results if not r["is_grounded"])
            }

        # 2. LLM Judge (precise but slow)
        judge_result = llm_judge_hallucination(
            context, question, response, self.llm
        )
        results["checks"]["llm_judge"] = judge_result

        # 3. Semantic overlap
        overlap = check_overlap(context, response)
        results["checks"]["overlap"] = overlap

        # 4. Final verdict
        hallucination_signals = 0
        total_signals = 0

        if "nli" in results["checks"]:
            if results["checks"]["nli"]["hallucinations"] > 0:
                hallucination_signals += 1
            total_signals += 1

        if results["checks"]["llm_judge"]["has_hallucinations"]:
            hallucination_signals += 1
        total_signals += 1

        if results["checks"]["overlap"]["potential_hallucination"]:
            hallucination_signals += 1
        total_signals += 1

        results["verdict"] = {
            "has_hallucinations": hallucination_signals >= 2,
            "confidence": hallucination_signals / total_signals,
            "recommendation": (
                "REJECT" if hallucination_signals >= 2
                else "REVIEW" if hallucination_signals == 1
                else "ACCEPT"
            )
        }

        return results

# Usage
detector = HallucinationDetector(llm_client=openai_client)

analysis = detector.analyze(
    context="Our product costs $99 and ships in 3-5 days.",
    question="What is the price?",
    response="The premium product costs $99 with free express shipping."
)

if analysis["verdict"]["recommendation"] == "REJECT":
    # Regenerate response
    pass

Detection Benchmarks

Method	Precision	Recall	Latency	Cost
ROUGE-L	60%	75%	5ms	Free
NLI	78%	82%	50ms	Free
BERTScore	72%	70%	100ms	Free
GPT-4o Judge	92%	88%	500ms	$$$
SelfCheckGPT	85%	80%	2s	$$
Ensemble	94%	90%	600ms	$$

Related Guides

Evaluation and Quality:

RAG Evaluation - Complete metrics
RAG Guardrails - Production security
RAG Monitoring - Continuous supervision

Retrieval:

Retrieval Strategies - Improve retrieval
Reranking - Better results

Are your users encountering hallucinations? Let's analyze your pipeline together →

Hallucination Detection in RAG Systems