Human Evaluation: Methodology and Tools

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Automated evaluation with RAGAS and LLM-as-judge metrics offers reproducibility and speed. But it lacks the nuance of human judgment. This guide presents human evaluation methodologies to complement your RAG quality pipeline.

Why Human Evaluation Remains Essential

The Limits of Automation

Aspect	Auto Evaluation	Human Evaluation
Cultural nuances	Difficult	Excellent
Tone and style	Approximate	Precise
Ambiguous cases	Often fails	Contextual judgment
Actual satisfaction	Indirect proxy	Direct measure
Bug discovery	Known patterns	New problems

When to Prioritize Human Evaluation?

Product launch: Validation before production
New domains: No automatic ground truth
Sensitive cases: Medical, legal, financial
Targeted debugging: Understanding why a response fails
Metric calibration: Verify auto scores match real judgment

Evaluation Protocols

1. Criteria-Based Evaluation (Likert Scale)

The most common approach. Each response is rated on multiple dimensions.

DEVELOPERpython
from dataclasses import dataclass
from enum import Enum
from typing import Optional

class Rating(Enum):
    TERRIBLE = 1
    POOR = 2
    ACCEPTABLE = 3
    GOOD = 4
    EXCELLENT = 5

@dataclass
class EvaluationCriteria:
    relevance: Rating        # Does the answer address the question?
    accuracy: Rating         # Is the information correct?
    completeness: Rating     # Is important information missing?
    clarity: Rating          # Is the answer clear and well-structured?
    helpfulness: Rating      # Does the answer help the user?
    comment: Optional[str]   # Free-form comment

@dataclass
class AnnotationTask:
    task_id: str
    question: str
    rag_answer: str
    contexts: list[str]
    ground_truth: Optional[str]
    annotator_id: str
    evaluation: Optional[EvaluationCriteria] = None

Standard Evaluation Grid:

Score	Relevance	Accuracy	Completeness
5	Answers perfectly	100% correct	Everything covered
4	Answers well	Nearly correct	Essentials covered
3	Partially answers	Minor errors	Minor gaps
2	Partially off-topic	Significant errors	Major gaps
1	Does not answer	False	Nothing useful

2. Pairwise Comparison (A/B)

More reliable for subtle differences. The annotator chooses the best response between two versions.

DEVELOPERpython
@dataclass
class PairwiseTask:
    task_id: str
    question: str
    answer_a: str
    answer_b: str
    contexts: list[str]
    annotator_id: str
    preference: Optional[str] = None  # "A", "B", or "equal"
    confidence: Optional[int] = None  # 1-5
    reason: Optional[str] = None

def calculate_win_rate(annotations: list[PairwiseTask]) -> dict:
    """Calculate win rate between two models"""
    wins_a = sum(1 for a in annotations if a.preference == "A")
    wins_b = sum(1 for a in annotations if a.preference == "B")
    ties = sum(1 for a in annotations if a.preference == "equal")
    total = len(annotations)

    return {
        "model_a_wins": wins_a / total,
        "model_b_wins": wins_b / total,
        "ties": ties / total
    }

3. Error-Based Evaluation (Error Taxonomy)

Identifies specific error types for targeted debugging.

DEVELOPERpython
class ErrorType(Enum):
    HALLUCINATION = "hallucination"
    FACTUAL_ERROR = "factual_error"
    INCOMPLETE = "incomplete"
    OFF_TOPIC = "off_topic"
    WRONG_CONTEXT = "wrong_context"
    OUTDATED = "outdated"
    FORMATTING = "formatting"
    TONE = "tone"

@dataclass
class ErrorAnnotation:
    task_id: str
    errors: list[ErrorType]
    error_details: dict[ErrorType, str]
    severity: int  # 1-5
    fixable: bool

def analyze_error_distribution(annotations: list[ErrorAnnotation]) -> dict:
    """Analyze error distribution"""
    error_counts = {}
    for annotation in annotations:
        for error in annotation.errors:
            error_counts[error.value] = error_counts.get(error.value, 0) + 1

    total_errors = sum(error_counts.values())
    return {
        error: count / total_errors
        for error, count in sorted(error_counts.items(), key=lambda x: x[1], reverse=True)
    }

Inter-Annotator Agreement

Why Measure It?

If two annotators disagree on 50% of samples, your annotations are unreliable. Inter-annotator agreement (IAA) measures consistency.

Agreement Metrics

DEVELOPERpython
import numpy as np
from sklearn.metrics import cohen_kappa_score

def calculate_iaa(annotations_1: list[int], annotations_2: list[int]) -> dict:
    """Calculate multiple inter-annotator agreement metrics"""
    # Exact agreement percentage
    exact_agreement = sum(
        1 for a, b in zip(annotations_1, annotations_2) if a == b
    ) / len(annotations_1)

    # Cohen's Kappa (chance-corrected)
    kappa = cohen_kappa_score(annotations_1, annotations_2)

    # Agreement within 1 point
    close_agreement = sum(
        1 for a, b in zip(annotations_1, annotations_2) if abs(a - b) <= 1
    ) / len(annotations_1)

    # Pearson correlation
    correlation = np.corrcoef(annotations_1, annotations_2)[0, 1]

    return {
        "exact_agreement": exact_agreement,
        "close_agreement": close_agreement,
        "cohens_kappa": kappa,
        "pearson_correlation": correlation
    }

Score Interpretation

Cohen's Kappa	Interpretation	Action
< 0.20	Slight agreement	Review guidelines
0.20-0.40	Fair agreement	Clarify criteria
0.40-0.60	Moderate agreement	Acceptable to start
0.60-0.80	Substantial agreement	Good level
> 0.80	Almost perfect	Excellent

Improving Agreement

DEVELOPERpython
class AnnotationGuidelines:
    """Structure for annotation guidelines"""

    def __init__(self):
        self.examples = {}
        self.edge_cases = []
        self.calibration_set = []

    def add_example(self, score: int, question: str, answer: str, explanation: str):
        """Add reference example for each score"""
        if score not in self.examples:
            self.examples[score] = []
        self.examples[score].append({
            "question": question,
            "answer": answer,
            "explanation": explanation
        })

    def run_calibration(self, annotators: list[str], samples: list[dict]) -> dict:
        """Calibration session: all annotate the same samples"""
        results = {}
        for annotator in annotators:
            results[annotator] = self._get_annotations(annotator, samples)

        # Identify disagreements
        disagreements = []
        for i, sample in enumerate(samples):
            scores = [results[a][i] for a in annotators]
            if max(scores) - min(scores) > 1:
                disagreements.append({
                    "sample_idx": i,
                    "sample": sample,
                    "scores": dict(zip(annotators, scores))
                })

        return {"disagreements": disagreements}

Annotation Platform

Simple Web Interface

DEVELOPERpython
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uuid

app = FastAPI()

class AnnotationSubmission(BaseModel):
    task_id: str
    annotator_id: str
    relevance: int
    accuracy: int
    completeness: int
    clarity: int
    comment: str = ""

tasks_db = {}
annotations_db = {}

@app.get("/task/{annotator_id}")
async def get_next_task(annotator_id: str):
    """Returns the next task to annotate"""
    for task_id, task in tasks_db.items():
        if not any(
            a["annotator_id"] == annotator_id
            for a in annotations_db.get(task_id, [])
        ):
            return task
    raise HTTPException(404, "No tasks available")

@app.post("/annotate")
async def submit_annotation(submission: AnnotationSubmission):
    """Submit an annotation"""
    if submission.task_id not in tasks_db:
        raise HTTPException(404, "Task not found")

    annotation = {
        "id": str(uuid.uuid4()),
        "task_id": submission.task_id,
        "annotator_id": submission.annotator_id,
        "scores": {
            "relevance": submission.relevance,
            "accuracy": submission.accuracy,
            "completeness": submission.completeness,
            "clarity": submission.clarity
        },
        "comment": submission.comment
    }

    if submission.task_id not in annotations_db:
        annotations_db[submission.task_id] = []
    annotations_db[submission.task_id].append(annotation)

    return {"status": "success", "annotation_id": annotation["id"]}

Existing Tools

Tool	Type	Price	Strengths
Label Studio	Open-source	Free	Flexible, self-hosted
Argilla	Open-source	Free	NLP/RAG specialized
Prodigy	Commercial	$390	Excellent UX, fast
Scale AI	Service	Variable	Annotators included

Smart Sampling

Sampling Strategy

DEVELOPERpython
import random
from collections import defaultdict

class SmartSampler:
    def __init__(self, all_samples: list[dict]):
        self.samples = all_samples

    def stratified_sample(self, n: int, strata_key: str) -> list[dict]:
        """Stratified sampling by category"""
        strata = defaultdict(list)
        for sample in self.samples:
            strata[sample.get(strata_key, "unknown")].append(sample)

        samples_per_stratum = n // len(strata)
        selected = []

        for stratum_samples in strata.values():
            selected.extend(random.sample(
                stratum_samples,
                min(samples_per_stratum, len(stratum_samples))
            ))

        return selected[:n]

    def uncertainty_sample(self, n: int, ragas_scores: dict) -> list[dict]:
        """Sample cases with uncertain RAGAS scores"""
        uncertain = [
            (i, sample)
            for i, sample in enumerate(self.samples)
            if 0.4 < ragas_scores.get(i, {}).get("faithfulness", 0) < 0.7
        ]
        return [sample for _, sample in uncertain[:n]]

Recommended Sample Sizes

Goal	Minimum Size	Annotators	Estimated Time
Quick validation	50	1	2h
Model calibration	100	2	6h
Serious benchmark	300	3	24h
Production critical	500+	3+	40h+

Integration with Automated Evaluation

Hybrid Pipeline

DEVELOPERpython
class HybridEvaluationPipeline:
    def __init__(self, ragas_evaluator, human_platform):
        self.ragas = ragas_evaluator
        self.human = human_platform

    async def evaluate(self, samples: list[dict]) -> dict:
        # Step 1: Automated evaluation
        auto_results = await self.ragas.evaluate(samples)

        # Step 2: Identify samples for human validation
        uncertain_indices = [
            i for i, score in enumerate(auto_results["per_sample"])
            if 0.4 < score["faithfulness"] < 0.7
        ]

        # Step 3: Create annotation tasks
        human_tasks = [samples[i] for i in uncertain_indices]
        await self.human.create_tasks(human_tasks)

        # Step 4: Wait for annotations
        human_results = await self.human.wait_for_completion()

        # Step 5: Combine results
        return self._merge_results(auto_results, human_results, uncertain_indices)

    def _merge_results(self, auto, human, human_indices):
        """Combine auto and human scores"""
        final_scores = auto["per_sample"].copy()

        for i, idx in enumerate(human_indices):
            human_score = human[i]["average_score"]
            auto_score = final_scores[idx]["faithfulness"]
            final_scores[idx]["final_score"] = 0.6 * human_score + 0.4 * auto_score
            final_scores[idx]["human_validated"] = True

        return final_scores

Human Evaluation Checklist

Step	Action	Done
Guidelines	Write with examples for each score	[ ]
Calibration	Initial session with all annotators	[ ]
Pilot	20 samples to verify agreement	[ ]
Production	Launch complete annotation	[ ]
Quality	Calculate IAA regularly	[ ]
Feedback	Incorporate feedback into guidelines	[ ]

Going Further

RAGAS Framework - Automated evaluation
RAG Metrics - Metrics overview
RAG Generation - Improve responses

Simplified Human Evaluation with Ailog

Setting up a human evaluation pipeline requires infrastructure and coordination. With Ailog, benefit from integrated tools:

Intuitive annotation interface
Real-time progress dashboard
Automatic IAA calculation
Smart sampling of critical cases
Reports combining auto and human scores

Try for free and validate your RAG quality with human expertise.

Human Evaluation: Methodology and Tools

Human Evaluation: Methodology and Tools

Why Human Evaluation Remains Essential

The Limits of Automation

When to Prioritize Human Evaluation?

Evaluation Protocols

1. Criteria-Based Evaluation (Likert Scale)

2. Pairwise Comparison (A/B)

3. Error-Based Evaluation (Error Taxonomy)

Inter-Annotator Agreement

Why Measure It?

Agreement Metrics

Score Interpretation

Improving Agreement

Annotation Platform

Simple Web Interface

Existing Tools

Smart Sampling

Sampling Strategy

Recommended Sample Sizes

Integration with Automated Evaluation

Hybrid Pipeline

Human Evaluation Checklist

Going Further

Simplified Human Evaluation with Ailog

Tags

Related Posts

RAGAS: Open-Source RAG Evaluation Framework

Evaluating a RAG System: Metrics and Methodologies

Reduce RAG Latency: From 2000ms to 200ms

Ailog Assistant