Human Evaluation: Methodology and Tools
Complement automated evaluation with human expertise. Annotation protocols, inter-annotator agreement, and labeling tools for RAG systems.
Human Evaluation: Methodology and Tools
Automated evaluation with RAGAS and LLM-as-judge metrics offers reproducibility and speed. But it lacks the nuance of human judgment. This guide presents human evaluation methodologies to complement your RAG quality pipeline.
Why Human Evaluation Remains Essential
The Limits of Automation
| Aspect | Auto Evaluation | Human Evaluation |
|---|---|---|
| Cultural nuances | Difficult | Excellent |
| Tone and style | Approximate | Precise |
| Ambiguous cases | Often fails | Contextual judgment |
| Actual satisfaction | Indirect proxy | Direct measure |
| Bug discovery | Known patterns | New problems |
When to Prioritize Human Evaluation?
- Product launch: Validation before production
- New domains: No automatic ground truth
- Sensitive cases: Medical, legal, financial
- Targeted debugging: Understanding why a response fails
- Metric calibration: Verify auto scores match real judgment
Evaluation Protocols
1. Criteria-Based Evaluation (Likert Scale)
The most common approach. Each response is rated on multiple dimensions.
DEVELOPERpythonfrom dataclasses import dataclass from enum import Enum from typing import Optional class Rating(Enum): TERRIBLE = 1 POOR = 2 ACCEPTABLE = 3 GOOD = 4 EXCELLENT = 5 @dataclass class EvaluationCriteria: relevance: Rating # Does the answer address the question? accuracy: Rating # Is the information correct? completeness: Rating # Is important information missing? clarity: Rating # Is the answer clear and well-structured? helpfulness: Rating # Does the answer help the user? comment: Optional[str] # Free-form comment @dataclass class AnnotationTask: task_id: str question: str rag_answer: str contexts: list[str] ground_truth: Optional[str] annotator_id: str evaluation: Optional[EvaluationCriteria] = None
Standard Evaluation Grid:
| Score | Relevance | Accuracy | Completeness |
|---|---|---|---|
| 5 | Answers perfectly | 100% correct | Everything covered |
| 4 | Answers well | Nearly correct | Essentials covered |
| 3 | Partially answers | Minor errors | Minor gaps |
| 2 | Partially off-topic | Significant errors | Major gaps |
| 1 | Does not answer | False | Nothing useful |
2. Pairwise Comparison (A/B)
More reliable for subtle differences. The annotator chooses the best response between two versions.
DEVELOPERpython@dataclass class PairwiseTask: task_id: str question: str answer_a: str answer_b: str contexts: list[str] annotator_id: str preference: Optional[str] = None # "A", "B", or "equal" confidence: Optional[int] = None # 1-5 reason: Optional[str] = None def calculate_win_rate(annotations: list[PairwiseTask]) -> dict: """Calculate win rate between two models""" wins_a = sum(1 for a in annotations if a.preference == "A") wins_b = sum(1 for a in annotations if a.preference == "B") ties = sum(1 for a in annotations if a.preference == "equal") total = len(annotations) return { "model_a_wins": wins_a / total, "model_b_wins": wins_b / total, "ties": ties / total }
3. Error-Based Evaluation (Error Taxonomy)
Identifies specific error types for targeted debugging.
DEVELOPERpythonclass ErrorType(Enum): HALLUCINATION = "hallucination" FACTUAL_ERROR = "factual_error" INCOMPLETE = "incomplete" OFF_TOPIC = "off_topic" WRONG_CONTEXT = "wrong_context" OUTDATED = "outdated" FORMATTING = "formatting" TONE = "tone" @dataclass class ErrorAnnotation: task_id: str errors: list[ErrorType] error_details: dict[ErrorType, str] severity: int # 1-5 fixable: bool def analyze_error_distribution(annotations: list[ErrorAnnotation]) -> dict: """Analyze error distribution""" error_counts = {} for annotation in annotations: for error in annotation.errors: error_counts[error.value] = error_counts.get(error.value, 0) + 1 total_errors = sum(error_counts.values()) return { error: count / total_errors for error, count in sorted(error_counts.items(), key=lambda x: x[1], reverse=True) }
Inter-Annotator Agreement
Why Measure It?
If two annotators disagree on 50% of samples, your annotations are unreliable. Inter-annotator agreement (IAA) measures consistency.
Agreement Metrics
DEVELOPERpythonimport numpy as np from sklearn.metrics import cohen_kappa_score def calculate_iaa(annotations_1: list[int], annotations_2: list[int]) -> dict: """Calculate multiple inter-annotator agreement metrics""" # Exact agreement percentage exact_agreement = sum( 1 for a, b in zip(annotations_1, annotations_2) if a == b ) / len(annotations_1) # Cohen's Kappa (chance-corrected) kappa = cohen_kappa_score(annotations_1, annotations_2) # Agreement within 1 point close_agreement = sum( 1 for a, b in zip(annotations_1, annotations_2) if abs(a - b) <= 1 ) / len(annotations_1) # Pearson correlation correlation = np.corrcoef(annotations_1, annotations_2)[0, 1] return { "exact_agreement": exact_agreement, "close_agreement": close_agreement, "cohens_kappa": kappa, "pearson_correlation": correlation }
Score Interpretation
| Cohen's Kappa | Interpretation | Action |
|---|---|---|
| < 0.20 | Slight agreement | Review guidelines |
| 0.20-0.40 | Fair agreement | Clarify criteria |
| 0.40-0.60 | Moderate agreement | Acceptable to start |
| 0.60-0.80 | Substantial agreement | Good level |
| > 0.80 | Almost perfect | Excellent |
Improving Agreement
DEVELOPERpythonclass AnnotationGuidelines: """Structure for annotation guidelines""" def __init__(self): self.examples = {} self.edge_cases = [] self.calibration_set = [] def add_example(self, score: int, question: str, answer: str, explanation: str): """Add reference example for each score""" if score not in self.examples: self.examples[score] = [] self.examples[score].append({ "question": question, "answer": answer, "explanation": explanation }) def run_calibration(self, annotators: list[str], samples: list[dict]) -> dict: """Calibration session: all annotate the same samples""" results = {} for annotator in annotators: results[annotator] = self._get_annotations(annotator, samples) # Identify disagreements disagreements = [] for i, sample in enumerate(samples): scores = [results[a][i] for a in annotators] if max(scores) - min(scores) > 1: disagreements.append({ "sample_idx": i, "sample": sample, "scores": dict(zip(annotators, scores)) }) return {"disagreements": disagreements}
Annotation Platform
Simple Web Interface
DEVELOPERpythonfrom fastapi import FastAPI, HTTPException from pydantic import BaseModel import uuid app = FastAPI() class AnnotationSubmission(BaseModel): task_id: str annotator_id: str relevance: int accuracy: int completeness: int clarity: int comment: str = "" tasks_db = {} annotations_db = {} @app.get("/task/{annotator_id}") async def get_next_task(annotator_id: str): """Returns the next task to annotate""" for task_id, task in tasks_db.items(): if not any( a["annotator_id"] == annotator_id for a in annotations_db.get(task_id, []) ): return task raise HTTPException(404, "No tasks available") @app.post("/annotate") async def submit_annotation(submission: AnnotationSubmission): """Submit an annotation""" if submission.task_id not in tasks_db: raise HTTPException(404, "Task not found") annotation = { "id": str(uuid.uuid4()), "task_id": submission.task_id, "annotator_id": submission.annotator_id, "scores": { "relevance": submission.relevance, "accuracy": submission.accuracy, "completeness": submission.completeness, "clarity": submission.clarity }, "comment": submission.comment } if submission.task_id not in annotations_db: annotations_db[submission.task_id] = [] annotations_db[submission.task_id].append(annotation) return {"status": "success", "annotation_id": annotation["id"]}
Existing Tools
| Tool | Type | Price | Strengths |
|---|---|---|---|
| Label Studio | Open-source | Free | Flexible, self-hosted |
| Argilla | Open-source | Free | NLP/RAG specialized |
| Prodigy | Commercial | $390 | Excellent UX, fast |
| Scale AI | Service | Variable | Annotators included |
Smart Sampling
Sampling Strategy
DEVELOPERpythonimport random from collections import defaultdict class SmartSampler: def __init__(self, all_samples: list[dict]): self.samples = all_samples def stratified_sample(self, n: int, strata_key: str) -> list[dict]: """Stratified sampling by category""" strata = defaultdict(list) for sample in self.samples: strata[sample.get(strata_key, "unknown")].append(sample) samples_per_stratum = n // len(strata) selected = [] for stratum_samples in strata.values(): selected.extend(random.sample( stratum_samples, min(samples_per_stratum, len(stratum_samples)) )) return selected[:n] def uncertainty_sample(self, n: int, ragas_scores: dict) -> list[dict]: """Sample cases with uncertain RAGAS scores""" uncertain = [ (i, sample) for i, sample in enumerate(self.samples) if 0.4 < ragas_scores.get(i, {}).get("faithfulness", 0) < 0.7 ] return [sample for _, sample in uncertain[:n]]
Recommended Sample Sizes
| Goal | Minimum Size | Annotators | Estimated Time |
|---|---|---|---|
| Quick validation | 50 | 1 | 2h |
| Model calibration | 100 | 2 | 6h |
| Serious benchmark | 300 | 3 | 24h |
| Production critical | 500+ | 3+ | 40h+ |
Integration with Automated Evaluation
Hybrid Pipeline
DEVELOPERpythonclass HybridEvaluationPipeline: def __init__(self, ragas_evaluator, human_platform): self.ragas = ragas_evaluator self.human = human_platform async def evaluate(self, samples: list[dict]) -> dict: # Step 1: Automated evaluation auto_results = await self.ragas.evaluate(samples) # Step 2: Identify samples for human validation uncertain_indices = [ i for i, score in enumerate(auto_results["per_sample"]) if 0.4 < score["faithfulness"] < 0.7 ] # Step 3: Create annotation tasks human_tasks = [samples[i] for i in uncertain_indices] await self.human.create_tasks(human_tasks) # Step 4: Wait for annotations human_results = await self.human.wait_for_completion() # Step 5: Combine results return self._merge_results(auto_results, human_results, uncertain_indices) def _merge_results(self, auto, human, human_indices): """Combine auto and human scores""" final_scores = auto["per_sample"].copy() for i, idx in enumerate(human_indices): human_score = human[i]["average_score"] auto_score = final_scores[idx]["faithfulness"] final_scores[idx]["final_score"] = 0.6 * human_score + 0.4 * auto_score final_scores[idx]["human_validated"] = True return final_scores
Human Evaluation Checklist
| Step | Action | Done |
|---|---|---|
| Guidelines | Write with examples for each score | [ ] |
| Calibration | Initial session with all annotators | [ ] |
| Pilot | 20 samples to verify agreement | [ ] |
| Production | Launch complete annotation | [ ] |
| Quality | Calculate IAA regularly | [ ] |
| Feedback | Incorporate feedback into guidelines | [ ] |
Going Further
- RAGAS Framework - Automated evaluation
- RAG Metrics - Metrics overview
- RAG Generation - Improve responses
Simplified Human Evaluation with Ailog
Setting up a human evaluation pipeline requires infrastructure and coordination. With Ailog, benefit from integrated tools:
- Intuitive annotation interface
- Real-time progress dashboard
- Automatic IAA calculation
- Smart sampling of critical cases
- Reports combining auto and human scores
Try for free and validate your RAG quality with human expertise.
Tags
Related Posts
RAGAS: Open-Source RAG Evaluation Framework
Master RAGAS for automated RAG system evaluation. Installation, metrics, synthetic datasets, and CI/CD integration.
Evaluating a RAG System: Metrics and Methodologies
Complete guide to measuring your RAG performance: faithfulness, relevancy, recall, and automated evaluation frameworks.
Reduce RAG Latency: From 2000ms to 200ms
10x faster RAG: parallel retrieval, streaming responses, and architectural optimizations for sub-200ms latency.