7. Optimization

Human Evaluation: Methodology and Tools

March 31, 2026
Ailog Team

Complement automated evaluation with human expertise. Annotation protocols, inter-annotator agreement, and labeling tools for RAG systems.

Human Evaluation: Methodology and Tools

Automated evaluation with RAGAS and LLM-as-judge metrics offers reproducibility and speed. But it lacks the nuance of human judgment. This guide presents human evaluation methodologies to complement your RAG quality pipeline.

Why Human Evaluation Remains Essential

The Limits of Automation

AspectAuto EvaluationHuman Evaluation
Cultural nuancesDifficultExcellent
Tone and styleApproximatePrecise
Ambiguous casesOften failsContextual judgment
Actual satisfactionIndirect proxyDirect measure
Bug discoveryKnown patternsNew problems

When to Prioritize Human Evaluation?

  • Product launch: Validation before production
  • New domains: No automatic ground truth
  • Sensitive cases: Medical, legal, financial
  • Targeted debugging: Understanding why a response fails
  • Metric calibration: Verify auto scores match real judgment

Evaluation Protocols

1. Criteria-Based Evaluation (Likert Scale)

The most common approach. Each response is rated on multiple dimensions.

DEVELOPERpython
from dataclasses import dataclass from enum import Enum from typing import Optional class Rating(Enum): TERRIBLE = 1 POOR = 2 ACCEPTABLE = 3 GOOD = 4 EXCELLENT = 5 @dataclass class EvaluationCriteria: relevance: Rating # Does the answer address the question? accuracy: Rating # Is the information correct? completeness: Rating # Is important information missing? clarity: Rating # Is the answer clear and well-structured? helpfulness: Rating # Does the answer help the user? comment: Optional[str] # Free-form comment @dataclass class AnnotationTask: task_id: str question: str rag_answer: str contexts: list[str] ground_truth: Optional[str] annotator_id: str evaluation: Optional[EvaluationCriteria] = None

Standard Evaluation Grid:

ScoreRelevanceAccuracyCompleteness
5Answers perfectly100% correctEverything covered
4Answers wellNearly correctEssentials covered
3Partially answersMinor errorsMinor gaps
2Partially off-topicSignificant errorsMajor gaps
1Does not answerFalseNothing useful

2. Pairwise Comparison (A/B)

More reliable for subtle differences. The annotator chooses the best response between two versions.

DEVELOPERpython
@dataclass class PairwiseTask: task_id: str question: str answer_a: str answer_b: str contexts: list[str] annotator_id: str preference: Optional[str] = None # "A", "B", or "equal" confidence: Optional[int] = None # 1-5 reason: Optional[str] = None def calculate_win_rate(annotations: list[PairwiseTask]) -> dict: """Calculate win rate between two models""" wins_a = sum(1 for a in annotations if a.preference == "A") wins_b = sum(1 for a in annotations if a.preference == "B") ties = sum(1 for a in annotations if a.preference == "equal") total = len(annotations) return { "model_a_wins": wins_a / total, "model_b_wins": wins_b / total, "ties": ties / total }

3. Error-Based Evaluation (Error Taxonomy)

Identifies specific error types for targeted debugging.

DEVELOPERpython
class ErrorType(Enum): HALLUCINATION = "hallucination" FACTUAL_ERROR = "factual_error" INCOMPLETE = "incomplete" OFF_TOPIC = "off_topic" WRONG_CONTEXT = "wrong_context" OUTDATED = "outdated" FORMATTING = "formatting" TONE = "tone" @dataclass class ErrorAnnotation: task_id: str errors: list[ErrorType] error_details: dict[ErrorType, str] severity: int # 1-5 fixable: bool def analyze_error_distribution(annotations: list[ErrorAnnotation]) -> dict: """Analyze error distribution""" error_counts = {} for annotation in annotations: for error in annotation.errors: error_counts[error.value] = error_counts.get(error.value, 0) + 1 total_errors = sum(error_counts.values()) return { error: count / total_errors for error, count in sorted(error_counts.items(), key=lambda x: x[1], reverse=True) }

Inter-Annotator Agreement

Why Measure It?

If two annotators disagree on 50% of samples, your annotations are unreliable. Inter-annotator agreement (IAA) measures consistency.

Agreement Metrics

DEVELOPERpython
import numpy as np from sklearn.metrics import cohen_kappa_score def calculate_iaa(annotations_1: list[int], annotations_2: list[int]) -> dict: """Calculate multiple inter-annotator agreement metrics""" # Exact agreement percentage exact_agreement = sum( 1 for a, b in zip(annotations_1, annotations_2) if a == b ) / len(annotations_1) # Cohen's Kappa (chance-corrected) kappa = cohen_kappa_score(annotations_1, annotations_2) # Agreement within 1 point close_agreement = sum( 1 for a, b in zip(annotations_1, annotations_2) if abs(a - b) <= 1 ) / len(annotations_1) # Pearson correlation correlation = np.corrcoef(annotations_1, annotations_2)[0, 1] return { "exact_agreement": exact_agreement, "close_agreement": close_agreement, "cohens_kappa": kappa, "pearson_correlation": correlation }

Score Interpretation

Cohen's KappaInterpretationAction
< 0.20Slight agreementReview guidelines
0.20-0.40Fair agreementClarify criteria
0.40-0.60Moderate agreementAcceptable to start
0.60-0.80Substantial agreementGood level
> 0.80Almost perfectExcellent

Improving Agreement

DEVELOPERpython
class AnnotationGuidelines: """Structure for annotation guidelines""" def __init__(self): self.examples = {} self.edge_cases = [] self.calibration_set = [] def add_example(self, score: int, question: str, answer: str, explanation: str): """Add reference example for each score""" if score not in self.examples: self.examples[score] = [] self.examples[score].append({ "question": question, "answer": answer, "explanation": explanation }) def run_calibration(self, annotators: list[str], samples: list[dict]) -> dict: """Calibration session: all annotate the same samples""" results = {} for annotator in annotators: results[annotator] = self._get_annotations(annotator, samples) # Identify disagreements disagreements = [] for i, sample in enumerate(samples): scores = [results[a][i] for a in annotators] if max(scores) - min(scores) > 1: disagreements.append({ "sample_idx": i, "sample": sample, "scores": dict(zip(annotators, scores)) }) return {"disagreements": disagreements}

Annotation Platform

Simple Web Interface

DEVELOPERpython
from fastapi import FastAPI, HTTPException from pydantic import BaseModel import uuid app = FastAPI() class AnnotationSubmission(BaseModel): task_id: str annotator_id: str relevance: int accuracy: int completeness: int clarity: int comment: str = "" tasks_db = {} annotations_db = {} @app.get("/task/{annotator_id}") async def get_next_task(annotator_id: str): """Returns the next task to annotate""" for task_id, task in tasks_db.items(): if not any( a["annotator_id"] == annotator_id for a in annotations_db.get(task_id, []) ): return task raise HTTPException(404, "No tasks available") @app.post("/annotate") async def submit_annotation(submission: AnnotationSubmission): """Submit an annotation""" if submission.task_id not in tasks_db: raise HTTPException(404, "Task not found") annotation = { "id": str(uuid.uuid4()), "task_id": submission.task_id, "annotator_id": submission.annotator_id, "scores": { "relevance": submission.relevance, "accuracy": submission.accuracy, "completeness": submission.completeness, "clarity": submission.clarity }, "comment": submission.comment } if submission.task_id not in annotations_db: annotations_db[submission.task_id] = [] annotations_db[submission.task_id].append(annotation) return {"status": "success", "annotation_id": annotation["id"]}

Existing Tools

ToolTypePriceStrengths
Label StudioOpen-sourceFreeFlexible, self-hosted
ArgillaOpen-sourceFreeNLP/RAG specialized
ProdigyCommercial$390Excellent UX, fast
Scale AIServiceVariableAnnotators included

Smart Sampling

Sampling Strategy

DEVELOPERpython
import random from collections import defaultdict class SmartSampler: def __init__(self, all_samples: list[dict]): self.samples = all_samples def stratified_sample(self, n: int, strata_key: str) -> list[dict]: """Stratified sampling by category""" strata = defaultdict(list) for sample in self.samples: strata[sample.get(strata_key, "unknown")].append(sample) samples_per_stratum = n // len(strata) selected = [] for stratum_samples in strata.values(): selected.extend(random.sample( stratum_samples, min(samples_per_stratum, len(stratum_samples)) )) return selected[:n] def uncertainty_sample(self, n: int, ragas_scores: dict) -> list[dict]: """Sample cases with uncertain RAGAS scores""" uncertain = [ (i, sample) for i, sample in enumerate(self.samples) if 0.4 < ragas_scores.get(i, {}).get("faithfulness", 0) < 0.7 ] return [sample for _, sample in uncertain[:n]]

Recommended Sample Sizes

GoalMinimum SizeAnnotatorsEstimated Time
Quick validation5012h
Model calibration10026h
Serious benchmark300324h
Production critical500+3+40h+

Integration with Automated Evaluation

Hybrid Pipeline

DEVELOPERpython
class HybridEvaluationPipeline: def __init__(self, ragas_evaluator, human_platform): self.ragas = ragas_evaluator self.human = human_platform async def evaluate(self, samples: list[dict]) -> dict: # Step 1: Automated evaluation auto_results = await self.ragas.evaluate(samples) # Step 2: Identify samples for human validation uncertain_indices = [ i for i, score in enumerate(auto_results["per_sample"]) if 0.4 < score["faithfulness"] < 0.7 ] # Step 3: Create annotation tasks human_tasks = [samples[i] for i in uncertain_indices] await self.human.create_tasks(human_tasks) # Step 4: Wait for annotations human_results = await self.human.wait_for_completion() # Step 5: Combine results return self._merge_results(auto_results, human_results, uncertain_indices) def _merge_results(self, auto, human, human_indices): """Combine auto and human scores""" final_scores = auto["per_sample"].copy() for i, idx in enumerate(human_indices): human_score = human[i]["average_score"] auto_score = final_scores[idx]["faithfulness"] final_scores[idx]["final_score"] = 0.6 * human_score + 0.4 * auto_score final_scores[idx]["human_validated"] = True return final_scores

Human Evaluation Checklist

StepActionDone
GuidelinesWrite with examples for each score[ ]
CalibrationInitial session with all annotators[ ]
Pilot20 samples to verify agreement[ ]
ProductionLaunch complete annotation[ ]
QualityCalculate IAA regularly[ ]
FeedbackIncorporate feedback into guidelines[ ]

Going Further


Simplified Human Evaluation with Ailog

Setting up a human evaluation pipeline requires infrastructure and coordination. With Ailog, benefit from integrated tools:

  • Intuitive annotation interface
  • Real-time progress dashboard
  • Automatic IAA calculation
  • Smart sampling of critical cases
  • Reports combining auto and human scores

Try for free and validate your RAG quality with human expertise.

Tags

ragevaluationannotationqualityhuman-in-the-loop

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !