Human Evaluation: Methodik und Tools

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Die automatisierte Bewertung mit RAGAS und den Metriken LLM-as-judge bietet Reproduzierbarkeit und Geschwindigkeit. Aber sie fehlt die Nuance des menschlichen Urteils. Dieser Leitfaden stellt menschliche Evaluationsmethoden vor, um Ihre RAG-Qualitätspipeline zu ergänzen.

Warum die menschliche Bewertung weiterhin essenziell ist

Die Grenzen der Automatisierung

Aspect	Évaluation auto	Évaluation humaine
Nuances culturelles	Difficile	Excellent
Ton et style	Approximatif	Précis
Cas ambigus	Échoue souvent	Jugement contextuel
Satisfaction réelle	Proxy indirect	Mesure directe
Découverte de bugs	Patterns connus	Nouveaux problèmes

Wann sollte der Mensch bevorzugt werden?

Produktstart : Validierung vor dem Rollout
Neue Domänen : Keine automatische ground truth vorhanden
Sensible Fälle : Medizinisch, rechtlich, finanziell
Gezieltes Debugging : Verstehen, warum eine Antwort fehlschlägt
Kalibrierung der Metriken : Überprüfen, ob die automatischen Scores dem echten Urteil entsprechen

Evaluationsprotokolle

1. Bewertung nach Kriterien (Likert)

Am häufigsten verwendet. Jede Antwort wird in mehreren Dimensionen bewertet.

DEVELOPERpython
from dataclasses import dataclass
from enum import Enum
from typing import Optional

class Rating(Enum):
    TERRIBLE = 1
    POOR = 2
    ACCEPTABLE = 3
    GOOD = 4
    EXCELLENT = 5

@dataclass
class EvaluationCriteria:
    relevance: Rating        # Beantwortet die Antwort die Frage?
    accuracy: Rating         # Sind die Informationen korrekt?
    completeness: Rating     # Fehlen wichtige Informationen?
    clarity: Rating          # Ist die Antwort klar und gut strukturiert?
    helpfulness: Rating      # Hilft die Antwort dem Nutzer?
    comment: Optional[str]   # Freier Kommentar

@dataclass
class AnnotationTask:
    task_id: str
    question: str
    rag_answer: str
    contexts: list[str]
    ground_truth: Optional[str]
    annotator_id: str
    evaluation: Optional[EvaluationCriteria] = None

Standard-Bewertungstabelle :

Score	Pertinence	Précision	Complétude
5	Répond parfaitement	100% correct	Tout couvert
4	Répond bien	Quasi correct	Essentiel couvert
3	Répond partiellement	Quelques erreurs mineures	Lacunes mineures
2	Hors sujet partiel	Erreurs significatives	Lacunes importantes
1	Ne répond pas	Faux	Rien d'utile

2. Paarweiser Vergleich (A/B)

Zuverlässiger bei subtilen Unterschieden. Der Annotator wählt die bessere Antwort zwischen zwei Versionen.

DEVELOPERpython
@dataclass
class PairwiseTask:
    task_id: str
    question: str
    answer_a: str
    answer_b: str
    contexts: list[str]
    annotator_id: str
    preference: Optional[str] = None  # "A", "B", ou "equal"
    confidence: Optional[int] = None  # 1-5
    reason: Optional[str] = None

def calculate_win_rate(annotations: list[PairwiseTask]) -> dict:
    """Berechnet die Gewinnrate zwischen zwei Modellen"""
    wins_a = sum(1 for a in annotations if a.preference == "A")
    wins_b = sum(1 for a in annotations if a.preference == "B")
    ties = sum(1 for a in annotations if a.preference == "equal")
    total = len(annotations)

    return {
        "model_a_wins": wins_a / total,
        "model_b_wins": wins_b / total,
        "ties": ties / total
    }

3. Bewertung nach Fehlern (Error taxonomy)

Identifiziert spezifische Fehlertypen für gezieltes Debugging.

DEVELOPERpython
class ErrorType(Enum):
    HALLUCINATION = "hallucination"
    FACTUAL_ERROR = "factual_error"
    INCOMPLETE = "incomplete"
    OFF_TOPIC = "off_topic"
    WRONG_CONTEXT = "wrong_context"
    OUTDATED = "outdated"
    FORMATTING = "formatting"
    TONE = "tone"

@dataclass
class ErrorAnnotation:
    task_id: str
    errors: list[ErrorType]
    error_details: dict[ErrorType, str]
    severity: int  # 1-5
    fixable: bool

def analyze_error_distribution(annotations: list[ErrorAnnotation]) -> dict:
    """Analysiert die Verteilung der Fehler"""
    error_counts = {}
    for annotation in annotations:
        for error in annotation.errors:
            error_counts[error.value] = error_counts.get(error.value, 0) + 1

    total_errors = sum(error_counts.values())
    return {
        error: count / total_errors
        for error, count in sorted(error_counts.items(), key=lambda x: x[1], reverse=True)
    }

Inter-Annotator Agreement

Warum sollte man das messen?

Wenn zwei Annotatoren in 50 % der Samples uneinig sind, sind Ihre Annotationen nicht zuverlässig. Das Inter-Annotator Agreement (IAA) misst die Konsistenz.

Übereinstimmungsmetriken

DEVELOPERpython
import numpy as np
from sklearn.metrics import cohen_kappa_score

def calculate_iaa(annotations_1: list[int], annotations_2: list[int]) -> dict:
    """Berechnet mehrere Metriken für das Inter-Annotator Agreement"""
    # Prozentsatz exakter Übereinstimmung
    exact_agreement = sum(
        1 for a, b in zip(annotations_1, annotations_2) if a == b
    ) / len(annotations_1)

    # Cohen's Kappa (korrigiert für Zufall)
    kappa = cohen_kappa_score(annotations_1, annotations_2)

    # Übereinstimmung innerhalb von 1 Punkt
    close_agreement = sum(
        1 for a, b in zip(annotations_1, annotations_2) if abs(a - b) <= 1
    ) / len(annotations_1)

    # Pearson-Korrelation
    correlation = np.corrcoef(annotations_1, annotations_2)[0, 1]

    return {
        "exact_agreement": exact_agreement,
        "close_agreement": close_agreement,
        "cohens_kappa": kappa,
        "pearson_correlation": correlation
    }

Interpretation der Scores

Cohen's Kappa	Interprétation	Action
< 0.20	Accord faible	Revoir les guidelines
0.20-0.40	Accord modeste	Clarifier les critères
0.40-0.60	Accord modéré	Acceptable pour démarrer
0.60-0.80	Accord substantiel	Bon niveau
> 0.80	Accord presque parfait	Excellent

Verbesserung der Übereinstimmung

DEVELOPERpython
class AnnotationGuidelines:
    """Struktur für die Annotation-Guidelines"""

    def __init__(self):
        self.examples = {}
        self.edge_cases = []
        self.calibration_set = []

    def add_example(self, score: int, question: str, answer: str, explanation: str):
        """Fügt ein Referenzbeispiel für jeden Score hinzu"""
        if score not in self.examples:
            self.examples[score] = []
        self.examples[score].append({
            "question": question,
            "answer": answer,
            "explanation": explanation
        })

    def run_calibration(self, annotators: list[str], samples: list[dict]) -> dict:
        """Kalibriersession: alle annotieren die gleichen Samples"""
        results = {}
        for annotator in annotators:
            results[annotator] = self._get_annotations(annotator, samples)

        # Identifiziere die Uneinigkeiten
        disagreements = []
        for i, sample in enumerate(samples):
            scores = [results[a][i] for a in annotators]
            if max(scores) - min(scores) > 1:
                disagreements.append({
                    "sample_idx": i,
                    "sample": sample,
                    "scores": dict(zip(annotators, scores))
                })

        return {"disagreements": disagreements}

Annotation-Plattform

Einfaches Web-Interface

DEVELOPERpython
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uuid

app = FastAPI()

class AnnotationSubmission(BaseModel):
    task_id: str
    annotator_id: str
    relevance: int
    accuracy: int
    completeness: int
    clarity: int
    comment: str = ""

tasks_db = {}
annotations_db = {}

@app.get("/task/{annotator_id}")
async def get_next_task(annotator_id: str):
    """Gibt die nächste zu annotierende Aufgabe zurück"""
    for task_id, task in tasks_db.items():
        if not any(
            a["annotator_id"] == annotator_id
            for a in annotations_db.get(task_id, [])
        ):
            return task
    raise HTTPException(404, "No tasks available")

@app.post("/annotate")
async def submit_annotation(submission: AnnotationSubmission):
    """Reicht eine Annotation ein"""
    if submission.task_id not in tasks_db:
        raise HTTPException(404, "Task not found")

    annotation = {
        "id": str(uuid.uuid4()),
        "task_id": submission.task_id,
        "annotator_id": submission.annotator_id,
        "scores": {
            "relevance": submission.relevance,
            "accuracy": submission.accuracy,
            "completeness": submission.completeness,
            "clarity": submission.clarity
        },
        "comment": submission.comment
    }

    if submission.task_id not in annotations_db:
        annotations_db[submission.task_id] = []
    annotations_db[submission.task_id].append(annotation)

    return {"status": "success", "annotation_id": annotation["id"]}

Bestehende Tools

Outil	Type	Prix	Forces
Label Studio	Open-source	Gratuit	Flexible, self-hosted
Argilla	Open-source	Gratuit	Spécialisé NLP/RAG
Prodigy	Commercial	390 EUR	UX excellente, rapide
Scale AI	Service	Variable	Annotateurs inclus

Intelligentes Sampling

Sampling-Strategie

DEVELOPERpython
import random
from collections import defaultdict

class SmartSampler:
    def __init__(self, all_samples: list[dict]):
        self.samples = all_samples

    def stratified_sample(self, n: int, strata_key: str) -> list[dict]:
        """Stratifiziertes Sampling nach Kategorie"""
        strata = defaultdict(list)
        for sample in self.samples:
            strata[sample.get(strata_key, "unknown")].append(sample)

        samples_per_stratum = n // len(strata)
        selected = []

        for stratum_samples in strata.values():
            selected.extend(random.sample(
                stratum_samples,
                min(samples_per_stratum, len(stratum_samples))
            ))

        return selected[:n]

    def uncertainty_sample(self, n: int, ragas_scores: dict) -> list[dict]:
        """Samplet die Samples mit unsicheren RAGAS-Scores"""
        uncertain = [
            (i, sample)
            for i, sample in enumerate(self.samples)
            if 0.4 < ragas_scores.get(i, {}).get("faithfulness", 0) < 0.7
        ]
        return [sample for _, sample in uncertain[:n]]

Empfohlene Stichprobengröße

Objectif	Taille mini	Annotateurs	Temps estimé
Validation rapide	50	1	2h
Calibrage modèle	100	2	6h
Benchmark sérieux	300	3	24h
Production critique	500+	3+	40h+

Integration mit der automatischen Bewertung

Hybride Pipeline

DEVELOPERpython
class HybridEvaluationPipeline:
    def __init__(self, ragas_evaluator, human_platform):
        self.ragas = ragas_evaluator
        self.human = human_platform

    async def evaluate(self, samples: list[dict]) -> dict:
        # Schritt 1: Automatische Bewertung
        auto_results = await self.ragas.evaluate(samples)

        # Schritt 2: Identifiziere die Samples, die menschlich validiert werden müssen
        uncertain_indices = [
            i for i, score in enumerate(auto_results["per_sample"])
            if 0.4 < score["faithfulness"] < 0.7
        ]

        # Schritt 3: Erstelle die Annotation-Aufgaben
        human_tasks = [samples[i] for i in uncertain_indices]
        await self.human.create_tasks(human_tasks)

        # Schritt 4: Warten auf die Annotationen
        human_results = await self.human.wait_for_completion()

        # Schritt 5: Kombiniere die Ergebnisse
        return self._merge_results(auto_results, human_results, uncertain_indices)

    def _merge_results(self, auto, human, human_indices):
        """Kombiniert die automatischen und menschlichen Scores"""
        final_scores = auto["per_sample"].copy()

        for i, idx in enumerate(human_indices):
            human_score = human[i]["average_score"]
            auto_score = final_scores[idx]["faithfulness"]
            final_scores[idx]["final_score"] = 0.6 * human_score + 0.4 * auto_score
            final_scores[idx]["human_validated"] = True

        return final_scores

Checkliste für menschliche Bewertung

Étape	Action	Fait
Guidelines	Rédiger avec exemples pour chaque score	[ ]
Calibrage	Session initiale avec tous les annotateurs	[ ]
Pilote	20 samples pour vérifier l'accord	[ ]
Production	Lancer l'annotation complète	[ ]
Qualité	Calculer l'IAA régulièrement	[ ]
Feedback	Intégrer les retours dans les guidelines	[ ]

Weiterführende Links

Framework RAGAS - Évaluation automatisée
Métriques RAG - Vue d'ensemble
Génération RAG - Améliorer les réponses

FAQ

Mindestens 2 Annotatoren pro Sample werden empfohlen, um das Inter-Annotator Agreement zu berechnen. Für kritische Fälle (medizinisch, rechtlich) zielen Sie auf 3 Annotatoren und nehmen die Mehrheit. Ein einzelner Annotator reicht nur für ein schnelles Audit in der Entwicklungsphase.

Verwenden Sie einen hybriden Ansatz: bewerten Sie automatisch mit RAGAS und konzentrieren Sie dann die menschliche Bewertung auf unsichere Samples (Scores zwischen 0.4 und 0.7). Das reduziert das zu annotierende Volumen um 60–80 % und deckt gleichzeitig kritische Fälle ab.

Ein Kappa von 0.40–0.60 (moderates Agreement) ist akzeptabel, um mit der Annotation zu beginnen. Wenn Sie unter 0.40 liegen, organisieren Sie eine Kalibriersession mit Ihren Annotatoren zu 20 gemeinsamen Samples, bevor Sie fortfahren.

Nein für die Bewertung der wahrgenommenen Qualität (Relevanz, Klarheit). Ja, wenn Sie die faktische Genauigkeit bewerten und Ihre Annotatoren keine Domain-Expertinnen sind. In diesem Fall stellen Sie den ground truth als Referenz zur Verfügung, fordern aber, dass zuerst unabhängig bewertet wird, bevor er konsultiert wird.

Bei kleinen Uneinigkeiten (1 Punkt Differenz auf einer Skala von 5) nehmen Sie den Durchschnitt. Bei größeren Differenzen organisieren Sie eine Reconciliations-Diskussion oder fügen einen dritten Annotator hinzu. Dokumentieren Sie die ambigen Fälle in Ihren Guidelines, um Wiederholungen zu vermeiden. ---

Vereinfachte menschliche Bewertung mit Ailog

Ein Pipeline für menschliche Bewertung einzurichten erfordert Infrastruktur und Koordination. Mit Ailog profitieren Sie von integrierten Tools:

Intuitive Annotation-Oberfläche
Echtzeit-Dashboard zur Fortschrittsanzeige
Automatische Berechnung des IAA
Intelligentes Sampling kritischer Fälle
Berichte, die automatische und menschliche Scores kombinieren

Testez gratuitement et validez la qualité de votre RAG avec expertise humaine.

Menschliche Evaluation: Methodik und Werkzeuge