GuideAdvanced

Evaluating RAG Systems: Metrics and Methodologies

February 15, 2025
12 min read
Ailog Research Team

Comprehensive guide to measuring RAG performance: retrieval metrics, generation quality, end-to-end evaluation, and automated testing frameworks.

Why Evaluation Matters

Without measurement, you cannot:

  • Know if changes improve performance
  • Identify failure modes
  • Optimize hyperparameters
  • Justify costs to stakeholders
  • Meet quality SLAs

Key insight: RAG has multiple components (retrieval, generation), each needing evaluation.

Evaluation Levels

Component-Level

Evaluate individual parts:

  • Retrieval quality
  • Generation quality
  • Chunking effectiveness

End-to-End

Evaluate full pipeline:

  • Answer correctness
  • User satisfaction
  • Task completion

Both are needed

Component metrics diagnose problems. End-to-end metrics measure business impact.

Retrieval Metrics

Precision@k

Proportion of retrieved documents that are relevant.

DEVELOPERpython
def precision_at_k(retrieved, relevant, k): """ retrieved: List of retrieved document IDs relevant: Set of relevant document IDs k: Number of top results to consider """ top_k = set(retrieved[:k]) relevant_retrieved = top_k & relevant return len(relevant_retrieved) / k if k > 0 else 0

Example:

Retrieved top 5: [doc1, doc2, doc3, doc4, doc5]
Relevant: {doc1, doc3, doc8}

Precision@5 = 2/5 = 0.4

Interpretation:

  • Higher is better
  • Measures accuracy
  • Doesn't account for recall

Recall@k

Proportion of relevant documents that were retrieved.

DEVELOPERpython
def recall_at_k(retrieved, relevant, k): """ What fraction of relevant docs did we find? """ top_k = set(retrieved[:k]) relevant_retrieved = top_k & relevant return len(relevant_retrieved) / len(relevant) if relevant else 0

Example:

Retrieved top 5: [doc1, doc2, doc3, doc4, doc5]
Relevant: {doc1, doc3, doc8}

Recall@5 = 2/3 ≈ 0.67

Interpretation:

  • Higher is better
  • Measures coverage
  • Harder to optimize than precision

F1@k

Harmonic mean of precision and recall.

DEVELOPERpython
def f1_at_k(retrieved, relevant, k): p = precision_at_k(retrieved, relevant, k) r = recall_at_k(retrieved, relevant, k) if p + r == 0: return 0 return 2 * (p * r) / (p + r)

Use when:

  • Need to balance precision and recall
  • Single metric for optimization

Mean Reciprocal Rank (MRR)

Average of reciprocal ranks of first relevant result.

DEVELOPERpython
def reciprocal_rank(retrieved, relevant): """ Rank of first relevant document """ for i, doc_id in enumerate(retrieved, 1): if doc_id in relevant: return 1 / i return 0 def mrr(queries_results, queries_relevant): """ Average across multiple queries """ rr_scores = [ reciprocal_rank(retrieved, relevant) for retrieved, relevant in zip(queries_results, queries_relevant) ] return sum(rr_scores) / len(rr_scores)

Example:

Query 1: First relevant at position 2 → RR = 1/2 = 0.5
Query 2: First relevant at position 1 → RR = 1/1 = 1.0
Query 3: First relevant at position 5 → RR = 1/5 = 0.2

MRR = (0.5 + 1.0 + 0.2) / 3 = 0.57

Interpretation:

  • Emphasizes ranking quality
  • Cares only about first relevant result
  • Good for question answering

NDCG@k (Normalized Discounted Cumulative Gain)

Accounts for graded relevance and position.

DEVELOPERpython
import numpy as np from sklearn.metrics import ndcg_score def calculate_ndcg(retrieved, relevance_scores, k): """ relevance_scores: Dict mapping doc_id to relevance (0-3 typical) """ # Get scores for retrieved docs scores = [relevance_scores.get(doc_id, 0) for doc_id in retrieved[:k]] # Calculate ideal ranking (best possible) ideal_scores = sorted(relevance_scores.values(), reverse=True)[:k] # NDCG return ndcg_score([ideal_scores], [scores])

Example:

Retrieved: [doc1, doc2, doc3]
Scores:    [2,    3,    1]     (your system)
Ideal:     [3,    2,    1]     (perfect ranking)

NDCG calculates how close you are to ideal

Use when:

  • Multiple relevance levels (not just binary)
  • Position matters (first result more important)
  • Research/enterprise search

Hit Rate@k

Did we retrieve at least one relevant document?

DEVELOPERpython
def hit_rate_at_k(retrieved, relevant, k): top_k = set(retrieved[:k]) return 1 if len(top_k & relevant) > 0 else 0

Use for:

  • Minimum viability (did we get anything useful?)
  • Aggregate across queries for overall hit rate

Generation Metrics

Faithfulness / Groundedness

Is the answer supported by retrieved context?

LLM-as-Judge:

DEVELOPERpython
def evaluate_faithfulness(answer, context, llm): prompt = f"""Is this answer faithful to the context? Answer only yes or no. Context: {context} Answer: {answer} Is the answer supported by the context?""" response = llm.generate(prompt, max_tokens=5) return 1 if 'yes' in response.lower() else 0

Why it matters:

  • Detects hallucinations
  • Ensures answers are grounded in facts
  • Critical for high-stakes applications

Answer Relevance

Does the answer address the question?

DEVELOPERpython
def evaluate_relevance(question, answer, llm): prompt = f"""Does this answer address the question? Rate 1-5. Question: {question} Answer: {answer} Relevance (1-5):""" score = int(llm.generate(prompt, max_tokens=5)) return score / 5 # Normalize to 0-1

Context Precision

How relevant is the retrieved context?

DEVELOPERpython
def context_precision(retrieved_chunks, question, llm): """ Are the retrieved chunks relevant to the question? """ relevant_count = 0 for chunk in retrieved_chunks: prompt = f"""Is this context relevant to the question? Question: {question} Context: {chunk} Relevant? (yes/no)""" response = llm.generate(prompt, max_tokens=5) if 'yes' in response.lower(): relevant_count += 1 return relevant_count / len(retrieved_chunks)

Context Recall

Is all necessary information in the retrieved context?

DEVELOPERpython
def context_recall(ground_truth_answer, retrieved_context, llm): """ Does the context contain all info needed for the ground truth answer? """ prompt = f"""Can this answer be derived from the context? Context: {retrieved_context} Answer: {ground_truth_answer} Is all information present? (yes/no)""" response = llm.generate(prompt, max_tokens=5) return 1 if 'yes' in response.lower() else 0

Automated Evaluation Frameworks

RAGAS

DEVELOPERpython
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall ) # Prepare dataset dataset = { 'question': [q1, q2, q3], 'answer': [a1, a2, a3], 'contexts': [c1, c2, c3], # List of retrieved chunks 'ground_truth': [gt1, gt2, gt3] } # Evaluate result = evaluate( dataset, metrics=[ faithfulness, answer_relevancy, context_precision, context_recall ] ) print(result) # { # 'faithfulness': 0.92, # 'answer_relevancy': 0.87, # 'context_precision': 0.81, # 'context_recall': 0.89 # }

TruLens

DEVELOPERpython
from trulens_eval import TruChain, Feedback, Tru # Initialize tru = Tru() # Define feedback functions f_groundedness = Feedback(groundedness_llm).on_output() f_answer_relevance = Feedback(answer_relevance_llm).on_input_output() f_context_relevance = Feedback(context_relevance_llm).on_input() # Wrap RAG chain tru_rag = TruChain( rag_chain, app_id='my_rag_v1', feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance] ) # Use normally - metrics auto-collected result = tru_rag.run(query) # View dashboard tru.run_dashboard()

DeepEval

DEVELOPERpython
from deepeval import evaluate from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric from deepeval.test_case import LLMTestCase # Create test case test_case = LLMTestCase( input="What is the capital of France?", actual_output="The capital of France is Paris.", context=["France is a country in Europe.", "Paris is the capital of France."] ) # Define metrics metrics = [ HallucinationMetric(threshold=0.9), AnswerRelevancyMetric(threshold=0.8) ] # Evaluate evaluate([test_case], metrics)

Creating a Test Set

Manual Curation

DEVELOPERpython
test_cases = [ { 'query': 'How do I reset my password?', 'ground_truth_answer': 'Click "Forgot Password" on the login page...', 'relevant_docs': {'doc_123', 'doc_456'}, 'difficulty': 'easy' }, { 'query': 'What are the differences between plans?', 'ground_truth_answer': 'Premium includes...', 'relevant_docs': {'doc_789'}, 'difficulty': 'medium' }, # ... more test cases ]

Best practices:

  • Diverse query types (simple, complex, ambiguous)
  • Various difficulty levels
  • Real user queries
  • Edge cases
  • 50-100 test cases minimum

Synthetic Generation

DEVELOPERpython
def generate_test_cases(documents, llm, num_cases=50): test_cases = [] for doc in random.sample(documents, num_cases): prompt = f"""Generate a question that can be answered using this document. Document: {doc} Question:""" question = llm.generate(prompt) prompt_answer = f"""Answer this question using the document. Document: {doc} Question: {question} Answer:""" answer = llm.generate(prompt_answer) test_cases.append({ 'query': question, 'ground_truth_answer': answer, 'relevant_docs': {doc['id']}, 'source': 'synthetic' }) return test_cases

User Query Mining

DEVELOPERpython
# Extract from logs def extract_queries_from_logs(log_file, sample_size=100): # Parse logs queries = parse_query_logs(log_file) # Filter for quality queries = [q for q in queries if len(q.split()) >= 3] # Not too short # Sample diverse queries return random.sample(queries, sample_size)

A/B Testing

Experiment Setup

DEVELOPERpython
class ABTest: def __init__(self, control_system, treatment_system): self.control = control_system self.treatment = treatment_system self.results = {'control': [], 'treatment': []} def run_query(self, query, user_id): # Assign to variant (50/50 split) variant = 'treatment' if hash(user_id) % 2 else 'control' system = self.treatment if variant == 'treatment' else self.control # Get answer answer = system.query(query) # Log result self.results[variant].append({ 'query': query, 'answer': answer, 'timestamp': time.time() }) return answer, variant def analyze(self): # Compare metrics between variants control_metrics = calculate_metrics(self.results['control']) treatment_metrics = calculate_metrics(self.results['treatment']) return { 'control': control_metrics, 'treatment': treatment_metrics, 'lift': calculate_lift(control_metrics, treatment_metrics) }

Metrics to Track

Quality:

  • Answer accuracy
  • User ratings (thumbs up/down)
  • Follow-up question rate

Engagement:

  • Session duration
  • Queries per session
  • Task completion rate

Business:

  • Conversion rate
  • Support ticket deflection
  • Customer satisfaction (CSAT)

Continuous Evaluation

Monitoring Pipeline

DEVELOPERpython
class RAGMonitor: def __init__(self, rag_system, test_set): self.system = rag_system self.test_set = test_set self.history = [] def run_evaluation(self): results = [] for test_case in self.test_set: # Run RAG answer, contexts = self.system.query(test_case['query']) # Calculate metrics metrics = { 'precision@5': precision_at_k(contexts, test_case['relevant_docs'], 5), 'faithfulness': evaluate_faithfulness(answer, contexts), 'relevance': evaluate_relevance(test_case['query'], answer) } results.append(metrics) # Aggregate aggregated = aggregate_metrics(results) # Save history self.history.append({ 'timestamp': time.time(), 'metrics': aggregated }) # Alert if degradation if self.detect_degradation(aggregated): self.send_alert(aggregated) return aggregated def detect_degradation(self, current_metrics, threshold=0.05): if not self.history: return False previous = self.history[-1]['metrics'] for metric, value in current_metrics.items(): if value < previous[metric] - threshold: return True return False

Scheduled Evaluation

DEVELOPERpython
import schedule def daily_evaluation(): monitor = RAGMonitor(rag_system, test_set) results = monitor.run_evaluation() # Log to monitoring system metrics_logger.log(results) # Update dashboard update_dashboard(results) # Run daily at 2 AM schedule.every().day.at("02:00").do(daily_evaluation) while True: schedule.run_pending() time.sleep(60)

Human Evaluation

Rating Interface

DEVELOPERpython
def collect_human_ratings(test_cases, rag_system): ratings = [] for test_case in test_cases: # Generate answer answer, contexts = rag_system.query(test_case['query']) # Show to human rater print(f"Query: {test_case['query']}") print(f"Answer: {answer}") print(f"Contexts: {contexts}") # Collect ratings correctness = int(input("Correctness (1-5): ")) completeness = int(input("Completeness (1-5): ")) conciseness = int(input("Conciseness (1-5): ")) ratings.append({ 'query': test_case['query'], 'correctness': correctness, 'completeness': completeness, 'conciseness': conciseness }) return ratings

Inter-Rater Reliability

DEVELOPERpython
from sklearn.metrics import cohen_kappa_score def calculate_agreement(rater1_scores, rater2_scores): """ Cohen's Kappa for inter-rater agreement """ kappa = cohen_kappa_score(rater1_scores, rater2_scores) if kappa > 0.8: return "Strong agreement" elif kappa > 0.6: return "Moderate agreement" else: return "Weak agreement - review rating criteria"

Cost of Evaluation

LLM-Based Metrics Cost

DEVELOPERpython
def estimate_evaluation_cost(num_test_cases, metrics_per_case=3): # GPT-4 pricing (example) cost_per_1k_tokens = 0.03 # Input tokens_per_evaluation = 500 # Typical total_evaluations = num_test_cases * metrics_per_case total_tokens = total_evaluations * tokens_per_evaluation cost = (total_tokens / 1000) * cost_per_1k_tokens return cost # Example cost = estimate_evaluation_cost(100) # $4.50 for 100 test cases

Optimization

  • Cache evaluations for unchanged outputs
  • Use smaller models (GPT-3.5 vs GPT-4) for some metrics
  • Batch evaluations
  • Run less frequently (daily vs every PR)

Best Practices

  1. Diverse test set: Cover all query types and difficulty levels
  2. Track over time: Monitor metrics as system evolves
  3. Component + E2E: Evaluate both parts and whole
  4. Real queries: Include actual user queries in test set
  5. Automate: Run evaluation on every change
  6. Human validation: Periodic human review of automated metrics
  7. Business metrics: Connect quality to business outcomes

Next Steps

With evaluation in place, the focus shifts to deploying RAG systems to production. The next guide covers production deployment, scaling, monitoring, and operational considerations.

Tags

evaluationmetricstestingquality

Related Guides