Evaluating RAG Systems: Metrics and Methodologies
Comprehensive guide to measuring RAG performance: retrieval metrics, generation quality, end-to-end evaluation, and automated testing frameworks.
Why Evaluation Matters
Without measurement, you cannot:
- Know if changes improve performance
- Identify failure modes
- Optimize hyperparameters
- Justify costs to stakeholders
- Meet quality SLAs
Key insight: RAG has multiple components (retrieval, generation), each needing evaluation.
Evaluation Levels
Component-Level
Evaluate individual parts:
- Retrieval quality
- Generation quality
- Chunking effectiveness
End-to-End
Evaluate full pipeline:
- Answer correctness
- User satisfaction
- Task completion
Both are needed
Component metrics diagnose problems. End-to-end metrics measure business impact.
Retrieval Metrics
Precision@k
Proportion of retrieved documents that are relevant.
DEVELOPERpythondef precision_at_k(retrieved, relevant, k): """ retrieved: List of retrieved document IDs relevant: Set of relevant document IDs k: Number of top results to consider """ top_k = set(retrieved[:k]) relevant_retrieved = top_k & relevant return len(relevant_retrieved) / k if k > 0 else 0
Example:
Retrieved top 5: [doc1, doc2, doc3, doc4, doc5]
Relevant: {doc1, doc3, doc8}
Precision@5 = 2/5 = 0.4
Interpretation:
- Higher is better
- Measures accuracy
- Doesn't account for recall
Recall@k
Proportion of relevant documents that were retrieved.
DEVELOPERpythondef recall_at_k(retrieved, relevant, k): """ What fraction of relevant docs did we find? """ top_k = set(retrieved[:k]) relevant_retrieved = top_k & relevant return len(relevant_retrieved) / len(relevant) if relevant else 0
Example:
Retrieved top 5: [doc1, doc2, doc3, doc4, doc5]
Relevant: {doc1, doc3, doc8}
Recall@5 = 2/3 ≈ 0.67
Interpretation:
- Higher is better
- Measures coverage
- Harder to optimize than precision
F1@k
Harmonic mean of precision and recall.
DEVELOPERpythondef f1_at_k(retrieved, relevant, k): p = precision_at_k(retrieved, relevant, k) r = recall_at_k(retrieved, relevant, k) if p + r == 0: return 0 return 2 * (p * r) / (p + r)
Use when:
- Need to balance precision and recall
- Single metric for optimization
Mean Reciprocal Rank (MRR)
Average of reciprocal ranks of first relevant result.
DEVELOPERpythondef reciprocal_rank(retrieved, relevant): """ Rank of first relevant document """ for i, doc_id in enumerate(retrieved, 1): if doc_id in relevant: return 1 / i return 0 def mrr(queries_results, queries_relevant): """ Average across multiple queries """ rr_scores = [ reciprocal_rank(retrieved, relevant) for retrieved, relevant in zip(queries_results, queries_relevant) ] return sum(rr_scores) / len(rr_scores)
Example:
Query 1: First relevant at position 2 → RR = 1/2 = 0.5
Query 2: First relevant at position 1 → RR = 1/1 = 1.0
Query 3: First relevant at position 5 → RR = 1/5 = 0.2
MRR = (0.5 + 1.0 + 0.2) / 3 = 0.57
Interpretation:
- Emphasizes ranking quality
- Cares only about first relevant result
- Good for question answering
NDCG@k (Normalized Discounted Cumulative Gain)
Accounts for graded relevance and position.
DEVELOPERpythonimport numpy as np from sklearn.metrics import ndcg_score def calculate_ndcg(retrieved, relevance_scores, k): """ relevance_scores: Dict mapping doc_id to relevance (0-3 typical) """ # Get scores for retrieved docs scores = [relevance_scores.get(doc_id, 0) for doc_id in retrieved[:k]] # Calculate ideal ranking (best possible) ideal_scores = sorted(relevance_scores.values(), reverse=True)[:k] # NDCG return ndcg_score([ideal_scores], [scores])
Example:
Retrieved: [doc1, doc2, doc3]
Scores: [2, 3, 1] (your system)
Ideal: [3, 2, 1] (perfect ranking)
NDCG calculates how close you are to ideal
Use when:
- Multiple relevance levels (not just binary)
- Position matters (first result more important)
- Research/enterprise search
Hit Rate@k
Did we retrieve at least one relevant document?
DEVELOPERpythondef hit_rate_at_k(retrieved, relevant, k): top_k = set(retrieved[:k]) return 1 if len(top_k & relevant) > 0 else 0
Use for:
- Minimum viability (did we get anything useful?)
- Aggregate across queries for overall hit rate
Generation Metrics
Faithfulness / Groundedness
Is the answer supported by retrieved context?
LLM-as-Judge:
DEVELOPERpythondef evaluate_faithfulness(answer, context, llm): prompt = f"""Is this answer faithful to the context? Answer only yes or no. Context: {context} Answer: {answer} Is the answer supported by the context?""" response = llm.generate(prompt, max_tokens=5) return 1 if 'yes' in response.lower() else 0
Why it matters:
- Detects hallucinations
- Ensures answers are grounded in facts
- Critical for high-stakes applications
Answer Relevance
Does the answer address the question?
DEVELOPERpythondef evaluate_relevance(question, answer, llm): prompt = f"""Does this answer address the question? Rate 1-5. Question: {question} Answer: {answer} Relevance (1-5):""" score = int(llm.generate(prompt, max_tokens=5)) return score / 5 # Normalize to 0-1
Context Precision
How relevant is the retrieved context?
DEVELOPERpythondef context_precision(retrieved_chunks, question, llm): """ Are the retrieved chunks relevant to the question? """ relevant_count = 0 for chunk in retrieved_chunks: prompt = f"""Is this context relevant to the question? Question: {question} Context: {chunk} Relevant? (yes/no)""" response = llm.generate(prompt, max_tokens=5) if 'yes' in response.lower(): relevant_count += 1 return relevant_count / len(retrieved_chunks)
Context Recall
Is all necessary information in the retrieved context?
DEVELOPERpythondef context_recall(ground_truth_answer, retrieved_context, llm): """ Does the context contain all info needed for the ground truth answer? """ prompt = f"""Can this answer be derived from the context? Context: {retrieved_context} Answer: {ground_truth_answer} Is all information present? (yes/no)""" response = llm.generate(prompt, max_tokens=5) return 1 if 'yes' in response.lower() else 0
Automated Evaluation Frameworks
RAGAS
DEVELOPERpythonfrom ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall ) # Prepare dataset dataset = { 'question': [q1, q2, q3], 'answer': [a1, a2, a3], 'contexts': [c1, c2, c3], # List of retrieved chunks 'ground_truth': [gt1, gt2, gt3] } # Evaluate result = evaluate( dataset, metrics=[ faithfulness, answer_relevancy, context_precision, context_recall ] ) print(result) # { # 'faithfulness': 0.92, # 'answer_relevancy': 0.87, # 'context_precision': 0.81, # 'context_recall': 0.89 # }
TruLens
DEVELOPERpythonfrom trulens_eval import TruChain, Feedback, Tru # Initialize tru = Tru() # Define feedback functions f_groundedness = Feedback(groundedness_llm).on_output() f_answer_relevance = Feedback(answer_relevance_llm).on_input_output() f_context_relevance = Feedback(context_relevance_llm).on_input() # Wrap RAG chain tru_rag = TruChain( rag_chain, app_id='my_rag_v1', feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance] ) # Use normally - metrics auto-collected result = tru_rag.run(query) # View dashboard tru.run_dashboard()
DeepEval
DEVELOPERpythonfrom deepeval import evaluate from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric from deepeval.test_case import LLMTestCase # Create test case test_case = LLMTestCase( input="What is the capital of France?", actual_output="The capital of France is Paris.", context=["France is a country in Europe.", "Paris is the capital of France."] ) # Define metrics metrics = [ HallucinationMetric(threshold=0.9), AnswerRelevancyMetric(threshold=0.8) ] # Evaluate evaluate([test_case], metrics)
Creating a Test Set
Manual Curation
DEVELOPERpythontest_cases = [ { 'query': 'How do I reset my password?', 'ground_truth_answer': 'Click "Forgot Password" on the login page...', 'relevant_docs': {'doc_123', 'doc_456'}, 'difficulty': 'easy' }, { 'query': 'What are the differences between plans?', 'ground_truth_answer': 'Premium includes...', 'relevant_docs': {'doc_789'}, 'difficulty': 'medium' }, # ... more test cases ]
Best practices:
- Diverse query types (simple, complex, ambiguous)
- Various difficulty levels
- Real user queries
- Edge cases
- 50-100 test cases minimum
Synthetic Generation
DEVELOPERpythondef generate_test_cases(documents, llm, num_cases=50): test_cases = [] for doc in random.sample(documents, num_cases): prompt = f"""Generate a question that can be answered using this document. Document: {doc} Question:""" question = llm.generate(prompt) prompt_answer = f"""Answer this question using the document. Document: {doc} Question: {question} Answer:""" answer = llm.generate(prompt_answer) test_cases.append({ 'query': question, 'ground_truth_answer': answer, 'relevant_docs': {doc['id']}, 'source': 'synthetic' }) return test_cases
User Query Mining
DEVELOPERpython# Extract from logs def extract_queries_from_logs(log_file, sample_size=100): # Parse logs queries = parse_query_logs(log_file) # Filter for quality queries = [q for q in queries if len(q.split()) >= 3] # Not too short # Sample diverse queries return random.sample(queries, sample_size)
A/B Testing
Experiment Setup
DEVELOPERpythonclass ABTest: def __init__(self, control_system, treatment_system): self.control = control_system self.treatment = treatment_system self.results = {'control': [], 'treatment': []} def run_query(self, query, user_id): # Assign to variant (50/50 split) variant = 'treatment' if hash(user_id) % 2 else 'control' system = self.treatment if variant == 'treatment' else self.control # Get answer answer = system.query(query) # Log result self.results[variant].append({ 'query': query, 'answer': answer, 'timestamp': time.time() }) return answer, variant def analyze(self): # Compare metrics between variants control_metrics = calculate_metrics(self.results['control']) treatment_metrics = calculate_metrics(self.results['treatment']) return { 'control': control_metrics, 'treatment': treatment_metrics, 'lift': calculate_lift(control_metrics, treatment_metrics) }
Metrics to Track
Quality:
- Answer accuracy
- User ratings (thumbs up/down)
- Follow-up question rate
Engagement:
- Session duration
- Queries per session
- Task completion rate
Business:
- Conversion rate
- Support ticket deflection
- Customer satisfaction (CSAT)
Continuous Evaluation
Monitoring Pipeline
DEVELOPERpythonclass RAGMonitor: def __init__(self, rag_system, test_set): self.system = rag_system self.test_set = test_set self.history = [] def run_evaluation(self): results = [] for test_case in self.test_set: # Run RAG answer, contexts = self.system.query(test_case['query']) # Calculate metrics metrics = { 'precision@5': precision_at_k(contexts, test_case['relevant_docs'], 5), 'faithfulness': evaluate_faithfulness(answer, contexts), 'relevance': evaluate_relevance(test_case['query'], answer) } results.append(metrics) # Aggregate aggregated = aggregate_metrics(results) # Save history self.history.append({ 'timestamp': time.time(), 'metrics': aggregated }) # Alert if degradation if self.detect_degradation(aggregated): self.send_alert(aggregated) return aggregated def detect_degradation(self, current_metrics, threshold=0.05): if not self.history: return False previous = self.history[-1]['metrics'] for metric, value in current_metrics.items(): if value < previous[metric] - threshold: return True return False
Scheduled Evaluation
DEVELOPERpythonimport schedule def daily_evaluation(): monitor = RAGMonitor(rag_system, test_set) results = monitor.run_evaluation() # Log to monitoring system metrics_logger.log(results) # Update dashboard update_dashboard(results) # Run daily at 2 AM schedule.every().day.at("02:00").do(daily_evaluation) while True: schedule.run_pending() time.sleep(60)
Human Evaluation
Rating Interface
DEVELOPERpythondef collect_human_ratings(test_cases, rag_system): ratings = [] for test_case in test_cases: # Generate answer answer, contexts = rag_system.query(test_case['query']) # Show to human rater print(f"Query: {test_case['query']}") print(f"Answer: {answer}") print(f"Contexts: {contexts}") # Collect ratings correctness = int(input("Correctness (1-5): ")) completeness = int(input("Completeness (1-5): ")) conciseness = int(input("Conciseness (1-5): ")) ratings.append({ 'query': test_case['query'], 'correctness': correctness, 'completeness': completeness, 'conciseness': conciseness }) return ratings
Inter-Rater Reliability
DEVELOPERpythonfrom sklearn.metrics import cohen_kappa_score def calculate_agreement(rater1_scores, rater2_scores): """ Cohen's Kappa for inter-rater agreement """ kappa = cohen_kappa_score(rater1_scores, rater2_scores) if kappa > 0.8: return "Strong agreement" elif kappa > 0.6: return "Moderate agreement" else: return "Weak agreement - review rating criteria"
Cost of Evaluation
LLM-Based Metrics Cost
DEVELOPERpythondef estimate_evaluation_cost(num_test_cases, metrics_per_case=3): # GPT-4 pricing (example) cost_per_1k_tokens = 0.03 # Input tokens_per_evaluation = 500 # Typical total_evaluations = num_test_cases * metrics_per_case total_tokens = total_evaluations * tokens_per_evaluation cost = (total_tokens / 1000) * cost_per_1k_tokens return cost # Example cost = estimate_evaluation_cost(100) # $4.50 for 100 test cases
Optimization
- Cache evaluations for unchanged outputs
- Use smaller models (GPT-3.5 vs GPT-4) for some metrics
- Batch evaluations
- Run less frequently (daily vs every PR)
Best Practices
- Diverse test set: Cover all query types and difficulty levels
- Track over time: Monitor metrics as system evolves
- Component + E2E: Evaluate both parts and whole
- Real queries: Include actual user queries in test set
- Automate: Run evaluation on every change
- Human validation: Periodic human review of automated metrics
- Business metrics: Connect quality to business outcomes
Next Steps
With evaluation in place, the focus shifts to deploying RAG systems to production. The next guide covers production deployment, scaling, monitoring, and operational considerations.
Tags
Related Guides
Automatic RAG Evaluation: New Framework Achieves 95% Correlation with Human Judgments
Google Research introduces AutoRAGEval, an automated evaluation framework that reliably assesses RAG quality without human annotation.
Reduce RAG Latency: From 2000ms to 200ms
10x faster RAG: parallel retrieval, streaming responses, and architectural optimizations for sub-200ms latency.
Qdrant: Advanced Vector Search Features
Leverage Qdrant's powerful features: payload indexing, quantization, distributed deployment for high-performance RAG.