Evaluating RAG Systems: Metrics and Methodologies
Comprehensive guide to measuring RAG performance: retrieval metrics, generation quality, end-to-end evaluation, and automated testing frameworks.
- Author
- Ailog Research Team
- Published
- Reading time
- 12 min read
- Level
- advanced
Why Evaluation Matters
Without measurement, you cannot: • Know if changes improve performance • Identify failure modes • Optimize hyperparameters • Justify costs to stakeholders • Meet quality SLAs
Key insight: RAG has multiple components (retrieval, generation), each needing evaluation.
Evaluation Levels
Component-Level
Evaluate individual parts: • Retrieval quality • Generation quality • Chunking effectiveness
End-to-End
Evaluate full pipeline: • Answer correctness • User satisfaction • Task completion
Both are needed
Component metrics diagnose problems. End-to-end metrics measure business impact.
Retrieval Metrics
Precision@k
Proportion of retrieved documents that are relevant.
``python def precision_at_k(retrieved, relevant, k): """ retrieved: List of retrieved document IDs relevant: Set of relevant document IDs k: Number of top results to consider """ top_k = set(retrieved[:k]) relevant_retrieved = top_k & relevant
return len(relevant_retrieved) / k if k > 0 else 0 `
Example: ` Retrieved top 5: [doc1, doc2, doc3, doc4, doc5] Relevant: {doc1, doc3, doc8}
Precision@5 = 2/5 = 0.4 `
Interpretation: • Higher is better • Measures accuracy • Doesn't account for recall
Recall@k
Proportion of relevant documents that were retrieved.
`python def recall_at_k(retrieved, relevant, k): """ What fraction of relevant docs did we find? """ top_k = set(retrieved[:k]) relevant_retrieved = top_k & relevant
return len(relevant_retrieved) / len(relevant) if relevant else 0 `
Example: ` Retrieved top 5: [doc1, doc2, doc3, doc4, doc5] Relevant: {doc1, doc3, doc8}
Recall@5 = 2/3 ≈ 0.67 `
Interpretation: • Higher is better • Measures coverage • Harder to optimize than precision
F1@k
Harmonic mean of precision and recall.
`python def f1_at_k(retrieved, relevant, k): p = precision_at_k(retrieved, relevant, k) r = recall_at_k(retrieved, relevant, k)
if p + r == 0: return 0
return 2 (p r) / (p + r) `
Use when: • Need to balance precision and recall • Single metric for optimization
Mean Reciprocal Rank (MRR)
Average of reciprocal ranks of first relevant result.
`python def reciprocal_rank(retrieved, relevant): """ Rank of first relevant document """ for i, doc_id in enumerate(retrieved, 1): if doc_id in relevant: return 1 / i return 0
def mrr(queries_results, queries_relevant): """ Average across multiple queries """ rr_scores = [ reciprocal_rank(retrieved, relevant) for retrieved, relevant in zip(queries_results, queries_relevant) ]
return sum(rr_scores) / len(rr_scores) `
Example: ` Query 1: First relevant at position 2 → RR = 1/2 = 0.5 Query 2: First relevant at position 1 → RR = 1/1 = 1.0 Query 3: First relevant at position 5 → RR = 1/5 = 0.2
MRR = (0.5 + 1.0 + 0.2) / 3 = 0.57 `
Interpretation: • Emphasizes ranking quality • Cares only about first relevant result • Good for question answering
NDCG@k (Normalized Discounted Cumulative Gain)
Accounts for graded relevance and position.
`python import numpy as np from sklearn.metrics import ndcg_score
def calculate_ndcg(retrieved, relevance_scores, k): """ relevance_scores: Dict mapping doc_id to relevance (0-3 typical) """ Get scores for retrieved docs scores = [relevance_scores.get(doc_id, 0) for doc_id in retrieved[:k]]
Calculate ideal ranking (best possible) ideal_scores = sorted(relevance_scores.values(), reverse=True)[:k]
NDCG return ndcg_score([ideal_scores], [scores]) `
Example: ` Retrieved: [doc1, doc2, doc3] Scores: [2, 3, 1] (your system) Ideal: [3, 2, 1] (perfect ranking)
NDCG calculates how close you are to ideal `
Use when: • Multiple relevance levels (not just binary) • Position matters (first result more important) • Research/enterprise search
Hit Rate@k
Did we retrieve at least one relevant document?
`python def hit_rate_at_k(retrieved, relevant, k): top_k = set(retrieved[:k]) return 1 if len(top_k & relevant) > 0 else 0 `
Use for: • Minimum viability (did we get anything useful?) • Aggregate across queries for overall hit rate
Generation Metrics
Faithfulness / Groundedness
Is the answer supported by retrieved context?
LLM-as-Judge: `python def evaluate_faithfulness(answer, context, llm): prompt = f"""Is this answer faithful to the context? Answer only yes or no.
Context: {context}
Answer: {answer}
Is the answer supported by the context?"""
response = llm.generate(prompt, max_tokens=5) return 1 if 'yes' in response.lower() else 0 `
Why it matters: • Detects hallucinations • Ensures answers are grounded in facts • Critical for high-stakes applications
Answer Relevance
Does the answer address the question?
`python def evaluate_relevance(question, answer, llm): prompt = f"""Does this answer address the question? Rate 1-5.
Question: {question}
Answer: {answer}
Relevance (1-5):"""
score = int(llm.generate(prompt, max_tokens=5)) return score / 5 Normalize to 0-1 `
Context Precision
How relevant is the retrieved context?
`python def context_precision(retrieved_chunks, question, llm): """ Are the retrieved chunks relevant to the question? """ relevant_count = 0
for chunk in retrieved_chunks: prompt = f"""Is this context relevant to the question?
Question: {question}
Context: {chunk}
Relevant? (yes/no)"""
response = llm.generate(prompt, max_tokens=5) if 'yes' in response.lower(): relevant_count += 1
return relevant_count / len(retrieved_chunks) `
Context Recall
Is all necessary information in the retrieved context?
`python def context_recall(ground_truth_answer, retrieved_context, llm): """ Does the context contain all info needed for the ground truth answer? """ prompt = f"""Can this answer be derived from the context?
Context: {retrieved_context}
Answer: {ground_truth_answer}
Is all information present? (yes/no)"""
response = llm.generate(prompt, max_tokens=5) return 1 if 'yes' in response.lower() else 0 `
Automated Evaluation Frameworks
RAGAS
`python from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall )
Prepare dataset dataset = { 'question': [q1, q2, q3], 'answer': [a1, a2, a3], 'contexts': [c1, c2, c3], List of retrieved chunks 'ground_truth': [gt1, gt2, gt3] }
Evaluate result = evaluate( dataset, metrics=[ faithfulness, answer_relevancy, context_precision, context_recall ] )
print(result) { 'faithfulness': 0.92, 'answer_relevancy': 0.87, 'context_precision': 0.81, 'context_recall': 0.89 } `
TruLens
`python from trulens_eval import TruChain, Feedback, Tru
Initialize tru = Tru()
Define feedback functions f_groundedness = Feedback(groundedness_llm).on_output() f_answer_relevance = Feedback(answer_relevance_llm).on_input_output() f_context_relevance = Feedback(context_relevance_llm).on_input()
Wrap RAG chain tru_rag = TruChain( rag_chain, app_id='my_rag_v1', feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance] )
Use normally - metrics auto-collected result = tru_rag.run(query)
View dashboard tru.run_dashboard() `
DeepEval
`python from deepeval import evaluate from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric from deepeval.test_case import LLMTestCase
Create test case test_case = LLMTestCase( input="What is the capital of France?", actual_output="The capital of France is Paris.", context=["France is a country in Europe.", "Paris is the capital of France."] )
Define metrics metrics = [ HallucinationMetric(threshold=0.9), AnswerRelevancyMetric(threshold=0.8) ]
Evaluate evaluate([test_case], metrics) `
Creating a Test Set
Manual Curation
`python test_cases = [ { 'query': 'How do I reset my password?', 'ground_truth_answer': 'Click "Forgot Password" on the login page...', 'relevant_docs': {'doc_123', 'doc_456'}, 'difficulty': 'easy' }, { 'query': 'What are the differences between plans?', 'ground_truth_answer': 'Premium includes...', 'relevant_docs': {'doc_789'}, 'difficulty': 'medium' }, ... more test cases ] `
Best practices: • Diverse query types (simple, complex, ambiguous) • Various difficulty levels • Real user queries • Edge cases • 50-100 test cases minimum
Synthetic Generation
`python def generate_test_cases(documents, llm, num_cases=50): test_cases = []
for doc in random.sample(documents, num_cases): prompt = f"""Generate a question that can be answered using this document.
Document: {doc}
Question:"""
question = llm.generate(prompt)
prompt_answer = f"""Answer this question using the document.
Document: {doc}
Question: {question}
Answer:"""
answer = llm.generate(prompt_answer)
test_cases.append({ 'query': question, 'ground_truth_answer': answer, 'relevant_docs': {doc['id']}, 'source': 'synthetic' })
return test_cases `
User Query Mining
`python Extract from logs def extract_queries_from_logs(log_file, sample_size=100): Parse logs queries = parse_query_logs(log_file)
Filter for quality queries = [q for q in queries if len(q.split()) >= 3] Not too short
Sample diverse queries return random.sample(queries, sample_size) `
A/B Testing
Experiment Setup
`python class ABTest: def __init__(self, control_system, treatment_system): self.control = control_system self.treatment = treatment_system self.results = {'control': [], 'treatment': []}
def run_query(self, query, user_id): Assign to variant (50/50 split) variant = 'treatment' if hash(user_id) % 2 else 'control' system = self.treatment if variant == 'treatment' else self.control
Get answer answer = system.query(query)
Log result self.results[variant].append({ 'query': query, 'answer': answer, 'timestamp': time.time() })
return answer, variant
def analyze(self): Compare metrics between variants control_metrics = calculate_metrics(self.results['control']) treatment_metrics = calculate_metrics(self.results['treatment'])
return { 'control': control_metrics, 'treatment': treatment_metrics, 'lift': calculate_lift(control_metrics, treatment_metrics) } `
Metrics to Track
Quality: • Answer accuracy • User ratings (thumbs up/down) • Follow-up question rate
Engagement: • Session duration • Queries per session • Task completion rate
Business: • Conversion rate • Support ticket deflection • Customer satisfaction (CSAT)
Continuous Evaluation
Monitoring Pipeline
`python class RAGMonitor: def __init__(self, rag_system, test_set): self.system = rag_system self.test_set = test_set self.history = []
def run_evaluation(self): results = []
for test_case in self.test_set: Run RAG answer, contexts = self.system.query(test_case['query'])
Calculate metrics metrics = { 'precision@5': precision_at_k(contexts, test_case['relevant_docs'], 5), 'faithfulness': evaluate_faithfulness(answer, contexts), 'relevance': evaluate_relevance(test_case['query'], answer) }
results.append(metrics)
Aggregate aggregated = aggregate_metrics(results)
Save history self.history.append({ 'timestamp': time.time(), 'metrics': aggregated })
Alert if degradation if self.detect_degradation(aggregated): self.send_alert(aggregated)
return aggregated
def detect_degradation(self, current_metrics, threshold=0.05): if not self.history: return False
previous = self.history[-1]['metrics']
for metric, value in current_metrics.items(): if value < previous[metric] - threshold: return True
return False `
Scheduled Evaluation
`python import schedule
def daily_evaluation(): monitor = RAGMonitor(rag_system, test_set) results = monitor.run_evaluation()
Log to monitoring system metrics_logger.log(results)
Update dashboard update_dashboard(results)
Run daily at 2 AM schedule.every().day.at("02:00").do(daily_evaluation)
while True: schedule.run_pending() time.sleep(60) `
Human Evaluation
Rating Interface
`python def collect_human_ratings(test_cases, rag_system): ratings = []
for test_case in test_cases: Generate answer answer, contexts = rag_system.query(test_case['query'])
Show to human rater print(f"Query: {test_case['query']}") print(f"Answer: {answer}") print(f"Contexts: {contexts}")
Collect ratings correctness = int(input("Correctness (1-5): ")) completeness = int(input("Completeness (1-5): ")) conciseness = int(input("Conciseness (1-5): "))
ratings.append({ 'query': test_case['query'], 'correctness': correctness, 'completeness': completeness, 'conciseness': conciseness })
return ratings `
Inter-Rater Reliability
`python from sklearn.metrics import cohen_kappa_score
def calculate_agreement(rater1_scores, rater2_scores): """ Cohen's Kappa for inter-rater agreement """ kappa = cohen_kappa_score(rater1_scores, rater2_scores)
if kappa > 0.8: return "Strong agreement" elif kappa > 0.6: return "Moderate agreement" else: return "Weak agreement - review rating criteria" `
Cost of Evaluation
LLM-Based Metrics Cost
`python def estimate_evaluation_cost(num_test_cases, metrics_per_case=3): GPT-4 pricing (example) cost_per_1k_tokens = 0.03 Input tokens_per_evaluation = 500 Typical
total_evaluations = num_test_cases metrics_per_case total_tokens = total_evaluations tokens_per_evaluation
cost = (total_tokens / 1000) * cost_per_1k_tokens
return cost
Example cost = estimate_evaluation_cost(100) $4.50 for 100 test cases ``
Optimization • Cache evaluations for unchanged outputs • Use smaller models (GPT-3.5 vs GPT-4) for some metrics • Batch evaluations • Run less frequently (daily vs every PR)
Best Practices Diverse test set: Cover all query types and difficulty levels Track over time: Monitor metrics as system evolves Component + E2E: Evaluate both parts and whole Real queries: Include actual user queries in test set Automate: Run evaluation on every change Human validation: Periodic human review of automated metrics Business metrics: Connect quality to business outcomes
Next Steps
With evaluation in place, the focus shifts to deploying RAG systems to production. The next guide covers production deployment, scaling, monitoring, and operational considerations.