RAGAS: Open-Source RAG Evaluation Framework
Master RAGAS for automated RAG system evaluation. Installation, metrics, synthetic datasets, and CI/CD integration.
RAGAS: Open-Source RAG Evaluation Framework
RAGAS (Retrieval Augmented Generation Assessment) has become the de facto standard for evaluating RAG systems. This open-source framework provides automated metrics that measure retrieval and generation quality without requiring exhaustive ground truth. This guide walks you through installation to production integration.
Why RAGAS?
Manual RAG system evaluation is time-consuming and non-reproducible. RAGAS solves this with automatically computable metrics:
| Approach | Time/100 samples | Reproducibility | Cost |
|---|---|---|---|
| Human evaluation | 4-8 hours | Low | High |
| Manual testing | 1-2 hours | Medium | Medium |
| Automated RAGAS | 5-15 minutes | Perfect | Low |
RAGAS Advantages
- Open-source: Auditable code, no vendor lock-in
- LLM-as-judge: Uses an LLM to evaluate responses
- No ground truth required: Some metrics work without reference answers
- CI/CD integrable: Complete evaluation automation
- Granular metrics: Precisely identifies weak points
Installation and Configuration
Basic Setup
DEVELOPERpython# Installation # pip install ragas langchain-openai datasets from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_recall, context_precision, answer_correctness, answer_similarity ) from langchain_openai import ChatOpenAI, OpenAIEmbeddings import os # Configure evaluator LLM os.environ["OPENAI_API_KEY"] = "sk-..." # LLM for evaluation (gpt-4 recommended for accuracy) llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
Advanced Configuration
DEVELOPERpythonfrom ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper # Wrapper for using other LLMs class CustomEvaluator: def __init__(self, llm_model: str = "gpt-4o-mini"): self.llm = LangchainLLMWrapper( ChatOpenAI(model=llm_model, temperature=0) ) self.embeddings = LangchainEmbeddingsWrapper( OpenAIEmbeddings(model="text-embedding-3-small") ) def configure_metrics(self): """Configure metrics with custom LLM""" metrics = [faithfulness, answer_relevancy, context_recall] for metric in metrics: metric.llm = self.llm if hasattr(metric, 'embeddings'): metric.embeddings = self.embeddings return metrics
RAGAS Metrics in Detail
1. Faithfulness
Measures whether the generated answer is faithful to the provided context, without hallucination.
DEVELOPERpythonfrom ragas.metrics import faithfulness from datasets import Dataset # Evaluation data eval_data = { "question": ["What is the return policy?"], "answer": ["You have 30 days to return an unused product."], "contexts": [["Our return policy allows returns of any unopened product within 30 days."]] } dataset = Dataset.from_dict(eval_data) # Evaluate faithfulness result = evaluate(dataset, metrics=[faithfulness]) print(f"Faithfulness: {result['faithfulness']:.3f}")
Internal Mechanism:
- Extracts claims from the answer
- Verifies each claim against the context
- Score = supported claims / total claims
| Score | Interpretation | Action |
|---|---|---|
| > 0.9 | Excellent | Maintain |
| 0.7-0.9 | Acceptable | Improve prompts |
| < 0.7 | Problematic | Review pipeline |
2. Answer Relevancy
Evaluates whether the answer actually addresses the question asked.
DEVELOPERpythonfrom ragas.metrics import answer_relevancy eval_data = { "question": ["How do I reset my password?"], "answer": ["To reset your password, click 'Forgot Password' on the login page, enter your email, and follow the link received."], "contexts": [["Login guide: The 'Forgot Password' button sends a reset email."]] } dataset = Dataset.from_dict(eval_data) result = evaluate(dataset, metrics=[answer_relevancy]) print(f"Answer Relevancy: {result['answer_relevancy']:.3f}")
Internal Mechanism:
- Generates questions from the answer
- Compares these questions with the original (cosine similarity)
- Score = average similarity of generated questions
3. Context Recall
Measures whether the retrieved context contains the information needed to answer.
DEVELOPERpythonfrom ragas.metrics import context_recall eval_data = { "question": ["What payment methods are accepted?"], "contexts": [["We accept Visa, Mastercard, and PayPal. Interest-free 3x payment is available."]], "ground_truth": ["Accepted payment methods are Visa, Mastercard, PayPal, and interest-free 3x payment."] } dataset = Dataset.from_dict(eval_data) result = evaluate(dataset, metrics=[context_recall]) print(f"Context Recall: {result['context_recall']:.3f}")
4. Context Precision
Evaluates whether relevant contexts are ranked at the top of results.
DEVELOPERpythonfrom ragas.metrics import context_precision eval_data = { "question": ["What are the delivery times?"], "contexts": [[ "Standard delivery: 3-5 business days. Express: 24h.", "Our customer service is available 24/7.", "Free shipping from 50 EUR." ]], "ground_truth": ["Standard delivery in 3-5 days, express in 24h, free from 50 EUR."] } dataset = Dataset.from_dict(eval_data) result = evaluate(dataset, metrics=[context_precision]) print(f"Context Precision: {result['context_precision']:.3f}")
5. Answer Correctness
Combines semantic and factual similarity for comprehensive evaluation.
DEVELOPERpythonfrom ragas.metrics import answer_correctness eval_data = { "question": ["What is the Premium subscription price?"], "answer": ["The Premium subscription costs 29.99 EUR per month."], "ground_truth": ["The Premium subscription is 29.99 EUR/month with annual commitment."] } dataset = Dataset.from_dict(eval_data) result = evaluate(dataset, metrics=[answer_correctness]) print(f"Answer Correctness: {result['answer_correctness']:.3f}")
Creating an Evaluation Dataset
Automatic Generation with RAGAS
DEVELOPERpythonfrom ragas.testset.generator import TestsetGenerator from ragas.testset.evolutions import simple, reasoning, multi_context from langchain_community.document_loaders import DirectoryLoader from langchain.text_splitter import RecursiveCharacterTextSplitter # Load documents loader = DirectoryLoader("./documents/", glob="**/*.md") documents = loader.load() # Split into chunks splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) chunks = splitter.split_documents(documents) # Generate test dataset generator = TestsetGenerator.from_langchain( generator_llm=ChatOpenAI(model="gpt-4o-mini"), critic_llm=ChatOpenAI(model="gpt-4o-mini"), embeddings=OpenAIEmbeddings() ) testset = generator.generate_with_langchain_docs( documents=chunks, test_size=50, distributions={ simple: 0.5, reasoning: 0.25, multi_context: 0.25 } ) testset_df = testset.to_pandas() print(testset_df.head())
Generated Dataset Structure
| Column | Description | Example |
|---|---|---|
| question | Generated question | "How to configure the API?" |
| contexts | Source chunks | ["API Doc: To configure..."] |
| ground_truth | Expected answer | "Create an API key in..." |
| evolution_type | Question type | simple, reasoning, multi_context |
Complete Evaluation Pipeline
Production-Ready Evaluation Class
DEVELOPERpythonfrom dataclasses import dataclass from datetime import datetime import json @dataclass class EvalConfig: metrics: list llm_model: str = "gpt-4o-mini" batch_size: int = 10 save_results: bool = True output_dir: str = "./eval_results" class RAGASEvaluator: def __init__(self, config: EvalConfig): self.config = config self.llm = ChatOpenAI(model=config.llm_model, temperature=0) self.embeddings = OpenAIEmbeddings() self._configure_metrics() def _configure_metrics(self): for metric in self.config.metrics: metric.llm = LangchainLLMWrapper(self.llm) if hasattr(metric, 'embeddings'): metric.embeddings = LangchainEmbeddingsWrapper(self.embeddings) async def evaluate_rag_system( self, rag_system, eval_dataset: Dataset, version: str = None ) -> dict: questions = eval_dataset["question"] ground_truths = eval_dataset["ground_truth"] answers = [] contexts = [] for question in questions: result = await rag_system.query(question) answers.append(result["answer"]) contexts.append(result["contexts"]) eval_data = { "question": questions, "answer": answers, "contexts": contexts, "ground_truth": ground_truths } dataset = Dataset.from_dict(eval_data) results = evaluate( dataset, metrics=self.config.metrics, llm=self.llm, embeddings=self.embeddings ) output = { "version": version or datetime.now().isoformat(), "timestamp": datetime.now().isoformat(), "sample_count": len(questions), "metrics": { metric.name: float(results[metric.name]) for metric in self.config.metrics }, "per_sample": results.to_pandas().to_dict(orient="records") } if self.config.save_results: self._save_results(output) return output def _save_results(self, results: dict): import os os.makedirs(self.config.output_dir, exist_ok=True) filename = f"eval_{results['version']}.json" filepath = os.path.join(self.config.output_dir, filename) with open(filepath, 'w') as f: json.dump(results, f, indent=2, default=str)
CI/CD Integration
GitHub Actions
DEVELOPERyamlname: RAG Evaluation on: pull_request: paths: - 'rag/**' - 'prompts/**' schedule: - cron: '0 6 * * 1' jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Python uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install dependencies run: pip install ragas langchain-openai datasets - name: Run RAGAS evaluation env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python scripts/run_ragas_eval.py - name: Check thresholds run: | python -c " import json with open('eval_results/latest.json') as f: results = json.load(f) thresholds = {'faithfulness': 0.8, 'answer_relevancy': 0.75} for metric, threshold in thresholds.items(): if results['metrics'].get(metric, 0) < threshold: exit(1) "
Analysis and Debugging
Identifying Problematic Samples
DEVELOPERpythonimport pandas as pd def analyze_failures(results_df: pd.DataFrame, threshold: float = 0.7) -> dict: analysis = {"low_faithfulness": [], "low_relevancy": [], "patterns": {}} low_faith = results_df[results_df["faithfulness"] < threshold] for _, row in low_faith.iterrows(): analysis["low_faithfulness"].append({ "question": row["question"], "answer": row["answer"], "score": row["faithfulness"] }) low_rel = results_df[results_df["answer_relevancy"] < threshold] for _, row in low_rel.iterrows(): analysis["low_relevancy"].append({ "question": row["question"], "score": row["answer_relevancy"] }) return analysis results_df = pd.DataFrame(results["per_sample"]) analysis = analyze_failures(results_df) print(f"Samples with low faithfulness: {len(analysis['low_faithfulness'])}")
Tracking Dashboard
DEVELOPERpythonclass EvalDashboard: def __init__(self, results_dir: str = "./eval_results"): self.results_dir = Path(results_dir) def load_history(self) -> pd.DataFrame: records = [] for file in self.results_dir.glob("eval_*.json"): with open(file) as f: data = json.load(f) records.append({ "version": data["version"], "timestamp": data["timestamp"], **data["metrics"] }) return pd.DataFrame(records).sort_values("timestamp") def generate_report(self) -> str: df = self.load_history() latest = df.iloc[-1] report = f"# RAG Evaluation Report\n\n## Version: {latest['version']}\n\n" for metric in ["faithfulness", "answer_relevancy", "context_recall"]: report += f"| {metric} | {latest[metric]:.3f} |\n" return report
Best Practices
Evaluation Checklist
| Step | Action | Frequency |
|---|---|---|
| Dataset | Maintain 100+ representative samples | Monthly |
| Validation | Review 10% of ground truth | Monthly |
| Thresholds | Adjust based on domain | Quarterly |
| CI/CD | Block PRs below thresholds | Each PR |
| Monitoring | Track trends | Weekly |
RAGAS Limitations
- LLM cost: Evaluation uses LLM calls
- Judge bias: The evaluator LLM may have its own biases
- No UX testing: Does not measure actual user satisfaction
Going Further
- Human Evaluation - Complement RAGAS with human review
- RAG Metrics - Metrics overview
- RAG Generation - Improve responses
Automated Evaluation with Ailog
Implementing RAGAS requires configuration and maintenance. With Ailog, benefit from integrated evaluation:
- Real-time metrics dashboard
- Alerts on quality degradation
- Evaluation history tracking
- Automatic improvement suggestions
- Pre-configured CI/CD integration
Try for free and measure your RAG quality effortlessly.
Tags
Related Posts
Evaluating a RAG System: Metrics and Methodologies
Complete guide to measuring your RAG performance: faithfulness, relevancy, recall, and automated evaluation frameworks.
Reduce RAG Latency: From 2000ms to 200ms
10x faster RAG: parallel retrieval, streaming responses, and architectural optimizations for sub-200ms latency.
Caching Strategies to Reduce RAG Latency and Cost
Cut costs by 80%: implement semantic caching, embedding caching, and response caching for production RAG.