7. Optimization

RAGAS: Open-Source RAG Evaluation Framework

March 30, 2026
Ailog Team

Master RAGAS for automated RAG system evaluation. Installation, metrics, synthetic datasets, and CI/CD integration.

RAGAS: Open-Source RAG Evaluation Framework

RAGAS (Retrieval Augmented Generation Assessment) has become the de facto standard for evaluating RAG systems. This open-source framework provides automated metrics that measure retrieval and generation quality without requiring exhaustive ground truth. This guide walks you through installation to production integration.

Why RAGAS?

Manual RAG system evaluation is time-consuming and non-reproducible. RAGAS solves this with automatically computable metrics:

ApproachTime/100 samplesReproducibilityCost
Human evaluation4-8 hoursLowHigh
Manual testing1-2 hoursMediumMedium
Automated RAGAS5-15 minutesPerfectLow

RAGAS Advantages

  • Open-source: Auditable code, no vendor lock-in
  • LLM-as-judge: Uses an LLM to evaluate responses
  • No ground truth required: Some metrics work without reference answers
  • CI/CD integrable: Complete evaluation automation
  • Granular metrics: Precisely identifies weak points

Installation and Configuration

Basic Setup

DEVELOPERpython
# Installation # pip install ragas langchain-openai datasets from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_recall, context_precision, answer_correctness, answer_similarity ) from langchain_openai import ChatOpenAI, OpenAIEmbeddings import os # Configure evaluator LLM os.environ["OPENAI_API_KEY"] = "sk-..." # LLM for evaluation (gpt-4 recommended for accuracy) llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Advanced Configuration

DEVELOPERpython
from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper # Wrapper for using other LLMs class CustomEvaluator: def __init__(self, llm_model: str = "gpt-4o-mini"): self.llm = LangchainLLMWrapper( ChatOpenAI(model=llm_model, temperature=0) ) self.embeddings = LangchainEmbeddingsWrapper( OpenAIEmbeddings(model="text-embedding-3-small") ) def configure_metrics(self): """Configure metrics with custom LLM""" metrics = [faithfulness, answer_relevancy, context_recall] for metric in metrics: metric.llm = self.llm if hasattr(metric, 'embeddings'): metric.embeddings = self.embeddings return metrics

RAGAS Metrics in Detail

1. Faithfulness

Measures whether the generated answer is faithful to the provided context, without hallucination.

DEVELOPERpython
from ragas.metrics import faithfulness from datasets import Dataset # Evaluation data eval_data = { "question": ["What is the return policy?"], "answer": ["You have 30 days to return an unused product."], "contexts": [["Our return policy allows returns of any unopened product within 30 days."]] } dataset = Dataset.from_dict(eval_data) # Evaluate faithfulness result = evaluate(dataset, metrics=[faithfulness]) print(f"Faithfulness: {result['faithfulness']:.3f}")

Internal Mechanism:

  1. Extracts claims from the answer
  2. Verifies each claim against the context
  3. Score = supported claims / total claims
ScoreInterpretationAction
> 0.9ExcellentMaintain
0.7-0.9AcceptableImprove prompts
< 0.7ProblematicReview pipeline

2. Answer Relevancy

Evaluates whether the answer actually addresses the question asked.

DEVELOPERpython
from ragas.metrics import answer_relevancy eval_data = { "question": ["How do I reset my password?"], "answer": ["To reset your password, click 'Forgot Password' on the login page, enter your email, and follow the link received."], "contexts": [["Login guide: The 'Forgot Password' button sends a reset email."]] } dataset = Dataset.from_dict(eval_data) result = evaluate(dataset, metrics=[answer_relevancy]) print(f"Answer Relevancy: {result['answer_relevancy']:.3f}")

Internal Mechanism:

  1. Generates questions from the answer
  2. Compares these questions with the original (cosine similarity)
  3. Score = average similarity of generated questions

3. Context Recall

Measures whether the retrieved context contains the information needed to answer.

DEVELOPERpython
from ragas.metrics import context_recall eval_data = { "question": ["What payment methods are accepted?"], "contexts": [["We accept Visa, Mastercard, and PayPal. Interest-free 3x payment is available."]], "ground_truth": ["Accepted payment methods are Visa, Mastercard, PayPal, and interest-free 3x payment."] } dataset = Dataset.from_dict(eval_data) result = evaluate(dataset, metrics=[context_recall]) print(f"Context Recall: {result['context_recall']:.3f}")

4. Context Precision

Evaluates whether relevant contexts are ranked at the top of results.

DEVELOPERpython
from ragas.metrics import context_precision eval_data = { "question": ["What are the delivery times?"], "contexts": [[ "Standard delivery: 3-5 business days. Express: 24h.", "Our customer service is available 24/7.", "Free shipping from 50 EUR." ]], "ground_truth": ["Standard delivery in 3-5 days, express in 24h, free from 50 EUR."] } dataset = Dataset.from_dict(eval_data) result = evaluate(dataset, metrics=[context_precision]) print(f"Context Precision: {result['context_precision']:.3f}")

5. Answer Correctness

Combines semantic and factual similarity for comprehensive evaluation.

DEVELOPERpython
from ragas.metrics import answer_correctness eval_data = { "question": ["What is the Premium subscription price?"], "answer": ["The Premium subscription costs 29.99 EUR per month."], "ground_truth": ["The Premium subscription is 29.99 EUR/month with annual commitment."] } dataset = Dataset.from_dict(eval_data) result = evaluate(dataset, metrics=[answer_correctness]) print(f"Answer Correctness: {result['answer_correctness']:.3f}")

Creating an Evaluation Dataset

Automatic Generation with RAGAS

DEVELOPERpython
from ragas.testset.generator import TestsetGenerator from ragas.testset.evolutions import simple, reasoning, multi_context from langchain_community.document_loaders import DirectoryLoader from langchain.text_splitter import RecursiveCharacterTextSplitter # Load documents loader = DirectoryLoader("./documents/", glob="**/*.md") documents = loader.load() # Split into chunks splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) chunks = splitter.split_documents(documents) # Generate test dataset generator = TestsetGenerator.from_langchain( generator_llm=ChatOpenAI(model="gpt-4o-mini"), critic_llm=ChatOpenAI(model="gpt-4o-mini"), embeddings=OpenAIEmbeddings() ) testset = generator.generate_with_langchain_docs( documents=chunks, test_size=50, distributions={ simple: 0.5, reasoning: 0.25, multi_context: 0.25 } ) testset_df = testset.to_pandas() print(testset_df.head())

Generated Dataset Structure

ColumnDescriptionExample
questionGenerated question"How to configure the API?"
contextsSource chunks["API Doc: To configure..."]
ground_truthExpected answer"Create an API key in..."
evolution_typeQuestion typesimple, reasoning, multi_context

Complete Evaluation Pipeline

Production-Ready Evaluation Class

DEVELOPERpython
from dataclasses import dataclass from datetime import datetime import json @dataclass class EvalConfig: metrics: list llm_model: str = "gpt-4o-mini" batch_size: int = 10 save_results: bool = True output_dir: str = "./eval_results" class RAGASEvaluator: def __init__(self, config: EvalConfig): self.config = config self.llm = ChatOpenAI(model=config.llm_model, temperature=0) self.embeddings = OpenAIEmbeddings() self._configure_metrics() def _configure_metrics(self): for metric in self.config.metrics: metric.llm = LangchainLLMWrapper(self.llm) if hasattr(metric, 'embeddings'): metric.embeddings = LangchainEmbeddingsWrapper(self.embeddings) async def evaluate_rag_system( self, rag_system, eval_dataset: Dataset, version: str = None ) -> dict: questions = eval_dataset["question"] ground_truths = eval_dataset["ground_truth"] answers = [] contexts = [] for question in questions: result = await rag_system.query(question) answers.append(result["answer"]) contexts.append(result["contexts"]) eval_data = { "question": questions, "answer": answers, "contexts": contexts, "ground_truth": ground_truths } dataset = Dataset.from_dict(eval_data) results = evaluate( dataset, metrics=self.config.metrics, llm=self.llm, embeddings=self.embeddings ) output = { "version": version or datetime.now().isoformat(), "timestamp": datetime.now().isoformat(), "sample_count": len(questions), "metrics": { metric.name: float(results[metric.name]) for metric in self.config.metrics }, "per_sample": results.to_pandas().to_dict(orient="records") } if self.config.save_results: self._save_results(output) return output def _save_results(self, results: dict): import os os.makedirs(self.config.output_dir, exist_ok=True) filename = f"eval_{results['version']}.json" filepath = os.path.join(self.config.output_dir, filename) with open(filepath, 'w') as f: json.dump(results, f, indent=2, default=str)

CI/CD Integration

GitHub Actions

DEVELOPERyaml
name: RAG Evaluation on: pull_request: paths: - 'rag/**' - 'prompts/**' schedule: - cron: '0 6 * * 1' jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Python uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install dependencies run: pip install ragas langchain-openai datasets - name: Run RAGAS evaluation env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python scripts/run_ragas_eval.py - name: Check thresholds run: | python -c " import json with open('eval_results/latest.json') as f: results = json.load(f) thresholds = {'faithfulness': 0.8, 'answer_relevancy': 0.75} for metric, threshold in thresholds.items(): if results['metrics'].get(metric, 0) < threshold: exit(1) "

Analysis and Debugging

Identifying Problematic Samples

DEVELOPERpython
import pandas as pd def analyze_failures(results_df: pd.DataFrame, threshold: float = 0.7) -> dict: analysis = {"low_faithfulness": [], "low_relevancy": [], "patterns": {}} low_faith = results_df[results_df["faithfulness"] < threshold] for _, row in low_faith.iterrows(): analysis["low_faithfulness"].append({ "question": row["question"], "answer": row["answer"], "score": row["faithfulness"] }) low_rel = results_df[results_df["answer_relevancy"] < threshold] for _, row in low_rel.iterrows(): analysis["low_relevancy"].append({ "question": row["question"], "score": row["answer_relevancy"] }) return analysis results_df = pd.DataFrame(results["per_sample"]) analysis = analyze_failures(results_df) print(f"Samples with low faithfulness: {len(analysis['low_faithfulness'])}")

Tracking Dashboard

DEVELOPERpython
class EvalDashboard: def __init__(self, results_dir: str = "./eval_results"): self.results_dir = Path(results_dir) def load_history(self) -> pd.DataFrame: records = [] for file in self.results_dir.glob("eval_*.json"): with open(file) as f: data = json.load(f) records.append({ "version": data["version"], "timestamp": data["timestamp"], **data["metrics"] }) return pd.DataFrame(records).sort_values("timestamp") def generate_report(self) -> str: df = self.load_history() latest = df.iloc[-1] report = f"# RAG Evaluation Report\n\n## Version: {latest['version']}\n\n" for metric in ["faithfulness", "answer_relevancy", "context_recall"]: report += f"| {metric} | {latest[metric]:.3f} |\n" return report

Best Practices

Evaluation Checklist

StepActionFrequency
DatasetMaintain 100+ representative samplesMonthly
ValidationReview 10% of ground truthMonthly
ThresholdsAdjust based on domainQuarterly
CI/CDBlock PRs below thresholdsEach PR
MonitoringTrack trendsWeekly

RAGAS Limitations

  • LLM cost: Evaluation uses LLM calls
  • Judge bias: The evaluator LLM may have its own biases
  • No UX testing: Does not measure actual user satisfaction

Going Further


Automated Evaluation with Ailog

Implementing RAGAS requires configuration and maintenance. With Ailog, benefit from integrated evaluation:

  • Real-time metrics dashboard
  • Alerts on quality degradation
  • Evaluation history tracking
  • Automatic improvement suggestions
  • Pre-configured CI/CD integration

Try for free and measure your RAG quality effortlessly.

Tags

ragevaluationragasmetricsqualityopen-source

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !