RAGAS: Open-Source RAG Evaluation Framework

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

RAGAS (Retrieval Augmented Generation Assessment) has become the de facto standard for evaluating RAG systems. This open-source framework provides automated metrics that measure retrieval and generation quality without requiring exhaustive ground truth. This guide walks you through installation to production integration.

Why RAGAS?

Manual RAG system evaluation is time-consuming and non-reproducible. RAGAS solves this with automatically computable metrics:

Approach	Time/100 samples	Reproducibility	Cost
Human evaluation	4-8 hours	Low	High
Manual testing	1-2 hours	Medium	Medium
Automated RAGAS	5-15 minutes	Perfect	Low

RAGAS Advantages

Open-source: Auditable code, no vendor lock-in
LLM-as-judge: Uses an LLM to evaluate responses
No ground truth required: Some metrics work without reference answers
CI/CD integrable: Complete evaluation automation
Granular metrics: Precisely identifies weak points

Installation and Configuration

Basic Setup

DEVELOPERpython
# Installation
# pip install ragas langchain-openai datasets

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
    answer_similarity
)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import os

# Configure evaluator LLM
os.environ["OPENAI_API_KEY"] = "sk-..."

# LLM for evaluation (gpt-4 recommended for accuracy)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Advanced Configuration

DEVELOPERpython
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Wrapper for using other LLMs
class CustomEvaluator:
    def __init__(self, llm_model: str = "gpt-4o-mini"):
        self.llm = LangchainLLMWrapper(
            ChatOpenAI(model=llm_model, temperature=0)
        )
        self.embeddings = LangchainEmbeddingsWrapper(
            OpenAIEmbeddings(model="text-embedding-3-small")
        )

    def configure_metrics(self):
        """Configure metrics with custom LLM"""
        metrics = [faithfulness, answer_relevancy, context_recall]

        for metric in metrics:
            metric.llm = self.llm
            if hasattr(metric, 'embeddings'):
                metric.embeddings = self.embeddings

        return metrics

RAGAS Metrics in Detail

1. Faithfulness

Measures whether the generated answer is faithful to the provided context, without hallucination.

DEVELOPERpython
from ragas.metrics import faithfulness
from datasets import Dataset

# Evaluation data
eval_data = {
    "question": ["What is the return policy?"],
    "answer": ["You have 30 days to return an unused product."],
    "contexts": [["Our return policy allows returns of any unopened product within 30 days."]]
}

dataset = Dataset.from_dict(eval_data)

# Evaluate faithfulness
result = evaluate(dataset, metrics=[faithfulness])
print(f"Faithfulness: {result['faithfulness']:.3f}")

Internal Mechanism:

Extracts claims from the answer
Verifies each claim against the context
Score = supported claims / total claims

Score	Interpretation	Action
> 0.9	Excellent	Maintain
0.7-0.9	Acceptable	Improve prompts
< 0.7	Problematic	Review pipeline

2. Answer Relevancy

Evaluates whether the answer actually addresses the question asked.

DEVELOPERpython
from ragas.metrics import answer_relevancy

eval_data = {
    "question": ["How do I reset my password?"],
    "answer": ["To reset your password, click 'Forgot Password' on the login page, enter your email, and follow the link received."],
    "contexts": [["Login guide: The 'Forgot Password' button sends a reset email."]]
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[answer_relevancy])
print(f"Answer Relevancy: {result['answer_relevancy']:.3f}")

Internal Mechanism:

Generates questions from the answer
Compares these questions with the original (cosine similarity)
Score = average similarity of generated questions

3. Context Recall

Measures whether the retrieved context contains the information needed to answer.

DEVELOPERpython
from ragas.metrics import context_recall

eval_data = {
    "question": ["What payment methods are accepted?"],
    "contexts": [["We accept Visa, Mastercard, and PayPal. Interest-free 3x payment is available."]],
    "ground_truth": ["Accepted payment methods are Visa, Mastercard, PayPal, and interest-free 3x payment."]
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[context_recall])
print(f"Context Recall: {result['context_recall']:.3f}")

4. Context Precision

Evaluates whether relevant contexts are ranked at the top of results.

DEVELOPERpython
from ragas.metrics import context_precision

eval_data = {
    "question": ["What are the delivery times?"],
    "contexts": [[
        "Standard delivery: 3-5 business days. Express: 24h.",
        "Our customer service is available 24/7.",
        "Free shipping from 50 EUR."
    ]],
    "ground_truth": ["Standard delivery in 3-5 days, express in 24h, free from 50 EUR."]
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[context_precision])
print(f"Context Precision: {result['context_precision']:.3f}")

5. Answer Correctness

Combines semantic and factual similarity for comprehensive evaluation.

DEVELOPERpython
from ragas.metrics import answer_correctness

eval_data = {
    "question": ["What is the Premium subscription price?"],
    "answer": ["The Premium subscription costs 29.99 EUR per month."],
    "ground_truth": ["The Premium subscription is 29.99 EUR/month with annual commitment."]
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[answer_correctness])
print(f"Answer Correctness: {result['answer_correctness']:.3f}")

Creating an Evaluation Dataset

Automatic Generation with RAGAS

DEVELOPERpython
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents
loader = DirectoryLoader("./documents/", glob="**/*.md")
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# Generate test dataset
generator = TestsetGenerator.from_langchain(
    generator_llm=ChatOpenAI(model="gpt-4o-mini"),
    critic_llm=ChatOpenAI(model="gpt-4o-mini"),
    embeddings=OpenAIEmbeddings()
)

testset = generator.generate_with_langchain_docs(
    documents=chunks,
    test_size=50,
    distributions={
        simple: 0.5,
        reasoning: 0.25,
        multi_context: 0.25
    }
)

testset_df = testset.to_pandas()
print(testset_df.head())

Generated Dataset Structure

Column	Description	Example
question	Generated question	"How to configure the API?"
contexts	Source chunks	["API Doc: To configure..."]
ground_truth	Expected answer	"Create an API key in..."
evolution_type	Question type	simple, reasoning, multi_context

Complete Evaluation Pipeline

Production-Ready Evaluation Class

DEVELOPERpython
from dataclasses import dataclass
from datetime import datetime
import json

@dataclass
class EvalConfig:
    metrics: list
    llm_model: str = "gpt-4o-mini"
    batch_size: int = 10
    save_results: bool = True
    output_dir: str = "./eval_results"

class RAGASEvaluator:
    def __init__(self, config: EvalConfig):
        self.config = config
        self.llm = ChatOpenAI(model=config.llm_model, temperature=0)
        self.embeddings = OpenAIEmbeddings()
        self._configure_metrics()

    def _configure_metrics(self):
        for metric in self.config.metrics:
            metric.llm = LangchainLLMWrapper(self.llm)
            if hasattr(metric, 'embeddings'):
                metric.embeddings = LangchainEmbeddingsWrapper(self.embeddings)

    async def evaluate_rag_system(
        self,
        rag_system,
        eval_dataset: Dataset,
        version: str = None
    ) -> dict:
        questions = eval_dataset["question"]
        ground_truths = eval_dataset["ground_truth"]

        answers = []
        contexts = []

        for question in questions:
            result = await rag_system.query(question)
            answers.append(result["answer"])
            contexts.append(result["contexts"])

        eval_data = {
            "question": questions,
            "answer": answers,
            "contexts": contexts,
            "ground_truth": ground_truths
        }

        dataset = Dataset.from_dict(eval_data)

        results = evaluate(
            dataset,
            metrics=self.config.metrics,
            llm=self.llm,
            embeddings=self.embeddings
        )

        output = {
            "version": version or datetime.now().isoformat(),
            "timestamp": datetime.now().isoformat(),
            "sample_count": len(questions),
            "metrics": {
                metric.name: float(results[metric.name])
                for metric in self.config.metrics
            },
            "per_sample": results.to_pandas().to_dict(orient="records")
        }

        if self.config.save_results:
            self._save_results(output)

        return output

    def _save_results(self, results: dict):
        import os
        os.makedirs(self.config.output_dir, exist_ok=True)
        filename = f"eval_{results['version']}.json"
        filepath = os.path.join(self.config.output_dir, filename)
        with open(filepath, 'w') as f:
            json.dump(results, f, indent=2, default=str)

CI/CD Integration

GitHub Actions

DEVELOPERyaml
name: RAG Evaluation

on:
  pull_request:
    paths:
      - 'rag/**'
      - 'prompts/**'
  schedule:
    - cron: '0 6 * * 1'

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install ragas langchain-openai datasets

      - name: Run RAGAS evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/run_ragas_eval.py

      - name: Check thresholds
        run: |
          python -c "
          import json
          with open('eval_results/latest.json') as f:
              results = json.load(f)
          thresholds = {'faithfulness': 0.8, 'answer_relevancy': 0.75}
          for metric, threshold in thresholds.items():
              if results['metrics'].get(metric, 0) < threshold:
                  exit(1)
          "

Analysis and Debugging

Identifying Problematic Samples

DEVELOPERpython
import pandas as pd

def analyze_failures(results_df: pd.DataFrame, threshold: float = 0.7) -> dict:
    analysis = {"low_faithfulness": [], "low_relevancy": [], "patterns": {}}

    low_faith = results_df[results_df["faithfulness"] < threshold]
    for _, row in low_faith.iterrows():
        analysis["low_faithfulness"].append({
            "question": row["question"],
            "answer": row["answer"],
            "score": row["faithfulness"]
        })

    low_rel = results_df[results_df["answer_relevancy"] < threshold]
    for _, row in low_rel.iterrows():
        analysis["low_relevancy"].append({
            "question": row["question"],
            "score": row["answer_relevancy"]
        })

    return analysis

results_df = pd.DataFrame(results["per_sample"])
analysis = analyze_failures(results_df)
print(f"Samples with low faithfulness: {len(analysis['low_faithfulness'])}")

Tracking Dashboard

DEVELOPERpython
class EvalDashboard:
    def __init__(self, results_dir: str = "./eval_results"):
        self.results_dir = Path(results_dir)

    def load_history(self) -> pd.DataFrame:
        records = []
        for file in self.results_dir.glob("eval_*.json"):
            with open(file) as f:
                data = json.load(f)
                records.append({
                    "version": data["version"],
                    "timestamp": data["timestamp"],
                    **data["metrics"]
                })
        return pd.DataFrame(records).sort_values("timestamp")

    def generate_report(self) -> str:
        df = self.load_history()
        latest = df.iloc[-1]
        report = f"# RAG Evaluation Report\n\n## Version: {latest['version']}\n\n"
        for metric in ["faithfulness", "answer_relevancy", "context_recall"]:
            report += f"| {metric} | {latest[metric]:.3f} |\n"
        return report

Best Practices

Evaluation Checklist

Step	Action	Frequency
Dataset	Maintain 100+ representative samples	Monthly
Validation	Review 10% of ground truth	Monthly
Thresholds	Adjust based on domain	Quarterly
CI/CD	Block PRs below thresholds	Each PR
Monitoring	Track trends	Weekly

RAGAS Limitations

LLM cost: Evaluation uses LLM calls
Judge bias: The evaluator LLM may have its own biases
No UX testing: Does not measure actual user satisfaction

Going Further

Human Evaluation - Complement RAGAS with human review
RAG Metrics - Metrics overview
RAG Generation - Improve responses

Automated Evaluation with Ailog

Implementing RAGAS requires configuration and maintenance. With Ailog, benefit from integrated evaluation:

Real-time metrics dashboard
Alerts on quality degradation
Evaluation history tracking
Automatic improvement suggestions
Pre-configured CI/CD integration

Try for free and measure your RAG quality effortlessly.

RAGAS: Open-Source RAG Evaluation Framework

RAGAS: Open-Source RAG Evaluation Framework

Why RAGAS?

RAGAS Advantages

Installation and Configuration

Basic Setup

Advanced Configuration

RAGAS Metrics in Detail

1. Faithfulness

2. Answer Relevancy

3. Context Recall

4. Context Precision

5. Answer Correctness

Creating an Evaluation Dataset

Automatic Generation with RAGAS

Generated Dataset Structure

Complete Evaluation Pipeline

Production-Ready Evaluation Class

CI/CD Integration

GitHub Actions

Analysis and Debugging

Identifying Problematic Samples

Tracking Dashboard

Best Practices

Evaluation Checklist

RAGAS Limitations

Going Further

Automated Evaluation with Ailog

Tags

Related Posts

Human Evaluation: Methodology and Tools

Evaluating a RAG System: Metrics and Methodologies

Reduce RAG Latency: From 2000ms to 200ms

Ailog Assistant