RAG Generation: Choosing and Optimizing Your LLM

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

The generation phase is when your RAG system transforms retrieved documents into a coherent and useful response. The choice of LLM and its configuration determine the final quality of the user experience. This guide walks you through selecting, configuring, and optimizing your generation model.

The Role of the LLM in a RAG System

Unlike a standalone LLM, the LLM in a RAG system doesn't generate from its internal knowledge. It synthesizes, reformulates, and structures information from retrieval.

What the RAG LLM Does

Synthesis: Condense multiple documents into a concise response
Reformulation: Adapt technical language to the user's level
Structuring: Organize information logically
Contextualization: Relate the answer to the question asked
Insufficiency Detection: Identify when documents can't answer the question

What the RAG LLM Should NOT Do

Invent information absent from documents (hallucination)
Contradict provided sources
Answer out-of-scope questions without signaling it

LLM Comparison for RAG

Proprietary Models

Model	Strengths	Weaknesses	Cost (1M tokens)	RAG Use
GPT-4o	Versatile, reasoning	High price	~$5 input / $15 output	Premium production
GPT-4o-mini	Good value ratio	Less performant on complex tasks	~$0.15 / $0.60	Standard production
Claude 3.5 Sonnet	Excellent instruction following	200k context sometimes underused	~$3 / $15	Premium production
Claude 3 Haiku	Ultra fast, economical	Less nuanced	~$0.25 / $1.25	High volume
Gemini 1.5 Pro	1M token context	API sometimes unstable	~$1.25 / $5	Very long documents

Open Source Models

Model	Parameters	VRAM Required	RAG Performance	Self-hostable
Llama 3.1 70B	70B	48GB+	Excellent	Yes (dedicated server)
Llama 3.1 8B	8B	8GB	Good	Yes (consumer GPU)
Mistral 7B	7B	6GB	Good	Yes
Mixtral 8x7B	46.7B (MoE)	32GB	Very good	Yes
Qwen2 72B	72B	48GB+	Excellent	Yes

Selection Criteria

DEVELOPERpython
def choose_llm(
    monthly_budget: float,
    request_volume: int,
    quality_requirement: str,  # "standard", "premium"
    hosting_constraint: str,  # "cloud", "sovereign", "on-premise"
    languages: list[str]
) -> str:

    tokens_per_request = 2000  # average RAG estimate
    monthly_tokens = request_volume * tokens_per_request

    if hosting_constraint == "on-premise":
        if quality_requirement == "premium":
            return "Llama 3.1 70B or Qwen2 72B"
        return "Llama 3.1 8B or Mistral 7B"

    if hosting_constraint == "sovereign":
        return "Mistral Large via Mistral API (EU hosted)"

    gpt4o_mini_cost = monthly_tokens * 0.15 / 1_000_000
    claude_haiku_cost = monthly_tokens * 0.25 / 1_000_000

    if monthly_budget < gpt4o_mini_cost:
        return "Self-hosted open source model"

    if quality_requirement == "premium":
        if "fr" in languages:
            return "Claude 3.5 Sonnet (excellent in French)"
        return "GPT-4o"

    return "GPT-4o-mini or Claude 3 Haiku"

Configuring the System Prompt

The system prompt defines the LLM's behavior in your RAG application. It's the most critical configuration element.

Structure of an Effective RAG Prompt

DEVELOPERpython
SYSTEM_PROMPT = """You are an AI assistant for {company_name}, specialized in {domain}.

STRICT RULES:
1. Answer ONLY from the documents provided in the context
2. If the information is not in the documents, say "I don't have this information in my knowledge base"
3. Never invent information
4. Cite your sources when relevant

RESPONSE STYLE:
- Tone: {tone} (professional/friendly/technical)
- Length: {length} (concise/detailed)
- Format: Use bullet points for more than 3 items
- Language: Respond in the same language as the question

BUSINESS CONTEXT:
{specific_context}

AVAILABLE DOCUMENTS:
{retrieved_documents}

USER QUESTION:
{question}
"""

Prompt Examples by Use Case

E-commerce Customer Support

DEVELOPERpython
ECOMMERCE_PROMPT = """You are the virtual assistant for {store_name}.

OBJECTIVES:
- Help customers with their orders, returns, and product questions
- Maximize customer satisfaction
- Direct to human support when necessary

RULES:
1. Base your answers ONLY on the provided documents
2. For questions about a specific order, ask for the order number
3. Never share personal information about other customers
4. If a request exceeds your capabilities, offer to contact customer service

TONE: Friendly and helpful, like a store salesperson

DOCUMENTS:
{context}

QUESTION:
{question}
"""

Technical Knowledge Base

DEVELOPERpython
TECH_KB_PROMPT = """You are a technical expert for {product}.

OBJECTIVES:
- Provide precise technical answers
- Include code examples when relevant
- Guide to official documentation

RULES:
1. Answer only from the provided documentation
2. Specify the product version if mentioned
3. Flag known limitations or documented bugs
4. Suggest alternatives if the requested solution doesn't exist

FORMAT:
- Start with a direct answer
- Then detail if necessary
- End with a code example if applicable

DOCUMENTATION:
{context}

QUESTION:
{question}
"""

Generation Parameters

Temperature

Temperature controls creativity vs. response coherence.

DEVELOPERpython
# Low temperature (0.0 - 0.3): Deterministic, factual responses
# Ideal for: Customer support, FAQ, technical documentation
temperature = 0.1

# Medium temperature (0.4 - 0.7): Creativity/coherence balance
# Ideal for: General chat, reformulation
temperature = 0.5

# High temperature (0.8 - 1.0): Varied, creative responses
# Ideal for: Brainstorming, content generation
# AVOID in RAG (hallucination risk)
temperature = 0.9

Top-p (Nucleus Sampling)

Limits considered tokens to those representing cumulative probability.

DEVELOPERpython
# top_p = 0.9: Consider tokens up to 90% cumulative probability
# More restrictive = more coherent
# Wider = more varied

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2,
    top_p=0.9,  # Combined with low temperature for RAG
)

Max Tokens

Limits the generated response length.

DEVELOPERpython
# Token budget calculation
def calculate_max_tokens(
    context_length: int,
    model_limit: int = 128000,  # GPT-4o
    reserve_output: int = 2000
) -> int:
    """
    Ensure enough room is left for the response
    """
    prompt_tokens = context_length + 500  # + instructions
    available = model_limit - prompt_tokens
    return min(available, reserve_output)

Frequency and Presence Penalty

Control repetition in responses.

DEVELOPERpython
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    frequency_penalty=0.3,  # Penalize already used tokens
    presence_penalty=0.1,   # Encourage new topics
)

Optimizing Context Provided to the LLM

Formatting Retrieved Documents

DEVELOPERpython
def format_context(documents: list[dict], max_tokens: int = 4000) -> str:
    """
    Format documents for LLM context
    """
    formatted_docs = []
    total_tokens = 0

    for i, doc in enumerate(documents):
        # Token estimation (1 token ≈ 4 characters in English)
        doc_tokens = len(doc["text"]) // 4

        if total_tokens + doc_tokens > max_tokens:
            break

        formatted = f"""
---
SOURCE {i+1}: {doc.get("source", "Document")}
RELEVANCE SCORE: {doc.get("score", "N/A")}

{doc["text"]}
---
"""
        formatted_docs.append(formatted)
        total_tokens += doc_tokens

    return "\n".join(formatted_docs)

Document Order

Order impacts LLM attention. Two main strategies:

DEVELOPERpython
def order_documents(documents: list[dict], strategy: str = "best_first") -> list:
    """
    Order documents according to chosen strategy
    """
    if strategy == "best_first":
        # Most relevant first
        return sorted(documents, key=lambda x: x["score"], reverse=True)

    elif strategy == "lost_in_middle":
        # Best at beginning and end (avoids "lost in the middle")
        sorted_docs = sorted(documents, key=lambda x: x["score"], reverse=True)
        n = len(sorted_docs)
        reordered = []
        for i in range(n):
            if i % 2 == 0:
                reordered.insert(0, sorted_docs[i])
            else:
                reordered.append(sorted_docs[i])
        return reordered

    return documents

Context Compression

For long documents, summarize before sending to the final LLM:

DEVELOPERpython
async def compress_context(
    documents: list[str],
    query: str,
    llm_compressor
) -> str:
    """
    Compress documents while keeping relevant info
    """
    compression_prompt = f"""
    Extract relevant information to answer this question:
    Question: {query}

    Documents:
    {chr(10).join(documents)}

    Summary of relevant information:
    """

    compressed = await llm_compressor.generate(compression_prompt)
    return compressed

Hallucination Management

Hallucinations are the main risk in RAG. Here's how to minimize them.

Hallucination Detection

DEVELOPERpython
def detect_hallucination(
    response: str,
    context: str,
    query: str,
    verifier_llm
) -> dict:
    """
    Check if the response contains hallucinations
    """
    verification_prompt = f"""
    Analyze this response and verify if it's faithful to the provided context.

    PROVIDED CONTEXT:
    {context}

    QUESTION:
    {query}

    RESPONSE TO VERIFY:
    {response}

    ANALYSIS:
    1. Does the response contain claims not present in the context?
    2. Does the response contradict the context?
    3. Faithfulness score (0-100):

    Respond in JSON: {{"hallucinations": [...], "contradictions": [...], "score": X}}
    """

    result = verifier_llm.generate(verification_prompt)
    return json.loads(result)

Anti-Hallucination Strategies

DEVELOPERpython
class RAGGenerator:
    def __init__(self, llm, verifier_llm=None):
        self.llm = llm
        self.verifier = verifier_llm

    async def generate_with_verification(
        self,
        query: str,
        context: str,
        max_retries: int = 2
    ) -> dict:
        """
        Generate a response with anti-hallucination verification
        """
        for attempt in range(max_retries + 1):
            # Generation
            response = await self.llm.generate(
                self._build_prompt(query, context),
                temperature=0.1  # Low to reduce hallucinations
            )

            # Verification if verifier available
            if self.verifier:
                check = detect_hallucination(response, context, query, self.verifier)

                if check["score"] >= 80:
                    return {
                        "response": response,
                        "confidence": check["score"],
                        "verified": True
                    }

                # Retry with stricter instruction
                context = self._add_anti_hallucination_instruction(context, check)
            else:
                return {"response": response, "verified": False}

        # Fallback: cautious response
        return {
            "response": "I'm not certain I can precisely answer this question with the available information.",
            "confidence": 0,
            "verified": True
        }

Streaming and Latency

For a smooth user experience, streaming is essential.

Streaming Implementation

DEVELOPERpython
from openai import OpenAI

client = OpenAI()

async def stream_response(prompt: str):
    """
    Stream the response token by token
    """
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

# Usage with FastAPI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/chat")
async def chat(request: ChatRequest):
    prompt = build_rag_prompt(request.query, request.context)

    return StreamingResponse(
        stream_response(prompt),
        media_type="text/event-stream"
    )

Latency Optimization

DEVELOPERpython
import asyncio
from functools import lru_cache

class OptimizedRAG:
    def __init__(self):
        self.retriever = Retriever()
        self.llm = LLM()

    async def query(self, question: str) -> str:
        # Parallelize retrieval and prompt preparation
        retrieval_task = asyncio.create_task(
            self.retriever.search(question)
        )

        # Pre-compute static prompt parts
        base_prompt = self._get_cached_prompt()

        # Wait for retrieval
        documents = await retrieval_task

        # Build and send to LLM
        full_prompt = base_prompt.format(
            context=format_context(documents),
            question=question
        )

        return await self.llm.generate(full_prompt)

    @lru_cache(maxsize=1)
    def _get_cached_prompt(self) -> str:
        return self._load_system_prompt()

Generation Quality Metrics

Faithfulness

Measures if the response is faithful to the provided context.

DEVELOPERpython
def calculate_faithfulness(
    response: str,
    context: str,
    evaluator_llm
) -> float:
    """
    Faithfulness score between 0 and 1
    """
    prompt = f"""
    Evaluate the faithfulness of this response to the context.

    Context: {context}
    Response: {response}

    For each claim in the response, verify if it is:
    - Supported by the context (1 point)
    - Not mentioned in the context (0 points)
    - Contradicted by the context (-1 point)

    Final score (0-1):
    """

    result = evaluator_llm.generate(prompt)
    return float(result)

Answer Relevancy

Measures if the response properly answers the question.

DEVELOPERpython
def calculate_relevancy(
    question: str,
    response: str,
    evaluator_llm
) -> float:
    """
    Relevancy score between 0 and 1
    """
    prompt = f"""
    Evaluate if this response properly answers the question asked.

    Question: {question}
    Response: {response}

    Criteria:
    - Does the response directly address the question? (0-0.4)
    - Is the response complete? (0-0.3)
    - Is the response concise and clear? (0-0.3)

    Final score (0-1):
    """

    result = evaluator_llm.generate(prompt)
    return float(result)

Integration with Different Providers

OpenAI

DEVELOPERpython
from openai import OpenAI

class OpenAIGenerator:
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model

    async def generate(self, prompt: str, **kwargs) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=kwargs.get("temperature", 0.2),
            max_tokens=kwargs.get("max_tokens", 2000)
        )
        return response.choices[0].message.content

Anthropic Claude

DEVELOPERpython
from anthropic import Anthropic

class ClaudeGenerator:
    def __init__(self, model: str = "claude-3-5-sonnet-20241022"):
        self.client = Anthropic()
        self.model = model

    async def generate(self, prompt: str, system: str = None, **kwargs) -> str:
        messages = [{"role": "user", "content": prompt}]

        response = self.client.messages.create(
            model=self.model,
            max_tokens=kwargs.get("max_tokens", 2000),
            system=system or "You are a precise and factual RAG assistant.",
            messages=messages
        )
        return response.content[0].text

Local Model with Ollama

DEVELOPERpython
import httpx

class OllamaGenerator:
    def __init__(self, model: str = "llama3.1:8b", base_url: str = "http://localhost:11434"):
        self.model = model
        self.base_url = base_url

    async def generate(self, prompt: str, **kwargs) -> str:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/api/generate",
                json={
                    "model": self.model,
                    "prompt": prompt,
                    "stream": False,
                    "options": {
                        "temperature": kwargs.get("temperature", 0.2)
                    }
                }
            )
            return response.json()["response"]

Next Steps

Now that you've mastered RAG generation, explore these advanced topics:

Conversational RAG: Memory and Context - Managing multi-turn conversations
RAG Agents: Multi-agent Orchestration - Going beyond simple RAG
Evaluating a RAG System - Complete metrics and methodologies

For a complete overview, check our Introduction to RAG.

Simplify Your Life with Ailog

Configuring and optimizing an LLM for RAG requires many iterations. With Ailog, get an optimized turnkey configuration:

Pre-optimized prompts for each use case (support, e-commerce, internal KB)
Built-in anti-hallucination with automatic verification
Native streaming for a smooth user experience
Multi-LLM: Switch between GPT-4, Claude, or sovereign models without changing your code

Start free with Ailog and deploy an optimized RAG assistant in minutes.

RAG Generation: Choosing and Optimizing Your LLM

RAG Generation: Choosing and Optimizing Your LLM

The Role of the LLM in a RAG System

What the RAG LLM Does

What the RAG LLM Should NOT Do

LLM Comparison for RAG

Proprietary Models

Open Source Models

Selection Criteria

Configuring the System Prompt

Structure of an Effective RAG Prompt

Prompt Examples by Use Case

E-commerce Customer Support

Technical Knowledge Base

Generation Parameters

Temperature

Top-p (Nucleus Sampling)

Max Tokens

Frequency and Presence Penalty

Optimizing Context Provided to the LLM

Formatting Retrieved Documents

Document Order

Context Compression

Hallucination Management

Hallucination Detection

Anti-Hallucination Strategies

Streaming and Latency

Streaming Implementation

Latency Optimization

Generation Quality Metrics

Faithfulness

Answer Relevancy

Integration with Different Providers

OpenAI

Anthropic Claude

Local Model with Ollama

Next Steps

Simplify Your Life with Ailog

Tags

Related Posts

Chain-of-Thought RAG: Step-by-Step Reasoning for Better Responses

Temperature and Sampling in RAG: Controlling LLM Creativity

RAG Prompt Engineering: Optimizing System Prompts for Better Responses

Ailog Assistant