RAG Generation: Choosing and Optimizing Your LLM
Complete guide to selecting and configuring your LLM in a RAG system: prompting, temperature, tokens, and response optimization.
- Author
- Ailog Team
- Published
- Reading time
- 20 min read
- Level
- intermediate
RAG Generation: Choosing and Optimizing Your LLM
The generation phase is when your RAG system transforms retrieved documents into a coherent and useful response. The choice of LLM and its configuration determine the final quality of the user experience. This guide walks you through selecting, configuring, and optimizing your generation model.
The Role of the LLM in a RAG System
Unlike a standalone LLM, the LLM in a RAG system doesn't generate from its internal knowledge. It synthesizes, reformulates, and structures information from retrieval.
What the RAG LLM Does Synthesis: Condense multiple documents into a concise response Reformulation: Adapt technical language to the user's level Structuring: Organize information logically Contextualization: Relate the answer to the question asked Insufficiency Detection: Identify when documents can't answer the question
What the RAG LLM Should NOT Do • Invent information absent from documents (hallucination) • Contradict provided sources • Answer out-of-scope questions without signaling it
LLM Comparison for RAG
Proprietary Models
| Model | Strengths | Weaknesses | Cost (1M tokens) | RAG Use | |-------|-----------|------------|------------------|---------| | GPT-4o | Versatile, reasoning | High price | ~$5 input / $15 output | Premium production | | GPT-4o-mini | Good value ratio | Less performant on complex tasks | ~$0.15 / $0.60 | Standard production | | Claude 3.5 Sonnet | Excellent instruction following | 200k context sometimes underused | ~$3 / $15 | Premium production | | Claude 3 Haiku | Ultra fast, economical | Less nuanced | ~$0.25 / $1.25 | High volume | | Gemini 1.5 Pro | 1M token context | API sometimes unstable | ~$1.25 / $5 | Very long documents |
Open Source Models
| Model | Parameters | VRAM Required | RAG Performance | Self-hostable | |-------|------------|---------------|-----------------|---------------| | Llama 3.1 70B | 70B | 48GB+ | Excellent | Yes (dedicated server) | | Llama 3.1 8B | 8B | 8GB | Good | Yes (consumer GPU) | | Mistral 7B | 7B | 6GB | Good | Yes | | Mixtral 8x7B | 46.7B (MoE) | 32GB | Very good | Yes | | Qwen2 72B | 72B | 48GB+ | Excellent | Yes |
Selection Criteria
``python def choose_llm( monthly_budget: float, request_volume: int, quality_requirement: str, "standard", "premium" hosting_constraint: str, "cloud", "sovereign", "on-premise" languages: list[str] ) -> str:
tokens_per_request = 2000 average RAG estimate monthly_tokens = request_volume tokens_per_request
if hosting_constraint == "on-premise": if quality_requirement == "premium": return "Llama 3.1 70B or Qwen2 72B" return "Llama 3.1 8B or Mistral 7B"
if hosting_constraint == "sovereign": return "Mistral Large via Mistral API (EU hosted)"
gpt4o_mini_cost = monthly_tokens 0.15 / 1_000_000 claude_haiku_cost = monthly_tokens 0.25 / 1_000_000
if monthly_budget < gpt4o_mini_cost: return "Self-hosted open source model"
if quality_requirement == "premium": if "fr" in languages: return "Claude 3.5 Sonnet (excellent in French)" return "GPT-4o"
return "GPT-4o-mini or Claude 3 Haiku" `
Configuring the System Prompt
The system prompt defines the LLM's behavior in your RAG application. It's the most critical configuration element.
Structure of an Effective RAG Prompt
`python SYSTEM_PROMPT = """You are an AI assistant for {company_name}, specialized in {domain}.
STRICT RULES: Answer ONLY from the documents provided in the context If the information is not in the documents, say "I don't have this information in my knowledge base" Never invent information Cite your sources when relevant
RESPONSE STYLE: • Tone: {tone} (professional/friendly/technical) • Length: {length} (concise/detailed) • Format: Use bullet points for more than 3 items • Language: Respond in the same language as the question
BUSINESS CONTEXT: {specific_context}
AVAILABLE DOCUMENTS: {retrieved_documents}
USER QUESTION: {question} """ `
Prompt Examples by Use Case
E-commerce Customer Support
`python ECOMMERCE_PROMPT = """You are the virtual assistant for {store_name}.
OBJECTIVES: • Help customers with their orders, returns, and product questions • Maximize customer satisfaction • Direct to human support when necessary
RULES: Base your answers ONLY on the provided documents For questions about a specific order, ask for the order number Never share personal information about other customers If a request exceeds your capabilities, offer to contact customer service
TONE: Friendly and helpful, like a store salesperson
DOCUMENTS: {context}
QUESTION: {question} """ `
Technical Knowledge Base
`python TECH_KB_PROMPT = """You are a technical expert for {product}.
OBJECTIVES: • Provide precise technical answers • Include code examples when relevant • Guide to official documentation
RULES: Answer only from the provided documentation Specify the product version if mentioned Flag known limitations or documented bugs Suggest alternatives if the requested solution doesn't exist
FORMAT: • Start with a direct answer • Then detail if necessary • End with a code example if applicable
DOCUMENTATION: {context}
QUESTION: {question} """ `
Generation Parameters
Temperature
Temperature controls creativity vs. response coherence.
`python Low temperature (0.0 - 0.3): Deterministic, factual responses Ideal for: Customer support, FAQ, technical documentation temperature = 0.1
Medium temperature (0.4 - 0.7): Creativity/coherence balance Ideal for: General chat, reformulation temperature = 0.5
High temperature (0.8 - 1.0): Varied, creative responses Ideal for: Brainstorming, content generation AVOID in RAG (hallucination risk) temperature = 0.9 `
Top-p (Nucleus Sampling)
Limits considered tokens to those representing cumulative probability.
`python top_p = 0.9: Consider tokens up to 90% cumulative probability More restrictive = more coherent Wider = more varied
from openai import OpenAI client = OpenAI()
response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0.2, top_p=0.9, Combined with low temperature for RAG ) `
Max Tokens
Limits the generated response length.
`python Token budget calculation def calculate_max_tokens( context_length: int, model_limit: int = 128000, GPT-4o reserve_output: int = 2000 ) -> int: """ Ensure enough room is left for the response """ prompt_tokens = context_length + 500 + instructions available = model_limit - prompt_tokens return min(available, reserve_output) `
Frequency and Presence Penalty
Control repetition in responses.
`python response = client.chat.completions.create( model="gpt-4o-mini", messages=messages, frequency_penalty=0.3, Penalize already used tokens presence_penalty=0.1, Encourage new topics ) `
Optimizing Context Provided to the LLM
Formatting Retrieved Documents
`python def format_context(documents: list[dict], max_tokens: int = 4000) -> str: """ Format documents for LLM context """ formatted_docs = [] total_tokens = 0
for i, doc in enumerate(documents): Token estimation (1 token ≈ 4 characters in English) doc_tokens = len(doc["text"]) // 4
if total_tokens + doc_tokens > max_tokens: break
formatted = f""" --- SOURCE {i+1}: {doc.get("source", "Document")} RELEVANCE SCORE: {doc.get("score", "N/A")}
{doc["text"]} --- """ formatted_docs.append(formatted) total_tokens += doc_tokens
return "\n".join(formatted_docs) `
Document Order
Order impacts LLM attention. Two main strategies:
`python def order_documents(documents: list[dict], strategy: str = "best_first") -> list: """ Order documents according to chosen strategy """ if strategy == "best_first": Most relevant first return sorted(documents, key=lambda x: x["score"], reverse=True)
elif strategy == "lost_in_middle": Best at beginning and end (avoids "lost in the middle") sorted_docs = sorted(documents, key=lambda x: x["score"], reverse=True) n = len(sorted_docs) reordered = [] for i in range(n): if i % 2 == 0: reordered.insert(0, sorted_docs[i]) else: reordered.append(sorted_docs[i]) return reordered
return documents `
Context Compression
For long documents, summarize before sending to the final LLM:
`python async def compress_context( documents: list[str], query: str, llm_compressor ) -> str: """ Compress documents while keeping relevant info """ compression_prompt = f""" Extract relevant information to answer this question: Question: {query}
Documents: {chr(10).join(documents)}
Summary of relevant information: """
compressed = await llm_compressor.generate(compression_prompt) return compressed `
Hallucination Management
Hallucinations are the main risk in RAG. Here's how to minimize them.
Hallucination Detection
`python def detect_hallucination( response: str, context: str, query: str, verifier_llm ) -> dict: """ Check if the response contains hallucinations """ verification_prompt = f""" Analyze this response and verify if it's faithful to the provided context.
PROVIDED CONTEXT: {context}
QUESTION: {query}
RESPONSE TO VERIFY: {response}
ANALYSIS: Does the response contain claims not present in the context? Does the response contradict the context? Faithfulness score (0-100):
Respond in JSON: {{"hallucinations": [...], "contradictions": [...], "score": X}} """
result = verifier_llm.generate(verification_prompt) return json.loads(result) `
Anti-Hallucination Strategies
`python class RAGGenerator: def __init__(self, llm, verifier_llm=None): self.llm = llm self.verifier = verifier_llm
async def generate_with_verification( self, query: str, context: str, max_retries: int = 2 ) -> dict: """ Generate a response with anti-hallucination verification """ for attempt in range(max_retries + 1): Generation response = await self.llm.generate( self._build_prompt(query, context), temperature=0.1 Low to reduce hallucinations )
Verification if verifier available if self.verifier: check = detect_hallucination(response, context, query, self.verifier)
if check["score"] >= 80: return { "response": response, "confidence": check["score"], "verified": True }
Retry with stricter instruction context = self._add_anti_hallucination_instruction(context, check) else: return {"response": response, "verified": False}
Fallback: cautious response return { "response": "I'm not certain I can precisely answer this question with the available information.", "confidence": 0, "verified": True } `
Streaming and Latency
For a smooth user experience, streaming is essential.
Streaming Implementation
`python from openai import OpenAI
client = OpenAI()
async def stream_response(prompt: str): """ Stream the response token by token """ stream = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], stream=True )
for chunk in stream: if chunk.choices[0].delta.content: yield chunk.choices[0].delta.content
Usage with FastAPI from fastapi import FastAPI from fastapi.responses import StreamingResponse
app = FastAPI()
@app.post("/chat") async def chat(request: ChatRequest): prompt = build_rag_prompt(request.query, request.context)
return StreamingResponse( stream_response(prompt), media_type="text/event-stream" ) `
Latency Optimization
`python import asyncio from functools import lru_cache
class OptimizedRAG: def __init__(self): self.retriever = Retriever() self.llm = LLM()
async def query(self, question: str) -> str: Parallelize retrieval and prompt preparation retrieval_task = asyncio.create_task( self.retriever.search(question) )
Pre-compute static prompt parts base_prompt = self._get_cached_prompt()
Wait for retrieval documents = await retrieval_task
Build and send to LLM full_prompt = base_prompt.format( context=format_context(documents), question=question )
return await self.llm.generate(full_prompt)
@lru_cache(maxsize=1) def _get_cached_prompt(self) -> str: return self._load_system_prompt() `
Generation Quality Metrics
Faithfulness
Measures if the response is faithful to the provided context.
`python def calculate_faithfulness( response: str, context: str, evaluator_llm ) -> float: """ Faithfulness score between 0 and 1 """ prompt = f""" Evaluate the faithfulness of this response to the context.
Context: {context} Response: {response}
For each claim in the response, verify if it is: • Supported by the context (1 point) • Not mentioned in the context (0 points) • Contradicted by the context (-1 point)
Final score (0-1): """
result = evaluator_llm.generate(prompt) return float(result) `
Answer Relevancy
Measures if the response properly answers the question.
`python def calculate_relevancy( question: str, response: str, evaluator_llm ) -> float: """ Relevancy score between 0 and 1 """ prompt = f""" Evaluate if this response properly answers the question asked.
Question: {question} Response: {response}
Criteria: • Does the response directly address the question? (0-0.4) • Is the response complete? (0-0.3) • Is the response concise and clear? (0-0.3)
Final score (0-1): """
result = evaluator_llm.generate(prompt) return float(result) `
Integration with Different Providers
OpenAI
`python from openai import OpenAI
class OpenAIGenerator: def __init__(self, model: str = "gpt-4o-mini"): self.client = OpenAI() self.model = model
async def generate(self, prompt: str, kwargs) -> str: response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=kwargs.get("temperature", 0.2), max_tokens=kwargs.get("max_tokens", 2000) ) return response.choices[0].message.content `
Anthropic Claude
`python from anthropic import Anthropic
class ClaudeGenerator: def __init__(self, model: str = "claude-3-5-sonnet-20241022"): self.client = Anthropic() self.model = model
async def generate(self, prompt: str, system: str = None, kwargs) -> str: messages = [{"role": "user", "content": prompt}]
response = self.client.messages.create( model=self.model, max_tokens=kwargs.get("max_tokens", 2000), system=system or "You are a precise and factual RAG assistant.", messages=messages ) return response.content[0].text `
Local Model with Ollama
`python import httpx
class OllamaGenerator: def __init__(self, model: str = "llama3.1:8b", base_url: str = "http://localhost:11434"): self.model = model self.base_url = base_url
async def generate(self, prompt: str, kwargs) -> str: async with httpx.AsyncClient() as client: response = await client.post( f"{self.base_url}/api/generate", json={ "model": self.model, "prompt": prompt, "stream": False, "options": { "temperature": kwargs.get("temperature", 0.2) } } ) return response.json()["response"] ``
Next Steps
Now that you've mastered RAG generation, explore these advanced topics: • Conversational RAG: Memory and Context - Managing multi-turn conversations • RAG Agents: Multi-agent Orchestration - Going beyond simple RAG • Evaluating a RAG System - Complete metrics and methodologies
For a complete overview, check our Introduction to RAG.
---
Simplify Your Life with Ailog
Configuring and optimizing an LLM for RAG requires many iterations. With Ailog, get an optimized turnkey configuration: • Pre-optimized prompts for each use case (support, e-commerce, internal KB) • Built-in anti-hallucination with automatic verification • Native streaming for a smooth user experience • Multi-LLM*: Switch between GPT-4, Claude, or sovereign models without changing your code
Start free with Ailog and deploy an optimized RAG assistant in minutes.