RAG Generation: Choosing and Optimizing Your LLM

Complete guide to selecting and configuring your LLM in a RAG system: prompting, temperature, tokens, and response optimization.

Author
Ailog Team
Published
Reading time
20 min read
Level
intermediate

RAG Generation: Choosing and Optimizing Your LLM

The generation phase is when your RAG system transforms retrieved documents into a coherent and useful response. The choice of LLM and its configuration determine the final quality of the user experience. This guide walks you through selecting, configuring, and optimizing your generation model.

The Role of the LLM in a RAG System

Unlike a standalone LLM, the LLM in a RAG system doesn't generate from its internal knowledge. It synthesizes, reformulates, and structures information from retrieval.

What the RAG LLM Does Synthesis: Condense multiple documents into a concise response Reformulation: Adapt technical language to the user's level Structuring: Organize information logically Contextualization: Relate the answer to the question asked Insufficiency Detection: Identify when documents can't answer the question

What the RAG LLM Should NOT Do • Invent information absent from documents (hallucination) • Contradict provided sources • Answer out-of-scope questions without signaling it

LLM Comparison for RAG

Proprietary Models

| Model | Strengths | Weaknesses | Cost (1M tokens) | RAG Use | |-------|-----------|------------|------------------|---------| | GPT-4o | Versatile, reasoning | High price | ~$5 input / $15 output | Premium production | | GPT-4o-mini | Good value ratio | Less performant on complex tasks | ~$0.15 / $0.60 | Standard production | | Claude 3.5 Sonnet | Excellent instruction following | 200k context sometimes underused | ~$3 / $15 | Premium production | | Claude 3 Haiku | Ultra fast, economical | Less nuanced | ~$0.25 / $1.25 | High volume | | Gemini 1.5 Pro | 1M token context | API sometimes unstable | ~$1.25 / $5 | Very long documents |

Open Source Models

| Model | Parameters | VRAM Required | RAG Performance | Self-hostable | |-------|------------|---------------|-----------------|---------------| | Llama 3.1 70B | 70B | 48GB+ | Excellent | Yes (dedicated server) | | Llama 3.1 8B | 8B | 8GB | Good | Yes (consumer GPU) | | Mistral 7B | 7B | 6GB | Good | Yes | | Mixtral 8x7B | 46.7B (MoE) | 32GB | Very good | Yes | | Qwen2 72B | 72B | 48GB+ | Excellent | Yes |

Selection Criteria

``python def choose_llm( monthly_budget: float, request_volume: int, quality_requirement: str, "standard", "premium" hosting_constraint: str, "cloud", "sovereign", "on-premise" languages: list[str] ) -> str:

tokens_per_request = 2000 average RAG estimate monthly_tokens = request_volume tokens_per_request

if hosting_constraint == "on-premise": if quality_requirement == "premium": return "Llama 3.1 70B or Qwen2 72B" return "Llama 3.1 8B or Mistral 7B"

if hosting_constraint == "sovereign": return "Mistral Large via Mistral API (EU hosted)"

gpt4o_mini_cost = monthly_tokens 0.15 / 1_000_000 claude_haiku_cost = monthly_tokens 0.25 / 1_000_000

if monthly_budget < gpt4o_mini_cost: return "Self-hosted open source model"

if quality_requirement == "premium": if "fr" in languages: return "Claude 3.5 Sonnet (excellent in French)" return "GPT-4o"

return "GPT-4o-mini or Claude 3 Haiku" `

Configuring the System Prompt

The system prompt defines the LLM's behavior in your RAG application. It's the most critical configuration element.

Structure of an Effective RAG Prompt

`python SYSTEM_PROMPT = """You are an AI assistant for {company_name}, specialized in {domain}.

STRICT RULES: Answer ONLY from the documents provided in the context If the information is not in the documents, say "I don't have this information in my knowledge base" Never invent information Cite your sources when relevant

RESPONSE STYLE: • Tone: {tone} (professional/friendly/technical) • Length: {length} (concise/detailed) • Format: Use bullet points for more than 3 items • Language: Respond in the same language as the question

BUSINESS CONTEXT: {specific_context}

AVAILABLE DOCUMENTS: {retrieved_documents}

USER QUESTION: {question} """ `

Prompt Examples by Use Case

E-commerce Customer Support

`python ECOMMERCE_PROMPT = """You are the virtual assistant for {store_name}.

OBJECTIVES: • Help customers with their orders, returns, and product questions • Maximize customer satisfaction • Direct to human support when necessary

RULES: Base your answers ONLY on the provided documents For questions about a specific order, ask for the order number Never share personal information about other customers If a request exceeds your capabilities, offer to contact customer service

TONE: Friendly and helpful, like a store salesperson

DOCUMENTS: {context}

QUESTION: {question} """ `

Technical Knowledge Base

`python TECH_KB_PROMPT = """You are a technical expert for {product}.

OBJECTIVES: • Provide precise technical answers • Include code examples when relevant • Guide to official documentation

RULES: Answer only from the provided documentation Specify the product version if mentioned Flag known limitations or documented bugs Suggest alternatives if the requested solution doesn't exist

FORMAT: • Start with a direct answer • Then detail if necessary • End with a code example if applicable

DOCUMENTATION: {context}

QUESTION: {question} """ `

Generation Parameters

Temperature

Temperature controls creativity vs. response coherence.

`python Low temperature (0.0 - 0.3): Deterministic, factual responses Ideal for: Customer support, FAQ, technical documentation temperature = 0.1

Medium temperature (0.4 - 0.7): Creativity/coherence balance Ideal for: General chat, reformulation temperature = 0.5

High temperature (0.8 - 1.0): Varied, creative responses Ideal for: Brainstorming, content generation AVOID in RAG (hallucination risk) temperature = 0.9 `

Top-p (Nucleus Sampling)

Limits considered tokens to those representing cumulative probability.

`python top_p = 0.9: Consider tokens up to 90% cumulative probability More restrictive = more coherent Wider = more varied

from openai import OpenAI client = OpenAI()

response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0.2, top_p=0.9, Combined with low temperature for RAG ) `

Max Tokens

Limits the generated response length.

`python Token budget calculation def calculate_max_tokens( context_length: int, model_limit: int = 128000, GPT-4o reserve_output: int = 2000 ) -> int: """ Ensure enough room is left for the response """ prompt_tokens = context_length + 500 + instructions available = model_limit - prompt_tokens return min(available, reserve_output) `

Frequency and Presence Penalty

Control repetition in responses.

`python response = client.chat.completions.create( model="gpt-4o-mini", messages=messages, frequency_penalty=0.3, Penalize already used tokens presence_penalty=0.1, Encourage new topics ) `

Optimizing Context Provided to the LLM

Formatting Retrieved Documents

`python def format_context(documents: list[dict], max_tokens: int = 4000) -> str: """ Format documents for LLM context """ formatted_docs = [] total_tokens = 0

for i, doc in enumerate(documents): Token estimation (1 token ≈ 4 characters in English) doc_tokens = len(doc["text"]) // 4

if total_tokens + doc_tokens > max_tokens: break

formatted = f""" --- SOURCE {i+1}: {doc.get("source", "Document")} RELEVANCE SCORE: {doc.get("score", "N/A")}

{doc["text"]} --- """ formatted_docs.append(formatted) total_tokens += doc_tokens

return "\n".join(formatted_docs) `

Document Order

Order impacts LLM attention. Two main strategies:

`python def order_documents(documents: list[dict], strategy: str = "best_first") -> list: """ Order documents according to chosen strategy """ if strategy == "best_first": Most relevant first return sorted(documents, key=lambda x: x["score"], reverse=True)

elif strategy == "lost_in_middle": Best at beginning and end (avoids "lost in the middle") sorted_docs = sorted(documents, key=lambda x: x["score"], reverse=True) n = len(sorted_docs) reordered = [] for i in range(n): if i % 2 == 0: reordered.insert(0, sorted_docs[i]) else: reordered.append(sorted_docs[i]) return reordered

return documents `

Context Compression

For long documents, summarize before sending to the final LLM:

`python async def compress_context( documents: list[str], query: str, llm_compressor ) -> str: """ Compress documents while keeping relevant info """ compression_prompt = f""" Extract relevant information to answer this question: Question: {query}

Documents: {chr(10).join(documents)}

Summary of relevant information: """

compressed = await llm_compressor.generate(compression_prompt) return compressed `

Hallucination Management

Hallucinations are the main risk in RAG. Here's how to minimize them.

Hallucination Detection

`python def detect_hallucination( response: str, context: str, query: str, verifier_llm ) -> dict: """ Check if the response contains hallucinations """ verification_prompt = f""" Analyze this response and verify if it's faithful to the provided context.

PROVIDED CONTEXT: {context}

QUESTION: {query}

RESPONSE TO VERIFY: {response}

ANALYSIS: Does the response contain claims not present in the context? Does the response contradict the context? Faithfulness score (0-100):

Respond in JSON: {{"hallucinations": [...], "contradictions": [...], "score": X}} """

result = verifier_llm.generate(verification_prompt) return json.loads(result) `

Anti-Hallucination Strategies

`python class RAGGenerator: def __init__(self, llm, verifier_llm=None): self.llm = llm self.verifier = verifier_llm

async def generate_with_verification( self, query: str, context: str, max_retries: int = 2 ) -> dict: """ Generate a response with anti-hallucination verification """ for attempt in range(max_retries + 1): Generation response = await self.llm.generate( self._build_prompt(query, context), temperature=0.1 Low to reduce hallucinations )

Verification if verifier available if self.verifier: check = detect_hallucination(response, context, query, self.verifier)

if check["score"] >= 80: return { "response": response, "confidence": check["score"], "verified": True }

Retry with stricter instruction context = self._add_anti_hallucination_instruction(context, check) else: return {"response": response, "verified": False}

Fallback: cautious response return { "response": "I'm not certain I can precisely answer this question with the available information.", "confidence": 0, "verified": True } `

Streaming and Latency

For a smooth user experience, streaming is essential.

Streaming Implementation

`python from openai import OpenAI

client = OpenAI()

async def stream_response(prompt: str): """ Stream the response token by token """ stream = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], stream=True )

for chunk in stream: if chunk.choices[0].delta.content: yield chunk.choices[0].delta.content

Usage with FastAPI from fastapi import FastAPI from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/chat") async def chat(request: ChatRequest): prompt = build_rag_prompt(request.query, request.context)

return StreamingResponse( stream_response(prompt), media_type="text/event-stream" ) `

Latency Optimization

`python import asyncio from functools import lru_cache

class OptimizedRAG: def __init__(self): self.retriever = Retriever() self.llm = LLM()

async def query(self, question: str) -> str: Parallelize retrieval and prompt preparation retrieval_task = asyncio.create_task( self.retriever.search(question) )

Pre-compute static prompt parts base_prompt = self._get_cached_prompt()

Wait for retrieval documents = await retrieval_task

Build and send to LLM full_prompt = base_prompt.format( context=format_context(documents), question=question )

return await self.llm.generate(full_prompt)

@lru_cache(maxsize=1) def _get_cached_prompt(self) -> str: return self._load_system_prompt() `

Generation Quality Metrics

Faithfulness

Measures if the response is faithful to the provided context.

`python def calculate_faithfulness( response: str, context: str, evaluator_llm ) -> float: """ Faithfulness score between 0 and 1 """ prompt = f""" Evaluate the faithfulness of this response to the context.

Context: {context} Response: {response}

For each claim in the response, verify if it is: • Supported by the context (1 point) • Not mentioned in the context (0 points) • Contradicted by the context (-1 point)

Final score (0-1): """

result = evaluator_llm.generate(prompt) return float(result) `

Answer Relevancy

Measures if the response properly answers the question.

`python def calculate_relevancy( question: str, response: str, evaluator_llm ) -> float: """ Relevancy score between 0 and 1 """ prompt = f""" Evaluate if this response properly answers the question asked.

Question: {question} Response: {response}

Criteria: • Does the response directly address the question? (0-0.4) • Is the response complete? (0-0.3) • Is the response concise and clear? (0-0.3)

Final score (0-1): """

result = evaluator_llm.generate(prompt) return float(result) `

Integration with Different Providers

OpenAI

`python from openai import OpenAI

class OpenAIGenerator: def __init__(self, model: str = "gpt-4o-mini"): self.client = OpenAI() self.model = model

async def generate(self, prompt: str, kwargs) -> str: response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=kwargs.get("temperature", 0.2), max_tokens=kwargs.get("max_tokens", 2000) ) return response.choices[0].message.content `

Anthropic Claude

`python from anthropic import Anthropic

class ClaudeGenerator: def __init__(self, model: str = "claude-3-5-sonnet-20241022"): self.client = Anthropic() self.model = model

async def generate(self, prompt: str, system: str = None, kwargs) -> str: messages = [{"role": "user", "content": prompt}]

response = self.client.messages.create( model=self.model, max_tokens=kwargs.get("max_tokens", 2000), system=system or "You are a precise and factual RAG assistant.", messages=messages ) return response.content[0].text `

Local Model with Ollama

`python import httpx

class OllamaGenerator: def __init__(self, model: str = "llama3.1:8b", base_url: str = "http://localhost:11434"): self.model = model self.base_url = base_url

async def generate(self, prompt: str, kwargs) -> str: async with httpx.AsyncClient() as client: response = await client.post( f"{self.base_url}/api/generate", json={ "model": self.model, "prompt": prompt, "stream": False, "options": { "temperature": kwargs.get("temperature", 0.2) } } ) return response.json()["response"] ``

Next Steps

Now that you've mastered RAG generation, explore these advanced topics: • Conversational RAG: Memory and Context - Managing multi-turn conversations • RAG Agents: Multi-agent Orchestration - Going beyond simple RAG • Evaluating a RAG System - Complete metrics and methodologies

For a complete overview, check our Introduction to RAG.

---

Simplify Your Life with Ailog

Configuring and optimizing an LLM for RAG requires many iterations. With Ailog, get an optimized turnkey configuration: • Pre-optimized prompts for each use case (support, e-commerce, internal KB) • Built-in anti-hallucination with automatic verification • Native streaming for a smooth user experience • Multi-LLM*: Switch between GPT-4, Claude, or sovereign models without changing your code

Start free with Ailog and deploy an optimized RAG assistant in minutes.

Tags

  • RAG
  • LLM
  • generation
  • prompting
  • GPT
  • Claude
GuideIntermédiaire

RAG Generation: Choosing and Optimizing Your LLM

13 janvier 2026
20 min read
Ailog Team

Complete guide to selecting and configuring your LLM in a RAG system: prompting, temperature, tokens, and response optimization.

RAG Generation: Choosing and Optimizing Your LLM

The generation phase is when your RAG system transforms retrieved documents into a coherent and useful response. The choice of LLM and its configuration determine the final quality of the user experience. This guide walks you through selecting, configuring, and optimizing your generation model.

The Role of the LLM in a RAG System

Unlike a standalone LLM, the LLM in a RAG system doesn't generate from its internal knowledge. It synthesizes, reformulates, and structures information from retrieval.

What the RAG LLM Does

  1. Synthesis: Condense multiple documents into a concise response
  2. Reformulation: Adapt technical language to the user's level
  3. Structuring: Organize information logically
  4. Contextualization: Relate the answer to the question asked
  5. Insufficiency Detection: Identify when documents can't answer the question

What the RAG LLM Should NOT Do

  • Invent information absent from documents (hallucination)
  • Contradict provided sources
  • Answer out-of-scope questions without signaling it

LLM Comparison for RAG

Proprietary Models

ModelStrengthsWeaknessesCost (1M tokens)RAG Use
GPT-4oVersatile, reasoningHigh price~$5 input / $15 outputPremium production
GPT-4o-miniGood value ratioLess performant on complex tasks~$0.15 / $0.60Standard production
Claude 3.5 SonnetExcellent instruction following200k context sometimes underused~$3 / $15Premium production
Claude 3 HaikuUltra fast, economicalLess nuanced~$0.25 / $1.25High volume
Gemini 1.5 Pro1M token contextAPI sometimes unstable~$1.25 / $5Very long documents

Open Source Models

ModelParametersVRAM RequiredRAG PerformanceSelf-hostable
Llama 3.1 70B70B48GB+ExcellentYes (dedicated server)
Llama 3.1 8B8B8GBGoodYes (consumer GPU)
Mistral 7B7B6GBGoodYes
Mixtral 8x7B46.7B (MoE)32GBVery goodYes
Qwen2 72B72B48GB+ExcellentYes

Selection Criteria

DEVELOPERpython
def choose_llm( monthly_budget: float, request_volume: int, quality_requirement: str, # "standard", "premium" hosting_constraint: str, # "cloud", "sovereign", "on-premise" languages: list[str] ) -> str: tokens_per_request = 2000 # average RAG estimate monthly_tokens = request_volume * tokens_per_request if hosting_constraint == "on-premise": if quality_requirement == "premium": return "Llama 3.1 70B or Qwen2 72B" return "Llama 3.1 8B or Mistral 7B" if hosting_constraint == "sovereign": return "Mistral Large via Mistral API (EU hosted)" gpt4o_mini_cost = monthly_tokens * 0.15 / 1_000_000 claude_haiku_cost = monthly_tokens * 0.25 / 1_000_000 if monthly_budget < gpt4o_mini_cost: return "Self-hosted open source model" if quality_requirement == "premium": if "fr" in languages: return "Claude 3.5 Sonnet (excellent in French)" return "GPT-4o" return "GPT-4o-mini or Claude 3 Haiku"

Configuring the System Prompt

The system prompt defines the LLM's behavior in your RAG application. It's the most critical configuration element.

Structure of an Effective RAG Prompt

DEVELOPERpython
SYSTEM_PROMPT = """You are an AI assistant for {company_name}, specialized in {domain}. STRICT RULES: 1. Answer ONLY from the documents provided in the context 2. If the information is not in the documents, say "I don't have this information in my knowledge base" 3. Never invent information 4. Cite your sources when relevant RESPONSE STYLE: - Tone: {tone} (professional/friendly/technical) - Length: {length} (concise/detailed) - Format: Use bullet points for more than 3 items - Language: Respond in the same language as the question BUSINESS CONTEXT: {specific_context} AVAILABLE DOCUMENTS: {retrieved_documents} USER QUESTION: {question} """

Prompt Examples by Use Case

E-commerce Customer Support

DEVELOPERpython
ECOMMERCE_PROMPT = """You are the virtual assistant for {store_name}. OBJECTIVES: - Help customers with their orders, returns, and product questions - Maximize customer satisfaction - Direct to human support when necessary RULES: 1. Base your answers ONLY on the provided documents 2. For questions about a specific order, ask for the order number 3. Never share personal information about other customers 4. If a request exceeds your capabilities, offer to contact customer service TONE: Friendly and helpful, like a store salesperson DOCUMENTS: {context} QUESTION: {question} """

Technical Knowledge Base

DEVELOPERpython
TECH_KB_PROMPT = """You are a technical expert for {product}. OBJECTIVES: - Provide precise technical answers - Include code examples when relevant - Guide to official documentation RULES: 1. Answer only from the provided documentation 2. Specify the product version if mentioned 3. Flag known limitations or documented bugs 4. Suggest alternatives if the requested solution doesn't exist FORMAT: - Start with a direct answer - Then detail if necessary - End with a code example if applicable DOCUMENTATION: {context} QUESTION: {question} """

Generation Parameters

Temperature

Temperature controls creativity vs. response coherence.

DEVELOPERpython
# Low temperature (0.0 - 0.3): Deterministic, factual responses # Ideal for: Customer support, FAQ, technical documentation temperature = 0.1 # Medium temperature (0.4 - 0.7): Creativity/coherence balance # Ideal for: General chat, reformulation temperature = 0.5 # High temperature (0.8 - 1.0): Varied, creative responses # Ideal for: Brainstorming, content generation # AVOID in RAG (hallucination risk) temperature = 0.9

Top-p (Nucleus Sampling)

Limits considered tokens to those representing cumulative probability.

DEVELOPERpython
# top_p = 0.9: Consider tokens up to 90% cumulative probability # More restrictive = more coherent # Wider = more varied from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0.2, top_p=0.9, # Combined with low temperature for RAG )

Max Tokens

Limits the generated response length.

DEVELOPERpython
# Token budget calculation def calculate_max_tokens( context_length: int, model_limit: int = 128000, # GPT-4o reserve_output: int = 2000 ) -> int: """ Ensure enough room is left for the response """ prompt_tokens = context_length + 500 # + instructions available = model_limit - prompt_tokens return min(available, reserve_output)

Frequency and Presence Penalty

Control repetition in responses.

DEVELOPERpython
response = client.chat.completions.create( model="gpt-4o-mini", messages=messages, frequency_penalty=0.3, # Penalize already used tokens presence_penalty=0.1, # Encourage new topics )

Optimizing Context Provided to the LLM

Formatting Retrieved Documents

DEVELOPERpython
def format_context(documents: list[dict], max_tokens: int = 4000) -> str: """ Format documents for LLM context """ formatted_docs = [] total_tokens = 0 for i, doc in enumerate(documents): # Token estimation (1 token ≈ 4 characters in English) doc_tokens = len(doc["text"]) // 4 if total_tokens + doc_tokens > max_tokens: break formatted = f""" --- SOURCE {i+1}: {doc.get("source", "Document")} RELEVANCE SCORE: {doc.get("score", "N/A")} {doc["text"]} --- """ formatted_docs.append(formatted) total_tokens += doc_tokens return "\n".join(formatted_docs)

Document Order

Order impacts LLM attention. Two main strategies:

DEVELOPERpython
def order_documents(documents: list[dict], strategy: str = "best_first") -> list: """ Order documents according to chosen strategy """ if strategy == "best_first": # Most relevant first return sorted(documents, key=lambda x: x["score"], reverse=True) elif strategy == "lost_in_middle": # Best at beginning and end (avoids "lost in the middle") sorted_docs = sorted(documents, key=lambda x: x["score"], reverse=True) n = len(sorted_docs) reordered = [] for i in range(n): if i % 2 == 0: reordered.insert(0, sorted_docs[i]) else: reordered.append(sorted_docs[i]) return reordered return documents

Context Compression

For long documents, summarize before sending to the final LLM:

DEVELOPERpython
async def compress_context( documents: list[str], query: str, llm_compressor ) -> str: """ Compress documents while keeping relevant info """ compression_prompt = f""" Extract relevant information to answer this question: Question: {query} Documents: {chr(10).join(documents)} Summary of relevant information: """ compressed = await llm_compressor.generate(compression_prompt) return compressed

Hallucination Management

Hallucinations are the main risk in RAG. Here's how to minimize them.

Hallucination Detection

DEVELOPERpython
def detect_hallucination( response: str, context: str, query: str, verifier_llm ) -> dict: """ Check if the response contains hallucinations """ verification_prompt = f""" Analyze this response and verify if it's faithful to the provided context. PROVIDED CONTEXT: {context} QUESTION: {query} RESPONSE TO VERIFY: {response} ANALYSIS: 1. Does the response contain claims not present in the context? 2. Does the response contradict the context? 3. Faithfulness score (0-100): Respond in JSON: {{"hallucinations": [...], "contradictions": [...], "score": X}} """ result = verifier_llm.generate(verification_prompt) return json.loads(result)

Anti-Hallucination Strategies

DEVELOPERpython
class RAGGenerator: def __init__(self, llm, verifier_llm=None): self.llm = llm self.verifier = verifier_llm async def generate_with_verification( self, query: str, context: str, max_retries: int = 2 ) -> dict: """ Generate a response with anti-hallucination verification """ for attempt in range(max_retries + 1): # Generation response = await self.llm.generate( self._build_prompt(query, context), temperature=0.1 # Low to reduce hallucinations ) # Verification if verifier available if self.verifier: check = detect_hallucination(response, context, query, self.verifier) if check["score"] >= 80: return { "response": response, "confidence": check["score"], "verified": True } # Retry with stricter instruction context = self._add_anti_hallucination_instruction(context, check) else: return {"response": response, "verified": False} # Fallback: cautious response return { "response": "I'm not certain I can precisely answer this question with the available information.", "confidence": 0, "verified": True }

Streaming and Latency

For a smooth user experience, streaming is essential.

Streaming Implementation

DEVELOPERpython
from openai import OpenAI client = OpenAI() async def stream_response(prompt: str): """ Stream the response token by token """ stream = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: yield chunk.choices[0].delta.content # Usage with FastAPI from fastapi import FastAPI from fastapi.responses import StreamingResponse app = FastAPI() @app.post("/chat") async def chat(request: ChatRequest): prompt = build_rag_prompt(request.query, request.context) return StreamingResponse( stream_response(prompt), media_type="text/event-stream" )

Latency Optimization

DEVELOPERpython
import asyncio from functools import lru_cache class OptimizedRAG: def __init__(self): self.retriever = Retriever() self.llm = LLM() async def query(self, question: str) -> str: # Parallelize retrieval and prompt preparation retrieval_task = asyncio.create_task( self.retriever.search(question) ) # Pre-compute static prompt parts base_prompt = self._get_cached_prompt() # Wait for retrieval documents = await retrieval_task # Build and send to LLM full_prompt = base_prompt.format( context=format_context(documents), question=question ) return await self.llm.generate(full_prompt) @lru_cache(maxsize=1) def _get_cached_prompt(self) -> str: return self._load_system_prompt()

Generation Quality Metrics

Faithfulness

Measures if the response is faithful to the provided context.

DEVELOPERpython
def calculate_faithfulness( response: str, context: str, evaluator_llm ) -> float: """ Faithfulness score between 0 and 1 """ prompt = f""" Evaluate the faithfulness of this response to the context. Context: {context} Response: {response} For each claim in the response, verify if it is: - Supported by the context (1 point) - Not mentioned in the context (0 points) - Contradicted by the context (-1 point) Final score (0-1): """ result = evaluator_llm.generate(prompt) return float(result)

Answer Relevancy

Measures if the response properly answers the question.

DEVELOPERpython
def calculate_relevancy( question: str, response: str, evaluator_llm ) -> float: """ Relevancy score between 0 and 1 """ prompt = f""" Evaluate if this response properly answers the question asked. Question: {question} Response: {response} Criteria: - Does the response directly address the question? (0-0.4) - Is the response complete? (0-0.3) - Is the response concise and clear? (0-0.3) Final score (0-1): """ result = evaluator_llm.generate(prompt) return float(result)

Integration with Different Providers

OpenAI

DEVELOPERpython
from openai import OpenAI class OpenAIGenerator: def __init__(self, model: str = "gpt-4o-mini"): self.client = OpenAI() self.model = model async def generate(self, prompt: str, **kwargs) -> str: response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=kwargs.get("temperature", 0.2), max_tokens=kwargs.get("max_tokens", 2000) ) return response.choices[0].message.content

Anthropic Claude

DEVELOPERpython
from anthropic import Anthropic class ClaudeGenerator: def __init__(self, model: str = "claude-3-5-sonnet-20241022"): self.client = Anthropic() self.model = model async def generate(self, prompt: str, system: str = None, **kwargs) -> str: messages = [{"role": "user", "content": prompt}] response = self.client.messages.create( model=self.model, max_tokens=kwargs.get("max_tokens", 2000), system=system or "You are a precise and factual RAG assistant.", messages=messages ) return response.content[0].text

Local Model with Ollama

DEVELOPERpython
import httpx class OllamaGenerator: def __init__(self, model: str = "llama3.1:8b", base_url: str = "http://localhost:11434"): self.model = model self.base_url = base_url async def generate(self, prompt: str, **kwargs) -> str: async with httpx.AsyncClient() as client: response = await client.post( f"{self.base_url}/api/generate", json={ "model": self.model, "prompt": prompt, "stream": False, "options": { "temperature": kwargs.get("temperature", 0.2) } } ) return response.json()["response"]

Next Steps

Now that you've mastered RAG generation, explore these advanced topics:

For a complete overview, check our Introduction to RAG.


Simplify Your Life with Ailog

Configuring and optimizing an LLM for RAG requires many iterations. With Ailog, get an optimized turnkey configuration:

  • Pre-optimized prompts for each use case (support, e-commerce, internal KB)
  • Built-in anti-hallucination with automatic verification
  • Native streaming for a smooth user experience
  • Multi-LLM: Switch between GPT-4, Claude, or sovereign models without changing your code

Start free with Ailog and deploy an optimized RAG assistant in minutes.

Tags

RAGLLMgenerationpromptingGPTClaude

Articles connexes

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !