GuideIntermediate

Temperature and Sampling in RAG: Controlling LLM Creativity

March 15, 2026
15 min read
Ailog Team

Complete guide to sampling parameters for RAG systems: temperature, top-p, top-k, frequency penalty. Optimize the balance between creativity and faithfulness.

TL;DR

Sampling parameters (temperature, top-p, top-k) control the level of creativity and determinism in LLM responses. In RAG, these parameters are critical: too much creativity = hallucinations, too much determinism = robotic responses. This guide teaches you how to calibrate these parameters for your use case.

Understanding Sampling Parameters

What is Sampling?

When an LLM generates text, it predicts a probability distribution over all possible tokens for the next word. Sampling determines how to choose among these candidates.

DEVELOPERpython
# Simplified probability distribution example next_token_probs = { "quickly": 0.35, "fast": 0.25, "rapidly": 0.15, "immediately": 0.12, "instantly": 0.08, "...": 0.05 } # Without sampling (greedy): always "quickly" # With sampling: might choose "fast", "rapidly", etc.

Parameter Overview

ParameterRangeEffectTypical RAG usage
Temperature0.0 - 2.0Controls distribution "heat"0.1 - 0.5
Top-p0.0 - 1.0Nucleus sampling0.9 - 1.0
Top-k1 - 100+Limits candidates40 - 80
Frequency penalty-2.0 - 2.0Penalizes repetitions0.0 - 0.5
Presence penalty-2.0 - 2.0Encourages diversity0.0 - 0.3

Temperature in Detail

Mathematical Function

Temperature modifies the softmax distribution of probabilities:

DEVELOPERpython
import numpy as np def apply_temperature(logits: np.array, temperature: float) -> np.array: """ Apply temperature to logits. - temperature < 1: sharper distribution (deterministic) - temperature = 1: original distribution - temperature > 1: flatter distribution (random) """ if temperature == 0: # Greedy: return one-hot vector on max result = np.zeros_like(logits) result[np.argmax(logits)] = 1.0 return result scaled_logits = logits / temperature exp_logits = np.exp(scaled_logits - np.max(scaled_logits)) return exp_logits / np.sum(exp_logits) # Example logits = np.array([2.0, 1.5, 1.0, 0.5, 0.2]) print("Temp 0.1:", apply_temperature(logits, 0.1)) # [0.99, 0.01, 0.00, 0.00, 0.00] - nearly deterministic print("Temp 1.0:", apply_temperature(logits, 1.0)) # [0.42, 0.26, 0.15, 0.09, 0.07] - original distribution print("Temp 2.0:", apply_temperature(logits, 2.0)) # [0.29, 0.23, 0.19, 0.16, 0.14] - more uniform

Visual Impact

Low Temperature (0.1)          High Temperature (1.5)
│                              │
│ ████████████                 │ ██████
│ ██                           │ █████
│ █                            │ ████
│                              │ ███
│                              │ ██
└────────────────              └────────────────
  Token 1 dominates              Flat distribution

Recommendations by RAG Use Case

Use caseTemperatureJustification
Factual customer support0.1 - 0.2Maximum precision, no creativity
Automated FAQ0.2 - 0.3Slight variations acceptable
E-commerce assistant0.3 - 0.5Some personality
Assisted writing0.5 - 0.7Controlled creativity
Brainstorming0.7 - 1.0Varied ideas welcome

Top-p (Nucleus Sampling)

How it Works

Top-p selects tokens whose cumulative probabilities reach p:

DEVELOPERpython
def top_p_sampling(probs: dict, p: float) -> list: """ Return tokens whose cumulative probability reaches p. """ # Sort by descending probability sorted_tokens = sorted(probs.items(), key=lambda x: x[1], reverse=True) cumulative_prob = 0.0 selected_tokens = [] for token, prob in sorted_tokens: cumulative_prob += prob selected_tokens.append((token, prob)) if cumulative_prob >= p: break return selected_tokens # Example probs = { "The": 0.40, "A": 0.25, "An": 0.15, "This": 0.10, "That": 0.05, "One": 0.03, "My": 0.02 } print(top_p_sampling(probs, 0.9)) # [("The", 0.40), ("A", 0.25), ("An", 0.15), ("This", 0.10)] # Cumulative: 0.90 - others are excluded

Top-p vs Temperature

CriteriaTemperatureTop-p
ControlGlobal across entire distributionCuts improbable tokens
RiskCan select very improbable tokensGuarantees reasonable tokens
UsageAdjust "confidence"Avoid aberrations

Recommended Combination for RAG

DEVELOPERpython
# Recommended configuration for support chatbot rag_config = { "temperature": 0.3, # Low creativity "top_p": 0.95, # Keep 95% of probability mass "top_k": 50, # Maximum 50 candidates } # Low temperature makes distribution sharper # Top-p eliminates tokens with < 5% remaining cumulative mass # Top-k puts hard limit on number of candidates

Top-k Sampling

Principle

Top-k keeps only the k most probable tokens:

DEVELOPERpython
def top_k_sampling(probs: dict, k: int) -> dict: """ Keep the k most probable tokens. """ sorted_tokens = sorted(probs.items(), key=lambda x: x[1], reverse=True) top_k_tokens = dict(sorted_tokens[:k]) # Renormalize total = sum(top_k_tokens.values()) return {t: p/total for t, p in top_k_tokens.items()} # Example probs = {"A": 0.3, "B": 0.25, "C": 0.2, "D": 0.15, "E": 0.1} print(top_k_sampling(probs, 3)) # {"A": 0.40, "B": 0.33, "C": 0.27} # D and E excluded, others renormalized

When to Use Top-k

  • Small k (10-20): Very conservative responses
  • Medium k (40-60): Good balance (recommended for RAG)
  • Large k (100+): Almost no filtering

Frequency and Presence Penalties

Frequency Penalty

Penalizes tokens proportionally to their frequency in generated text:

DEVELOPERpython
def apply_frequency_penalty( logits: dict, generated_tokens: list, penalty: float ) -> dict: """ Reduce probability of frequently used tokens. """ token_counts = {} for token in generated_tokens: token_counts[token] = token_counts.get(token, 0) + 1 adjusted_logits = {} for token, logit in logits.items(): count = token_counts.get(token, 0) adjusted_logits[token] = logit - (penalty * count) return adjusted_logits

RAG usage: Avoids repetitive responses like "as mentioned previously..."

Presence Penalty

Penalizes any token that already appeared, regardless of frequency:

DEVELOPERpython
def apply_presence_penalty( logits: dict, generated_tokens: set, penalty: float ) -> dict: """ Reduce probability of already-used tokens. """ adjusted_logits = {} for token, logit in logits.items(): if token in generated_tokens: adjusted_logits[token] = logit - penalty else: adjusted_logits[token] = logit return adjusted_logits

RAG usage: Encourages synonym use and lexical variety.

Optimal Configuration by Model

OpenAI GPT-4

DEVELOPERpython
from openai import OpenAI client = OpenAI() # Recommended RAG configuration response = client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"} ], temperature=0.3, top_p=0.95, frequency_penalty=0.2, presence_penalty=0.1, max_tokens=500 )

Anthropic Claude

DEVELOPERpython
import anthropic client = anthropic.Anthropic() # Claude uses only temperature and top_p response = client.messages.create( model="claude-3-opus-20240229", max_tokens=500, temperature=0.3, top_p=0.95, messages=[ {"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"} ] )

Open-source Models (Llama, Mistral)

DEVELOPERpython
from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") # RAG configuration generation_config = { "temperature": 0.3, "top_p": 0.95, "top_k": 50, "repetition_penalty": 1.1, # Equivalent to frequency_penalty "do_sample": True, "max_new_tokens": 500 } outputs = model.generate( input_ids, **generation_config )

Advanced Strategies

1. Adaptive Sampling

Dynamically adjust parameters based on context:

DEVELOPERpython
class AdaptiveSampler: def __init__(self): self.base_temperature = 0.3 def get_config(self, query_type: str, context_quality: float) -> dict: """ Adjust parameters based on context. - query_type: "factual", "creative", "mixed" - context_quality: 0-1, context relevance score """ if query_type == "factual": # Factual questions: very deterministic temp = 0.1 elif query_type == "creative": # Creative questions: more freedom temp = 0.7 else: temp = self.base_temperature # If context quality is poor, be more conservative if context_quality < 0.5: temp *= 0.5 # Reduce temperature return { "temperature": temp, "top_p": 0.95 if context_quality > 0.7 else 0.85, "frequency_penalty": 0.2 } # Usage sampler = AdaptiveSampler() config = sampler.get_config( query_type="factual", context_quality=0.85 )

2. Variable Temperature by Section

Use different temperatures for different response parts:

DEVELOPERpython
async def generate_structured_response(query: str, context: str): """ Generate response with different parameters per section. """ # Section 1: Factual answer (low temp) factual_part = await llm.generate( prompt=f"Answer factually: {query}\nContext: {context}", temperature=0.1, max_tokens=200 ) # Section 2: Explanation (medium temp) explanation = await llm.generate( prompt=f"Explain why: {factual_part}", temperature=0.4, max_tokens=150 ) # Section 3: Suggestion (higher temp) suggestion = await llm.generate( prompt=f"Suggest alternatives or additions", temperature=0.6, max_tokens=100 ) return { "answer": factual_part, "explanation": explanation, "suggestions": suggestion }

3. A/B Testing Parameters

DEVELOPERpython
import random from dataclasses import dataclass @dataclass class SamplingVariant: name: str temperature: float top_p: float frequency_penalty: float class SamplingABTester: def __init__(self): self.variants = [ SamplingVariant("conservative", 0.1, 0.9, 0.0), SamplingVariant("balanced", 0.3, 0.95, 0.2), SamplingVariant("creative", 0.5, 1.0, 0.3), ] self.results = {v.name: {"count": 0, "satisfaction": []} for v in self.variants} def get_variant(self) -> SamplingVariant: return random.choice(self.variants) def record_feedback(self, variant_name: str, satisfaction: float): self.results[variant_name]["count"] += 1 self.results[variant_name]["satisfaction"].append(satisfaction) def get_best_variant(self) -> str: avg_scores = { name: sum(data["satisfaction"]) / max(len(data["satisfaction"]), 1) for name, data in self.results.items() } return max(avg_scores, key=avg_scores.get)

Common Mistakes

1. Temperature Too High for Factual Content

DEVELOPERpython
# ❌ Bad: high temperature for support response = llm.generate( prompt="What is the return period?", temperature=1.0 # Risk of hallucinations! ) # ✅ Good: low temperature for factual response = llm.generate( prompt="What is the return period?", temperature=0.2 )

2. Ignoring Context Quality

DEVELOPERpython
# ❌ Bad: same temperature regardless of context config = {"temperature": 0.5} # ✅ Good: adapt based on context quality context_score = retriever.get_relevance_score(query, documents) config = { "temperature": 0.2 if context_score < 0.6 else 0.4 }

3. Inconsistent Combinations

DEVELOPERpython
# ❌ Inconsistent: low temperature + very low top_p config = { "temperature": 0.1, "top_p": 0.5 # Unnecessary double restriction } # ✅ Consistent: one main restriction config = { "temperature": 0.2, "top_p": 0.95 # Just to avoid aberrations }

Integration with Ailog

Ailog lets you configure sampling parameters directly in the interface:

DEVELOPERpython
from ailog import AilogClient client = AilogClient(api_key="your-key") # Configuration via interface or API channel_config = { "generation": { "temperature": 0.3, "top_p": 0.95, "frequency_penalty": 0.2, "adaptive_sampling": True # Automatic adjustment } } client.update_channel_config("support-widget", channel_config)

Conclusion

Sampling parameters are powerful but subtle levers. In RAG:

  1. Low temperature (0.1-0.3) for factual responses
  2. Top-p around 0.95 to filter aberrations
  3. Light frequency penalty (0.1-0.3) to avoid repetitions
  4. Dynamic adaptation based on context quality
  5. A/B testing to find optimal configuration

Additional Resources


Need optimal configuration without the hassle? Try Ailog - pre-optimized parameters by use case, adaptive adjustment included.

Tags

RAGtemperaturesamplingLLMcreativityparametersgeneration

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !