Temperature and Sampling in RAG: Controlling LLM Creativity
Complete guide to sampling parameters for RAG systems: temperature, top-p, top-k, frequency penalty. Optimize the balance between creativity and faithfulness.
TL;DR
Sampling parameters (temperature, top-p, top-k) control the level of creativity and determinism in LLM responses. In RAG, these parameters are critical: too much creativity = hallucinations, too much determinism = robotic responses. This guide teaches you how to calibrate these parameters for your use case.
Understanding Sampling Parameters
What is Sampling?
When an LLM generates text, it predicts a probability distribution over all possible tokens for the next word. Sampling determines how to choose among these candidates.
DEVELOPERpython# Simplified probability distribution example next_token_probs = { "quickly": 0.35, "fast": 0.25, "rapidly": 0.15, "immediately": 0.12, "instantly": 0.08, "...": 0.05 } # Without sampling (greedy): always "quickly" # With sampling: might choose "fast", "rapidly", etc.
Parameter Overview
| Parameter | Range | Effect | Typical RAG usage |
|---|---|---|---|
| Temperature | 0.0 - 2.0 | Controls distribution "heat" | 0.1 - 0.5 |
| Top-p | 0.0 - 1.0 | Nucleus sampling | 0.9 - 1.0 |
| Top-k | 1 - 100+ | Limits candidates | 40 - 80 |
| Frequency penalty | -2.0 - 2.0 | Penalizes repetitions | 0.0 - 0.5 |
| Presence penalty | -2.0 - 2.0 | Encourages diversity | 0.0 - 0.3 |
Temperature in Detail
Mathematical Function
Temperature modifies the softmax distribution of probabilities:
DEVELOPERpythonimport numpy as np def apply_temperature(logits: np.array, temperature: float) -> np.array: """ Apply temperature to logits. - temperature < 1: sharper distribution (deterministic) - temperature = 1: original distribution - temperature > 1: flatter distribution (random) """ if temperature == 0: # Greedy: return one-hot vector on max result = np.zeros_like(logits) result[np.argmax(logits)] = 1.0 return result scaled_logits = logits / temperature exp_logits = np.exp(scaled_logits - np.max(scaled_logits)) return exp_logits / np.sum(exp_logits) # Example logits = np.array([2.0, 1.5, 1.0, 0.5, 0.2]) print("Temp 0.1:", apply_temperature(logits, 0.1)) # [0.99, 0.01, 0.00, 0.00, 0.00] - nearly deterministic print("Temp 1.0:", apply_temperature(logits, 1.0)) # [0.42, 0.26, 0.15, 0.09, 0.07] - original distribution print("Temp 2.0:", apply_temperature(logits, 2.0)) # [0.29, 0.23, 0.19, 0.16, 0.14] - more uniform
Visual Impact
Low Temperature (0.1) High Temperature (1.5)
│ │
│ ████████████ │ ██████
│ ██ │ █████
│ █ │ ████
│ │ ███
│ │ ██
└──────────────── └────────────────
Token 1 dominates Flat distribution
Recommendations by RAG Use Case
| Use case | Temperature | Justification |
|---|---|---|
| Factual customer support | 0.1 - 0.2 | Maximum precision, no creativity |
| Automated FAQ | 0.2 - 0.3 | Slight variations acceptable |
| E-commerce assistant | 0.3 - 0.5 | Some personality |
| Assisted writing | 0.5 - 0.7 | Controlled creativity |
| Brainstorming | 0.7 - 1.0 | Varied ideas welcome |
Top-p (Nucleus Sampling)
How it Works
Top-p selects tokens whose cumulative probabilities reach p:
DEVELOPERpythondef top_p_sampling(probs: dict, p: float) -> list: """ Return tokens whose cumulative probability reaches p. """ # Sort by descending probability sorted_tokens = sorted(probs.items(), key=lambda x: x[1], reverse=True) cumulative_prob = 0.0 selected_tokens = [] for token, prob in sorted_tokens: cumulative_prob += prob selected_tokens.append((token, prob)) if cumulative_prob >= p: break return selected_tokens # Example probs = { "The": 0.40, "A": 0.25, "An": 0.15, "This": 0.10, "That": 0.05, "One": 0.03, "My": 0.02 } print(top_p_sampling(probs, 0.9)) # [("The", 0.40), ("A", 0.25), ("An", 0.15), ("This", 0.10)] # Cumulative: 0.90 - others are excluded
Top-p vs Temperature
| Criteria | Temperature | Top-p |
|---|---|---|
| Control | Global across entire distribution | Cuts improbable tokens |
| Risk | Can select very improbable tokens | Guarantees reasonable tokens |
| Usage | Adjust "confidence" | Avoid aberrations |
Recommended Combination for RAG
DEVELOPERpython# Recommended configuration for support chatbot rag_config = { "temperature": 0.3, # Low creativity "top_p": 0.95, # Keep 95% of probability mass "top_k": 50, # Maximum 50 candidates } # Low temperature makes distribution sharper # Top-p eliminates tokens with < 5% remaining cumulative mass # Top-k puts hard limit on number of candidates
Top-k Sampling
Principle
Top-k keeps only the k most probable tokens:
DEVELOPERpythondef top_k_sampling(probs: dict, k: int) -> dict: """ Keep the k most probable tokens. """ sorted_tokens = sorted(probs.items(), key=lambda x: x[1], reverse=True) top_k_tokens = dict(sorted_tokens[:k]) # Renormalize total = sum(top_k_tokens.values()) return {t: p/total for t, p in top_k_tokens.items()} # Example probs = {"A": 0.3, "B": 0.25, "C": 0.2, "D": 0.15, "E": 0.1} print(top_k_sampling(probs, 3)) # {"A": 0.40, "B": 0.33, "C": 0.27} # D and E excluded, others renormalized
When to Use Top-k
- Small k (10-20): Very conservative responses
- Medium k (40-60): Good balance (recommended for RAG)
- Large k (100+): Almost no filtering
Frequency and Presence Penalties
Frequency Penalty
Penalizes tokens proportionally to their frequency in generated text:
DEVELOPERpythondef apply_frequency_penalty( logits: dict, generated_tokens: list, penalty: float ) -> dict: """ Reduce probability of frequently used tokens. """ token_counts = {} for token in generated_tokens: token_counts[token] = token_counts.get(token, 0) + 1 adjusted_logits = {} for token, logit in logits.items(): count = token_counts.get(token, 0) adjusted_logits[token] = logit - (penalty * count) return adjusted_logits
RAG usage: Avoids repetitive responses like "as mentioned previously..."
Presence Penalty
Penalizes any token that already appeared, regardless of frequency:
DEVELOPERpythondef apply_presence_penalty( logits: dict, generated_tokens: set, penalty: float ) -> dict: """ Reduce probability of already-used tokens. """ adjusted_logits = {} for token, logit in logits.items(): if token in generated_tokens: adjusted_logits[token] = logit - penalty else: adjusted_logits[token] = logit return adjusted_logits
RAG usage: Encourages synonym use and lexical variety.
Optimal Configuration by Model
OpenAI GPT-4
DEVELOPERpythonfrom openai import OpenAI client = OpenAI() # Recommended RAG configuration response = client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"} ], temperature=0.3, top_p=0.95, frequency_penalty=0.2, presence_penalty=0.1, max_tokens=500 )
Anthropic Claude
DEVELOPERpythonimport anthropic client = anthropic.Anthropic() # Claude uses only temperature and top_p response = client.messages.create( model="claude-3-opus-20240229", max_tokens=500, temperature=0.3, top_p=0.95, messages=[ {"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"} ] )
Open-source Models (Llama, Mistral)
DEVELOPERpythonfrom transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") # RAG configuration generation_config = { "temperature": 0.3, "top_p": 0.95, "top_k": 50, "repetition_penalty": 1.1, # Equivalent to frequency_penalty "do_sample": True, "max_new_tokens": 500 } outputs = model.generate( input_ids, **generation_config )
Advanced Strategies
1. Adaptive Sampling
Dynamically adjust parameters based on context:
DEVELOPERpythonclass AdaptiveSampler: def __init__(self): self.base_temperature = 0.3 def get_config(self, query_type: str, context_quality: float) -> dict: """ Adjust parameters based on context. - query_type: "factual", "creative", "mixed" - context_quality: 0-1, context relevance score """ if query_type == "factual": # Factual questions: very deterministic temp = 0.1 elif query_type == "creative": # Creative questions: more freedom temp = 0.7 else: temp = self.base_temperature # If context quality is poor, be more conservative if context_quality < 0.5: temp *= 0.5 # Reduce temperature return { "temperature": temp, "top_p": 0.95 if context_quality > 0.7 else 0.85, "frequency_penalty": 0.2 } # Usage sampler = AdaptiveSampler() config = sampler.get_config( query_type="factual", context_quality=0.85 )
2. Variable Temperature by Section
Use different temperatures for different response parts:
DEVELOPERpythonasync def generate_structured_response(query: str, context: str): """ Generate response with different parameters per section. """ # Section 1: Factual answer (low temp) factual_part = await llm.generate( prompt=f"Answer factually: {query}\nContext: {context}", temperature=0.1, max_tokens=200 ) # Section 2: Explanation (medium temp) explanation = await llm.generate( prompt=f"Explain why: {factual_part}", temperature=0.4, max_tokens=150 ) # Section 3: Suggestion (higher temp) suggestion = await llm.generate( prompt=f"Suggest alternatives or additions", temperature=0.6, max_tokens=100 ) return { "answer": factual_part, "explanation": explanation, "suggestions": suggestion }
3. A/B Testing Parameters
DEVELOPERpythonimport random from dataclasses import dataclass @dataclass class SamplingVariant: name: str temperature: float top_p: float frequency_penalty: float class SamplingABTester: def __init__(self): self.variants = [ SamplingVariant("conservative", 0.1, 0.9, 0.0), SamplingVariant("balanced", 0.3, 0.95, 0.2), SamplingVariant("creative", 0.5, 1.0, 0.3), ] self.results = {v.name: {"count": 0, "satisfaction": []} for v in self.variants} def get_variant(self) -> SamplingVariant: return random.choice(self.variants) def record_feedback(self, variant_name: str, satisfaction: float): self.results[variant_name]["count"] += 1 self.results[variant_name]["satisfaction"].append(satisfaction) def get_best_variant(self) -> str: avg_scores = { name: sum(data["satisfaction"]) / max(len(data["satisfaction"]), 1) for name, data in self.results.items() } return max(avg_scores, key=avg_scores.get)
Common Mistakes
1. Temperature Too High for Factual Content
DEVELOPERpython# ❌ Bad: high temperature for support response = llm.generate( prompt="What is the return period?", temperature=1.0 # Risk of hallucinations! ) # ✅ Good: low temperature for factual response = llm.generate( prompt="What is the return period?", temperature=0.2 )
2. Ignoring Context Quality
DEVELOPERpython# ❌ Bad: same temperature regardless of context config = {"temperature": 0.5} # ✅ Good: adapt based on context quality context_score = retriever.get_relevance_score(query, documents) config = { "temperature": 0.2 if context_score < 0.6 else 0.4 }
3. Inconsistent Combinations
DEVELOPERpython# ❌ Inconsistent: low temperature + very low top_p config = { "temperature": 0.1, "top_p": 0.5 # Unnecessary double restriction } # ✅ Consistent: one main restriction config = { "temperature": 0.2, "top_p": 0.95 # Just to avoid aberrations }
Integration with Ailog
Ailog lets you configure sampling parameters directly in the interface:
DEVELOPERpythonfrom ailog import AilogClient client = AilogClient(api_key="your-key") # Configuration via interface or API channel_config = { "generation": { "temperature": 0.3, "top_p": 0.95, "frequency_penalty": 0.2, "adaptive_sampling": True # Automatic adjustment } } client.update_channel_config("support-widget", channel_config)
Conclusion
Sampling parameters are powerful but subtle levers. In RAG:
- Low temperature (0.1-0.3) for factual responses
- Top-p around 0.95 to filter aberrations
- Light frequency penalty (0.1-0.3) to avoid repetitions
- Dynamic adaptation based on context quality
- A/B testing to find optimal configuration
Additional Resources
- Introduction to RAG - Fundamentals
- LLM Generation for RAG - Parent guide
- RAG Prompt Engineering - Optimize prompts
- Chain-of-Thought RAG - Step-by-step reasoning
Need optimal configuration without the hassle? Try Ailog - pre-optimized parameters by use case, adaptive adjustment included.
Tags
Related Posts
RAG Generation: Choosing and Optimizing Your LLM
Complete guide to selecting and configuring your LLM in a RAG system: prompting, temperature, tokens, and response optimization.
RAG Agents: Orchestrating Multi-Agent Systems
Architect multi-agent RAG systems: orchestration, specialization, collaboration and failure handling for complex assistants.
Conversational RAG: Memory and Multi-Session Context
Implement RAG with conversational memory: context management, multi-session history, and personalized responses.