- **Small k (10-20)**: Very conservative responses - **Medium k (40-60)**: Good balance (recommended for RAG) - **Large k (100+)**: Almost no filtering

Temperature and Sampling in RAG: Controlling LLM Creativity

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

TL;DR

Sampling parameters (temperature, top-p, top-k) control the level of creativity and determinism in LLM responses. In RAG, these parameters are critical: too much creativity = hallucinations, too much determinism = robotic responses. This guide teaches you how to calibrate these parameters for your use case.

Understanding Sampling Parameters

What is Sampling?

When an LLM generates text, it predicts a probability distribution over all possible tokens for the next word. Sampling determines how to choose among these candidates.

DEVELOPERpython
# Simplified probability distribution example
next_token_probs = {
    "quickly": 0.35,
    "fast": 0.25,
    "rapidly": 0.15,
    "immediately": 0.12,
    "instantly": 0.08,
    "...": 0.05
}

# Without sampling (greedy): always "quickly"
# With sampling: might choose "fast", "rapidly", etc.

Parameter Overview

Parameter	Range	Effect	Typical RAG usage
Temperature	0.0 - 2.0	Controls distribution "heat"	0.1 - 0.5
Top-p	0.0 - 1.0	Nucleus sampling	0.9 - 1.0
Top-k	1 - 100+	Limits candidates	40 - 80
Frequency penalty	-2.0 - 2.0	Penalizes repetitions	0.0 - 0.5
Presence penalty	-2.0 - 2.0	Encourages diversity	0.0 - 0.3

Temperature in Detail

Mathematical Function

Temperature modifies the softmax distribution of probabilities:

DEVELOPERpython
import numpy as np

def apply_temperature(logits: np.array, temperature: float) -> np.array:
    """
    Apply temperature to logits.
    - temperature < 1: sharper distribution (deterministic)
    - temperature = 1: original distribution
    - temperature > 1: flatter distribution (random)
    """
    if temperature == 0:
        # Greedy: return one-hot vector on max
        result = np.zeros_like(logits)
        result[np.argmax(logits)] = 1.0
        return result

    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
    return exp_logits / np.sum(exp_logits)

# Example
logits = np.array([2.0, 1.5, 1.0, 0.5, 0.2])

print("Temp 0.1:", apply_temperature(logits, 0.1))
# [0.99, 0.01, 0.00, 0.00, 0.00] - nearly deterministic

print("Temp 1.0:", apply_temperature(logits, 1.0))
# [0.42, 0.26, 0.15, 0.09, 0.07] - original distribution

print("Temp 2.0:", apply_temperature(logits, 2.0))
# [0.29, 0.23, 0.19, 0.16, 0.14] - more uniform

Visual Impact

Low Temperature (0.1)          High Temperature (1.5)
│                              │
│ ████████████                 │ ██████
│ ██                           │ █████
│ █                            │ ████
│                              │ ███
│                              │ ██
└────────────────              └────────────────
  Token 1 dominates              Flat distribution

Recommendations by RAG Use Case

Use case	Temperature	Justification
Factual customer support	0.1 - 0.2	Maximum precision, no creativity
Automated FAQ	0.2 - 0.3	Slight variations acceptable
E-commerce assistant	0.3 - 0.5	Some personality
Assisted writing	0.5 - 0.7	Controlled creativity
Brainstorming	0.7 - 1.0	Varied ideas welcome

Top-p (Nucleus Sampling)

How it Works

Top-p selects tokens whose cumulative probabilities reach p:

DEVELOPERpython
def top_p_sampling(probs: dict, p: float) -> list:
    """
    Return tokens whose cumulative probability reaches p.
    """
    # Sort by descending probability
    sorted_tokens = sorted(probs.items(), key=lambda x: x[1], reverse=True)

    cumulative_prob = 0.0
    selected_tokens = []

    for token, prob in sorted_tokens:
        cumulative_prob += prob
        selected_tokens.append((token, prob))
        if cumulative_prob >= p:
            break

    return selected_tokens

# Example
probs = {
    "The": 0.40,
    "A": 0.25,
    "An": 0.15,
    "This": 0.10,
    "That": 0.05,
    "One": 0.03,
    "My": 0.02
}

print(top_p_sampling(probs, 0.9))
# [("The", 0.40), ("A", 0.25), ("An", 0.15), ("This", 0.10)]
# Cumulative: 0.90 - others are excluded

Top-p vs Temperature

Criteria	Temperature	Top-p
Control	Global across entire distribution	Cuts improbable tokens
Risk	Can select very improbable tokens	Guarantees reasonable tokens
Usage	Adjust "confidence"	Avoid aberrations

Recommended Combination for RAG

DEVELOPERpython
# Recommended configuration for support chatbot
rag_config = {
    "temperature": 0.3,  # Low creativity
    "top_p": 0.95,       # Keep 95% of probability mass
    "top_k": 50,         # Maximum 50 candidates
}

# Low temperature makes distribution sharper
# Top-p eliminates tokens with < 5% remaining cumulative mass
# Top-k puts hard limit on number of candidates

Top-k Sampling

Principle

Top-k keeps only the k most probable tokens:

DEVELOPERpython
def top_k_sampling(probs: dict, k: int) -> dict:
    """
    Keep the k most probable tokens.
    """
    sorted_tokens = sorted(probs.items(), key=lambda x: x[1], reverse=True)
    top_k_tokens = dict(sorted_tokens[:k])

    # Renormalize
    total = sum(top_k_tokens.values())
    return {t: p/total for t, p in top_k_tokens.items()}

# Example
probs = {"A": 0.3, "B": 0.25, "C": 0.2, "D": 0.15, "E": 0.1}

print(top_k_sampling(probs, 3))
# {"A": 0.40, "B": 0.33, "C": 0.27}
# D and E excluded, others renormalized

When to Use Top-k

Small k (10-20): Very conservative responses
Medium k (40-60): Good balance (recommended for RAG)
Large k (100+): Almost no filtering

Frequency and Presence Penalties

Frequency Penalty

Penalizes tokens proportionally to their frequency in generated text:

DEVELOPERpython
def apply_frequency_penalty(
    logits: dict,
    generated_tokens: list,
    penalty: float
) -> dict:
    """
    Reduce probability of frequently used tokens.
    """
    token_counts = {}
    for token in generated_tokens:
        token_counts[token] = token_counts.get(token, 0) + 1

    adjusted_logits = {}
    for token, logit in logits.items():
        count = token_counts.get(token, 0)
        adjusted_logits[token] = logit - (penalty * count)

    return adjusted_logits

RAG usage: Avoids repetitive responses like "as mentioned previously..."

Presence Penalty

Penalizes any token that already appeared, regardless of frequency:

DEVELOPERpython
def apply_presence_penalty(
    logits: dict,
    generated_tokens: set,
    penalty: float
) -> dict:
    """
    Reduce probability of already-used tokens.
    """
    adjusted_logits = {}
    for token, logit in logits.items():
        if token in generated_tokens:
            adjusted_logits[token] = logit - penalty
        else:
            adjusted_logits[token] = logit

    return adjusted_logits

RAG usage: Encourages synonym use and lexical variety.

Optimal Configuration by Model

OpenAI GPT-4

DEVELOPERpython
from openai import OpenAI

client = OpenAI()

# Recommended RAG configuration
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"}
    ],
    temperature=0.3,
    top_p=0.95,
    frequency_penalty=0.2,
    presence_penalty=0.1,
    max_tokens=500
)

Anthropic Claude

DEVELOPERpython
import anthropic

client = anthropic.Anthropic()

# Claude supports temperature, top_p and top_k (but no frequency/presence penalty)
response = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=500,
    temperature=0.3,
    top_p=0.95,
    messages=[
        {"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"}
    ]
)

Open-source Models (Llama, Mistral)

DEVELOPERpython
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

# RAG configuration
generation_config = {
    "temperature": 0.3,
    "top_p": 0.95,
    "top_k": 50,
    "repetition_penalty": 1.1,  # Equivalent to frequency_penalty
    "do_sample": True,
    "max_new_tokens": 500
}

outputs = model.generate(
    input_ids,
    **generation_config
)

Advanced Strategies

1. Adaptive Sampling

Dynamically adjust parameters based on context:

DEVELOPERpython
class AdaptiveSampler:
    def __init__(self):
        self.base_temperature = 0.3

    def get_config(self, query_type: str, context_quality: float) -> dict:
        """
        Adjust parameters based on context.

        - query_type: "factual", "creative", "mixed"
        - context_quality: 0-1, context relevance score
        """

        if query_type == "factual":
            # Factual questions: very deterministic
            temp = 0.1
        elif query_type == "creative":
            # Creative questions: more freedom
            temp = 0.7
        else:
            temp = self.base_temperature

        # If context quality is poor, be more conservative
        if context_quality < 0.5:
            temp *= 0.5  # Reduce temperature

        return {
            "temperature": temp,
            "top_p": 0.95 if context_quality > 0.7 else 0.85,
            "frequency_penalty": 0.2
        }

# Usage
sampler = AdaptiveSampler()
config = sampler.get_config(
    query_type="factual",
    context_quality=0.85
)

2. Variable Temperature by Section

Use different temperatures for different response parts:

DEVELOPERpython
async def generate_structured_response(query: str, context: str):
    """
    Generate response with different parameters per section.
    """

    # Section 1: Factual answer (low temp)
    factual_part = await llm.generate(
        prompt=f"Answer factually: {query}\nContext: {context}",
        temperature=0.1,
        max_tokens=200
    )

    # Section 2: Explanation (medium temp)
    explanation = await llm.generate(
        prompt=f"Explain why: {factual_part}",
        temperature=0.4,
        max_tokens=150
    )

    # Section 3: Suggestion (higher temp)
    suggestion = await llm.generate(
        prompt=f"Suggest alternatives or additions",
        temperature=0.6,
        max_tokens=100
    )

    return {
        "answer": factual_part,
        "explanation": explanation,
        "suggestions": suggestion
    }

3. A/B Testing Parameters

DEVELOPERpython
import random
from dataclasses import dataclass

@dataclass
class SamplingVariant:
    name: str
    temperature: float
    top_p: float
    frequency_penalty: float

class SamplingABTester:
    def __init__(self):
        self.variants = [
            SamplingVariant("conservative", 0.1, 0.9, 0.0),
            SamplingVariant("balanced", 0.3, 0.95, 0.2),
            SamplingVariant("creative", 0.5, 1.0, 0.3),
        ]
        self.results = {v.name: {"count": 0, "satisfaction": []} for v in self.variants}

    def get_variant(self) -> SamplingVariant:
        return random.choice(self.variants)

    def record_feedback(self, variant_name: str, satisfaction: float):
        self.results[variant_name]["count"] += 1
        self.results[variant_name]["satisfaction"].append(satisfaction)

    def get_best_variant(self) -> str:
        avg_scores = {
            name: sum(data["satisfaction"]) / max(len(data["satisfaction"]), 1)
            for name, data in self.results.items()
        }
        return max(avg_scores, key=avg_scores.get)

Common Mistakes

1. Temperature Too High for Factual Content

DEVELOPERpython
# ❌ Bad: high temperature for support
response = llm.generate(
    prompt="What is the return period?",
    temperature=1.0  # Risk of hallucinations!
)

# ✅ Good: low temperature for factual
response = llm.generate(
    prompt="What is the return period?",
    temperature=0.2
)

2. Ignoring Context Quality

DEVELOPERpython
# ❌ Bad: same temperature regardless of context
config = {"temperature": 0.5}

# ✅ Good: adapt based on context quality
context_score = retriever.get_relevance_score(query, documents)
config = {
    "temperature": 0.2 if context_score < 0.6 else 0.4
}

3. Inconsistent Combinations

DEVELOPERpython
# ❌ Inconsistent: low temperature + very low top_p
config = {
    "temperature": 0.1,
    "top_p": 0.5  # Unnecessary double restriction
}

# ✅ Consistent: one main restriction
config = {
    "temperature": 0.2,
    "top_p": 0.95  # Just to avoid aberrations
}

Integration with Ailog

Ailog lets you configure sampling parameters directly in the interface:

DEVELOPERpython
from ailog import AilogClient

client = AilogClient(api_key="your-key")

# Configuration via interface or API
channel_config = {
    "generation": {
        "temperature": 0.3,
        "top_p": 0.95,
        "frequency_penalty": 0.2,
        "adaptive_sampling": True  # Automatic adjustment
    }
}

client.update_channel_config("support-widget", channel_config)

Conclusion

Sampling parameters are powerful but subtle levers. In RAG:

Low temperature (0.1-0.3) for factual responses
Top-p around 0.95 to filter aberrations
Light frequency penalty (0.1-0.3) to avoid repetitions
Dynamic adaptation based on context quality
A/B testing to find optimal configuration

Additional Resources

Introduction to RAG - Fundamentals
LLM Generation for RAG - Parent guide
RAG Prompt Engineering - Optimize prompts
Chain-of-Thought RAG - Step-by-step reasoning

Need optimal configuration without the hassle? Try Ailog - pre-optimized parameters by use case, adaptive adjustment included.