7. OptimizationAvancé

Context Window Optimization: Managing Token Limits

1 mars 2025
11 min read
Ailog Research Team

Strategies for fitting more information in limited context windows: compression, summarization, smart selection, and window management techniques.

The Context Window Challenge

LLMs have fixed context windows:

ModelContext Window
GPT-3.5 Turbo16K tokens
GPT-48K / 32K tokens
GPT-4 Turbo128K tokens
Claude 2100K tokens
Claude 3200K tokens
Llama 24K tokens
Gemini 1.5 Pro1M tokens

The problem:

System prompt: 500 tokens
User query: 100 tokens
Retrieved contexts: 5 chunks × 512 tokens = 2,560 tokens
Conversation history: 1,000 tokens
─────────────────────────────────────────
Total input: 4,160 tokens
Max output: 1,000 tokens
─────────────────────────────────────────
Total: 5,160 tokens (fits in 8K window)

As your RAG system scales:

  • More retrieved chunks
  • Longer conversations
  • More complex prompts
  • Larger documents

You'll hit context limits. Optimization is essential.

Token Counting

Accurate Token Counting

DEVELOPERpython
import tiktoken def count_tokens(text: str, model="gpt-4") -> int: encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(text)) # Example text = "Hello, how are you?" tokens = count_tokens(text) # ~5 tokens

Context Budget Calculation

DEVELOPERpython
class ContextBudget: def __init__(self, model="gpt-4", max_output_tokens=1000): self.model = model self.max_output = max_output_tokens # Context windows self.windows = { "gpt-3.5-turbo": 16385, "gpt-4": 8192, "gpt-4-32k": 32768, "gpt-4-turbo": 128000, "claude-2": 100000, "claude-3": 200000, } self.total_window = self.windows.get(model, 8192) self.available_input = self.total_window - max_output_tokens def allocate(self, system=500, query=200, history=1000): """ Allocate tokens for different components """ fixed_tokens = system + query + history available_for_context = self.available_input - fixed_tokens return { 'system_prompt': system, 'query': query, 'history': history, 'context': available_for_context, 'output': self.max_output, 'total': fixed_tokens + available_for_context + self.max_output } # Usage budget = ContextBudget(model="gpt-4") allocation = budget.allocate() print(f"Available for retrieved context: {allocation['context']} tokens") # Available for retrieved context: 5492 tokens

Chunk Selection Strategies

Top-K with Token Limit

DEVELOPERpython
def select_chunks_within_budget(chunks: List[dict], budget: int, model="gpt-4") -> List[dict]: """ Select as many top chunks as possible within token budget """ selected = [] total_tokens = 0 for chunk in chunks: chunk_tokens = count_tokens(chunk['content'], model) if total_tokens + chunk_tokens <= budget: selected.append(chunk) total_tokens += chunk_tokens else: break return selected # Usage chunks = retriever.retrieve(query, k=20) # Get more than needed selected = select_chunks_within_budget(chunks, budget=5000) # Might return 8-12 chunks depending on length

Priority-Based Selection

DEVELOPERpython
def priority_select_chunks(chunks: List[dict], budget: int, weights: dict) -> List[dict]: """ Select chunks based on multiple criteria """ # Score chunks for chunk in chunks: score = ( weights['relevance'] * chunk['similarity_score'] + weights['recency'] * chunk['recency_score'] + weights['authority'] * chunk['authority_score'] ) chunk['priority_score'] = score # Sort by priority sorted_chunks = sorted(chunks, key=lambda x: x['priority_score'], reverse=True) # Select within budget return select_chunks_within_budget(sorted_chunks, budget)

Compression Techniques

Extractive Summarization

Extract only relevant sentences.

DEVELOPERpython
from transformers import pipeline summarizer = pipeline("summarization", model="facebook/bart-large-cnn") def compress_chunk(chunk: str, max_length: int = 130) -> str: """ Compress chunk while preserving key information """ summary = summarizer( chunk, max_length=max_length, min_length=30, do_sample=False ) return summary[0]['summary_text'] # Usage original = "Very long chunk of text..." # 500 tokens compressed = compress_chunk(original, max_length=130) # ~100 tokens

LLM-Based Compression

DEVELOPERpython
async def llm_compress_context(query: str, chunk: str, llm) -> str: """ Use LLM to extract only relevant information """ prompt = f"""Extract only the information relevant to the query from this text. Query: {query} Text: {chunk} Relevant excerpt:""" return await llm.generate(prompt, max_tokens=200) # Example query = "How do I reset my password?" chunk = "Our system offers many features including user management, password reset, data export..." compressed = await llm_compress_context(query, chunk, llm) # "Password reset: Click 'Forgot Password' on the login page..."

Semantic Compression

Remove redundant information across chunks.

DEVELOPERpython
def remove_redundant_chunks(chunks: List[str], threshold=0.85) -> List[str]: """ Remove chunks with high semantic overlap """ embeddings = embed_batch(chunks) selected = [0] # Always keep first (highest relevance) for i in range(1, len(chunks)): # Check similarity to already selected max_similarity = max( cosine_similarity(embeddings[i], embeddings[j]) for j in selected ) # Only add if sufficiently different if max_similarity < threshold: selected.append(i) return [chunks[i] for i in selected]

Sliding Window Approaches

Chunked Processing

Process long documents in windows.

DEVELOPERpython
def sliding_window_qa(query: str, long_document: str, window_size=2000, stride=1000): """ Process long document with sliding window """ answers = [] # Create windows tokens = tokenize(long_document) for i in range(0, len(tokens), stride): window = tokens[i:i + window_size] window_text = detokenize(window) # Generate answer for this window answer = llm.generate( query=query, context=window_text ) if answer and answer != "Information not found": answers.append({ 'answer': answer, 'position': i, 'confidence': estimate_confidence(answer) }) # Combine answers (take highest confidence or synthesize) return best_answer(answers)

Hierarchical Processing

Process at multiple granularities.

DEVELOPERpython
async def hierarchical_processing(query: str, document: str, llm): """ 1. Summarize entire document 2. Find relevant sections in summary 3. Process full sections for answer """ # Level 1: Document summary doc_summary = await llm.generate( f"Summarize this document:\n\n{document}", max_tokens=500 ) # Level 2: Identify relevant sections relevance_check = await llm.generate( f"Which parts of this summary are relevant to: {query}\n\nSummary: {doc_summary}", max_tokens=100 ) # Level 3: Process full relevant sections relevant_sections = extract_sections(document, relevance_check) # Generate final answer from relevant sections answer = await llm.generate( query=query, context=relevant_sections ) return answer

Conversation History Management

Fixed Window

Keep only recent history.

DEVELOPERpython
class FixedWindowHistory: def __init__(self, max_turns=5): self.max_turns = max_turns self.history = [] def add_turn(self, query: str, answer: str): self.history.append({'query': query, 'answer': answer}) # Keep only recent turns if len(self.history) > self.max_turns: self.history = self.history[-self.max_turns:] def get_context(self) -> str: return "\n".join([ f"User: {turn['query']}\nAssistant: {turn['answer']}" for turn in self.history ])

Token-Based Window

Keep history within token budget.

DEVELOPERpython
class TokenBudgetHistory: def __init__(self, max_tokens=2000, model="gpt-4"): self.max_tokens = max_tokens self.model = model self.history = [] def add_turn(self, query: str, answer: str): self.history.append({'query': query, 'answer': answer}) self._trim_to_budget() def _trim_to_budget(self): while self.history: context = self.get_context() tokens = count_tokens(context, self.model) if tokens <= self.max_tokens: break # Remove oldest turn self.history.pop(0) def get_context(self) -> str: return "\n".join([ f"User: {turn['query']}\nAssistant: {turn['answer']}" for turn in self.history ])

Summarized History

Summarize old history to save tokens.

DEVELOPERpython
class SummarizedHistory: def __init__(self, llm, summary_threshold=10): self.llm = llm self.summary_threshold = summary_threshold self.summary = "" self.recent_history = [] async def add_turn(self, query: str, answer: str): self.recent_history.append({'query': query, 'answer': answer}) # When recent history gets long, summarize if len(self.recent_history) >= self.summary_threshold: await self._summarize_old_turns() async def _summarize_old_turns(self): # Summarize all but last 3 turns to_summarize = self.recent_history[:-3] if to_summarize: history_text = format_history(to_summarize) new_summary = await self.llm.generate( f"Summarize this conversation:\n\n{self.summary}\n\n{history_text}", max_tokens=200 ) self.summary = new_summary self.recent_history = self.recent_history[-3:] async def get_context(self) -> str: recent = format_history(self.recent_history) if self.summary: return f"Earlier: {self.summary}\n\nRecent:\n{recent}" else: return recent

Prompt Optimization

Template Compression

DEVELOPERpython
# Verbose prompt (wasteful) verbose_prompt = """ You are a helpful AI assistant. Your job is to answer questions based on the provided context. Please read the context carefully and provide accurate answers. If you don't know the answer, say so. Always be polite and professional. Context: {context} Question: {query} Answer: """ # Compressed prompt (efficient) compressed_prompt = """Answer based on context. Say "I don't know" if uncertain. Context: {context} Q: {query} A:""" # Token savings: ~50 tokens per query

Dynamic Prompts

Adjust prompt based on query complexity.

DEVELOPERpython
def get_optimal_prompt(query: str, context: str, complexity: str) -> str: if complexity == "simple": # Minimal prompt for simple queries return f"Context: {context}\n\nQ: {query}\nA:" elif complexity == "medium": # Standard prompt return f"Answer based on context:\n\n{context}\n\nQ: {query}\nA:" else: # Detailed prompt for complex queries return f"""Analyze the context carefully and provide a detailed answer. Context: {context} Question: {query} Detailed answer:"""

Adaptive Context Loading

Lazy Loading

Load context incrementally.

DEVELOPERpython
async def adaptive_context_loading(query: str, vector_db, llm, max_chunks=10): """ Start with few chunks, add more if needed """ chunk_counts = [3, 5, 8, max_chunks] for num_chunks in chunk_counts: # Retrieve chunks chunks = await vector_db.search(query, k=num_chunks) context = format_chunks(chunks) # Generate answer answer = await llm.generate(query=query, context=context) # Check confidence confidence = await estimate_confidence(answer, llm) if confidence > 0.8: return answer # Good enough # Used all chunks, return best effort return answer

Confidence-Based Retrieval

DEVELOPERpython
async def confidence_based_retrieval(query: str, vector_db, llm): """ Retrieve more context if initial answer has low confidence """ # Start with top 3 chunks = await vector_db.search(query, k=3) context = format_chunks(chunks) answer = await llm.generate(query=query, context=context) confidence = await estimate_confidence(answer, llm) # If low confidence, retrieve more if confidence < 0.6: additional_chunks = await vector_db.search(query, k=7) chunks.extend(additional_chunks[3:]) # Skip first 3 (duplicates) context = format_chunks(chunks) answer = await llm.generate(query=query, context=context) return answer

Multi-Turn Optimization

Context Carryover

Avoid re-sending unchanged context.

DEVELOPERpython
class EfficientConversation: def __init__(self, llm): self.llm = llm self.static_context = None self.conversation_history = [] async def query(self, user_query: str, retrieve_new_context=True): # Retrieve context only if query changed topic if retrieve_new_context: self.static_context = await retrieve_context(user_query) # Build minimal prompt prompt = f"""Context (same as before): [Ref: {hash(self.static_context)}] Previous conversation: {format_recent_history(self.conversation_history[-2:])} New question: {user_query} Answer:""" answer = await self.llm.generate(prompt) self.conversation_history.append({ 'query': user_query, 'answer': answer }) return answer

Monitoring Token Usage

DEVELOPERpython
class TokenUsageTracker: def __init__(self): self.usage = [] def track(self, prompt_tokens: int, completion_tokens: int, model: str): self.usage.append({ 'timestamp': time.time(), 'prompt_tokens': prompt_tokens, 'completion_tokens': completion_tokens, 'total_tokens': prompt_tokens + completion_tokens, 'model': model }) def get_stats(self): if not self.usage: return {} total_tokens = sum(u['total_tokens'] for u in self.usage) avg_prompt = np.mean([u['prompt_tokens'] for u in self.usage]) avg_completion = np.mean([u['completion_tokens'] for u in self.usage]) return { 'total_tokens': total_tokens, 'avg_prompt_tokens': avg_prompt, 'avg_completion_tokens': avg_completion, 'num_requests': len(self.usage) } # Usage tracker = TokenUsageTracker() response = await llm.generate(prompt) tracker.track( prompt_tokens=count_tokens(prompt), completion_tokens=count_tokens(response), model="gpt-4" ) stats = tracker.get_stats() print(f"Average prompt tokens: {stats['avg_prompt_tokens']}")

Best Practices

  1. Measure first: Count tokens before optimizing
  2. Allocate budgets: Reserve tokens for each component
  3. Compress intelligently: Only compress what won't hurt quality
  4. Trim history: Don't send full conversation every time
  5. Start small: Retrieve fewer chunks, expand if needed
  6. Monitor usage: Track token consumption over time
  7. Test impact: Ensure compression doesn't hurt quality

Trade-offs

StrategyToken SavingsQuality ImpactComplexity
Chunk selection20-40%LowLow
Extractive summary50-70%MediumMedium
LLM compression60-80%Low-MediumMedium
Prompt optimization10-30%LowLow
History summarization40-60%LowMedium
Adaptive loadingVariableLowHigh

Next Steps

You now have a complete understanding of RAG fundamentals, from embeddings and chunking to production deployment and optimization. Apply these guides incrementally, measure results, and iterate based on your specific use case and constraints.

Tags

context windowtokensoptimizationcompression

Articles connexes

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !