Contextual Compression: Extract the Essential from Documents
Implement contextual compression to extract relevant passages from retrieved documents. LLM, extractors, and context optimization.
Contextual Compression: Extract the Essential from Documents
Contextual compression filters and condenses retrieved documents to keep only passages directly relevant to the query. Instead of passing entire chunks to the LLM, this technique extracts key sentences, reducing noise and optimizing costs. This guide explores compression methods and their integration in a RAG pipeline.
Why Compress Context?
Retrieved chunks often contain superfluous content:
Original chunk (500 tokens):
"Our company was founded in 2010. We started with 3 employees. Today, we have
over 200 team members. Regarding our return policy, you have 30 days to return
an unused product in its original packaging. Return shipping costs are your
responsibility unless the product is defective. We are present in 15 European
countries..."
Query: "What is the return deadline?"
After compression (50 tokens):
"You have 30 days to return an unused product in its original packaging."
Compression Benefits
| Metric | Without Compression | With Compression | Improvement |
|---|---|---|---|
| Tokens/query | 4000 | 800 | -80% |
| LLM Cost | $0.08 | $0.016 | -80% |
| Latency | 3.2s | 1.1s | -65% |
| Response Quality | 0.78 | 0.85 | +9% |
Compression improves quality because it reduces noise that can distract the LLM.
Compression Methods
1. LLM Compression
The LLM identifies and extracts relevant passages:
DEVELOPERpythonfrom openai import OpenAI class LLMContextCompressor: def __init__(self, model: str = "gpt-4o-mini"): self.client = OpenAI() self.model = model def compress(self, query: str, documents: list[str]) -> list[str]: compressed = [] for doc in documents: prompt = f"""Extract only the sentences directly relevant to answering the question. If no part is relevant, respond "NOT_RELEVANT". Question: {query} Document: {doc} Relevant sentences:""" response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=0, max_tokens=500 ) content = response.choices[0].message.content.strip() if content != "NOT_RELEVANT": compressed.append(content) return compressed # Example compressor = LLMContextCompressor() docs = [ "Our company has existed for 20 years. Return policy: 30 days maximum. We have 500 employees.", "Shipping is free. Returns accepted within 14 business days. Satisfaction guaranteed." ] compressed = compressor.compress("What is the return deadline?", docs) # ["Return policy: 30 days maximum.", "Returns accepted within 14 business days."]
2. Sentence Extraction Compression
Faster and no API cost, uses semantic similarity:
DEVELOPERpythonfrom sentence_transformers import SentenceTransformer import numpy as np import nltk class SentenceExtractor: def __init__(self, model_name: str = "BAAI/bge-m3"): self.model = SentenceTransformer(model_name) nltk.download('punkt', quiet=True) def compress( self, query: str, documents: list[str], top_k_sentences: int = 3, min_similarity: float = 0.5 ) -> list[str]: # Encode query query_embedding = self.model.encode(query) all_relevant_sentences = [] for doc in documents: # Split into sentences sentences = nltk.sent_tokenize(doc) if not sentences: continue # Encode sentences sentence_embeddings = self.model.encode(sentences) # Calculate similarities similarities = np.dot(sentence_embeddings, query_embedding) / ( np.linalg.norm(sentence_embeddings, axis=1) * np.linalg.norm(query_embedding) ) # Select relevant sentences for sentence, sim in zip(sentences, similarities): if sim >= min_similarity: all_relevant_sentences.append((sentence, sim)) # Sort by similarity and take top_k sorted_sentences = sorted(all_relevant_sentences, key=lambda x: x[1], reverse=True) return [s[0] for s in sorted_sentences[:top_k_sentences]] # Example extractor = SentenceExtractor() result = extractor.compress( query="How to return a product?", documents=docs, top_k_sentences=3 )
3. Reranking + Filtering Compression
Uses a cross-encoder to score and filter:
DEVELOPERpythonfrom sentence_transformers import CrossEncoder class RerankerCompressor: def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"): self.reranker = CrossEncoder(model_name) def compress( self, query: str, documents: list[str], threshold: float = 0.5, max_passages: int = 5 ) -> list[dict]: # Split each document into passages passages = [] for doc_idx, doc in enumerate(documents): for para in doc.split('\n\n'): if len(para.strip()) > 50: # Ignore too short paragraphs passages.append({ "text": para.strip(), "doc_idx": doc_idx }) if not passages: return [] # Score all passages pairs = [[query, p["text"]] for p in passages] scores = self.reranker.predict(pairs) # Filter and sort for passage, score in zip(passages, scores): passage["score"] = float(score) filtered = [p for p in passages if p["score"] >= threshold] sorted_passages = sorted(filtered, key=lambda x: x["score"], reverse=True) return sorted_passages[:max_passages] # Example compressor = RerankerCompressor() results = compressor.compress( query="Refund policy", documents=docs, threshold=0.3 ) for r in results: print(f"Score: {r['score']:.3f} - {r['text'][:100]}...")
4. Summary Compression
For very long documents, generate a targeted summary:
DEVELOPERpythonclass SummaryCompressor: def __init__(self): self.client = OpenAI() def compress( self, query: str, documents: list[str], max_summary_length: int = 200 ) -> str: combined_docs = "\n\n---\n\n".join(documents) prompt = f"""Generate a concise summary that answers the following question. Include only information relevant to the question. Maximum {max_summary_length} words. Question: {query} Documents: {combined_docs} Relevant summary:""" response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0, max_tokens=max_summary_length * 2 ) return response.choices[0].message.content
Complete Pipeline Architecture
DEVELOPERpythonclass ContextualCompressionRetriever: def __init__( self, base_retriever, compression_method: str = "reranker", # "llm", "sentence", "reranker", "summary" **compression_kwargs ): self.retriever = base_retriever self.compression_method = compression_method self.compression_kwargs = compression_kwargs # Initialize appropriate compressor if compression_method == "llm": self.compressor = LLMContextCompressor() elif compression_method == "sentence": self.compressor = SentenceExtractor() elif compression_method == "reranker": self.compressor = RerankerCompressor() elif compression_method == "summary": self.compressor = SummaryCompressor() def search(self, query: str, top_k: int = 5) -> dict: # 1. Initial retrieval (more documents than needed) initial_results = self.retriever.search(query, top_k=top_k * 2) documents = [r["content"] for r in initial_results] # 2. Compression compressed = self.compressor.compress(query, documents, **self.compression_kwargs) # 3. Calculate metrics original_tokens = sum(len(d.split()) for d in documents) compressed_tokens = ( sum(len(c.split()) for c in compressed) if isinstance(compressed, list) else len(compressed.split()) ) return { "compressed_context": compressed, "original_docs": initial_results, "compression_ratio": 1 - (compressed_tokens / original_tokens) if original_tokens > 0 else 0, "original_tokens": original_tokens, "compressed_tokens": compressed_tokens }
Adaptive Compression
Adapt the method based on context:
DEVELOPERpythonclass AdaptiveCompressor: def __init__(self): self.llm_compressor = LLMContextCompressor() self.sentence_extractor = SentenceExtractor() self.reranker = RerankerCompressor() def compress( self, query: str, documents: list[str], budget_tokens: int = 1000, quality_priority: bool = False ) -> list[str]: total_tokens = sum(len(d.split()) for d in documents) # If context is small, no compression needed if total_tokens <= budget_tokens: return documents # Choose method based on constraints compression_needed = 1 - (budget_tokens / total_tokens) if quality_priority or compression_needed > 0.7: # Aggressive compression → LLM return self.llm_compressor.compress(query, documents) elif compression_needed > 0.4: # Medium compression → Reranker results = self.reranker.compress(query, documents) return [r["text"] for r in results] else: # Light compression → Sentence extraction return self.sentence_extractor.compress( query, documents, top_k_sentences=int(budget_tokens / 20) # ~20 tokens/sentence )
Compression with Source Preservation
Keep traceability for citations:
DEVELOPERpythonclass SourcePreservingCompressor: def __init__(self): self.extractor = SentenceExtractor() def compress( self, query: str, documents: list[dict] # {"content": str, "source": str, "metadata": dict} ) -> list[dict]: compressed_with_sources = [] for doc in documents: # Extract relevant sentences sentences = nltk.sent_tokenize(doc["content"]) if not sentences: continue query_emb = self.extractor.model.encode(query) sentence_embs = self.extractor.model.encode(sentences) similarities = np.dot(sentence_embs, query_emb) / ( np.linalg.norm(sentence_embs, axis=1) * np.linalg.norm(query_emb) ) # Keep relevant sentences with their source for sentence, sim in zip(sentences, similarities): if sim > 0.5: compressed_with_sources.append({ "text": sentence, "source": doc["source"], "metadata": doc["metadata"], "relevance_score": float(sim) }) # Sort by relevance return sorted(compressed_with_sources, key=lambda x: x["relevance_score"], reverse=True) # Usage for citations def format_context_with_citations(compressed_results: list[dict]) -> str: context_parts = [] for i, result in enumerate(compressed_results, 1): context_parts.append(f"[{i}] {result['text']}") return "\n".join(context_parts) def format_sources(compressed_results: list[dict]) -> str: sources = [] for i, result in enumerate(compressed_results, 1): sources.append(f"[{i}] {result['source']}") return "\n".join(sources)
Compression Evaluation
DEVELOPERpythonclass CompressionEvaluator: def __init__(self): self.embedder = SentenceTransformer("BAAI/bge-m3") def evaluate( self, query: str, original_docs: list[str], compressed: list[str], ground_truth_answer: str = None ) -> dict: # 1. Compression ratio original_tokens = sum(len(d.split()) for d in original_docs) compressed_tokens = sum(len(c.split()) for c in compressed) compression_ratio = 1 - (compressed_tokens / original_tokens) # 2. Information preservation (via similarity) original_combined = " ".join(original_docs) compressed_combined = " ".join(compressed) orig_emb = self.embedder.encode(original_combined) comp_emb = self.embedder.encode(compressed_combined) information_preservation = np.dot(orig_emb, comp_emb) / ( np.linalg.norm(orig_emb) * np.linalg.norm(comp_emb) ) # 3. Query relevance query_emb = self.embedder.encode(query) query_relevance = np.dot(comp_emb, query_emb) / ( np.linalg.norm(comp_emb) * np.linalg.norm(query_emb) ) # 4. Combined score (balance compression and quality) quality_score = 0.6 * information_preservation + 0.4 * query_relevance efficiency_score = compression_ratio combined_score = 0.5 * quality_score + 0.5 * efficiency_score return { "compression_ratio": compression_ratio, "information_preservation": float(information_preservation), "query_relevance": float(query_relevance), "quality_score": float(quality_score), "efficiency_score": efficiency_score, "combined_score": float(combined_score), "original_tokens": original_tokens, "compressed_tokens": compressed_tokens }
Cost Optimization
DEVELOPERpythonclass CostOptimizedCompressor: def __init__(self, max_llm_calls_per_minute: int = 60): self.local_compressor = SentenceExtractor() self.llm_compressor = LLMContextCompressor() self.llm_calls = [] self.max_calls = max_llm_calls_per_minute def compress(self, query: str, documents: list[str]) -> list[str]: # First, fast local compression local_compressed = self.local_compressor.compress( query, documents, top_k_sentences=10 ) # If local compression is sufficient, no need for LLM total_tokens = sum(len(s.split()) for s in local_compressed) if total_tokens <= 500: return local_compressed # Check rate limit self._clean_old_calls() if len(self.llm_calls) >= self.max_calls: # Rate limit reached, return local compression return local_compressed[:5] # Use LLM to refine self.llm_calls.append(time.time()) return self.llm_compressor.compress(query, local_compressed) def _clean_old_calls(self): cutoff = time.time() - 60 self.llm_calls = [t for t in self.llm_calls if t > cutoff]
Next Steps
Contextual compression optimizes both quality and costs for your RAG. To go further:
- LLM RAG Generation - Optimize generation
- Ensemble Retrieval - Combine multiple retrievers
- Retrieval Fundamentals - Overview
Intelligent Compression with Ailog
Ailog implements contextual compression automatically:
- Adaptive compression based on query complexity
- Source preservation for citations
- Cost optimization with local compression priority
- Integrated monitoring of compression/quality ratio
Try for free and reduce your LLM costs by 80%.
Tags
Related Posts
Self-Query Retrieval: Let the LLM Structure the Search
Implement self-query retrieval to transform natural language queries into structured filters. LLM, filter extraction, and optimization.
Query Routing: Direct Queries to the Right Source
Implement query routing to direct each query to the optimal data source. Classification, LLM routing, and advanced strategies explained.
Metadata Filtering: Refine RAG Search
Master metadata filtering for precise RAG searches. Filter types, indexing, combined queries, and optimization techniques.