5. Retrieval

Contextual Compression: Extract the Essential from Documents

March 9, 2026
Ailog Team

Implement contextual compression to extract relevant passages from retrieved documents. LLM, extractors, and context optimization.

Contextual Compression: Extract the Essential from Documents

Contextual compression filters and condenses retrieved documents to keep only passages directly relevant to the query. Instead of passing entire chunks to the LLM, this technique extracts key sentences, reducing noise and optimizing costs. This guide explores compression methods and their integration in a RAG pipeline.

Why Compress Context?

Retrieved chunks often contain superfluous content:

Original chunk (500 tokens):
"Our company was founded in 2010. We started with 3 employees. Today, we have
over 200 team members. Regarding our return policy, you have 30 days to return
an unused product in its original packaging. Return shipping costs are your
responsibility unless the product is defective. We are present in 15 European
countries..."

Query: "What is the return deadline?"

After compression (50 tokens):
"You have 30 days to return an unused product in its original packaging."

Compression Benefits

MetricWithout CompressionWith CompressionImprovement
Tokens/query4000800-80%
LLM Cost$0.08$0.016-80%
Latency3.2s1.1s-65%
Response Quality0.780.85+9%

Compression improves quality because it reduces noise that can distract the LLM.

Compression Methods

1. LLM Compression

The LLM identifies and extracts relevant passages:

DEVELOPERpython
from openai import OpenAI class LLMContextCompressor: def __init__(self, model: str = "gpt-4o-mini"): self.client = OpenAI() self.model = model def compress(self, query: str, documents: list[str]) -> list[str]: compressed = [] for doc in documents: prompt = f"""Extract only the sentences directly relevant to answering the question. If no part is relevant, respond "NOT_RELEVANT". Question: {query} Document: {doc} Relevant sentences:""" response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=0, max_tokens=500 ) content = response.choices[0].message.content.strip() if content != "NOT_RELEVANT": compressed.append(content) return compressed # Example compressor = LLMContextCompressor() docs = [ "Our company has existed for 20 years. Return policy: 30 days maximum. We have 500 employees.", "Shipping is free. Returns accepted within 14 business days. Satisfaction guaranteed." ] compressed = compressor.compress("What is the return deadline?", docs) # ["Return policy: 30 days maximum.", "Returns accepted within 14 business days."]

2. Sentence Extraction Compression

Faster and no API cost, uses semantic similarity:

DEVELOPERpython
from sentence_transformers import SentenceTransformer import numpy as np import nltk class SentenceExtractor: def __init__(self, model_name: str = "BAAI/bge-m3"): self.model = SentenceTransformer(model_name) nltk.download('punkt', quiet=True) def compress( self, query: str, documents: list[str], top_k_sentences: int = 3, min_similarity: float = 0.5 ) -> list[str]: # Encode query query_embedding = self.model.encode(query) all_relevant_sentences = [] for doc in documents: # Split into sentences sentences = nltk.sent_tokenize(doc) if not sentences: continue # Encode sentences sentence_embeddings = self.model.encode(sentences) # Calculate similarities similarities = np.dot(sentence_embeddings, query_embedding) / ( np.linalg.norm(sentence_embeddings, axis=1) * np.linalg.norm(query_embedding) ) # Select relevant sentences for sentence, sim in zip(sentences, similarities): if sim >= min_similarity: all_relevant_sentences.append((sentence, sim)) # Sort by similarity and take top_k sorted_sentences = sorted(all_relevant_sentences, key=lambda x: x[1], reverse=True) return [s[0] for s in sorted_sentences[:top_k_sentences]] # Example extractor = SentenceExtractor() result = extractor.compress( query="How to return a product?", documents=docs, top_k_sentences=3 )

3. Reranking + Filtering Compression

Uses a cross-encoder to score and filter:

DEVELOPERpython
from sentence_transformers import CrossEncoder class RerankerCompressor: def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"): self.reranker = CrossEncoder(model_name) def compress( self, query: str, documents: list[str], threshold: float = 0.5, max_passages: int = 5 ) -> list[dict]: # Split each document into passages passages = [] for doc_idx, doc in enumerate(documents): for para in doc.split('\n\n'): if len(para.strip()) > 50: # Ignore too short paragraphs passages.append({ "text": para.strip(), "doc_idx": doc_idx }) if not passages: return [] # Score all passages pairs = [[query, p["text"]] for p in passages] scores = self.reranker.predict(pairs) # Filter and sort for passage, score in zip(passages, scores): passage["score"] = float(score) filtered = [p for p in passages if p["score"] >= threshold] sorted_passages = sorted(filtered, key=lambda x: x["score"], reverse=True) return sorted_passages[:max_passages] # Example compressor = RerankerCompressor() results = compressor.compress( query="Refund policy", documents=docs, threshold=0.3 ) for r in results: print(f"Score: {r['score']:.3f} - {r['text'][:100]}...")

4. Summary Compression

For very long documents, generate a targeted summary:

DEVELOPERpython
class SummaryCompressor: def __init__(self): self.client = OpenAI() def compress( self, query: str, documents: list[str], max_summary_length: int = 200 ) -> str: combined_docs = "\n\n---\n\n".join(documents) prompt = f"""Generate a concise summary that answers the following question. Include only information relevant to the question. Maximum {max_summary_length} words. Question: {query} Documents: {combined_docs} Relevant summary:""" response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0, max_tokens=max_summary_length * 2 ) return response.choices[0].message.content

Complete Pipeline Architecture

DEVELOPERpython
class ContextualCompressionRetriever: def __init__( self, base_retriever, compression_method: str = "reranker", # "llm", "sentence", "reranker", "summary" **compression_kwargs ): self.retriever = base_retriever self.compression_method = compression_method self.compression_kwargs = compression_kwargs # Initialize appropriate compressor if compression_method == "llm": self.compressor = LLMContextCompressor() elif compression_method == "sentence": self.compressor = SentenceExtractor() elif compression_method == "reranker": self.compressor = RerankerCompressor() elif compression_method == "summary": self.compressor = SummaryCompressor() def search(self, query: str, top_k: int = 5) -> dict: # 1. Initial retrieval (more documents than needed) initial_results = self.retriever.search(query, top_k=top_k * 2) documents = [r["content"] for r in initial_results] # 2. Compression compressed = self.compressor.compress(query, documents, **self.compression_kwargs) # 3. Calculate metrics original_tokens = sum(len(d.split()) for d in documents) compressed_tokens = ( sum(len(c.split()) for c in compressed) if isinstance(compressed, list) else len(compressed.split()) ) return { "compressed_context": compressed, "original_docs": initial_results, "compression_ratio": 1 - (compressed_tokens / original_tokens) if original_tokens > 0 else 0, "original_tokens": original_tokens, "compressed_tokens": compressed_tokens }

Adaptive Compression

Adapt the method based on context:

DEVELOPERpython
class AdaptiveCompressor: def __init__(self): self.llm_compressor = LLMContextCompressor() self.sentence_extractor = SentenceExtractor() self.reranker = RerankerCompressor() def compress( self, query: str, documents: list[str], budget_tokens: int = 1000, quality_priority: bool = False ) -> list[str]: total_tokens = sum(len(d.split()) for d in documents) # If context is small, no compression needed if total_tokens <= budget_tokens: return documents # Choose method based on constraints compression_needed = 1 - (budget_tokens / total_tokens) if quality_priority or compression_needed > 0.7: # Aggressive compression → LLM return self.llm_compressor.compress(query, documents) elif compression_needed > 0.4: # Medium compression → Reranker results = self.reranker.compress(query, documents) return [r["text"] for r in results] else: # Light compression → Sentence extraction return self.sentence_extractor.compress( query, documents, top_k_sentences=int(budget_tokens / 20) # ~20 tokens/sentence )

Compression with Source Preservation

Keep traceability for citations:

DEVELOPERpython
class SourcePreservingCompressor: def __init__(self): self.extractor = SentenceExtractor() def compress( self, query: str, documents: list[dict] # {"content": str, "source": str, "metadata": dict} ) -> list[dict]: compressed_with_sources = [] for doc in documents: # Extract relevant sentences sentences = nltk.sent_tokenize(doc["content"]) if not sentences: continue query_emb = self.extractor.model.encode(query) sentence_embs = self.extractor.model.encode(sentences) similarities = np.dot(sentence_embs, query_emb) / ( np.linalg.norm(sentence_embs, axis=1) * np.linalg.norm(query_emb) ) # Keep relevant sentences with their source for sentence, sim in zip(sentences, similarities): if sim > 0.5: compressed_with_sources.append({ "text": sentence, "source": doc["source"], "metadata": doc["metadata"], "relevance_score": float(sim) }) # Sort by relevance return sorted(compressed_with_sources, key=lambda x: x["relevance_score"], reverse=True) # Usage for citations def format_context_with_citations(compressed_results: list[dict]) -> str: context_parts = [] for i, result in enumerate(compressed_results, 1): context_parts.append(f"[{i}] {result['text']}") return "\n".join(context_parts) def format_sources(compressed_results: list[dict]) -> str: sources = [] for i, result in enumerate(compressed_results, 1): sources.append(f"[{i}] {result['source']}") return "\n".join(sources)

Compression Evaluation

DEVELOPERpython
class CompressionEvaluator: def __init__(self): self.embedder = SentenceTransformer("BAAI/bge-m3") def evaluate( self, query: str, original_docs: list[str], compressed: list[str], ground_truth_answer: str = None ) -> dict: # 1. Compression ratio original_tokens = sum(len(d.split()) for d in original_docs) compressed_tokens = sum(len(c.split()) for c in compressed) compression_ratio = 1 - (compressed_tokens / original_tokens) # 2. Information preservation (via similarity) original_combined = " ".join(original_docs) compressed_combined = " ".join(compressed) orig_emb = self.embedder.encode(original_combined) comp_emb = self.embedder.encode(compressed_combined) information_preservation = np.dot(orig_emb, comp_emb) / ( np.linalg.norm(orig_emb) * np.linalg.norm(comp_emb) ) # 3. Query relevance query_emb = self.embedder.encode(query) query_relevance = np.dot(comp_emb, query_emb) / ( np.linalg.norm(comp_emb) * np.linalg.norm(query_emb) ) # 4. Combined score (balance compression and quality) quality_score = 0.6 * information_preservation + 0.4 * query_relevance efficiency_score = compression_ratio combined_score = 0.5 * quality_score + 0.5 * efficiency_score return { "compression_ratio": compression_ratio, "information_preservation": float(information_preservation), "query_relevance": float(query_relevance), "quality_score": float(quality_score), "efficiency_score": efficiency_score, "combined_score": float(combined_score), "original_tokens": original_tokens, "compressed_tokens": compressed_tokens }

Cost Optimization

DEVELOPERpython
class CostOptimizedCompressor: def __init__(self, max_llm_calls_per_minute: int = 60): self.local_compressor = SentenceExtractor() self.llm_compressor = LLMContextCompressor() self.llm_calls = [] self.max_calls = max_llm_calls_per_minute def compress(self, query: str, documents: list[str]) -> list[str]: # First, fast local compression local_compressed = self.local_compressor.compress( query, documents, top_k_sentences=10 ) # If local compression is sufficient, no need for LLM total_tokens = sum(len(s.split()) for s in local_compressed) if total_tokens <= 500: return local_compressed # Check rate limit self._clean_old_calls() if len(self.llm_calls) >= self.max_calls: # Rate limit reached, return local compression return local_compressed[:5] # Use LLM to refine self.llm_calls.append(time.time()) return self.llm_compressor.compress(query, local_compressed) def _clean_old_calls(self): cutoff = time.time() - 60 self.llm_calls = [t for t in self.llm_calls if t > cutoff]

Next Steps

Contextual compression optimizes both quality and costs for your RAG. To go further:


Intelligent Compression with Ailog

Ailog implements contextual compression automatically:

  • Adaptive compression based on query complexity
  • Source preservation for citations
  • Cost optimization with local compression priority
  • Integrated monitoring of compression/quality ratio

Try for free and reduce your LLM costs by 80%.

Tags

ragretrievalcompressioncontextllm

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !