Advanced Retrieval Strategies for RAG
Beyond basic similarity search: hybrid search, query expansion, MMR, and multi-stage retrieval for better RAG performance.
TL;DR
- Hybrid search (semantic + keyword) beats pure semantic by 20-35%
- Query expansion helps when queries are vague or use different terminology
- MMR reduces redundancy in retrieved results
- Start simple: Pure semantic → Add hybrid → Optimize with reranking
- Test retrieval strategies side-by-side on Ailog
Beyond Simple Similarity Search
Basic RAG uses semantic similarity to retrieve documents. While effective, this approach has limitations:
- Keyword blindness: Misses exact term matches (product IDs, proper nouns)
- Query-document mismatch: Questions phrased differently than answers
- Redundancy: Retrieved chunks often contain similar information
- Context insufficiency: Top-k chunks may not provide complete context
Advanced retrieval strategies address these limitations.
Hybrid Search
Combines semantic (vector) and lexical (keyword) search.
BM25 + Vector Search
BM25 (Best Matching 25): Statistical keyword ranking
DEVELOPERpythonfrom rank_bm25 import BM25Okapi # Index documents tokenized_docs = [doc.split() for doc in documents] bm25 = BM25Okapi(tokenized_docs) # Keyword search keyword_scores = bm25.get_scores(query.split()) # Vector search vector_scores = cosine_similarity(query_embedding, doc_embeddings) # Combine scores (weighted average) alpha = 0.7 # Weight for vector search final_scores = alpha * vector_scores + (1 - alpha) * keyword_scores # Retrieve top-k top_k_indices = np.argsort(final_scores)[-k:][::-1]
Reciprocal Rank Fusion (RRF)
Combine rankings from multiple retrievers.
DEVELOPERpythondef reciprocal_rank_fusion(rankings_list, k=60): """ rankings_list: List of ranked document IDs from different retrievers k: Constant (typically 60) """ scores = {} for ranking in rankings_list: for rank, doc_id in enumerate(ranking, start=1): if doc_id not in scores: scores[doc_id] = 0 scores[doc_id] += 1 / (k + rank) return sorted(scores.items(), key=lambda x: x[1], reverse=True) # Example usage vector_results = ["doc1", "doc3", "doc5", "doc2"] bm25_results = ["doc2", "doc1", "doc4", "doc3"] final_ranking = reciprocal_rank_fusion([vector_results, bm25_results]) # Result: [("doc1", score), ("doc2", score), ...]
When to Use Hybrid Search
Use hybrid when:
- Queries contain specific terms (IDs, names, technical terms)
- Mix of semantic and exact matching needed
- Domain has specialized vocabulary
Use vector-only when:
- Natural language queries
- Synonym handling critical
- Multilingual search
Benchmarks show:
- Hybrid often outperforms either alone by 10-20%
- Especially effective for technical domains
- Critical for product search, code search
Query Expansion
Reformulate or expand queries for better retrieval.
Multi-Query Generation
Generate multiple query variations.
DEVELOPERpythondef generate_query_variations(query, llm): prompt = f"""Given the user query, generate 3 variations that capture different aspects: Original: {query} Generate 3 variations: 1. 2. 3. """ variations = llm.generate(prompt) all_queries = [query] + variations # Retrieve for each query all_results = [] for q in all_queries: results = retrieve(q, k=5) all_results.extend(results) # Deduplicate and rerank unique_results = deduplicate(all_results) return rerank(unique_results, query)
Benefits:
- Captures multiple interpretations
- Increases recall
- Handles ambiguous queries
Cost:
- Multiple retrievals (slower, more expensive)
- LLM call for generation
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer, then search for it.
DEVELOPERpythondef hyde_retrieval(query, llm, k=5): # Generate hypothetical answer prompt = f"""Write a passage that would answer this question: Question: {query} Passage:""" hypothetical_answer = llm.generate(prompt) # Embed and search using the hypothetical answer answer_embedding = embed(hypothetical_answer) results = vector_search(answer_embedding, k=k) return results
Why it works:
- Answers are semantically similar to answers (not questions)
- Bridges query-document gap
- Effective when questions and answers are phrased differently
When to use:
- Question-answering systems
- When queries are questions but documents are statements
- Academic/research search
Query Decomposition
Break complex queries into sub-queries.
DEVELOPERpythondef decompose_query(complex_query, llm): prompt = f"""Break this complex question into simpler sub-questions: Question: {complex_query} Sub-questions: 1. 2. 3. """ sub_questions = llm.generate(prompt) # Retrieve for each sub-question all_contexts = [] for sub_q in sub_questions: contexts = retrieve(sub_q, k=3) all_contexts.extend(contexts) # Generate final answer using all contexts final_answer = llm.generate( context=all_contexts, query=complex_query ) return final_answer
Use cases:
- Multi-hop questions
- Complex analytical queries
- When single retrieval is insufficient
Maximal Marginal Relevance (MMR)
Reduce redundancy in retrieved results.
DEVELOPERpythondef mmr(query_embedding, doc_embeddings, documents, k=5, lambda_param=0.7): """ Maximize relevance while minimizing similarity to already-selected docs. lambda_param: Tradeoff between relevance (1.0) and diversity (0.0) """ selected = [] remaining = list(range(len(documents))) while len(selected) < k and remaining: mmr_scores = [] for i in remaining: # Relevance to query relevance = cosine_similarity( query_embedding, doc_embeddings[i] ) # Max similarity to already selected docs if selected: similarities = [ cosine_similarity(doc_embeddings[i], doc_embeddings[j]) for j in selected ] max_sim = max(similarities) else: max_sim = 0 # MMR score mmr_score = lambda_param * relevance - (1 - lambda_param) * max_sim mmr_scores.append((i, mmr_score)) # Select best MMR score best = max(mmr_scores, key=lambda x: x[1]) selected.append(best[0]) remaining.remove(best[0]) return [documents[i] for i in selected]
Parameters:
lambda_param = 1.0: Pure relevance (no diversity)lambda_param = 0.5: Balance relevance and diversitylambda_param = 0.0: Maximum diversity
Use when:
- Retrieved chunks are very similar
- Need diverse perspectives
- Summarization tasks
Parent-Child Retrieval
Retrieve small chunks, return larger context.
DEVELOPERpythonclass ParentChildRetriever: def __init__(self, documents): self.parents = [] # Original documents self.children = [] # Small chunks self.child_to_parent = {} # Mapping for doc_id, doc in enumerate(documents): # Split into small chunks for precise retrieval chunks = split_document(doc, chunk_size=256) for chunk_id, chunk in enumerate(chunks): self.children.append(chunk) self.child_to_parent[len(self.children) - 1] = doc_id self.parents.append(doc) # Embed children for retrieval self.child_embeddings = embed_batch(self.children) def retrieve(self, query, k=3): # Search small chunks for precision query_emb = embed(query) child_indices = vector_search(query_emb, self.child_embeddings, k=k) # Return parent documents for context parent_indices = [self.child_to_parent[i] for i in child_indices] unique_parents = list(set(parent_indices)) return [self.parents[i] for i in unique_parents]
Benefits:
- Precise retrieval (small chunks)
- Rich context (large documents)
- Best of both worlds
Use when:
- Need full context for generation
- Documents have natural hierarchy (sections, paragraphs)
- Context window allows larger chunks
Ensemble Retrieval
Combine multiple retrieval methods.
DEVELOPERpythonclass EnsembleRetriever: def __init__(self, retrievers, weights=None): self.retrievers = retrievers self.weights = weights or [1.0] * len(retrievers) def retrieve(self, query, k=5): all_results = [] # Get results from each retriever for retriever, weight in zip(self.retrievers, self.weights): results = retriever.retrieve(query, k=k*2) # Overfetch # Weight scores for doc, score in results: all_results.append((doc, score * weight)) # Deduplicate and aggregate scores doc_scores = {} for doc, score in all_results: doc_id = doc['id'] if doc_id not in doc_scores: doc_scores[doc_id] = {'doc': doc, 'score': 0} doc_scores[doc_id]['score'] += score # Sort and return top-k ranked = sorted( doc_scores.values(), key=lambda x: x['score'], reverse=True ) return [item['doc'] for item in ranked[:k]]
Example ensemble:
DEVELOPERpythonensemble = EnsembleRetriever( retrievers=[ VectorRetriever(embedding_model="openai"), BM25Retriever(), VectorRetriever(embedding_model="sentence-transformers") ], weights=[0.5, 0.3, 0.2] )
Self-Query Retrieval
Extract filters from natural language queries.
DEVELOPERpythondef self_query_retrieval(query, llm, vector_db): # Extract structured query prompt = f"""Extract search filters from this query: Query: {query} Extract: - search_text: Semantic search text - filters: Metadata filters (dict) Output (JSON):""" structured = llm.generate(prompt, format="json") # Example output: # { # "search_text": "customer support best practices", # "filters": {"department": "support", "date_range": "2024"} # } # Execute filtered search results = vector_db.query( text=structured['search_text'], filter=structured['filters'], k=5 ) return results
Benefits:
- Leverages metadata effectively
- Natural language interface to filters
- Better precision
Use when:
- Rich metadata available
- Queries contain filterable attributes
- Time-based, category-based, or attribute-based filtering needed
Multi-Stage Retrieval
Coarse-to-fine retrieval pipeline.
DEVELOPERpythonclass MultiStageRetriever: def __init__(self, fast_retriever, accurate_reranker): self.retriever = fast_retriever self.reranker = accurate_reranker def retrieve(self, query, k=5): # Stage 1: Fast retrieval (overfetch) candidates = self.retriever.retrieve(query, k=k*10) # Stage 2: Accurate reranking reranked = self.reranker.rerank(query, candidates) # Return top-k return reranked[:k]
Stages:
- Retrieval (fast, high recall): 100 candidates
- Reranking (accurate, expensive): Top 10
- Optional: LLM-based refinement: Top 3
Benefits:
- Balance speed and accuracy
- Cost-effective (expensive models on small candidate set)
- Higher quality results
Contextual Compression
Remove irrelevant parts from retrieved chunks.
DEVELOPERpythondef compress_context(query, chunks, llm): compressed = [] for chunk in chunks: prompt = f"""Extract only the parts relevant to the query: Query: {query} Document: {chunk} Relevant extract:""" relevant_part = llm.generate(prompt, max_tokens=200) compressed.append(relevant_part) return compressed
Benefits:
- Reduce token usage
- Fit more chunks in context window
- Focus on relevant information
Costs:
- LLM calls (expensive)
- Additional latency
Use when:
- Token budget is tight
- Retrieved chunks are long and partially relevant
- Need to fit many sources
Choosing a Retrieval Strategy
Decision Framework
Start with:
- Basic semantic search (vector similarity)
- k=3 to 5 chunks
Add hybrid search if:
- Queries contain specific terms
- Domain has specialized vocabulary
- Performance improves in evaluation
Add query expansion if:
- Queries are ambiguous
- Recall is more important than precision
- Willing to accept higher latency/cost
Add MMR if:
- Retrieved chunks are redundant
- Need diverse perspectives
- Summarization or analysis tasks
Add reranking if:
- Top-k results are not consistently relevant
- Willing to trade latency for quality
- Budget allows (next guide covers this)
Performance Impact
| Strategy | Latency Impact | Cost Impact | Quality Gain |
|---|---|---|---|
| Hybrid search | +20-50ms | Low | +10-20% |
| Multi-query | +3x | High | +15-25% |
| HyDE | +LLM call | High | +10-30% |
| MMR | +10-50ms | Low | +5-15% |
| Parent-child | +0-20ms | Medium | +10-20% |
| Reranking | +50-200ms | Medium | +20-40% |
Practical Implementation
LangChain Example
DEVELOPERpythonfrom langchain.retrievers import ( EnsembleRetriever, ContextualCompressionRetriever ) from langchain.retrievers.document_compressors import LLMChainExtractor # Ensemble: Vector + BM25 vector_retriever = vector_db.as_retriever(search_kwargs={"k": 5}) bm25_retriever = BM25Retriever.from_documents(documents) ensemble = EnsembleRetriever( retrievers=[vector_retriever, bm25_retriever], weights=[0.7, 0.3] ) # Add compression compressor = LLMChainExtractor.from_llm(llm) compressed_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=ensemble )
LlamaIndex Example
DEVELOPERpythonfrom llama_index import VectorStoreIndex, SimpleKeywordTableIndex from llama_index.retrievers import RouterRetriever # Create retrievers vector_retriever = VectorStoreIndex.from_documents(documents).as_retriever() keyword_retriever = SimpleKeywordTableIndex.from_documents(documents).as_retriever() # Router retriever (chooses based on query) router = RouterRetriever( selector=llm_selector, retriever_dict={ "vector": vector_retriever, "keyword": keyword_retriever } ) # Query-dependent routing results = router.retrieve("What are the system requirements?")
Monitoring Retrieval Quality
Track these metrics:
Retrieval Metrics:
- Precision@k: Relevant docs in top-k
- Recall@k: Retrieved relevant docs / all relevant docs
- MRR: Mean Reciprocal Rank of first relevant result
User Metrics:
- Answer quality ratings
- Follow-up question rate
- Task completion
Technical Metrics:
- Retrieval latency
- Cache hit rate
- Query throughput
💡 Expert Tip from Ailog: Hybrid search is the single highest-ROI retrieval improvement. Adding BM25 keyword search alongside semantic search consistently delivers 20-35% better results across domains. It's easier to implement than query expansion and more reliable than MMR tuning. If you only make one retrieval optimization, make it hybrid search. We implemented it in production and saw immediate quality gains with minimal complexity.
Test Retrieval Strategies on Ailog
Compare retrieval methods with your documents:
Ailog lets you benchmark:
- Pure semantic vs hybrid search
- Different query expansion techniques
- MMR vs standard retrieval
- Custom metadata filtering
See real metrics: Precision@k, MRR, latency comparisons
Start testing → Free access to all retrieval strategies.
Next Steps
Retrieved documents often need reranking to optimize for the specific query. The next guide covers reranking strategies and cross-encoder models to further improve RAG quality.
Tags
Related Guides
Hybrid Search: Combine Semantic and Keyword Search
Boost retrieval accuracy by 20-30%: combine vector search with BM25 keyword matching for superior RAG performance.
Query Expansion: Retrieve More Relevant Results
Improve recall by 40%: expand user queries with synonyms, sub-queries, and LLM-generated variations.
MMR: Diversify Search Results with Maximal Marginal Relevance
Reduce redundancy in RAG retrieval: use MMR to balance relevance and diversity for better context quality.