Sparse Retrieval and BM25: When Lexical Search Wins
Discover sparse retrieval and BM25 for precise lexical search. Use cases, implementation, and comparison with dense retrieval explained.
Sparse Retrieval and BM25: When Lexical Search Wins
Sparse retrieval, embodied by the BM25 algorithm, remains a dominant force in information retrieval. Despite the hype around embeddings and dense retrieval, lexical search often outperforms semantic approaches in specific cases. This guide explores the mechanisms of sparse retrieval, its strengths, and when to prioritize it.
What is Sparse Retrieval?
Sparse retrieval represents documents as "sparse" vectors where each dimension corresponds to a vocabulary term. Most values are zero because a document contains only a fraction of the total vocabulary.
From TF-IDF to BM25
The evolution of sparse retrieval algorithms:
TF-IDF (1972) → BM25 (1994) → BM25+ (2011)
Score = TF × IDF TF saturation Short document
Length normalization bias correction
How BM25 Works
BM25 (Best Matching 25) calculates a relevance score based on the frequency of query terms in documents, with intelligent adjustments.
The BM25 Formula
Score(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))
Where:
f(qi, D): frequency of term qi in document D|D|: document lengthavgdl: average document lengthk1: saturation parameter (typically 1.2-2.0)b: length normalization parameter (typically 0.75)
Intuition Behind BM25
DEVELOPERpythondef bm25_score_explained(query_terms, document, corpus_stats): """ Intuitive explanation of BM25 scoring """ score = 0 doc_length = len(document) avg_length = corpus_stats['average_length'] for term in query_terms: # 1. IDF: rare terms = more important # "algorithm" in 5% of docs > "the" in 95% idf = compute_idf(term, corpus_stats) # 2. TF with saturation: prevents over-weighting repetitions # "machine learning machine learning machine" ≠ 5× better tf = document.count(term) saturated_tf = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * doc_length / avg_length)) # 3. Length normalization: compensates for long documents # A 10-page document with 5 mentions ≠ a 2-line FAQ with 5 mentions score += idf * saturated_tf return score
Practical Implementation
With rank_bm25
DEVELOPERpythonfrom rank_bm25 import BM25Okapi import nltk # Prepare documents documents = [ "How to return a defective product", "Refund policy within 30 days", "Delivery times to continental US", "Return fees and conditions", "Customer service available 24/7" ] # Tokenization tokenized_docs = [nltk.word_tokenize(doc.lower()) for doc in documents] # Create BM25 index bm25 = BM25Okapi(tokenized_docs) # Search query = "return product" tokenized_query = nltk.word_tokenize(query.lower()) scores = bm25.get_scores(tokenized_query) # Sorted results results = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True) for doc, score in results[:3]: print(f"Score: {score:.3f} - {doc}")
With Elasticsearch
DEVELOPERpythonfrom elasticsearch import Elasticsearch es = Elasticsearch() # Create index with BM25 es.indices.create( index="knowledge_base", body={ "settings": { "similarity": { "custom_bm25": { "type": "BM25", "k1": 1.2, "b": 0.75 } } }, "mappings": { "properties": { "content": { "type": "text", "similarity": "custom_bm25", "analyzer": "english" }, "category": {"type": "keyword"} } } } ) # Index documents for i, doc in enumerate(documents): es.index(index="knowledge_base", id=i, body={"content": doc}) # Search results = es.search( index="knowledge_base", body={ "query": { "match": { "content": { "query": "return product", "operator": "or" } } } } )
With Qdrant (sparse vectors)
DEVELOPERpythonfrom qdrant_client import QdrantClient from qdrant_client.models import ( SparseVectorParams, PointStruct, SparseVector, NamedSparseVector ) from collections import Counter import math client = QdrantClient("localhost", port=6333) # Collection with sparse vectors client.create_collection( collection_name="bm25_docs", sparse_vectors_config={ "text": SparseVectorParams() } ) def compute_sparse_vector(text: str, idf_dict: dict) -> SparseVector: """Convert text to BM25-like sparse vector""" tokens = text.lower().split() tf = Counter(tokens) indices = [] values = [] for token, freq in tf.items(): if token in idf_dict: token_id = hash(token) % 1000000 # Simple hash # Simplified TF-IDF score score = freq * idf_dict.get(token, 1.0) indices.append(token_id) values.append(score) return SparseVector(indices=indices, values=values) # Index for i, doc in enumerate(documents): sparse_vec = compute_sparse_vector(doc, idf_dict) client.upsert( collection_name="bm25_docs", points=[PointStruct( id=i, payload={"content": doc}, vector={"text": sparse_vec} )] )
When Sparse Retrieval Beats Dense
1. Exact Match Required
Query: "Error 503"
Dense retrieval: finds "Server problem", "Site unavailable"
Sparse retrieval: finds exactly "Error 503 - Service unavailable"
For error codes, serial numbers, product references, sparse is unbeatable.
2. Technical or Rare Terms
Query: "Riemannian covariance tensor"
Dense retrieval: confused, finds general tensor articles
Sparse retrieval: exact match on documents containing these precise terms
Embedding models rarely saw these specialized terms during training.
3. Combinatorial Searches
Query: "Python asyncio websocket"
Dense retrieval: understands overall meaning but may miss exact combinations
Sparse retrieval: finds documents containing all 3 terms
4. Proper Names and Entities
Query: "John Smith invoice 2024"
Dense retrieval: loses proper name in embedding
Sparse retrieval: exact match on name
Comparison Table
| Use Case | Dense | Sparse | Winner |
|---|---|---|---|
| Semantic rephrasing | Excellent | Poor | Dense |
| Exact match | Poor | Excellent | Sparse |
| Technical terms | Medium | Excellent | Sparse |
| Synonyms | Excellent | Poor | Dense |
| Proper names | Medium | Excellent | Sparse |
| Long queries | Excellent | Medium | Dense |
| 1-2 word queries | Medium | Excellent | Sparse |
Optimizing BM25
Tuning k1 and b Parameters
DEVELOPERpythondef grid_search_bm25_params(queries, relevant_docs, corpus): """Find optimal k1 and b parameters""" best_score = 0 best_params = {} for k1 in [0.5, 1.0, 1.2, 1.5, 2.0]: for b in [0.25, 0.5, 0.75, 1.0]: bm25 = BM25Okapi(corpus, k1=k1, b=b) # Evaluate total_recall = 0 for query, relevant in zip(queries, relevant_docs): scores = bm25.get_scores(query) top_k = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:10] hits = len(set(top_k) & set(relevant)) total_recall += hits / len(relevant) avg_recall = total_recall / len(queries) if avg_recall > best_score: best_score = avg_recall best_params = {"k1": k1, "b": b} return best_params
General rules:
- Long documents → higher
b(0.75-1.0) - Short documents (FAQ) → lower
b(0.3-0.5) - Repetitive queries → lower
k1(0.5-1.0)
Linguistic Analysis
Sparse retrieval quality depends heavily on tokenization and preprocessing:
DEVELOPERpythonimport spacy nlp = spacy.load("en_core_web_md") def preprocess_english(text: str) -> list[str]: """Optimized preprocessing for English""" doc = nlp(text.lower()) tokens = [] for token in doc: # Skip punctuation and stopwords if token.is_punct or token.is_stop: continue # Lemmatization: "returned" → "return" lemma = token.lemma_ # Filter very short tokens if len(lemma) > 2: tokens.append(lemma) return tokens # Example text = "Returned products will be refunded within 15 days" tokens = preprocess_english(text) # ['return', 'product', 'refund', 'day']
Query Expansion
DEVELOPERpythondef expand_query_synonyms(query: str, synonyms_dict: dict) -> str: """Enrich query with synonyms""" expanded_terms = [] for term in query.split(): expanded_terms.append(term) if term in synonyms_dict: expanded_terms.extend(synonyms_dict[term]) return " ".join(expanded_terms) synonyms = { "return": ["refund", "exchange", "send back"], "product": ["item", "order", "purchase"], "problem": ["issue", "error", "bug", "incident"] } query = "return product" expanded = expand_query_synonyms(query, synonyms) # "return refund exchange send back product item order purchase"
BM25F: For Structured Documents
BM25F extends BM25 to documents with multiple fields:
DEVELOPERpython# Elasticsearch configuration with field boosting es.search( index="products", body={ "query": { "multi_match": { "query": "samsung smartphone", "fields": [ "title^3", # Title: weight x3 "description^1", # Description: weight x1 "category^2" # Category: weight x2 ], "type": "best_fields" } } } )
DEVELOPERpython# Manual BM25F implementation def bm25f_score(query, document_fields, field_weights): """ BM25F: multi-field scoring document_fields = { "title": "Samsung Galaxy S24", "description": "High-end smartphone with AMOLED display...", "category": "Phones" } field_weights = {"title": 3, "description": 1, "category": 2} """ total_score = 0 for field_name, content in document_fields.items(): weight = field_weights.get(field_name, 1) field_score = bm25_score(query, content) total_score += weight * field_score return total_score
Limitations of Sparse Retrieval
1. Vocabulary Mismatch
Document: "Electric vehicle with lithium battery"
Query: "eco-friendly car"
→ BM25 score = 0 (no common terms)
2. Typo Sensitivity
Query: "refnd" (typo)
→ Does not find "refund"
Solution: Fuzzy matching
DEVELOPERpythones.search( index="knowledge_base", body={ "query": { "match": { "content": { "query": "refnd", "fuzziness": "AUTO" # Tolerates 1-2 errors } } } } )
3. No Context Understanding
Query: "Apple"
→ Finds both the brand and apple pie recipes
Sparse cannot disambiguate without additional context.
Integration in a RAG Pipeline
DEVELOPERpythonclass SparseRetriever: def __init__(self, documents: list[str]): self.documents = documents self.tokenized_docs = [self._preprocess(d) for d in documents] self.bm25 = BM25Okapi(self.tokenized_docs) def _preprocess(self, text: str) -> list[str]: # Tokenization and normalization return preprocess_english(text) def search(self, query: str, top_k: int = 5) -> list[dict]: tokenized_query = self._preprocess(query) scores = self.bm25.get_scores(tokenized_query) # Top k document indices top_indices = sorted( range(len(scores)), key=lambda i: scores[i], reverse=True )[:top_k] return [ { "content": self.documents[i], "score": scores[i], "method": "bm25" } for i in top_indices ]
Next Steps
Sparse retrieval excels at exact matches but lacks semantic understanding. The solution? Combine both approaches.
- Hybrid Fusion - Combining dense and sparse
- Dense Retrieval - Semantic search with embeddings
- Retrieval Fundamentals - Overview
Optimized Sparse Retrieval with Ailog
Ailog automatically combines sparse and dense retrieval for optimal results:
- Native language BM25 with lemmatization and linguistic analysis
- Automatic hybrid fusion adapted to your content
- Parameter tuning based on user feedback
- Zero configuration - everything works from import
Try for free and get the best of both worlds.
Tags
Related Posts
Query Routing: Direct Queries to the Right Source
Implement query routing to direct each query to the optimal data source. Classification, LLM routing, and advanced strategies explained.
Hybrid Fusion: Combining Dense and Sparse Retrieval
Master hybrid fusion to combine semantic and lexical search. RRF, weighted fusion, and optimal combination strategies explained.
Dense Retrieval: Semantic Search with Embeddings
Master dense retrieval for high-performance semantic search. Embeddings, models, vector indexing, and advanced optimizations explained.