5. Retrieval

Sparse Retrieval and BM25: When Lexical Search Wins

March 8, 2026
Ailog Team

Discover sparse retrieval and BM25 for precise lexical search. Use cases, implementation, and comparison with dense retrieval explained.

Sparse Retrieval and BM25: When Lexical Search Wins

Sparse retrieval, embodied by the BM25 algorithm, remains a dominant force in information retrieval. Despite the hype around embeddings and dense retrieval, lexical search often outperforms semantic approaches in specific cases. This guide explores the mechanisms of sparse retrieval, its strengths, and when to prioritize it.

What is Sparse Retrieval?

Sparse retrieval represents documents as "sparse" vectors where each dimension corresponds to a vocabulary term. Most values are zero because a document contains only a fraction of the total vocabulary.

From TF-IDF to BM25

The evolution of sparse retrieval algorithms:

TF-IDF (1972)           →    BM25 (1994)           →    BM25+ (2011)
Score = TF × IDF             TF saturation              Short document
                             Length normalization        bias correction

How BM25 Works

BM25 (Best Matching 25) calculates a relevance score based on the frequency of query terms in documents, with intelligent adjustments.

The BM25 Formula

Score(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))

Where:

  • f(qi, D): frequency of term qi in document D
  • |D|: document length
  • avgdl: average document length
  • k1: saturation parameter (typically 1.2-2.0)
  • b: length normalization parameter (typically 0.75)

Intuition Behind BM25

DEVELOPERpython
def bm25_score_explained(query_terms, document, corpus_stats): """ Intuitive explanation of BM25 scoring """ score = 0 doc_length = len(document) avg_length = corpus_stats['average_length'] for term in query_terms: # 1. IDF: rare terms = more important # "algorithm" in 5% of docs > "the" in 95% idf = compute_idf(term, corpus_stats) # 2. TF with saturation: prevents over-weighting repetitions # "machine learning machine learning machine" ≠ 5× better tf = document.count(term) saturated_tf = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * doc_length / avg_length)) # 3. Length normalization: compensates for long documents # A 10-page document with 5 mentions ≠ a 2-line FAQ with 5 mentions score += idf * saturated_tf return score

Practical Implementation

With rank_bm25

DEVELOPERpython
from rank_bm25 import BM25Okapi import nltk # Prepare documents documents = [ "How to return a defective product", "Refund policy within 30 days", "Delivery times to continental US", "Return fees and conditions", "Customer service available 24/7" ] # Tokenization tokenized_docs = [nltk.word_tokenize(doc.lower()) for doc in documents] # Create BM25 index bm25 = BM25Okapi(tokenized_docs) # Search query = "return product" tokenized_query = nltk.word_tokenize(query.lower()) scores = bm25.get_scores(tokenized_query) # Sorted results results = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True) for doc, score in results[:3]: print(f"Score: {score:.3f} - {doc}")

With Elasticsearch

DEVELOPERpython
from elasticsearch import Elasticsearch es = Elasticsearch() # Create index with BM25 es.indices.create( index="knowledge_base", body={ "settings": { "similarity": { "custom_bm25": { "type": "BM25", "k1": 1.2, "b": 0.75 } } }, "mappings": { "properties": { "content": { "type": "text", "similarity": "custom_bm25", "analyzer": "english" }, "category": {"type": "keyword"} } } } ) # Index documents for i, doc in enumerate(documents): es.index(index="knowledge_base", id=i, body={"content": doc}) # Search results = es.search( index="knowledge_base", body={ "query": { "match": { "content": { "query": "return product", "operator": "or" } } } } )

With Qdrant (sparse vectors)

DEVELOPERpython
from qdrant_client import QdrantClient from qdrant_client.models import ( SparseVectorParams, PointStruct, SparseVector, NamedSparseVector ) from collections import Counter import math client = QdrantClient("localhost", port=6333) # Collection with sparse vectors client.create_collection( collection_name="bm25_docs", sparse_vectors_config={ "text": SparseVectorParams() } ) def compute_sparse_vector(text: str, idf_dict: dict) -> SparseVector: """Convert text to BM25-like sparse vector""" tokens = text.lower().split() tf = Counter(tokens) indices = [] values = [] for token, freq in tf.items(): if token in idf_dict: token_id = hash(token) % 1000000 # Simple hash # Simplified TF-IDF score score = freq * idf_dict.get(token, 1.0) indices.append(token_id) values.append(score) return SparseVector(indices=indices, values=values) # Index for i, doc in enumerate(documents): sparse_vec = compute_sparse_vector(doc, idf_dict) client.upsert( collection_name="bm25_docs", points=[PointStruct( id=i, payload={"content": doc}, vector={"text": sparse_vec} )] )

When Sparse Retrieval Beats Dense

1. Exact Match Required

Query: "Error 503"
Dense retrieval: finds "Server problem", "Site unavailable"
Sparse retrieval: finds exactly "Error 503 - Service unavailable"

For error codes, serial numbers, product references, sparse is unbeatable.

2. Technical or Rare Terms

Query: "Riemannian covariance tensor"
Dense retrieval: confused, finds general tensor articles
Sparse retrieval: exact match on documents containing these precise terms

Embedding models rarely saw these specialized terms during training.

3. Combinatorial Searches

Query: "Python asyncio websocket"
Dense retrieval: understands overall meaning but may miss exact combinations
Sparse retrieval: finds documents containing all 3 terms

4. Proper Names and Entities

Query: "John Smith invoice 2024"
Dense retrieval: loses proper name in embedding
Sparse retrieval: exact match on name

Comparison Table

Use CaseDenseSparseWinner
Semantic rephrasingExcellentPoorDense
Exact matchPoorExcellentSparse
Technical termsMediumExcellentSparse
SynonymsExcellentPoorDense
Proper namesMediumExcellentSparse
Long queriesExcellentMediumDense
1-2 word queriesMediumExcellentSparse

Optimizing BM25

Tuning k1 and b Parameters

DEVELOPERpython
def grid_search_bm25_params(queries, relevant_docs, corpus): """Find optimal k1 and b parameters""" best_score = 0 best_params = {} for k1 in [0.5, 1.0, 1.2, 1.5, 2.0]: for b in [0.25, 0.5, 0.75, 1.0]: bm25 = BM25Okapi(corpus, k1=k1, b=b) # Evaluate total_recall = 0 for query, relevant in zip(queries, relevant_docs): scores = bm25.get_scores(query) top_k = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:10] hits = len(set(top_k) & set(relevant)) total_recall += hits / len(relevant) avg_recall = total_recall / len(queries) if avg_recall > best_score: best_score = avg_recall best_params = {"k1": k1, "b": b} return best_params

General rules:

  • Long documents → higher b (0.75-1.0)
  • Short documents (FAQ) → lower b (0.3-0.5)
  • Repetitive queries → lower k1 (0.5-1.0)

Linguistic Analysis

Sparse retrieval quality depends heavily on tokenization and preprocessing:

DEVELOPERpython
import spacy nlp = spacy.load("en_core_web_md") def preprocess_english(text: str) -> list[str]: """Optimized preprocessing for English""" doc = nlp(text.lower()) tokens = [] for token in doc: # Skip punctuation and stopwords if token.is_punct or token.is_stop: continue # Lemmatization: "returned" → "return" lemma = token.lemma_ # Filter very short tokens if len(lemma) > 2: tokens.append(lemma) return tokens # Example text = "Returned products will be refunded within 15 days" tokens = preprocess_english(text) # ['return', 'product', 'refund', 'day']

Query Expansion

DEVELOPERpython
def expand_query_synonyms(query: str, synonyms_dict: dict) -> str: """Enrich query with synonyms""" expanded_terms = [] for term in query.split(): expanded_terms.append(term) if term in synonyms_dict: expanded_terms.extend(synonyms_dict[term]) return " ".join(expanded_terms) synonyms = { "return": ["refund", "exchange", "send back"], "product": ["item", "order", "purchase"], "problem": ["issue", "error", "bug", "incident"] } query = "return product" expanded = expand_query_synonyms(query, synonyms) # "return refund exchange send back product item order purchase"

BM25F: For Structured Documents

BM25F extends BM25 to documents with multiple fields:

DEVELOPERpython
# Elasticsearch configuration with field boosting es.search( index="products", body={ "query": { "multi_match": { "query": "samsung smartphone", "fields": [ "title^3", # Title: weight x3 "description^1", # Description: weight x1 "category^2" # Category: weight x2 ], "type": "best_fields" } } } )
DEVELOPERpython
# Manual BM25F implementation def bm25f_score(query, document_fields, field_weights): """ BM25F: multi-field scoring document_fields = { "title": "Samsung Galaxy S24", "description": "High-end smartphone with AMOLED display...", "category": "Phones" } field_weights = {"title": 3, "description": 1, "category": 2} """ total_score = 0 for field_name, content in document_fields.items(): weight = field_weights.get(field_name, 1) field_score = bm25_score(query, content) total_score += weight * field_score return total_score

Limitations of Sparse Retrieval

1. Vocabulary Mismatch

Document: "Electric vehicle with lithium battery"
Query: "eco-friendly car"
→ BM25 score = 0 (no common terms)

2. Typo Sensitivity

Query: "refnd" (typo)
→ Does not find "refund"

Solution: Fuzzy matching

DEVELOPERpython
es.search( index="knowledge_base", body={ "query": { "match": { "content": { "query": "refnd", "fuzziness": "AUTO" # Tolerates 1-2 errors } } } } )

3. No Context Understanding

Query: "Apple"
→ Finds both the brand and apple pie recipes

Sparse cannot disambiguate without additional context.

Integration in a RAG Pipeline

DEVELOPERpython
class SparseRetriever: def __init__(self, documents: list[str]): self.documents = documents self.tokenized_docs = [self._preprocess(d) for d in documents] self.bm25 = BM25Okapi(self.tokenized_docs) def _preprocess(self, text: str) -> list[str]: # Tokenization and normalization return preprocess_english(text) def search(self, query: str, top_k: int = 5) -> list[dict]: tokenized_query = self._preprocess(query) scores = self.bm25.get_scores(tokenized_query) # Top k document indices top_indices = sorted( range(len(scores)), key=lambda i: scores[i], reverse=True )[:top_k] return [ { "content": self.documents[i], "score": scores[i], "method": "bm25" } for i in top_indices ]

Next Steps

Sparse retrieval excels at exact matches but lacks semantic understanding. The solution? Combine both approaches.


Optimized Sparse Retrieval with Ailog

Ailog automatically combines sparse and dense retrieval for optimal results:

  • Native language BM25 with lemmatization and linguistic analysis
  • Automatic hybrid fusion adapted to your content
  • Parameter tuning based on user feedback
  • Zero configuration - everything works from import

Try for free and get the best of both worlds.

Tags

ragretrievalbm25sparse retrievallexical search

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !