Sparse Retrieval and BM25: When Lexical Search Wins

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Sparse retrieval, embodied by the BM25 algorithm, remains a dominant force in information retrieval. Despite the hype around embeddings and dense retrieval, lexical search often outperforms semantic approaches in specific cases. This guide explores the mechanisms of sparse retrieval, its strengths, and when to prioritize it.

What is Sparse Retrieval?

Sparse retrieval represents documents as "sparse" vectors where each dimension corresponds to a vocabulary term. Most values are zero because a document contains only a fraction of the total vocabulary.

From TF-IDF to BM25

The evolution of sparse retrieval algorithms:

TF-IDF (1972)           →    BM25 (1994)           →    BM25+ (2011)
Score = TF × IDF             TF saturation              Short document
                             Length normalization        bias correction

How BM25 Works

BM25 (Best Matching 25) calculates a relevance score based on the frequency of query terms in documents, with intelligent adjustments.

The BM25 Formula

Score(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))

Where:

f(qi, D): frequency of term qi in document D
|D|: document length
avgdl: average document length
k1: saturation parameter (typically 1.2-2.0)
b: length normalization parameter (typically 0.75)

Intuition Behind BM25

DEVELOPERpython
def bm25_score_explained(query_terms, document, corpus_stats):
    """
    Intuitive explanation of BM25 scoring
    """
    score = 0
    doc_length = len(document)
    avg_length = corpus_stats['average_length']

    for term in query_terms:
        # 1. IDF: rare terms = more important
        # "algorithm" in 5% of docs > "the" in 95%
        idf = compute_idf(term, corpus_stats)

        # 2. TF with saturation: prevents over-weighting repetitions
        # "machine learning machine learning machine" ≠ 5× better
        tf = document.count(term)
        saturated_tf = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * doc_length / avg_length))

        # 3. Length normalization: compensates for long documents
        # A 10-page document with 5 mentions ≠ a 2-line FAQ with 5 mentions

        score += idf * saturated_tf

    return score

Practical Implementation

With rank_bm25

DEVELOPERpython
from rank_bm25 import BM25Okapi
import nltk

# Prepare documents
documents = [
    "How to return a defective product",
    "Refund policy within 30 days",
    "Delivery times to continental US",
    "Return fees and conditions",
    "Customer service available 24/7"
]

# Tokenization
tokenized_docs = [nltk.word_tokenize(doc.lower()) for doc in documents]

# Create BM25 index
bm25 = BM25Okapi(tokenized_docs)

# Search
query = "return product"
tokenized_query = nltk.word_tokenize(query.lower())
scores = bm25.get_scores(tokenized_query)

# Sorted results
results = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
for doc, score in results[:3]:
    print(f"Score: {score:.3f} - {doc}")

With Elasticsearch

DEVELOPERpython
from elasticsearch import Elasticsearch

es = Elasticsearch()

# Create index with BM25
es.indices.create(
    index="knowledge_base",
    body={
        "settings": {
            "similarity": {
                "custom_bm25": {
                    "type": "BM25",
                    "k1": 1.2,
                    "b": 0.75
                }
            }
        },
        "mappings": {
            "properties": {
                "content": {
                    "type": "text",
                    "similarity": "custom_bm25",
                    "analyzer": "english"
                },
                "category": {"type": "keyword"}
            }
        }
    }
)

# Index documents
for i, doc in enumerate(documents):
    es.index(index="knowledge_base", id=i, body={"content": doc})

# Search
results = es.search(
    index="knowledge_base",
    body={
        "query": {
            "match": {
                "content": {
                    "query": "return product",
                    "operator": "or"
                }
            }
        }
    }
)

With Qdrant (sparse vectors)

DEVELOPERpython
from qdrant_client import QdrantClient
from qdrant_client.models import (
    SparseVectorParams,
    PointStruct,
    SparseVector,
    NamedSparseVector
)
from collections import Counter
import math

client = QdrantClient("localhost", port=6333)

# Collection with sparse vectors
client.create_collection(
    collection_name="bm25_docs",
    sparse_vectors_config={
        "text": SparseVectorParams()
    }
)

def compute_sparse_vector(text: str, idf_dict: dict) -> SparseVector:
    """Convert text to BM25-like sparse vector"""
    tokens = text.lower().split()
    tf = Counter(tokens)

    indices = []
    values = []

    for token, freq in tf.items():
        if token in idf_dict:
            token_id = hash(token) % 1000000  # Simple hash
            # Simplified TF-IDF score
            score = freq * idf_dict.get(token, 1.0)
            indices.append(token_id)
            values.append(score)

    return SparseVector(indices=indices, values=values)

# Index
for i, doc in enumerate(documents):
    sparse_vec = compute_sparse_vector(doc, idf_dict)
    client.upsert(
        collection_name="bm25_docs",
        points=[PointStruct(
            id=i,
            payload={"content": doc},
            vector={"text": sparse_vec}
        )]
    )

When Sparse Retrieval Beats Dense

1. Exact Match Required

Query: "Error 503"
Dense retrieval: finds "Server problem", "Site unavailable"
Sparse retrieval: finds exactly "Error 503 - Service unavailable"

For error codes, serial numbers, product references, sparse is unbeatable.

2. Technical or Rare Terms

Query: "Riemannian covariance tensor"
Dense retrieval: confused, finds general tensor articles
Sparse retrieval: exact match on documents containing these precise terms

Embedding models rarely saw these specialized terms during training.

3. Combinatorial Searches

Query: "Python asyncio websocket"
Dense retrieval: understands overall meaning but may miss exact combinations
Sparse retrieval: finds documents containing all 3 terms

4. Proper Names and Entities

Query: "John Smith invoice 2024"
Dense retrieval: loses proper name in embedding
Sparse retrieval: exact match on name

Comparison Table

Use Case	Dense	Sparse	Winner
Semantic rephrasing	Excellent	Poor	Dense
Exact match	Poor	Excellent	Sparse
Technical terms	Medium	Excellent	Sparse
Synonyms	Excellent	Poor	Dense
Proper names	Medium	Excellent	Sparse
Long queries	Excellent	Medium	Dense
1-2 word queries	Medium	Excellent	Sparse

Optimizing BM25

Tuning k1 and b Parameters

DEVELOPERpython
def grid_search_bm25_params(queries, relevant_docs, corpus):
    """Find optimal k1 and b parameters"""
    best_score = 0
    best_params = {}

    for k1 in [0.5, 1.0, 1.2, 1.5, 2.0]:
        for b in [0.25, 0.5, 0.75, 1.0]:
            bm25 = BM25Okapi(corpus, k1=k1, b=b)

            # Evaluate
            total_recall = 0
            for query, relevant in zip(queries, relevant_docs):
                scores = bm25.get_scores(query)
                top_k = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:10]
                hits = len(set(top_k) & set(relevant))
                total_recall += hits / len(relevant)

            avg_recall = total_recall / len(queries)

            if avg_recall > best_score:
                best_score = avg_recall
                best_params = {"k1": k1, "b": b}

    return best_params

General rules:

Long documents → higher b (0.75-1.0)
Short documents (FAQ) → lower b (0.3-0.5)
Repetitive queries → lower k1 (0.5-1.0)

Linguistic Analysis

Sparse retrieval quality depends heavily on tokenization and preprocessing:

DEVELOPERpython
import spacy

nlp = spacy.load("en_core_web_md")

def preprocess_english(text: str) -> list[str]:
    """Optimized preprocessing for English"""
    doc = nlp(text.lower())

    tokens = []
    for token in doc:
        # Skip punctuation and stopwords
        if token.is_punct or token.is_stop:
            continue

        # Lemmatization: "returned" → "return"
        lemma = token.lemma_

        # Filter very short tokens
        if len(lemma) > 2:
            tokens.append(lemma)

    return tokens

# Example
text = "Returned products will be refunded within 15 days"
tokens = preprocess_english(text)
# ['return', 'product', 'refund', 'day']

Query Expansion

DEVELOPERpython
def expand_query_synonyms(query: str, synonyms_dict: dict) -> str:
    """Enrich query with synonyms"""
    expanded_terms = []

    for term in query.split():
        expanded_terms.append(term)
        if term in synonyms_dict:
            expanded_terms.extend(synonyms_dict[term])

    return " ".join(expanded_terms)

synonyms = {
    "return": ["refund", "exchange", "send back"],
    "product": ["item", "order", "purchase"],
    "problem": ["issue", "error", "bug", "incident"]
}

query = "return product"
expanded = expand_query_synonyms(query, synonyms)
# "return refund exchange send back product item order purchase"

BM25F: For Structured Documents

BM25F extends BM25 to documents with multiple fields:

DEVELOPERpython
# Elasticsearch configuration with field boosting
es.search(
    index="products",
    body={
        "query": {
            "multi_match": {
                "query": "samsung smartphone",
                "fields": [
                    "title^3",       # Title: weight x3
                    "description^1", # Description: weight x1
                    "category^2"     # Category: weight x2
                ],
                "type": "best_fields"
            }
        }
    }
)

DEVELOPERpython
# Manual BM25F implementation
def bm25f_score(query, document_fields, field_weights):
    """
    BM25F: multi-field scoring

    document_fields = {
        "title": "Samsung Galaxy S24",
        "description": "High-end smartphone with AMOLED display...",
        "category": "Phones"
    }

    field_weights = {"title": 3, "description": 1, "category": 2}
    """
    total_score = 0

    for field_name, content in document_fields.items():
        weight = field_weights.get(field_name, 1)
        field_score = bm25_score(query, content)
        total_score += weight * field_score

    return total_score

Limitations of Sparse Retrieval

1. Vocabulary Mismatch

Document: "Electric vehicle with lithium battery"
Query: "eco-friendly car"
→ BM25 score = 0 (no common terms)

2. Typo Sensitivity

Query: "refnd" (typo)
→ Does not find "refund"

Solution: Fuzzy matching

DEVELOPERpython
es.search(
    index="knowledge_base",
    body={
        "query": {
            "match": {
                "content": {
                    "query": "refnd",
                    "fuzziness": "AUTO"  # Tolerates 1-2 errors
                }
            }
        }
    }
)

3. No Context Understanding

Query: "Apple"
→ Finds both the brand and apple pie recipes

Sparse cannot disambiguate without additional context.

Integration in a RAG Pipeline

DEVELOPERpython
class SparseRetriever:
    def __init__(self, documents: list[str]):
        self.documents = documents
        self.tokenized_docs = [self._preprocess(d) for d in documents]
        self.bm25 = BM25Okapi(self.tokenized_docs)

    def _preprocess(self, text: str) -> list[str]:
        # Tokenization and normalization
        return preprocess_english(text)

    def search(self, query: str, top_k: int = 5) -> list[dict]:
        tokenized_query = self._preprocess(query)
        scores = self.bm25.get_scores(tokenized_query)

        # Top k document indices
        top_indices = sorted(
            range(len(scores)),
            key=lambda i: scores[i],
            reverse=True
        )[:top_k]

        return [
            {
                "content": self.documents[i],
                "score": scores[i],
                "method": "bm25"
            }
            for i in top_indices
        ]

Next Steps

Sparse retrieval excels at exact matches but lacks semantic understanding. The solution? Combine both approaches.

Hybrid Fusion - Combining dense and sparse
Dense Retrieval - Semantic search with embeddings
Retrieval Fundamentals - Overview

Optimized Sparse Retrieval with Ailog

Ailog automatically combines sparse and dense retrieval for optimal results:

Native language BM25 with lemmatization and linguistic analysis
Automatic hybrid fusion adapted to your content
Parameter tuning based on user feedback
Zero configuration - everything works from import

Try for free and get the best of both worlds.

Sparse Retrieval and BM25: When Lexical Search Wins

Sparse Retrieval and BM25: When Lexical Search Wins

What is Sparse Retrieval?

From TF-IDF to BM25

How BM25 Works

The BM25 Formula

Intuition Behind BM25

Practical Implementation

With rank_bm25

With Elasticsearch

With Qdrant (sparse vectors)

When Sparse Retrieval Beats Dense

1. Exact Match Required

2. Technical or Rare Terms

3. Combinatorial Searches

4. Proper Names and Entities

Comparison Table

Optimizing BM25

Tuning k1 and b Parameters

Linguistic Analysis

Query Expansion

BM25F: For Structured Documents

Limitations of Sparse Retrieval

1. Vocabulary Mismatch

2. Typo Sensitivity

3. No Context Understanding

Integration in a RAG Pipeline

Next Steps

Optimized Sparse Retrieval with Ailog

Tags

Related Posts

Query Routing: Direct Queries to the Right Source

Metadata Filtering: Refine RAG Search

Ensemble Retrieval: Combining Multiple Retrievers

Ailog Assistant