6. RerankingAdvanced

Reranking: Improving Retrieval Precision

February 10, 2025
11 min read
Ailog Research Team

Cross-encoders, LLM-based reranking, and reranking strategies to optimize retrieved context for better RAG responses.

TL;DR

  • Reranking = Second-pass scoring of retrieved docs for better precision
  • Cross-encoders deliver 10-25% accuracy improvement over pure retrieval
  • Cohere Rerank API: Easiest option ($1/1000 queries)
  • Self-hosted: ms-marco cross-encoders (free, good quality)
  • Compare rerankers on your data with Ailog

The Reranking Problem

Initial retrieval (vector search, BM25) casts a wide net to recall potentially relevant documents. However:

  • False positives: Some retrieved chunks aren't actually relevant
  • Ranking quality: Most relevant chunks may not be ranked first
  • Query-specific relevance: Initial ranking doesn't account for query nuances

Solution: Rerank retrieved candidates with a more sophisticated model.

Two-Stage Retrieval

Query → [Stage 1: Retrieval] → 100 candidates
       → [Stage 2: Reranking] → 10 best results
       → [Stage 3: Generation] → Answer

Why two stages?

  • Retrieval: Fast, scales to millions/billions of documents
  • Reranking: Expensive but accurate, only on small candidate set
  • Best of both: Speed + Quality

Reranking Approaches

Cross-Encoder Models

Unlike bi-encoders (embed query and document separately), cross-encoders process query and document together.

Bi-encoder (Retrieval)

DEVELOPERpython
query_emb = embed(query) # [768] doc_emb = embed(document) # [768] score = cosine(query_emb, doc_emb) # Similarity

Cross-encoder (Reranking)

DEVELOPERpython
# Process together input = f"[CLS] {query} [SEP] {document} [SEP]" score = model(input) # Direct relevance score

Why cross-encoders are better:

  • Attention between query and document tokens
  • Captures word-level interactions
  • More accurate relevance scoring

Why not use for retrieval:

  • Must score each query-document pair (O(n))
  • Too slow for large collections
  • No pre-computed embeddings

Popular Cross-Encoder Models

ms-marco-MiniLM-L6-v2

DEVELOPERpython
from sentence_transformers import CrossEncoder model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2') # Score query-document pairs scores = model.predict([ (query, doc1), (query, doc2), (query, doc3) ]) # Rerank by score ranked_indices = np.argsort(scores)[::-1]

Characteristics:

  • Size: 80MB
  • Speed: ~50ms per batch
  • Quality: Good for English
  • Training: Trained on MS MARCO

ms-marco-TinyBERT-L2-v2

  • Even smaller/faster
  • Slight quality tradeoff
  • Good for latency-critical apps

mmarco-mMiniLMv2-L12-H384-v1

  • Multilingual support
  • Similar performance to English models
  • Supports 100+ languages

Implementation

DEVELOPERpython
class RerankedRetriever: def __init__(self, base_retriever, reranker_model): self.retriever = base_retriever self.reranker = CrossEncoder(reranker_model) def retrieve(self, query, k=5, rerank_top_n=20): # Stage 1: Retrieve more candidates candidates = self.retriever.retrieve(query, k=rerank_top_n) # Stage 2: Rerank pairs = [(query, doc['content']) for doc in candidates] scores = self.reranker.predict(pairs) # Sort by reranker scores ranked_indices = np.argsort(scores)[::-1] reranked_docs = [candidates[i] for i in ranked_indices] # Return top-k return reranked_docs[:k]

LLM-Based Reranking

Use an LLM to judge relevance.

Binary Relevance

Ask LLM if document is relevant.

DEVELOPERpython
def llm_rerank_binary(query, documents, llm): relevant_docs = [] for doc in documents: prompt = f"""Is this document relevant to the query? Query: {query} Document: {doc} Answer only 'yes' or 'no'.""" response = llm.generate(prompt, max_tokens=5) if 'yes' in response.lower(): relevant_docs.append(doc) return relevant_docs

Scoring Relevance

Get numerical relevance scores.

DEVELOPERpython
def llm_rerank_score(query, documents, llm): scored_docs = [] for doc in documents: prompt = f"""Rate the relevance of this document to the query on a scale of 1-10. Query: {query} Document: {doc} Relevance score (1-10):""" score = int(llm.generate(prompt, max_tokens=5)) scored_docs.append((doc, score)) # Sort by score scored_docs.sort(key=lambda x: x[1], reverse=True) return [doc for doc, score in scored_docs]

Comparative Ranking

Compare documents pairwise or in batches.

DEVELOPERpython
def llm_rerank_comparative(query, documents, llm): prompt = f"""Rank these documents by relevance to the query. Query: {query} Documents: {format_documents(documents)} Provide ranking (most to least relevant):""" ranking = llm.generate(prompt) ranked_docs = parse_ranking(ranking, documents) return ranked_docs

Pros:

  • Very accurate
  • Can handle nuanced relevance
  • Explains reasoning

Cons:

  • Expensive (LLM call per document or batch)
  • Slow (hundreds of ms to seconds)
  • May exceed context window with many docs

Use when:

  • Highest quality required
  • Cost/latency acceptable
  • Small candidate set (< 10 docs)

Cohere Rerank API

Managed reranking service.

DEVELOPERpython
import cohere co = cohere.Client(api_key="your-key") def cohere_rerank(query, documents, top_n=5): response = co.rerank( query=query, documents=documents, top_n=top_n, model="rerank-english-v2.0" ) return [doc.document for doc in response.results]

Models:

  • rerank-english-v2.0: English
  • rerank-multilingual-v2.0: 100+ languages

Pricing:

  • $1.00 per 1000 searches (as of 2025)

Pros:

  • Managed service
  • High quality
  • Multilingual

Cons:

  • API latency
  • Ongoing cost
  • Vendor dependency

FlashRank

Efficient local reranking.

DEVELOPERpython
from flashrank import Ranker, RerankRequest ranker = Ranker(model_name="ms-marco-MultiBERT-L-12") def flashrank_rerank(query, documents, top_n=5): rerank_request = RerankRequest( query=query, passages=[{"text": doc} for doc in documents] ) results = ranker.rerank(rerank_request) return [r.text for r in results[:top_n]]

Benefits:

  • Very fast (optimized inference)
  • Self-hosted
  • No API costs

Hybrid Reranking

Combine multiple signals.

DEVELOPERpython
def hybrid_rerank(query, documents, weights=None): if weights is None: weights = { 'vector_score': 0.3, 'bm25_score': 0.2, 'cross_encoder': 0.5 } # Get scores from different models vector_scores = get_vector_scores(query, documents) bm25_scores = get_bm25_scores(query, documents) ce_scores = get_cross_encoder_scores(query, documents) # Normalize scores to [0, 1] vector_scores = normalize(vector_scores) bm25_scores = normalize(bm25_scores) ce_scores = normalize(ce_scores) # Weighted combination final_scores = ( weights['vector_score'] * vector_scores + weights['bm25_score'] * bm25_scores + weights['cross_encoder'] * ce_scores ) # Rank documents ranked_indices = np.argsort(final_scores)[::-1] return [documents[i] for i in ranked_indices]

Reranking Strategies

Top-K Reranking

Rerank only top candidates from initial retrieval.

DEVELOPERpython
# Retrieve top 20, rerank to get top 5 candidates = retriever.retrieve(query, k=20) reranked = reranker.rerank(query, candidates, top_n=5)

Settings:

  • Retrieve: 3-5x the final k
  • Rerank: Final k needed

Example:

  • Need 5 final results
  • Retrieve 20 candidates
  • Rerank to top 5

Cascading Reranking

Multiple reranking stages with increasing accuracy.

DEVELOPERpython
# Stage 1: Fast retrieval candidates = fast_retriever.retrieve(query, k=100) # Stage 2: Fast reranker reranked_1 = tiny_reranker.rerank(query, candidates, top_n=20) # Stage 3: Accurate reranker reranked_2 = large_reranker.rerank(query, reranked_1, top_n=5) return reranked_2

Use when:

  • Very large candidate sets
  • Multiple quality tiers needed
  • Optimizing cost/latency

Query-Adaptive Reranking

Different reranking based on query type.

DEVELOPERpython
def adaptive_rerank(query, documents): query_type = classify_query(query) if query_type == "factual": # Use keyword signals return bm25_rerank(query, documents) elif query_type == "semantic": # Use cross-encoder return cross_encoder_rerank(query, documents) elif query_type == "complex": # Use LLM return llm_rerank(query, documents)

Performance Optimization

Batching

Rerank multiple queries efficiently.

DEVELOPERpython
# Bad: One at a time for query in queries: rerank(query, docs) # Good: Batched pairs = [(q, doc) for q in queries for doc in docs] scores = reranker.predict(pairs, batch_size=32)

Caching

Cache reranking results.

DEVELOPERpython
from functools import lru_cache import hashlib def cache_key(query, doc): return hashlib.md5(f"{query}:{doc}".encode()).hexdigest() @lru_cache(maxsize=10000) def cached_rerank_score(query, doc): return reranker.predict([(query, doc)])[0]

Async Reranking

Parallelize reranking calls.

DEVELOPERpython
import asyncio async def async_rerank_batch(query, documents): tasks = [ rerank_async(query, doc) for doc in documents ] scores = await asyncio.gather(*tasks) return rank_by_scores(documents, scores)

Evaluation

Metrics

Precision@k: Relevant docs in top-k after reranking

DEVELOPERpython
def precision_at_k(reranked_docs, relevant_docs, k): top_k = set(reranked_docs[:k]) relevant = set(relevant_docs) return len(top_k & relevant) / k

NDCG@k: Normalized Discounted Cumulative Gain

DEVELOPERpython
from sklearn.metrics import ndcg_score def evaluate_reranking(predictions, relevance_labels, k=5): return ndcg_score([relevance_labels], [predictions], k=k)

MRR: Mean Reciprocal Rank

DEVELOPERpython
def mrr(reranked_docs, relevant_docs): for i, doc in enumerate(reranked_docs, 1): if doc in relevant_docs: return 1 / i return 0

A/B Testing

Compare reranking strategies.

DEVELOPERpython
# Control: No reranking control_results = retriever.retrieve(query, k=5) # Treatment: With reranking treatment_candidates = retriever.retrieve(query, k=20) treatment_results = reranker.rerank(query, treatment_candidates, k=5) # Measure: User satisfaction, answer quality

Cost-Benefit Analysis

RerankerLatencyCost/1KQualityBest For
No reranking0ms$0BaselineBudget/speed critical
TinyBERT+30ms$0 (self-hosted)+10%Balanced
MiniLM+50ms$0 (self-hosted)+20%Quality-focused
Cohere+100ms$1+25%Managed simplicity
LLM+500ms$5-20+30%Highest quality

Best Practices

  1. Always overfetch for reranking: Retrieve 3-5x the final k
  2. Start with cross-encoder: MiniLM is a good default
  3. Measure impact: A/B test reranking vs. no reranking
  4. Tune retrieval count: Balance cost and recall
  5. Consider query latency budget: Reranking adds 50-500ms
  6. Monitor costs: LLM reranking can be expensive at scale

Choosing a Reranker

Prototyping:

  • cross-encoder/ms-marco-MiniLM-L6-v2
  • Easy to use, good quality

Production (Cost-Sensitive):

  • cross-encoder/ms-marco-TinyBERT-L2-v2
  • Self-hosted, fast

Production (Quality-Focused):

  • Cohere Rerank API
  • Highest quality, managed

Multilingual:

  • mmarco-mMiniLMv2-L12-H384-v1
  • cross-encoder/mmarco-mMiniLMv2-L12

Highest Quality (Budget Available):

  • LLM-based reranking
  • GPT-4, Claude for best results

💡 Expert Tip from Ailog: Reranking is high impact but not first priority. Get your chunking, embeddings, and retrieval right first – they're foundational. Once you have a working RAG system, reranking is the easiest way to gain another 10-25% accuracy. Start with Cohere Rerank API for zero-setup wins. We added reranking to production in one afternoon and immediately saw fewer hallucinations and better answer quality.

Test Reranking on Ailog

Compare reranking models with zero setup:

Ailog platform includes:

  • Cohere Rerank, cross-encoders, LLM reranking
  • Side-by-side quality comparison
  • Latency and cost analysis
  • A/B testing with real queries

Test reranking free →

Next Steps

With retrieval and reranking optimized, it's critical to measure performance. The next guide covers evaluation metrics and methodologies for assessing RAG system quality.

Tags

rerankingcross-encoderretrievalprecision

Related Guides