Reranking for RAG: +40% Accuracy with Cross-Encoders (2025 Guide)
Boost RAG accuracy by 40% using reranking. Complete guide to cross-encoders, Cohere Rerank API, and ColBERT for production retrieval systems.
- Author
- Ailog Research Team
- Published
- Reading time
- 11 min read
- Level
- advanced
- RAG Pipeline Step
- Reranking
TL;DR • Reranking = Second-pass scoring of retrieved docs for better precision • Cross-encoders deliver 10-25% accuracy improvement over pure retrieval • Cohere Rerank API: Easiest option ($1/1000 queries) • Self-hosted: ms-marco cross-encoders (free, good quality) • Compare rerankers on your data with Ailog
The Reranking Problem
Initial retrieval (vector search, BM25) casts a wide net to recall potentially relevant documents. However: • False positives: Some retrieved chunks aren't actually relevant • Ranking quality: Most relevant chunks may not be ranked first • Query-specific relevance: Initial ranking doesn't account for query nuances
Solution: Rerank retrieved candidates with a more sophisticated model.
Two-Stage Retrieval
`` Query → [Stage 1: Retrieval] → 100 candidates → [Stage 2: Reranking] → 10 best results → [Stage 3: Generation] → Answer `
Why two stages? • Retrieval: Fast, scales to millions/billions of documents • Reranking: Expensive but accurate, only on small candidate set • Best of both: Speed + Quality
Reranking Approaches
Cross-Encoder Models
Unlike bi-encoders (embed query and document separately), cross-encoders process query and document together.
Bi-encoder (Retrieval) `python query_emb = embed(query) [768] doc_emb = embed(document) [768] score = cosine(query_emb, doc_emb) Similarity `
Cross-encoder (Reranking) `python Process together input = f"[CLS] {query} [SEP] {document} [SEP]" score = model(input) Direct relevance score `
Why cross-encoders are better: • Attention between query and document tokens • Captures word-level interactions • More accurate relevance scoring
Why not use for retrieval: • Must score each query-document pair (O(n)) • Too slow for large collections • No pre-computed embeddings
Popular Cross-Encoder Models
ms-marco-MiniLM-L6-v2 `python from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2')
Score query-document pairs scores = model.predict([ (query, doc1), (query, doc2), (query, doc3) ])
Rerank by score ranked_indices = np.argsort(scores)[::-1] `
Characteristics: • Size: 80MB • Speed: ~50ms per batch • Quality: Good for English • Training: Trained on MS MARCO
ms-marco-TinyBERT-L2-v2 • Even smaller/faster • Slight quality tradeoff • Good for latency-critical apps
mmarco-mMiniLMv2-L12-H384-v1 • Multilingual support • Similar performance to English models • Supports 100+ languages
Implementation
`python class RerankedRetriever: def __init__(self, base_retriever, reranker_model): self.retriever = base_retriever self.reranker = CrossEncoder(reranker_model)
def retrieve(self, query, k=5, rerank_top_n=20): Stage 1: Retrieve more candidates candidates = self.retriever.retrieve(query, k=rerank_top_n)
Stage 2: Rerank pairs = [(query, doc['content']) for doc in candidates] scores = self.reranker.predict(pairs)
Sort by reranker scores ranked_indices = np.argsort(scores)[::-1] reranked_docs = [candidates[i] for i in ranked_indices]
Return top-k return reranked_docs[:k] `
LLM-Based Reranking
Use an LLM to judge relevance.
Binary Relevance
Ask LLM if document is relevant.
`python def llm_rerank_binary(query, documents, llm): relevant_docs = []
for doc in documents: prompt = f"""Is this document relevant to the query?
Query: {query}
Document: {doc}
Answer only 'yes' or 'no'."""
response = llm.generate(prompt, max_tokens=5)
if 'yes' in response.lower(): relevant_docs.append(doc)
return relevant_docs `
Scoring Relevance
Get numerical relevance scores.
`python def llm_rerank_score(query, documents, llm): scored_docs = []
for doc in documents: prompt = f"""Rate the relevance of this document to the query on a scale of 1-10.
Query: {query}
Document: {doc}
Relevance score (1-10):"""
score = int(llm.generate(prompt, max_tokens=5)) scored_docs.append((doc, score))
Sort by score scored_docs.sort(key=lambda x: x[1], reverse=True) return [doc for doc, score in scored_docs] `
Comparative Ranking
Compare documents pairwise or in batches.
`python def llm_rerank_comparative(query, documents, llm): prompt = f"""Rank these documents by relevance to the query.
Query: {query}
Documents: {format_documents(documents)}
Provide ranking (most to least relevant):"""
ranking = llm.generate(prompt) ranked_docs = parse_ranking(ranking, documents)
return ranked_docs `
Pros: • Very accurate • Can handle nuanced relevance • Explains reasoning
Cons: • Expensive (LLM call per document or batch) • Slow (hundreds of ms to seconds) • May exceed context window with many docs
Use when: • Highest quality required • Cost/latency acceptable • Small candidate set (< 10 docs)
Cohere Rerank API
Managed reranking service.
`python import cohere
co = cohere.Client(api_key="your-key")
def cohere_rerank(query, documents, top_n=5): response = co.rerank( query=query, documents=documents, top_n=top_n, model="rerank-english-v2.0" )
return [doc.document for doc in response.results] `
Models: • rerank-english-v2.0: English • rerank-multilingual-v2.0: 100+ languages
Pricing: • $1.00 per 1000 searches (as of 2025)
Pros: • Managed service • High quality • Multilingual
Cons: • API latency • Ongoing cost • Vendor dependency
FlashRank
Efficient local reranking.
`python from flashrank import Ranker, RerankRequest
ranker = Ranker(model_name="ms-marco-MultiBERT-L-12")
def flashrank_rerank(query, documents, top_n=5): rerank_request = RerankRequest( query=query, passages=[{"text": doc} for doc in documents] )
results = ranker.rerank(rerank_request)
return [r.text for r in results[:top_n]] `
Benefits: • Very fast (optimized inference) • Self-hosted • No API costs
Hybrid Reranking
Combine multiple signals.
`python def hybrid_rerank(query, documents, weights=None): if weights is None: weights = { 'vector_score': 0.3, 'bm25_score': 0.2, 'cross_encoder': 0.5 }
Get scores from different models vector_scores = get_vector_scores(query, documents) bm25_scores = get_bm25_scores(query, documents) ce_scores = get_cross_encoder_scores(query, documents)
Normalize scores to [0, 1] vector_scores = normalize(vector_scores) bm25_scores = normalize(bm25_scores) ce_scores = normalize(ce_scores)
Weighted combination final_scores = ( weights['vector_score'] vector_scores + weights['bm25_score'] bm25_scores + weights['cross_encoder'] ce_scores )
Rank documents ranked_indices = np.argsort(final_scores)[::-1] return [documents[i] for i in ranked_indices] `
Reranking Strategies
Top-K Reranking
Rerank only top candidates from initial retrieval.
`python Retrieve top 20, rerank to get top 5 candidates = retriever.retrieve(query, k=20) reranked = reranker.rerank(query, candidates, top_n=5) `
Settings: • Retrieve: 3-5x the final k • Rerank: Final k needed
Example: • Need 5 final results • Retrieve 20 candidates • Rerank to top 5
Cascading Reranking
Multiple reranking stages with increasing accuracy.
`python Stage 1: Fast retrieval candidates = fast_retriever.retrieve(query, k=100)
Stage 2: Fast reranker reranked_1 = tiny_reranker.rerank(query, candidates, top_n=20)
Stage 3: Accurate reranker reranked_2 = large_reranker.rerank(query, reranked_1, top_n=5)
return reranked_2 `
Use when: • Very large candidate sets • Multiple quality tiers needed • Optimizing cost/latency
Query-Adaptive Reranking
Different reranking based on query type.
`python def adaptive_rerank(query, documents): query_type = classify_query(query)
if query_type == "factual": Use keyword signals return bm25_rerank(query, documents)
elif query_type == "semantic": Use cross-encoder return cross_encoder_rerank(query, documents)
elif query_type == "complex": Use LLM return llm_rerank(query, documents) `
Performance Optimization
Batching
Rerank multiple queries efficiently.
`python Bad: One at a time for query in queries: rerank(query, docs)
Good: Batched pairs = [(q, doc) for q in queries for doc in docs] scores = reranker.predict(pairs, batch_size=32) `
Caching
Cache reranking results.
`python from functools import lru_cache import hashlib
def cache_key(query, doc): return hashlib.md5(f"{query}:{doc}".encode()).hexdigest()
@lru_cache(maxsize=10000) def cached_rerank_score(query, doc): return reranker.predict([(query, doc)])[0] `
Async Reranking
Parallelize reranking calls.
`python import asyncio
async def async_rerank_batch(query, documents): tasks = [ rerank_async(query, doc) for doc in documents ] scores = await asyncio.gather(tasks) return rank_by_scores(documents, scores) `
Evaluation
Metrics
Precision@k: Relevant docs in top-k after reranking
`python def precision_at_k(reranked_docs, relevant_docs, k): top_k = set(reranked_docs[:k]) relevant = set(relevant_docs) return len(top_k & relevant) / k `
NDCG@k: Normalized Discounted Cumulative Gain
`python from sklearn.metrics import ndcg_score
def evaluate_reranking(predictions, relevance_labels, k=5): return ndcg_score([relevance_labels], [predictions], k=k) `
MRR: Mean Reciprocal Rank
`python def mrr(reranked_docs, relevant_docs): for i, doc in enumerate(reranked_docs, 1): if doc in relevant_docs: return 1 / i return 0 `
A/B Testing
Compare reranking strategies.
`python Control: No reranking control_results = retriever.retrieve(query, k=5)
Treatment: With reranking treatment_candidates = retriever.retrieve(query, k=20) treatment_results = reranker.rerank(query, treatment_candidates, k=5)
Measure: User satisfaction, answer quality ``
Cost-Benefit Analysis
| Reranker | Latency | Cost/1K | Quality | Best For | |----------|---------|---------|---------|----------| | No reranking | 0ms | $0 | Baseline | Budget/speed critical | | TinyBERT | +30ms | $0 (self-hosted) | +10% | Balanced | | MiniLM | +50ms | $0 (self-hosted) | +20% | Quality-focused | | Cohere | +100ms | $1 | +25% | Managed simplicity | | LLM | +500ms | $5-20 | +30% | Highest quality |
Best Practices Always overfetch for reranking: Retrieve 3-5x the final k Start with cross-encoder: MiniLM is a good default Measure impact: A/B test reranking vs. no reranking Tune retrieval count: Balance cost and recall Consider query latency budget: Reranking adds 50-500ms Monitor costs: LLM reranking can be expensive at scale
Choosing a Reranker
Prototyping: • cross-encoder/ms-marco-MiniLM-L6-v2 • Easy to use, good quality
Production (Cost-Sensitive): • cross-encoder/ms-marco-TinyBERT-L2-v2 • Self-hosted, fast
Production (Quality-Focused): • Cohere Rerank API • Highest quality, managed
Multilingual: • mmarco-mMiniLMv2-L12-H384-v1 • cross-encoder/mmarco-mMiniLMv2-L12
Highest Quality (Budget Available): • LLM-based reranking • GPT-4, Claude for best results
> 💡 Expert Tip from Ailog: Reranking is high impact but not first priority. Get your chunking, embeddings, and retrieval right first – they're foundational. Once you have a working RAG system, reranking is the easiest way to gain another 10-25% accuracy. Start with Cohere Rerank API for zero-setup wins. We added reranking to production in one afternoon and immediately saw fewer hallucinations and better answer quality.
Test Reranking on Ailog
Compare reranking models with zero setup:
Ailog platform includes: • Cohere Rerank, cross-encoders, LLM reranking • Side-by-side quality comparison • Latency and cost analysis • A/B testing with real queries
Test reranking free →
Next Steps
With retrieval and reranking optimized, it's critical to measure performance. The next guide covers evaluation metrics and methodologies for assessing RAG system quality.