Cross-Encoder Reranking for RAG Precision
Achieve 95%+ precision: use cross-encoders to rerank retrieved documents and eliminate false positives.
Why Cross-Encoders?
Bi-encoders (standard embeddings) encode query and document separately. Cross-encoders process them together - much more accurate but slower.
Bi-encoder: sim(encode(query), encode(doc))
Cross-encoder: score(query + doc together)
Implementation
DEVELOPERpythonfrom sentence_transformers import CrossEncoder model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') def rerank(query, documents, top_k=5): # Create query-document pairs pairs = [[query, doc] for doc in documents] # Score all pairs scores = model.predict(pairs) # Sort by score ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True) return [doc for doc, score in ranked[:top_k]] # Use it initial_results = vector_search(query, k=100) final_results = rerank(query, initial_results, top_k=10)
Best Models (November 2025)
1. ms-marco-MiniLM-L-12-v2
- Fast, accurate
- Best for general purpose
2. bge-reranker-v2-m3
- Multilingual
- SOTA accuracy
3. jina-reranker-v2-base-multilingual
- 89 languages
- Production-ready
Two-Stage Retrieval
DEVELOPERpythondef two_stage_rag(query, vector_db): # Stage 1: Fast bi-encoder retrieval (100 candidates) candidates = vector_db.search( query_embedding=embed(query), k=100 ) # Stage 2: Slow but accurate cross-encoder reranking cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2') pairs = [[query, doc['content']] for doc in candidates] scores = cross_encoder.predict(pairs) # Return top 10 ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return [doc for doc, _ in ranked[:10]]
Performance Optimization
Cross-encoders are slow - optimize:
DEVELOPERpython# Batch processing def batch_rerank(query, documents, batch_size=32): pairs = [[query, doc] for doc in documents] all_scores = [] for i in range(0, len(pairs), batch_size): batch = pairs[i:i+batch_size] scores = model.predict(batch) all_scores.extend(scores) return sorted(zip(documents, all_scores), key=lambda x: x[1], reverse=True)
When to Rerank
Always rerank when:
- Precision is critical
- Cost of false positives is high
- You have compute budget
Skip reranking when:
- Latency < 100ms required
- High QPS (> 1000/sec)
- Budget constrained
Tags
Related Posts
Reranking for RAG: +40% Accuracy with Cross-Encoders (2025 Guide)
Boost RAG accuracy by 40% using reranking. Complete guide to cross-encoders, Cohere Rerank API, and ColBERT for production retrieval systems.
LLM Reranking: Using LLMs to Reorder Your Results
LLMs can rerank search results with deep contextual understanding. Learn when and how to use this expensive but powerful technique.
Cohere Rerank API for Production RAG
Boost RAG accuracy by 40% with Cohere's Rerank API: simple integration, multilingual support, production-ready.