Cross-Encoder Reranking for RAG Precision
Achieve 95%+ precision: use cross-encoders to rerank retrieved documents and eliminate false positives.
Why Cross-Encoders?
Bi-encoders (standard embeddings) encode query and document separately. Cross-encoders process them together - much more accurate but slower.
Bi-encoder: sim(encode(query), encode(doc))
Cross-encoder: score(query + doc together)
Implementation
DEVELOPERpythonfrom sentence_transformers import CrossEncoder model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') def rerank(query, documents, top_k=5): # Create query-document pairs pairs = [[query, doc] for doc in documents] # Score all pairs scores = model.predict(pairs) # Sort by score ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True) return [doc for doc, score in ranked[:top_k]] # Use it initial_results = vector_search(query, k=100) final_results = rerank(query, initial_results, top_k=10)
Best Models (November 2025)
1. ms-marco-MiniLM-L-12-v2
- Fast, accurate
- Best for general purpose
2. bge-reranker-v2-m3
- Multilingual
- SOTA accuracy
3. jina-reranker-v2-base-multilingual
- 89 languages
- Production-ready
Two-Stage Retrieval
DEVELOPERpythondef two_stage_rag(query, vector_db): # Stage 1: Fast bi-encoder retrieval (100 candidates) candidates = vector_db.search( query_embedding=embed(query), k=100 ) # Stage 2: Slow but accurate cross-encoder reranking cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2') pairs = [[query, doc['content']] for doc in candidates] scores = cross_encoder.predict(pairs) # Return top 10 ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return [doc for doc, _ in ranked[:10]]
Performance Optimization
Cross-encoders are slow - optimize:
DEVELOPERpython# Batch processing def batch_rerank(query, documents, batch_size=32): pairs = [[query, doc] for doc in documents] all_scores = [] for i in range(0, len(pairs), batch_size): batch = pairs[i:i+batch_size] scores = model.predict(batch) all_scores.extend(scores) return sorted(zip(documents, all_scores), key=lambda x: x[1], reverse=True)
When to Rerank
Always rerank when:
- Precision is critical
- Cost of false positives is high
- You have compute budget
Skip reranking when:
- Latency < 100ms required
- High QPS (> 1000/sec)
- Budget constrained
Tags
Articles connexes
Reranking: Improving Retrieval Precision
Cross-encoders, LLM-based reranking, and reranking strategies to optimize retrieved context for better RAG responses.
Cohere Rerank API for Production RAG
Boost RAG accuracy by 40% with Cohere's Rerank API: simple integration, multilingual support, production-ready.
Query Optimization: Making Retrieval More Effective
Techniques to optimize user queries for better retrieval: query rewriting, expansion, decomposition, and routing strategies.