Cross-Encoder Reranking for RAG Precision
Achieve 95%+ precision: use cross-encoders to rerank retrieved documents and eliminate false positives.
- Author
- Ailog Research Team
- Published
- Reading time
- 11 min read
- Level
- advanced
- RAG Pipeline Step
- Reranking
Why Cross-Encoders?
Bi-encoders (standard embeddings) encode query and document separately. Cross-encoders process them together - much more accurate but slower.
Bi-encoder: sim(encode(query), encode(doc)) Cross-encoder: score(query + doc together)
Implementation
``python from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, documents, top_k=5): Create query-document pairs pairs = [[query, doc] for doc in documents] Score all pairs scores = model.predict(pairs) Sort by score ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True) return [doc for doc, score in ranked[:top_k]]
Use it initial_results = vector_search(query, k=100) final_results = rerank(query, initial_results, top_k=10) `
Best Models (November 2025) ms-marco-MiniLM-L-12-v2 • Fast, accurate • Best for general purpose bge-reranker-v2-m3 • Multilingual • SOTA accuracy jina-reranker-v2-base-multilingual • 89 languages • Production-ready
Two-Stage Retrieval
`python def two_stage_rag(query, vector_db): Stage 1: Fast bi-encoder retrieval (100 candidates) candidates = vector_db.search( query_embedding=embed(query), k=100 ) Stage 2: Slow but accurate cross-encoder reranking cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2') pairs = [[query, doc['content']] for doc in candidates] scores = cross_encoder.predict(pairs) Return top 10 ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return [doc for doc, _ in ranked[:10]] `
Performance Optimization
Cross-encoders are slow - optimize:
`python Batch processing def batch_rerank(query, documents, batch_size=32): pairs = [[query, doc] for doc in documents] all_scores = [] for i in range(0, len(pairs), batch_size): batch = pairs[i:i+batch_size] scores = model.predict(batch) all_scores.extend(scores) return sorted(zip(documents, all_scores), key=lambda x: x[1], reverse=True) ``
When to Rerank
Always rerank when: • Precision is critical • Cost of false positives is high • You have compute budget
Skip reranking when: • Latency < 100ms required • High QPS (> 1000/sec) • Budget constrained