BEIR Benchmark 2.0 Leaderboard 2025: Complete NDCG@10 Scores & Rankings
Complete BEIR 2.0 leaderboard with NDCG@10 scores for all top models. Compare Voyage, Cohere, BGE, OpenAI embeddings on the latest benchmark.
- Author
- Ailog Research Team
- Published
- Reading time
- 4 min read
BEIR 2.0 Leaderboard - NDCG@10 Scores (2025)
Quick reference table for all models on BEIR 2.0 benchmark:
| Rank | Model | NDCG@10 | Recall@1000 | Type | |------|-------|---------|-------------|------| | 1 | Voyage-Large-2 | 54.8% | 89.2% | Dense | | 2 | Cohere Embed v4 | 53.7% | 87.8% | Dense | | 3 | Gemini-embedding-001 | 52.1% | 86.9% | Dense | | 4 | BGE-Large-EN | 52.3% | 86.1% | Dense | | 5 | OpenAI text-3-large | 51.9% | 85.7% | Dense | | 6 | Qwen3-Embedding-8B | 51.5% | 86.2% | Dense | | 7 | E5-Mistral-7B | 51.2% | 84.9% | Dense | | 8 | ColBERT-v2 | 49.1% | 88.3% | Late Interaction | | 9 | BM25 | 41.2% | 76.8% | Sparse |
Note: BEIR focuses on zero-shot retrieval across 18 datasets. For overall embedding quality, see MTEB leaderboard.
Source: BEIR Official Leaderboard
---
Announcement
The BEIR (Benchmarking IR) team has released version 2.0 of their widely-used retrieval benchmark, addressing limitations of the original and adding more challenging test scenarios.
What's New
Six New Datasets CodeSearchNet-RAG: Code search with natural language queries MedQA-Retrieval: Medical question answering LegalBench-IR: Legal document retrieval MultiHop-V2: Complex multi-hop questions TimeQA: Time-sensitive queries TableQA: Structured data retrieval
Total datasets: 18 (up from 12)
Adversarial Test Sets
New adversarial examples designed to challenge retrieval systems:
Paraphrase Adversaries • Same meaning, different wording • Tests semantic understanding vs. keyword matching
Negation Adversaries • Queries with negations ("not", "except", "without") • Tests fine-grained understanding
Entity Swap Adversaries • Similar entities swapped • Tests entity disambiguation
Results on adversarial sets:
| System | Original BEIR | BEIR 2.0 (Adversarial) | Gap | |--------|---------------|------------------------|-----| | BM25 | 41.2% | 28.7% | -30.3% | | Dense (SBERT) | 43.8% | 35.1% | -19.9% | | ColBERT | 47.3% | 39.8% | -15.8% | | Hybrid | 49.1% | 42.3% | -13.8% |
Insight: All systems struggle with adversarial examples; hybrid approaches degrade least.
Enhanced Metrics
Recall@1000
Added to measure coverage for two-stage systems:
`` Recall@1000: Did we retrieve relevant docs in top-1000? `
Critical for reranking pipelines where initial retrieval must have high recall.
MRR@100
Mean Reciprocal Rank at 100 results:
` MRR@100 = 1/rank of first relevant result (up to 100) `
Better reflects real-world usage than nDCG@10.
Latency Percentiles
Now tracks retrieval speed: • p50, p95, p99 latencies • Throughput (queries/second) • Enables speed-quality tradeoffs
Domain Shift Analysis
BEIR 2.0 includes cross-domain test splits:
Training domains: Science, News Test domains: Legal, Medical, Code
Measures generalization across domains:
| System | In-Domain | Out-of-Domain | Generalization Gap | |--------|-----------|---------------|--------------------| | BM25 | 42.1% | 39.8% | -5.5% | | DPR | 45.3% | 34.7% | -23.4% | | BGE-Large | 48.7% | 42.1% | -13.5% | | Cohere Embed v4 | 51.2% | 47.8% | -6.6% |
Insight: Newer models generalize better across domains.
Leaderboard (2025)
Top performers on BEIR 2.0 (average across all datasets):
| Rank | Model | Avg nDCG@10 | Avg Recall@1000 | |------|-------|-------------|-----------------| | 1 | Voyage-Large-2 | 54.8% | 89.2% | | 2 | Cohere Embed v4 | 53.7% | 87.8% | | 3 | Gemini-embedding-001 | 52.1% | 86.9% | | 4 | BGE-Large-EN | 52.3% | 86.1% | | 5 | OpenAI text-3-large | 51.9% | 85.7% | | 6 | Qwen3-Embedding-8B | 51.5% | 86.2% | | 7 | E5-Mistral-7B | 51.2% | 84.9% | | 8 | ColBERT-v2 | 49.1% | 88.3% | | 9 | BM25 | 41.2% | 76.8% |
Key Findings
Dense vs. Sparse
Dense retrieval now consistently outperforms BM25: • 2021 (BEIR 1.0): BM25 competitive • 2025 (BEIR 2.0): Dense models lead by 10-12%
Improvement driven by better training and larger models.
Hybrid Search Value
Hybrid (BM25 + Dense) provides modest gains: • Dense alone: 53.7% • + BM25: 55.2% (+2.8%)
Diminishing returns as dense models improve.
Model Size vs. Performance
Scaling laws still apply:
| Model Size | Avg Performance | Cost/1M Tokens | |------------|----------------|----------------| | Small (100M) | 46.2% | $0.01 | | Base (350M) | 49.8% | $0.05 | | Large (1B+) | 53.7% | $0.10 |
2-3x size = +3-4% performance
Domain-Specific Models
Fine-tuned domain models outperform general models in-domain:
Medical retrieval: • General model: 48.3% • Med-tuned model: 61.7% (+27.7%)
Code search: • General model: 44.1% • Code-tuned model: 58.9% (+33.5%)
Recommendation: Fine-tune for specialized domains.
Using BEIR 2.0
Installation
`bash pip install beir==2.0.0 `
Example
`python from beir import util from beir.datasets.data_loader import GenericDataLoader from beir.retrieval.evaluation import EvaluateRetrieval
Load dataset dataset = "msmarco-v2" or any BEIR 2.0 dataset data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")
Evaluate your model retriever = YourRetriever()
results = retriever.retrieve(corpus, queries)
Standard metrics eval = EvaluateRetrieval() metrics = eval.evaluate(qrels, results, k_values=[1, 3, 5, 10, 100, 1000])
print(f"NDCG@10: {metrics['NDCG@10']}") print(f"Recall@1000: {metrics['Recall@1000']}") `
Adversarial Evaluation
`python Load adversarial test set corpus, queries, qrels = GenericDataLoader(data_path).load( split="test-adversarial" )
Evaluate adv_metrics = eval.evaluate(qrels, results, k_values=[10])
Compare standard vs. adversarial print(f"Standard: {metrics['NDCG@10']}") print(f"Adversarial: {adv_metrics['NDCG@10']}") print(f"Robustness gap: {metrics['NDCG@10'] - adv_metrics['NDCG@10']}") ``
Implications for RAG
What Changed Higher bar: BEIR 2.0 is harder; expect lower absolute scores Adversarial robustness matters: Real queries are adversarial Domain adaptation critical: General models struggle on specialized domains Hybrid declining: Dense models closing gap with BM25
Recommendations Benchmark on BEIR 2.0: More realistic than v1 Test adversarial splits: Measures robustness Consider domain fine-tuning: Large gains in specialized fields Track Recall@1000: Critical for two-stage retrieval Monitor latency: Speed matters in production
Future Plans
BEIR team announced: • Quarterly updates with new datasets • Multilingual expansion (currently English-only) • Multimodal retrieval (images, tables) • Real-user query distribution • Continuous leaderboard updates
Resources • Website: beir.ai • Paper: "BEIR 2.0: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models" • GitHub: github.com/beir-cellar/beir • Leaderboard: beir.ai/leaderboard
Conclusion
BEIR 2.0 raises the bar for retrieval evaluation with more realistic and challenging test scenarios. Systems optimized for BEIR 1.0 should be re-evaluated to ensure they handle adversarial queries and domain shifts effectively.
---
FAQ
What is the BEIR benchmark? BEIR (Benchmarking Information Retrieval) is a heterogeneous benchmark for zero-shot evaluation of retrieval models across 18 diverse datasets including MS MARCO, Natural Questions, and domain-specific corpora like medical and legal.
Which model scores highest on BEIR? Voyage-Large-2 leads the BEIR leaderboard with 54.8% NDCG@10, followed by Cohere Embed v4 at 53.7% and BGE-Large-EN at 52.3%.
Is BEIR still relevant? Yes, BEIR remains the gold standard for evaluating retrieval performance specifically. It tests zero-shot generalization across domains, which is critical for real-world RAG applications.
What's the difference between BEIR and MTEB? BEIR focuses specifically on information retrieval across 18 datasets. MTEB is broader, covering 58 datasets across 8 task types including retrieval, classification, clustering, and more. BEIR is a subset of the retrieval tasks in MTEB.