BEIR Benchmark 2.0 Leaderboard 2025: Complete NDCG@10 Scores & Rankings

Complete BEIR 2.0 leaderboard with NDCG@10 scores for all top models. Compare Voyage, Cohere, BGE, OpenAI embeddings on the latest benchmark.

Author
Ailog Research Team
Published
Reading time
4 min read

BEIR 2.0 Leaderboard - NDCG@10 Scores (2025)

Quick reference table for all models on BEIR 2.0 benchmark:

| Rank | Model | NDCG@10 | Recall@1000 | Type | |------|-------|---------|-------------|------| | 1 | Voyage-Large-2 | 54.8% | 89.2% | Dense | | 2 | Cohere Embed v4 | 53.7% | 87.8% | Dense | | 3 | Gemini-embedding-001 | 52.1% | 86.9% | Dense | | 4 | BGE-Large-EN | 52.3% | 86.1% | Dense | | 5 | OpenAI text-3-large | 51.9% | 85.7% | Dense | | 6 | Qwen3-Embedding-8B | 51.5% | 86.2% | Dense | | 7 | E5-Mistral-7B | 51.2% | 84.9% | Dense | | 8 | ColBERT-v2 | 49.1% | 88.3% | Late Interaction | | 9 | BM25 | 41.2% | 76.8% | Sparse |

Note: BEIR focuses on zero-shot retrieval across 18 datasets. For overall embedding quality, see MTEB leaderboard.

Source: BEIR Official Leaderboard

---

Announcement

The BEIR (Benchmarking IR) team has released version 2.0 of their widely-used retrieval benchmark, addressing limitations of the original and adding more challenging test scenarios.

What's New

Six New Datasets CodeSearchNet-RAG: Code search with natural language queries MedQA-Retrieval: Medical question answering LegalBench-IR: Legal document retrieval MultiHop-V2: Complex multi-hop questions TimeQA: Time-sensitive queries TableQA: Structured data retrieval

Total datasets: 18 (up from 12)

Adversarial Test Sets

New adversarial examples designed to challenge retrieval systems:

Paraphrase Adversaries • Same meaning, different wording • Tests semantic understanding vs. keyword matching

Negation Adversaries • Queries with negations ("not", "except", "without") • Tests fine-grained understanding

Entity Swap Adversaries • Similar entities swapped • Tests entity disambiguation

Results on adversarial sets:

| System | Original BEIR | BEIR 2.0 (Adversarial) | Gap | |--------|---------------|------------------------|-----| | BM25 | 41.2% | 28.7% | -30.3% | | Dense (SBERT) | 43.8% | 35.1% | -19.9% | | ColBERT | 47.3% | 39.8% | -15.8% | | Hybrid | 49.1% | 42.3% | -13.8% |

Insight: All systems struggle with adversarial examples; hybrid approaches degrade least.

Enhanced Metrics

Recall@1000

Added to measure coverage for two-stage systems:

`` Recall@1000: Did we retrieve relevant docs in top-1000? `

Critical for reranking pipelines where initial retrieval must have high recall.

MRR@100

Mean Reciprocal Rank at 100 results:

` MRR@100 = 1/rank of first relevant result (up to 100) `

Better reflects real-world usage than nDCG@10.

Latency Percentiles

Now tracks retrieval speed: • p50, p95, p99 latencies • Throughput (queries/second) • Enables speed-quality tradeoffs

Domain Shift Analysis

BEIR 2.0 includes cross-domain test splits:

Training domains: Science, News Test domains: Legal, Medical, Code

Measures generalization across domains:

| System | In-Domain | Out-of-Domain | Generalization Gap | |--------|-----------|---------------|--------------------| | BM25 | 42.1% | 39.8% | -5.5% | | DPR | 45.3% | 34.7% | -23.4% | | BGE-Large | 48.7% | 42.1% | -13.5% | | Cohere Embed v4 | 51.2% | 47.8% | -6.6% |

Insight: Newer models generalize better across domains.

Leaderboard (2025)

Top performers on BEIR 2.0 (average across all datasets):

| Rank | Model | Avg nDCG@10 | Avg Recall@1000 | |------|-------|-------------|-----------------| | 1 | Voyage-Large-2 | 54.8% | 89.2% | | 2 | Cohere Embed v4 | 53.7% | 87.8% | | 3 | Gemini-embedding-001 | 52.1% | 86.9% | | 4 | BGE-Large-EN | 52.3% | 86.1% | | 5 | OpenAI text-3-large | 51.9% | 85.7% | | 6 | Qwen3-Embedding-8B | 51.5% | 86.2% | | 7 | E5-Mistral-7B | 51.2% | 84.9% | | 8 | ColBERT-v2 | 49.1% | 88.3% | | 9 | BM25 | 41.2% | 76.8% |

Key Findings

Dense vs. Sparse

Dense retrieval now consistently outperforms BM25: • 2021 (BEIR 1.0): BM25 competitive • 2025 (BEIR 2.0): Dense models lead by 10-12%

Improvement driven by better training and larger models.

Hybrid Search Value

Hybrid (BM25 + Dense) provides modest gains: • Dense alone: 53.7% • + BM25: 55.2% (+2.8%)

Diminishing returns as dense models improve.

Model Size vs. Performance

Scaling laws still apply:

| Model Size | Avg Performance | Cost/1M Tokens | |------------|----------------|----------------| | Small (100M) | 46.2% | $0.01 | | Base (350M) | 49.8% | $0.05 | | Large (1B+) | 53.7% | $0.10 |

2-3x size = +3-4% performance

Domain-Specific Models

Fine-tuned domain models outperform general models in-domain:

Medical retrieval: • General model: 48.3% • Med-tuned model: 61.7% (+27.7%)

Code search: • General model: 44.1% • Code-tuned model: 58.9% (+33.5%)

Recommendation: Fine-tune for specialized domains.

Using BEIR 2.0

Installation

`bash pip install beir==2.0.0 `

Example

`python from beir import util from beir.datasets.data_loader import GenericDataLoader from beir.retrieval.evaluation import EvaluateRetrieval

Load dataset dataset = "msmarco-v2" or any BEIR 2.0 dataset data_path = util.download_and_unzip(url, "datasets")

corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")

Evaluate your model retriever = YourRetriever()

results = retriever.retrieve(corpus, queries)

Standard metrics eval = EvaluateRetrieval() metrics = eval.evaluate(qrels, results, k_values=[1, 3, 5, 10, 100, 1000])

print(f"NDCG@10: {metrics['NDCG@10']}") print(f"Recall@1000: {metrics['Recall@1000']}") `

Adversarial Evaluation

`python Load adversarial test set corpus, queries, qrels = GenericDataLoader(data_path).load( split="test-adversarial" )

Evaluate adv_metrics = eval.evaluate(qrels, results, k_values=[10])

Compare standard vs. adversarial print(f"Standard: {metrics['NDCG@10']}") print(f"Adversarial: {adv_metrics['NDCG@10']}") print(f"Robustness gap: {metrics['NDCG@10'] - adv_metrics['NDCG@10']}") ``

Implications for RAG

What Changed Higher bar: BEIR 2.0 is harder; expect lower absolute scores Adversarial robustness matters: Real queries are adversarial Domain adaptation critical: General models struggle on specialized domains Hybrid declining: Dense models closing gap with BM25

Recommendations Benchmark on BEIR 2.0: More realistic than v1 Test adversarial splits: Measures robustness Consider domain fine-tuning: Large gains in specialized fields Track Recall@1000: Critical for two-stage retrieval Monitor latency: Speed matters in production

Future Plans

BEIR team announced: • Quarterly updates with new datasets • Multilingual expansion (currently English-only) • Multimodal retrieval (images, tables) • Real-user query distribution • Continuous leaderboard updates

Resources • Website: beir.ai • Paper: "BEIR 2.0: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models" • GitHub: github.com/beir-cellar/beir • Leaderboard: beir.ai/leaderboard

Conclusion

BEIR 2.0 raises the bar for retrieval evaluation with more realistic and challenging test scenarios. Systems optimized for BEIR 1.0 should be re-evaluated to ensure they handle adversarial queries and domain shifts effectively.

---

FAQ

What is the BEIR benchmark? BEIR (Benchmarking Information Retrieval) is a heterogeneous benchmark for zero-shot evaluation of retrieval models across 18 diverse datasets including MS MARCO, Natural Questions, and domain-specific corpora like medical and legal.

Which model scores highest on BEIR? Voyage-Large-2 leads the BEIR leaderboard with 54.8% NDCG@10, followed by Cohere Embed v4 at 53.7% and BGE-Large-EN at 52.3%.

Is BEIR still relevant? Yes, BEIR remains the gold standard for evaluating retrieval performance specifically. It tests zero-shot generalization across domains, which is critical for real-world RAG applications.

What's the difference between BEIR and MTEB? BEIR focuses specifically on information retrieval across 18 datasets. MTEB is broader, covering 58 datasets across 8 task types including retrieval, classification, clustering, and more. BEIR is a subset of the retrieval tasks in MTEB.

Tags

  • benchmarks
  • evaluation
  • research
  • BEIR
  • NDCG
  • leaderboard
  • 2025
Actualités

BEIR Benchmark 2.0 Leaderboard 2025: Complete NDCG@10 Scores & Rankings

16 janvier 2026
4 min read
Ailog Research Team

Complete BEIR 2.0 leaderboard with NDCG@10 scores for all top models. Compare Voyage, Cohere, BGE, OpenAI embeddings on the latest benchmark.

BEIR 2.0 Leaderboard - NDCG@10 Scores (2025)

Quick reference table for all models on BEIR 2.0 benchmark:

RankModelNDCG@10Recall@1000Type
1Voyage-Large-254.8%89.2%Dense
2Cohere Embed v453.7%87.8%Dense
3Gemini-embedding-00152.1%86.9%Dense
4BGE-Large-EN52.3%86.1%Dense
5OpenAI text-3-large51.9%85.7%Dense
6Qwen3-Embedding-8B51.5%86.2%Dense
7E5-Mistral-7B51.2%84.9%Dense
8ColBERT-v249.1%88.3%Late Interaction
9BM2541.2%76.8%Sparse

Note: BEIR focuses on zero-shot retrieval across 18 datasets. For overall embedding quality, see MTEB leaderboard.

Source: BEIR Official Leaderboard


Announcement

The BEIR (Benchmarking IR) team has released version 2.0 of their widely-used retrieval benchmark, addressing limitations of the original and adding more challenging test scenarios.

What's New

Six New Datasets

  1. CodeSearchNet-RAG: Code search with natural language queries
  2. MedQA-Retrieval: Medical question answering
  3. LegalBench-IR: Legal document retrieval
  4. MultiHop-V2: Complex multi-hop questions
  5. TimeQA: Time-sensitive queries
  6. TableQA: Structured data retrieval

Total datasets: 18 (up from 12)

Adversarial Test Sets

New adversarial examples designed to challenge retrieval systems:

Paraphrase Adversaries

  • Same meaning, different wording
  • Tests semantic understanding vs. keyword matching

Negation Adversaries

  • Queries with negations ("not", "except", "without")
  • Tests fine-grained understanding

Entity Swap Adversaries

  • Similar entities swapped
  • Tests entity disambiguation

Results on adversarial sets:

SystemOriginal BEIRBEIR 2.0 (Adversarial)Gap
BM2541.2%28.7%-30.3%
Dense (SBERT)43.8%35.1%-19.9%
ColBERT47.3%39.8%-15.8%
Hybrid49.1%42.3%-13.8%

Insight: All systems struggle with adversarial examples; hybrid approaches degrade least.

Enhanced Metrics

Recall@1000

Added to measure coverage for two-stage systems:

Recall@1000: Did we retrieve relevant docs in top-1000?

Critical for reranking pipelines where initial retrieval must have high recall.

MRR@100

Mean Reciprocal Rank at 100 results:

MRR@100 = 1/rank of first relevant result (up to 100)

Better reflects real-world usage than nDCG@10.

Latency Percentiles

Now tracks retrieval speed:

  • p50, p95, p99 latencies
  • Throughput (queries/second)
  • Enables speed-quality tradeoffs

Domain Shift Analysis

BEIR 2.0 includes cross-domain test splits:

Training domains: Science, News Test domains: Legal, Medical, Code

Measures generalization across domains:

SystemIn-DomainOut-of-DomainGeneralization Gap
BM2542.1%39.8%-5.5%
DPR45.3%34.7%-23.4%
BGE-Large48.7%42.1%-13.5%
Cohere Embed v451.2%47.8%-6.6%

Insight: Newer models generalize better across domains.

Leaderboard (2025)

Top performers on BEIR 2.0 (average across all datasets):

RankModelAvg nDCG@10Avg Recall@1000
1Voyage-Large-254.8%89.2%
2Cohere Embed v453.7%87.8%
3Gemini-embedding-00152.1%86.9%
4BGE-Large-EN52.3%86.1%
5OpenAI text-3-large51.9%85.7%
6Qwen3-Embedding-8B51.5%86.2%
7E5-Mistral-7B51.2%84.9%
8ColBERT-v249.1%88.3%
9BM2541.2%76.8%

Key Findings

Dense vs. Sparse

Dense retrieval now consistently outperforms BM25:

  • 2021 (BEIR 1.0): BM25 competitive
  • 2025 (BEIR 2.0): Dense models lead by 10-12%

Improvement driven by better training and larger models.

Hybrid Search Value

Hybrid (BM25 + Dense) provides modest gains:

  • Dense alone: 53.7%
    • BM25: 55.2% (+2.8%)

Diminishing returns as dense models improve.

Model Size vs. Performance

Scaling laws still apply:

Model SizeAvg PerformanceCost/1M Tokens
Small (100M)46.2%$0.01
Base (350M)49.8%$0.05
Large (1B+)53.7%$0.10

2-3x size = +3-4% performance

Domain-Specific Models

Fine-tuned domain models outperform general models in-domain:

Medical retrieval:

  • General model: 48.3%
  • Med-tuned model: 61.7% (+27.7%)

Code search:

  • General model: 44.1%
  • Code-tuned model: 58.9% (+33.5%)

Recommendation: Fine-tune for specialized domains.

Using BEIR 2.0

Installation

DEVELOPERbash
pip install beir==2.0.0

Example

DEVELOPERpython
from beir import util from beir.datasets.data_loader import GenericDataLoader from beir.retrieval.evaluation import EvaluateRetrieval # Load dataset dataset = "msmarco-v2" # or any BEIR 2.0 dataset data_path = util.download_and_unzip(url, "datasets") corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # Evaluate your model retriever = YourRetriever() results = retriever.retrieve(corpus, queries) # Standard metrics eval = EvaluateRetrieval() metrics = eval.evaluate(qrels, results, k_values=[1, 3, 5, 10, 100, 1000]) print(f"NDCG@10: {metrics['NDCG@10']}") print(f"Recall@1000: {metrics['Recall@1000']}")

Adversarial Evaluation

DEVELOPERpython
# Load adversarial test set corpus, queries, qrels = GenericDataLoader(data_path).load( split="test-adversarial" ) # Evaluate adv_metrics = eval.evaluate(qrels, results, k_values=[10]) # Compare standard vs. adversarial print(f"Standard: {metrics['NDCG@10']}") print(f"Adversarial: {adv_metrics['NDCG@10']}") print(f"Robustness gap: {metrics['NDCG@10'] - adv_metrics['NDCG@10']}")

Implications for RAG

What Changed

  1. Higher bar: BEIR 2.0 is harder; expect lower absolute scores
  2. Adversarial robustness matters: Real queries are adversarial
  3. Domain adaptation critical: General models struggle on specialized domains
  4. Hybrid declining: Dense models closing gap with BM25

Recommendations

  1. Benchmark on BEIR 2.0: More realistic than v1
  2. Test adversarial splits: Measures robustness
  3. Consider domain fine-tuning: Large gains in specialized fields
  4. Track Recall@1000: Critical for two-stage retrieval
  5. Monitor latency: Speed matters in production

Future Plans

BEIR team announced:

  • Quarterly updates with new datasets
  • Multilingual expansion (currently English-only)
  • Multimodal retrieval (images, tables)
  • Real-user query distribution
  • Continuous leaderboard updates

Resources

  • Website: beir.ai
  • Paper: "BEIR 2.0: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models"
  • GitHub: github.com/beir-cellar/beir
  • Leaderboard: beir.ai/leaderboard

Conclusion

BEIR 2.0 raises the bar for retrieval evaluation with more realistic and challenging test scenarios. Systems optimized for BEIR 1.0 should be re-evaluated to ensure they handle adversarial queries and domain shifts effectively.

FAQ

BEIR (Benchmarking Information Retrieval) is a heterogeneous benchmark for zero-shot evaluation of retrieval models across 18 diverse datasets including MS MARCO, Natural Questions, and domain-specific corpora like medical and legal.
Voyage-Large-2 leads the BEIR leaderboard with 54.8% NDCG@10, followed by Cohere Embed v4 at 53.7% and BGE-Large-EN at 52.3%.
Yes, BEIR remains the gold standard for evaluating retrieval performance specifically. It tests zero-shot generalization across domains, which is critical for real-world RAG applications.
BEIR focuses specifically on information retrieval across 18 datasets. MTEB is broader, covering 58 datasets across 8 task types including retrieval, classification, clustering, and more. BEIR is a subset of the retrieval tasks in MTEB.

Tags

benchmarksevaluationresearchBEIRNDCGleaderboard2025

Articles connexes

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !