News

BEIR Benchmark 2.0 Released with Harder Test Sets and New Evaluation Metrics

October 12, 2025
4 min read
Ailog Research Team

Updated BEIR benchmark includes 6 new datasets, adversarial examples, and improved evaluation methodology for more robust retrieval testing.

Announcement

The BEIR (Benchmarking IR) team has released version 2.0 of their widely-used retrieval benchmark, addressing limitations of the original and adding more challenging test scenarios.

What's New

Six New Datasets

  1. CodeSearchNet-RAG: Code search with natural language queries
  2. MedQA-Retrieval: Medical question answering
  3. LegalBench-IR: Legal document retrieval
  4. MultiHop-V2: Complex multi-hop questions
  5. TimeQA: Time-sensitive queries
  6. TableQA: Structured data retrieval

Total datasets: 18 (up from 12)

Adversarial Test Sets

New adversarial examples designed to challenge retrieval systems:

Paraphrase Adversaries

  • Same meaning, different wording
  • Tests semantic understanding vs. keyword matching

Negation Adversaries

  • Queries with negations ("not", "except", "without")
  • Tests fine-grained understanding

Entity Swap Adversaries

  • Similar entities swapped
  • Tests entity disambiguation

Results on adversarial sets:

SystemOriginal BEIRBEIR 2.0 (Adversarial)Gap
BM2541.2%28.7%-30.3%
Dense (SBERT)43.8%35.1%-19.9%
ColBERT47.3%39.8%-15.8%
Hybrid49.1%42.3%-13.8%

Insight: All systems struggle with adversarial examples; hybrid approaches degrade least.

Enhanced Metrics

Recall@1000

Added to measure coverage for two-stage systems:

Recall@1000: Did we retrieve relevant docs in top-1000?

Critical for reranking pipelines where initial retrieval must have high recall.

MRR@100

Mean Reciprocal Rank at 100 results:

MRR@100 = 1/rank of first relevant result (up to 100)

Better reflects real-world usage than nDCG@10.

Latency Percentiles

Now tracks retrieval speed:

  • p50, p95, p99 latencies
  • Throughput (queries/second)
  • Enables speed-quality tradeoffs

Domain Shift Analysis

BEIR 2.0 includes cross-domain test splits:

Training domains: Science, News Test domains: Legal, Medical, Code

Measures generalization across domains:

SystemIn-DomainOut-of-DomainGeneralization Gap
BM2542.1%39.8%-5.5%
DPR45.3%34.7%-23.4%
BGE-Large48.7%42.1%-13.5%
Cohere Embed v451.2%47.8%-6.6%

Insight: Newer models generalize better across domains.

Leaderboard

Top performers on BEIR 2.0 (average across all datasets):

RankModelAvg nDCG@10Avg Recall@1000
1Voyage-Large-254.8%89.2%
2Cohere Embed v453.7%87.8%
3BGE-Large-EN52.3%86.1%
4OpenAI text-3-large51.9%85.7%
5E5-Mistral-7B51.2%84.9%
6ColBERT-v249.1%88.3%
7SBERT (mpnet)43.8%81.2%
8BM2541.2%76.8%

Key Findings

Dense vs. Sparse

Dense retrieval now consistently outperforms BM25:

  • 2021 (BEIR 1.0): BM25 competitive
  • 2025 (BEIR 2.0): Dense models lead by 10-12%

Improvement driven by better training and larger models.

Hybrid Search Value

Hybrid (BM25 + Dense) provides modest gains:

  • Dense alone: 53.7%
    • BM25: 55.2% (+2.8%)

Diminishing returns as dense models improve.

Model Size vs. Performance

Scaling laws still apply:

Model SizeAvg PerformanceCost/1M Tokens
Small (100M)46.2%$0.01
Base (350M)49.8%$0.05
Large (1B+)53.7%$0.10

2-3x size = +3-4% performance

Domain-Specific Models

Fine-tuned domain models outperform general models in-domain:

Medical retrieval:

  • General model: 48.3%
  • Med-tuned model: 61.7% (+27.7%)

Code search:

  • General model: 44.1%
  • Code-tuned model: 58.9% (+33.5%)

Recommendation: Fine-tune for specialized domains.

Using BEIR 2.0

Installation

DEVELOPERbash
pip install beir==2.0.0

Example

DEVELOPERpython
from beir import util from beir.datasets.data_loader import GenericDataLoader from beir.retrieval.evaluation import EvaluateRetrieval # Load dataset dataset = "msmarco-v2" # or any BEIR 2.0 dataset data_path = util.download_and_unzip(url, "datasets") corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # Evaluate your model retriever = YourRetriever() results = retriever.retrieve(corpus, queries) # Standard metrics eval = EvaluateRetrieval() metrics = eval.evaluate(qrels, results, k_values=[1, 3, 5, 10, 100, 1000]) print(f"NDCG@10: {metrics['NDCG@10']}") print(f"Recall@1000: {metrics['Recall@1000']}")

Adversarial Evaluation

DEVELOPERpython
# Load adversarial test set corpus, queries, qrels = GenericDataLoader(data_path).load( split="test-adversarial" ) # Evaluate adv_metrics = eval.evaluate(qrels, results, k_values=[10]) # Compare standard vs. adversarial print(f"Standard: {metrics['NDCG@10']}") print(f"Adversarial: {adv_metrics['NDCG@10']}") print(f"Robustness gap: {metrics['NDCG@10'] - adv_metrics['NDCG@10']}")

Implications for RAG

What Changed

  1. Higher bar: BEIR 2.0 is harder; expect lower absolute scores
  2. Adversarial robustness matters: Real queries are adversarial
  3. Domain adaptation critical: General models struggle on specialized domains
  4. Hybrid declining: Dense models closing gap with BM25

Recommendations

  1. Benchmark on BEIR 2.0: More realistic than v1
  2. Test adversarial splits: Measures robustness
  3. Consider domain fine-tuning: Large gains in specialized fields
  4. Track Recall@1000: Critical for two-stage retrieval
  5. Monitor latency: Speed matters in production

Future Plans

BEIR team announced:

  • Quarterly updates with new datasets
  • Multilingual expansion (currently English-only)
  • Multimodal retrieval (images, tables)
  • Real-user query distribution
  • Continuous leaderboard updates

Resources

  • Website: beir.ai
  • Paper: "BEIR 2.0: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models"
  • GitHub: github.com/beir-cellar/beir
  • Leaderboard: beir.ai/leaderboard

Conclusion

BEIR 2.0 raises the bar for retrieval evaluation with more realistic and challenging test scenarios. Systems optimized for BEIR 1.0 should be re-evaluated to ensure they handle adversarial queries and domain shifts effectively.

Tags

benchmarksevaluationresearchBEIR

Related Guides