BEIR Benchmark 2.0 Released with Harder Test Sets and New Evaluation Metrics
Updated BEIR benchmark includes 6 new datasets, adversarial examples, and improved evaluation methodology for more robust retrieval testing.
Announcement
The BEIR (Benchmarking IR) team has released version 2.0 of their widely-used retrieval benchmark, addressing limitations of the original and adding more challenging test scenarios.
What's New
Six New Datasets
- CodeSearchNet-RAG: Code search with natural language queries
- MedQA-Retrieval: Medical question answering
- LegalBench-IR: Legal document retrieval
- MultiHop-V2: Complex multi-hop questions
- TimeQA: Time-sensitive queries
- TableQA: Structured data retrieval
Total datasets: 18 (up from 12)
Adversarial Test Sets
New adversarial examples designed to challenge retrieval systems:
Paraphrase Adversaries
- Same meaning, different wording
- Tests semantic understanding vs. keyword matching
Negation Adversaries
- Queries with negations ("not", "except", "without")
- Tests fine-grained understanding
Entity Swap Adversaries
- Similar entities swapped
- Tests entity disambiguation
Results on adversarial sets:
| System | Original BEIR | BEIR 2.0 (Adversarial) | Gap |
|---|---|---|---|
| BM25 | 41.2% | 28.7% | -30.3% |
| Dense (SBERT) | 43.8% | 35.1% | -19.9% |
| ColBERT | 47.3% | 39.8% | -15.8% |
| Hybrid | 49.1% | 42.3% | -13.8% |
Insight: All systems struggle with adversarial examples; hybrid approaches degrade least.
Enhanced Metrics
Recall@1000
Added to measure coverage for two-stage systems:
Recall@1000: Did we retrieve relevant docs in top-1000?
Critical for reranking pipelines where initial retrieval must have high recall.
MRR@100
Mean Reciprocal Rank at 100 results:
MRR@100 = 1/rank of first relevant result (up to 100)
Better reflects real-world usage than nDCG@10.
Latency Percentiles
Now tracks retrieval speed:
- p50, p95, p99 latencies
- Throughput (queries/second)
- Enables speed-quality tradeoffs
Domain Shift Analysis
BEIR 2.0 includes cross-domain test splits:
Training domains: Science, News Test domains: Legal, Medical, Code
Measures generalization across domains:
| System | In-Domain | Out-of-Domain | Generalization Gap |
|---|---|---|---|
| BM25 | 42.1% | 39.8% | -5.5% |
| DPR | 45.3% | 34.7% | -23.4% |
| BGE-Large | 48.7% | 42.1% | -13.5% |
| Cohere Embed v4 | 51.2% | 47.8% | -6.6% |
Insight: Newer models generalize better across domains.
Leaderboard
Top performers on BEIR 2.0 (average across all datasets):
| Rank | Model | Avg nDCG@10 | Avg Recall@1000 |
|---|---|---|---|
| 1 | Voyage-Large-2 | 54.8% | 89.2% |
| 2 | Cohere Embed v4 | 53.7% | 87.8% |
| 3 | BGE-Large-EN | 52.3% | 86.1% |
| 4 | OpenAI text-3-large | 51.9% | 85.7% |
| 5 | E5-Mistral-7B | 51.2% | 84.9% |
| 6 | ColBERT-v2 | 49.1% | 88.3% |
| 7 | SBERT (mpnet) | 43.8% | 81.2% |
| 8 | BM25 | 41.2% | 76.8% |
Key Findings
Dense vs. Sparse
Dense retrieval now consistently outperforms BM25:
- 2021 (BEIR 1.0): BM25 competitive
- 2025 (BEIR 2.0): Dense models lead by 10-12%
Improvement driven by better training and larger models.
Hybrid Search Value
Hybrid (BM25 + Dense) provides modest gains:
- Dense alone: 53.7%
-
- BM25: 55.2% (+2.8%)
Diminishing returns as dense models improve.
Model Size vs. Performance
Scaling laws still apply:
| Model Size | Avg Performance | Cost/1M Tokens |
|---|---|---|
| Small (100M) | 46.2% | $0.01 |
| Base (350M) | 49.8% | $0.05 |
| Large (1B+) | 53.7% | $0.10 |
2-3x size = +3-4% performance
Domain-Specific Models
Fine-tuned domain models outperform general models in-domain:
Medical retrieval:
- General model: 48.3%
- Med-tuned model: 61.7% (+27.7%)
Code search:
- General model: 44.1%
- Code-tuned model: 58.9% (+33.5%)
Recommendation: Fine-tune for specialized domains.
Using BEIR 2.0
Installation
DEVELOPERbashpip install beir==2.0.0
Example
DEVELOPERpythonfrom beir import util from beir.datasets.data_loader import GenericDataLoader from beir.retrieval.evaluation import EvaluateRetrieval # Load dataset dataset = "msmarco-v2" # or any BEIR 2.0 dataset data_path = util.download_and_unzip(url, "datasets") corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # Evaluate your model retriever = YourRetriever() results = retriever.retrieve(corpus, queries) # Standard metrics eval = EvaluateRetrieval() metrics = eval.evaluate(qrels, results, k_values=[1, 3, 5, 10, 100, 1000]) print(f"NDCG@10: {metrics['NDCG@10']}") print(f"Recall@1000: {metrics['Recall@1000']}")
Adversarial Evaluation
DEVELOPERpython# Load adversarial test set corpus, queries, qrels = GenericDataLoader(data_path).load( split="test-adversarial" ) # Evaluate adv_metrics = eval.evaluate(qrels, results, k_values=[10]) # Compare standard vs. adversarial print(f"Standard: {metrics['NDCG@10']}") print(f"Adversarial: {adv_metrics['NDCG@10']}") print(f"Robustness gap: {metrics['NDCG@10'] - adv_metrics['NDCG@10']}")
Implications for RAG
What Changed
- Higher bar: BEIR 2.0 is harder; expect lower absolute scores
- Adversarial robustness matters: Real queries are adversarial
- Domain adaptation critical: General models struggle on specialized domains
- Hybrid declining: Dense models closing gap with BM25
Recommendations
- Benchmark on BEIR 2.0: More realistic than v1
- Test adversarial splits: Measures robustness
- Consider domain fine-tuning: Large gains in specialized fields
- Track Recall@1000: Critical for two-stage retrieval
- Monitor latency: Speed matters in production
Future Plans
BEIR team announced:
- Quarterly updates with new datasets
- Multilingual expansion (currently English-only)
- Multimodal retrieval (images, tables)
- Real-user query distribution
- Continuous leaderboard updates
Resources
- Website: beir.ai
- Paper: "BEIR 2.0: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models"
- GitHub: github.com/beir-cellar/beir
- Leaderboard: beir.ai/leaderboard
Conclusion
BEIR 2.0 raises the bar for retrieval evaluation with more realistic and challenging test scenarios. Systems optimized for BEIR 1.0 should be re-evaluated to ensure they handle adversarial queries and domain shifts effectively.
Tags
Related Guides
Automatic RAG Evaluation: New Framework Achieves 95% Correlation with Human Judgments
Google Research introduces AutoRAGEval, an automated evaluation framework that reliably assesses RAG quality without human annotation.
Microsoft Research Introduces GraphRAG: Combining Knowledge Graphs with RAG
Microsoft Research unveils GraphRAG, a novel approach that combines RAG with knowledge graphs to improve contextual understanding
Query Decomposition Breakthrough: DecomposeRAG Handles Complex Questions 50% Better
UC Berkeley researchers introduce DecomposeRAG, an automated query decomposition framework that significantly improves multi-hop question answering.