BEIR Benchmark 2.0 Released with Harder Test Sets and New Evaluation Metrics

Announcement

The BEIR (Benchmarking IR) team has released version 2.0 of their widely-used retrieval benchmark, addressing limitations of the original and adding more challenging test scenarios.

What's New

Six New Datasets

CodeSearchNet-RAG: Code search with natural language queries
MedQA-Retrieval: Medical question answering
LegalBench-IR: Legal document retrieval
MultiHop-V2: Complex multi-hop questions
TimeQA: Time-sensitive queries
TableQA: Structured data retrieval

Total datasets: 18 (up from 12)

Adversarial Test Sets

New adversarial examples designed to challenge retrieval systems:

Paraphrase Adversaries

Same meaning, different wording
Tests semantic understanding vs. keyword matching

Negation Adversaries

Queries with negations ("not", "except", "without")
Tests fine-grained understanding

Entity Swap Adversaries

Similar entities swapped
Tests entity disambiguation

Results on adversarial sets:

System	Original BEIR	BEIR 2.0 (Adversarial)	Gap
BM25	41.2%	28.7%	-30.3%
Dense (SBERT)	43.8%	35.1%	-19.9%
ColBERT	47.3%	39.8%	-15.8%
Hybrid	49.1%	42.3%	-13.8%

Insight: All systems struggle with adversarial examples; hybrid approaches degrade least.

Enhanced Metrics

Recall@1000

Added to measure coverage for two-stage systems:

Recall@1000: Did we retrieve relevant docs in top-1000?

Critical for reranking pipelines where initial retrieval must have high recall.

MRR@100

Mean Reciprocal Rank at 100 results:

MRR@100 = 1/rank of first relevant result (up to 100)

Better reflects real-world usage than nDCG@10.

Latency Percentiles

Now tracks retrieval speed:

p50, p95, p99 latencies
Throughput (queries/second)
Enables speed-quality tradeoffs

Domain Shift Analysis

BEIR 2.0 includes cross-domain test splits:

Training domains: Science, News Test domains: Legal, Medical, Code

Measures generalization across domains:

System	In-Domain	Out-of-Domain	Generalization Gap
BM25	42.1%	39.8%	-5.5%
DPR	45.3%	34.7%	-23.4%
BGE-Large	48.7%	42.1%	-13.5%
Cohere Embed v4	51.2%	47.8%	-6.6%

Insight: Newer models generalize better across domains.

Leaderboard

Top performers on BEIR 2.0 (average across all datasets):

Rank	Model	Avg nDCG@10	Avg Recall@1000
1	Voyage-Large-2	54.8%	89.2%
2	Cohere Embed v4	53.7%	87.8%
3	BGE-Large-EN	52.3%	86.1%
4	OpenAI text-3-large	51.9%	85.7%
5	E5-Mistral-7B	51.2%	84.9%
6	ColBERT-v2	49.1%	88.3%
7	SBERT (mpnet)	43.8%	81.2%
8	BM25	41.2%	76.8%

Key Findings

Dense vs. Sparse

Dense retrieval now consistently outperforms BM25:

2021 (BEIR 1.0): BM25 competitive
2025 (BEIR 2.0): Dense models lead by 10-12%

Improvement driven by better training and larger models.

Hybrid Search Value

Hybrid (BM25 + Dense) provides modest gains:

Dense alone: 53.7%
- BM25: 55.2% (+2.8%)

Diminishing returns as dense models improve.

Model Size vs. Performance

Scaling laws still apply:

Model Size	Avg Performance	Cost/1M Tokens
Small (100M)	46.2%	$0.01
Base (350M)	49.8%	$0.05
Large (1B+)	53.7%	$0.10

2-3x size = +3-4% performance

Domain-Specific Models

Fine-tuned domain models outperform general models in-domain:

Medical retrieval:

General model: 48.3%
Med-tuned model: 61.7% (+27.7%)

Code search:

General model: 44.1%
Code-tuned model: 58.9% (+33.5%)

Recommendation: Fine-tune for specialized domains.

Using BEIR 2.0

Installation

DEVELOPERbash
pip install beir==2.0.0

Example

DEVELOPERpython
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval

# Load dataset
dataset = "msmarco-v2"  # or any BEIR 2.0 dataset
data_path = util.download_and_unzip(url, "datasets")

corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")

# Evaluate your model
retriever = YourRetriever()

results = retriever.retrieve(corpus, queries)

# Standard metrics
eval = EvaluateRetrieval()
metrics = eval.evaluate(qrels, results, k_values=[1, 3, 5, 10, 100, 1000])

print(f"NDCG@10: {metrics['NDCG@10']}")
print(f"Recall@1000: {metrics['Recall@1000']}")

Adversarial Evaluation

DEVELOPERpython
# Load adversarial test set
corpus, queries, qrels = GenericDataLoader(data_path).load(
    split="test-adversarial"
)

# Evaluate
adv_metrics = eval.evaluate(qrels, results, k_values=[10])

# Compare standard vs. adversarial
print(f"Standard: {metrics['NDCG@10']}")
print(f"Adversarial: {adv_metrics['NDCG@10']}")
print(f"Robustness gap: {metrics['NDCG@10'] - adv_metrics['NDCG@10']}")

Implications for RAG

What Changed

Higher bar: BEIR 2.0 is harder; expect lower absolute scores
Adversarial robustness matters: Real queries are adversarial
Domain adaptation critical: General models struggle on specialized domains
Hybrid declining: Dense models closing gap with BM25

Recommendations

Benchmark on BEIR 2.0: More realistic than v1
Test adversarial splits: Measures robustness
Consider domain fine-tuning: Large gains in specialized fields
Track Recall@1000: Critical for two-stage retrieval
Monitor latency: Speed matters in production

Future Plans

BEIR team announced:

Quarterly updates with new datasets
Multilingual expansion (currently English-only)
Multimodal retrieval (images, tables)
Real-user query distribution
Continuous leaderboard updates

Resources

Website: beir.ai
Paper: "BEIR 2.0: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models"
GitHub: github.com/beir-cellar/beir
Leaderboard: beir.ai/leaderboard

Conclusion

BEIR 2.0 raises the bar for retrieval evaluation with more realistic and challenging test scenarios. Systems optimized for BEIR 1.0 should be re-evaluated to ensure they handle adversarial queries and domain shifts effectively.