BEIR Benchmark Leaderboard 2025 & 2026: NDCG@10 Scores & Rankings
Complete BEIR leaderboard with NDCG@10 scores. Compare embedding models on retrieval benchmarks. Updated April 2026 with MTEB v2 rankings.
BEIR Leaderboard - Top Retrieval Models (2025 & 2026)
Quick reference table for top models on BEIR retrieval benchmark (nDCG@10, zero-shot):
| Rank | Model | MTEB Retrieval | Type | Release |
|---|---|---|---|---|
| 1 | Gemini Embedding 2 | 67.71 | Dense | Mar 2026 |
| 2 | Voyage 4 Large | ~66.0 | Dense (MoE) | Jan 2026 |
| 3 | NV-Embed-v2 | 62.65 | Dense | 2025 |
| 4 | Qwen3-Embedding-8B | ~62.0 | Dense | 2025 |
| 5 | Cohere Embed v4 | ~61.0 | Dense | 2025 |
| 6 | OpenAI text-3-large | ~59.0 | Dense | Jan 2024 |
| 7 | BGE-M3 | ~58.0 | Dense + Sparse | 2024 |
| 8 | ColBERT-v2 | ~55.0 | Late Interaction | 2022 |
| 9 | BM25 | ~42.0 | Sparse | Baseline |
BEIR retrieval scores are part of the broader MTEB leaderboard. Source: MTEB Retrieval subset, April 2026.
What is BEIR?
BEIR (Benchmarking Information Retrieval) is a heterogeneous benchmark for zero-shot evaluation of retrieval models. Created in 2021, it tests models across 18 diverse datasets including MS MARCO, Natural Questions, TREC-COVID, and domain-specific corpora.
The benchmark measures how well models generalize to unseen domains without fine-tuning — a critical capability for real-world RAG applications.
- GitHub: github.com/beir-cellar/beir
- Paper: arXiv:2104.08663
- Datasets: 18 covering search, QA, fact-checking, citation prediction
BEIR in 2026: Current Landscape
MTEB Has Superseded BEIR as the Primary Leaderboard
BEIR's 18 retrieval datasets are now a subset of the larger MTEB (Massive Text Embedding Benchmark) which covers 56+ tasks across retrieval, classification, clustering, and more. The MTEB leaderboard on HuggingFace is now the authoritative source for comparing embedding models.
Key differences:
- BEIR: 18 retrieval-only datasets, nDCG@10 metric
- MTEB v1: 56 datasets, 8 task types, average score
- MTEB v2 (2026): Restructured tasks, not directly comparable to v1
New Benchmarks Complementing BEIR
Several new benchmarks address BEIR's limitations:
BRIGHT (ICLR 2025)
- Reasoning-intensive retrieval tasks
- The best MTEB model (59.0) scores only 18.3 on BRIGHT
- Tests complex reasoning rather than lexical matching
Agentset Leaderboard (2026)
- ELO-based scoring with head-to-head comparisons
- Uses GPT-5 as judge across FiQA, SciFact, MSMARCO, DBPedia
- More robust than single-metric leaderboards
Academic Criticism (arXiv:2509.07253)
- BEIR tasks are not all strictly retrieval tasks (citation prediction, fact verification)
- Labeling issues in some datasets
- Limited query complexity
Top Retrieval Models (April 2026)
Gemini Embedding 2 (March 2026) — New #1
Google's first natively multimodal embedding model handles text, images, video, audio, and PDFs in a single 3,072-dim vector space.
- MTEB English: 68.32 | Retrieval: 67.71
- Cross-lingual retrieval: 0.997 (highest tested)
- Pricing: $0.20/M tokens (text), $0.10/M batch
Voyage 4 Family (January 2026)
Industry-first shared embedding space with MoE architecture. Mix and match models for queries vs documents.
- Claims +14% over OpenAI 3-large, +8.2% over Cohere v4 on RTEB
- Pricing: $0.12/M (large), $0.06/M (standard)
zembed-1 (March 2026)
ZeroEntropy's 4B open-weight model. Achieved 0.946 nDCG@10 on MSMARCO.
- ELO 1590 on Agentset leaderboard (#2)
- Open-weight (commercial license on request)
Established Leaders
- NV-Embed-v2: MTEB 72.31 overall, strong retrieval
- Qwen3-Embedding-8B: MTEB Multilingual 70.58, Apache 2.0
- Cohere Embed v4: 128K context, multimodal (text + images)
- OpenAI text-3-large: MTEB 64.6, no update since January 2024
Key Findings
Dense vs. Sparse
Dense retrieval now consistently outperforms BM25 by 15-25% on BEIR datasets. The gap has widened significantly since the original 2021 benchmark where BM25 was competitive.
Domain Generalization
Models trained on web data still struggle with specialized domains:
| Domain | General Model | Domain-Tuned | Improvement |
|---|---|---|---|
| Medical | ~48% | ~62% | +29% |
| Code | ~44% | ~59% | +34% |
| Legal | ~46% | ~57% | +24% |
Fine-tuning on domain data remains critical for specialized RAG applications.
Hybrid Search Value
Hybrid retrieval (BM25 + dense) still provides 2-5% gains, especially on out-of-domain queries. While the marginal benefit has decreased as dense models improve, hybrid approaches remain the production standard.
Using BEIR
Installation
DEVELOPERbashpip install beir
Example
DEVELOPERpythonfrom beir import util from beir.datasets.data_loader import GenericDataLoader from beir.retrieval.evaluation import EvaluateRetrieval # Load dataset dataset = "msmarco" data_path = util.download_and_unzip(url, "datasets") corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # Evaluate your model retriever = YourRetriever() results = retriever.retrieve(corpus, queries) # Standard metrics eval = EvaluateRetrieval() metrics = eval.evaluate(qrels, results, k_values=[1, 3, 5, 10, 100, 1000]) print(f"NDCG@10: {metrics['NDCG@10']}") print(f"Recall@1000: {metrics['Recall@1000']}")
Implications for RAG
- Use MTEB for model selection: BEIR datasets are part of MTEB — use the HuggingFace leaderboard for up-to-date comparisons
- Test adversarial robustness: BRIGHT benchmark reveals weaknesses that BEIR misses
- Consider domain fine-tuning: 24-34% gains in specialized domains
- Track Recall@1000: Critical for two-stage retrieval with rerankers
- Monitor latency: Speed matters in production RAG
Resources
- BEIR GitHub: github.com/beir-cellar/beir
- MTEB Leaderboard: huggingface.co/spaces/mteb/leaderboard
- BRIGHT Benchmark: brightbenchmark.github.io
- Agentset Leaderboard: agentset.ai/embeddings
- Original Paper: arXiv:2104.08663
FAQ
Tags
Related Posts
Best Embedding Models 2025: MTEB Scores & Leaderboard (Cohere, OpenAI, BGE)
Compare MTEB scores for top embedding models: Cohere embed-v4 (65.2), OpenAI text-3-large (64.6), BGE-M3 (63.0). Full leaderboard with pricing.
Automatic RAG Evaluation: New Framework Achieves 95% Correlation with Human Judgments
Google Research introduces AutoRAGEval, an automated evaluation framework that reliably assesses RAG quality without human annotation.
New Research: Cross-Encoder Reranking Improves RAG Accuracy by 40%
MIT study demonstrates that two-stage retrieval with cross-encoder reranking significantly outperforms single-stage vector search across multiple benchmarks.