News

BEIR Benchmark Leaderboard 2025 & 2026: NDCG@10 Scores & Rankings

April 8, 2026
5 min read
Ailog Research Team

Complete BEIR leaderboard with NDCG@10 scores. Compare embedding models on retrieval benchmarks. Updated April 2026 with MTEB v2 rankings.

BEIR Leaderboard - Top Retrieval Models (2025 & 2026)

Quick reference table for top models on BEIR retrieval benchmark (nDCG@10, zero-shot):

RankModelMTEB RetrievalTypeRelease
1Gemini Embedding 267.71DenseMar 2026
2Voyage 4 Large~66.0Dense (MoE)Jan 2026
3NV-Embed-v262.65Dense2025
4Qwen3-Embedding-8B~62.0Dense2025
5Cohere Embed v4~61.0Dense2025
6OpenAI text-3-large~59.0DenseJan 2024
7BGE-M3~58.0Dense + Sparse2024
8ColBERT-v2~55.0Late Interaction2022
9BM25~42.0SparseBaseline

BEIR retrieval scores are part of the broader MTEB leaderboard. Source: MTEB Retrieval subset, April 2026.


What is BEIR?

BEIR (Benchmarking Information Retrieval) is a heterogeneous benchmark for zero-shot evaluation of retrieval models. Created in 2021, it tests models across 18 diverse datasets including MS MARCO, Natural Questions, TREC-COVID, and domain-specific corpora.

The benchmark measures how well models generalize to unseen domains without fine-tuning — a critical capability for real-world RAG applications.

BEIR in 2026: Current Landscape

MTEB Has Superseded BEIR as the Primary Leaderboard

BEIR's 18 retrieval datasets are now a subset of the larger MTEB (Massive Text Embedding Benchmark) which covers 56+ tasks across retrieval, classification, clustering, and more. The MTEB leaderboard on HuggingFace is now the authoritative source for comparing embedding models.

Key differences:

  • BEIR: 18 retrieval-only datasets, nDCG@10 metric
  • MTEB v1: 56 datasets, 8 task types, average score
  • MTEB v2 (2026): Restructured tasks, not directly comparable to v1

New Benchmarks Complementing BEIR

Several new benchmarks address BEIR's limitations:

BRIGHT (ICLR 2025)

  • Reasoning-intensive retrieval tasks
  • The best MTEB model (59.0) scores only 18.3 on BRIGHT
  • Tests complex reasoning rather than lexical matching

Agentset Leaderboard (2026)

  • ELO-based scoring with head-to-head comparisons
  • Uses GPT-5 as judge across FiQA, SciFact, MSMARCO, DBPedia
  • More robust than single-metric leaderboards

Academic Criticism (arXiv:2509.07253)

  • BEIR tasks are not all strictly retrieval tasks (citation prediction, fact verification)
  • Labeling issues in some datasets
  • Limited query complexity

Top Retrieval Models (April 2026)

Gemini Embedding 2 (March 2026) — New #1

Google's first natively multimodal embedding model handles text, images, video, audio, and PDFs in a single 3,072-dim vector space.

  • MTEB English: 68.32 | Retrieval: 67.71
  • Cross-lingual retrieval: 0.997 (highest tested)
  • Pricing: $0.20/M tokens (text), $0.10/M batch

Voyage 4 Family (January 2026)

Industry-first shared embedding space with MoE architecture. Mix and match models for queries vs documents.

  • Claims +14% over OpenAI 3-large, +8.2% over Cohere v4 on RTEB
  • Pricing: $0.12/M (large), $0.06/M (standard)

zembed-1 (March 2026)

ZeroEntropy's 4B open-weight model. Achieved 0.946 nDCG@10 on MSMARCO.

  • ELO 1590 on Agentset leaderboard (#2)
  • Open-weight (commercial license on request)

Established Leaders

  • NV-Embed-v2: MTEB 72.31 overall, strong retrieval
  • Qwen3-Embedding-8B: MTEB Multilingual 70.58, Apache 2.0
  • Cohere Embed v4: 128K context, multimodal (text + images)
  • OpenAI text-3-large: MTEB 64.6, no update since January 2024

Key Findings

Dense vs. Sparse

Dense retrieval now consistently outperforms BM25 by 15-25% on BEIR datasets. The gap has widened significantly since the original 2021 benchmark where BM25 was competitive.

Domain Generalization

Models trained on web data still struggle with specialized domains:

DomainGeneral ModelDomain-TunedImprovement
Medical~48%~62%+29%
Code~44%~59%+34%
Legal~46%~57%+24%

Fine-tuning on domain data remains critical for specialized RAG applications.

Hybrid Search Value

Hybrid retrieval (BM25 + dense) still provides 2-5% gains, especially on out-of-domain queries. While the marginal benefit has decreased as dense models improve, hybrid approaches remain the production standard.

Using BEIR

Installation

DEVELOPERbash
pip install beir

Example

DEVELOPERpython
from beir import util from beir.datasets.data_loader import GenericDataLoader from beir.retrieval.evaluation import EvaluateRetrieval # Load dataset dataset = "msmarco" data_path = util.download_and_unzip(url, "datasets") corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # Evaluate your model retriever = YourRetriever() results = retriever.retrieve(corpus, queries) # Standard metrics eval = EvaluateRetrieval() metrics = eval.evaluate(qrels, results, k_values=[1, 3, 5, 10, 100, 1000]) print(f"NDCG@10: {metrics['NDCG@10']}") print(f"Recall@1000: {metrics['Recall@1000']}")

Implications for RAG

  1. Use MTEB for model selection: BEIR datasets are part of MTEB — use the HuggingFace leaderboard for up-to-date comparisons
  2. Test adversarial robustness: BRIGHT benchmark reveals weaknesses that BEIR misses
  3. Consider domain fine-tuning: 24-34% gains in specialized domains
  4. Track Recall@1000: Critical for two-stage retrieval with rerankers
  5. Monitor latency: Speed matters in production RAG

Resources

FAQ

BEIR (Benchmarking Information Retrieval) is a heterogeneous benchmark for zero-shot evaluation of retrieval models across 18 diverse datasets including MS MARCO, Natural Questions, and domain-specific corpora like TREC-COVID and SciFact.
As of April 2026, Gemini Embedding 2 leads retrieval benchmarks with 67.71 on the MTEB retrieval subset, followed by Voyage 4 Large and NV-Embed-v2. The landscape has shifted significantly with multimodal and MoE models entering the top positions.
BEIR remains valuable for measuring zero-shot retrieval generalization, but it is now part of the larger MTEB benchmark. New benchmarks like BRIGHT (reasoning-intensive retrieval) and Agentset (ELO-based) complement BEIR for more comprehensive evaluation.
BEIR focuses specifically on information retrieval across 18 datasets. MTEB is broader, covering 56+ datasets across 8 task types including retrieval, classification, clustering, and more. BEIR's retrieval datasets are a subset of MTEB's retrieval tasks.
Use MTEB — it includes all BEIR datasets plus additional retrieval benchmarks. The MTEB leaderboard on HuggingFace provides the most comprehensive and up-to-date comparison. Use BRIGHT additionally if your application requires reasoning-intensive retrieval.

Tags

benchmarksevaluationresearchBEIRNDCGleaderboard20252026

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !