Hugging Face: New Open-Source RAG Models
Hugging Face releases a new family of models optimized for RAG: embeddings, rerankers, and specialized LLMs. Complete overview.
Hugging Face Enriches Open-Source RAG Ecosystem
Hugging Face announces the release of a new family of models specially optimized for RAG applications. This release includes embedding models, rerankers, and LLMs adapted for augmented generation.
"Our goal is to democratize enterprise-grade RAG," explains Clement Delangue, CEO of Hugging Face. "These models offer performance comparable to proprietary solutions, in open-source."
The New Models
Embeddings: HF-RAG-Embed
A new family of RAG-optimized embedding models:
| Model | Dimensions | Context | MTEB Score | License |
|---|---|---|---|---|
| hf-rag-embed-small | 384 | 512 | 62.1 | Apache 2.0 |
| hf-rag-embed-base | 768 | 2048 | 65.8 | Apache 2.0 |
| hf-rag-embed-large | 1024 | 8192 | 68.4 | Apache 2.0 |
| hf-rag-embed-xl | 2048 | 16384 | 70.2 | Apache 2.0 |
Features:
- Specifically trained for document retrieval
- Native support for asymmetric queries (query vs document)
- Optimized for multilingual (100 languages)
DEVELOPERpythonfrom sentence_transformers import SentenceTransformer model = SentenceTransformer("huggingface/hf-rag-embed-large") # Document embeddings doc_embeddings = model.encode( documents, prompt_name="document" # Automatic prefix ) # Query embeddings query_embedding = model.encode( query, prompt_name="query" )
Check our guide on choosing embedding models.
Rerankers: HF-RAG-Rerank
Performant and lightweight reranking models:
| Model | Parameters | Latency (P50) | nDCG@10 |
|---|---|---|---|
| hf-rag-rerank-tiny | 33M | 5ms | 58.2 |
| hf-rag-rerank-small | 110M | 12ms | 64.7 |
| hf-rag-rerank-base | 330M | 28ms | 68.9 |
| hf-rag-rerank-large | 560M | 45ms | 71.3 |
DEVELOPERpythonfrom transformers import AutoModelForSequenceClassification, AutoTokenizer model = AutoModelForSequenceClassification.from_pretrained( "huggingface/hf-rag-rerank-base" ) tokenizer = AutoTokenizer.from_pretrained( "huggingface/hf-rag-rerank-base" ) # Reranking pairs = [(query, doc) for doc in candidate_docs] inputs = tokenizer(pairs, padding=True, return_tensors="pt") scores = model(**inputs).logits.squeeze()
These models perfectly complement our guide on reranking.
LLMs: HF-RAG-LLM
LLMs specially adapted for RAG generation:
| Model | Parameters | Context | RAGBench Score |
|---|---|---|---|
| hf-rag-llm-7b | 7B | 32K | 72.4 |
| hf-rag-llm-13b | 13B | 64K | 76.8 |
| hf-rag-llm-34b | 34B | 128K | 81.2 |
Unique features:
- Trained to systematically cite sources
- Reduced sensitivity to hallucinations
- Instruction-following optimized for RAG
DEVELOPERpythonfrom transformers import pipeline generator = pipeline( "text-generation", model="huggingface/hf-rag-llm-13b" ) response = generator( f"""<context> {retrieved_documents} </context> <question> {user_question} </question> Answer by citing your sources with [1], [2], etc.""" )
Benchmarks
Comparison with Competition
Embeddings (MTEB Retrieval)
| Model | Score | Latency | Open-source |
|---|---|---|---|
| hf-rag-embed-large | 68.4 | 15ms | Yes |
| Cohere Embed v5 | 71.2 | 45ms | No |
| text-embedding-3-large | 67.4 | 40ms | No |
| BGE-M3 | 64.8 | 12ms | Yes |
Rerankers
| Model | nDCG@10 | Latency | Open-source |
|---|---|---|---|
| hf-rag-rerank-base | 68.9 | 28ms | Yes |
| Cohere Rerank 3 | 72.1 | 35ms | No |
| ms-marco-MiniLM | 64.2 | 8ms | Yes |
LLMs (RAGBench)
| Model | Score | Hallucinations | Open-source |
|---|---|---|---|
| hf-rag-llm-34b | 81.2 | 2.8% | Yes |
| GPT-4 Turbo | 84.5 | 2.4% | No |
| Claude 3 Opus | 86.1 | 1.8% | No |
| Mixtral 8x22B | 78.4 | 4.1% | Yes |
Deployment
Deployment Options
1. Hugging Face Inference Endpoints
DEVELOPERpythonfrom huggingface_hub import InferenceClient client = InferenceClient(model="huggingface/hf-rag-embed-large") embeddings = client.feature_extraction(texts)
Price: $0.06/hour (GPU) to $0.60/hour (high-perf GPU)
2. Self-hosted with vLLM
DEVELOPERbashpip install vllm python -m vllm.entrypoints.openai.api_server \ --model huggingface/hf-rag-llm-13b \ --port 8000
3. Optimization with ONNX
DEVELOPERpythonfrom optimum.onnxruntime import ORTModelForSequenceClassification model = ORTModelForSequenceClassification.from_pretrained( "huggingface/hf-rag-rerank-base", export=True )
Performance gain: 2-3x on CPU
For production configurations, check our guide on production deployment.
Quantization
Models are available in quantized versions:
| Quantization | Size | Quality Loss |
|---|---|---|
| FP16 | 100% | 0% |
| INT8 | 50% | -0.5% |
| INT4 | 25% | -2% |
| GPTQ | 25% | -1.5% |
Integration with Frameworks
LangChain
DEVELOPERpythonfrom langchain_huggingface import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings( model_name="huggingface/hf-rag-embed-large" )
LlamaIndex
DEVELOPERpythonfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding embed_model = HuggingFaceEmbedding( model_name="huggingface/hf-rag-embed-large" )
Ailog
HF-RAG models are integrated as an option in the Ailog configuration.
Recommended Use Cases
When to Use HF-RAG
Ideal for:
- Data sovereignty constraints
- Limited budget (self-hosting)
- Need for customization/fine-tuning
- High request volume
Less suitable for:
- Teams without ML expertise
- Need for absolute best quality
- Rapid prototypes
Fine-tuning
Models are designed for fine-tuning:
DEVELOPERpythonfrom transformers import Trainer, TrainingArguments training_args = TrainingArguments( output_dir="./fine-tuned-rag-embed", per_device_train_batch_size=32, num_train_epochs=3 ) trainer = Trainer( model=model, args=training_args, train_dataset=domain_dataset ) trainer.train()
Check our guide on fine-tuning embeddings.
Our Take
This release represents a major advance for open-source:
Strengths:
- Performance close to proprietary
- Permissive license (Apache 2.0)
- Models optimized for RAG
- Excellent documentation
Points of attention:
- Requires expertise to deploy
- Infrastructure costs if self-hosted
- No commercial support
For organizations with sovereignty constraints or high volumes, HF-RAG becomes a credible alternative to proprietary solutions.
Platforms like Ailog allow using these models without managing infrastructure, combining open-source with simplicity.
Check our RAG introduction guide to get started.
Tags
Related Posts
LangChain v1: Stable Release and Maturity
LangChain reaches version 1.0 stable after 2 years of development. API stability, new abstractions, and roadmap for the future.
Cohere Embed v4: The First Production Multimodal Embedding
Cohere launches Embed v4 Multimodal, the first embedding model capable of vectorizing text, images, and interleaved documents. A revolution for multimodal RAG.
Embedding Models 2026: Benchmark and Comparison
Comprehensive comparison of the best embedding models in 2026. MTEB benchmarks, multilingual performance, and recommendations for your RAG applications.