Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Hugging Face Enriches Open-Source RAG Ecosystem

Hugging Face announces the release of a new family of models specially optimized for RAG applications. This release includes embedding models, rerankers, and LLMs adapted for augmented generation.

"Our goal is to democratize enterprise-grade RAG," explains Clement Delangue, CEO of Hugging Face. "These models offer performance comparable to proprietary solutions, in open-source."

The New Models

Embeddings: HF-RAG-Embed

A new family of RAG-optimized embedding models:

Model	Dimensions	Context	MTEB Score	License
hf-rag-embed-small	384	512	62.1	Apache 2.0
hf-rag-embed-base	768	2048	65.8	Apache 2.0
hf-rag-embed-large	1024	8192	68.4	Apache 2.0
hf-rag-embed-xl	2048	16384	70.2	Apache 2.0

Features:

Specifically trained for document retrieval
Native support for asymmetric queries (query vs document)
Optimized for multilingual (100 languages)

DEVELOPERpython
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("huggingface/hf-rag-embed-large")

# Document embeddings
doc_embeddings = model.encode(
    documents,
    prompt_name="document"  # Automatic prefix
)

# Query embeddings
query_embedding = model.encode(
    query,
    prompt_name="query"
)

Check our guide on choosing embedding models.

Rerankers: HF-RAG-Rerank

Performant and lightweight reranking models:

Model	Parameters	Latency (P50)	nDCG@10
hf-rag-rerank-tiny	33M	5ms	58.2
hf-rag-rerank-small	110M	12ms	64.7
hf-rag-rerank-base	330M	28ms	68.9
hf-rag-rerank-large	560M	45ms	71.3

DEVELOPERpython
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "huggingface/hf-rag-rerank-base"
)
tokenizer = AutoTokenizer.from_pretrained(
    "huggingface/hf-rag-rerank-base"
)

# Reranking
pairs = [(query, doc) for doc in candidate_docs]
inputs = tokenizer(pairs, padding=True, return_tensors="pt")
scores = model(**inputs).logits.squeeze()

These models perfectly complement our guide on reranking.

LLMs: HF-RAG-LLM

LLMs specially adapted for RAG generation:

Model	Parameters	Context	RAGBench Score
hf-rag-llm-7b	7B	32K	72.4
hf-rag-llm-13b	13B	64K	76.8
hf-rag-llm-34b	34B	128K	81.2

Unique features:

Trained to systematically cite sources
Reduced sensitivity to hallucinations
Instruction-following optimized for RAG

DEVELOPERpython
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="huggingface/hf-rag-llm-13b"
)

response = generator(
    f"""<context>
{retrieved_documents}
</context>

<question>
{user_question}
</question>

Answer by citing your sources with [1], [2], etc."""
)

Benchmarks

Comparison with Competition

Embeddings (MTEB Retrieval)

Model	Score	Latency	Open-source
hf-rag-embed-large	68.4	15ms	Yes
Cohere Embed v5	71.2	45ms	No
text-embedding-3-large	67.4	40ms	No
BGE-M3	64.8	12ms	Yes

Rerankers

Model	nDCG@10	Latency	Open-source
hf-rag-rerank-base	68.9	28ms	Yes
Cohere Rerank 3	72.1	35ms	No
ms-marco-MiniLM	64.2	8ms	Yes

LLMs (RAGBench)

Model	Score	Hallucinations	Open-source
hf-rag-llm-34b	81.2	2.8%	Yes
GPT-4 Turbo	84.5	2.4%	No
Claude 3 Opus	86.1	1.8%	No
Mixtral 8x22B	78.4	4.1%	Yes

Deployment

Deployment Options

1. Hugging Face Inference Endpoints

DEVELOPERpython
from huggingface_hub import InferenceClient

client = InferenceClient(model="huggingface/hf-rag-embed-large")
embeddings = client.feature_extraction(texts)

Price: $0.06/hour (GPU) to $0.60/hour (high-perf GPU)

2. Self-hosted with vLLM

DEVELOPERbash
pip install vllm
python -m vllm.entrypoints.openai.api_server \
    --model huggingface/hf-rag-llm-13b \
    --port 8000

3. Optimization with ONNX

DEVELOPERpython
from optimum.onnxruntime import ORTModelForSequenceClassification

model = ORTModelForSequenceClassification.from_pretrained(
    "huggingface/hf-rag-rerank-base",
    export=True
)

Performance gain: 2-3x on CPU

For production configurations, check our guide on production deployment.

Quantization

Models are available in quantized versions:

Quantization	Size	Quality Loss
FP16	100%	0%
INT8	50%	-0.5%
INT4	25%	-2%
GPTQ	25%	-1.5%

Integration with Frameworks

LangChain

DEVELOPERpython
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="huggingface/hf-rag-embed-large"
)

LlamaIndex

DEVELOPERpython
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(
    model_name="huggingface/hf-rag-embed-large"
)

Ailog

HF-RAG models are integrated as an option in the Ailog configuration.

Recommended Use Cases

When to Use HF-RAG

Ideal for:

Data sovereignty constraints
Limited budget (self-hosting)
Need for customization/fine-tuning
High request volume

Less suitable for:

Teams without ML expertise
Need for absolute best quality
Rapid prototypes

Fine-tuning

Models are designed for fine-tuning:

DEVELOPERpython
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./fine-tuned-rag-embed",
    per_device_train_batch_size=32,
    num_train_epochs=3
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=domain_dataset
)
trainer.train()

Check our guide on fine-tuning embeddings.

Our Take

This release represents a major advance for open-source:

Strengths:

Performance close to proprietary
Permissive license (Apache 2.0)
Models optimized for RAG
Excellent documentation

Points of attention:

Requires expertise to deploy
Infrastructure costs if self-hosted
No commercial support

For organizations with sovereignty constraints or high volumes, HF-RAG becomes a credible alternative to proprietary solutions.

Platforms like Ailog allow using these models without managing infrastructure, combining open-source with simplicity.

Check our RAG introduction guide to get started.

Hugging Face: New Open-Source RAG Models