5. Retrieval

Dense Retrieval: Semantic Search with Embeddings

March 5, 2026
Ailog Team

Master dense retrieval for high-performance semantic search. Embeddings, models, vector indexing, and advanced optimizations explained.

Dense Retrieval: Semantic Search with Embeddings

Dense retrieval transforms information search by capturing the deep meaning of queries and documents. Unlike lexical search that compares words, dense retrieval compares concepts. This guide dives deep into the mechanisms, models, and techniques for implementing high-performance semantic search in your RAG systems.

What is Dense Retrieval?

Dense retrieval represents each text as a dense vector of real numbers, typically between 384 and 4096 dimensions. These vectors, called embeddings, capture the semantics of text in a mathematical space where similar concepts are close together.

Difference from Sparse Retrieval

CharacteristicDense RetrievalSparse Retrieval
RepresentationDense vectors (384-4096 dim)Sparse vectors (vocabulary)
MatchingSemanticLexical
"Car" = "Automobile"YesNo
Rare termsLess performantExcellent
Typos and variantsRobustSensitive

Dense retrieval excels when users phrase their queries differently from the source content. "How to cancel my order" will find "Purchase cancellation procedure" even without common words.

Dense Retrieval System Architecture

┌──────────────────────────────────────────────────────────────┐
│                    Dense Retrieval Pipeline                   │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│   Documents          Embedding         Vector Database       │
│   ┌─────────┐        Model            ┌──────────────┐      │
│   │ Doc 1   │──┐    ┌───────┐    ┌───│   Qdrant     │      │
│   │ Doc 2   │──┼───▶│ E5    │────┼───│   Pinecone   │      │
│   │ Doc 3   │──┘    │ BGE   │    │   │   Weaviate   │      │
│   └─────────┘       └───────┘    │   └──────────────┘      │
│                                   │          │              │
│   Query                          │          │ ANN Search   │
│   ┌─────────┐                    │          ▼              │
│   │"How to  │────────────────────┘   ┌──────────────┐      │
│   │ cancel" │                        │ Top-K docs   │      │
│   └─────────┘                        └──────────────┘      │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Embedding Models: 2024 Comparison

Your choice of embedding model directly impacts retrieval quality. Here are current options ranked by MTEB benchmark performance.

Open Source Models

DEVELOPERpython
from sentence_transformers import SentenceTransformer # BGE-M3: Best quality/performance ratio for multilingual model_bge = SentenceTransformer('BAAI/bge-m3') # E5-Large: Excellent general purpose model_e5 = SentenceTransformer('intfloat/multilingual-e5-large') # GTE-Large: Performant alternative model_gte = SentenceTransformer('thenlper/gte-large')
ModelDimensionsMTEB ScoreSpeedUse Case
BGE-M3102464.2MediumProduction multilingual
E5-Large-v2102463.5MediumGeneral purpose
GTE-Large102462.1FastHigh volume
All-MiniLM38456.3Very fastPrototyping

Proprietary Models

DEVELOPERpython
import openai # OpenAI text-embedding-3 client = openai.OpenAI() response = client.embeddings.create( model="text-embedding-3-large", input="How do I cancel my order?", dimensions=1536 # Configurable up to 3072 ) embedding = response.data[0].embedding # Cohere Embed v3 import cohere co = cohere.Client() response = co.embed( texts=["How do I cancel my order?"], model="embed-multilingual-v3.0", input_type="search_query" )
ModelDimensionsPerformanceCost/1M tokens
text-embedding-3-large3072Excellent$0.13
text-embedding-3-small1536Very good$0.02
Cohere embed-v31024Excellent$0.10
Voyage-21024State of art$0.12

Optimizing Embedding Quality

Query/Passage Prefixes

Some models require prefixes to distinguish queries from documents:

DEVELOPERpython
from sentence_transformers import SentenceTransformer model = SentenceTransformer('intfloat/multilingual-e5-large') # IMPORTANT: Prefix correctly query = "query: How does the warranty work?" passages = [ "passage: The warranty covers defects for 2 years.", "passage: Contact support for any claims." ] query_embedding = model.encode(query) passage_embeddings = model.encode(passages)

Without these prefixes, performance can drop 5-15 points on benchmarks.

Embedding Normalization

To use dot product instead of cosine similarity (faster):

DEVELOPERpython
import numpy as np def normalize(embeddings): norms = np.linalg.norm(embeddings, axis=1, keepdims=True) return embeddings / norms # Normalized embeddings embeddings_norm = normalize(embeddings) # Now dot product = cosine similarity similarity = np.dot(query_norm, doc_norm.T)

Matryoshka Embeddings

Recent models support "Matryoshka embeddings": embeddings where the first dimensions are the most informative.

DEVELOPERpython
# With text-embedding-3, you can reduce dimensions response = client.embeddings.create( model="text-embedding-3-large", input=texts, dimensions=512 # Instead of 3072, with minimal loss ) # Fast search with reduced dimensions # then reranking with full dimensions

This enables configurable speed/precision tradeoffs.

Advanced Vector Indexing

ANN (Approximate Nearest Neighbor) Algorithms

Exact search (brute force) is O(n). ANN algorithms reduce to O(log n) with slight precision loss.

HNSW (Hierarchical Navigable Small Worlds)

Most widely used. Excellent speed/precision tradeoff.

DEVELOPERpython
from qdrant_client import QdrantClient from qdrant_client.models import VectorParams, HnswConfigDiff client = QdrantClient("localhost", port=6333) # Optimized HNSW configuration client.create_collection( collection_name="documents", vectors_config=VectorParams( size=1024, distance="Cosine" ), hnsw_config=HnswConfigDiff( m=16, # Connections per node (16-64) ef_construct=100, # Construction quality ) ) # At search time, adjust ef for precision/speed results = client.search( collection_name="documents", query_vector=query_embedding, limit=10, search_params={"ef": 128} # Higher = more precise, slower )

IVF (Inverted File Index)

Divides the space into clusters. Faster but less precise.

DEVELOPERpython
# With FAISS import faiss # Create IVF index dimension = 1024 nlist = 100 # Number of clusters quantizer = faiss.IndexFlatL2(dimension) index = faiss.IndexIVFFlat(quantizer, dimension, nlist) # Train on data index.train(embeddings) index.add(embeddings) # Search with nprobe clusters index.nprobe = 10 # Higher = more precise distances, indices = index.search(query_embedding, k=10)

Quantization for Memory Reduction

Reduce vector size by 75% with minimal loss:

DEVELOPERpython
# Scalar quantization (int8) from qdrant_client.models import QuantizationConfig, ScalarQuantization client.create_collection( collection_name="documents_quantized", vectors_config=VectorParams(size=1024, distance="Cosine"), quantization_config=QuantizationConfig( scalar=ScalarQuantization( type="int8", quantile=0.99, always_ram=True ) ) ) # 1M vectors 1024d: 4GB → 1GB # Recall loss: ~1-2%

Evaluating Dense Retrieval

Essential Metrics

DEVELOPERpython
def evaluate_retrieval(queries, ground_truth, retriever, k_values=[1, 5, 10]): metrics = {} for k in k_values: recalls = [] precisions = [] for query, relevant_docs in zip(queries, ground_truth): retrieved = retriever.search(query, top_k=k) retrieved_ids = [doc.id for doc in retrieved] # Recall@k hits = len(set(retrieved_ids) & set(relevant_docs)) recall = hits / len(relevant_docs) recalls.append(recall) # Precision@k precision = hits / k precisions.append(precision) metrics[f"recall@{k}"] = np.mean(recalls) metrics[f"precision@{k}"] = np.mean(precisions) # MRR mrr_scores = [] for query, relevant_docs in zip(queries, ground_truth): retrieved = retriever.search(query, top_k=100) for i, doc in enumerate(retrieved): if doc.id in relevant_docs: mrr_scores.append(1 / (i + 1)) break else: mrr_scores.append(0) metrics["mrr"] = np.mean(mrr_scores) return metrics

Recommended Benchmarks

  • BEIR: Multi-domain retrieval benchmark
  • MTEB: Massive Text Embedding Benchmark
  • MS MARCO: Web question-answering

Common Pitfalls and Solutions

1. Domain Shift

Generic models perform poorly on specialized vocabulary.

Solution: Contrastive fine-tuning

DEVELOPERpython
from sentence_transformers import SentenceTransformer, InputExample, losses model = SentenceTransformer('BAAI/bge-m3') # Training data: (query, positive_doc) pairs train_examples = [ InputExample(texts=["COVID symptoms", "Fever, dry cough, fatigue"]), InputExample(texts=["Flu treatment", "Rest, hydration, acetaminophen"]), ] # Contrastive loss train_loss = losses.MultipleNegativesRankingLoss(model) # Fine-tuning model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100 )

2. Short Queries

1-2 word queries have less discriminative embeddings.

Solution: Query expansion

DEVELOPERpython
def expand_query(query: str, llm) -> str: if len(query.split()) <= 3: expansion = llm.complete( f"Rephrase this search as a complete sentence: {query}" ) return f"{query} {expansion}" return query

3. Long Documents

Embeddings of overly long documents lose precision.

Solution: Late interaction or chunking with aggregation

DEVELOPERpython
def embed_long_document(doc: str, model, chunk_size=512): chunks = split_into_chunks(doc, chunk_size) chunk_embeddings = model.encode(chunks) # Aggregation: position-weighted average weights = np.exp(-np.arange(len(chunks)) * 0.1) weights /= weights.sum() return np.average(chunk_embeddings, axis=0, weights=weights)

Integration in a RAG Pipeline

DEVELOPERpython
class DenseRetriever: def __init__(self, model_name: str, collection: str): self.model = SentenceTransformer(model_name) self.client = QdrantClient("localhost", port=6333) self.collection = collection def index(self, documents: list[dict]): embeddings = self.model.encode( [doc["content"] for doc in documents], show_progress_bar=True ) points = [ PointStruct( id=doc["id"], vector=emb.tolist(), payload={"content": doc["content"], **doc.get("metadata", {})} ) for doc, emb in zip(documents, embeddings) ] self.client.upsert(collection_name=self.collection, points=points) def search(self, query: str, top_k: int = 5, filters: dict = None): query_embedding = self.model.encode(f"query: {query}") filter_conditions = self._build_filters(filters) if filters else None results = self.client.search( collection_name=self.collection, query_vector=query_embedding.tolist(), query_filter=filter_conditions, limit=top_k ) return [ {"content": hit.payload["content"], "score": hit.score} for hit in results ]

Next Steps

Dense retrieval is powerful but not universal. For certain cases, lexical search remains superior. Discover how in our related guides:


Get Started with Dense Retrieval on Ailog

Implementing a performant dense retrieval system requires expertise and infrastructure. With Ailog, you get:

  • Optimized embedding models for multilingual content (BGE-M3, E5-Large)
  • Automatic HNSW indexing with Qdrant
  • Smart quantization to optimize costs and performance
  • Assisted fine-tuning on your domain vocabulary

Try for free and deploy your semantic search in minutes.

Tags

ragretrievalembeddingsdense retrievalsemantic search

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !