Dense Retrieval: Semantic Search with Embeddings
Master dense retrieval for high-performance semantic search. Embeddings, models, vector indexing, and advanced optimizations explained.
Dense Retrieval: Semantic Search with Embeddings
Dense retrieval transforms information search by capturing the deep meaning of queries and documents. Unlike lexical search that compares words, dense retrieval compares concepts. This guide dives deep into the mechanisms, models, and techniques for implementing high-performance semantic search in your RAG systems.
What is Dense Retrieval?
Dense retrieval represents each text as a dense vector of real numbers, typically between 384 and 4096 dimensions. These vectors, called embeddings, capture the semantics of text in a mathematical space where similar concepts are close together.
Difference from Sparse Retrieval
| Characteristic | Dense Retrieval | Sparse Retrieval |
|---|---|---|
| Representation | Dense vectors (384-4096 dim) | Sparse vectors (vocabulary) |
| Matching | Semantic | Lexical |
| "Car" = "Automobile" | Yes | No |
| Rare terms | Less performant | Excellent |
| Typos and variants | Robust | Sensitive |
Dense retrieval excels when users phrase their queries differently from the source content. "How to cancel my order" will find "Purchase cancellation procedure" even without common words.
Dense Retrieval System Architecture
┌──────────────────────────────────────────────────────────────┐
│ Dense Retrieval Pipeline │
├──────────────────────────────────────────────────────────────┤
│ │
│ Documents Embedding Vector Database │
│ ┌─────────┐ Model ┌──────────────┐ │
│ │ Doc 1 │──┐ ┌───────┐ ┌───│ Qdrant │ │
│ │ Doc 2 │──┼───▶│ E5 │────┼───│ Pinecone │ │
│ │ Doc 3 │──┘ │ BGE │ │ │ Weaviate │ │
│ └─────────┘ └───────┘ │ └──────────────┘ │
│ │ │ │
│ Query │ │ ANN Search │
│ ┌─────────┐ │ ▼ │
│ │"How to │────────────────────┘ ┌──────────────┐ │
│ │ cancel" │ │ Top-K docs │ │
│ └─────────┘ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
Embedding Models: 2024 Comparison
Your choice of embedding model directly impacts retrieval quality. Here are current options ranked by MTEB benchmark performance.
Open Source Models
DEVELOPERpythonfrom sentence_transformers import SentenceTransformer # BGE-M3: Best quality/performance ratio for multilingual model_bge = SentenceTransformer('BAAI/bge-m3') # E5-Large: Excellent general purpose model_e5 = SentenceTransformer('intfloat/multilingual-e5-large') # GTE-Large: Performant alternative model_gte = SentenceTransformer('thenlper/gte-large')
| Model | Dimensions | MTEB Score | Speed | Use Case |
|---|---|---|---|---|
| BGE-M3 | 1024 | 64.2 | Medium | Production multilingual |
| E5-Large-v2 | 1024 | 63.5 | Medium | General purpose |
| GTE-Large | 1024 | 62.1 | Fast | High volume |
| All-MiniLM | 384 | 56.3 | Very fast | Prototyping |
Proprietary Models
DEVELOPERpythonimport openai # OpenAI text-embedding-3 client = openai.OpenAI() response = client.embeddings.create( model="text-embedding-3-large", input="How do I cancel my order?", dimensions=1536 # Configurable up to 3072 ) embedding = response.data[0].embedding # Cohere Embed v3 import cohere co = cohere.Client() response = co.embed( texts=["How do I cancel my order?"], model="embed-multilingual-v3.0", input_type="search_query" )
| Model | Dimensions | Performance | Cost/1M tokens |
|---|---|---|---|
| text-embedding-3-large | 3072 | Excellent | $0.13 |
| text-embedding-3-small | 1536 | Very good | $0.02 |
| Cohere embed-v3 | 1024 | Excellent | $0.10 |
| Voyage-2 | 1024 | State of art | $0.12 |
Optimizing Embedding Quality
Query/Passage Prefixes
Some models require prefixes to distinguish queries from documents:
DEVELOPERpythonfrom sentence_transformers import SentenceTransformer model = SentenceTransformer('intfloat/multilingual-e5-large') # IMPORTANT: Prefix correctly query = "query: How does the warranty work?" passages = [ "passage: The warranty covers defects for 2 years.", "passage: Contact support for any claims." ] query_embedding = model.encode(query) passage_embeddings = model.encode(passages)
Without these prefixes, performance can drop 5-15 points on benchmarks.
Embedding Normalization
To use dot product instead of cosine similarity (faster):
DEVELOPERpythonimport numpy as np def normalize(embeddings): norms = np.linalg.norm(embeddings, axis=1, keepdims=True) return embeddings / norms # Normalized embeddings embeddings_norm = normalize(embeddings) # Now dot product = cosine similarity similarity = np.dot(query_norm, doc_norm.T)
Matryoshka Embeddings
Recent models support "Matryoshka embeddings": embeddings where the first dimensions are the most informative.
DEVELOPERpython# With text-embedding-3, you can reduce dimensions response = client.embeddings.create( model="text-embedding-3-large", input=texts, dimensions=512 # Instead of 3072, with minimal loss ) # Fast search with reduced dimensions # then reranking with full dimensions
This enables configurable speed/precision tradeoffs.
Advanced Vector Indexing
ANN (Approximate Nearest Neighbor) Algorithms
Exact search (brute force) is O(n). ANN algorithms reduce to O(log n) with slight precision loss.
HNSW (Hierarchical Navigable Small Worlds)
Most widely used. Excellent speed/precision tradeoff.
DEVELOPERpythonfrom qdrant_client import QdrantClient from qdrant_client.models import VectorParams, HnswConfigDiff client = QdrantClient("localhost", port=6333) # Optimized HNSW configuration client.create_collection( collection_name="documents", vectors_config=VectorParams( size=1024, distance="Cosine" ), hnsw_config=HnswConfigDiff( m=16, # Connections per node (16-64) ef_construct=100, # Construction quality ) ) # At search time, adjust ef for precision/speed results = client.search( collection_name="documents", query_vector=query_embedding, limit=10, search_params={"ef": 128} # Higher = more precise, slower )
IVF (Inverted File Index)
Divides the space into clusters. Faster but less precise.
DEVELOPERpython# With FAISS import faiss # Create IVF index dimension = 1024 nlist = 100 # Number of clusters quantizer = faiss.IndexFlatL2(dimension) index = faiss.IndexIVFFlat(quantizer, dimension, nlist) # Train on data index.train(embeddings) index.add(embeddings) # Search with nprobe clusters index.nprobe = 10 # Higher = more precise distances, indices = index.search(query_embedding, k=10)
Quantization for Memory Reduction
Reduce vector size by 75% with minimal loss:
DEVELOPERpython# Scalar quantization (int8) from qdrant_client.models import QuantizationConfig, ScalarQuantization client.create_collection( collection_name="documents_quantized", vectors_config=VectorParams(size=1024, distance="Cosine"), quantization_config=QuantizationConfig( scalar=ScalarQuantization( type="int8", quantile=0.99, always_ram=True ) ) ) # 1M vectors 1024d: 4GB → 1GB # Recall loss: ~1-2%
Evaluating Dense Retrieval
Essential Metrics
DEVELOPERpythondef evaluate_retrieval(queries, ground_truth, retriever, k_values=[1, 5, 10]): metrics = {} for k in k_values: recalls = [] precisions = [] for query, relevant_docs in zip(queries, ground_truth): retrieved = retriever.search(query, top_k=k) retrieved_ids = [doc.id for doc in retrieved] # Recall@k hits = len(set(retrieved_ids) & set(relevant_docs)) recall = hits / len(relevant_docs) recalls.append(recall) # Precision@k precision = hits / k precisions.append(precision) metrics[f"recall@{k}"] = np.mean(recalls) metrics[f"precision@{k}"] = np.mean(precisions) # MRR mrr_scores = [] for query, relevant_docs in zip(queries, ground_truth): retrieved = retriever.search(query, top_k=100) for i, doc in enumerate(retrieved): if doc.id in relevant_docs: mrr_scores.append(1 / (i + 1)) break else: mrr_scores.append(0) metrics["mrr"] = np.mean(mrr_scores) return metrics
Recommended Benchmarks
- BEIR: Multi-domain retrieval benchmark
- MTEB: Massive Text Embedding Benchmark
- MS MARCO: Web question-answering
Common Pitfalls and Solutions
1. Domain Shift
Generic models perform poorly on specialized vocabulary.
Solution: Contrastive fine-tuning
DEVELOPERpythonfrom sentence_transformers import SentenceTransformer, InputExample, losses model = SentenceTransformer('BAAI/bge-m3') # Training data: (query, positive_doc) pairs train_examples = [ InputExample(texts=["COVID symptoms", "Fever, dry cough, fatigue"]), InputExample(texts=["Flu treatment", "Rest, hydration, acetaminophen"]), ] # Contrastive loss train_loss = losses.MultipleNegativesRankingLoss(model) # Fine-tuning model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100 )
2. Short Queries
1-2 word queries have less discriminative embeddings.
Solution: Query expansion
DEVELOPERpythondef expand_query(query: str, llm) -> str: if len(query.split()) <= 3: expansion = llm.complete( f"Rephrase this search as a complete sentence: {query}" ) return f"{query} {expansion}" return query
3. Long Documents
Embeddings of overly long documents lose precision.
Solution: Late interaction or chunking with aggregation
DEVELOPERpythondef embed_long_document(doc: str, model, chunk_size=512): chunks = split_into_chunks(doc, chunk_size) chunk_embeddings = model.encode(chunks) # Aggregation: position-weighted average weights = np.exp(-np.arange(len(chunks)) * 0.1) weights /= weights.sum() return np.average(chunk_embeddings, axis=0, weights=weights)
Integration in a RAG Pipeline
DEVELOPERpythonclass DenseRetriever: def __init__(self, model_name: str, collection: str): self.model = SentenceTransformer(model_name) self.client = QdrantClient("localhost", port=6333) self.collection = collection def index(self, documents: list[dict]): embeddings = self.model.encode( [doc["content"] for doc in documents], show_progress_bar=True ) points = [ PointStruct( id=doc["id"], vector=emb.tolist(), payload={"content": doc["content"], **doc.get("metadata", {})} ) for doc, emb in zip(documents, embeddings) ] self.client.upsert(collection_name=self.collection, points=points) def search(self, query: str, top_k: int = 5, filters: dict = None): query_embedding = self.model.encode(f"query: {query}") filter_conditions = self._build_filters(filters) if filters else None results = self.client.search( collection_name=self.collection, query_vector=query_embedding.tolist(), query_filter=filter_conditions, limit=top_k ) return [ {"content": hit.payload["content"], "score": hit.score} for hit in results ]
Next Steps
Dense retrieval is powerful but not universal. For certain cases, lexical search remains superior. Discover how in our related guides:
- Sparse Retrieval and BM25 - When lexical search wins
- Hybrid Fusion - Combining dense and sparse
- Retrieval Fundamentals - Overview
Get Started with Dense Retrieval on Ailog
Implementing a performant dense retrieval system requires expertise and infrastructure. With Ailog, you get:
- Optimized embedding models for multilingual content (BGE-M3, E5-Large)
- Automatic HNSW indexing with Qdrant
- Smart quantization to optimize costs and performance
- Assisted fine-tuning on your domain vocabulary
Try for free and deploy your semantic search in minutes.
Tags
Related Posts
Query Routing: Direct Queries to the Right Source
Implement query routing to direct each query to the optimal data source. Classification, LLM routing, and advanced strategies explained.
Hybrid Fusion: Combining Dense and Sparse Retrieval
Master hybrid fusion to combine semantic and lexical search. RRF, weighted fusion, and optimal combination strategies explained.
Sparse Retrieval and BM25: When Lexical Search Wins
Discover sparse retrieval and BM25 for precise lexical search. Use cases, implementation, and comparison with dense retrieval explained.