Embeddings: The Foundation of Semantic Search
Deep dive into embedding models, vector representations, and how to choose the right embedding strategy for your RAG system.
TL;DR
- Embeddings = Vector representations that capture semantic meaning (similar text → close vectors)
- Best for most: OpenAI text-embedding-3-small ($0.02/1M tokens, 1536 dimensions)
- Budget option: Sentence Transformers all-mpnet-base-v2 (free, self-hosted)
- Quality matters: Better embeddings = 20-40% improvement in retrieval accuracy
- Compare models live on Ailog's platform
Understanding Embeddings
Embeddings are dense vector representations of text that capture semantic meaning in a high-dimensional space. Words, sentences, or documents with similar meanings are positioned close together in this vector space.
From Text to Vectors
Consider these sentences:
- "The cat sits on the mat"
- "A feline rests on the rug"
- "Python is a programming language"
Good embeddings will place the first two sentences close together (similar meaning) and the third one far away (different topic).
Why Embeddings Matter for RAG
Embeddings enable semantic search rather than keyword matching:
Keyword Search (Traditional)
Query: "how to reset password"
Match: Exact word matching
Misses: "password recovery", "forgot credentials", "account access"
Semantic Search (Embeddings)
Query: "how to reset password"
Finds: "password recovery", "forgot credentials", "regain account access"
Reason: Similar meaning, different words
Popular Embedding Models
OpenAI Embeddings
text-embedding-3-small
- Dimensions: 1536
- Cost: $0.02 / 1M tokens
- Performance: Good for most use cases
- Speed: Fast
text-embedding-3-large
- Dimensions: 3072
- Cost: $0.13 / 1M tokens
- Performance: Best accuracy
- Speed: Slower, higher memory
Open Source Alternatives
Sentence Transformers
- Models: all-MiniLM-L6-v2, all-mpnet-base-v2
- Dimensions: 384-768
- Cost: Free (self-hosted)
- Performance: Good for English
- Customizable: Can fine-tune on your domain
BGE (BAAI General Embedding)
- Models: bge-small, bge-base, bge-large
- Dimensions: 512-1024
- Performance: State-of-the-art open source
- Languages: Multilingual support
E5 (Microsoft)
- Models: e5-small, e5-base, e5-large
- Dimensions: 384-1024
- Performance: Excellent zero-shot
- Training: Weak supervision approach
Cohere Embed
- Dimensions: 1024 (v3), 768 (multilingual)
- Cost: API-based pricing
- Performance: Strong multilingual
- Features: Built-in compression
Embedding Dimensions
The number of dimensions affects:
Higher Dimensions (1024-3072)
Pros:
- More expressive representations
- Better capture nuanced meanings
- Higher accuracy on complex queries
Cons:
- More storage space required
- Slower similarity computations
- Higher memory usage
- Increased costs
Lower Dimensions (256-512)
Pros:
- Faster search
- Less storage
- Lower memory footprint
- Cost-effective at scale
Cons:
- May lose semantic nuance
- Lower accuracy on subtle distinctions
Optimal Choice
For most RAG applications:
- 384-768 dimensions: Good balance for general use
- 1024-1536 dimensions: Better for complex domains
- 256-384 dimensions: High-volume, cost-sensitive applications
Embedding Strategies
Document-Level Embeddings
Embed entire documents as single vectors.
Pros:
- Captures overall document theme
- Simple implementation
- Good for document classification
Cons:
- Loses granular information
- Context window limits long documents
- Poor for precise retrieval
Use when:
- Documents are short (< 512 tokens)
- Need document-level similarity
- Classification tasks
Chunk-Level Embeddings
Split documents into chunks, embed each separately.
Pros:
- Retrieves specific relevant sections
- Handles long documents
- More precise context
Cons:
- More embeddings to store
- Chunk boundaries may split context
- Requires chunking strategy
Use when:
- Documents are long
- Need precise retrieval
- Most RAG applications
Sentence-Level Embeddings
Embed individual sentences.
Pros:
- Very precise retrieval
- Minimal irrelevant context
- Good for FAQ systems
Cons:
- May lack surrounding context
- Very large number of embeddings
- Context fragmentation
Use when:
- Questions have short, specific answers
- Minimizing context window usage
- FAQ or Q&A systems
Hybrid Approaches
Combine multiple granularities:
DEVELOPERpython# Pseudocode document_embedding = embed(full_document) chunk_embeddings = [embed(chunk) for chunk in chunks] sentence_embeddings = [embed(sent) for sent in sentences] # Retrieval: Search at document level, then drill down to chunks
Embedding Best Practices
1. Consistent Preprocessing
Ensure training and inference preprocessing match:
DEVELOPERpython# Bad: Inconsistent preprocessing training: "The Quick Brown Fox" inference: "the quick brown fox" # Good: Consistent preprocessing def preprocess(text): return text.lower().strip() training: preprocess(text) inference: preprocess(query)
2. Handle Long Text
Most models have token limits (512 tokens typical).
Options:
- Chunking: Split before embedding
- Truncation: Take first N tokens
- Summarization: Embed summary for long documents
- Long-context models: Use models with larger context windows
3. Normalize Embeddings
L2 normalization improves consistency:
DEVELOPERpythonimport numpy as np def normalize(embedding): return embedding / np.linalg.norm(embedding)
Benefits:
- Cosine similarity becomes dot product (faster)
- Consistent similarity ranges
- Better clustering
4. Batch Processing
Embed multiple texts at once for efficiency:
DEVELOPERpython# Inefficient: One at a time embeddings = [embed(text) for text in texts] # Efficient: Batch embedding embeddings = embed_batch(texts, batch_size=32)
5. Caching
Cache embeddings to avoid recomputation:
DEVELOPERpython# Use content hash as cache key import hashlib def get_embedding(text, cache): text_hash = hashlib.md5(text.encode()).hexdigest() if text_hash not in cache: cache[text_hash] = embed(text) return cache[text_hash]
Similarity Metrics
Cosine Similarity
Measures angle between vectors. Range: [-1, 1]
DEVELOPERpythonfrom numpy import dot from numpy.linalg import norm def cosine_similarity(a, b): return dot(a, b) / (norm(a) * norm(b))
Best for: Normalized embeddings, most common choice
Euclidean Distance
Measures straight-line distance. Range: [0, ∞]
DEVELOPERpythondef euclidean_distance(a, b): return norm(a - b)
Best for: Unnormalized embeddings, clustering
Dot Product
Simple multiplication and sum. Range: [-∞, ∞]
DEVELOPERpythondef dot_product(a, b): return dot(a, b)
Best for: Normalized embeddings (equivalent to cosine), fastest computation
Domain Adaptation
When to Fine-Tune
Consider fine-tuning when:
- Domain has specialized vocabulary
- Off-the-shelf models perform poorly
- You have quality training data
- High-value application justifies effort
Fine-Tuning Approaches
Contrastive Learning
Positive pairs: Similar items
Negative pairs: Dissimilar items
Example:
(query: "reset password", doc: "password recovery") → similar
(query: "reset password", doc: "billing info") → dissimilar
Triplet Loss
(anchor, positive, negative)
anchor: "database optimization"
positive: "improving SQL query performance"
negative: "frontend UI design"
Knowledge Distillation
- Use large teacher model (e.g., OpenAI)
- Train smaller student model to match
- Deploy smaller model for cost/speed
Evaluating Embedding Quality
Intrinsic Metrics
Similarity Tasks
- Semantic textual similarity (STS) benchmarks
- Correlation with human judgments
Clustering Quality
- Do similar documents cluster together?
- Silhouette score
Extrinsic Metrics
Retrieval Performance
- Precision@k
- Recall@k
- NDCG (Normalized Discounted Cumulative Gain)
End-to-End RAG Metrics
- Answer quality with these embeddings
- User satisfaction
- Task completion rate
Practical Considerations
Storage Requirements
Calculate storage needs:
Storage = num_documents × chunks_per_doc × dimensions × bytes_per_float
Example:
1M documents × 10 chunks × 768 dimensions × 4 bytes = 30.7 GB
Latency
Typical embedding times:
- OpenAI API: 50-200ms per request
- Local Sentence Transformers: 10-50ms per batch
- GPU acceleration: 2-10ms per batch
Cost
Monthly cost estimation:
OpenAI text-embedding-3-small:
1M documents × 500 tokens/doc × $0.02/1M tokens = $10
Self-hosted:
GPU instance: $200-500/month
Amortized over volume
Choosing Your Embedding Model
Decision framework:
- Budget: API or self-hosted?
- Volume: How many embeddings needed?
- Latency: Real-time or batch?
- Language: English only or multilingual?
- Domain: General or specialized?
- Accuracy: How critical is precision?
Recommendations
General Purpose (English)
- OpenAI text-embedding-3-small
- Sentence Transformers all-mpnet-base-v2
Multilingual
- Cohere embed-multilingual-v3
- BGE-M3
Cost-Optimized
- Self-hosted Sentence Transformers
- E5-small-v2
Maximum Accuracy
- OpenAI text-embedding-3-large
- Voyage AI voyage-large-2
💡 Expert Tip from Ailog: Don't overthink your first embedding model choice. OpenAI text-embedding-3-small hits the sweet spot for 90% of applications – great quality, reasonable cost, and no infrastructure to manage. Optimize to specialized models only after you've proven RAG value and identified specific bottlenecks. We've seen teams waste months fine-tuning embeddings before validating their use case.
Test Embedding Models on Ailog
Compare embedding models without infrastructure setup:
On Ailog's platform:
- Test OpenAI, Cohere, and open-source models side-by-side
- Benchmark retrieval accuracy on your actual documents
- See real cost projections based on your data volume
- Switch models instantly with one click
Start testing → Free tier includes all major embedding models.
Next Steps
With embeddings in place, the next critical component is determining how to split your documents. The chunking strategy significantly impacts retrieval quality and will be covered in the next guide.
Tags
Related Guides
Choosing Embedding Models for RAG
Compare embedding models in 2025: OpenAI, Cohere, open-source alternatives. Find the best fit for your use case.
Multilingual Embeddings for Global RAG
Build RAG systems that work across languages using multilingual embedding models and cross-lingual retrieval.
Fine-Tune Embeddings for Your Domain
Boost retrieval accuracy by 30%: fine-tune embedding models on your specific documents and queries.