3. EmbeddingIntermediate

Embeddings: The Foundation of Semantic Search

January 20, 2025
12 min read
Ailog Research Team

Deep dive into embedding models, vector representations, and how to choose the right embedding strategy for your RAG system.

TL;DR

  • Embeddings = Vector representations that capture semantic meaning (similar text → close vectors)
  • Best for most: OpenAI text-embedding-3-small ($0.02/1M tokens, 1536 dimensions)
  • Budget option: Sentence Transformers all-mpnet-base-v2 (free, self-hosted)
  • Quality matters: Better embeddings = 20-40% improvement in retrieval accuracy
  • Compare models live on Ailog's platform

Understanding Embeddings

Embeddings are dense vector representations of text that capture semantic meaning in a high-dimensional space. Words, sentences, or documents with similar meanings are positioned close together in this vector space.

From Text to Vectors

Consider these sentences:

  • "The cat sits on the mat"
  • "A feline rests on the rug"
  • "Python is a programming language"

Good embeddings will place the first two sentences close together (similar meaning) and the third one far away (different topic).

Why Embeddings Matter for RAG

Embeddings enable semantic search rather than keyword matching:

Keyword Search (Traditional)

Query: "how to reset password"
Match: Exact word matching
Misses: "password recovery", "forgot credentials", "account access"

Semantic Search (Embeddings)

Query: "how to reset password"
Finds: "password recovery", "forgot credentials", "regain account access"
Reason: Similar meaning, different words

Popular Embedding Models

OpenAI Embeddings

text-embedding-3-small

  • Dimensions: 1536
  • Cost: $0.02 / 1M tokens
  • Performance: Good for most use cases
  • Speed: Fast

text-embedding-3-large

  • Dimensions: 3072
  • Cost: $0.13 / 1M tokens
  • Performance: Best accuracy
  • Speed: Slower, higher memory

Open Source Alternatives

Sentence Transformers

  • Models: all-MiniLM-L6-v2, all-mpnet-base-v2
  • Dimensions: 384-768
  • Cost: Free (self-hosted)
  • Performance: Good for English
  • Customizable: Can fine-tune on your domain

BGE (BAAI General Embedding)

  • Models: bge-small, bge-base, bge-large
  • Dimensions: 512-1024
  • Performance: State-of-the-art open source
  • Languages: Multilingual support

E5 (Microsoft)

  • Models: e5-small, e5-base, e5-large
  • Dimensions: 384-1024
  • Performance: Excellent zero-shot
  • Training: Weak supervision approach

Cohere Embed

  • Dimensions: 1024 (v3), 768 (multilingual)
  • Cost: API-based pricing
  • Performance: Strong multilingual
  • Features: Built-in compression

Embedding Dimensions

The number of dimensions affects:

Higher Dimensions (1024-3072)

Pros:

  • More expressive representations
  • Better capture nuanced meanings
  • Higher accuracy on complex queries

Cons:

  • More storage space required
  • Slower similarity computations
  • Higher memory usage
  • Increased costs

Lower Dimensions (256-512)

Pros:

  • Faster search
  • Less storage
  • Lower memory footprint
  • Cost-effective at scale

Cons:

  • May lose semantic nuance
  • Lower accuracy on subtle distinctions

Optimal Choice

For most RAG applications:

  • 384-768 dimensions: Good balance for general use
  • 1024-1536 dimensions: Better for complex domains
  • 256-384 dimensions: High-volume, cost-sensitive applications

Embedding Strategies

Document-Level Embeddings

Embed entire documents as single vectors.

Pros:

  • Captures overall document theme
  • Simple implementation
  • Good for document classification

Cons:

  • Loses granular information
  • Context window limits long documents
  • Poor for precise retrieval

Use when:

  • Documents are short (< 512 tokens)
  • Need document-level similarity
  • Classification tasks

Chunk-Level Embeddings

Split documents into chunks, embed each separately.

Pros:

  • Retrieves specific relevant sections
  • Handles long documents
  • More precise context

Cons:

  • More embeddings to store
  • Chunk boundaries may split context
  • Requires chunking strategy

Use when:

  • Documents are long
  • Need precise retrieval
  • Most RAG applications

Sentence-Level Embeddings

Embed individual sentences.

Pros:

  • Very precise retrieval
  • Minimal irrelevant context
  • Good for FAQ systems

Cons:

  • May lack surrounding context
  • Very large number of embeddings
  • Context fragmentation

Use when:

  • Questions have short, specific answers
  • Minimizing context window usage
  • FAQ or Q&A systems

Hybrid Approaches

Combine multiple granularities:

DEVELOPERpython
# Pseudocode document_embedding = embed(full_document) chunk_embeddings = [embed(chunk) for chunk in chunks] sentence_embeddings = [embed(sent) for sent in sentences] # Retrieval: Search at document level, then drill down to chunks

Embedding Best Practices

1. Consistent Preprocessing

Ensure training and inference preprocessing match:

DEVELOPERpython
# Bad: Inconsistent preprocessing training: "The Quick Brown Fox" inference: "the quick brown fox" # Good: Consistent preprocessing def preprocess(text): return text.lower().strip() training: preprocess(text) inference: preprocess(query)

2. Handle Long Text

Most models have token limits (512 tokens typical).

Options:

  • Chunking: Split before embedding
  • Truncation: Take first N tokens
  • Summarization: Embed summary for long documents
  • Long-context models: Use models with larger context windows

3. Normalize Embeddings

L2 normalization improves consistency:

DEVELOPERpython
import numpy as np def normalize(embedding): return embedding / np.linalg.norm(embedding)

Benefits:

  • Cosine similarity becomes dot product (faster)
  • Consistent similarity ranges
  • Better clustering

4. Batch Processing

Embed multiple texts at once for efficiency:

DEVELOPERpython
# Inefficient: One at a time embeddings = [embed(text) for text in texts] # Efficient: Batch embedding embeddings = embed_batch(texts, batch_size=32)

5. Caching

Cache embeddings to avoid recomputation:

DEVELOPERpython
# Use content hash as cache key import hashlib def get_embedding(text, cache): text_hash = hashlib.md5(text.encode()).hexdigest() if text_hash not in cache: cache[text_hash] = embed(text) return cache[text_hash]

Similarity Metrics

Cosine Similarity

Measures angle between vectors. Range: [-1, 1]

DEVELOPERpython
from numpy import dot from numpy.linalg import norm def cosine_similarity(a, b): return dot(a, b) / (norm(a) * norm(b))

Best for: Normalized embeddings, most common choice

Euclidean Distance

Measures straight-line distance. Range: [0, ∞]

DEVELOPERpython
def euclidean_distance(a, b): return norm(a - b)

Best for: Unnormalized embeddings, clustering

Dot Product

Simple multiplication and sum. Range: [-∞, ∞]

DEVELOPERpython
def dot_product(a, b): return dot(a, b)

Best for: Normalized embeddings (equivalent to cosine), fastest computation

Domain Adaptation

When to Fine-Tune

Consider fine-tuning when:

  • Domain has specialized vocabulary
  • Off-the-shelf models perform poorly
  • You have quality training data
  • High-value application justifies effort

Fine-Tuning Approaches

Contrastive Learning

Positive pairs: Similar items
Negative pairs: Dissimilar items

Example:
(query: "reset password", doc: "password recovery") → similar
(query: "reset password", doc: "billing info") → dissimilar

Triplet Loss

(anchor, positive, negative)

anchor: "database optimization"
positive: "improving SQL query performance"
negative: "frontend UI design"

Knowledge Distillation

  • Use large teacher model (e.g., OpenAI)
  • Train smaller student model to match
  • Deploy smaller model for cost/speed

Evaluating Embedding Quality

Intrinsic Metrics

Similarity Tasks

  • Semantic textual similarity (STS) benchmarks
  • Correlation with human judgments

Clustering Quality

  • Do similar documents cluster together?
  • Silhouette score

Extrinsic Metrics

Retrieval Performance

  • Precision@k
  • Recall@k
  • NDCG (Normalized Discounted Cumulative Gain)

End-to-End RAG Metrics

  • Answer quality with these embeddings
  • User satisfaction
  • Task completion rate

Practical Considerations

Storage Requirements

Calculate storage needs:

Storage = num_documents × chunks_per_doc × dimensions × bytes_per_float

Example:
1M documents × 10 chunks × 768 dimensions × 4 bytes = 30.7 GB

Latency

Typical embedding times:

  • OpenAI API: 50-200ms per request
  • Local Sentence Transformers: 10-50ms per batch
  • GPU acceleration: 2-10ms per batch

Cost

Monthly cost estimation:

OpenAI text-embedding-3-small:
1M documents × 500 tokens/doc × $0.02/1M tokens = $10

Self-hosted:
GPU instance: $200-500/month
Amortized over volume

Choosing Your Embedding Model

Decision framework:

  1. Budget: API or self-hosted?
  2. Volume: How many embeddings needed?
  3. Latency: Real-time or batch?
  4. Language: English only or multilingual?
  5. Domain: General or specialized?
  6. Accuracy: How critical is precision?

Recommendations

General Purpose (English)

  • OpenAI text-embedding-3-small
  • Sentence Transformers all-mpnet-base-v2

Multilingual

  • Cohere embed-multilingual-v3
  • BGE-M3

Cost-Optimized

  • Self-hosted Sentence Transformers
  • E5-small-v2

Maximum Accuracy

  • OpenAI text-embedding-3-large
  • Voyage AI voyage-large-2

💡 Expert Tip from Ailog: Don't overthink your first embedding model choice. OpenAI text-embedding-3-small hits the sweet spot for 90% of applications – great quality, reasonable cost, and no infrastructure to manage. Optimize to specialized models only after you've proven RAG value and identified specific bottlenecks. We've seen teams waste months fine-tuning embeddings before validating their use case.

Test Embedding Models on Ailog

Compare embedding models without infrastructure setup:

On Ailog's platform:

  • Test OpenAI, Cohere, and open-source models side-by-side
  • Benchmark retrieval accuracy on your actual documents
  • See real cost projections based on your data volume
  • Switch models instantly with one click

Start testing → Free tier includes all major embedding models.

Next Steps

With embeddings in place, the next critical component is determining how to split your documents. The chunking strategy significantly impacts retrieval quality and will be covered in the next guide.

Tags

embeddingsvectorssemantic searchmodels

Related Guides