Embeddings: The Foundation of Semantic Search

TL;DR

Embeddings = Vector representations that capture semantic meaning (similar text → close vectors)
Best for most: OpenAI text-embedding-3-small ($0.02/1M tokens, 1536 dimensions)
Budget option: Sentence Transformers all-mpnet-base-v2 (free, self-hosted)
Quality matters: Better embeddings = 20-40% improvement in retrieval accuracy
Compare models live on Ailog's platform

Understanding Embeddings

Embeddings are dense vector representations of text that capture semantic meaning in a high-dimensional space. Words, sentences, or documents with similar meanings are positioned close together in this vector space.

From Text to Vectors

Consider these sentences:

"The cat sits on the mat"
"A feline rests on the rug"
"Python is a programming language"

Good embeddings will place the first two sentences close together (similar meaning) and the third one far away (different topic).

Why Embeddings Matter for RAG

Embeddings enable semantic search rather than keyword matching:

Keyword Search (Traditional)

Query: "how to reset password"
Match: Exact word matching
Misses: "password recovery", "forgot credentials", "account access"

Semantic Search (Embeddings)

Query: "how to reset password"
Finds: "password recovery", "forgot credentials", "regain account access"
Reason: Similar meaning, different words

Popular Embedding Models

OpenAI Embeddings

text-embedding-3-small

Dimensions: 1536
Cost: $0.02 / 1M tokens
Performance: Good for most use cases
Speed: Fast

text-embedding-3-large

Dimensions: 3072
Cost: $0.13 / 1M tokens
Performance: Best accuracy
Speed: Slower, higher memory

Open Source Alternatives

Sentence Transformers

Models: all-MiniLM-L6-v2, all-mpnet-base-v2
Dimensions: 384-768
Cost: Free (self-hosted)
Performance: Good for English
Customizable: Can fine-tune on your domain

BGE (BAAI General Embedding)

Models: bge-small, bge-base, bge-large
Dimensions: 512-1024
Performance: State-of-the-art open source
Languages: Multilingual support

E5 (Microsoft)

Models: e5-small, e5-base, e5-large
Dimensions: 384-1024
Performance: Excellent zero-shot
Training: Weak supervision approach

Cohere Embed

Dimensions: 1024 (v3), 768 (multilingual)
Cost: API-based pricing
Performance: Strong multilingual
Features: Built-in compression

Embedding Dimensions

The number of dimensions affects:

Higher Dimensions (1024-3072)

Pros:

More expressive representations
Better capture nuanced meanings
Higher accuracy on complex queries

Cons:

More storage space required
Slower similarity computations
Higher memory usage
Increased costs

Lower Dimensions (256-512)

Pros:

Faster search
Less storage
Lower memory footprint
Cost-effective at scale

Cons:

May lose semantic nuance
Lower accuracy on subtle distinctions

Optimal Choice

For most RAG applications:

384-768 dimensions: Good balance for general use
1024-1536 dimensions: Better for complex domains
256-384 dimensions: High-volume, cost-sensitive applications

Embedding Strategies

Document-Level Embeddings

Embed entire documents as single vectors.

Pros:

Captures overall document theme
Simple implementation
Good for document classification

Cons:

Loses granular information
Context window limits long documents
Poor for precise retrieval

Use when:

Documents are short (< 512 tokens)
Need document-level similarity
Classification tasks

Chunk-Level Embeddings

Split documents into chunks, embed each separately.

Pros:

Retrieves specific relevant sections
Handles long documents
More precise context

Cons:

More embeddings to store
Chunk boundaries may split context
Requires chunking strategy

Use when:

Documents are long
Need precise retrieval
Most RAG applications

Sentence-Level Embeddings

Embed individual sentences.

Pros:

Very precise retrieval
Minimal irrelevant context
Good for FAQ systems

Cons:

May lack surrounding context
Very large number of embeddings
Context fragmentation

Use when:

Questions have short, specific answers
Minimizing context window usage
FAQ or Q&A systems

Hybrid Approaches

Combine multiple granularities:

DEVELOPERpython
# Pseudocode
document_embedding = embed(full_document)
chunk_embeddings = [embed(chunk) for chunk in chunks]
sentence_embeddings = [embed(sent) for sent in sentences]

# Retrieval: Search at document level, then drill down to chunks

Embedding Best Practices

1. Consistent Preprocessing

Ensure training and inference preprocessing match:

DEVELOPERpython
# Bad: Inconsistent preprocessing
training: "The Quick Brown Fox"
inference: "the quick brown fox"

# Good: Consistent preprocessing
def preprocess(text):
    return text.lower().strip()

training: preprocess(text)
inference: preprocess(query)

2. Handle Long Text

Most models have token limits (512 tokens typical).

Options:

Chunking: Split before embedding
Truncation: Take first N tokens
Summarization: Embed summary for long documents
Long-context models: Use models with larger context windows

3. Normalize Embeddings

L2 normalization improves consistency:

DEVELOPERpython
import numpy as np

def normalize(embedding):
    return embedding / np.linalg.norm(embedding)

Benefits:

Cosine similarity becomes dot product (faster)
Consistent similarity ranges
Better clustering

4. Batch Processing

Embed multiple texts at once for efficiency:

DEVELOPERpython
# Inefficient: One at a time
embeddings = [embed(text) for text in texts]

# Efficient: Batch embedding
embeddings = embed_batch(texts, batch_size=32)

5. Caching

Cache embeddings to avoid recomputation:

DEVELOPERpython
# Use content hash as cache key
import hashlib

def get_embedding(text, cache):
    text_hash = hashlib.md5(text.encode()).hexdigest()

    if text_hash not in cache:
        cache[text_hash] = embed(text)

    return cache[text_hash]

Similarity Metrics

Cosine Similarity

Measures angle between vectors. Range: [-1, 1]

DEVELOPERpython
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

Best for: Normalized embeddings, most common choice

Euclidean Distance

Measures straight-line distance. Range: [0, ∞]

DEVELOPERpython
def euclidean_distance(a, b):
    return norm(a - b)

Best for: Unnormalized embeddings, clustering

Dot Product

Simple multiplication and sum. Range: [-∞, ∞]

DEVELOPERpython
def dot_product(a, b):
    return dot(a, b)

Best for: Normalized embeddings (equivalent to cosine), fastest computation

Domain Adaptation

When to Fine-Tune

Consider fine-tuning when:

Domain has specialized vocabulary
Off-the-shelf models perform poorly
You have quality training data
High-value application justifies effort

Fine-Tuning Approaches

Contrastive Learning

Positive pairs: Similar items
Negative pairs: Dissimilar items

Example:
(query: "reset password", doc: "password recovery") → similar
(query: "reset password", doc: "billing info") → dissimilar

Triplet Loss

(anchor, positive, negative)

anchor: "database optimization"
positive: "improving SQL query performance"
negative: "frontend UI design"

Knowledge Distillation

Use large teacher model (e.g., OpenAI)
Train smaller student model to match
Deploy smaller model for cost/speed

Evaluating Embedding Quality

Intrinsic Metrics

Similarity Tasks

Semantic textual similarity (STS) benchmarks
Correlation with human judgments

Clustering Quality

Do similar documents cluster together?
Silhouette score

Extrinsic Metrics

Retrieval Performance

Precision@k
Recall@k
NDCG (Normalized Discounted Cumulative Gain)

End-to-End RAG Metrics

Answer quality with these embeddings
User satisfaction
Task completion rate

Practical Considerations

Storage Requirements

Calculate storage needs:

Storage = num_documents × chunks_per_doc × dimensions × bytes_per_float

Example:
1M documents × 10 chunks × 768 dimensions × 4 bytes = 30.7 GB

Latency

Typical embedding times:

OpenAI API: 50-200ms per request
Local Sentence Transformers: 10-50ms per batch
GPU acceleration: 2-10ms per batch

Cost

Monthly cost estimation:

OpenAI text-embedding-3-small:
1M documents × 500 tokens/doc × $0.02/1M tokens = $10

Self-hosted:
GPU instance: $200-500/month
Amortized over volume

Choosing Your Embedding Model

Decision framework:

Budget: API or self-hosted?
Volume: How many embeddings needed?
Latency: Real-time or batch?
Language: English only or multilingual?
Domain: General or specialized?
Accuracy: How critical is precision?

Recommendations

General Purpose (English)

OpenAI text-embedding-3-small
Sentence Transformers all-mpnet-base-v2

Multilingual

Cohere embed-multilingual-v3
BGE-M3

Cost-Optimized

Self-hosted Sentence Transformers
E5-small-v2

Maximum Accuracy

OpenAI text-embedding-3-large
Voyage AI voyage-large-2

💡 Expert Tip from Ailog: Don't overthink your first embedding model choice. OpenAI text-embedding-3-small hits the sweet spot for 90% of applications – great quality, reasonable cost, and no infrastructure to manage. Optimize to specialized models only after you've proven RAG value and identified specific bottlenecks. We've seen teams waste months fine-tuning embeddings before validating their use case.

Test Embedding Models on Ailog

Compare embedding models without infrastructure setup:

On Ailog's platform:

Test OpenAI, Cohere, and open-source models side-by-side
Benchmark retrieval accuracy on your actual documents
See real cost projections based on your data volume
Switch models instantly with one click

Start testing → Free tier includes all major embedding models.

Next Steps

With embeddings in place, the next critical component is determining how to split your documents. The chunking strategy significantly impacts retrieval quality and will be covered in the next guide.

Embeddings: The Foundation of Semantic Search

TL;DR

Understanding Embeddings

From Text to Vectors

Why Embeddings Matter for RAG

Popular Embedding Models

OpenAI Embeddings

Open Source Alternatives

Embedding Dimensions

Higher Dimensions (1024-3072)

Lower Dimensions (256-512)

Optimal Choice

Embedding Strategies

Document-Level Embeddings

Chunk-Level Embeddings

Sentence-Level Embeddings

Hybrid Approaches

Embedding Best Practices

1. Consistent Preprocessing

2. Handle Long Text

3. Normalize Embeddings

4. Batch Processing

5. Caching

Similarity Metrics

Cosine Similarity

Euclidean Distance

Dot Product

Domain Adaptation

When to Fine-Tune

Fine-Tuning Approaches

Evaluating Embedding Quality

Intrinsic Metrics

Extrinsic Metrics

Practical Considerations

Storage Requirements

Latency

Cost

Choosing Your Embedding Model

Recommendations

Test Embedding Models on Ailog

Next Steps

Tags

Related Guides

Choosing Embedding Models for RAG

Multilingual Embeddings for Global RAG

Fine-Tune Embeddings for Your Domain