Vector Databases: Storing and Searching Embeddings
Comprehensive guide to vector databases for RAG: comparison of popular options, indexing strategies, and performance optimization.
TL;DR
- For prototyping: ChromaDB (embedded, zero setup)
- For production: Pinecone (managed) or Qdrant (self-hosted)
- Need hybrid search: Weaviate or Elasticsearch
- Key metric: Query latency <100ms for good UX
- Test vector DBs on Ailog without infrastructure
What is a Vector Database?
A vector database is a specialized database optimized for storing and searching high-dimensional vectors (embeddings). Unlike traditional databases that search by exact matches or ranges, vector databases find items by semantic similarity.
Core Capabilities
- Vector storage: Efficiently store millions of high-dimensional vectors
- Similarity search: Find nearest neighbors in vector space
- Metadata filtering: Combine semantic search with traditional filters
- Scalability: Handle billions of vectors with low latency
- CRUD operations: Create, read, update, delete vectors
Why Not Use a Regular Database?
Traditional databases struggle with vector search:
Problem: Curse of dimensionality
- High-dimensional spaces behave counterintuitively
- Distance metrics become less meaningful
- Exhaustive search is O(n×d) - too slow at scale
Vector DB Solution: Approximate Nearest Neighbor (ANN)
- Specialized indexing (HNSW, IVF, etc.)
- Sub-linear search time: O(log n) typical
- Trade exactness for speed (99%+ recall)
Popular Vector Databases
Pinecone
Type: Managed cloud service
Pros:
- Fully managed, no infrastructure
- Easy to use, great DX
- Auto-scaling
- High performance
- Good documentation
Cons:
- Cost at scale
- Vendor lock-in
- Limited self-hosting
Pricing:
- Starter: Free (1 index, 100K vectors)
- Standard: ~$70/month (1M vectors, 1 pod)
- Enterprise: Custom
Best for:
- Quick prototypes
- Production without ops overhead
- When budget allows
Weaviate
Type: Open source, self-hostable
Pros:
- Open source (Apache 2.0)
- Hybrid search (vector + keyword)
- GraphQL API
- Multi-tenancy support
- Active community
Cons:
- More complex setup
- Self-hosting overhead
- Learning curve
Hosting:
- Self-hosted: Free (infrastructure costs)
- Weaviate Cloud: From $25/month
Best for:
- Self-hosting requirement
- Hybrid search needs
- Complex filtering
Qdrant
Type: Open source, Rust-based
Pros:
- Very fast (Rust performance)
- Rich filtering capabilities
- Good Python SDK
- Easy Docker deployment
- Snapshot support
Cons:
- Smaller ecosystem than others
- Less mature managed offering
Hosting:
- Self-hosted: Free
- Qdrant Cloud: From $25/month
Best for:
- Performance-critical applications
- Complex filtering requirements
- Self-hosting with ease
Chroma
Type: Open source, embedded
Pros:
- Embedded mode (no server needed)
- Simple API
- Good for development
- Free and open source
Cons:
- Limited scale
- No multi-user support in embedded mode
- Fewer features than others
Best for:
- Development and prototyping
- Small-scale applications
- Embedded use cases
Milvus
Type: Open source, cloud-native
Pros:
- Highly scalable (billions of vectors)
- Multiple index types
- Cloud-native architecture
- GPU support
Cons:
- Complex setup
- Resource-intensive
- Steeper learning curve
Hosting:
- Self-hosted: Free
- Zilliz Cloud (managed): Custom pricing
Best for:
- Large-scale production
- Multi-index requirements
- When scale is primary concern
PostgreSQL + pgvector
Type: Extension for PostgreSQL
Pros:
- Use existing PostgreSQL infrastructure
- ACID guarantees
- Rich SQL ecosystem
- Easy integration
Cons:
- Not optimized for massive scale
- Slower than specialized vector DBs
- Limited to millions, not billions
Cost:
- Free (extension)
- Postgres hosting costs
Best for:
- Already using PostgreSQL
- Need transactional guarantees
- Moderate scale (< 1M vectors)
Comparison Matrix
| Database | Managed | Open Source | Scale | Best Feature |
|---|---|---|---|---|
| Pinecone | ✅ | ❌ | High | Ease of use |
| Weaviate | ✅ | ✅ | High | Hybrid search |
| Qdrant | ✅ | ✅ | High | Performance |
| Chroma | ❌ | ✅ | Low | Simplicity |
| Milvus | ✅ | ✅ | Very High | Scalability |
| pgvector | ❌ | ✅ | Medium | SQL integration |
Indexing Strategies
HNSW (Hierarchical Navigable Small Worlds)
How it works:
- Multi-layer graph structure
- Navigable small-world properties
- Greedy search from top layer down
Characteristics:
- Fast search: O(log n)
- High recall (95-99%)
- Memory-intensive
- Slow index building
Parameters:
DEVELOPERpythonindex_config = { 'M': 16, # Connections per node (tradeoff: recall vs memory) 'ef_construction': 64 # Search width during build (higher = better recall) } search_params = { 'ef': 32 # Search width at query time (higher = better recall, slower) }
Tuning:
M: 8-64 (16 default). Higher = better recall, more memoryef_construction: 64-512. Higher = better index quality, slower buildef: 32-512. Higher = better recall, slower search
Best for:
- High recall requirements
- Read-heavy workloads
- When memory is available
IVF (Inverted File Index)
How it works:
- Cluster vectors into partitions (Voronoi cells)
- Search only nearby partitions
- Coarse-to-fine approach
Parameters:
DEVELOPERpythonindex_config = { 'nlist': 100, # Number of clusters (√n to 4×√n typical) } search_params = { 'nprobe': 10 # Number of clusters to search }
Tuning:
nlist: sqrt(N) typical. More = faster search, slower buildnprobe: 1 to nlist. Higher = better recall, slower search
Best for:
- Very large datasets
- Acceptable recall tradeoff
- When memory is limited
Flat (Brute Force)
How it works:
- Compare query to every vector
- Exact nearest neighbors
- No indexing required
Characteristics:
- 100% recall
- O(n) search time
- No index overhead
Best for:
- Small datasets (< 10K vectors)
- Exact results required
- Ground truth evaluation
HNSW vs IVF
| Aspect | HNSW | IVF |
|---|---|---|
| Speed | Very fast | Fast |
| Recall | Higher (98-99%) | Lower (90-95%) |
| Memory | High | Lower |
| Build time | Slow | Medium |
| Updates | Expensive | Cheaper |
| Best scale | Millions | Billions |
Metadata Filtering
Combine vector similarity with traditional filters.
Pre-filtering
Filter first, then search vectors.
DEVELOPERpython# Filter by metadata, then vector search within results results = db.query( vector=query_embedding, filter={"category": "electronics", "price": {"$lt": 1000}}, limit=10 )
Pros:
- Exact filter application
- No irrelevant results
Cons:
- May reduce candidate set too much
- Slower if filter is selective
Post-filtering
Search vectors first, then filter results.
DEVELOPERpython# Vector search first, filter results results = db.query( vector=query_embedding, limit=100 # Overfetch ) filtered = [r for r in results if r.metadata.get('category') == 'electronics'][:10]
Pros:
- Always get k results (if available)
- Faster vector search
Cons:
- May waste computation on filtered-out results
- Less efficient
Hybrid (HNSW-IF)
Modern approach: filter-aware indexing.
DEVELOPERpython# Efficient combined search results = db.query( vector=query_embedding, filter={"category": "electronics"}, limit=10, filter_strategy="hnsw_if" # Filter-aware HNSW traversal )
How it works:
- HNSW graph traversal respects filters
- Skip filtered-out nodes during search
- Best of both approaches
Best for:
- Production RAG systems
- When filtering is common
- Supported by Qdrant, Weaviate
Distance Metrics
Cosine Similarity
Measures angle between vectors.
DEVELOPERpythonsimilarity = dot(a, b) / (norm(a) * norm(b))
Range: [-1, 1] (higher = more similar)
Best for:
- Normalized embeddings
- Most common choice
- Text embeddings
Euclidean (L2) Distance
Straight-line distance.
DEVELOPERpythondistance = sqrt(sum((a - b) ** 2))
Range: [0, ∞] (lower = more similar)
Best for:
- Unnormalized embeddings
- Image embeddings
- When magnitude matters
Dot Product
Simple multiplication.
DEVELOPERpythonscore = dot(a, b)
Range: [-∞, ∞] (higher = more similar)
Best for:
- Normalized embeddings (equivalent to cosine)
- Fastest computation
- When vectors are normalized
Note: For normalized vectors:
- Cosine similarity ≈ Dot product (scaled)
- Dot product is faster (no division)
- Use dot product if vectors are normalized
Performance Optimization
Batch Operations
Upload/query in batches for better throughput.
DEVELOPERpython# Bad: One at a time for vector in vectors: db.upsert(vector) # Good: Batched db.upsert_batch(vectors, batch_size=100)
Async Operations
Parallelize I/O-bound operations.
DEVELOPERpythonimport asyncio async def batch_search(queries): tasks = [db.search_async(q) for q in queries] return await asyncio.gather(*tasks) results = asyncio.run(batch_search(query_batch))
Indexing Strategies
Incremental indexing:
- Add vectors as they arrive
- Good for dynamic data
- Maintains index quality
Batch reindexing:
- Rebuild index periodically
- Better index quality
- Downtime required
Dual indexing:
- Write to two indexes
- Switch atomically
- Zero downtime
- Double storage cost
Sharding
Split data across multiple instances.
DEVELOPERpython# Route by document ID def get_shard(doc_id, num_shards=4): return hash(doc_id) % num_shards # Parallel search across shards async def search_all_shards(query): tasks = [ search_shard(shard_id, query) for shard_id in range(num_shards) ] results = await asyncio.gather(*tasks) return merge_and_rank(results)
Caching
Cache frequent queries.
DEVELOPERpythonfrom functools import lru_cache @lru_cache(maxsize=1000) def search_cached(query_text, k=5): embedding = embed(query_text) return db.search(embedding, limit=k)
Monitoring and Observability
Key Metrics
Performance Metrics:
- Query latency (p50, p95, p99)
- Indexing throughput
- CPU/memory utilization
Quality Metrics:
- Recall@k
- Precision@k
- User feedback (thumbs up/down)
Operational Metrics:
- Index size
- Number of vectors
- Query rate
- Error rate
Instrumentation
DEVELOPERpythonimport time def search_with_metrics(query_vector): start = time.time() try: results = db.search(query_vector, limit=10) latency = time.time() - start metrics.record('vector_search_latency', latency) metrics.record('vector_search_success', 1) return results except Exception as e: metrics.record('vector_search_error', 1) raise
Backup and Recovery
Snapshot Strategy
DEVELOPERpython# Regular snapshots def backup_database(db, backup_path): snapshot = db.create_snapshot() snapshot.save(backup_path) # Restore from snapshot def restore_database(db, backup_path): db.restore_snapshot(backup_path)
Incremental Backups
DEVELOPERpython# Track changes since last backup last_backup_time = get_last_backup_time() changed_vectors = db.get_vectors_since(last_backup_time) backup_incremental(changed_vectors)
Migration Strategies
Zero-Downtime Migration
DEVELOPERpython# 1. Set up new database new_db = setup_new_database() # 2. Backfill data async def migrate(): vectors = old_db.scan_all() await new_db.upsert_batch(vectors) # 3. Dual-write during migration def write_both(vector): old_db.upsert(vector) new_db.upsert(vector) # 4. Validate new database assert validate_migration(old_db, new_db) # 5. Switch reads to new database db = new_db # 6. Decommission old database old_db.shutdown()
Cost Optimization
Calculate Costs
DEVELOPERpython# Storage costs num_vectors = 1_000_000 dimensions = 768 bytes_per_vector = dimensions * 4 # float32 storage_gb = (num_vectors * bytes_per_vector) / (1024 ** 3) storage_cost_monthly = storage_gb * 0.10 # $0.10/GB typical # Query costs (for managed services) queries_per_month = 10_000_000 cost_per_1k_queries = 0.05 query_cost_monthly = (queries_per_month / 1000) * cost_per_1k_queries total_monthly = storage_cost_monthly + query_cost_monthly
Optimization Tactics
- Reduce dimensions: Use smaller embedding models
- Quantization: Store vectors in lower precision (int8 instead of float32)
- Tiered storage: Hot/warm/cold data
- Caching: Reduce redundant queries
- Batch operations: Lower per-operation overhead
Choosing a Vector Database
Decision Framework
Prototyping / POC:
- Chroma (embedded) or Pinecone (cloud)
- Ease of use > performance
Production (Small Scale < 1M vectors):
- pgvector (if using Postgres)
- Pinecone (managed simplicity)
- Qdrant (self-hosted performance)
Production (Medium Scale 1-100M vectors):
- Qdrant or Weaviate (self-hosted)
- Pinecone (managed)
Production (Large Scale > 100M vectors):
- Milvus
- Weaviate
- Distributed Pinecone
Hybrid Search Required:
- Weaviate (best hybrid support)
- Elasticsearch with vector plugin
Need SQL:
- pgvector
Migration Path
Start simple, scale up as needed:
- Development: Chroma (embedded)
- MVP: Pinecone or pgvector
- Scale: Qdrant or Weaviate (self-hosted)
- Massive scale: Milvus or distributed setup
💡 Expert Tip from Ailog: Don't prematurely optimize your vector database choice. We've run production RAG systems serving millions of queries on both Pinecone and self-hosted Qdrant. The database is rarely the bottleneck – poor chunking or embedding strategies are. Start with ChromaDB for prototyping, move to Pinecone for simplicity or Qdrant for control. Only consider Milvus/Weaviate when you're serving 10M+ queries/month.
Compare Vector Databases on Ailog
Test different vector databases with your actual data:
Ailog supports:
- ChromaDB, Pinecone, Qdrant, Weaviate
- Performance benchmarks with your documents
- Cost projections based on your scale
- One-click migration between databases
Next Steps
With embeddings stored and searchable, the next challenge is retrieving the most relevant context. Advanced retrieval strategies including hybrid search, query expansion, and reranking are covered in the next guide.
Tags
Articles connexes
Qdrant: Advanced Vector Search Features
Leverage Qdrant's powerful features: payload indexing, quantization, distributed deployment for high-performance RAG.
Milvus: Billion-Scale Vector Search
Deploy Milvus for production-scale RAG handling billions of vectors with horizontal scaling and GPU acceleration.
Pinecone for Production RAG at Scale
Deploy production-ready vector search: Pinecone setup, indexing strategies, and scaling to billions of vectors.