ChromaDB Setup for RAG Applications
Get started with ChromaDB: lightweight, fast vector database perfect for prototyping and production RAG systems.
- Author
- Ailog Research Team
- Published
- Reading time
- 9 min read
- Level
- beginner
- RAG Pipeline Step
- Storage
Why ChromaDB?
ChromaDB is the fastest way to get started with vector search: • ✅ No infrastructure required • ✅ Runs in-memory or persistent • ✅ Built-in embedding functions • ✅ Perfect for prototyping • ✅ Scales to production
Installation (November 2025)
``bash pip install chromadb
Optional: For persistent storage pip install chromadb[server] `
Quick Start
`python import chromadb
Create client (in-memory) client = chromadb.Client()
Or persistent client = chromadb.PersistentClient(path="./chroma_db")
Create collection collection = client.create_collection( name="my_documents", metadata={"description": "RAG knowledge base"} ) `
Adding Documents
`python Add documents with automatic embedding collection.add( documents=[ "This is about machine learning", "Python is a programming language", "Vector databases store embeddings" ], metadatas=[ {"source": "ml_book", "page": 1}, {"source": "python_guide", "page": 5}, {"source": "db_manual", "page": 12} ], ids=["doc1", "doc2", "doc3"] ) `
Using Custom Embeddings
`python from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
Generate embeddings texts = ["doc1 text", "doc2 text"] embeddings = model.encode(texts).tolist()
Add with pre-computed embeddings collection.add( embeddings=embeddings, documents=texts, ids=["id1", "id2"] ) `
Querying
`python Simple similarity search results = collection.query( query_texts=["machine learning algorithms"], n_results=5 )
print(results['documents']) print(results['distances']) print(results['metadatas']) `
Advanced Filtering
`python Filter by metadata results = collection.query( query_texts=["python"], n_results=10, where={"source": "python_guide"} Only from python_guide )
Filter by ID results = collection.query( query_texts=["databases"], where_document={"$contains": "vector"} Document must contain "vector" ) `
Updating Documents
`python Update existing document collection.update( ids=["doc1"], documents=["Updated content about ML"], metadatas=[{"source": "ml_book", "page": 2, "updated": True}] )
Delete documents collection.delete(ids=["doc2"]) `
Production Setup (2025)
ChromaDB server mode for production:
`bash Start server chroma run --host 0.0.0.0 --port 8000 `
`python Connect from client import chromadb
client = chromadb.HttpClient( host="localhost", port=8000 ) `
Docker Deployment
`yaml docker-compose.yml version: '3.8' services: chromadb: image: chromadb/chroma:latest ports: • "8000:8000" volumes: • chroma_data:/chroma/chroma environment: • ALLOW_RESET=true
volumes: chroma_data: `
Performance Tuning
`python Batch operations for speed batch_size = 100 for i in range(0, len(docs), batch_size): batch = docs[i:i+batch_size] collection.add( documents=batch, ids=[f"doc_{j}" for j in range(i, i+len(batch))] ) `
Hybrid Search (ChromaDB + BM25)
`python from rank_bm25 import BM25Okapi
BM25 for keyword search tokenized_docs = [doc.split() for doc in documents] bm25 = BM25Okapi(tokenized_docs)
Query both def hybrid_search(query, n_results=5, alpha=0.7): Vector search (ChromaDB) vector_results = collection.query( query_texts=[query], n_results=n_results*2 )
Keyword search (BM25) bm25_scores = bm25.get_scores(query.split())
Merge results ... (combine scores with alpha weighting)
return top_results `
ChromaDB vs Alternatives (Nov 2025)
| Feature | ChromaDB | Pinecone | Qdrant | |---------|----------|----------|--------| | Setup | Instant | Cloud signup | Docker | | Cost | Free | $70/month+ | Free/self-hosted | | Scale | 1M+ vectors | Billions | Billions | | Best for | Prototyping | Production | Advanced features |
Migration to Production DB
When outgrowing ChromaDB:
`python Export from ChromaDB chroma_docs = collection.get()
Import to Pinecone/Qdrant for doc, emb in zip(chroma_docs['documents'], chroma_docs['embeddings']): production_db.upsert(doc, emb) ``
Best Practices Use persistent mode for important data Batch operations for performance Index metadata fields you filter on Monitor collection size (ChromaDB is best < 10M vectors) Backup regularly if using persistent mode
ChromaDB is perfect for getting started. It's what we use at Ailog for development and small deployments.