ChromaDB Setup for RAG Applications

Get started with ChromaDB: lightweight, fast vector database perfect for prototyping and production RAG systems.

Author
Ailog Research Team
Published
Reading time
9 min read
Level
beginner
RAG Pipeline Step
Storage

Why ChromaDB?

ChromaDB is the fastest way to get started with vector search: • ✅ No infrastructure required • ✅ Runs in-memory or persistent • ✅ Built-in embedding functions • ✅ Perfect for prototyping • ✅ Scales to production

Installation (November 2025)

``bash pip install chromadb

Optional: For persistent storage pip install chromadb[server] `

Quick Start

`python import chromadb

Create client (in-memory) client = chromadb.Client()

Or persistent client = chromadb.PersistentClient(path="./chroma_db")

Create collection collection = client.create_collection( name="my_documents", metadata={"description": "RAG knowledge base"} ) `

Adding Documents

`python Add documents with automatic embedding collection.add( documents=[ "This is about machine learning", "Python is a programming language", "Vector databases store embeddings" ], metadatas=[ {"source": "ml_book", "page": 1}, {"source": "python_guide", "page": 5}, {"source": "db_manual", "page": 12} ], ids=["doc1", "doc2", "doc3"] ) `

Using Custom Embeddings

`python from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

Generate embeddings texts = ["doc1 text", "doc2 text"] embeddings = model.encode(texts).tolist()

Add with pre-computed embeddings collection.add( embeddings=embeddings, documents=texts, ids=["id1", "id2"] ) `

Querying

`python Simple similarity search results = collection.query( query_texts=["machine learning algorithms"], n_results=5 )

print(results['documents']) print(results['distances']) print(results['metadatas']) `

Advanced Filtering

`python Filter by metadata results = collection.query( query_texts=["python"], n_results=10, where={"source": "python_guide"} Only from python_guide )

Filter by ID results = collection.query( query_texts=["databases"], where_document={"$contains": "vector"} Document must contain "vector" ) `

Updating Documents

`python Update existing document collection.update( ids=["doc1"], documents=["Updated content about ML"], metadatas=[{"source": "ml_book", "page": 2, "updated": True}] )

Delete documents collection.delete(ids=["doc2"]) `

Production Setup (2025)

ChromaDB server mode for production:

`bash Start server chroma run --host 0.0.0.0 --port 8000 `

`python Connect from client import chromadb

client = chromadb.HttpClient( host="localhost", port=8000 ) `

Docker Deployment

`yaml docker-compose.yml version: '3.8' services: chromadb: image: chromadb/chroma:latest ports: • "8000:8000" volumes: • chroma_data:/chroma/chroma environment: • ALLOW_RESET=true

volumes: chroma_data: `

Performance Tuning

`python Batch operations for speed batch_size = 100 for i in range(0, len(docs), batch_size): batch = docs[i:i+batch_size] collection.add( documents=batch, ids=[f"doc_{j}" for j in range(i, i+len(batch))] ) `

Hybrid Search (ChromaDB + BM25)

`python from rank_bm25 import BM25Okapi

BM25 for keyword search tokenized_docs = [doc.split() for doc in documents] bm25 = BM25Okapi(tokenized_docs)

Query both def hybrid_search(query, n_results=5, alpha=0.7): Vector search (ChromaDB) vector_results = collection.query( query_texts=[query], n_results=n_results*2 )

Keyword search (BM25) bm25_scores = bm25.get_scores(query.split())

Merge results ... (combine scores with alpha weighting)

return top_results `

ChromaDB vs Alternatives (Nov 2025)

| Feature | ChromaDB | Pinecone | Qdrant | |---------|----------|----------|--------| | Setup | Instant | Cloud signup | Docker | | Cost | Free | $70/month+ | Free/self-hosted | | Scale | 1M+ vectors | Billions | Billions | | Best for | Prototyping | Production | Advanced features |

Migration to Production DB

When outgrowing ChromaDB:

`python Export from ChromaDB chroma_docs = collection.get()

Import to Pinecone/Qdrant for doc, emb in zip(chroma_docs['documents'], chroma_docs['embeddings']): production_db.upsert(doc, emb) ``

Best Practices Use persistent mode for important data Batch operations for performance Index metadata fields you filter on Monitor collection size (ChromaDB is best < 10M vectors) Backup regularly if using persistent mode

ChromaDB is perfect for getting started. It's what we use at Ailog for development and small deployments.

Tags

  • chromadb
  • vector database
  • setup
  • quickstart
4. StorageDébutant

ChromaDB Setup for RAG Applications

12 novembre 2025
9 min read
Ailog Research Team

Get started with ChromaDB: lightweight, fast vector database perfect for prototyping and production RAG systems.

Why ChromaDB?

ChromaDB is the fastest way to get started with vector search:

  • ✅ No infrastructure required
  • ✅ Runs in-memory or persistent
  • ✅ Built-in embedding functions
  • ✅ Perfect for prototyping
  • ✅ Scales to production

Installation (November 2025)

DEVELOPERbash
pip install chromadb # Optional: For persistent storage pip install chromadb[server]

Quick Start

DEVELOPERpython
import chromadb # Create client (in-memory) client = chromadb.Client() # Or persistent client = chromadb.PersistentClient(path="./chroma_db") # Create collection collection = client.create_collection( name="my_documents", metadata={"description": "RAG knowledge base"} )

Adding Documents

DEVELOPERpython
# Add documents with automatic embedding collection.add( documents=[ "This is about machine learning", "Python is a programming language", "Vector databases store embeddings" ], metadatas=[ {"source": "ml_book", "page": 1}, {"source": "python_guide", "page": 5}, {"source": "db_manual", "page": 12} ], ids=["doc1", "doc2", "doc3"] )

Using Custom Embeddings

DEVELOPERpython
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') # Generate embeddings texts = ["doc1 text", "doc2 text"] embeddings = model.encode(texts).tolist() # Add with pre-computed embeddings collection.add( embeddings=embeddings, documents=texts, ids=["id1", "id2"] )

Querying

DEVELOPERpython
# Simple similarity search results = collection.query( query_texts=["machine learning algorithms"], n_results=5 ) print(results['documents']) print(results['distances']) print(results['metadatas'])

Advanced Filtering

DEVELOPERpython
# Filter by metadata results = collection.query( query_texts=["python"], n_results=10, where={"source": "python_guide"} # Only from python_guide ) # Filter by ID results = collection.query( query_texts=["databases"], where_document={"$contains": "vector"} # Document must contain "vector" )

Updating Documents

DEVELOPERpython
# Update existing document collection.update( ids=["doc1"], documents=["Updated content about ML"], metadatas=[{"source": "ml_book", "page": 2, "updated": True}] ) # Delete documents collection.delete(ids=["doc2"])

Production Setup (2025)

ChromaDB server mode for production:

DEVELOPERbash
# Start server chroma run --host 0.0.0.0 --port 8000
DEVELOPERpython
# Connect from client import chromadb client = chromadb.HttpClient( host="localhost", port=8000 )

Docker Deployment

DEVELOPERyaml
# docker-compose.yml version: '3.8' services: chromadb: image: chromadb/chroma:latest ports: - "8000:8000" volumes: - chroma_data:/chroma/chroma environment: - ALLOW_RESET=true volumes: chroma_data:

Performance Tuning

DEVELOPERpython
# Batch operations for speed batch_size = 100 for i in range(0, len(docs), batch_size): batch = docs[i:i+batch_size] collection.add( documents=batch, ids=[f"doc_{j}" for j in range(i, i+len(batch))] )

Hybrid Search (ChromaDB + BM25)

DEVELOPERpython
from rank_bm25 import BM25Okapi # BM25 for keyword search tokenized_docs = [doc.split() for doc in documents] bm25 = BM25Okapi(tokenized_docs) # Query both def hybrid_search(query, n_results=5, alpha=0.7): # Vector search (ChromaDB) vector_results = collection.query( query_texts=[query], n_results=n_results*2 ) # Keyword search (BM25) bm25_scores = bm25.get_scores(query.split()) # Merge results # ... (combine scores with alpha weighting) return top_results

ChromaDB vs Alternatives (Nov 2025)

FeatureChromaDBPineconeQdrant
SetupInstantCloud signupDocker
CostFree$70/month+Free/self-hosted
Scale1M+ vectorsBillionsBillions
Best forPrototypingProductionAdvanced features

Migration to Production DB

When outgrowing ChromaDB:

DEVELOPERpython
# Export from ChromaDB chroma_docs = collection.get() # Import to Pinecone/Qdrant for doc, emb in zip(chroma_docs['documents'], chroma_docs['embeddings']): production_db.upsert(doc, emb)

Best Practices

  1. Use persistent mode for important data
  2. Batch operations for performance
  3. Index metadata fields you filter on
  4. Monitor collection size (ChromaDB is best < 10M vectors)
  5. Backup regularly if using persistent mode

ChromaDB is perfect for getting started. It's what we use at Ailog for development and small deployments.

Tags

chromadbvector databasesetupquickstart

Articles connexes

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !