4. StorageDébutant

ChromaDB Setup for RAG Applications

12 novembre 2025
9 min read
Ailog Research Team

Get started with ChromaDB: lightweight, fast vector database perfect for prototyping and production RAG systems.

Why ChromaDB?

ChromaDB is the fastest way to get started with vector search:

  • ✅ No infrastructure required
  • ✅ Runs in-memory or persistent
  • ✅ Built-in embedding functions
  • ✅ Perfect for prototyping
  • ✅ Scales to production

Installation (November 2025)

DEVELOPERbash
pip install chromadb # Optional: For persistent storage pip install chromadb[server]

Quick Start

DEVELOPERpython
import chromadb # Create client (in-memory) client = chromadb.Client() # Or persistent client = chromadb.PersistentClient(path="./chroma_db") # Create collection collection = client.create_collection( name="my_documents", metadata={"description": "RAG knowledge base"} )

Adding Documents

DEVELOPERpython
# Add documents with automatic embedding collection.add( documents=[ "This is about machine learning", "Python is a programming language", "Vector databases store embeddings" ], metadatas=[ {"source": "ml_book", "page": 1}, {"source": "python_guide", "page": 5}, {"source": "db_manual", "page": 12} ], ids=["doc1", "doc2", "doc3"] )

Using Custom Embeddings

DEVELOPERpython
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') # Generate embeddings texts = ["doc1 text", "doc2 text"] embeddings = model.encode(texts).tolist() # Add with pre-computed embeddings collection.add( embeddings=embeddings, documents=texts, ids=["id1", "id2"] )

Querying

DEVELOPERpython
# Simple similarity search results = collection.query( query_texts=["machine learning algorithms"], n_results=5 ) print(results['documents']) print(results['distances']) print(results['metadatas'])

Advanced Filtering

DEVELOPERpython
# Filter by metadata results = collection.query( query_texts=["python"], n_results=10, where={"source": "python_guide"} # Only from python_guide ) # Filter by ID results = collection.query( query_texts=["databases"], where_document={"$contains": "vector"} # Document must contain "vector" )

Updating Documents

DEVELOPERpython
# Update existing document collection.update( ids=["doc1"], documents=["Updated content about ML"], metadatas=[{"source": "ml_book", "page": 2, "updated": True}] ) # Delete documents collection.delete(ids=["doc2"])

Production Setup (2025)

ChromaDB server mode for production:

DEVELOPERbash
# Start server chroma run --host 0.0.0.0 --port 8000
DEVELOPERpython
# Connect from client import chromadb client = chromadb.HttpClient( host="localhost", port=8000 )

Docker Deployment

DEVELOPERyaml
# docker-compose.yml version: '3.8' services: chromadb: image: chromadb/chroma:latest ports: - "8000:8000" volumes: - chroma_data:/chroma/chroma environment: - ALLOW_RESET=true volumes: chroma_data:

Performance Tuning

DEVELOPERpython
# Batch operations for speed batch_size = 100 for i in range(0, len(docs), batch_size): batch = docs[i:i+batch_size] collection.add( documents=batch, ids=[f"doc_{j}" for j in range(i, i+len(batch))] )

Hybrid Search (ChromaDB + BM25)

DEVELOPERpython
from rank_bm25 import BM25Okapi # BM25 for keyword search tokenized_docs = [doc.split() for doc in documents] bm25 = BM25Okapi(tokenized_docs) # Query both def hybrid_search(query, n_results=5, alpha=0.7): # Vector search (ChromaDB) vector_results = collection.query( query_texts=[query], n_results=n_results*2 ) # Keyword search (BM25) bm25_scores = bm25.get_scores(query.split()) # Merge results # ... (combine scores with alpha weighting) return top_results

ChromaDB vs Alternatives (Nov 2025)

FeatureChromaDBPineconeQdrant
SetupInstantCloud signupDocker
CostFree$70/month+Free/self-hosted
Scale1M+ vectorsBillionsBillions
Best forPrototypingProductionAdvanced features

Migration to Production DB

When outgrowing ChromaDB:

DEVELOPERpython
# Export from ChromaDB chroma_docs = collection.get() # Import to Pinecone/Qdrant for doc, emb in zip(chroma_docs['documents'], chroma_docs['embeddings']): production_db.upsert(doc, emb)

Best Practices

  1. Use persistent mode for important data
  2. Batch operations for performance
  3. Index metadata fields you filter on
  4. Monitor collection size (ChromaDB is best < 10M vectors)
  5. Backup regularly if using persistent mode

ChromaDB is perfect for getting started. It's what we use at Ailog for development and small deployments.

Tags

chromadbvector databasesetupquickstart

Articles connexes

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !