4. StorageIntermediate
Pinecone for Production RAG at Scale
November 18, 2025
12 min read
Ailog Research Team
Deploy production-ready vector search: Pinecone setup, indexing strategies, and scaling to billions of vectors.
Why Pinecone?
- Fully managed (no ops)
- Scales to billions of vectors
- 50ms p95 latency
- Built-in hybrid search
- SOC 2 compliant
Setup (November 2025)
DEVELOPERpythonfrom pinecone import Pinecone pc = Pinecone(api_key="YOUR_API_KEY") # Create index pc.create_index( name="rag-production", dimension=1536, # OpenAI text-embedding-3-small metric="cosine", spec=ServerlessSpec( cloud="aws", region="us-east-1" ) ) index = pc.Index("rag-production")
Upserting Documents
DEVELOPERpythonfrom openai import OpenAI client = OpenAI() def upsert_documents(documents): vectors = [] for i, doc in enumerate(documents): # Generate embedding embedding = client.embeddings.create( model="text-embedding-3-small", input=doc['text'] ).data[0].embedding vectors.append({ "id": f"doc_{i}", "values": embedding, "metadata": { "text": doc['text'], "source": doc['source'], "date": doc['date'] } }) # Batch upsert (max 100 per batch) for i in range(0, len(vectors), 100): batch = vectors[i:i+100] index.upsert(vectors=batch)
Querying
DEVELOPERpythondef search(query, top_k=10): # Embed query query_embedding = client.embeddings.create( model="text-embedding-3-small", input=query ).data[0].embedding # Search results = index.query( vector=query_embedding, top_k=top_k, include_metadata=True ) return [match['metadata']['text'] for match in results['matches']]
Metadata Filtering
DEVELOPERpython# Filter by source results = index.query( vector=query_embedding, filter={"source": {"$eq": "wikipedia"}}, top_k=10, include_metadata=True ) # Date range results = index.query( vector=query_embedding, filter={ "date": {"$gte": "2025-01-01", "$lte": "2025-12-31"} }, top_k=10 )
Namespaces (Multi-tenancy)
DEVELOPERpython# Separate customer data index.upsert( vectors=[...], namespace="customer_123" ) # Query specific namespace results = index.query( vector=query_embedding, namespace="customer_123", top_k=10 )
Hybrid Search (Sparse + Dense)
DEVELOPERpython# Upsert with sparse vectors index.upsert( vectors=[{ "id": "doc1", "values": dense_vector, # Dense embedding "sparse_values": { "indices": [10, 45, 123], # BM25 indices "values": [0.5, 0.3, 0.2] }, "metadata": {"text": "..."} }] ) # Hybrid query results = index.query( vector=dense_query, sparse_vector={ "indices": [10, 45], "values": [0.6, 0.4] }, top_k=10, alpha=0.7 # Dense weight )
Cost Optimization
Serverless pricing (Nov 2025):
- $0.09 per million read units
- $2.00 per million write units
- $0.00025 per GB-hour storage
Tips:
- Use serverless for variable load
- Batch upserts (100 per request)
- Cache frequent queries
- Delete old data
Monitoring
DEVELOPERpython# Index stats stats = index.describe_index_stats() print(f"Total vectors: {stats['total_vector_count']}") print(f"Dimension: {stats['dimension']}")
Pinecone is production-grade. Use it when you need scale, reliability, and zero ops.
Tags
pineconevector databaseproductionscale
Related Guides
guidesintermediate
Vector Databases: Storing and Searching Embeddings
Comprehensive guide to vector databases for RAG: comparison of popular options, indexing strategies, and performance optimization.
14 min read
guidesadvanced
Milvus: Billion-Scale Vector Search
Deploy Milvus for production-scale RAG handling billions of vectors with horizontal scaling and GPU acceleration.
13 min read
guidesintermediate
Weaviate: GraphQL-Powered Vector Database
Set up Weaviate for production RAG with GraphQL queries, hybrid search, and generative modules.
12 min read