7. OptimizationIntermediate

RAG Cost Optimization: Cut Spending by 90%

November 12, 2025
11 min read
Ailog Research Team

Reduce RAG costs from $10k to $1k/month: smart chunking, caching, model selection, and batch processing.

Cost Breakdown (Typical RAG)

Per 1M queries:

  • Embeddings: $100
  • Vector DB: $200
  • LLM calls: $5,000
  • Total: $5,300

1. Reduce Embedding Costs

Use smaller models:

DEVELOPERpython
# Before: text-embedding-3-large # Cost: $0.13 / 1M tokens # Dimensions: 3072 # After: text-embedding-3-small # Cost: $0.02 / 1M tokens (6.5x cheaper) # Dimensions: 1536 # Performance: -5% accuracy for most use cases import openai embeddings = openai.Embedding.create( input=texts, model="text-embedding-3-small" # 6.5x cheaper )

Or use open-source models:

DEVELOPERpython
# Free embeddings (self-hosted) from sentence_transformers import SentenceTransformer model = SentenceTransformer('BAAI/bge-small-en-v1.5') embeddings = model.encode(texts) # $0 cost

2. Smart Chunking

Fewer chunks = lower costs:

DEVELOPERpython
# Before: 500-token chunks → 10,000 chunks chunk_size = 500 # Embedding cost: $100 # Storage cost: $50 # After: 800-token chunks → 6,250 chunks (37.5% fewer) chunk_size = 800 # Embedding cost: $65 (-35%) # Storage cost: $32 (-36%) # Trade-off: Slightly less precise, but huge savings

3. Aggressive Caching

Cache everything:

DEVELOPERpython
import redis import hashlib redis_client = redis.Redis(host='localhost', port=6379) def cached_rag(query): # Check cache (90% hit rate → 90% cost savings) cache_key = f"rag:{hashlib.md5(query.encode()).hexdigest()}" cached = redis_client.get(cache_key) if cached: return cached.decode() # $0 cost # Cache miss - do full RAG response = expensive_rag_pipeline(query) # Store for 24 hours redis_client.setex(cache_key, 86400, response) return response # With 90% cache hit rate: # Before: $5,300/month # After: $530/month (-90%)

4. Use Smaller LLMs

DEVELOPERpython
# Before: GPT-4 Turbo # Cost: $10/1M input tokens, $30/1M output tokens # After: GPT-4o-mini # Cost: $0.15/1M input, $0.60/1M output (60x cheaper) # Performance: 80-90% as good for most RAG tasks import openai response = openai.ChatCompletion.create( model="gpt-4o-mini", # 60x cheaper messages=[...] ) # Or even cheaper: GPT-3.5 Turbo # Cost: $0.50/1M input, $1.50/1M output

5. Reduce Context Size

Fewer tokens to LLM = lower cost:

DEVELOPERpython
# Before: Send top 10 docs (5000 tokens) context = "\n\n".join(retrieve(query, k=10)) # Cost: 5000 tokens * $10/1M = $0.05 per query # After: Send top 3 docs (1500 tokens) context = "\n\n".join(retrieve(query, k=3)) # Cost: 1500 tokens * $10/1M = $0.015 per query (-70%) # Or summarize context first def compress_context(docs): summaries = [] for doc in docs: summary = openai.ChatCompletion.create( model="gpt-4o-mini", # Cheap model for summarization messages=[{ "role": "user", "content": f"Summarize in 50 words: {doc}" }] ) summaries.append(summary.choices[0].message.content) return "\n\n".join(summaries)

6. Batch Processing

Process multiple queries together:

DEVELOPERpython
# Instead of 1000 individual API calls for query in queries: embed(query) # 1000 calls # Batch embed batch_embeddings = openai.Embedding.create( input=queries, # Single call model="text-embedding-3-small" ) # Savings: Reduced latency overhead

7. Self-Hosted Vector DB

DEVELOPERpython
# Before: Pinecone # Cost: $70/month for 1M vectors # After: Qdrant (self-hosted) # Cost: $20/month (DigitalOcean droplet) # Savings: $50/month (-71%) docker run -p 6333:6333 qdrant/qdrant

8. Lazy Reranking

Only rerank when necessary:

DEVELOPERpython
def smart_rerank(query, candidates): # If top result has high score, skip reranking if candidates[0].score > 0.9: return candidates[:5] # Skip expensive reranking # Otherwise, rerank return rerank(query, candidates) # Savings: 50% fewer reranking calls

9. User Quotas

Prevent abuse:

DEVELOPERpython
import time user_quotas = {} # {user_id: [timestamp, timestamp, ...]} def rate_limit(user_id, max_queries=100, window=3600): now = time.time() # Remove old queries outside window if user_id in user_quotas: user_quotas[user_id] = [ ts for ts in user_quotas[user_id] if now - ts < window ] else: user_quotas[user_id] = [] # Check limit if len(user_quotas[user_id]) >= max_queries: raise Exception("Rate limit exceeded") # Add current query user_quotas[user_id].append(now)

10. Monitoring & Alerts

Track costs in real-time:

DEVELOPERpython
import prometheus_client # Track costs embedding_cost = prometheus_client.Counter( 'rag_embedding_cost_usd', 'Total embedding API costs' ) llm_cost = prometheus_client.Counter( 'rag_llm_cost_usd', 'Total LLM API costs' ) def track_embedding_cost(tokens): cost = tokens / 1_000_000 * 0.02 # $0.02/1M tokens embedding_cost.inc(cost) def track_llm_cost(input_tokens, output_tokens): cost = (input_tokens / 1_000_000 * 0.15) + (output_tokens / 1_000_000 * 0.60) llm_cost.inc(cost) # Set alerts when cost > $1000/day

Complete Cost Optimization

DEVELOPERpython
@cached # 90% cache hit def optimized_rag(query): # 1. Cheap embeddings query_emb = open_source_embed(query) # Free # 2. Efficient retrieval (fewer docs) docs = vector_db.search(query_emb, limit=3) # Not 10 # 3. Smart reranking (only if needed) if docs[0].score < 0.9: docs = fast_rerank(query, docs) # TinyBERT, not GPT-4 # 4. Compress context context = compress_context(docs) # 500 tokens, not 5000 # 5. Cheap LLM response = openai.ChatCompletion.create( model="gpt-4o-mini", # 60x cheaper messages=[{ "role": "user", "content": f"Context: {context}\n\nQ: {query}" }] ) return response.choices[0].message.content # Cost reduction: # - Embeddings: -100% (self-hosted) # - Vector DB: -71% (self-hosted) # - LLM: -60% (smaller model) # - Cache: -90% (fewer calls) # Total: ~95% cost reduction

Smart optimizations can cut RAG costs by 90%+ without sacrificing quality.

Tags

optimizationcostbudgetefficiency

Related Guides