Caching Strategies to Reduce RAG Latency and Cost
Cut costs by 80%: implement semantic caching, embedding caching, and response caching for production RAG.
Why Cache?
Without caching:
- Every query → API call ($$$)
- 500ms+ latency
- Rate limits
With caching:
- 80% cost reduction
- 10ms cache hits
- No rate limits
1. Semantic Query Caching
Don't cache exact matches - cache similar queries:
DEVELOPERpythonimport numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') cache = {} # {embedding: response} def semantic_cache_lookup(query, threshold=0.95): query_emb = model.encode(query) # Check if similar query in cache for cached_emb, response in cache.items(): similarity = np.dot(query_emb, cached_emb) if similarity > threshold: return response # Cache hit! return None # Cache miss def rag_with_cache(query): # Check cache first cached = semantic_cache_lookup(query) if cached: return cached # Cache miss - do full RAG response = full_rag_pipeline(query) # Store in cache cache[model.encode(query)] = response return response
2. Embedding Caching
Cache embeddings to avoid re-computing:
DEVELOPERpythonimport hashlib import redis redis_client = redis.Redis(host='localhost', port=6379) def get_embedding_cached(text): # Create cache key cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}" # Check cache cached = redis_client.get(cache_key) if cached: return np.frombuffer(cached, dtype=np.float32) # Compute embedding embedding = openai_embed(text) # Store in cache (expire after 7 days) redis_client.setex( cache_key, 604800, # 7 days embedding.tobytes() ) return embedding
3. GPTCache Integration
DEVELOPERpythonfrom gptcache import Cache from gptcache.embedding import OpenAI from gptcache.similarity_evaluation import SearchDistanceEvaluation cache = Cache() cache.init( embedding_func=OpenAI().to_embeddings, similarity_evaluation=SearchDistanceEvaluation(), ) def cached_llm_call(prompt): # Check cache cached_response = cache.get(prompt) if cached_response: return cached_response # Call LLM response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) # Cache response cache.set(prompt, response) return response
4. Two-Tier Caching
Fast in-memory + persistent Redis:
DEVELOPERpythonfrom functools import lru_cache import redis redis_client = redis.Redis() @lru_cache(maxsize=1000) def l1_cache(query): # L2 cache (Redis) cached = redis_client.get(f"rag:{query}") if cached: return cached.decode() # Cache miss - compute result = rag_pipeline(query) # Store in L2 redis_client.setex(f"rag:{query}", 3600, result) return result
5. Cache Invalidation
DEVELOPERpythonimport time cache_with_ttl = {} def get_with_ttl(key, ttl=3600): if key in cache_with_ttl: value, timestamp = cache_with_ttl[key] if time.time() - timestamp < ttl: return value else: del cache_with_ttl[key] # Expired return None def set_with_ttl(key, value): cache_with_ttl[key] = (value, time.time())
Cost Analysis
Without caching (1M queries/month):
- Embeddings: $100
- LLM: $3000
- Total: $3100
With caching (80% hit rate):
- Embeddings: $20
- LLM: $600
- Redis: $50
- Total: $670 (78% savings)
Caching is the lowest-hanging fruit for RAG optimization. Implement it early.
Tags
Related Guides
Context Window Optimization: Managing Token Limits
Strategies for fitting more information in limited context windows: compression, summarization, smart selection, and window management techniques.
RAG Cost Optimization: Cut Spending by 90%
Reduce RAG costs from $10k to $1k/month: smart chunking, caching, model selection, and batch processing.
Reduce RAG Latency: From 2000ms to 200ms
10x faster RAG: parallel retrieval, streaming responses, and architectural optimizations for sub-200ms latency.