Caching Strategies to Reduce RAG Latency and Cost
Cut costs by 80%: implement semantic caching, embedding caching, and response caching for production RAG.
- Author
- Ailog Research Team
- Published
- Reading time
- 10 min read
- Level
- intermediate
- RAG Pipeline Step
- Optimization
Why Cache?
Without caching: • Every query → API call ($$$) • 500ms+ latency • Rate limits
With caching: • 80% cost reduction • 10ms cache hits • No rate limits Semantic Query Caching
Don't cache exact matches - cache similar queries:
``python import numpy as np from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') cache = {} {embedding: response}
def semantic_cache_lookup(query, threshold=0.95): query_emb = model.encode(query) Check if similar query in cache for cached_emb, response in cache.items(): similarity = np.dot(query_emb, cached_emb) if similarity > threshold: return response Cache hit! return None Cache miss
def rag_with_cache(query): Check cache first cached = semantic_cache_lookup(query) if cached: return cached Cache miss - do full RAG response = full_rag_pipeline(query) Store in cache cache[model.encode(query)] = response return response ` Embedding Caching
Cache embeddings to avoid re-computing:
`python import hashlib import redis
redis_client = redis.Redis(host='localhost', port=6379)
def get_embedding_cached(text): Create cache key cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}" Check cache cached = redis_client.get(cache_key) if cached: return np.frombuffer(cached, dtype=np.float32) Compute embedding embedding = openai_embed(text) Store in cache (expire after 7 days) redis_client.setex( cache_key, 604800, 7 days embedding.tobytes() ) return embedding ` GPTCache Integration
`python from gptcache import Cache from gptcache.embedding import OpenAI from gptcache.similarity_evaluation import SearchDistanceEvaluation
cache = Cache() cache.init( embedding_func=OpenAI().to_embeddings, similarity_evaluation=SearchDistanceEvaluation(), )
def cached_llm_call(prompt): Check cache cached_response = cache.get(prompt) if cached_response: return cached_response Call LLM response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) Cache response cache.set(prompt, response) return response ` Two-Tier Caching
Fast in-memory + persistent Redis:
`python from functools import lru_cache import redis
redis_client = redis.Redis()
@lru_cache(maxsize=1000) def l1_cache(query): L2 cache (Redis) cached = redis_client.get(f"rag:{query}") if cached: return cached.decode() Cache miss - compute result = rag_pipeline(query) Store in L2 redis_client.setex(f"rag:{query}", 3600, result) return result ` Cache Invalidation
`python import time
cache_with_ttl = {}
def get_with_ttl(key, ttl=3600): if key in cache_with_ttl: value, timestamp = cache_with_ttl[key] if time.time() - timestamp < ttl: return value else: del cache_with_ttl[key] Expired return None
def set_with_ttl(key, value): cache_with_ttl[key] = (value, time.time()) ``
Cost Analysis
Without caching (1M queries/month): • Embeddings: $100 • LLM: $3000 • Total: $3100
With caching (80% hit rate): • Embeddings: $20 • LLM: $600 • Redis: $50 • Total: $670 (78% savings)
Caching is the lowest-hanging fruit for RAG optimization. Implement it early.