7. OptimizationIntermediate

Caching Strategies to Reduce RAG Latency and Cost

November 20, 2025
10 min read
Ailog Research Team

Cut costs by 80%: implement semantic caching, embedding caching, and response caching for production RAG.

Why Cache?

Without caching:

  • Every query → API call ($$$)
  • 500ms+ latency
  • Rate limits

With caching:

  • 80% cost reduction
  • 10ms cache hits
  • No rate limits

1. Semantic Query Caching

Don't cache exact matches - cache similar queries:

DEVELOPERpython
import numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') cache = {} # {embedding: response} def semantic_cache_lookup(query, threshold=0.95): query_emb = model.encode(query) # Check if similar query in cache for cached_emb, response in cache.items(): similarity = np.dot(query_emb, cached_emb) if similarity > threshold: return response # Cache hit! return None # Cache miss def rag_with_cache(query): # Check cache first cached = semantic_cache_lookup(query) if cached: return cached # Cache miss - do full RAG response = full_rag_pipeline(query) # Store in cache cache[model.encode(query)] = response return response

2. Embedding Caching

Cache embeddings to avoid re-computing:

DEVELOPERpython
import hashlib import redis redis_client = redis.Redis(host='localhost', port=6379) def get_embedding_cached(text): # Create cache key cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}" # Check cache cached = redis_client.get(cache_key) if cached: return np.frombuffer(cached, dtype=np.float32) # Compute embedding embedding = openai_embed(text) # Store in cache (expire after 7 days) redis_client.setex( cache_key, 604800, # 7 days embedding.tobytes() ) return embedding

3. GPTCache Integration

DEVELOPERpython
from gptcache import Cache from gptcache.embedding import OpenAI from gptcache.similarity_evaluation import SearchDistanceEvaluation cache = Cache() cache.init( embedding_func=OpenAI().to_embeddings, similarity_evaluation=SearchDistanceEvaluation(), ) def cached_llm_call(prompt): # Check cache cached_response = cache.get(prompt) if cached_response: return cached_response # Call LLM response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) # Cache response cache.set(prompt, response) return response

4. Two-Tier Caching

Fast in-memory + persistent Redis:

DEVELOPERpython
from functools import lru_cache import redis redis_client = redis.Redis() @lru_cache(maxsize=1000) def l1_cache(query): # L2 cache (Redis) cached = redis_client.get(f"rag:{query}") if cached: return cached.decode() # Cache miss - compute result = rag_pipeline(query) # Store in L2 redis_client.setex(f"rag:{query}", 3600, result) return result

5. Cache Invalidation

DEVELOPERpython
import time cache_with_ttl = {} def get_with_ttl(key, ttl=3600): if key in cache_with_ttl: value, timestamp = cache_with_ttl[key] if time.time() - timestamp < ttl: return value else: del cache_with_ttl[key] # Expired return None def set_with_ttl(key, value): cache_with_ttl[key] = (value, time.time())

Cost Analysis

Without caching (1M queries/month):

  • Embeddings: $100
  • LLM: $3000
  • Total: $3100

With caching (80% hit rate):

  • Embeddings: $20
  • LLM: $600
  • Redis: $50
  • Total: $670 (78% savings)

Caching is the lowest-hanging fruit for RAG optimization. Implement it early.

Tags

cachingoptimizationcostlatency

Related Guides