Caching Strategies to Reduce RAG Latency and Cost

Cut costs by 80%: implement semantic caching, embedding caching, and response caching for production RAG.

Author
Ailog Research Team
Published
Reading time
10 min read
Level
intermediate
RAG Pipeline Step
Optimization

Why Cache?

Without caching: • Every query → API call ($$$) • 500ms+ latency • Rate limits

With caching: • 80% cost reduction • 10ms cache hits • No rate limits Semantic Query Caching

Don't cache exact matches - cache similar queries:

``python import numpy as np from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2') cache = {} {embedding: response}

def semantic_cache_lookup(query, threshold=0.95): query_emb = model.encode(query) Check if similar query in cache for cached_emb, response in cache.items(): similarity = np.dot(query_emb, cached_emb) if similarity > threshold: return response Cache hit! return None Cache miss

def rag_with_cache(query): Check cache first cached = semantic_cache_lookup(query) if cached: return cached Cache miss - do full RAG response = full_rag_pipeline(query) Store in cache cache[model.encode(query)] = response return response ` Embedding Caching

Cache embeddings to avoid re-computing:

`python import hashlib import redis

redis_client = redis.Redis(host='localhost', port=6379)

def get_embedding_cached(text): Create cache key cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}" Check cache cached = redis_client.get(cache_key) if cached: return np.frombuffer(cached, dtype=np.float32) Compute embedding embedding = openai_embed(text) Store in cache (expire after 7 days) redis_client.setex( cache_key, 604800, 7 days embedding.tobytes() ) return embedding ` GPTCache Integration

`python from gptcache import Cache from gptcache.embedding import OpenAI from gptcache.similarity_evaluation import SearchDistanceEvaluation

cache = Cache() cache.init( embedding_func=OpenAI().to_embeddings, similarity_evaluation=SearchDistanceEvaluation(), )

def cached_llm_call(prompt): Check cache cached_response = cache.get(prompt) if cached_response: return cached_response Call LLM response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) Cache response cache.set(prompt, response) return response ` Two-Tier Caching

Fast in-memory + persistent Redis:

`python from functools import lru_cache import redis

redis_client = redis.Redis()

@lru_cache(maxsize=1000) def l1_cache(query): L2 cache (Redis) cached = redis_client.get(f"rag:{query}") if cached: return cached.decode() Cache miss - compute result = rag_pipeline(query) Store in L2 redis_client.setex(f"rag:{query}", 3600, result) return result ` Cache Invalidation

`python import time

cache_with_ttl = {}

def get_with_ttl(key, ttl=3600): if key in cache_with_ttl: value, timestamp = cache_with_ttl[key] if time.time() - timestamp < ttl: return value else: del cache_with_ttl[key] Expired return None

def set_with_ttl(key, value): cache_with_ttl[key] = (value, time.time()) ``

Cost Analysis

Without caching (1M queries/month): • Embeddings: $100 • LLM: $3000 • Total: $3100

With caching (80% hit rate): • Embeddings: $20 • LLM: $600 • Redis: $50 • Total: $670 (78% savings)

Caching is the lowest-hanging fruit for RAG optimization. Implement it early.

Tags

  • caching
  • optimization
  • cost
  • latency
7. OptimizationIntermédiaire

Caching Strategies to Reduce RAG Latency and Cost

20 novembre 2025
10 min read
Ailog Research Team

Cut costs by 80%: implement semantic caching, embedding caching, and response caching for production RAG.

Why Cache?

Without caching:

  • Every query → API call ($$$)
  • 500ms+ latency
  • Rate limits

With caching:

  • 80% cost reduction
  • 10ms cache hits
  • No rate limits

1. Semantic Query Caching

Don't cache exact matches - cache similar queries:

DEVELOPERpython
import numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') cache = {} # {embedding: response} def semantic_cache_lookup(query, threshold=0.95): query_emb = model.encode(query) # Check if similar query in cache for cached_emb, response in cache.items(): similarity = np.dot(query_emb, cached_emb) if similarity > threshold: return response # Cache hit! return None # Cache miss def rag_with_cache(query): # Check cache first cached = semantic_cache_lookup(query) if cached: return cached # Cache miss - do full RAG response = full_rag_pipeline(query) # Store in cache cache[model.encode(query)] = response return response

2. Embedding Caching

Cache embeddings to avoid re-computing:

DEVELOPERpython
import hashlib import redis redis_client = redis.Redis(host='localhost', port=6379) def get_embedding_cached(text): # Create cache key cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}" # Check cache cached = redis_client.get(cache_key) if cached: return np.frombuffer(cached, dtype=np.float32) # Compute embedding embedding = openai_embed(text) # Store in cache (expire after 7 days) redis_client.setex( cache_key, 604800, # 7 days embedding.tobytes() ) return embedding

3. GPTCache Integration

DEVELOPERpython
from gptcache import Cache from gptcache.embedding import OpenAI from gptcache.similarity_evaluation import SearchDistanceEvaluation cache = Cache() cache.init( embedding_func=OpenAI().to_embeddings, similarity_evaluation=SearchDistanceEvaluation(), ) def cached_llm_call(prompt): # Check cache cached_response = cache.get(prompt) if cached_response: return cached_response # Call LLM response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) # Cache response cache.set(prompt, response) return response

4. Two-Tier Caching

Fast in-memory + persistent Redis:

DEVELOPERpython
from functools import lru_cache import redis redis_client = redis.Redis() @lru_cache(maxsize=1000) def l1_cache(query): # L2 cache (Redis) cached = redis_client.get(f"rag:{query}") if cached: return cached.decode() # Cache miss - compute result = rag_pipeline(query) # Store in L2 redis_client.setex(f"rag:{query}", 3600, result) return result

5. Cache Invalidation

DEVELOPERpython
import time cache_with_ttl = {} def get_with_ttl(key, ttl=3600): if key in cache_with_ttl: value, timestamp = cache_with_ttl[key] if time.time() - timestamp < ttl: return value else: del cache_with_ttl[key] # Expired return None def set_with_ttl(key, value): cache_with_ttl[key] = (value, time.time())

Cost Analysis

Without caching (1M queries/month):

  • Embeddings: $100
  • LLM: $3000
  • Total: $3100

With caching (80% hit rate):

  • Embeddings: $20
  • LLM: $600
  • Redis: $50
  • Total: $670 (78% savings)

Caching is the lowest-hanging fruit for RAG optimization. Implement it early.

Tags

cachingoptimizationcostlatency

Articles connexes

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !