Caching Strategies to Reduce RAG Latency and Cost

Why Cache?

Without caching:

Every query → API call ($$$)
500ms+ latency
Rate limits

With caching:

80% cost reduction
10ms cache hits
No rate limits

1. Semantic Query Caching

Don't cache exact matches - cache similar queries:

DEVELOPERpython
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
cache = {}  # {embedding: response}

def semantic_cache_lookup(query, threshold=0.95):
    query_emb = model.encode(query)
    
    # Check if similar query in cache
    for cached_emb, response in cache.items():
        similarity = np.dot(query_emb, cached_emb)
        if similarity > threshold:
            return response  # Cache hit!
    
    return None  # Cache miss

def rag_with_cache(query):
    # Check cache first
    cached = semantic_cache_lookup(query)
    if cached:
        return cached
    
    # Cache miss - do full RAG
    response = full_rag_pipeline(query)
    
    # Store in cache
    cache[model.encode(query)] = response
    
    return response

2. Embedding Caching

Cache embeddings to avoid re-computing:

DEVELOPERpython
import hashlib
import redis

redis_client = redis.Redis(host='localhost', port=6379)

def get_embedding_cached(text):
    # Create cache key
    cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
    
    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return np.frombuffer(cached, dtype=np.float32)
    
    # Compute embedding
    embedding = openai_embed(text)
    
    # Store in cache (expire after 7 days)
    redis_client.setex(
        cache_key,
        604800,  # 7 days
        embedding.tobytes()
    )
    
    return embedding

3. GPTCache Integration

DEVELOPERpython
from gptcache import Cache
from gptcache.embedding import OpenAI
from gptcache.similarity_evaluation import SearchDistanceEvaluation

cache = Cache()
cache.init(
    embedding_func=OpenAI().to_embeddings,
    similarity_evaluation=SearchDistanceEvaluation(),
)

def cached_llm_call(prompt):
    # Check cache
    cached_response = cache.get(prompt)
    if cached_response:
        return cached_response
    
    # Call LLM
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Cache response
    cache.set(prompt, response)
    
    return response

4. Two-Tier Caching

Fast in-memory + persistent Redis:

DEVELOPERpython
from functools import lru_cache
import redis

redis_client = redis.Redis()

@lru_cache(maxsize=1000)
def l1_cache(query):
    # L2 cache (Redis)
    cached = redis_client.get(f"rag:{query}")
    if cached:
        return cached.decode()
    
    # Cache miss - compute
    result = rag_pipeline(query)
    
    # Store in L2
    redis_client.setex(f"rag:{query}", 3600, result)
    
    return result

5. Cache Invalidation

DEVELOPERpython
import time

cache_with_ttl = {}

def get_with_ttl(key, ttl=3600):
    if key in cache_with_ttl:
        value, timestamp = cache_with_ttl[key]
        if time.time() - timestamp < ttl:
            return value
        else:
            del cache_with_ttl[key]  # Expired
    
    return None

def set_with_ttl(key, value):
    cache_with_ttl[key] = (value, time.time())

Cost Analysis

Without caching (1M queries/month):

Embeddings: $100
LLM: $3000
Total: $3100

With caching (80% hit rate):

Embeddings: $20
LLM: $600
Redis: $50
Total: $670 (78% savings)

Caching is the lowest-hanging fruit for RAG optimization. Implement it early.

Caching Strategies to Reduce RAG Latency and Cost

Why Cache?

1. Semantic Query Caching

2. Embedding Caching

3. GPTCache Integration

4. Two-Tier Caching

5. Cache Invalidation

Cost Analysis

Tags

Related Guides

Context Window Optimization: Managing Token Limits

RAG Cost Optimization: Cut Spending by 90%

Reduce RAG Latency: From 2000ms to 200ms