Conversational RAG: Memory and Multi-Session Context
Implement RAG with conversational memory: context management, multi-session history, and personalized responses.
Conversational RAG: Memory and Multi-Session Context
A static RAG responds to each question in isolation. A conversational RAG maintains the thread of discussion, remembers user preferences, and personalizes responses over time. This guide explains how to implement effective memory.
Why Memory is Essential
Limitations of Memoryless RAG
Without memory, each request is processed independently:
User: What is your return policy?
Assistant: You have 30 days to return a product.
User: And for electronic products?
Assistant: [No context] What would you like to know about electronic products?
With memory:
User: What is your return policy?
Assistant: You have 30 days to return a product.
User: And for electronic products?
Assistant: For electronic products, the return period is also 30 days,
but they must be unopened unless defective.
Types of Memory
| Type | Scope | Duration | Usage |
|---|---|---|---|
| Immediate context | Current turn | Ephemeral | Co-references ("it", "that") |
| Session memory | Conversation | Session | Discussion thread tracking |
| Long-term memory | User | Permanent | Preferences, history |
| Semantic memory | Global base | Permanent | Facts learned from conversations |
Conversational Architecture
┌─────────────────────────────────────────────────────────────┐
│ USER REQUEST │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CONTEXT MANAGER │
│ ┌─────────────┐ ┌─────────────┐ ┌───────────────────────┐ │
│ │ Session │ │ User │ │ Long-term │ │
│ │ History │ │ Profile │ │ Memory │ │
│ └──────┬──────┘ └──────┬──────┘ └───────────┬───────────┘ │
│ └───────────────┼───────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Context Builder │ │
│ └────────┬────────┘ │
└────────────────────────┼────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ RAG PIPELINE │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Query │ │ Retrieval │ │ Generation │ │
│ │ Rewriting │ │ + Context │ │ + Memory │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Session History Management
Conversation Structure
DEVELOPERpythonfrom dataclasses import dataclass, field from datetime import datetime from typing import List, Optional @dataclass class Message: role: str # "user" or "assistant" content: str timestamp: datetime = field(default_factory=datetime.now) metadata: dict = field(default_factory=dict) @dataclass class Conversation: id: str user_id: str messages: List[Message] = field(default_factory=list) created_at: datetime = field(default_factory=datetime.now) context: dict = field(default_factory=dict) class ConversationManager: def __init__(self, storage, max_messages: int = 20): self.storage = storage self.max_messages = max_messages async def add_message( self, conversation_id: str, role: str, content: str, metadata: dict = None ): """Add a message to the conversation""" conversation = await self.storage.get(conversation_id) message = Message( role=role, content=content, metadata=metadata or {} ) conversation.messages.append(message) # Keep only the last N messages if len(conversation.messages) > self.max_messages: conversation.messages = conversation.messages[-self.max_messages:] await self.storage.save(conversation) async def get_context( self, conversation_id: str, max_turns: int = 5 ) -> str: """Retrieve formatted conversation context""" conversation = await self.storage.get(conversation_id) # Take the last turns recent_messages = conversation.messages[-(max_turns * 2):] context_parts = [] for msg in recent_messages: role = "User" if msg.role == "user" else "Assistant" context_parts.append(f"{role}: {msg.content}") return "\n".join(context_parts)
History Compression
For long conversations, compress old history:
DEVELOPERpythonclass HistoryCompressor: def __init__(self, llm): self.llm = llm async def compress( self, messages: List[Message], keep_recent: int = 4 ) -> dict: """ Compress history while keeping recent messages intact """ if len(messages) <= keep_recent: return { "summary": None, "recent_messages": messages } # Messages to compress to_compress = messages[:-keep_recent] recent = messages[-keep_recent:] # Generate summary history_text = "\n".join([ f"{m.role}: {m.content}" for m in to_compress ]) prompt = f""" Summarize this conversation while preserving: - Key information exchanged - Decisions or commitments made - Important context for continuation Conversation: {history_text} Concise summary (max 200 words): """ summary = await self.llm.generate(prompt, temperature=0) return { "summary": summary, "recent_messages": recent }
Query Rewriting with Context
Co-reference Resolution
DEVELOPERpythonclass QueryRewriter: def __init__(self, llm): self.llm = llm async def rewrite( self, query: str, conversation_context: str ) -> str: """ Rewrite query by resolving co-references """ prompt = f""" Rewrite the user's question to make it standalone, by resolving pronouns and context references. Conversation history: {conversation_context} Last question: {query} Rewritten question (standalone, no ambiguous pronouns): """ rewritten = await self.llm.generate(prompt, temperature=0) return rewritten.strip() async def extract_search_queries( self, query: str, context: str ) -> List[str]: """ Generate multiple optimized search queries """ prompt = f""" From this conversation and question, generate 2-3 optimized search queries to find relevant information. Context: {context} Question: {query} Search queries (one per line): """ result = await self.llm.generate(prompt, temperature=0.3) queries = [q.strip() for q in result.strip().split("\n") if q.strip()] return queries[:3]
Application Example
DEVELOPERpython# Before rewriting context = """ User: I'm looking for a gaming laptop Assistant: I recommend the ASUS ROG Strix G15, excellent value. User: How much RAM does it have? """ query = "And the graphics card?" rewritten = await rewriter.rewrite(query, context) # "What is the graphics card of the ASUS ROG Strix G15?"
Long-term Memory
User Profile Storage
DEVELOPERpythonfrom datetime import datetime, timedelta class UserMemory: def __init__(self, db): self.db = db async def update_preference( self, user_id: str, key: str, value: any, confidence: float = 1.0 ): """ Update a user preference """ await self.db.upsert( "user_preferences", { "user_id": user_id, "key": key, "value": value, "confidence": confidence, "last_updated": datetime.now() }, conflict_keys=["user_id", "key"] ) async def get_preferences(self, user_id: str) -> dict: """ Retrieve all user preferences """ prefs = await self.db.find( "user_preferences", {"user_id": user_id} ) return {p["key"]: p["value"] for p in prefs} async def log_interaction( self, user_id: str, query: str, response: str, topic: str, satisfaction: float = None ): """ Log an interaction for future analysis """ await self.db.insert("user_interactions", { "user_id": user_id, "query": query, "response": response, "topic": topic, "satisfaction": satisfaction, "timestamp": datetime.now() }) async def extract_preferences_from_history( self, user_id: str, llm ) -> dict: """ Extract preferences from past interactions """ # Retrieve recent interactions interactions = await self.db.find( "user_interactions", { "user_id": user_id, "timestamp": {"$gte": datetime.now() - timedelta(days=30)} }, limit=50 ) if not interactions: return {} history = "\n".join([ f"Q: {i['query']}\nA: {i['response']}" for i in interactions ]) prompt = f""" Analyze this conversation history and extract user preferences and characteristics. History: {history} Extract in JSON: - preferred_language: preferred language - technical_level: beginner/intermediate/expert - topics_of_interest: list of frequent topics - communication_style: formal/informal - preferences: other detected preferences JSON: """ result = await llm.generate(prompt, temperature=0) return self._parse_json(result)
Response Personalization
DEVELOPERpythonclass PersonalizedRAG: def __init__(self, rag_pipeline, user_memory, llm): self.rag = rag_pipeline self.memory = user_memory self.llm = llm async def query( self, user_id: str, conversation_id: str, query: str ) -> dict: """ Personalized RAG query """ # 1. Retrieve user profile preferences = await self.memory.get_preferences(user_id) # 2. Retrieve conversation context conv_context = await self.conv_manager.get_context(conversation_id) # 3. Rewrite query with context rewritten_query = await self.query_rewriter.rewrite(query, conv_context) # 4. Execute standard RAG rag_result = await self.rag.query(rewritten_query) # 5. Personalize response personalized = await self._personalize_response( query=query, rag_response=rag_result["answer"], preferences=preferences, context=conv_context ) return { "answer": personalized, "sources": rag_result["sources"], "original_query": query, "rewritten_query": rewritten_query } async def _personalize_response( self, query: str, rag_response: str, preferences: dict, context: str ) -> str: """ Adapt response to user preferences """ tech_level = preferences.get("technical_level", "intermediate") style = preferences.get("communication_style", "professional") prompt = f""" Adapt this response according to user profile. Profile: - Technical level: {tech_level} - Preferred style: {style} Conversation context: {context} Question: {query} Original response: {rag_response} Adapted response: """ return await self.llm.generate(prompt, temperature=0.3)
Shared Semantic Memory
Learning from Conversations
DEVELOPERpythonclass SemanticMemory: def __init__(self, vector_db, embedder, llm): self.vector_db = vector_db self.embedder = embedder self.llm = llm async def learn_from_conversation( self, conversation: Conversation ): """ Extract and store facts learned from a conversation """ # Extract facts facts = await self._extract_facts(conversation) for fact in facts: # Check if fact already exists existing = await self._find_similar_fact(fact) if existing: # Update confidence await self._update_fact_confidence(existing, fact) else: # Store new fact await self._store_fact(fact) async def _extract_facts(self, conversation: Conversation) -> List[dict]: """ Extract factual information from a conversation """ messages_text = "\n".join([ f"{m.role}: {m.content}" for m in conversation.messages ]) prompt = f""" Extract new facts learned from this conversation. A fact must be: - Factual and verifiable - Useful for future conversations - New (not already known information) Conversation: {messages_text} Facts (JSON array format): [ {{"fact": "...", "category": "...", "confidence": 0.9}}, ... ] """ result = await self.llm.generate(prompt, temperature=0) return self._parse_json(result) async def recall( self, query: str, top_k: int = 5 ) -> List[dict]: """ Retrieve relevant facts for a query """ query_embedding = self.embedder.encode(query) results = await self.vector_db.search( collection="semantic_memory", query_vector=query_embedding, limit=top_k ) return [r.payload for r in results]
Session Management
Multi-device and Continuity
DEVELOPERpythonimport redis import json class SessionManager: def __init__(self, redis_client: redis.Redis): self.redis = redis_client self.session_ttl = 3600 * 24 # 24 hours async def create_session( self, user_id: str, device_id: str = None ) -> str: """ Create a new session """ session_id = f"session_{user_id}_{datetime.now().timestamp()}" session_data = { "user_id": user_id, "device_id": device_id, "created_at": datetime.now().isoformat(), "messages": [], "context": {} } await self.redis.setex( session_id, self.session_ttl, json.dumps(session_data) ) return session_id async def resume_or_create( self, user_id: str, device_id: str = None ) -> str: """ Resume last session or create a new one """ # Look for recent active session pattern = f"session_{user_id}_*" keys = await self.redis.keys(pattern) if keys: # Sort by creation date sessions = [] for key in keys: data = json.loads(await self.redis.get(key)) sessions.append((key, data)) # Take most recent sessions.sort(key=lambda x: x[1]["created_at"], reverse=True) return sessions[0][0] # Create new session return await self.create_session(user_id, device_id)
Metrics and Evaluation
Conversational Quality
DEVELOPERpythonclass ConversationMetrics: def __init__(self, db): self.db = db async def calculate_metrics( self, conversation_id: str ) -> dict: """ Calculate conversation metrics """ conversation = await self.db.get("conversations", conversation_id) return { # Length and engagement "total_turns": len(conversation["messages"]) // 2, "avg_user_message_length": self._avg_length( [m for m in conversation["messages"] if m["role"] == "user"] ), "avg_assistant_message_length": self._avg_length( [m for m in conversation["messages"] if m["role"] == "assistant"] ), # Resolution "ended_naturally": self._check_natural_end(conversation), "required_clarification": self._count_clarifications(conversation), # Coherence "topic_coherence": await self._calculate_topic_coherence(conversation), "context_usage": self._measure_context_usage(conversation) }
Best Practices
1. Limit Context Size
Don't overload the prompt with too much history. Compress or summarize.
2. Handle Topic Changes
Detect when user changes topic to avoid polluting context.
3. Respect Privacy
Allow users to delete their history and preferences.
4. Graceful Fallback
If memory is unavailable, system should work in stateless mode.
Learn More
- Retrieval Fundamentals - Optimize search
- LLM Generation - Improve responses
- Prompt Engineering RAG - Optimize prompts
Conversational RAG with Ailog
Implementing robust conversational memory is complex. With Ailog, benefit from native features:
- Automatic session history with compression
- Persistent user memory and personalization
- Query rewriting for co-reference resolution
- Multi-device with conversation resumption
- Conversational analytics to measure engagement
- GDPR compliance with data export and deletion
Try Ailog for free and deploy an intelligent conversational assistant.
Tags
Related Posts
Guardrails for RAG: Securing Your AI Assistants
Implement robust guardrails to prevent dangerous, off-topic, or inappropriate responses in your production RAG systems.
RAG Agents: Orchestrating Multi-Agent Systems
Architect multi-agent RAG systems: orchestration, specialization, collaboration and failure handling for complex assistants.
Agentic RAG: Building AI Agents with Dynamic Knowledge Retrieval
Comprehensive guide to Agentic RAG: architecture, design patterns, implementing autonomous agents with knowledge retrieval, multi-tool orchestration, and advanced use cases.