Conversational RAG: Memory and Multi-Session Context

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

A static RAG responds to each question in isolation. A conversational RAG maintains the thread of discussion, remembers user preferences, and personalizes responses over time. This guide explains how to implement effective memory.

Why Memory is Essential

Limitations of Memoryless RAG

Without memory, each request is processed independently:

User: What is your return policy?
Assistant: You have 30 days to return a product.

User: And for electronic products?
Assistant: [No context] What would you like to know about electronic products?

With memory:

User: What is your return policy?
Assistant: You have 30 days to return a product.

User: And for electronic products?
Assistant: For electronic products, the return period is also 30 days,
          but they must be unopened unless defective.

Types of Memory

Type	Scope	Duration	Usage
Immediate context	Current turn	Ephemeral	Co-references ("it", "that")
Session memory	Conversation	Session	Discussion thread tracking
Long-term memory	User	Permanent	Preferences, history
Semantic memory	Global base	Permanent	Facts learned from conversations

Conversational Architecture

┌─────────────────────────────────────────────────────────────┐
│                    USER REQUEST                              │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│               CONTEXT MANAGER                                │
│  ┌─────────────┐ ┌─────────────┐ ┌───────────────────────┐  │
│  │  Session    │ │   User      │ │      Long-term        │  │
│  │  History    │ │  Profile    │ │      Memory           │  │
│  └──────┬──────┘ └──────┬──────┘ └───────────┬───────────┘  │
│         └───────────────┼───────────────────┘               │
│                         ▼                                    │
│               ┌─────────────────┐                           │
│               │ Context Builder │                           │
│               └────────┬────────┘                           │
└────────────────────────┼────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    RAG PIPELINE                              │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│  │  Query       │ │   Retrieval  │ │     Generation       │ │
│  │  Rewriting   │ │   + Context  │ │     + Memory         │ │
│  └──────────────┘ └──────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Session History Management

Conversation Structure

DEVELOPERpython
from dataclasses import dataclass, field
from datetime import datetime
from typing import List, Optional

@dataclass
class Message:
    role: str  # "user" or "assistant"
    content: str
    timestamp: datetime = field(default_factory=datetime.now)
    metadata: dict = field(default_factory=dict)

@dataclass
class Conversation:
    id: str
    user_id: str
    messages: List[Message] = field(default_factory=list)
    created_at: datetime = field(default_factory=datetime.now)
    context: dict = field(default_factory=dict)

class ConversationManager:
    def __init__(self, storage, max_messages: int = 20):
        self.storage = storage
        self.max_messages = max_messages

    async def add_message(
        self,
        conversation_id: str,
        role: str,
        content: str,
        metadata: dict = None
    ):
        """Add a message to the conversation"""
        conversation = await self.storage.get(conversation_id)

        message = Message(
            role=role,
            content=content,
            metadata=metadata or {}
        )

        conversation.messages.append(message)

        # Keep only the last N messages
        if len(conversation.messages) > self.max_messages:
            conversation.messages = conversation.messages[-self.max_messages:]

        await self.storage.save(conversation)

    async def get_context(
        self,
        conversation_id: str,
        max_turns: int = 5
    ) -> str:
        """Retrieve formatted conversation context"""
        conversation = await self.storage.get(conversation_id)

        # Take the last turns
        recent_messages = conversation.messages[-(max_turns * 2):]

        context_parts = []
        for msg in recent_messages:
            role = "User" if msg.role == "user" else "Assistant"
            context_parts.append(f"{role}: {msg.content}")

        return "\n".join(context_parts)

History Compression

For long conversations, compress old history:

DEVELOPERpython
class HistoryCompressor:
    def __init__(self, llm):
        self.llm = llm

    async def compress(
        self,
        messages: List[Message],
        keep_recent: int = 4
    ) -> dict:
        """
        Compress history while keeping recent messages intact
        """
        if len(messages) <= keep_recent:
            return {
                "summary": None,
                "recent_messages": messages
            }

        # Messages to compress
        to_compress = messages[:-keep_recent]
        recent = messages[-keep_recent:]

        # Generate summary
        history_text = "\n".join([
            f"{m.role}: {m.content}" for m in to_compress
        ])

        prompt = f"""
        Summarize this conversation while preserving:
        - Key information exchanged
        - Decisions or commitments made
        - Important context for continuation

        Conversation:
        {history_text}

        Concise summary (max 200 words):
        """

        summary = await self.llm.generate(prompt, temperature=0)

        return {
            "summary": summary,
            "recent_messages": recent
        }

Query Rewriting with Context

Co-reference Resolution

DEVELOPERpython
class QueryRewriter:
    def __init__(self, llm):
        self.llm = llm

    async def rewrite(
        self,
        query: str,
        conversation_context: str
    ) -> str:
        """
        Rewrite query by resolving co-references
        """
        prompt = f"""
        Rewrite the user's question to make it standalone,
        by resolving pronouns and context references.

        Conversation history:
        {conversation_context}

        Last question: {query}

        Rewritten question (standalone, no ambiguous pronouns):
        """

        rewritten = await self.llm.generate(prompt, temperature=0)
        return rewritten.strip()

    async def extract_search_queries(
        self,
        query: str,
        context: str
    ) -> List[str]:
        """
        Generate multiple optimized search queries
        """
        prompt = f"""
        From this conversation and question, generate 2-3 optimized
        search queries to find relevant information.

        Context:
        {context}

        Question: {query}

        Search queries (one per line):
        """

        result = await self.llm.generate(prompt, temperature=0.3)
        queries = [q.strip() for q in result.strip().split("\n") if q.strip()]

        return queries[:3]

Application Example

DEVELOPERpython
# Before rewriting
context = """
User: I'm looking for a gaming laptop
Assistant: I recommend the ASUS ROG Strix G15, excellent value.
User: How much RAM does it have?
"""

query = "And the graphics card?"

rewritten = await rewriter.rewrite(query, context)
# "What is the graphics card of the ASUS ROG Strix G15?"

Long-term Memory

User Profile Storage

DEVELOPERpython
from datetime import datetime, timedelta

class UserMemory:
    def __init__(self, db):
        self.db = db

    async def update_preference(
        self,
        user_id: str,
        key: str,
        value: any,
        confidence: float = 1.0
    ):
        """
        Update a user preference
        """
        await self.db.upsert(
            "user_preferences",
            {
                "user_id": user_id,
                "key": key,
                "value": value,
                "confidence": confidence,
                "last_updated": datetime.now()
            },
            conflict_keys=["user_id", "key"]
        )

    async def get_preferences(self, user_id: str) -> dict:
        """
        Retrieve all user preferences
        """
        prefs = await self.db.find(
            "user_preferences",
            {"user_id": user_id}
        )

        return {p["key"]: p["value"] for p in prefs}

    async def log_interaction(
        self,
        user_id: str,
        query: str,
        response: str,
        topic: str,
        satisfaction: float = None
    ):
        """
        Log an interaction for future analysis
        """
        await self.db.insert("user_interactions", {
            "user_id": user_id,
            "query": query,
            "response": response,
            "topic": topic,
            "satisfaction": satisfaction,
            "timestamp": datetime.now()
        })

    async def extract_preferences_from_history(
        self,
        user_id: str,
        llm
    ) -> dict:
        """
        Extract preferences from past interactions
        """
        # Retrieve recent interactions
        interactions = await self.db.find(
            "user_interactions",
            {
                "user_id": user_id,
                "timestamp": {"$gte": datetime.now() - timedelta(days=30)}
            },
            limit=50
        )

        if not interactions:
            return {}

        history = "\n".join([
            f"Q: {i['query']}\nA: {i['response']}"
            for i in interactions
        ])

        prompt = f"""
        Analyze this conversation history and extract user preferences
        and characteristics.

        History:
        {history}

        Extract in JSON:
        - preferred_language: preferred language
        - technical_level: beginner/intermediate/expert
        - topics_of_interest: list of frequent topics
        - communication_style: formal/informal
        - preferences: other detected preferences

        JSON:
        """

        result = await llm.generate(prompt, temperature=0)
        return self._parse_json(result)

Response Personalization

DEVELOPERpython
class PersonalizedRAG:
    def __init__(self, rag_pipeline, user_memory, llm):
        self.rag = rag_pipeline
        self.memory = user_memory
        self.llm = llm

    async def query(
        self,
        user_id: str,
        conversation_id: str,
        query: str
    ) -> dict:
        """
        Personalized RAG query
        """
        # 1. Retrieve user profile
        preferences = await self.memory.get_preferences(user_id)

        # 2. Retrieve conversation context
        conv_context = await self.conv_manager.get_context(conversation_id)

        # 3. Rewrite query with context
        rewritten_query = await self.query_rewriter.rewrite(query, conv_context)

        # 4. Execute standard RAG
        rag_result = await self.rag.query(rewritten_query)

        # 5. Personalize response
        personalized = await self._personalize_response(
            query=query,
            rag_response=rag_result["answer"],
            preferences=preferences,
            context=conv_context
        )

        return {
            "answer": personalized,
            "sources": rag_result["sources"],
            "original_query": query,
            "rewritten_query": rewritten_query
        }

    async def _personalize_response(
        self,
        query: str,
        rag_response: str,
        preferences: dict,
        context: str
    ) -> str:
        """
        Adapt response to user preferences
        """
        tech_level = preferences.get("technical_level", "intermediate")
        style = preferences.get("communication_style", "professional")

        prompt = f"""
        Adapt this response according to user profile.

        Profile:
        - Technical level: {tech_level}
        - Preferred style: {style}

        Conversation context:
        {context}

        Question: {query}

        Original response:
        {rag_response}

        Adapted response:
        """

        return await self.llm.generate(prompt, temperature=0.3)

Shared Semantic Memory

Learning from Conversations

DEVELOPERpython
class SemanticMemory:
    def __init__(self, vector_db, embedder, llm):
        self.vector_db = vector_db
        self.embedder = embedder
        self.llm = llm

    async def learn_from_conversation(
        self,
        conversation: Conversation
    ):
        """
        Extract and store facts learned from a conversation
        """
        # Extract facts
        facts = await self._extract_facts(conversation)

        for fact in facts:
            # Check if fact already exists
            existing = await self._find_similar_fact(fact)

            if existing:
                # Update confidence
                await self._update_fact_confidence(existing, fact)
            else:
                # Store new fact
                await self._store_fact(fact)

    async def _extract_facts(self, conversation: Conversation) -> List[dict]:
        """
        Extract factual information from a conversation
        """
        messages_text = "\n".join([
            f"{m.role}: {m.content}" for m in conversation.messages
        ])

        prompt = f"""
        Extract new facts learned from this conversation.
        A fact must be:
        - Factual and verifiable
        - Useful for future conversations
        - New (not already known information)

        Conversation:
        {messages_text}

        Facts (JSON array format):
        [
          {{"fact": "...", "category": "...", "confidence": 0.9}},
          ...
        ]
        """

        result = await self.llm.generate(prompt, temperature=0)
        return self._parse_json(result)

    async def recall(
        self,
        query: str,
        top_k: int = 5
    ) -> List[dict]:
        """
        Retrieve relevant facts for a query
        """
        query_embedding = self.embedder.encode(query)

        results = await self.vector_db.search(
            collection="semantic_memory",
            query_vector=query_embedding,
            limit=top_k
        )

        return [r.payload for r in results]

Session Management

Multi-device and Continuity

DEVELOPERpython
import redis
import json

class SessionManager:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
        self.session_ttl = 3600 * 24  # 24 hours

    async def create_session(
        self,
        user_id: str,
        device_id: str = None
    ) -> str:
        """
        Create a new session
        """
        session_id = f"session_{user_id}_{datetime.now().timestamp()}"

        session_data = {
            "user_id": user_id,
            "device_id": device_id,
            "created_at": datetime.now().isoformat(),
            "messages": [],
            "context": {}
        }

        await self.redis.setex(
            session_id,
            self.session_ttl,
            json.dumps(session_data)
        )

        return session_id

    async def resume_or_create(
        self,
        user_id: str,
        device_id: str = None
    ) -> str:
        """
        Resume last session or create a new one
        """
        # Look for recent active session
        pattern = f"session_{user_id}_*"
        keys = await self.redis.keys(pattern)

        if keys:
            # Sort by creation date
            sessions = []
            for key in keys:
                data = json.loads(await self.redis.get(key))
                sessions.append((key, data))

            # Take most recent
            sessions.sort(key=lambda x: x[1]["created_at"], reverse=True)
            return sessions[0][0]

        # Create new session
        return await self.create_session(user_id, device_id)

Metrics and Evaluation

Conversational Quality

DEVELOPERpython
class ConversationMetrics:
    def __init__(self, db):
        self.db = db

    async def calculate_metrics(
        self,
        conversation_id: str
    ) -> dict:
        """
        Calculate conversation metrics
        """
        conversation = await self.db.get("conversations", conversation_id)

        return {
            # Length and engagement
            "total_turns": len(conversation["messages"]) // 2,
            "avg_user_message_length": self._avg_length(
                [m for m in conversation["messages"] if m["role"] == "user"]
            ),
            "avg_assistant_message_length": self._avg_length(
                [m for m in conversation["messages"] if m["role"] == "assistant"]
            ),

            # Resolution
            "ended_naturally": self._check_natural_end(conversation),
            "required_clarification": self._count_clarifications(conversation),

            # Coherence
            "topic_coherence": await self._calculate_topic_coherence(conversation),
            "context_usage": self._measure_context_usage(conversation)
        }

Best Practices

1. Limit Context Size

Don't overload the prompt with too much history. Compress or summarize.

2. Handle Topic Changes

Detect when user changes topic to avoid polluting context.

3. Respect Privacy

Allow users to delete their history and preferences.

4. Graceful Fallback

If memory is unavailable, system should work in stateless mode.

Learn More

Retrieval Fundamentals - Optimize search
LLM Generation - Improve responses
Prompt Engineering RAG - Optimize prompts

Conversational RAG with Ailog

Implementing robust conversational memory is complex. With Ailog, benefit from native features:

Automatic session history with compression
Persistent user memory and personalization
Query rewriting for co-reference resolution
Multi-device with conversation resumption
Conversational analytics to measure engagement
GDPR compliance with data export and deletion

Try Ailog for free and deploy an intelligent conversational assistant.

Conversational RAG: Memory and Multi-Session Context

Conversational RAG: Memory and Multi-Session Context

Why Memory is Essential

Limitations of Memoryless RAG

Types of Memory

Conversational Architecture

Session History Management

Conversation Structure

History Compression

Query Rewriting with Context

Co-reference Resolution

Application Example

Long-term Memory

User Profile Storage

Response Personalization

Shared Semantic Memory

Learning from Conversations

Session Management

Multi-device and Continuity

Metrics and Evaluation

Conversational Quality

Best Practices

1. Limit Context Size

2. Handle Topic Changes

3. Respect Privacy

4. Graceful Fallback

Learn More

Conversational RAG with Ailog

Tags

Related Posts

Guardrails for RAG: Securing Your AI Assistants

RAG Agents: Orchestrating Multi-Agent Systems

Agentic RAG: Building AI Agents with Dynamic Knowledge Retrieval

Ailog Assistant