Summary Memory: Summarize to Remember Long

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Summary Memory is an advanced conversational memory technique that condenses exchange history into successive summaries. It allows maintaining context over dozens, even hundreds of exchanges, without exploding the token budget. This guide explains how to implement it effectively.

Why Summary Memory?

The Problem with Long Conversations

With classic Buffer Memory, each message consumes tokens. After 15-20 exchanges, limits are quickly reached:

┌─────────────────────────────────────────────────────────────┐
│               BUFFER MEMORY PROBLEM                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Exchange 1-5:    ~800 tokens    ██░░░░░░░░                 │
│  Exchange 1-10:   ~1600 tokens   ████░░░░░░                 │
│  Exchange 1-15:   ~2400 tokens   ██████░░░░                 │
│  Exchange 1-20:   ~3200 tokens   ████████░░   LIMIT!        │
│  Exchange 1-30:   ~4800 tokens   ████████████ EXCEEDED!     │
│                                                              │
│  GPT-4 window: ~8000 usable tokens                          │
│  (rest of context = RAG documents + prompt)                 │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The Solution: Progressive Summarization

┌─────────────────────────────────────────────────────────────┐
│               SUMMARY MEMORY                                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌────────────────────────────────────────────────────┐     │
│  │ SUMMARY (200-300 tokens)                            │     │
│  │ "The user is looking for a gaming laptop budget    │     │
│  │  $1500. Rejected MSI because too heavy.            │     │
│  │  Interested in ASUS ROG, questions about warranty."│     │
│  └────────────────────────────────────────────────────┘     │
│                            +                                 │
│  ┌────────────────────────────────────────────────────┐     │
│  │ RECENT MESSAGES (last 4-6 exchanges)               │     │
│  │ User: "What about the graphics card?"              │     │
│  │ AI: "The ASUS ROG has an RTX 4060..."              │     │
│  │ User: "Is it compatible with my monitor?"          │     │
│  │ AI: "Yes, the HDMI 2.1 port..."                    │     │
│  └────────────────────────────────────────────────────┘     │
│                                                              │
│  TOTAL: ~600 tokens (vs 3000+ with buffer)                  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key statistic: Summary Memory reduces token consumption by 70-80% for long conversations while retaining 90%+ of relevant context.

Approach Comparison

Scenario	Buffer Memory	Summary Memory	Recommendation
< 5 exchanges	Optimal	Overkill	Buffer
5-15 exchanges	OK with window	Good	Buffer Window
15-50 exchanges	Limited	Optimal	Summary
50+ exchanges	Impossible	Excellent	Summary

Basic Implementation

Summary Memory from Scratch

DEVELOPERpython
from typing import List, Dict
from datetime import datetime

class SummaryMemory:
    """
    Conversational memory with progressive summarization
    """
    def __init__(
        self,
        llm,
        max_messages_before_summary: int = 6,
        summary_max_tokens: int = 300
    ):
        self.llm = llm
        self.max_messages = max_messages_before_summary
        self.summary_max_tokens = summary_max_tokens
        self.summary: str = ""
        self.recent_messages: List[Dict] = []

    def add_message(self, role: str, content: str) -> None:
        """Add a message and trigger summary if needed"""
        self.recent_messages.append({
            "role": role,
            "content": content,
            "timestamp": datetime.now().isoformat()
        })

        # If too many messages, summarize old ones
        if len(self.recent_messages) > self.max_messages:
            self._summarize_old_messages()

    def _summarize_old_messages(self) -> None:
        """Summarize the oldest messages"""
        # Split: old to summarize, recent to keep
        split_point = len(self.recent_messages) // 2
        to_summarize = self.recent_messages[:split_point]
        self.recent_messages = self.recent_messages[split_point:]

        # Format conversation to summarize
        conversation = "\n".join([
            f"{m['role'].capitalize()}: {m['content']}"
            for m in to_summarize
        ])

        # Generate new summary
        prompt = f"""You need to update a conversation summary.

Existing summary:
{self.summary if self.summary else "No previous summary."}

New exchanges to integrate:
{conversation}

Instructions:
- Integrate important information from new exchanges into the summary
- Keep decisions made, preferences expressed, problems resolved
- Remove redundant or obsolete details
- Summary should be concise (max {self.summary_max_tokens} tokens)
- Write in third person ("The user...", "The assistant...")

Updated summary:"""

        self.summary = self.llm.invoke(prompt)

    def get_context(self) -> str:
        """Return complete context for the prompt"""
        parts = []

        if self.summary:
            parts.append(f"Previous conversation summary:\n{self.summary}")

        if self.recent_messages:
            recent = "\n".join([
                f"{m['role'].capitalize()}: {m['content']}"
                for m in self.recent_messages
            ])
            parts.append(f"Recent exchanges:\n{recent}")

        return "\n\n".join(parts)

    def clear(self) -> None:
        """Reset the memory"""
        self.summary = ""
        self.recent_messages = []

Integration with RAG

DEVELOPERpython
class RAGWithSummaryMemory:
    """
    Complete RAG Pipeline with Summary Memory
    """
    def __init__(self, vector_store, llm):
        self.vector_store = vector_store
        self.llm = llm
        self.memory = SummaryMemory(llm, max_messages_before_summary=6)

    def query(self, user_message: str) -> str:
        """Execute a RAG query with summarized context"""
        # 1. Add user message
        self.memory.add_message("user", user_message)

        # 2. Retrieve relevant documents
        docs = self.vector_store.similarity_search(user_message, k=3)
        doc_context = "\n\n".join([
            f"Document {i+1}:\n{d.page_content}"
            for i, d in enumerate(docs)
        ])

        # 3. Build prompt with memory
        memory_context = self.memory.get_context()

        prompt = f"""You are a helpful assistant. Use the conversation
context and documents to answer.

{memory_context}

Relevant documents:
{doc_context}

Current question: {user_message}

Answer precisely and contextually."""

        # 4. Generate and save response
        response = self.llm.invoke(prompt)
        self.memory.add_message("assistant", response)

        return response

Advanced Techniques

Incremental Summarization (More Efficient)

Instead of regenerating the complete summary, enrich it incrementally:

DEVELOPERpython
class IncrementalSummaryMemory(SummaryMemory):
    """
    Summary Memory with incremental updates
    Faster and less costly in tokens
    """

    def _summarize_old_messages(self) -> None:
        """Incremental summary update"""
        split_point = len(self.recent_messages) // 2
        to_summarize = self.recent_messages[:split_point]
        self.recent_messages = self.recent_messages[split_point:]

        conversation = "\n".join([
            f"{m['role'].capitalize()}: {m['content']}"
            for m in to_summarize
        ])

        # Optimized prompt for incremental update
        if self.summary:
            prompt = f"""Current summary: {self.summary}

New conversation elements:
{conversation}

Update the summary by:
1. Adding important new information
2. Removing obsolete information
3. Keeping summary concise (max 200 words)

Updated summary:"""
        else:
            prompt = f"""Summarize this conversation extracting:
- The main topic
- Decisions made
- User preferences
- Pending questions

Conversation:
{conversation}

Summary (max 200 words):"""

        self.summary = self.llm.invoke(prompt)

Structured Summary

For better organization, use a structured summary:

DEVELOPERpython
import json

class StructuredSummaryMemory(SummaryMemory):
    """
    Summary Memory with JSON structured summary
    Allows more precise access to information
    """

    def __init__(self, llm, **kwargs):
        super().__init__(llm, **kwargs)
        self.structured_summary: Dict = {
            "main_topic": "",
            "user_preferences": [],
            "decisions_made": [],
            "pending_questions": [],
            "key_facts": []
        }

    def _summarize_old_messages(self) -> None:
        """Generate a structured summary"""
        split_point = len(self.recent_messages) // 2
        to_summarize = self.recent_messages[:split_point]
        self.recent_messages = self.recent_messages[split_point:]

        conversation = "\n".join([
            f"{m['role'].capitalize()}: {m['content']}"
            for m in to_summarize
        ])

        prompt = f"""Analyze this conversation and update the structured summary.

Current summary:
{json.dumps(self.structured_summary, ensure_ascii=False, indent=2)}

New exchanges:
{conversation}

Return a JSON with the following structure:
{{
    "main_topic": "main conversation topic",
    "user_preferences": ["list of expressed preferences"],
    "decisions_made": ["list of decisions made"],
    "pending_questions": ["unresolved questions"],
    "key_facts": ["important facts to remember"]
}}

Updated JSON:"""

        response = self.llm.invoke(prompt)
        try:
            self.structured_summary = json.loads(response)
        except json.JSONDecodeError:
            # Fallback to text summary
            self.summary = response

    def get_context(self) -> str:
        """Format structured summary for prompt"""
        parts = []

        if self.structured_summary.get("main_topic"):
            parts.append(f"Topic: {self.structured_summary['main_topic']}")

        if self.structured_summary.get("user_preferences"):
            prefs = ", ".join(self.structured_summary["user_preferences"])
            parts.append(f"User preferences: {prefs}")

        if self.structured_summary.get("decisions_made"):
            decisions = ", ".join(self.structured_summary["decisions_made"])
            parts.append(f"Decisions made: {decisions}")

        if self.structured_summary.get("pending_questions"):
            questions = ", ".join(self.structured_summary["pending_questions"])
            parts.append(f"Pending questions: {questions}")

        summary_text = "\n".join(parts) if parts else ""

        if self.recent_messages:
            recent = "\n".join([
                f"{m['role'].capitalize()}: {m['content']}"
                for m in self.recent_messages
            ])
            return f"Context:\n{summary_text}\n\nRecent exchanges:\n{recent}"

        return f"Context:\n{summary_text}"

Implementation with LangChain

LangChain provides a native implementation:

DEVELOPERpython
from langchain.memory import ConversationSummaryMemory
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI

# LLM for generation and summarization
llm = ChatOpenAI(model="gpt-4", temperature=0.7)

# Native LangChain Summary Memory
memory = ConversationSummaryMemory(
    llm=llm,
    memory_key="history",
    return_messages=True
)

# Conversational chain
conversation = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True
)

# Usage - summarization happens automatically
response = conversation.predict(input="I'm looking for a gaming laptop")
response = conversation.predict(input="Budget max 1500 euros")
response = conversation.predict(input="I already looked at MSI but too heavy")
# ... after several exchanges, memory contains a summary

# Inspect the summary
print(memory.buffer)  # Shows current summary

Combining Buffer + Summary

The best approach combines both:

DEVELOPERpython
from langchain.memory import ConversationSummaryBufferMemory

# Keep recent messages AND a summary of old ones
memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=1000,  # Summarize when exceeding 1000 tokens
    memory_key="history",
    return_messages=True
)

# More effective than pure summary for recent context

Implementation with LlamaIndex

DEVELOPERpython
from llama_index.core.memory import ChatSummaryMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine
from llama_index.llms.openai import OpenAI

# LLM
llm = OpenAI(model="gpt-4", temperature=0.7)

# Memory with automatic summarization
memory = ChatSummaryMemoryBuffer.from_defaults(
    llm=llm,
    token_limit=1500,  # Limit before summarization
    summarize_prompt=(
        "Summarize the previous conversation keeping:\n"
        "- The main topic\n"
        "- Decisions made\n"
        "- Important context\n\n"
        "Conversation:\n{conversation}\n\n"
        "Summary:"
    )
)

# Chat engine with memory
chat_engine = CondensePlusContextChatEngine.from_defaults(
    retriever=index.as_retriever(similarity_top_k=4),
    memory=memory,
    llm=llm
)

# Long conversation - automatic summarization
for question in user_questions:
    response = chat_engine.chat(question)
    print(response)

Best Practices

1. Choose the Right Summary Threshold

DEVELOPERpython
# Recommendations by use case
THRESHOLDS = {
    "customer_support": 6,    # Frequent summary, recent context important
    "consultation": 10,       # More context before summary
    "tutorial": 4,           # Very frequent summary
    "free_conversation": 8   # Balance
}

memory = SummaryMemory(
    llm=llm,
    max_messages_before_summary=THRESHOLDS["customer_support"]
)

2. Validate Summary Quality

DEVELOPERpython
class ValidatedSummaryMemory(SummaryMemory):
    """Summary Memory with summary validation"""

    def _summarize_old_messages(self) -> None:
        super()._summarize_old_messages()

        # Check that summary isn't empty or too short
        if len(self.summary.split()) < 20:
            # Regenerate with more explicit prompt
            self._force_detailed_summary()

    def _force_detailed_summary(self) -> None:
        """Regenerate a more detailed summary if needed"""
        prompt = f"""The previous summary was too short.
Expand further including:
- Who the user is and what they're looking for
- Important criteria mentioned
- Options discussed
- Current conversation state

Current summary: {self.summary}

Expanded summary:"""
        self.summary = self.llm.invoke(prompt)

3. Handle Topic Changes

DEVELOPERpython
class TopicAwareSummaryMemory(SummaryMemory):
    """Detects and handles topic changes"""

    def add_message(self, role: str, content: str) -> None:
        # Detect topic change
        if role == "user" and self._is_topic_change(content):
            # Finalize summary of previous topic
            self._finalize_current_topic()

        super().add_message(role, content)

    def _is_topic_change(self, message: str) -> bool:
        """Detect if message changes topic"""
        indicators = [
            "something else",
            "new topic",
            "different question",
            "I'd also like",
            "by the way"
        ]
        return any(ind in message.lower() for ind in indicators)

    def _finalize_current_topic(self) -> None:
        """Archive current topic before changing"""
        if self.summary:
            self.archived_topics.append({
                "topic": self.summary,
                "timestamp": datetime.now().isoformat()
            })
            self.summary = ""

When to Use Summary Memory?

Ideal Use Cases

In-depth customer support: Multi-step troubleshooting
Consultations: Personalized advice over time
Onboarding: Guidance over multiple sessions
Complex conversations: Negotiations, detailed configurations

When to Avoid

Situation	Why	Alternative
Conversations < 5 exchanges	Unnecessary overhead	Buffer Memory
Need exact precision	Summary loses details	Buffer Window
Critical latency	Summary adds LLM call	Buffer with limit

Learn More

Buffer Memory - For short conversations
Entity Memory - For retaining specific entities
Conversational RAG - Overview

Summary Memory with Ailog

Implementing robust Summary Memory requires careful attention to summary quality. With Ailog, benefit from optimized memory management:

Automatic summarization with adaptive thresholds based on conversation
Topic change detection for relevant summaries
Structured summary for precise information access
Multi-session persistence with context resumption
Analytics on summary quality and information retention

Try Ailog for free and deploy a chatbot with long-term memory.

Summary Memory: Summarize to Remember Long

Summary Memory: Summarize to Remember Long

Why Summary Memory?

The Problem with Long Conversations

The Solution: Progressive Summarization

Approach Comparison

Basic Implementation

Summary Memory from Scratch

Integration with RAG

Advanced Techniques

Incremental Summarization (More Efficient)

Structured Summary

Implementation with LangChain

Combining Buffer + Summary

Implementation with LlamaIndex

Best Practices

1. Choose the Right Summary Threshold

2. Validate Summary Quality

3. Handle Topic Changes

When to Use Summary Memory?

Ideal Use Cases

When to Avoid

Learn More

Summary Memory with Ailog

Tags

Related Posts

Conversational RAG: Memory and Multi-Session Context

Buffer Memory: Simple Conversation History

RAG Generation: Choosing and Optimizing Your LLM

Ailog Assistant