Summary Memory: Summarize to Remember Long
Complete guide to implementing summary memory in a RAG system: condensing long conversations to maintain context without exploding tokens.
Summary Memory: Summarize to Remember Long
Summary Memory is an advanced conversational memory technique that condenses exchange history into successive summaries. It allows maintaining context over dozens, even hundreds of exchanges, without exploding the token budget. This guide explains how to implement it effectively.
Why Summary Memory?
The Problem with Long Conversations
With classic Buffer Memory, each message consumes tokens. After 15-20 exchanges, limits are quickly reached:
┌─────────────────────────────────────────────────────────────┐
│ BUFFER MEMORY PROBLEM │
├─────────────────────────────────────────────────────────────┤
│ │
│ Exchange 1-5: ~800 tokens ██░░░░░░░░ │
│ Exchange 1-10: ~1600 tokens ████░░░░░░ │
│ Exchange 1-15: ~2400 tokens ██████░░░░ │
│ Exchange 1-20: ~3200 tokens ████████░░ LIMIT! │
│ Exchange 1-30: ~4800 tokens ████████████ EXCEEDED! │
│ │
│ GPT-4 window: ~8000 usable tokens │
│ (rest of context = RAG documents + prompt) │
│ │
└─────────────────────────────────────────────────────────────┘
The Solution: Progressive Summarization
┌─────────────────────────────────────────────────────────────┐
│ SUMMARY MEMORY │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ SUMMARY (200-300 tokens) │ │
│ │ "The user is looking for a gaming laptop budget │ │
│ │ $1500. Rejected MSI because too heavy. │ │
│ │ Interested in ASUS ROG, questions about warranty."│ │
│ └────────────────────────────────────────────────────┘ │
│ + │
│ ┌────────────────────────────────────────────────────┐ │
│ │ RECENT MESSAGES (last 4-6 exchanges) │ │
│ │ User: "What about the graphics card?" │ │
│ │ AI: "The ASUS ROG has an RTX 4060..." │ │
│ │ User: "Is it compatible with my monitor?" │ │
│ │ AI: "Yes, the HDMI 2.1 port..." │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ TOTAL: ~600 tokens (vs 3000+ with buffer) │
│ │
└─────────────────────────────────────────────────────────────┘
Key statistic: Summary Memory reduces token consumption by 70-80% for long conversations while retaining 90%+ of relevant context.
Approach Comparison
| Scenario | Buffer Memory | Summary Memory | Recommendation |
|---|---|---|---|
| < 5 exchanges | Optimal | Overkill | Buffer |
| 5-15 exchanges | OK with window | Good | Buffer Window |
| 15-50 exchanges | Limited | Optimal | Summary |
| 50+ exchanges | Impossible | Excellent | Summary |
Basic Implementation
Summary Memory from Scratch
DEVELOPERpythonfrom typing import List, Dict from datetime import datetime class SummaryMemory: """ Conversational memory with progressive summarization """ def __init__( self, llm, max_messages_before_summary: int = 6, summary_max_tokens: int = 300 ): self.llm = llm self.max_messages = max_messages_before_summary self.summary_max_tokens = summary_max_tokens self.summary: str = "" self.recent_messages: List[Dict] = [] def add_message(self, role: str, content: str) -> None: """Add a message and trigger summary if needed""" self.recent_messages.append({ "role": role, "content": content, "timestamp": datetime.now().isoformat() }) # If too many messages, summarize old ones if len(self.recent_messages) > self.max_messages: self._summarize_old_messages() def _summarize_old_messages(self) -> None: """Summarize the oldest messages""" # Split: old to summarize, recent to keep split_point = len(self.recent_messages) // 2 to_summarize = self.recent_messages[:split_point] self.recent_messages = self.recent_messages[split_point:] # Format conversation to summarize conversation = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in to_summarize ]) # Generate new summary prompt = f"""You need to update a conversation summary. Existing summary: {self.summary if self.summary else "No previous summary."} New exchanges to integrate: {conversation} Instructions: - Integrate important information from new exchanges into the summary - Keep decisions made, preferences expressed, problems resolved - Remove redundant or obsolete details - Summary should be concise (max {self.summary_max_tokens} tokens) - Write in third person ("The user...", "The assistant...") Updated summary:""" self.summary = self.llm.invoke(prompt) def get_context(self) -> str: """Return complete context for the prompt""" parts = [] if self.summary: parts.append(f"Previous conversation summary:\n{self.summary}") if self.recent_messages: recent = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in self.recent_messages ]) parts.append(f"Recent exchanges:\n{recent}") return "\n\n".join(parts) def clear(self) -> None: """Reset the memory""" self.summary = "" self.recent_messages = []
Integration with RAG
DEVELOPERpythonclass RAGWithSummaryMemory: """ Complete RAG Pipeline with Summary Memory """ def __init__(self, vector_store, llm): self.vector_store = vector_store self.llm = llm self.memory = SummaryMemory(llm, max_messages_before_summary=6) def query(self, user_message: str) -> str: """Execute a RAG query with summarized context""" # 1. Add user message self.memory.add_message("user", user_message) # 2. Retrieve relevant documents docs = self.vector_store.similarity_search(user_message, k=3) doc_context = "\n\n".join([ f"Document {i+1}:\n{d.page_content}" for i, d in enumerate(docs) ]) # 3. Build prompt with memory memory_context = self.memory.get_context() prompt = f"""You are a helpful assistant. Use the conversation context and documents to answer. {memory_context} Relevant documents: {doc_context} Current question: {user_message} Answer precisely and contextually.""" # 4. Generate and save response response = self.llm.invoke(prompt) self.memory.add_message("assistant", response) return response
Advanced Techniques
Incremental Summarization (More Efficient)
Instead of regenerating the complete summary, enrich it incrementally:
DEVELOPERpythonclass IncrementalSummaryMemory(SummaryMemory): """ Summary Memory with incremental updates Faster and less costly in tokens """ def _summarize_old_messages(self) -> None: """Incremental summary update""" split_point = len(self.recent_messages) // 2 to_summarize = self.recent_messages[:split_point] self.recent_messages = self.recent_messages[split_point:] conversation = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in to_summarize ]) # Optimized prompt for incremental update if self.summary: prompt = f"""Current summary: {self.summary} New conversation elements: {conversation} Update the summary by: 1. Adding important new information 2. Removing obsolete information 3. Keeping summary concise (max 200 words) Updated summary:""" else: prompt = f"""Summarize this conversation extracting: - The main topic - Decisions made - User preferences - Pending questions Conversation: {conversation} Summary (max 200 words):""" self.summary = self.llm.invoke(prompt)
Structured Summary
For better organization, use a structured summary:
DEVELOPERpythonimport json class StructuredSummaryMemory(SummaryMemory): """ Summary Memory with JSON structured summary Allows more precise access to information """ def __init__(self, llm, **kwargs): super().__init__(llm, **kwargs) self.structured_summary: Dict = { "main_topic": "", "user_preferences": [], "decisions_made": [], "pending_questions": [], "key_facts": [] } def _summarize_old_messages(self) -> None: """Generate a structured summary""" split_point = len(self.recent_messages) // 2 to_summarize = self.recent_messages[:split_point] self.recent_messages = self.recent_messages[split_point:] conversation = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in to_summarize ]) prompt = f"""Analyze this conversation and update the structured summary. Current summary: {json.dumps(self.structured_summary, ensure_ascii=False, indent=2)} New exchanges: {conversation} Return a JSON with the following structure: {{ "main_topic": "main conversation topic", "user_preferences": ["list of expressed preferences"], "decisions_made": ["list of decisions made"], "pending_questions": ["unresolved questions"], "key_facts": ["important facts to remember"] }} Updated JSON:""" response = self.llm.invoke(prompt) try: self.structured_summary = json.loads(response) except json.JSONDecodeError: # Fallback to text summary self.summary = response def get_context(self) -> str: """Format structured summary for prompt""" parts = [] if self.structured_summary.get("main_topic"): parts.append(f"Topic: {self.structured_summary['main_topic']}") if self.structured_summary.get("user_preferences"): prefs = ", ".join(self.structured_summary["user_preferences"]) parts.append(f"User preferences: {prefs}") if self.structured_summary.get("decisions_made"): decisions = ", ".join(self.structured_summary["decisions_made"]) parts.append(f"Decisions made: {decisions}") if self.structured_summary.get("pending_questions"): questions = ", ".join(self.structured_summary["pending_questions"]) parts.append(f"Pending questions: {questions}") summary_text = "\n".join(parts) if parts else "" if self.recent_messages: recent = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in self.recent_messages ]) return f"Context:\n{summary_text}\n\nRecent exchanges:\n{recent}" return f"Context:\n{summary_text}"
Implementation with LangChain
LangChain provides a native implementation:
DEVELOPERpythonfrom langchain.memory import ConversationSummaryMemory from langchain.chains import ConversationChain from langchain_openai import ChatOpenAI # LLM for generation and summarization llm = ChatOpenAI(model="gpt-4", temperature=0.7) # Native LangChain Summary Memory memory = ConversationSummaryMemory( llm=llm, memory_key="history", return_messages=True ) # Conversational chain conversation = ConversationChain( llm=llm, memory=memory, verbose=True ) # Usage - summarization happens automatically response = conversation.predict(input="I'm looking for a gaming laptop") response = conversation.predict(input="Budget max 1500 euros") response = conversation.predict(input="I already looked at MSI but too heavy") # ... after several exchanges, memory contains a summary # Inspect the summary print(memory.buffer) # Shows current summary
Combining Buffer + Summary
The best approach combines both:
DEVELOPERpythonfrom langchain.memory import ConversationSummaryBufferMemory # Keep recent messages AND a summary of old ones memory = ConversationSummaryBufferMemory( llm=llm, max_token_limit=1000, # Summarize when exceeding 1000 tokens memory_key="history", return_messages=True ) # More effective than pure summary for recent context
Implementation with LlamaIndex
DEVELOPERpythonfrom llama_index.core.memory import ChatSummaryMemoryBuffer from llama_index.core.chat_engine import CondensePlusContextChatEngine from llama_index.llms.openai import OpenAI # LLM llm = OpenAI(model="gpt-4", temperature=0.7) # Memory with automatic summarization memory = ChatSummaryMemoryBuffer.from_defaults( llm=llm, token_limit=1500, # Limit before summarization summarize_prompt=( "Summarize the previous conversation keeping:\n" "- The main topic\n" "- Decisions made\n" "- Important context\n\n" "Conversation:\n{conversation}\n\n" "Summary:" ) ) # Chat engine with memory chat_engine = CondensePlusContextChatEngine.from_defaults( retriever=index.as_retriever(similarity_top_k=4), memory=memory, llm=llm ) # Long conversation - automatic summarization for question in user_questions: response = chat_engine.chat(question) print(response)
Best Practices
1. Choose the Right Summary Threshold
DEVELOPERpython# Recommendations by use case THRESHOLDS = { "customer_support": 6, # Frequent summary, recent context important "consultation": 10, # More context before summary "tutorial": 4, # Very frequent summary "free_conversation": 8 # Balance } memory = SummaryMemory( llm=llm, max_messages_before_summary=THRESHOLDS["customer_support"] )
2. Validate Summary Quality
DEVELOPERpythonclass ValidatedSummaryMemory(SummaryMemory): """Summary Memory with summary validation""" def _summarize_old_messages(self) -> None: super()._summarize_old_messages() # Check that summary isn't empty or too short if len(self.summary.split()) < 20: # Regenerate with more explicit prompt self._force_detailed_summary() def _force_detailed_summary(self) -> None: """Regenerate a more detailed summary if needed""" prompt = f"""The previous summary was too short. Expand further including: - Who the user is and what they're looking for - Important criteria mentioned - Options discussed - Current conversation state Current summary: {self.summary} Expanded summary:""" self.summary = self.llm.invoke(prompt)
3. Handle Topic Changes
DEVELOPERpythonclass TopicAwareSummaryMemory(SummaryMemory): """Detects and handles topic changes""" def add_message(self, role: str, content: str) -> None: # Detect topic change if role == "user" and self._is_topic_change(content): # Finalize summary of previous topic self._finalize_current_topic() super().add_message(role, content) def _is_topic_change(self, message: str) -> bool: """Detect if message changes topic""" indicators = [ "something else", "new topic", "different question", "I'd also like", "by the way" ] return any(ind in message.lower() for ind in indicators) def _finalize_current_topic(self) -> None: """Archive current topic before changing""" if self.summary: self.archived_topics.append({ "topic": self.summary, "timestamp": datetime.now().isoformat() }) self.summary = ""
When to Use Summary Memory?
Ideal Use Cases
- In-depth customer support: Multi-step troubleshooting
- Consultations: Personalized advice over time
- Onboarding: Guidance over multiple sessions
- Complex conversations: Negotiations, detailed configurations
When to Avoid
| Situation | Why | Alternative |
|---|---|---|
| Conversations < 5 exchanges | Unnecessary overhead | Buffer Memory |
| Need exact precision | Summary loses details | Buffer Window |
| Critical latency | Summary adds LLM call | Buffer with limit |
Learn More
- Buffer Memory - For short conversations
- Entity Memory - For retaining specific entities
- Conversational RAG - Overview
Summary Memory with Ailog
Implementing robust Summary Memory requires careful attention to summary quality. With Ailog, benefit from optimized memory management:
- Automatic summarization with adaptive thresholds based on conversation
- Topic change detection for relevant summaries
- Structured summary for precise information access
- Multi-session persistence with context resumption
- Analytics on summary quality and information retention
Try Ailog for free and deploy a chatbot with long-term memory.
Tags
Related Posts
Conversational RAG: Memory and Multi-Session Context
Implement RAG with conversational memory: context management, multi-session history, and personalized responses.
Buffer Memory: Simple Conversation History
Complete guide to implementing buffer memory in a conversational RAG system: keeping context of recent exchanges for coherent responses.
RAG Generation: Choosing and Optimizing Your LLM
Complete guide to selecting and configuring your LLM in a RAG system: prompting, temperature, tokens, and response optimization.