GuideIntermediate

Summary Memory: Summarize to Remember Long

March 28, 2026
18 min read
Ailog Team

Complete guide to implementing summary memory in a RAG system: condensing long conversations to maintain context without exploding tokens.

Summary Memory: Summarize to Remember Long

Summary Memory is an advanced conversational memory technique that condenses exchange history into successive summaries. It allows maintaining context over dozens, even hundreds of exchanges, without exploding the token budget. This guide explains how to implement it effectively.

Why Summary Memory?

The Problem with Long Conversations

With classic Buffer Memory, each message consumes tokens. After 15-20 exchanges, limits are quickly reached:

┌─────────────────────────────────────────────────────────────┐
│               BUFFER MEMORY PROBLEM                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Exchange 1-5:    ~800 tokens    ██░░░░░░░░                 │
│  Exchange 1-10:   ~1600 tokens   ████░░░░░░                 │
│  Exchange 1-15:   ~2400 tokens   ██████░░░░                 │
│  Exchange 1-20:   ~3200 tokens   ████████░░   LIMIT!        │
│  Exchange 1-30:   ~4800 tokens   ████████████ EXCEEDED!     │
│                                                              │
│  GPT-4 window: ~8000 usable tokens                          │
│  (rest of context = RAG documents + prompt)                 │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The Solution: Progressive Summarization

┌─────────────────────────────────────────────────────────────┐
│               SUMMARY MEMORY                                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌────────────────────────────────────────────────────┐     │
│  │ SUMMARY (200-300 tokens)                            │     │
│  │ "The user is looking for a gaming laptop budget    │     │
│  │  $1500. Rejected MSI because too heavy.            │     │
│  │  Interested in ASUS ROG, questions about warranty."│     │
│  └────────────────────────────────────────────────────┘     │
│                            +                                 │
│  ┌────────────────────────────────────────────────────┐     │
│  │ RECENT MESSAGES (last 4-6 exchanges)               │     │
│  │ User: "What about the graphics card?"              │     │
│  │ AI: "The ASUS ROG has an RTX 4060..."              │     │
│  │ User: "Is it compatible with my monitor?"          │     │
│  │ AI: "Yes, the HDMI 2.1 port..."                    │     │
│  └────────────────────────────────────────────────────┘     │
│                                                              │
│  TOTAL: ~600 tokens (vs 3000+ with buffer)                  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key statistic: Summary Memory reduces token consumption by 70-80% for long conversations while retaining 90%+ of relevant context.

Approach Comparison

ScenarioBuffer MemorySummary MemoryRecommendation
< 5 exchangesOptimalOverkillBuffer
5-15 exchangesOK with windowGoodBuffer Window
15-50 exchangesLimitedOptimalSummary
50+ exchangesImpossibleExcellentSummary

Basic Implementation

Summary Memory from Scratch

DEVELOPERpython
from typing import List, Dict from datetime import datetime class SummaryMemory: """ Conversational memory with progressive summarization """ def __init__( self, llm, max_messages_before_summary: int = 6, summary_max_tokens: int = 300 ): self.llm = llm self.max_messages = max_messages_before_summary self.summary_max_tokens = summary_max_tokens self.summary: str = "" self.recent_messages: List[Dict] = [] def add_message(self, role: str, content: str) -> None: """Add a message and trigger summary if needed""" self.recent_messages.append({ "role": role, "content": content, "timestamp": datetime.now().isoformat() }) # If too many messages, summarize old ones if len(self.recent_messages) > self.max_messages: self._summarize_old_messages() def _summarize_old_messages(self) -> None: """Summarize the oldest messages""" # Split: old to summarize, recent to keep split_point = len(self.recent_messages) // 2 to_summarize = self.recent_messages[:split_point] self.recent_messages = self.recent_messages[split_point:] # Format conversation to summarize conversation = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in to_summarize ]) # Generate new summary prompt = f"""You need to update a conversation summary. Existing summary: {self.summary if self.summary else "No previous summary."} New exchanges to integrate: {conversation} Instructions: - Integrate important information from new exchanges into the summary - Keep decisions made, preferences expressed, problems resolved - Remove redundant or obsolete details - Summary should be concise (max {self.summary_max_tokens} tokens) - Write in third person ("The user...", "The assistant...") Updated summary:""" self.summary = self.llm.invoke(prompt) def get_context(self) -> str: """Return complete context for the prompt""" parts = [] if self.summary: parts.append(f"Previous conversation summary:\n{self.summary}") if self.recent_messages: recent = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in self.recent_messages ]) parts.append(f"Recent exchanges:\n{recent}") return "\n\n".join(parts) def clear(self) -> None: """Reset the memory""" self.summary = "" self.recent_messages = []

Integration with RAG

DEVELOPERpython
class RAGWithSummaryMemory: """ Complete RAG Pipeline with Summary Memory """ def __init__(self, vector_store, llm): self.vector_store = vector_store self.llm = llm self.memory = SummaryMemory(llm, max_messages_before_summary=6) def query(self, user_message: str) -> str: """Execute a RAG query with summarized context""" # 1. Add user message self.memory.add_message("user", user_message) # 2. Retrieve relevant documents docs = self.vector_store.similarity_search(user_message, k=3) doc_context = "\n\n".join([ f"Document {i+1}:\n{d.page_content}" for i, d in enumerate(docs) ]) # 3. Build prompt with memory memory_context = self.memory.get_context() prompt = f"""You are a helpful assistant. Use the conversation context and documents to answer. {memory_context} Relevant documents: {doc_context} Current question: {user_message} Answer precisely and contextually.""" # 4. Generate and save response response = self.llm.invoke(prompt) self.memory.add_message("assistant", response) return response

Advanced Techniques

Incremental Summarization (More Efficient)

Instead of regenerating the complete summary, enrich it incrementally:

DEVELOPERpython
class IncrementalSummaryMemory(SummaryMemory): """ Summary Memory with incremental updates Faster and less costly in tokens """ def _summarize_old_messages(self) -> None: """Incremental summary update""" split_point = len(self.recent_messages) // 2 to_summarize = self.recent_messages[:split_point] self.recent_messages = self.recent_messages[split_point:] conversation = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in to_summarize ]) # Optimized prompt for incremental update if self.summary: prompt = f"""Current summary: {self.summary} New conversation elements: {conversation} Update the summary by: 1. Adding important new information 2. Removing obsolete information 3. Keeping summary concise (max 200 words) Updated summary:""" else: prompt = f"""Summarize this conversation extracting: - The main topic - Decisions made - User preferences - Pending questions Conversation: {conversation} Summary (max 200 words):""" self.summary = self.llm.invoke(prompt)

Structured Summary

For better organization, use a structured summary:

DEVELOPERpython
import json class StructuredSummaryMemory(SummaryMemory): """ Summary Memory with JSON structured summary Allows more precise access to information """ def __init__(self, llm, **kwargs): super().__init__(llm, **kwargs) self.structured_summary: Dict = { "main_topic": "", "user_preferences": [], "decisions_made": [], "pending_questions": [], "key_facts": [] } def _summarize_old_messages(self) -> None: """Generate a structured summary""" split_point = len(self.recent_messages) // 2 to_summarize = self.recent_messages[:split_point] self.recent_messages = self.recent_messages[split_point:] conversation = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in to_summarize ]) prompt = f"""Analyze this conversation and update the structured summary. Current summary: {json.dumps(self.structured_summary, ensure_ascii=False, indent=2)} New exchanges: {conversation} Return a JSON with the following structure: {{ "main_topic": "main conversation topic", "user_preferences": ["list of expressed preferences"], "decisions_made": ["list of decisions made"], "pending_questions": ["unresolved questions"], "key_facts": ["important facts to remember"] }} Updated JSON:""" response = self.llm.invoke(prompt) try: self.structured_summary = json.loads(response) except json.JSONDecodeError: # Fallback to text summary self.summary = response def get_context(self) -> str: """Format structured summary for prompt""" parts = [] if self.structured_summary.get("main_topic"): parts.append(f"Topic: {self.structured_summary['main_topic']}") if self.structured_summary.get("user_preferences"): prefs = ", ".join(self.structured_summary["user_preferences"]) parts.append(f"User preferences: {prefs}") if self.structured_summary.get("decisions_made"): decisions = ", ".join(self.structured_summary["decisions_made"]) parts.append(f"Decisions made: {decisions}") if self.structured_summary.get("pending_questions"): questions = ", ".join(self.structured_summary["pending_questions"]) parts.append(f"Pending questions: {questions}") summary_text = "\n".join(parts) if parts else "" if self.recent_messages: recent = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in self.recent_messages ]) return f"Context:\n{summary_text}\n\nRecent exchanges:\n{recent}" return f"Context:\n{summary_text}"

Implementation with LangChain

LangChain provides a native implementation:

DEVELOPERpython
from langchain.memory import ConversationSummaryMemory from langchain.chains import ConversationChain from langchain_openai import ChatOpenAI # LLM for generation and summarization llm = ChatOpenAI(model="gpt-4", temperature=0.7) # Native LangChain Summary Memory memory = ConversationSummaryMemory( llm=llm, memory_key="history", return_messages=True ) # Conversational chain conversation = ConversationChain( llm=llm, memory=memory, verbose=True ) # Usage - summarization happens automatically response = conversation.predict(input="I'm looking for a gaming laptop") response = conversation.predict(input="Budget max 1500 euros") response = conversation.predict(input="I already looked at MSI but too heavy") # ... after several exchanges, memory contains a summary # Inspect the summary print(memory.buffer) # Shows current summary

Combining Buffer + Summary

The best approach combines both:

DEVELOPERpython
from langchain.memory import ConversationSummaryBufferMemory # Keep recent messages AND a summary of old ones memory = ConversationSummaryBufferMemory( llm=llm, max_token_limit=1000, # Summarize when exceeding 1000 tokens memory_key="history", return_messages=True ) # More effective than pure summary for recent context

Implementation with LlamaIndex

DEVELOPERpython
from llama_index.core.memory import ChatSummaryMemoryBuffer from llama_index.core.chat_engine import CondensePlusContextChatEngine from llama_index.llms.openai import OpenAI # LLM llm = OpenAI(model="gpt-4", temperature=0.7) # Memory with automatic summarization memory = ChatSummaryMemoryBuffer.from_defaults( llm=llm, token_limit=1500, # Limit before summarization summarize_prompt=( "Summarize the previous conversation keeping:\n" "- The main topic\n" "- Decisions made\n" "- Important context\n\n" "Conversation:\n{conversation}\n\n" "Summary:" ) ) # Chat engine with memory chat_engine = CondensePlusContextChatEngine.from_defaults( retriever=index.as_retriever(similarity_top_k=4), memory=memory, llm=llm ) # Long conversation - automatic summarization for question in user_questions: response = chat_engine.chat(question) print(response)

Best Practices

1. Choose the Right Summary Threshold

DEVELOPERpython
# Recommendations by use case THRESHOLDS = { "customer_support": 6, # Frequent summary, recent context important "consultation": 10, # More context before summary "tutorial": 4, # Very frequent summary "free_conversation": 8 # Balance } memory = SummaryMemory( llm=llm, max_messages_before_summary=THRESHOLDS["customer_support"] )

2. Validate Summary Quality

DEVELOPERpython
class ValidatedSummaryMemory(SummaryMemory): """Summary Memory with summary validation""" def _summarize_old_messages(self) -> None: super()._summarize_old_messages() # Check that summary isn't empty or too short if len(self.summary.split()) < 20: # Regenerate with more explicit prompt self._force_detailed_summary() def _force_detailed_summary(self) -> None: """Regenerate a more detailed summary if needed""" prompt = f"""The previous summary was too short. Expand further including: - Who the user is and what they're looking for - Important criteria mentioned - Options discussed - Current conversation state Current summary: {self.summary} Expanded summary:""" self.summary = self.llm.invoke(prompt)

3. Handle Topic Changes

DEVELOPERpython
class TopicAwareSummaryMemory(SummaryMemory): """Detects and handles topic changes""" def add_message(self, role: str, content: str) -> None: # Detect topic change if role == "user" and self._is_topic_change(content): # Finalize summary of previous topic self._finalize_current_topic() super().add_message(role, content) def _is_topic_change(self, message: str) -> bool: """Detect if message changes topic""" indicators = [ "something else", "new topic", "different question", "I'd also like", "by the way" ] return any(ind in message.lower() for ind in indicators) def _finalize_current_topic(self) -> None: """Archive current topic before changing""" if self.summary: self.archived_topics.append({ "topic": self.summary, "timestamp": datetime.now().isoformat() }) self.summary = ""

When to Use Summary Memory?

Ideal Use Cases

  • In-depth customer support: Multi-step troubleshooting
  • Consultations: Personalized advice over time
  • Onboarding: Guidance over multiple sessions
  • Complex conversations: Negotiations, detailed configurations

When to Avoid

SituationWhyAlternative
Conversations < 5 exchangesUnnecessary overheadBuffer Memory
Need exact precisionSummary loses detailsBuffer Window
Critical latencySummary adds LLM callBuffer with limit

Learn More


Summary Memory with Ailog

Implementing robust Summary Memory requires careful attention to summary quality. With Ailog, benefit from optimized memory management:

  • Automatic summarization with adaptive thresholds based on conversation
  • Topic change detection for relevant summaries
  • Structured summary for precise information access
  • Multi-session persistence with context resumption
  • Analytics on summary quality and information retention

Try Ailog for free and deploy a chatbot with long-term memory.

Tags

RAGmemorysummarylong-termcompressionLangChainLlamaIndex

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !