Buffer Memory: Simple Conversation History
Complete guide to implementing buffer memory in a conversational RAG system: keeping context of recent exchanges for coherent responses.
Buffer Memory: Simple Conversation History
Buffer memory is the simplest and most intuitive form of conversational memory. It keeps the last N messages of the conversation in a buffer and injects them directly into the LLM context. Simple but effective for short to medium conversations, it's often the first choice for implementing a conversational RAG chatbot.
What is Buffer Memory?
How It Works
Buffer Memory works like a FIFO (First In, First Out) queue for messages:
┌─────────────────────────────────────────────────────────────┐
│ BUFFER MEMORY │
├─────────────────────────────────────────────────────────────┤
│ Message 1: [User] Hello, I'm looking for a computer │
│ Message 2: [AI] Hello! What will you use it for? │
│ Message 3: [User] Mainly gaming │
│ Message 4: [AI] I recommend a config with RTX... │
│ Message 5: [User] What about the budget? │
│ Message 6: [AI] Count between 1200 and 2000 euros... │
│ ... │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Injection into the │
│ LLM prompt │
└─────────────────────────┘
When the buffer reaches its limit (in number of messages or tokens), the oldest messages are automatically removed to make room for new ones.
Memory Approaches Comparison
| Approach | Advantages | Disadvantages | Use Case |
|---|---|---|---|
| No memory | Very simple, no cost | Loses all context | Static FAQ |
| Buffer memory | Exact context, easy debug | Token limited | Short conversations |
| Summary memory | Long memory, compact | Loss of details, complex | Long conversations |
| Entity memory | Retains key entities | More complex | Specific object tracking |
Key statistic: A Buffer Memory of 10 conversation turns consumes on average 2000-3000 tokens, about 15-20% of GPT-4's context window. For 80% of customer support use cases, 6-8 turns are sufficient.
Basic Python Implementation
Buffer Memory from Scratch
DEVELOPERpythonfrom collections import deque from typing import List, Dict from datetime import datetime class ConversationBufferMemory: """ Simple buffer memory for conversations Keeps the last N messages in a deque """ def __init__(self, max_messages: int = 10): self.messages: deque = deque(maxlen=max_messages) self.created_at = datetime.now() def add_message(self, role: str, content: str) -> None: """Add a message to the buffer""" self.messages.append({ "role": role, "content": content, "timestamp": datetime.now().isoformat() }) def get_history(self) -> List[Dict]: """Return all messages""" return list(self.messages) def get_context_string(self) -> str: """Format history for prompt injection""" return "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in self.messages ]) def clear(self) -> None: """Clear the buffer""" self.messages.clear() def __len__(self) -> int: return len(self.messages)
Integration with a RAG Pipeline
DEVELOPERpythonclass ConversationalRAG: """ Complete RAG Pipeline with Buffer Memory """ def __init__(self, vector_store, llm, max_history: int = 6): self.vector_store = vector_store self.llm = llm self.memory = ConversationBufferMemory(max_messages=max_history) def query(self, user_message: str) -> str: """Execute a RAG query with conversational context""" # 1. Add user message self.memory.add_message("user", user_message) # 2. Retrieve relevant documents docs = self.vector_store.similarity_search(user_message, k=3) context = "\n\n".join([ f"Document {i+1}:\n{d.page_content}" for i, d in enumerate(docs) ]) # 3. Build prompt with history history = self.memory.get_context_string() prompt = f"""You are a helpful assistant. Use the provided documents and conversation history to answer. Conversation history: {history} Relevant documents: {context} Current question: {user_message} Answer precisely taking into account the conversation context.""" # 4. Generate response response = self.llm.invoke(prompt) # 5. Save response self.memory.add_message("assistant", response) return response def reset(self) -> None: """Reset the conversation""" self.memory.clear()
Intelligent Token Management
The main problem with Buffer Memory is token consumption. Here's an optimized version:
DEVELOPERpythonimport tiktoken class TokenAwareBufferMemory: """ Buffer Memory with token management Removes old messages when limit is reached """ def __init__(self, max_tokens: int = 2000, model: str = "gpt-4o"): self.max_tokens = max_tokens self.encoding = tiktoken.encoding_for_model(model) self.messages: List[Dict] = [] def add_message(self, role: str, content: str) -> None: """Add a message and adjust if necessary""" self.messages.append({"role": role, "content": content}) self._trim_to_token_limit() def _count_tokens(self) -> int: """Count total number of tokens""" text = " ".join([m["content"] for m in self.messages]) return len(self.encoding.encode(text)) def _trim_to_token_limit(self) -> None: """Remove oldest messages if limit exceeded""" while self._count_tokens() > self.max_tokens and len(self.messages) > 2: # Always keep at least the last exchange self.messages.pop(0) def get_messages_for_api(self) -> List[Dict]: """Return messages in OpenAI API format""" return [ {"role": m["role"], "content": m["content"]} for m in self.messages ] @property def token_usage(self) -> dict: """Usage statistics""" current = self._count_tokens() return { "current_tokens": current, "max_tokens": self.max_tokens, "utilization": current / self.max_tokens, "messages_count": len(self.messages) }
Implementation with LangChain
LangChain provides ready-to-use implementations:
DEVELOPERpythonfrom langchain.memory import ConversationBufferMemory from langchain.chains import ConversationChain from langchain_openai import ChatOpenAI # Basic Buffer Memory memory = ConversationBufferMemory( memory_key="history", return_messages=True ) # With sliding window (keeps last k exchanges) from langchain.memory import ConversationBufferWindowMemory memory_window = ConversationBufferWindowMemory( k=5, # Keep last 5 exchanges memory_key="history", return_messages=True ) # Integration with conversational chain llm = ChatOpenAI(model="gpt-4", temperature=0.7) conversation = ConversationChain( llm=llm, memory=memory, verbose=True ) # Usage response = conversation.predict(input="I'm looking for a gaming laptop") print(response) response = conversation.predict(input="Budget 1500 euros max") print(response) # Understands the laptop context
Integration with ConversationalRetrievalChain
DEVELOPERpythonfrom langchain.chains import ConversationalRetrievalChain from langchain_community.vectorstores import Qdrant # Vector store configuration vectorstore = Qdrant( client=qdrant_client, collection_name="products", embeddings=OpenAIEmbeddings() ) # Memory for conversational RAG memory = ConversationBufferMemory( memory_key="chat_history", return_messages=True, output_key="answer" ) # RAG chain with memory rag_chain = ConversationalRetrievalChain.from_llm( llm=ChatOpenAI(model="gpt-4"), retriever=vectorstore.as_retriever(search_kwargs={"k": 4}), memory=memory, return_source_documents=True ) # Natural conversation result = rag_chain({"question": "What is your return policy?"}) result2 = rag_chain({"question": "And for electronic products?"}) # Understands that "And for" refers to the return policy
Implementation with LlamaIndex
DEVELOPERpythonfrom llama_index.core.memory import ChatMemoryBuffer from llama_index.core.chat_engine import CondensePlusContextChatEngine from llama_index.core import VectorStoreIndex # Create index index = VectorStoreIndex.from_documents(documents) # Buffer memory with token limit memory = ChatMemoryBuffer.from_defaults( token_limit=3000 ) # Chat engine with memory chat_engine = CondensePlusContextChatEngine.from_defaults( retriever=index.as_retriever(similarity_top_k=4), memory=memory, llm=OpenAI(model="gpt-4"), context_prompt=( "Relevant context:\n{context_str}\n\n" "Answer using this context and history." ), verbose=True ) # Conversation response1 = chat_engine.chat("What are your hours?") response2 = chat_engine.chat("And on weekends?") # Understands context
Best Practices and Optimizations
1. Compressing Long Messages
DEVELOPERpythondef compress_message(content: str, max_chars: int = 500) -> str: """Truncate messages that are too long""" if len(content) <= max_chars: return content # Keep beginning and end half = max_chars // 2 return f"{content[:half]}... [truncated] ...{content[-half:]}"
2. Persistence and Storage
DEVELOPERpythonimport json from pathlib import Path class PersistentBufferMemory(ConversationBufferMemory): """Buffer with automatic saving""" def __init__(self, session_id: str, storage_dir: str = "./sessions"): super().__init__() self.session_id = session_id self.storage_path = Path(storage_dir) / f"{session_id}.json" self._load() def _load(self) -> None: if self.storage_path.exists(): with open(self.storage_path, "r") as f: data = json.load(f) for msg in data.get("messages", []): self.messages.append(msg) def save(self) -> None: self.storage_path.parent.mkdir(exist_ok=True) with open(self.storage_path, "w") as f: json.dump({"messages": list(self.messages)}, f) def add_message(self, role: str, content: str) -> None: super().add_message(role, content) self.save()
3. Monitoring and Metrics
DEVELOPERpythonclass MonitoredBufferMemory(TokenAwareBufferMemory): """Buffer with monitoring metrics""" def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.stats = { "total_messages": 0, "messages_evicted": 0, "peak_tokens": 0 } def add_message(self, role: str, content: str) -> None: count_before = len(self.messages) super().add_message(role, content) self.stats["total_messages"] += 1 self.stats["messages_evicted"] += max(0, count_before + 1 - len(self.messages)) self.stats["peak_tokens"] = max( self.stats["peak_tokens"], self._count_tokens() )
When to Use Buffer Memory?
Ideal Use Cases
- Short conversations: Less than 10 exchanges
- Simple follow-up questions: "And for accessories?", "How much does it cost?"
- Level 1 customer support: FAQ with recent context
- FAQ chatbots: Independent questions with little follow-up
When to Switch
| Signal | Solution |
|---|---|
| Conversations > 15 turns | Summary Memory |
| Need to retain names, products | Entity Memory |
| Very tight token budget | Token Buffer with low limit |
| Persistent multi-sessions | Database + Summary |
Learn More
- Summary Memory - For long conversations
- Entity Memory - For retaining entities
- Conversational RAG - Overview
Buffer Memory with Ailog
Implementing and optimizing Buffer Memory takes time. With Ailog, get turnkey conversational memory management:
- Optimized Buffer Memory with automatic token management
- Adaptive window based on conversation complexity
- Automatic persistence with session resumption
- Analytics on memory usage and conversation quality
- Zero configuration: it works out of the box
Try Ailog for free and deploy a chatbot with memory in 3 minutes.
Tags
Related Posts
Conversational RAG: Memory and Multi-Session Context
Implement RAG with conversational memory: context management, multi-session history, and personalized responses.
RAG Generation: Choosing and Optimizing Your LLM
Complete guide to selecting and configuring your LLM in a RAG system: prompting, temperature, tokens, and response optimization.
RAG Agents: Orchestrating Multi-Agent Systems
Architect multi-agent RAG systems: orchestration, specialization, collaboration and failure handling for complex assistants.