GuideIntermediate

Buffer Memory: Simple Conversation History

March 27, 2026
15 min read
Ailog Team

Complete guide to implementing buffer memory in a conversational RAG system: keeping context of recent exchanges for coherent responses.

Buffer Memory: Simple Conversation History

Buffer memory is the simplest and most intuitive form of conversational memory. It keeps the last N messages of the conversation in a buffer and injects them directly into the LLM context. Simple but effective for short to medium conversations, it's often the first choice for implementing a conversational RAG chatbot.

What is Buffer Memory?

How It Works

Buffer Memory works like a FIFO (First In, First Out) queue for messages:

┌─────────────────────────────────────────────────────────────┐
│                      BUFFER MEMORY                           │
├─────────────────────────────────────────────────────────────┤
│  Message 1: [User] Hello, I'm looking for a computer        │
│  Message 2: [AI] Hello! What will you use it for?           │
│  Message 3: [User] Mainly gaming                             │
│  Message 4: [AI] I recommend a config with RTX...           │
│  Message 5: [User] What about the budget?                    │
│  Message 6: [AI] Count between 1200 and 2000 euros...       │
│  ...                                                         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │   Injection into the    │
              │   LLM prompt            │
              └─────────────────────────┘

When the buffer reaches its limit (in number of messages or tokens), the oldest messages are automatically removed to make room for new ones.

Memory Approaches Comparison

ApproachAdvantagesDisadvantagesUse Case
No memoryVery simple, no costLoses all contextStatic FAQ
Buffer memoryExact context, easy debugToken limitedShort conversations
Summary memoryLong memory, compactLoss of details, complexLong conversations
Entity memoryRetains key entitiesMore complexSpecific object tracking

Key statistic: A Buffer Memory of 10 conversation turns consumes on average 2000-3000 tokens, about 15-20% of GPT-4's context window. For 80% of customer support use cases, 6-8 turns are sufficient.

Basic Python Implementation

Buffer Memory from Scratch

DEVELOPERpython
from collections import deque from typing import List, Dict from datetime import datetime class ConversationBufferMemory: """ Simple buffer memory for conversations Keeps the last N messages in a deque """ def __init__(self, max_messages: int = 10): self.messages: deque = deque(maxlen=max_messages) self.created_at = datetime.now() def add_message(self, role: str, content: str) -> None: """Add a message to the buffer""" self.messages.append({ "role": role, "content": content, "timestamp": datetime.now().isoformat() }) def get_history(self) -> List[Dict]: """Return all messages""" return list(self.messages) def get_context_string(self) -> str: """Format history for prompt injection""" return "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in self.messages ]) def clear(self) -> None: """Clear the buffer""" self.messages.clear() def __len__(self) -> int: return len(self.messages)

Integration with a RAG Pipeline

DEVELOPERpython
class ConversationalRAG: """ Complete RAG Pipeline with Buffer Memory """ def __init__(self, vector_store, llm, max_history: int = 6): self.vector_store = vector_store self.llm = llm self.memory = ConversationBufferMemory(max_messages=max_history) def query(self, user_message: str) -> str: """Execute a RAG query with conversational context""" # 1. Add user message self.memory.add_message("user", user_message) # 2. Retrieve relevant documents docs = self.vector_store.similarity_search(user_message, k=3) context = "\n\n".join([ f"Document {i+1}:\n{d.page_content}" for i, d in enumerate(docs) ]) # 3. Build prompt with history history = self.memory.get_context_string() prompt = f"""You are a helpful assistant. Use the provided documents and conversation history to answer. Conversation history: {history} Relevant documents: {context} Current question: {user_message} Answer precisely taking into account the conversation context.""" # 4. Generate response response = self.llm.invoke(prompt) # 5. Save response self.memory.add_message("assistant", response) return response def reset(self) -> None: """Reset the conversation""" self.memory.clear()

Intelligent Token Management

The main problem with Buffer Memory is token consumption. Here's an optimized version:

DEVELOPERpython
import tiktoken class TokenAwareBufferMemory: """ Buffer Memory with token management Removes old messages when limit is reached """ def __init__(self, max_tokens: int = 2000, model: str = "gpt-4o"): self.max_tokens = max_tokens self.encoding = tiktoken.encoding_for_model(model) self.messages: List[Dict] = [] def add_message(self, role: str, content: str) -> None: """Add a message and adjust if necessary""" self.messages.append({"role": role, "content": content}) self._trim_to_token_limit() def _count_tokens(self) -> int: """Count total number of tokens""" text = " ".join([m["content"] for m in self.messages]) return len(self.encoding.encode(text)) def _trim_to_token_limit(self) -> None: """Remove oldest messages if limit exceeded""" while self._count_tokens() > self.max_tokens and len(self.messages) > 2: # Always keep at least the last exchange self.messages.pop(0) def get_messages_for_api(self) -> List[Dict]: """Return messages in OpenAI API format""" return [ {"role": m["role"], "content": m["content"]} for m in self.messages ] @property def token_usage(self) -> dict: """Usage statistics""" current = self._count_tokens() return { "current_tokens": current, "max_tokens": self.max_tokens, "utilization": current / self.max_tokens, "messages_count": len(self.messages) }

Implementation with LangChain

LangChain provides ready-to-use implementations:

DEVELOPERpython
from langchain.memory import ConversationBufferMemory from langchain.chains import ConversationChain from langchain_openai import ChatOpenAI # Basic Buffer Memory memory = ConversationBufferMemory( memory_key="history", return_messages=True ) # With sliding window (keeps last k exchanges) from langchain.memory import ConversationBufferWindowMemory memory_window = ConversationBufferWindowMemory( k=5, # Keep last 5 exchanges memory_key="history", return_messages=True ) # Integration with conversational chain llm = ChatOpenAI(model="gpt-4", temperature=0.7) conversation = ConversationChain( llm=llm, memory=memory, verbose=True ) # Usage response = conversation.predict(input="I'm looking for a gaming laptop") print(response) response = conversation.predict(input="Budget 1500 euros max") print(response) # Understands the laptop context

Integration with ConversationalRetrievalChain

DEVELOPERpython
from langchain.chains import ConversationalRetrievalChain from langchain_community.vectorstores import Qdrant # Vector store configuration vectorstore = Qdrant( client=qdrant_client, collection_name="products", embeddings=OpenAIEmbeddings() ) # Memory for conversational RAG memory = ConversationBufferMemory( memory_key="chat_history", return_messages=True, output_key="answer" ) # RAG chain with memory rag_chain = ConversationalRetrievalChain.from_llm( llm=ChatOpenAI(model="gpt-4"), retriever=vectorstore.as_retriever(search_kwargs={"k": 4}), memory=memory, return_source_documents=True ) # Natural conversation result = rag_chain({"question": "What is your return policy?"}) result2 = rag_chain({"question": "And for electronic products?"}) # Understands that "And for" refers to the return policy

Implementation with LlamaIndex

DEVELOPERpython
from llama_index.core.memory import ChatMemoryBuffer from llama_index.core.chat_engine import CondensePlusContextChatEngine from llama_index.core import VectorStoreIndex # Create index index = VectorStoreIndex.from_documents(documents) # Buffer memory with token limit memory = ChatMemoryBuffer.from_defaults( token_limit=3000 ) # Chat engine with memory chat_engine = CondensePlusContextChatEngine.from_defaults( retriever=index.as_retriever(similarity_top_k=4), memory=memory, llm=OpenAI(model="gpt-4"), context_prompt=( "Relevant context:\n{context_str}\n\n" "Answer using this context and history." ), verbose=True ) # Conversation response1 = chat_engine.chat("What are your hours?") response2 = chat_engine.chat("And on weekends?") # Understands context

Best Practices and Optimizations

1. Compressing Long Messages

DEVELOPERpython
def compress_message(content: str, max_chars: int = 500) -> str: """Truncate messages that are too long""" if len(content) <= max_chars: return content # Keep beginning and end half = max_chars // 2 return f"{content[:half]}... [truncated] ...{content[-half:]}"

2. Persistence and Storage

DEVELOPERpython
import json from pathlib import Path class PersistentBufferMemory(ConversationBufferMemory): """Buffer with automatic saving""" def __init__(self, session_id: str, storage_dir: str = "./sessions"): super().__init__() self.session_id = session_id self.storage_path = Path(storage_dir) / f"{session_id}.json" self._load() def _load(self) -> None: if self.storage_path.exists(): with open(self.storage_path, "r") as f: data = json.load(f) for msg in data.get("messages", []): self.messages.append(msg) def save(self) -> None: self.storage_path.parent.mkdir(exist_ok=True) with open(self.storage_path, "w") as f: json.dump({"messages": list(self.messages)}, f) def add_message(self, role: str, content: str) -> None: super().add_message(role, content) self.save()

3. Monitoring and Metrics

DEVELOPERpython
class MonitoredBufferMemory(TokenAwareBufferMemory): """Buffer with monitoring metrics""" def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.stats = { "total_messages": 0, "messages_evicted": 0, "peak_tokens": 0 } def add_message(self, role: str, content: str) -> None: count_before = len(self.messages) super().add_message(role, content) self.stats["total_messages"] += 1 self.stats["messages_evicted"] += max(0, count_before + 1 - len(self.messages)) self.stats["peak_tokens"] = max( self.stats["peak_tokens"], self._count_tokens() )

When to Use Buffer Memory?

Ideal Use Cases

  • Short conversations: Less than 10 exchanges
  • Simple follow-up questions: "And for accessories?", "How much does it cost?"
  • Level 1 customer support: FAQ with recent context
  • FAQ chatbots: Independent questions with little follow-up

When to Switch

SignalSolution
Conversations > 15 turnsSummary Memory
Need to retain names, productsEntity Memory
Very tight token budgetToken Buffer with low limit
Persistent multi-sessionsDatabase + Summary

Learn More


Buffer Memory with Ailog

Implementing and optimizing Buffer Memory takes time. With Ailog, get turnkey conversational memory management:

  • Optimized Buffer Memory with automatic token management
  • Adaptive window based on conversation complexity
  • Automatic persistence with session resumption
  • Analytics on memory usage and conversation quality
  • Zero configuration: it works out of the box

Try Ailog for free and deploy a chatbot with memory in 3 minutes.

Tags

RAGmemoryconversationbufferhistorycontextLangChainLlamaIndex

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !