Buffer Memory: Simple Conversation History

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Buffer memory is the simplest and most intuitive form of conversational memory. It keeps the last N messages of the conversation in a buffer and injects them directly into the LLM context. Simple but effective for short to medium conversations, it's often the first choice for implementing a conversational RAG chatbot.

What is Buffer Memory?

How It Works

Buffer Memory works like a FIFO (First In, First Out) queue for messages:

┌─────────────────────────────────────────────────────────────┐
│                      BUFFER MEMORY                           │
├─────────────────────────────────────────────────────────────┤
│  Message 1: [User] Hello, I'm looking for a computer        │
│  Message 2: [AI] Hello! What will you use it for?           │
│  Message 3: [User] Mainly gaming                             │
│  Message 4: [AI] I recommend a config with RTX...           │
│  Message 5: [User] What about the budget?                    │
│  Message 6: [AI] Count between 1200 and 2000 euros...       │
│  ...                                                         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │   Injection into the    │
              │   LLM prompt            │
              └─────────────────────────┘

When the buffer reaches its limit (in number of messages or tokens), the oldest messages are automatically removed to make room for new ones.

Memory Approaches Comparison

Approach	Advantages	Disadvantages	Use Case
No memory	Very simple, no cost	Loses all context	Static FAQ
Buffer memory	Exact context, easy debug	Token limited	Short conversations
Summary memory	Long memory, compact	Loss of details, complex	Long conversations
Entity memory	Retains key entities	More complex	Specific object tracking

Key statistic: A Buffer Memory of 10 conversation turns consumes on average 2000-3000 tokens, about 15-20% of GPT-4's context window. For 80% of customer support use cases, 6-8 turns are sufficient.

Basic Python Implementation

Buffer Memory from Scratch

DEVELOPERpython
from collections import deque
from typing import List, Dict
from datetime import datetime

class ConversationBufferMemory:
    """
    Simple buffer memory for conversations
    Keeps the last N messages in a deque
    """
    def __init__(self, max_messages: int = 10):
        self.messages: deque = deque(maxlen=max_messages)
        self.created_at = datetime.now()

    def add_message(self, role: str, content: str) -> None:
        """Add a message to the buffer"""
        self.messages.append({
            "role": role,
            "content": content,
            "timestamp": datetime.now().isoformat()
        })

    def get_history(self) -> List[Dict]:
        """Return all messages"""
        return list(self.messages)

    def get_context_string(self) -> str:
        """Format history for prompt injection"""
        return "\n".join([
            f"{m['role'].capitalize()}: {m['content']}"
            for m in self.messages
        ])

    def clear(self) -> None:
        """Clear the buffer"""
        self.messages.clear()

    def __len__(self) -> int:
        return len(self.messages)

Integration with a RAG Pipeline

DEVELOPERpython
class ConversationalRAG:
    """
    Complete RAG Pipeline with Buffer Memory
    """
    def __init__(self, vector_store, llm, max_history: int = 6):
        self.vector_store = vector_store
        self.llm = llm
        self.memory = ConversationBufferMemory(max_messages=max_history)

    def query(self, user_message: str) -> str:
        """Execute a RAG query with conversational context"""
        # 1. Add user message
        self.memory.add_message("user", user_message)

        # 2. Retrieve relevant documents
        docs = self.vector_store.similarity_search(user_message, k=3)
        context = "\n\n".join([
            f"Document {i+1}:\n{d.page_content}"
            for i, d in enumerate(docs)
        ])

        # 3. Build prompt with history
        history = self.memory.get_context_string()

        prompt = f"""You are a helpful assistant. Use the provided documents
and conversation history to answer.

Conversation history:
{history}

Relevant documents:
{context}

Current question: {user_message}

Answer precisely taking into account the conversation context."""

        # 4. Generate response
        response = self.llm.invoke(prompt)

        # 5. Save response
        self.memory.add_message("assistant", response)

        return response

    def reset(self) -> None:
        """Reset the conversation"""
        self.memory.clear()

Intelligent Token Management

The main problem with Buffer Memory is token consumption. Here's an optimized version:

DEVELOPERpython
import tiktoken

class TokenAwareBufferMemory:
    """
    Buffer Memory with token management
    Removes old messages when limit is reached
    """
    def __init__(self, max_tokens: int = 2000, model: str = "gpt-4o"):
        self.max_tokens = max_tokens
        self.encoding = tiktoken.encoding_for_model(model)
        self.messages: List[Dict] = []

    def add_message(self, role: str, content: str) -> None:
        """Add a message and adjust if necessary"""
        self.messages.append({"role": role, "content": content})
        self._trim_to_token_limit()

    def _count_tokens(self) -> int:
        """Count total number of tokens"""
        text = " ".join([m["content"] for m in self.messages])
        return len(self.encoding.encode(text))

    def _trim_to_token_limit(self) -> None:
        """Remove oldest messages if limit exceeded"""
        while self._count_tokens() > self.max_tokens and len(self.messages) > 2:
            # Always keep at least the last exchange
            self.messages.pop(0)

    def get_messages_for_api(self) -> List[Dict]:
        """Return messages in OpenAI API format"""
        return [
            {"role": m["role"], "content": m["content"]}
            for m in self.messages
        ]

    @property
    def token_usage(self) -> dict:
        """Usage statistics"""
        current = self._count_tokens()
        return {
            "current_tokens": current,
            "max_tokens": self.max_tokens,
            "utilization": current / self.max_tokens,
            "messages_count": len(self.messages)
        }

Implementation with LangChain

LangChain provides ready-to-use implementations:

DEVELOPERpython
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI

# Basic Buffer Memory
memory = ConversationBufferMemory(
    memory_key="history",
    return_messages=True
)

# With sliding window (keeps last k exchanges)
from langchain.memory import ConversationBufferWindowMemory

memory_window = ConversationBufferWindowMemory(
    k=5,  # Keep last 5 exchanges
    memory_key="history",
    return_messages=True
)

# Integration with conversational chain
llm = ChatOpenAI(model="gpt-4", temperature=0.7)
conversation = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True
)

# Usage
response = conversation.predict(input="I'm looking for a gaming laptop")
print(response)

response = conversation.predict(input="Budget 1500 euros max")
print(response)  # Understands the laptop context

Integration with ConversationalRetrievalChain

DEVELOPERpython
from langchain.chains import ConversationalRetrievalChain
from langchain_community.vectorstores import Qdrant

# Vector store configuration
vectorstore = Qdrant(
    client=qdrant_client,
    collection_name="products",
    embeddings=OpenAIEmbeddings()
)

# Memory for conversational RAG
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    output_key="answer"
)

# RAG chain with memory
rag_chain = ConversationalRetrievalChain.from_llm(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    memory=memory,
    return_source_documents=True
)

# Natural conversation
result = rag_chain({"question": "What is your return policy?"})
result2 = rag_chain({"question": "And for electronic products?"})
# Understands that "And for" refers to the return policy

Implementation with LlamaIndex

DEVELOPERpython
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine
from llama_index.core import VectorStoreIndex

# Create index
index = VectorStoreIndex.from_documents(documents)

# Buffer memory with token limit
memory = ChatMemoryBuffer.from_defaults(
    token_limit=3000
)

# Chat engine with memory
chat_engine = CondensePlusContextChatEngine.from_defaults(
    retriever=index.as_retriever(similarity_top_k=4),
    memory=memory,
    llm=OpenAI(model="gpt-4"),
    context_prompt=(
        "Relevant context:\n{context_str}\n\n"
        "Answer using this context and history."
    ),
    verbose=True
)

# Conversation
response1 = chat_engine.chat("What are your hours?")
response2 = chat_engine.chat("And on weekends?")  # Understands context

Best Practices and Optimizations

1. Compressing Long Messages

DEVELOPERpython
def compress_message(content: str, max_chars: int = 500) -> str:
    """Truncate messages that are too long"""
    if len(content) <= max_chars:
        return content

    # Keep beginning and end
    half = max_chars // 2
    return f"{content[:half]}... [truncated] ...{content[-half:]}"

2. Persistence and Storage

DEVELOPERpython
import json
from pathlib import Path

class PersistentBufferMemory(ConversationBufferMemory):
    """Buffer with automatic saving"""

    def __init__(self, session_id: str, storage_dir: str = "./sessions"):
        super().__init__()
        self.session_id = session_id
        self.storage_path = Path(storage_dir) / f"{session_id}.json"
        self._load()

    def _load(self) -> None:
        if self.storage_path.exists():
            with open(self.storage_path, "r") as f:
                data = json.load(f)
                for msg in data.get("messages", []):
                    self.messages.append(msg)

    def save(self) -> None:
        self.storage_path.parent.mkdir(exist_ok=True)
        with open(self.storage_path, "w") as f:
            json.dump({"messages": list(self.messages)}, f)

    def add_message(self, role: str, content: str) -> None:
        super().add_message(role, content)
        self.save()

3. Monitoring and Metrics

DEVELOPERpython
class MonitoredBufferMemory(TokenAwareBufferMemory):
    """Buffer with monitoring metrics"""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.stats = {
            "total_messages": 0,
            "messages_evicted": 0,
            "peak_tokens": 0
        }

    def add_message(self, role: str, content: str) -> None:
        count_before = len(self.messages)
        super().add_message(role, content)

        self.stats["total_messages"] += 1
        self.stats["messages_evicted"] += max(0, count_before + 1 - len(self.messages))
        self.stats["peak_tokens"] = max(
            self.stats["peak_tokens"],
            self._count_tokens()
        )

When to Use Buffer Memory?

Ideal Use Cases

Short conversations: Less than 10 exchanges
Simple follow-up questions: "And for accessories?", "How much does it cost?"
Level 1 customer support: FAQ with recent context
FAQ chatbots: Independent questions with little follow-up

When to Switch

Signal	Solution
Conversations > 15 turns	Summary Memory
Need to retain names, products	Entity Memory
Very tight token budget	Token Buffer with low limit
Persistent multi-sessions	Database + Summary

Learn More

Summary Memory - For long conversations
Entity Memory - For retaining entities
Conversational RAG - Overview

Buffer Memory with Ailog

Implementing and optimizing Buffer Memory takes time. With Ailog, get turnkey conversational memory management:

Optimized Buffer Memory with automatic token management
Adaptive window based on conversation complexity
Automatic persistence with session resumption
Analytics on memory usage and conversation quality
Zero configuration: it works out of the box

Try Ailog for free and deploy a chatbot with memory in 3 minutes.

Buffer Memory: Simple Conversation History

Buffer Memory: Simple Conversation History

What is Buffer Memory?

How It Works

Memory Approaches Comparison

Basic Python Implementation

Buffer Memory from Scratch

Integration with a RAG Pipeline

Intelligent Token Management

Implementation with LangChain

Integration with ConversationalRetrievalChain

Implementation with LlamaIndex

Best Practices and Optimizations

1. Compressing Long Messages

2. Persistence and Storage

3. Monitoring and Metrics

When to Use Buffer Memory?

Ideal Use Cases

When to Switch

Learn More

Buffer Memory with Ailog

Tags

Related Posts

Conversational RAG: Memory and Multi-Session Context

Entity Memory: Remember Mentioned Entities

Summary Memory: Summarize to Remember Long

Ailog Assistant