Fixed-Size Chunking: Fast and Reliable
Master the basics: implement fixed-size chunking with overlaps for consistent, predictable RAG performance.
- Author
- Ailog Research Team
- Published
- Reading time
- 7 min read
- Level
- beginner
- RAG Pipeline Step
- Chunking
Why Fixed-Size?
Pros: • ✅ Simple to implement • ✅ Predictable chunk count • ✅ Fast (no AI needed) • ✅ Works for any content
Cons: • ❌ Breaks sentences • ❌ Ignores semantics
Basic Implementation
``python def fixed_chunk(text, chunk_size=500, overlap=50): chunks = [] start = 0 while start < len(text): end = start + chunk_size chunk = text[start:end] chunks.append(chunk) start += chunk_size - overlap Move forward with overlap return chunks `
With Sentence Boundaries
Better: don't break mid-sentence:
`python import re
def chunk_by_tokens(text, chunk_size=500, overlap=50): Split into sentences sentences = re.split(r'(?<=[.!?])\s+', text) chunks = [] current_chunk = [] current_size = 0 for sentence in sentences: sentence_size = len(sentence) if current_size + sentence_size > chunk_size and current_chunk: Save current chunk chunks.append(' '.join(current_chunk)) Start new chunk with overlap overlap_sentences = current_chunk[-2:] if len(current_chunk) > 1 else current_chunk current_chunk = overlap_sentences + [sentence] current_size = sum(len(s) for s in current_chunk) else: current_chunk.append(sentence) current_size += sentence_size if current_chunk: chunks.append(' '.join(current_chunk)) return chunks `
LangChain Implementation
`python from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""] )
chunks = splitter.split_text(long_text) ``
Choosing Chunk Size
Small chunks (200-300): • More precise retrieval • But less context
Medium chunks (500-800): • Balanced (recommended)
Large chunks (1000+): • More context • But noisy retrieval
Test on your data!
Fixed-size is battle-tested. Start here, optimize later if needed.