Fixed-Size Chunking: Fast and Reliable
Master the basics: implement fixed-size chunking with overlaps for consistent, predictable RAG performance.
Why Fixed-Size?
Pros:
- ✅ Simple to implement
- ✅ Predictable chunk count
- ✅ Fast (no AI needed)
- ✅ Works for any content
Cons:
- ❌ Breaks sentences
- ❌ Ignores semantics
Basic Implementation
DEVELOPERpythondef fixed_chunk(text, chunk_size=500, overlap=50): chunks = [] start = 0 while start < len(text): end = start + chunk_size chunk = text[start:end] chunks.append(chunk) start += chunk_size - overlap # Move forward with overlap return chunks
With Sentence Boundaries
Better: don't break mid-sentence:
DEVELOPERpythonimport re def chunk_by_tokens(text, chunk_size=500, overlap=50): # Split into sentences sentences = re.split(r'(?<=[.!?])\s+', text) chunks = [] current_chunk = [] current_size = 0 for sentence in sentences: sentence_size = len(sentence) if current_size + sentence_size > chunk_size and current_chunk: # Save current chunk chunks.append(' '.join(current_chunk)) # Start new chunk with overlap overlap_sentences = current_chunk[-2:] if len(current_chunk) > 1 else current_chunk current_chunk = overlap_sentences + [sentence] current_size = sum(len(s) for s in current_chunk) else: current_chunk.append(sentence) current_size += sentence_size if current_chunk: chunks.append(' '.join(current_chunk)) return chunks
LangChain Implementation
DEVELOPERpythonfrom langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""] ) chunks = splitter.split_text(long_text)
Choosing Chunk Size
Small chunks (200-300):
- More precise retrieval
- But less context
Medium chunks (500-800):
- Balanced (recommended)
Large chunks (1000+):
- More context
- But noisy retrieval
Test on your data!
Fixed-size is battle-tested. Start here, optimize later if needed.
Tags
Related Posts
RAG Chunking Strategies 2025: Optimal Chunk Sizes & Techniques
Master document chunking for RAG: optimal chunk sizes (512-1024 tokens), overlap strategies, semantic vs fixed-size splitting. Improve retrieval by 25%+.
Hierarchical Chunking: Preserving Document Structure
Hierarchical chunking maintains parent-child relationships in your documents. Learn how to implement this advanced technique to improve RAG retrieval quality.
Semantic Chunking for Better Retrieval
Split documents intelligently based on meaning, not just length. Learn semantic chunking techniques for RAG.