Chunking Strategies: Optimizing Document Segmentation
Master document chunking techniques to improve retrieval quality. Learn about chunk sizes, overlaps, semantic splitting, and advanced strategies.
TL;DR
- Chunk size matters: 500-1000 tokens balances context and precision
- Semantic chunking (splitting by meaning) beats fixed-size for quality (+15-30% retrieval accuracy)
- Overlap (10-20%) prevents losing context at boundaries
- Best for most use cases: Recursive text splitter with 512 tokens, 50 token overlap
- Try it now: Test different strategies on Ailog's platform
The Chunking Problem
Most documents are too long to:
- Embed as a single vector (context window limits)
- Use entirely as LLM context (token limits)
- Retrieve with precision (too much irrelevant information)
Chunking splits documents into smaller, manageable pieces while preserving semantic meaning.
Why Chunking Matters
Poor chunking leads to:
- Split context: Important information broken across chunks
- Irrelevant retrieval: Chunks contain mix of relevant and irrelevant content
- Lost context: Chunk boundaries cut off critical information
- Poor generation: LLM lacks complete context to answer accurately
Good chunking enables:
- Precise retrieval: Find exactly the relevant information
- Complete context: Chunks contain full thoughts or concepts
- Efficient token usage: No wasted context on irrelevant text
- Better answers: LLM has what it needs, nothing more
Fixed-Size Chunking
Character-Based
Split text every N characters.
DEVELOPERpythondef chunk_by_chars(text, chunk_size=1000, overlap=200): chunks = [] start = 0 while start < len(text): end = start + chunk_size chunks.append(text[start:end]) start += chunk_size - overlap return chunks
Pros:
- Simple implementation
- Predictable chunk sizes
- Fast processing
Cons:
- Splits mid-word, mid-sentence
- Ignores semantic boundaries
- Breaks code, tables, lists
Use when:
- Quick prototype needed
- Text structure is homogeneous
- Precision not critical
Token-Based
Split by token count (matches model tokenization).
DEVELOPERpythonimport tiktoken def chunk_by_tokens(text, chunk_size=512, overlap=50): encoding = tiktoken.get_encoding("cl100k_base") tokens = encoding.encode(text) chunks = [] start = 0 while start < len(tokens): end = start + chunk_size chunk_tokens = tokens[start:end] chunks.append(encoding.decode(chunk_tokens)) start += chunk_size - overlap return chunks
Pros:
- Respects token limits precisely
- Works with any embedding model
- Predictable embedding costs
Cons:
- Still ignores semantic boundaries
- Tokenization overhead
- May split important context
Use when:
- Strict token budget
- Token count is critical (API costs)
- Embedding model has hard token limits
Recommended Fixed Sizes
| Use Case | Chunk Size | Overlap | Rationale |
|---|---|---|---|
| Short FAQ | 128-256 tokens | 0-20 | Minimal context needed |
| General docs | 512-1024 tokens | 50-100 | Balance precision and context |
| Technical docs | 1024-2048 tokens | 100-200 | More context for complex topics |
| Code | 256-512 tokens | 50-100 | Preserve function/class context |
Semantic Chunking
Split at natural semantic boundaries.
Sentence-Based
Split at sentence boundaries.
DEVELOPERpythonimport nltk nltk.download('punkt') def chunk_by_sentences(text, sentences_per_chunk=5): sentences = nltk.sent_tokenize(text) chunks = [] for i in range(0, len(sentences), sentences_per_chunk): chunk = ' '.join(sentences[i:i + sentences_per_chunk]) chunks.append(chunk) return chunks
Pros:
- Respects sentence boundaries
- More readable chunks
- Preserves complete thoughts
Cons:
- Variable chunk sizes
- Sentence detection can fail
- May not group related sentences
Use when:
- Readability is important
- Sentences are self-contained
- General narrative text
Paragraph-Based
Split at paragraph breaks.
DEVELOPERpythondef chunk_by_paragraphs(text, paragraphs_per_chunk=2): paragraphs = text.split('\n\n') chunks = [] for i in range(0, len(paragraphs), paragraphs_per_chunk): chunk = '\n\n'.join(paragraphs[i:i + paragraphs_per_chunk]) chunks.append(chunk) return chunks
Pros:
- Respects document structure
- Keeps related content together
- Natural reading units
Cons:
- Highly variable sizes
- Depends on formatting
- Long paragraphs still problematic
Use when:
- Well-formatted documents
- Paragraphs represent complete ideas
- Blog posts, articles
Recursive Character Splitting
LangChain's approach: try splits in order of preference.
DEVELOPERpythonfrom langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ". ", " ", ""] ) chunks = splitter.split_text(text)
Split hierarchy:
- Double newline (paragraphs)
- Single newline (lines)
- Period + space (sentences)
- Space (words)
- Character
Pros:
- Respects document structure when possible
- Falls back gracefully
- Balances semantics and size
Cons:
- Still somewhat arbitrary
- May not capture true semantic units
- Configuration required
Use when:
- General-purpose chunking
- Mixed document types
- Good default choice
Metadata-Aware Chunking
Use document structure to inform chunking.
Markdown Chunking
Split by headers, preserving hierarchy.
DEVELOPERpythondef chunk_markdown(text): chunks = [] current_h1 = "" current_h2 = "" current_chunk = [] for line in text.split('\n'): if line.startswith('# '): if current_chunk: chunks.append({ 'content': '\n'.join(current_chunk), 'h1': current_h1, 'h2': current_h2 }) current_chunk = [] current_h1 = line[2:] elif line.startswith('## '): if current_chunk: chunks.append({ 'content': '\n'.join(current_chunk), 'h1': current_h1, 'h2': current_h2 }) current_chunk = [] current_h2 = line[3:] current_chunk.append(line) if current_chunk: chunks.append({ 'content': '\n'.join(current_chunk), 'h1': current_h1, 'h2': current_h2 }) return chunks
Metadata benefits:
- Headers provide context for search
- Can filter by section
- Better relevance scoring
HTML/XML Chunking
Split by semantic HTML tags.
DEVELOPERpythonfrom bs4 import BeautifulSoup def chunk_html(html): soup = BeautifulSoup(html, 'html.parser') chunks = [] # Split by sections for section in soup.find_all(['section', 'article', 'div']): if section.get('class') in ['content', 'main']: chunks.append({ 'content': section.get_text(), 'tag': section.name, 'class': section.get('class') }) return chunks
Code Chunking
Split by function/class boundaries.
DEVELOPERpythonimport ast def chunk_python_code(code): tree = ast.parse(code) chunks = [] for node in ast.walk(tree): if isinstance(node, (ast.FunctionDef, ast.ClassDef)): chunk_lines = code.split('\n')[node.lineno-1:node.end_lineno] chunks.append({ 'content': '\n'.join(chunk_lines), 'type': type(node).__name__, 'name': node.name }) return chunks
Pros:
- Preserves logical units (functions, classes)
- Metadata aids discovery
- Natural code boundaries
Cons:
- Language-specific parsing
- Complex implementation
- May miss cross-function context
Advanced Chunking Techniques
Semantic Similarity-Based
Group sentences by semantic similarity.
DEVELOPERpythonfrom sentence_transformers import SentenceTransformer from sklearn.cluster import AgglomerativeClustering def semantic_chunking(text, model, max_chunk_size=512): sentences = nltk.sent_tokenize(text) embeddings = model.encode(sentences) # Cluster similar sentences clustering = AgglomerativeClustering( n_clusters=None, distance_threshold=0.5 ) labels = clustering.fit_predict(embeddings) # Group sentences by cluster chunks = {} for sent, label in zip(sentences, labels): chunks.setdefault(label, []).append(sent) return [' '.join(sents) for sents in chunks.values()]
Pros:
- Truly semantic grouping
- Handles topic shifts
- Optimal information density
Cons:
- Computationally expensive
- Requires embedding model
- Complex to tune
Sliding Window with Contextual Overlap
Add surrounding context to each chunk.
DEVELOPERpythondef sliding_window_chunk(text, window_size=512, context_size=128): tokens = tokenize(text) chunks = [] for i in range(0, len(tokens), window_size): # Main window start = max(0, i - context_size) end = min(len(tokens), i + window_size + context_size) chunk = { 'content': detokenize(tokens[i:i+window_size]), 'context': detokenize(tokens[start:end]), 'position': i } chunks.append(chunk) return chunks
Pros:
- Each chunk has surrounding context
- Reduces information loss
- Better for cross-boundary queries
Cons:
- Larger storage requirements
- More embeddings needed
- Potential redundancy
Hybrid Hierarchical Chunking
Chunk at multiple granularities.
DEVELOPERpythondef hierarchical_chunk(document): # Level 1: Document doc_embedding = embed(document['content']) # Level 2: Sections sections = split_by_headers(document['content']) section_embeddings = [embed(s) for s in sections] # Level 3: Paragraphs paragraph_chunks = [] for section in sections: paragraphs = section.split('\n\n') paragraph_chunks.extend([ {'content': p, 'section': section} for p in paragraphs ]) para_embeddings = [embed(p['content']) for p in paragraph_chunks] return { 'document': {'embedding': doc_embedding, 'content': document}, 'sections': [{'embedding': e, 'content': s} for e, s in zip(section_embeddings, sections)], 'paragraphs': [{'embedding': e, **p} for e, p in zip(para_embeddings, paragraph_chunks)] }
Retrieval strategy:
- Search at document level
- If match found, search within sections
- Finally retrieve specific paragraphs
Pros:
- Multiple levels of granularity
- Coarse-to-fine retrieval
- Better context preservation
Cons:
- Complex implementation
- More storage needed
- Slower indexing
Chunk Overlap
Why Overlap?
Without overlap:
Chunk 1: "...the database schema includes user tables"
Chunk 2: "with columns for email and password..."
Query: "database user email" might miss both chunks
With overlap:
Chunk 1: "...the database schema includes user tables with columns for..."
Chunk 2: "...user tables with columns for email and password..."
Now "user tables with columns" appears in both, improving recall.
Optimal Overlap
| Chunk Size | Recommended Overlap | Ratio |
|---|---|---|
| 128 tokens | 10-20 tokens | 8-15% |
| 512 tokens | 50-100 tokens | 10-20% |
| 1024 tokens | 100-200 tokens | 10-20% |
| 2048 tokens | 200-400 tokens | 10-20% |
Trade-offs:
- More overlap: Better recall, more storage, slower search
- Less overlap: Less storage, faster search, may miss context
Chunking for Different Content Types
Technical Documentation
DEVELOPERpython# Recommended: Markdown-aware, preserve code blocks chunk_size = 1024 overlap = 150 preserve_code_blocks = True preserve_tables = True
Customer Support Tickets
DEVELOPERpython# Recommended: Fixed-size with moderate overlap chunk_size = 512 overlap = 100 split_by_turns = True # Each Q&A turn
Research Papers
DEVELOPERpython# Recommended: Section-based with citations split_by_sections = True preserve_citations = True chunk_size = 1024
Code Repositories
DEVELOPERpython# Recommended: Syntactic splitting split_by_functions = True include_docstrings = True chunk_size = 512
Chat Logs
DEVELOPERpython# Recommended: Message-based chunk_by_messages = True messages_per_chunk = 10 preserve_threading = True
Evaluating Chunking Strategies
Retrieval Metrics
Test with query set:
DEVELOPERpythondef evaluate_chunking(queries, ground_truth, chunking_fn): chunks = chunking_fn(documents) embeddings = embed(chunks) precision_scores = [] recall_scores = [] for query, expected_docs in zip(queries, ground_truth): retrieved = search(embed(query), embeddings, k=5) precision = len(set(retrieved) & set(expected_docs)) / len(retrieved) recall = len(set(retrieved) & set(expected_docs)) / len(expected_docs) precision_scores.append(precision) recall_scores.append(recall) return { 'precision': np.mean(precision_scores), 'recall': np.mean(recall_scores) }
End-to-End Metrics
Test full RAG pipeline:
- Answer accuracy
- Context utilization (how much of retrieved context is used)
- Answer groundedness (faithfulness to chunks)
Practical Recommendations
Decision Framework
- Start simple: Fixed-size with overlap (512 tokens, 100 overlap)
- Measure performance: Use evaluation metrics
- Identify failures: Where does retrieval fail?
- Iterate: Try semantic or metadata-aware chunking
- A/B test: Compare strategies on real queries
Common Patterns
90% of use cases:
- Recursive character splitting
- 512-1024 token chunks
- 10-20% overlap
Structured documents:
- Markdown/HTML-aware chunking
- Preserve metadata (headers, sections)
- Variable sizes OK
Code:
- Syntax-aware splitting
- Include docstrings with functions
- Smaller chunks (256-512)
Hybrid search:
- Multiple chunk sizes
- Hierarchical retrieval
- Worth the complexity for high-value apps
Common Pitfalls
- Too small chunks: Lose context, fragmented retrieval
- Too large chunks: Irrelevant information, token waste
- No overlap: Miss boundary-spanning queries
- Ignoring structure: Arbitrary splits in tables, code, lists
- One-size-fits-all: Different content needs different strategies
- No evaluation: Guessing instead of measuring
💡 Expert Tip from Ailog: In production with 10M+ documents, we've found that starting with 512-token chunks and 10% overlap works for 80% of use cases. Only optimize further if you see retrieval failures in your evaluation metrics. The biggest mistake is over-engineering chunking before measuring actual performance. Start simple, measure, iterate.
Try Chunking Strategies on Ailog
Want to test different chunking approaches without writing code?
Ailog's platform lets you:
- Upload documents and compare chunking strategies side-by-side
- Test semantic vs fixed-size chunking instantly
- Visualize chunk boundaries and overlap
- Benchmark retrieval quality with real queries
- Deploy the best strategy to production in one click
Try it free → No credit card required.
Next Steps
With documents properly chunked, the next step is selecting and configuring a vector database to store and search embeddings efficiently. This is covered in the next guide on vector databases.
Tags
Related Guides
Fixed-Size Chunking: Fast and Reliable
Master the basics: implement fixed-size chunking with overlaps for consistent, predictable RAG performance.
Semantic Chunking for Better Retrieval
Split documents intelligently based on meaning, not just length. Learn semantic chunking techniques for RAG.
Parent Document Retrieval: Context Without Noise
Search small chunks, retrieve full documents: the best of both precision and context for RAG systems.