RAG Chunking Strategies 2025: Optimal Chunk Sizes & Techniques
Master document chunking for RAG: optimal chunk sizes (512-1024 tokens), overlap strategies, semantic vs fixed-size splitting. Improve retrieval by 25%+.
- Author
- Ailog Research Team
- Published
- Reading time
- 15 min read
- Level
- intermediate
- RAG Pipeline Step
- Chunking
TL;DR • Chunk size matters: 500-1000 tokens balances context and precision • Semantic chunking (splitting by meaning) beats fixed-size for quality (+15-30% retrieval accuracy) • Overlap (10-20%) prevents losing context at boundaries • Best for most use cases: Recursive text splitter with 512 tokens, 50 token overlap • Try it now: Test different strategies on Ailog's platform
The Chunking Problem
Most documents are too long to: • Embed as a single vector (context window limits) • Use entirely as LLM context (token limits) • Retrieve with precision (too much irrelevant information)
Chunking splits documents into smaller, manageable pieces while preserving semantic meaning.
Why Chunking Matters
Poor chunking leads to: • Split context: Important information broken across chunks • Irrelevant retrieval: Chunks contain mix of relevant and irrelevant content • Lost context: Chunk boundaries cut off critical information • Poor generation: LLM lacks complete context to answer accurately
Good chunking enables: • Precise retrieval: Find exactly the relevant information • Complete context: Chunks contain full thoughts or concepts • Efficient token usage: No wasted context on irrelevant text • Better answers: LLM has what it needs, nothing more
Fixed-Size Chunking
Character-Based
Split text every N characters.
``python def chunk_by_chars(text, chunk_size=1000, overlap=200): chunks = [] start = 0 while start < len(text): end = start + chunk_size chunks.append(text[start:end]) start += chunk_size - overlap return chunks `
Pros: • Simple implementation • Predictable chunk sizes • Fast processing
Cons: • Splits mid-word, mid-sentence • Ignores semantic boundaries • Breaks code, tables, lists
Use when: • Quick prototype needed • Text structure is homogeneous • Precision not critical
Token-Based
Split by token count (matches model tokenization).
`python import tiktoken
def chunk_by_tokens(text, chunk_size=512, overlap=50): encoding = tiktoken.get_encoding("cl100k_base") tokens = encoding.encode(text)
chunks = [] start = 0 while start < len(tokens): end = start + chunk_size chunk_tokens = tokens[start:end] chunks.append(encoding.decode(chunk_tokens)) start += chunk_size - overlap
return chunks `
Pros: • Respects token limits precisely • Works with any embedding model • Predictable embedding costs
Cons: • Still ignores semantic boundaries • Tokenization overhead • May split important context
Use when: • Strict token budget • Token count is critical (API costs) • Embedding model has hard token limits
Recommended Fixed Sizes
| Use Case | Chunk Size | Overlap | Rationale | |----------|-----------|---------|-----------| | Short FAQ | 128-256 tokens | 0-20 | Minimal context needed | | General docs | 512-1024 tokens | 50-100 | Balance precision and context | | Technical docs | 1024-2048 tokens | 100-200 | More context for complex topics | | Code | 256-512 tokens | 50-100 | Preserve function/class context |
Semantic Chunking
Split at natural semantic boundaries.
Sentence-Based
Split at sentence boundaries.
`python import nltk nltk.download('punkt')
def chunk_by_sentences(text, sentences_per_chunk=5): sentences = nltk.sent_tokenize(text) chunks = []
for i in range(0, len(sentences), sentences_per_chunk): chunk = ' '.join(sentences[i:i + sentences_per_chunk]) chunks.append(chunk)
return chunks `
Pros: • Respects sentence boundaries • More readable chunks • Preserves complete thoughts
Cons: • Variable chunk sizes • Sentence detection can fail • May not group related sentences
Use when: • Readability is important • Sentences are self-contained • General narrative text
Paragraph-Based
Split at paragraph breaks.
`python def chunk_by_paragraphs(text, paragraphs_per_chunk=2): paragraphs = text.split('\n\n') chunks = []
for i in range(0, len(paragraphs), paragraphs_per_chunk): chunk = '\n\n'.join(paragraphs[i:i + paragraphs_per_chunk]) chunks.append(chunk)
return chunks `
Pros: • Respects document structure • Keeps related content together • Natural reading units
Cons: • Highly variable sizes • Depends on formatting • Long paragraphs still problematic
Use when: • Well-formatted documents • Paragraphs represent complete ideas • Blog posts, articles
Recursive Character Splitting
LangChain's approach: try splits in order of preference.
`python from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ". ", " ", ""] )
chunks = splitter.split_text(text) `
Split hierarchy: Double newline (paragraphs) Single newline (lines) Period + space (sentences) Space (words) Character
Pros: • Respects document structure when possible • Falls back gracefully • Balances semantics and size
Cons: • Still somewhat arbitrary • May not capture true semantic units • Configuration required
Use when: • General-purpose chunking • Mixed document types • Good default choice
Metadata-Aware Chunking
Use document structure to inform chunking.
Markdown Chunking
Split by headers, preserving hierarchy.
`python def chunk_markdown(text): chunks = [] current_h1 = "" current_h2 = "" current_chunk = []
for line in text.split('\n'): if line.startswith(''): if current_chunk: chunks.append({ 'content': '\n'.join(current_chunk), 'h1': current_h1, 'h2': current_h2 }) current_chunk = [] current_h1 = line[2:]
elif line.startswith(''): if current_chunk: chunks.append({ 'content': '\n'.join(current_chunk), 'h1': current_h1, 'h2': current_h2 }) current_chunk = [] current_h2 = line[3:]
current_chunk.append(line)
if current_chunk: chunks.append({ 'content': '\n'.join(current_chunk), 'h1': current_h1, 'h2': current_h2 })
return chunks `
Metadata benefits: • Headers provide context for search • Can filter by section • Better relevance scoring
HTML/XML Chunking
Split by semantic HTML tags.
`python from bs4 import BeautifulSoup
def chunk_html(html): soup = BeautifulSoup(html, 'html.parser') chunks = []
Split by sections for section in soup.find_all(['section', 'article', 'div']): if section.get('class') in ['content', 'main']: chunks.append({ 'content': section.get_text(), 'tag': section.name, 'class': section.get('class') })
return chunks `
Code Chunking
Split by function/class boundaries.
`python import ast
def chunk_python_code(code): tree = ast.parse(code) chunks = []
for node in ast.walk(tree): if isinstance(node, (ast.FunctionDef, ast.ClassDef)): chunk_lines = code.split('\n')[node.lineno-1:node.end_lineno] chunks.append({ 'content': '\n'.join(chunk_lines), 'type': type(node).__name__, 'name': node.name })
return chunks `
Pros: • Preserves logical units (functions, classes) • Metadata aids discovery • Natural code boundaries
Cons: • Language-specific parsing • Complex implementation • May miss cross-function context
Advanced Chunking Techniques
Semantic Similarity-Based
Group sentences by semantic similarity.
`python from sentence_transformers import SentenceTransformer from sklearn.cluster import AgglomerativeClustering
def semantic_chunking(text, model, max_chunk_size=512): sentences = nltk.sent_tokenize(text) embeddings = model.encode(sentences)
Cluster similar sentences clustering = AgglomerativeClustering( n_clusters=None, distance_threshold=0.5 ) labels = clustering.fit_predict(embeddings)
Group sentences by cluster chunks = {} for sent, label in zip(sentences, labels): chunks.setdefault(label, []).append(sent)
return [' '.join(sents) for sents in chunks.values()] `
Pros: • Truly semantic grouping • Handles topic shifts • Optimal information density
Cons: • Computationally expensive • Requires embedding model • Complex to tune
Sliding Window with Contextual Overlap
Add surrounding context to each chunk.
`python def sliding_window_chunk(text, window_size=512, context_size=128): tokens = tokenize(text) chunks = []
for i in range(0, len(tokens), window_size): Main window start = max(0, i - context_size) end = min(len(tokens), i + window_size + context_size)
chunk = { 'content': detokenize(tokens[i:i+window_size]), 'context': detokenize(tokens[start:end]), 'position': i } chunks.append(chunk)
return chunks `
Pros: • Each chunk has surrounding context • Reduces information loss • Better for cross-boundary queries
Cons: • Larger storage requirements • More embeddings needed • Potential redundancy
Hybrid Hierarchical Chunking
Chunk at multiple granularities.
`python def hierarchical_chunk(document): Level 1: Document doc_embedding = embed(document['content'])
Level 2: Sections sections = split_by_headers(document['content']) section_embeddings = [embed(s) for s in sections]
Level 3: Paragraphs paragraph_chunks = [] for section in sections: paragraphs = section.split('\n\n') paragraph_chunks.extend([ {'content': p, 'section': section} for p in paragraphs ]) para_embeddings = [embed(p['content']) for p in paragraph_chunks]
return { 'document': {'embedding': doc_embedding, 'content': document}, 'sections': [{'embedding': e, 'content': s} for e, s in zip(section_embeddings, sections)], 'paragraphs': [{'embedding': e, p} for e, p in zip(para_embeddings, paragraph_chunks)] } `
Retrieval strategy: Search at document level If match found, search within sections Finally retrieve specific paragraphs
Pros: • Multiple levels of granularity • Coarse-to-fine retrieval • Better context preservation
Cons: • Complex implementation • More storage needed • Slower indexing
Chunk Overlap
Why Overlap?
Without overlap: ` Chunk 1: "...the database schema includes user tables" Chunk 2: "with columns for email and password..." `
Query: "database user email" might miss both chunks
With overlap: ` Chunk 1: "...the database schema includes user tables with columns for..." Chunk 2: "...user tables with columns for email and password..." `
Now "user tables with columns" appears in both, improving recall.
Optimal Overlap
| Chunk Size | Recommended Overlap | Ratio | |-----------|-------------------|-------| | 128 tokens | 10-20 tokens | 8-15% | | 512 tokens | 50-100 tokens | 10-20% | | 1024 tokens | 100-200 tokens | 10-20% | | 2048 tokens | 200-400 tokens | 10-20% |
Trade-offs: • More overlap: Better recall, more storage, slower search • Less overlap: Less storage, faster search, may miss context
Chunking for Different Content Types
Technical Documentation
`python Recommended: Markdown-aware, preserve code blocks chunk_size = 1024 overlap = 150 preserve_code_blocks = True preserve_tables = True `
Customer Support Tickets
`python Recommended: Fixed-size with moderate overlap chunk_size = 512 overlap = 100 split_by_turns = True Each Q&A turn `
Research Papers
`python Recommended: Section-based with citations split_by_sections = True preserve_citations = True chunk_size = 1024 `
Code Repositories
`python Recommended: Syntactic splitting split_by_functions = True include_docstrings = True chunk_size = 512 `
Chat Logs
`python Recommended: Message-based chunk_by_messages = True messages_per_chunk = 10 preserve_threading = True `
Evaluating Chunking Strategies
Retrieval Metrics
Test with query set:
`python def evaluate_chunking(queries, ground_truth, chunking_fn): chunks = chunking_fn(documents) embeddings = embed(chunks)
precision_scores = [] recall_scores = []
for query, expected_docs in zip(queries, ground_truth): retrieved = search(embed(query), embeddings, k=5) precision = len(set(retrieved) & set(expected_docs)) / len(retrieved) recall = len(set(retrieved) & set(expected_docs)) / len(expected_docs)
precision_scores.append(precision) recall_scores.append(recall)
return { 'precision': np.mean(precision_scores), 'recall': np.mean(recall_scores) } ``
End-to-End Metrics
Test full RAG pipeline: • Answer accuracy • Context utilization (how much of retrieved context is used) • Answer groundedness (faithfulness to chunks)
Practical Recommendations
Decision Framework Start simple: Fixed-size with overlap (512 tokens, 100 overlap) Measure performance: Use evaluation metrics Identify failures: Where does retrieval fail? Iterate: Try semantic or metadata-aware chunking A/B test: Compare strategies on real queries
Common Patterns
90% of use cases: • Recursive character splitting • 512-1024 token chunks • 10-20% overlap
Structured documents: • Markdown/HTML-aware chunking • Preserve metadata (headers, sections) • Variable sizes OK
Code: • Syntax-aware splitting • Include docstrings with functions • Smaller chunks (256-512)
Hybrid search: • Multiple chunk sizes • Hierarchical retrieval • Worth the complexity for high-value apps
Common Pitfalls Too small chunks: Lose context, fragmented retrieval Too large chunks: Irrelevant information, token waste No overlap: Miss boundary-spanning queries Ignoring structure: Arbitrary splits in tables, code, lists One-size-fits-all: Different content needs different strategies No evaluation: Guessing instead of measuring
> 💡 Expert Tip from Ailog: In production with 10M+ documents, we've found that starting with 512-token chunks and 10% overlap works for 80% of use cases. Only optimize further if you see retrieval failures in your evaluation metrics. The biggest mistake is over-engineering chunking before measuring actual performance. Start simple, measure, iterate.
Try Chunking Strategies on Ailog
Want to test different chunking approaches without writing code?
Ailog's platform lets you: • Upload documents and compare chunking strategies side-by-side • Test semantic vs fixed-size chunking instantly • Visualize chunk boundaries and overlap • Benchmark retrieval quality with real queries • Deploy the best strategy to production in one click
Try it free →** No credit card required.
Next Steps
With documents properly chunked, the next step is selecting and configuring a vector database to store and search embeddings efficiently. This is covered in the next guide on vector databases.