Chunking Strategies: Optimizing Document Segmentation

TL;DR

Chunk size matters: 500-1000 tokens balances context and precision
Semantic chunking (splitting by meaning) beats fixed-size for quality (+15-30% retrieval accuracy)
Overlap (10-20%) prevents losing context at boundaries
Best for most use cases: Recursive text splitter with 512 tokens, 50 token overlap
Try it now: Test different strategies on Ailog's platform

The Chunking Problem

Most documents are too long to:

Embed as a single vector (context window limits)
Use entirely as LLM context (token limits)
Retrieve with precision (too much irrelevant information)

Chunking splits documents into smaller, manageable pieces while preserving semantic meaning.

Why Chunking Matters

Poor chunking leads to:

Split context: Important information broken across chunks
Irrelevant retrieval: Chunks contain mix of relevant and irrelevant content
Lost context: Chunk boundaries cut off critical information
Poor generation: LLM lacks complete context to answer accurately

Good chunking enables:

Precise retrieval: Find exactly the relevant information
Complete context: Chunks contain full thoughts or concepts
Efficient token usage: No wasted context on irrelevant text
Better answers: LLM has what it needs, nothing more

Fixed-Size Chunking

Character-Based

Split text every N characters.

DEVELOPERpython
def chunk_by_chars(text, chunk_size=1000, overlap=200):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

Pros:

Simple implementation
Predictable chunk sizes
Fast processing

Cons:

Splits mid-word, mid-sentence
Ignores semantic boundaries
Breaks code, tables, lists

Use when:

Quick prototype needed
Text structure is homogeneous
Precision not critical

Token-Based

Split by token count (matches model tokenization).

DEVELOPERpython
import tiktoken

def chunk_by_tokens(text, chunk_size=512, overlap=50):
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunks.append(encoding.decode(chunk_tokens))
        start += chunk_size - overlap

    return chunks

Pros:

Respects token limits precisely
Works with any embedding model
Predictable embedding costs

Cons:

Still ignores semantic boundaries
Tokenization overhead
May split important context

Use when:

Strict token budget
Token count is critical (API costs)
Embedding model has hard token limits

Recommended Fixed Sizes

Use Case	Chunk Size	Overlap	Rationale
Short FAQ	128-256 tokens	0-20	Minimal context needed
General docs	512-1024 tokens	50-100	Balance precision and context
Technical docs	1024-2048 tokens	100-200	More context for complex topics
Code	256-512 tokens	50-100	Preserve function/class context

Semantic Chunking

Split at natural semantic boundaries.

Sentence-Based

Split at sentence boundaries.

DEVELOPERpython
import nltk
nltk.download('punkt')

def chunk_by_sentences(text, sentences_per_chunk=5):
    sentences = nltk.sent_tokenize(text)
    chunks = []

    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = ' '.join(sentences[i:i + sentences_per_chunk])
        chunks.append(chunk)

    return chunks

Pros:

Respects sentence boundaries
More readable chunks
Preserves complete thoughts

Cons:

Variable chunk sizes
Sentence detection can fail
May not group related sentences

Use when:

Readability is important
Sentences are self-contained
General narrative text

Paragraph-Based

Split at paragraph breaks.

DEVELOPERpython
def chunk_by_paragraphs(text, paragraphs_per_chunk=2):
    paragraphs = text.split('\n\n')
    chunks = []

    for i in range(0, len(paragraphs), paragraphs_per_chunk):
        chunk = '\n\n'.join(paragraphs[i:i + paragraphs_per_chunk])
        chunks.append(chunk)

    return chunks

Pros:

Respects document structure
Keeps related content together
Natural reading units

Cons:

Highly variable sizes
Depends on formatting
Long paragraphs still problematic

Use when:

Well-formatted documents
Paragraphs represent complete ideas
Blog posts, articles

Recursive Character Splitting

LangChain's approach: try splits in order of preference.

DEVELOPERpython
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(text)

Split hierarchy:

Double newline (paragraphs)
Single newline (lines)
Period + space (sentences)
Space (words)
Character

Pros:

Respects document structure when possible
Falls back gracefully
Balances semantics and size

Cons:

Still somewhat arbitrary
May not capture true semantic units
Configuration required

Use when:

General-purpose chunking
Mixed document types
Good default choice

Metadata-Aware Chunking

Use document structure to inform chunking.

Markdown Chunking

Split by headers, preserving hierarchy.

DEVELOPERpython
def chunk_markdown(text):
    chunks = []
    current_h1 = ""
    current_h2 = ""
    current_chunk = []

    for line in text.split('\n'):
        if line.startswith('# '):
            if current_chunk:
                chunks.append({
                    'content': '\n'.join(current_chunk),
                    'h1': current_h1,
                    'h2': current_h2
                })
                current_chunk = []
            current_h1 = line[2:]

        elif line.startswith('## '):
            if current_chunk:
                chunks.append({
                    'content': '\n'.join(current_chunk),
                    'h1': current_h1,
                    'h2': current_h2
                })
                current_chunk = []
            current_h2 = line[3:]

        current_chunk.append(line)

    if current_chunk:
        chunks.append({
            'content': '\n'.join(current_chunk),
            'h1': current_h1,
            'h2': current_h2
        })

    return chunks

Metadata benefits:

Headers provide context for search
Can filter by section
Better relevance scoring

HTML/XML Chunking

Split by semantic HTML tags.

DEVELOPERpython
from bs4 import BeautifulSoup

def chunk_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    chunks = []

    # Split by sections
    for section in soup.find_all(['section', 'article', 'div']):
        if section.get('class') in ['content', 'main']:
            chunks.append({
                'content': section.get_text(),
                'tag': section.name,
                'class': section.get('class')
            })

    return chunks

Code Chunking

Split by function/class boundaries.

DEVELOPERpython
import ast

def chunk_python_code(code):
    tree = ast.parse(code)
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
            chunk_lines = code.split('\n')[node.lineno-1:node.end_lineno]
            chunks.append({
                'content': '\n'.join(chunk_lines),
                'type': type(node).__name__,
                'name': node.name
            })

    return chunks

Pros:

Preserves logical units (functions, classes)
Metadata aids discovery
Natural code boundaries

Cons:

Language-specific parsing
Complex implementation
May miss cross-function context

Advanced Chunking Techniques

Semantic Similarity-Based

Group sentences by semantic similarity.

DEVELOPERpython
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering

def semantic_chunking(text, model, max_chunk_size=512):
    sentences = nltk.sent_tokenize(text)
    embeddings = model.encode(sentences)

    # Cluster similar sentences
    clustering = AgglomerativeClustering(
        n_clusters=None,
        distance_threshold=0.5
    )
    labels = clustering.fit_predict(embeddings)

    # Group sentences by cluster
    chunks = {}
    for sent, label in zip(sentences, labels):
        chunks.setdefault(label, []).append(sent)

    return [' '.join(sents) for sents in chunks.values()]

Pros:

Truly semantic grouping
Handles topic shifts
Optimal information density

Cons:

Computationally expensive
Requires embedding model
Complex to tune

Sliding Window with Contextual Overlap

Add surrounding context to each chunk.

DEVELOPERpython
def sliding_window_chunk(text, window_size=512, context_size=128):
    tokens = tokenize(text)
    chunks = []

    for i in range(0, len(tokens), window_size):
        # Main window
        start = max(0, i - context_size)
        end = min(len(tokens), i + window_size + context_size)

        chunk = {
            'content': detokenize(tokens[i:i+window_size]),
            'context': detokenize(tokens[start:end]),
            'position': i
        }
        chunks.append(chunk)

    return chunks

Pros:

Each chunk has surrounding context
Reduces information loss
Better for cross-boundary queries

Cons:

Larger storage requirements
More embeddings needed
Potential redundancy

Hybrid Hierarchical Chunking

Chunk at multiple granularities.

DEVELOPERpython
def hierarchical_chunk(document):
    # Level 1: Document
    doc_embedding = embed(document['content'])

    # Level 2: Sections
    sections = split_by_headers(document['content'])
    section_embeddings = [embed(s) for s in sections]

    # Level 3: Paragraphs
    paragraph_chunks = []
    for section in sections:
        paragraphs = section.split('\n\n')
        paragraph_chunks.extend([
            {'content': p, 'section': section} for p in paragraphs
        ])
    para_embeddings = [embed(p['content']) for p in paragraph_chunks]

    return {
        'document': {'embedding': doc_embedding, 'content': document},
        'sections': [{'embedding': e, 'content': s} for e, s in zip(section_embeddings, sections)],
        'paragraphs': [{'embedding': e, **p} for e, p in zip(para_embeddings, paragraph_chunks)]
    }

Retrieval strategy:

Search at document level
If match found, search within sections
Finally retrieve specific paragraphs

Pros:

Multiple levels of granularity
Coarse-to-fine retrieval
Better context preservation

Cons:

Complex implementation
More storage needed
Slower indexing

Chunk Overlap

Why Overlap?

Without overlap:

Chunk 1: "...the database schema includes user tables"
Chunk 2: "with columns for email and password..."

Query: "database user email" might miss both chunks

With overlap:

Chunk 1: "...the database schema includes user tables with columns for..."
Chunk 2: "...user tables with columns for email and password..."

Now "user tables with columns" appears in both, improving recall.

Optimal Overlap

Chunk Size	Recommended Overlap	Ratio
128 tokens	10-20 tokens	8-15%
512 tokens	50-100 tokens	10-20%
1024 tokens	100-200 tokens	10-20%
2048 tokens	200-400 tokens	10-20%

Trade-offs:

More overlap: Better recall, more storage, slower search
Less overlap: Less storage, faster search, may miss context

Chunking for Different Content Types

Technical Documentation

DEVELOPERpython
# Recommended: Markdown-aware, preserve code blocks
chunk_size = 1024
overlap = 150
preserve_code_blocks = True
preserve_tables = True

Customer Support Tickets

DEVELOPERpython
# Recommended: Fixed-size with moderate overlap
chunk_size = 512
overlap = 100
split_by_turns = True  # Each Q&A turn

Research Papers

DEVELOPERpython
# Recommended: Section-based with citations
split_by_sections = True
preserve_citations = True
chunk_size = 1024

Code Repositories

DEVELOPERpython
# Recommended: Syntactic splitting
split_by_functions = True
include_docstrings = True
chunk_size = 512

Chat Logs

DEVELOPERpython
# Recommended: Message-based
chunk_by_messages = True
messages_per_chunk = 10
preserve_threading = True

Evaluating Chunking Strategies

Retrieval Metrics

Test with query set:

DEVELOPERpython
def evaluate_chunking(queries, ground_truth, chunking_fn):
    chunks = chunking_fn(documents)
    embeddings = embed(chunks)

    precision_scores = []
    recall_scores = []

    for query, expected_docs in zip(queries, ground_truth):
        retrieved = search(embed(query), embeddings, k=5)
        precision = len(set(retrieved) & set(expected_docs)) / len(retrieved)
        recall = len(set(retrieved) & set(expected_docs)) / len(expected_docs)

        precision_scores.append(precision)
        recall_scores.append(recall)

    return {
        'precision': np.mean(precision_scores),
        'recall': np.mean(recall_scores)
    }

End-to-End Metrics

Test full RAG pipeline:

Answer accuracy
Context utilization (how much of retrieved context is used)
Answer groundedness (faithfulness to chunks)

Practical Recommendations

Decision Framework

Start simple: Fixed-size with overlap (512 tokens, 100 overlap)
Measure performance: Use evaluation metrics
Identify failures: Where does retrieval fail?
Iterate: Try semantic or metadata-aware chunking
A/B test: Compare strategies on real queries

Common Patterns

90% of use cases:

Recursive character splitting
512-1024 token chunks
10-20% overlap

Structured documents:

Markdown/HTML-aware chunking
Preserve metadata (headers, sections)
Variable sizes OK

Code:

Syntax-aware splitting
Include docstrings with functions
Smaller chunks (256-512)

Hybrid search:

Multiple chunk sizes
Hierarchical retrieval
Worth the complexity for high-value apps

Common Pitfalls

Too small chunks: Lose context, fragmented retrieval
Too large chunks: Irrelevant information, token waste
No overlap: Miss boundary-spanning queries
Ignoring structure: Arbitrary splits in tables, code, lists
One-size-fits-all: Different content needs different strategies
No evaluation: Guessing instead of measuring

💡 Expert Tip from Ailog: In production with 10M+ documents, we've found that starting with 512-token chunks and 10% overlap works for 80% of use cases. Only optimize further if you see retrieval failures in your evaluation metrics. The biggest mistake is over-engineering chunking before measuring actual performance. Start simple, measure, iterate.

Try Chunking Strategies on Ailog

Want to test different chunking approaches without writing code?

Ailog's platform lets you:

Upload documents and compare chunking strategies side-by-side
Test semantic vs fixed-size chunking instantly
Visualize chunk boundaries and overlap
Benchmark retrieval quality with real queries
Deploy the best strategy to production in one click

Try it free → No credit card required.

Next Steps

With documents properly chunked, the next step is selecting and configuring a vector database to store and search embeddings efficiently. This is covered in the next guide on vector databases.

Chunking Strategies: Optimizing Document Segmentation

TL;DR

The Chunking Problem

Why Chunking Matters

Fixed-Size Chunking

Character-Based

Token-Based

Recommended Fixed Sizes

Semantic Chunking

Sentence-Based

Paragraph-Based

Recursive Character Splitting

Metadata-Aware Chunking

Markdown Chunking

HTML/XML Chunking

Code Chunking

Advanced Chunking Techniques

Semantic Similarity-Based

Sliding Window with Contextual Overlap

Hybrid Hierarchical Chunking

Chunk Overlap

Why Overlap?

Optimal Overlap

Chunking for Different Content Types

Technical Documentation

Customer Support Tickets

Research Papers

Code Repositories

Chat Logs

Evaluating Chunking Strategies

Retrieval Metrics

End-to-End Metrics

Practical Recommendations

Decision Framework

Common Patterns

Common Pitfalls

Try Chunking Strategies on Ailog

Next Steps

Tags

Related Guides

Fixed-Size Chunking: Fast and Reliable

Semantic Chunking for Better Retrieval

Parent Document Retrieval: Context Without Noise