Semantic Chunking for Better Retrieval
Split documents intelligently based on meaning, not just length. Learn semantic chunking techniques for RAG.
- Author
- Ailog Research Team
- Published
- Reading time
- 12 min read
- Level
- advanced
- RAG Pipeline Step
- Chunking
The Problem with Fixed-Size Chunking
Traditional chunking splits text every N characters or tokens: • ❌ Breaks sentences mid-thought • ❌ Separates related content • ❌ No context awareness
Semantic chunking splits based on meaning, not length.
How Semantic Chunking Works Embed each sentence using a sentence encoder Calculate similarity between consecutive sentences Split where similarity drops (topic change)
``python from sentence_transformers import SentenceTransformer import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_chunk(text, similarity_threshold=0.5): Split into sentences sentences = text.split('. ')
Embed all sentences embeddings = model.encode(sentences)
Calculate cosine similarity between consecutive sentences chunks = [] current_chunk = [sentences[0]]
for i in range(1, len(sentences)): similarity = np.dot(embeddings[i-1], embeddings[i]) / ( np.linalg.norm(embeddings[i-1]) np.linalg.norm(embeddings[i]) )
if similarity < similarity_threshold: Topic changed - start new chunk chunks.append('. '.join(current_chunk)) current_chunk = [sentences[i]] else: current_chunk.append(sentences[i])
Add final chunk chunks.append('. '.join(current_chunk))
return chunks `
LangChain Semantic Chunking (2025)
LangChain now includes built-in semantic chunking:
`python from langchain.text_splitter import SemanticChunker from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker( OpenAIEmbeddings(), breakpoint_threshold_type="percentile", or "standard_deviation" breakpoint_threshold_amount=95 )
chunks = text_splitter.create_documents([long_text]) `
Advanced: Multi-Level Semantic Chunking
Combine semantic splits with size constraints:
`python def smart_semantic_chunk(text, max_chunk_size=1000, min_chunk_size=200): First: semantic split semantic_chunks = semantic_chunk(text)
final_chunks = []
for chunk in semantic_chunks: If chunk too large, further split if len(chunk) > max_chunk_size: Split by paragraphs within this semantic section paragraphs = chunk.split('\n\n') sub_chunk = ""
for para in paragraphs: if len(sub_chunk) + len(para) < max_chunk_size: sub_chunk += para + "\n\n" else: final_chunks.append(sub_chunk.strip()) sub_chunk = para + "\n\n"
if sub_chunk: final_chunks.append(sub_chunk.strip())
If chunk too small, merge with next elif len(chunk) < min_chunk_size and final_chunks: final_chunks[-1] += "\n\n" + chunk else: final_chunks.append(chunk)
return final_chunks `
Llamaindex Semantic Splitter
`python from llama_index.node_parser import SemanticSplitterNodeParser from llama_index.embeddings import OpenAIEmbedding
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser( buffer_size=1, Sentences to group breakpoint_percentile_threshold=95, embed_model=embed_model )
nodes = splitter.get_nodes_from_documents(documents) `
When to Use Semantic Chunking
Use semantic chunking when: • Documents have clear topic transitions • You need high precision retrieval • Content is narrative or explanatory • You can afford the compute cost
Stick to fixed-size when: • Speed is critical • Documents are very uniform • Budget is limited • Content is tabular or structured
Performance Considerations
Embedding cost: • Semantic chunking requires embedding every sentence • For a 10,000-word document: ~300 sentences to embed • Consider caching embeddings
Speed comparison (November 2025): • Fixed-size: ~1ms per document • Semantic: ~100-500ms per document (depending on model)
Hybrid Approach: Best of Both Worlds
`python def hybrid_chunk(text, target_size=500): Semantic split first semantic_chunks = semantic_chunk(text, similarity_threshold=0.6) Merge small chunks, split large ones final_chunks = [] buffer = ""
for chunk in semantic_chunks: if len(buffer) + len(chunk) < target_size 1.5: buffer += "\n\n" + chunk if buffer else chunk else: if buffer: final_chunks.append(buffer) buffer = chunk
if buffer: final_chunks.append(buffer)
return final_chunks `
Evaluation
Test retrieval quality with semantic vs fixed chunking:
`python Your test queries queries = [ "How does photosynthesis work?", "What are the benefits of exercise?" ]
Compare retrieval accuracy semantic_results = evaluate_chunking(semantic_chunks, queries) fixed_results = evaluate_chunking(fixed_chunks, queries)
print(f"Semantic MRR: {semantic_results['mrr']}") print(f"Fixed MRR: {fixed_results['mrr']}") ``
Semantic chunking typically improves retrieval by 15-30% but costs 100x more compute. Choose based on your accuracy/cost tradeoff.