Hierarchical Chunking: Preserving Document Structure
Hierarchical chunking maintains parent-child relationships in your documents. Learn how to implement this advanced technique to improve RAG retrieval quality.
TL;DR
- Hierarchical chunking = preserve sections, subsections, and paragraphs
- Benefit: rich context + fine granularity simultaneously
- Implementation: nested chunks with hierarchy metadata
- Typical gain: +20-35% relevance on structured documents
- Test hierarchical chunking on Ailog
Why Hierarchical Chunking?
Real documents have structure:
- Chapters > Sections > Subsections > Paragraphs
- This hierarchy carries semantic meaning
Classic chunking (fixed-size or semantic) ignores this structure:
Original document:
├── Chapter 1: Introduction
│ ├── 1.1 Background
│ └── 1.2 Objectives
└── Chapter 2: Methods
├── 2.1 Approach A
└── 2.2 Approach B
Classic chunking:
[Chunk 1: "...end of background. 1.2 Objectives..."] ❌ Mixed sections
[Chunk 2: "...beginning of methods..."] ❌ Lost hierarchy
Hierarchical Chunking Principle
Create chunks at multiple levels with parent-child links:
DEVELOPERpython# Preserved hierarchical structure { "id": "doc1_ch2_s1", "content": "2.1 Approach A - Detailed description...", "metadata": { "level": 3, "parent_id": "doc1_ch2", "path": ["Chapter 2: Methods", "2.1 Approach A"], "document_id": "doc1" } }
Python Implementation
Hierarchy Extraction
DEVELOPERpythonimport re from dataclasses import dataclass from typing import List, Optional @dataclass class HierarchicalChunk: id: str content: str level: int title: str parent_id: Optional[str] path: List[str] children_ids: List[str] def extract_hierarchy(text: str, patterns: dict = None) -> List[HierarchicalChunk]: """ Extracts hierarchical structure from a document. patterns: Regex to detect levels """ if patterns is None: patterns = { 1: r'^# (.+)$', # Main title 2: r'^## (.+)$', # Sections 3: r'^### (.+)$', # Subsections 4: r'^#### (.+)$', # Sub-subsections } chunks = [] current_path = [] parent_stack = [] # Stack of (level, chunk_id) # Split by headers lines = text.split('\n') current_content = [] current_title = "Document" current_level = 0 chunk_counter = 0 for line in lines: header_found = False for level, pattern in patterns.items(): match = re.match(pattern, line, re.MULTILINE) if match: # Save previous chunk if current_content: chunk_id = f"chunk_{chunk_counter}" parent_id = parent_stack[-1][1] if parent_stack else None chunk = HierarchicalChunk( id=chunk_id, content='\n'.join(current_content).strip(), level=current_level, title=current_title, parent_id=parent_id, path=current_path.copy(), children_ids=[] ) chunks.append(chunk) chunk_counter += 1 # Update hierarchy current_title = match.group(1) current_level = level current_content = [] # Update path and parent stack while parent_stack and parent_stack[-1][0] >= level: parent_stack.pop() if current_path: current_path.pop() current_path.append(current_title) parent_stack.append((level, f"chunk_{chunk_counter}")) header_found = True break if not header_found: current_content.append(line) # Don't forget last chunk if current_content: chunk_id = f"chunk_{chunk_counter}" parent_id = parent_stack[-1][1] if parent_stack else None chunk = HierarchicalChunk( id=chunk_id, content='\n'.join(current_content).strip(), level=current_level, title=current_title, parent_id=parent_id, path=current_path.copy(), children_ids=[] ) chunks.append(chunk) return chunks
Multi-Level Indexing
DEVELOPERpythondef index_hierarchical_chunks(chunks: List[HierarchicalChunk], vector_db): """ Indexes chunks with their hierarchical context. """ for chunk in chunks: # Create enriched context path_context = " > ".join(chunk.path) enriched_content = f"{path_context}\n\n{chunk.content}" # Generate embedding embedding = embed(enriched_content) # Store with metadata vector_db.upsert( id=chunk.id, embedding=embedding, metadata={ "content": chunk.content, "title": chunk.title, "level": chunk.level, "parent_id": chunk.parent_id, "path": path_context, "path_list": chunk.path } )
Contextual Retrieval
Strategy: Small-to-Big
Search in fine chunks, return parent context:
DEVELOPERpythondef hierarchical_retrieve(query: str, vector_db, k: int = 3) -> List[dict]: """ Retrieves relevant chunks with their parent context. """ # 1. Fine search (lowest level) results = vector_db.query( query_embedding=embed(query), filter={"level": {"$gte": 3}}, # Subsections and below limit=k * 2 ) # 2. Enrich with parent context enriched_results = [] seen_parents = set() for result in results: parent_id = result.metadata.get("parent_id") # Retrieve parent chain context_chain = [result.metadata["content"]] current_parent = parent_id while current_parent and current_parent not in seen_parents: parent = vector_db.get(current_parent) if parent: context_chain.insert(0, parent.metadata["content"]) seen_parents.add(current_parent) current_parent = parent.metadata.get("parent_id") else: break enriched_results.append({ "chunk": result, "full_context": "\n\n---\n\n".join(context_chain), "path": result.metadata["path"] }) return enriched_results[:k]
Strategy: Big-to-Small
Search at section level, then drill-down:
DEVELOPERpythondef drill_down_retrieve(query: str, vector_db, k: int = 3) -> List[dict]: """ Start with sections, then refine to details. """ # 1. Search at section level sections = vector_db.query( query_embedding=embed(query), filter={"level": 2}, limit=k ) # 2. For each relevant section, search for details detailed_results = [] for section in sections: # Search children of this section children = vector_db.query( query_embedding=embed(query), filter={ "parent_id": section.id }, limit=3 ) detailed_results.append({ "section": section, "details": children, "combined_context": ( section.metadata["content"] + "\n\n" + "\n".join([c.metadata["content"] for c in children]) ) }) return detailed_results
LlamaIndex: Parent Document Retriever
LlamaIndex offers native implementation:
DEVELOPERpythonfrom llama_index import VectorStoreIndex, ServiceContext from llama_index.node_parser import HierarchicalNodeParser from llama_index.retrievers import AutoMergingRetriever from llama_index.query_engine import RetrieverQueryEngine # 1. Hierarchical parser node_parser = HierarchicalNodeParser.from_defaults( chunk_sizes=[2048, 512, 128] # Granularity levels ) # 2. Create nodes nodes = node_parser.get_nodes_from_documents(documents) # 3. Index index = VectorStoreIndex(nodes) # 4. Retriever with auto-merging retriever = AutoMergingRetriever( index.as_retriever(similarity_top_k=6), index.storage_context, verbose=True ) # 5. Query engine query_engine = RetrieverQueryEngine.from_args(retriever) response = query_engine.query("What methods are used?")
LangChain: Parent Document Retriever
DEVELOPERpythonfrom langchain.retrievers import ParentDocumentRetriever from langchain.storage import InMemoryStore from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.vectorstores import Chroma # Splitters for different levels parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) # Store for parents docstore = InMemoryStore() # Vectorstore for children (fine search) vectorstore = Chroma(embedding_function=embeddings) # Parent Document Retriever retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=docstore, child_splitter=child_splitter, parent_splitter=parent_splitter, ) # Add documents retriever.add_documents(documents) # Search: children match, parents returned results = retriever.get_relevant_documents("Question about methods")
Metadata Optimization
Enrich Semantic Path
DEVELOPERpythondef create_semantic_path(chunk: HierarchicalChunk) -> str: """ Creates a readable semantic path for the LLM. """ path_parts = [] for i, title in enumerate(chunk.path): level_prefix = { 0: "Document:", 1: "Chapter:", 2: "Section:", 3: "Subsection:", 4: "Paragraph:" }.get(i, "") path_parts.append(f"{level_prefix} {title}") return " → ".join(path_parts) # Example output: # "Document: Technical Manual → Chapter: Installation → Section: Prerequisites"
Add Breadcrumbs to Context
DEVELOPERpythondef format_context_with_breadcrumbs(chunks: List[dict]) -> str: """ Formats context with breadcrumbs for the LLM. """ formatted = [] for chunk in chunks: breadcrumb = chunk['path'] content = chunk['content'] formatted.append(f""" 📍 {breadcrumb} {content} """) return "\n---\n".join(formatted)
When to Use Hierarchical Chunking
Use it when:
- Long, structured documents (manuals, technical docs)
- Clear hierarchy (chapters, sections, subsections)
- Need both broad context AND fine precision
- Questions that span multiple levels
Avoid when:
- Flat documents (emails, chats, logs)
- Very homogeneous content
- Strict latency constraints (retrieval overhead)
- Very short documents (< 2000 tokens)
Benchmarks
| Document Type | Fixed Chunking | Semantic | Hierarchical |
|---|---|---|---|
| Technical docs | 65% | 72% | 88% |
| Structured reports | 58% | 68% | 85% |
| Scientific papers | 62% | 75% | 82% |
| Narrative text | 70% | 78% | 72% |
MRR@5 on internal test datasets
Hierarchical excels on structured documents but provides no gain on narrative content.
Related Guides
Chunking:
- Chunking Strategies - Overview of approaches
- Semantic Chunking - Meaning-based chunking
- Fixed-Size Chunking - Classic approach
Retrieval:
- Parent Document Retrieval - Retrieval with parent context
- Retrieval Strategies - Advanced techniques
Need help implementing hierarchical chunking on your complex documents? Let's discuss your project →
Tags
Articles connexes
RAG Chunking Strategies 2025: Optimal Chunk Sizes & Techniques
Master document chunking for RAG: optimal chunk sizes (512-1024 tokens), overlap strategies, semantic vs fixed-size splitting. Improve retrieval by 25%+.
Semantic Chunking for Better Retrieval
Split documents intelligently based on meaning, not just length. Learn semantic chunking techniques for RAG.
Fixed-Size Chunking: Fast and Reliable
Master the basics: implement fixed-size chunking with overlaps for consistent, predictable RAG performance.