2. ChunkingAvancé

Hierarchical Chunking: Preserving Document Structure

27 décembre 2025
11 min read
Ailog Research Team

Hierarchical chunking maintains parent-child relationships in your documents. Learn how to implement this advanced technique to improve RAG retrieval quality.

TL;DR

  • Hierarchical chunking = preserve sections, subsections, and paragraphs
  • Benefit: rich context + fine granularity simultaneously
  • Implementation: nested chunks with hierarchy metadata
  • Typical gain: +20-35% relevance on structured documents
  • Test hierarchical chunking on Ailog

Why Hierarchical Chunking?

Real documents have structure:

  • Chapters > Sections > Subsections > Paragraphs
  • This hierarchy carries semantic meaning

Classic chunking (fixed-size or semantic) ignores this structure:

Original document:
├── Chapter 1: Introduction
│   ├── 1.1 Background
│   └── 1.2 Objectives
└── Chapter 2: Methods
    ├── 2.1 Approach A
    └── 2.2 Approach B

Classic chunking:
[Chunk 1: "...end of background. 1.2 Objectives..."]  ❌ Mixed sections
[Chunk 2: "...beginning of methods..."]               ❌ Lost hierarchy

Hierarchical Chunking Principle

Create chunks at multiple levels with parent-child links:

DEVELOPERpython
# Preserved hierarchical structure { "id": "doc1_ch2_s1", "content": "2.1 Approach A - Detailed description...", "metadata": { "level": 3, "parent_id": "doc1_ch2", "path": ["Chapter 2: Methods", "2.1 Approach A"], "document_id": "doc1" } }

Python Implementation

Hierarchy Extraction

DEVELOPERpython
import re from dataclasses import dataclass from typing import List, Optional @dataclass class HierarchicalChunk: id: str content: str level: int title: str parent_id: Optional[str] path: List[str] children_ids: List[str] def extract_hierarchy(text: str, patterns: dict = None) -> List[HierarchicalChunk]: """ Extracts hierarchical structure from a document. patterns: Regex to detect levels """ if patterns is None: patterns = { 1: r'^# (.+)$', # Main title 2: r'^## (.+)$', # Sections 3: r'^### (.+)$', # Subsections 4: r'^#### (.+)$', # Sub-subsections } chunks = [] current_path = [] parent_stack = [] # Stack of (level, chunk_id) # Split by headers lines = text.split('\n') current_content = [] current_title = "Document" current_level = 0 chunk_counter = 0 for line in lines: header_found = False for level, pattern in patterns.items(): match = re.match(pattern, line, re.MULTILINE) if match: # Save previous chunk if current_content: chunk_id = f"chunk_{chunk_counter}" parent_id = parent_stack[-1][1] if parent_stack else None chunk = HierarchicalChunk( id=chunk_id, content='\n'.join(current_content).strip(), level=current_level, title=current_title, parent_id=parent_id, path=current_path.copy(), children_ids=[] ) chunks.append(chunk) chunk_counter += 1 # Update hierarchy current_title = match.group(1) current_level = level current_content = [] # Update path and parent stack while parent_stack and parent_stack[-1][0] >= level: parent_stack.pop() if current_path: current_path.pop() current_path.append(current_title) parent_stack.append((level, f"chunk_{chunk_counter}")) header_found = True break if not header_found: current_content.append(line) # Don't forget last chunk if current_content: chunk_id = f"chunk_{chunk_counter}" parent_id = parent_stack[-1][1] if parent_stack else None chunk = HierarchicalChunk( id=chunk_id, content='\n'.join(current_content).strip(), level=current_level, title=current_title, parent_id=parent_id, path=current_path.copy(), children_ids=[] ) chunks.append(chunk) return chunks

Multi-Level Indexing

DEVELOPERpython
def index_hierarchical_chunks(chunks: List[HierarchicalChunk], vector_db): """ Indexes chunks with their hierarchical context. """ for chunk in chunks: # Create enriched context path_context = " > ".join(chunk.path) enriched_content = f"{path_context}\n\n{chunk.content}" # Generate embedding embedding = embed(enriched_content) # Store with metadata vector_db.upsert( id=chunk.id, embedding=embedding, metadata={ "content": chunk.content, "title": chunk.title, "level": chunk.level, "parent_id": chunk.parent_id, "path": path_context, "path_list": chunk.path } )

Contextual Retrieval

Strategy: Small-to-Big

Search in fine chunks, return parent context:

DEVELOPERpython
def hierarchical_retrieve(query: str, vector_db, k: int = 3) -> List[dict]: """ Retrieves relevant chunks with their parent context. """ # 1. Fine search (lowest level) results = vector_db.query( query_embedding=embed(query), filter={"level": {"$gte": 3}}, # Subsections and below limit=k * 2 ) # 2. Enrich with parent context enriched_results = [] seen_parents = set() for result in results: parent_id = result.metadata.get("parent_id") # Retrieve parent chain context_chain = [result.metadata["content"]] current_parent = parent_id while current_parent and current_parent not in seen_parents: parent = vector_db.get(current_parent) if parent: context_chain.insert(0, parent.metadata["content"]) seen_parents.add(current_parent) current_parent = parent.metadata.get("parent_id") else: break enriched_results.append({ "chunk": result, "full_context": "\n\n---\n\n".join(context_chain), "path": result.metadata["path"] }) return enriched_results[:k]

Strategy: Big-to-Small

Search at section level, then drill-down:

DEVELOPERpython
def drill_down_retrieve(query: str, vector_db, k: int = 3) -> List[dict]: """ Start with sections, then refine to details. """ # 1. Search at section level sections = vector_db.query( query_embedding=embed(query), filter={"level": 2}, limit=k ) # 2. For each relevant section, search for details detailed_results = [] for section in sections: # Search children of this section children = vector_db.query( query_embedding=embed(query), filter={ "parent_id": section.id }, limit=3 ) detailed_results.append({ "section": section, "details": children, "combined_context": ( section.metadata["content"] + "\n\n" + "\n".join([c.metadata["content"] for c in children]) ) }) return detailed_results

LlamaIndex: Parent Document Retriever

LlamaIndex offers native implementation:

DEVELOPERpython
from llama_index import VectorStoreIndex, ServiceContext from llama_index.node_parser import HierarchicalNodeParser from llama_index.retrievers import AutoMergingRetriever from llama_index.query_engine import RetrieverQueryEngine # 1. Hierarchical parser node_parser = HierarchicalNodeParser.from_defaults( chunk_sizes=[2048, 512, 128] # Granularity levels ) # 2. Create nodes nodes = node_parser.get_nodes_from_documents(documents) # 3. Index index = VectorStoreIndex(nodes) # 4. Retriever with auto-merging retriever = AutoMergingRetriever( index.as_retriever(similarity_top_k=6), index.storage_context, verbose=True ) # 5. Query engine query_engine = RetrieverQueryEngine.from_args(retriever) response = query_engine.query("What methods are used?")

LangChain: Parent Document Retriever

DEVELOPERpython
from langchain.retrievers import ParentDocumentRetriever from langchain.storage import InMemoryStore from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.vectorstores import Chroma # Splitters for different levels parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) # Store for parents docstore = InMemoryStore() # Vectorstore for children (fine search) vectorstore = Chroma(embedding_function=embeddings) # Parent Document Retriever retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=docstore, child_splitter=child_splitter, parent_splitter=parent_splitter, ) # Add documents retriever.add_documents(documents) # Search: children match, parents returned results = retriever.get_relevant_documents("Question about methods")

Metadata Optimization

Enrich Semantic Path

DEVELOPERpython
def create_semantic_path(chunk: HierarchicalChunk) -> str: """ Creates a readable semantic path for the LLM. """ path_parts = [] for i, title in enumerate(chunk.path): level_prefix = { 0: "Document:", 1: "Chapter:", 2: "Section:", 3: "Subsection:", 4: "Paragraph:" }.get(i, "") path_parts.append(f"{level_prefix} {title}") return " → ".join(path_parts) # Example output: # "Document: Technical Manual → Chapter: Installation → Section: Prerequisites"

Add Breadcrumbs to Context

DEVELOPERpython
def format_context_with_breadcrumbs(chunks: List[dict]) -> str: """ Formats context with breadcrumbs for the LLM. """ formatted = [] for chunk in chunks: breadcrumb = chunk['path'] content = chunk['content'] formatted.append(f""" 📍 {breadcrumb} {content} """) return "\n---\n".join(formatted)

When to Use Hierarchical Chunking

Use it when:

  • Long, structured documents (manuals, technical docs)
  • Clear hierarchy (chapters, sections, subsections)
  • Need both broad context AND fine precision
  • Questions that span multiple levels

Avoid when:

  • Flat documents (emails, chats, logs)
  • Very homogeneous content
  • Strict latency constraints (retrieval overhead)
  • Very short documents (< 2000 tokens)

Benchmarks

Document TypeFixed ChunkingSemanticHierarchical
Technical docs65%72%88%
Structured reports58%68%85%
Scientific papers62%75%82%
Narrative text70%78%72%

MRR@5 on internal test datasets

Hierarchical excels on structured documents but provides no gain on narrative content.


Related Guides

Chunking:

Retrieval:


Need help implementing hierarchical chunking on your complex documents? Let's discuss your project →

Tags

chunkinghierarchystructuredocuments

Articles connexes

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !