1. ParsingAvancé

Table Extraction and Processing for RAG

27 décembre 2025
11 min read
Ailog Research Team

Tables contain critical structured data but are difficult to parse. Master table extraction and chunking techniques for RAG.

TL;DR

  • Tables often contain the most important info (prices, specs, comparisons)
  • Problem: standard parsers destroy the structure
  • Solutions: detection + specialized extraction + smart serialization
  • Tools: Unstructured, Camelot, Tabula, multimodal LLMs
  • Upload your PDFs with tables on Ailog

Why Tables Are Problematic

Typical example of table destruction:

Original PDF:
┌──────────┬─────────┬──────────┐
│ Product  │ Price   │ Stock    │
├──────────┼─────────┼──────────┤
│ Widget A │ $99     │ In stock │
│ Widget B │ $149    │ Sold out │
└──────────┴─────────┴──────────┘

After naive parsing:
"Product Price Stock Widget A $99 In stock Widget B $149 Sold out"

→ Structure lost, relationships broken

Table Detection

With Unstructured

DEVELOPERpython
from unstructured.partition.pdf import partition_pdf def extract_with_table_detection(pdf_path: str) -> dict: """ Extracts PDF content with table detection. """ elements = partition_pdf( filename=pdf_path, strategy="hi_res", # Visual detection infer_table_structure=True, include_page_breaks=True ) tables = [] text_content = [] for element in elements: if element.category == "Table": tables.append({ "html": element.metadata.text_as_html, "text": element.text, "page": element.metadata.page_number }) else: text_content.append(element.text) return { "tables": tables, "text": "\n".join(text_content) }

With Camelot (Native PDFs)

DEVELOPERpython
import camelot def extract_tables_camelot(pdf_path: str) -> list: """ Table extraction with Camelot. Works well on native PDFs (not scanned). """ # Lattice method for tables with borders tables = camelot.read_pdf( pdf_path, pages='all', flavor='lattice' # or 'stream' for borderless ) extracted = [] for i, table in enumerate(tables): df = table.df extracted.append({ "table_id": i, "page": table.page, "accuracy": table.accuracy, "dataframe": df, "html": df.to_html(), "markdown": df.to_markdown() }) return extracted

Vision Detection (Multimodal LLM)

DEVELOPERpython
import anthropic import base64 def detect_tables_vision(image_path: str) -> dict: """ Uses Claude Vision to detect and extract tables. """ client = anthropic.Anthropic() with open(image_path, "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8") response = client.messages.create( model="claude-3-5-sonnet-latest", max_tokens=4096, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/png", "data": image_data } }, { "type": "text", "text": """Extract all tables from this image. For each table: 1. Output as markdown table 2. Preserve headers 3. Keep all data exactly as shown Format: TABLE 1: | Header1 | Header2 | ... | |---------|---------|-----| | data | data | ... | TABLE 2: ...""" } ] }] ) return { "extracted_tables": response.content[0].text }

Table Serialization

Markdown Format

DEVELOPERpython
def table_to_markdown(df) -> str: """ Converts DataFrame to clean markdown. """ return df.to_markdown(index=False) # Result: # | Product | Price | Stock | # |----------|-------|----------| # | Widget A | $99 | In stock | # | Widget B | $149 | Sold out |

Row-by-Row Format (Best for RAG)

DEVELOPERpython
def table_to_row_format(df, table_context: str = "") -> list: """ Converts each row to standalone text. Each row becomes an autonomous chunk. """ headers = df.columns.tolist() rows_as_text = [] for _, row in df.iterrows(): row_text = "; ".join([ f"{header}: {value}" for header, value in zip(headers, row.values) ]) if table_context: row_text = f"{table_context} - {row_text}" rows_as_text.append(row_text) return rows_as_text # Result: # ["Product Catalog - Product: Widget A; Price: $99; Stock: In stock", # "Product Catalog - Product: Widget B; Price: $149; Stock: Sold out"]

Q&A Format (Optimal for Retrieval)

DEVELOPERpython
def table_to_qa_pairs(df, table_title: str) -> list: """ Generates Q&A pairs from the table. Significantly improves retrieval. """ headers = df.columns.tolist() qa_pairs = [] for _, row in df.iterrows(): # Identify key column (often the first) key_col = headers[0] key_val = row[key_col] for header in headers[1:]: value = row[header] if pd.notna(value) and str(value).strip(): qa_pairs.append({ "question": f"What is the {header.lower()} of {key_val}?", "answer": f"The {header.lower()} of {key_val} is {value}.", "source": table_title }) return qa_pairs # Result: # [{"question": "What is the price of Widget A?", # "answer": "The price of Widget A is $99.", # "source": "Product Catalog"}, # {"question": "What is the stock of Widget A?", # "answer": "The stock of Widget A is In stock.", # "source": "Product Catalog"}]

Table Chunking

Small Tables (< 20 rows)

Keep the entire table as a single chunk:

DEVELOPERpython
def chunk_small_table(df, metadata: dict) -> dict: """ Small table = single chunk with context. """ markdown = df.to_markdown(index=False) chunk = { "content": f"**{metadata['title']}**\n\n{markdown}", "metadata": { "type": "table", "rows": len(df), "columns": list(df.columns), **metadata } } return chunk

Medium Tables (20-100 rows)

Chunking by row groups with overlap:

DEVELOPERpython
def chunk_medium_table( df, metadata: dict, rows_per_chunk: int = 10, overlap: int = 2 ) -> list: """ Chunk by row groups with repeated headers. """ chunks = [] headers = df.columns.tolist() header_row = "| " + " | ".join(headers) + " |" separator = "| " + " | ".join(["---"] * len(headers)) + " |" for i in range(0, len(df), rows_per_chunk - overlap): subset = df.iloc[i:i + rows_per_chunk] if len(subset) == 0: continue rows_md = subset.to_markdown(index=False).split('\n')[2:] # Skip header chunk_md = ( f"**{metadata['title']}** (rows {i+1}-{i+len(subset)})\n\n" f"{header_row}\n{separator}\n" + "\n".join(rows_md) ) chunks.append({ "content": chunk_md, "metadata": { "type": "table_chunk", "start_row": i + 1, "end_row": i + len(subset), **metadata } }) return chunks

Large Tables (> 100 rows)

Convert to row-by-row format:

DEVELOPERpython
def chunk_large_table(df, metadata: dict) -> list: """ Large tables: each row becomes a chunk. """ return [ { "content": table_to_row_format(df.iloc[[i]], metadata['title'])[0], "metadata": { "type": "table_row", "row_index": i + 1, "primary_key": str(df.iloc[i, 0]), # First column as key **metadata } } for i in range(len(df)) ]

Context Enrichment

Add Surrounding Context

DEVELOPERpython
def enrich_table_context( table_html: str, surrounding_text: str, llm_client ) -> dict: """ Uses LLM to enrich table context. """ prompt = f"""Analyze this table and its surrounding context. Surrounding text: {surrounding_text[:500]} Table (HTML): {table_html} Generate: 1. A descriptive title for the table 2. A one-sentence summary of what the table shows 3. The key columns and what they represent Output as JSON: {{"title": "...", "summary": "...", "key_columns": [{{"name": "...", "description": "..."}}]}}""" result = llm_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0 ) import json return json.loads(result.choices[0].message.content)

Generate Summaries

DEVELOPERpython
def summarize_table(df, llm_client) -> str: """ Generates a textual summary of the table. """ # Basic stats stats = { "rows": len(df), "columns": list(df.columns), "sample": df.head(3).to_markdown() } prompt = f"""Summarize this table in 2-3 sentences. Columns: {stats['columns']} Rows: {stats['rows']} Sample: {stats['sample']} Summary:""" result = llm_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], max_tokens=100, temperature=0 ) return result.choices[0].message.content.strip()

Complete Pipeline

DEVELOPERpython
class TableProcessor: def __init__(self, llm_client=None): self.llm = llm_client def process_document(self, pdf_path: str) -> dict: """ Complete table extraction and chunking pipeline. """ # 1. Extraction raw = extract_with_table_detection(pdf_path) processed_tables = [] for i, table in enumerate(raw["tables"]): # 2. Convert to DataFrame df = self._html_to_df(table["html"]) if df is None or df.empty: continue # 3. Enrich context if self.llm: context = enrich_table_context( table["html"], raw["text"][:500], self.llm ) else: context = {"title": f"Table {i+1}", "summary": ""} # 4. Chunking based on size if len(df) <= 20: chunks = [chunk_small_table(df, context)] elif len(df) <= 100: chunks = chunk_medium_table(df, context) else: chunks = chunk_large_table(df, context) # 5. Also generate Q&A pairs qa_pairs = table_to_qa_pairs(df, context["title"]) processed_tables.append({ "table_id": i, "metadata": context, "chunks": chunks, "qa_pairs": qa_pairs, "row_count": len(df) }) return { "text_chunks": self._chunk_text(raw["text"]), "table_chunks": processed_tables, "stats": { "tables_found": len(raw["tables"]), "tables_processed": len(processed_tables) } } def _html_to_df(self, html: str): """Converts HTML to DataFrame.""" import pandas as pd try: dfs = pd.read_html(html) return dfs[0] if dfs else None except: return None def _chunk_text(self, text: str) -> list: """Chunks standard text.""" # Standard chunking implementation pass # Usage processor = TableProcessor(llm_client=openai_client) result = processor.process_document("report.pdf") # Index chunks for table in result["table_chunks"]: for chunk in table["chunks"]: vector_db.upsert(chunk) # Bonus: index Q&A pairs for better retrieval for qa in table["qa_pairs"]: vector_db.upsert({ "content": f"Q: {qa['question']}\nA: {qa['answer']}", "metadata": {"type": "table_qa", "source": qa["source"]} })

Benchmarks

MethodAccuracyComplex TablesLatency
PyPDF220%Fails50ms
Camelot (lattice)85%Good200ms
Unstructured80%Medium500ms
Claude Vision95%Excellent2s
GPT-4o Vision93%Excellent1.5s

Related Guides

Parsing:

Chunking:


Do your documents contain complex tables? Let's analyze the best strategy together →

Tags

parsingtablesextractionpdfstructured data

Articles connexes

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !