Table Extraction and Processing for RAG
Tables contain critical structured data but are difficult to parse. Master table extraction and chunking techniques for RAG.
TL;DR
- Tables often contain the most important info (prices, specs, comparisons)
- Problem: standard parsers destroy the structure
- Solutions: detection + specialized extraction + smart serialization
- Tools: Unstructured, Camelot, Tabula, multimodal LLMs
- Upload your PDFs with tables on Ailog
Why Tables Are Problematic
Typical example of table destruction:
Original PDF:
┌──────────┬─────────┬──────────┐
│ Product │ Price │ Stock │
├──────────┼─────────┼──────────┤
│ Widget A │ $99 │ In stock │
│ Widget B │ $149 │ Sold out │
└──────────┴─────────┴──────────┘
After naive parsing:
"Product Price Stock Widget A $99 In stock Widget B $149 Sold out"
→ Structure lost, relationships broken
Table Detection
With Unstructured
DEVELOPERpythonfrom unstructured.partition.pdf import partition_pdf def extract_with_table_detection(pdf_path: str) -> dict: """ Extracts PDF content with table detection. """ elements = partition_pdf( filename=pdf_path, strategy="hi_res", # Visual detection infer_table_structure=True, include_page_breaks=True ) tables = [] text_content = [] for element in elements: if element.category == "Table": tables.append({ "html": element.metadata.text_as_html, "text": element.text, "page": element.metadata.page_number }) else: text_content.append(element.text) return { "tables": tables, "text": "\n".join(text_content) }
With Camelot (Native PDFs)
DEVELOPERpythonimport camelot def extract_tables_camelot(pdf_path: str) -> list: """ Table extraction with Camelot. Works well on native PDFs (not scanned). """ # Lattice method for tables with borders tables = camelot.read_pdf( pdf_path, pages='all', flavor='lattice' # or 'stream' for borderless ) extracted = [] for i, table in enumerate(tables): df = table.df extracted.append({ "table_id": i, "page": table.page, "accuracy": table.accuracy, "dataframe": df, "html": df.to_html(), "markdown": df.to_markdown() }) return extracted
Vision Detection (Multimodal LLM)
DEVELOPERpythonimport anthropic import base64 def detect_tables_vision(image_path: str) -> dict: """ Uses Claude Vision to detect and extract tables. """ client = anthropic.Anthropic() with open(image_path, "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8") response = client.messages.create( model="claude-3-5-sonnet-latest", max_tokens=4096, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/png", "data": image_data } }, { "type": "text", "text": """Extract all tables from this image. For each table: 1. Output as markdown table 2. Preserve headers 3. Keep all data exactly as shown Format: TABLE 1: | Header1 | Header2 | ... | |---------|---------|-----| | data | data | ... | TABLE 2: ...""" } ] }] ) return { "extracted_tables": response.content[0].text }
Table Serialization
Markdown Format
DEVELOPERpythondef table_to_markdown(df) -> str: """ Converts DataFrame to clean markdown. """ return df.to_markdown(index=False) # Result: # | Product | Price | Stock | # |----------|-------|----------| # | Widget A | $99 | In stock | # | Widget B | $149 | Sold out |
Row-by-Row Format (Best for RAG)
DEVELOPERpythondef table_to_row_format(df, table_context: str = "") -> list: """ Converts each row to standalone text. Each row becomes an autonomous chunk. """ headers = df.columns.tolist() rows_as_text = [] for _, row in df.iterrows(): row_text = "; ".join([ f"{header}: {value}" for header, value in zip(headers, row.values) ]) if table_context: row_text = f"{table_context} - {row_text}" rows_as_text.append(row_text) return rows_as_text # Result: # ["Product Catalog - Product: Widget A; Price: $99; Stock: In stock", # "Product Catalog - Product: Widget B; Price: $149; Stock: Sold out"]
Q&A Format (Optimal for Retrieval)
DEVELOPERpythondef table_to_qa_pairs(df, table_title: str) -> list: """ Generates Q&A pairs from the table. Significantly improves retrieval. """ headers = df.columns.tolist() qa_pairs = [] for _, row in df.iterrows(): # Identify key column (often the first) key_col = headers[0] key_val = row[key_col] for header in headers[1:]: value = row[header] if pd.notna(value) and str(value).strip(): qa_pairs.append({ "question": f"What is the {header.lower()} of {key_val}?", "answer": f"The {header.lower()} of {key_val} is {value}.", "source": table_title }) return qa_pairs # Result: # [{"question": "What is the price of Widget A?", # "answer": "The price of Widget A is $99.", # "source": "Product Catalog"}, # {"question": "What is the stock of Widget A?", # "answer": "The stock of Widget A is In stock.", # "source": "Product Catalog"}]
Table Chunking
Small Tables (< 20 rows)
Keep the entire table as a single chunk:
DEVELOPERpythondef chunk_small_table(df, metadata: dict) -> dict: """ Small table = single chunk with context. """ markdown = df.to_markdown(index=False) chunk = { "content": f"**{metadata['title']}**\n\n{markdown}", "metadata": { "type": "table", "rows": len(df), "columns": list(df.columns), **metadata } } return chunk
Medium Tables (20-100 rows)
Chunking by row groups with overlap:
DEVELOPERpythondef chunk_medium_table( df, metadata: dict, rows_per_chunk: int = 10, overlap: int = 2 ) -> list: """ Chunk by row groups with repeated headers. """ chunks = [] headers = df.columns.tolist() header_row = "| " + " | ".join(headers) + " |" separator = "| " + " | ".join(["---"] * len(headers)) + " |" for i in range(0, len(df), rows_per_chunk - overlap): subset = df.iloc[i:i + rows_per_chunk] if len(subset) == 0: continue rows_md = subset.to_markdown(index=False).split('\n')[2:] # Skip header chunk_md = ( f"**{metadata['title']}** (rows {i+1}-{i+len(subset)})\n\n" f"{header_row}\n{separator}\n" + "\n".join(rows_md) ) chunks.append({ "content": chunk_md, "metadata": { "type": "table_chunk", "start_row": i + 1, "end_row": i + len(subset), **metadata } }) return chunks
Large Tables (> 100 rows)
Convert to row-by-row format:
DEVELOPERpythondef chunk_large_table(df, metadata: dict) -> list: """ Large tables: each row becomes a chunk. """ return [ { "content": table_to_row_format(df.iloc[[i]], metadata['title'])[0], "metadata": { "type": "table_row", "row_index": i + 1, "primary_key": str(df.iloc[i, 0]), # First column as key **metadata } } for i in range(len(df)) ]
Context Enrichment
Add Surrounding Context
DEVELOPERpythondef enrich_table_context( table_html: str, surrounding_text: str, llm_client ) -> dict: """ Uses LLM to enrich table context. """ prompt = f"""Analyze this table and its surrounding context. Surrounding text: {surrounding_text[:500]} Table (HTML): {table_html} Generate: 1. A descriptive title for the table 2. A one-sentence summary of what the table shows 3. The key columns and what they represent Output as JSON: {{"title": "...", "summary": "...", "key_columns": [{{"name": "...", "description": "..."}}]}}""" result = llm_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0 ) import json return json.loads(result.choices[0].message.content)
Generate Summaries
DEVELOPERpythondef summarize_table(df, llm_client) -> str: """ Generates a textual summary of the table. """ # Basic stats stats = { "rows": len(df), "columns": list(df.columns), "sample": df.head(3).to_markdown() } prompt = f"""Summarize this table in 2-3 sentences. Columns: {stats['columns']} Rows: {stats['rows']} Sample: {stats['sample']} Summary:""" result = llm_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], max_tokens=100, temperature=0 ) return result.choices[0].message.content.strip()
Complete Pipeline
DEVELOPERpythonclass TableProcessor: def __init__(self, llm_client=None): self.llm = llm_client def process_document(self, pdf_path: str) -> dict: """ Complete table extraction and chunking pipeline. """ # 1. Extraction raw = extract_with_table_detection(pdf_path) processed_tables = [] for i, table in enumerate(raw["tables"]): # 2. Convert to DataFrame df = self._html_to_df(table["html"]) if df is None or df.empty: continue # 3. Enrich context if self.llm: context = enrich_table_context( table["html"], raw["text"][:500], self.llm ) else: context = {"title": f"Table {i+1}", "summary": ""} # 4. Chunking based on size if len(df) <= 20: chunks = [chunk_small_table(df, context)] elif len(df) <= 100: chunks = chunk_medium_table(df, context) else: chunks = chunk_large_table(df, context) # 5. Also generate Q&A pairs qa_pairs = table_to_qa_pairs(df, context["title"]) processed_tables.append({ "table_id": i, "metadata": context, "chunks": chunks, "qa_pairs": qa_pairs, "row_count": len(df) }) return { "text_chunks": self._chunk_text(raw["text"]), "table_chunks": processed_tables, "stats": { "tables_found": len(raw["tables"]), "tables_processed": len(processed_tables) } } def _html_to_df(self, html: str): """Converts HTML to DataFrame.""" import pandas as pd try: dfs = pd.read_html(html) return dfs[0] if dfs else None except: return None def _chunk_text(self, text: str) -> list: """Chunks standard text.""" # Standard chunking implementation pass # Usage processor = TableProcessor(llm_client=openai_client) result = processor.process_document("report.pdf") # Index chunks for table in result["table_chunks"]: for chunk in table["chunks"]: vector_db.upsert(chunk) # Bonus: index Q&A pairs for better retrieval for qa in table["qa_pairs"]: vector_db.upsert({ "content": f"Q: {qa['question']}\nA: {qa['answer']}", "metadata": {"type": "table_qa", "source": qa["source"]} })
Benchmarks
| Method | Accuracy | Complex Tables | Latency |
|---|---|---|---|
| PyPDF2 | 20% | Fails | 50ms |
| Camelot (lattice) | 85% | Good | 200ms |
| Unstructured | 80% | Medium | 500ms |
| Claude Vision | 95% | Excellent | 2s |
| GPT-4o Vision | 93% | Excellent | 1.5s |
Related Guides
Parsing:
- Document Parsing Fundamentals - Overview
- Parse PDF Documents - PDF techniques
- OCR for Scanned Documents - Image documents
Chunking:
- Chunking Strategies - General approaches
- Hierarchical Chunking - Preserve structure
Do your documents contain complex tables? Let's analyze the best strategy together →
Tags
Related Posts
Document Parsing Fundamentals
Start your RAG journey: learn how to extract text, metadata, and structure from documents for semantic search.
Multimodal RAG: Images, PDFs, and Beyond Text
Extend your RAG beyond text: image indexing, PDF extraction, tables, and charts for a truly complete assistant.
Parse PDF Documents with PyMuPDF
Master PDF parsing: extract text, images, tables, and metadata from PDFs using PyMuPDF and alternatives.