Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

TL;DR

Tables often contain the most important info (prices, specs, comparisons)
Problem: standard parsers destroy the structure
Solutions: detection + specialized extraction + smart serialization
Tools: Unstructured, Camelot, Tabula, multimodal LLMs
Upload your PDFs with tables on Ailog

Why Tables Are Problematic

Typical example of table destruction:

Original PDF:
┌──────────┬─────────┬──────────┐
│ Product  │ Price   │ Stock    │
├──────────┼─────────┼──────────┤
│ Widget A │ $99     │ In stock │
│ Widget B │ $149    │ Sold out │
└──────────┴─────────┴──────────┘

After naive parsing:
"Product Price Stock Widget A $99 In stock Widget B $149 Sold out"

→ Structure lost, relationships broken

Table Detection

With Unstructured

DEVELOPERpython
from unstructured.partition.pdf import partition_pdf

def extract_with_table_detection(pdf_path: str) -> dict:
    """
    Extracts PDF content with table detection.
    """
    elements = partition_pdf(
        filename=pdf_path,
        strategy="hi_res",  # Visual detection
        infer_table_structure=True,
        include_page_breaks=True
    )

    tables = []
    text_content = []

    for element in elements:
        if element.category == "Table":
            tables.append({
                "html": element.metadata.text_as_html,
                "text": element.text,
                "page": element.metadata.page_number
            })
        else:
            text_content.append(element.text)

    return {
        "tables": tables,
        "text": "\n".join(text_content)
    }

With Camelot (Native PDFs)

DEVELOPERpython
import camelot

def extract_tables_camelot(pdf_path: str) -> list:
    """
    Table extraction with Camelot.
    Works well on native PDFs (not scanned).
    """
    # Lattice method for tables with borders
    tables = camelot.read_pdf(
        pdf_path,
        pages='all',
        flavor='lattice'  # or 'stream' for borderless
    )

    extracted = []
    for i, table in enumerate(tables):
        df = table.df

        extracted.append({
            "table_id": i,
            "page": table.page,
            "accuracy": table.accuracy,
            "dataframe": df,
            "html": df.to_html(),
            "markdown": df.to_markdown()
        })

    return extracted

Vision Detection (Multimodal LLM)

DEVELOPERpython
import anthropic
import base64

def detect_tables_vision(image_path: str) -> dict:
    """
    Uses Claude Vision to detect and extract tables.
    """
    client = anthropic.Anthropic()

    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.messages.create(
        model="claude-3-5-sonnet-latest",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": """Extract all tables from this image.
For each table:
1. Output as markdown table
2. Preserve headers
3. Keep all data exactly as shown

Format:
TABLE 1:
| Header1 | Header2 | ... |
|---------|---------|-----|
| data    | data    | ... |

TABLE 2:
..."""
                }
            ]
        }]
    )

    return {
        "extracted_tables": response.content[0].text
    }

Table Serialization

Markdown Format

DEVELOPERpython
def table_to_markdown(df) -> str:
    """
    Converts DataFrame to clean markdown.
    """
    return df.to_markdown(index=False)

# Result:
# | Product  | Price | Stock    |
# |----------|-------|----------|
# | Widget A | $99   | In stock |
# | Widget B | $149  | Sold out |

Row-by-Row Format (Best for RAG)

DEVELOPERpython
def table_to_row_format(df, table_context: str = "") -> list:
    """
    Converts each row to standalone text.
    Each row becomes an autonomous chunk.
    """
    headers = df.columns.tolist()
    rows_as_text = []

    for _, row in df.iterrows():
        row_text = "; ".join([
            f"{header}: {value}"
            for header, value in zip(headers, row.values)
        ])

        if table_context:
            row_text = f"{table_context} - {row_text}"

        rows_as_text.append(row_text)

    return rows_as_text

# Result:
# ["Product Catalog - Product: Widget A; Price: $99; Stock: In stock",
#  "Product Catalog - Product: Widget B; Price: $149; Stock: Sold out"]

Q&A Format (Optimal for Retrieval)

DEVELOPERpython
def table_to_qa_pairs(df, table_title: str) -> list:
    """
    Generates Q&A pairs from the table.
    Significantly improves retrieval.
    """
    headers = df.columns.tolist()
    qa_pairs = []

    for _, row in df.iterrows():
        # Identify key column (often the first)
        key_col = headers[0]
        key_val = row[key_col]

        for header in headers[1:]:
            value = row[header]
            if pd.notna(value) and str(value).strip():
                qa_pairs.append({
                    "question": f"What is the {header.lower()} of {key_val}?",
                    "answer": f"The {header.lower()} of {key_val} is {value}.",
                    "source": table_title
                })

    return qa_pairs

# Result:
# [{"question": "What is the price of Widget A?",
#   "answer": "The price of Widget A is $99.",
#   "source": "Product Catalog"},
#  {"question": "What is the stock of Widget A?",
#   "answer": "The stock of Widget A is In stock.",
#   "source": "Product Catalog"}]

Table Chunking

Small Tables (< 20 rows)

Keep the entire table as a single chunk:

DEVELOPERpython
def chunk_small_table(df, metadata: dict) -> dict:
    """
    Small table = single chunk with context.
    """
    markdown = df.to_markdown(index=False)

    chunk = {
        "content": f"**{metadata['title']}**\n\n{markdown}",
        "metadata": {
            "type": "table",
            "rows": len(df),
            "columns": list(df.columns),
            **metadata
        }
    }

    return chunk

Medium Tables (20-100 rows)

Chunking by row groups with overlap:

DEVELOPERpython
def chunk_medium_table(
    df,
    metadata: dict,
    rows_per_chunk: int = 10,
    overlap: int = 2
) -> list:
    """
    Chunk by row groups with repeated headers.
    """
    chunks = []
    headers = df.columns.tolist()
    header_row = "| " + " | ".join(headers) + " |"
    separator = "| " + " | ".join(["---"] * len(headers)) + " |"

    for i in range(0, len(df), rows_per_chunk - overlap):
        subset = df.iloc[i:i + rows_per_chunk]

        if len(subset) == 0:
            continue

        rows_md = subset.to_markdown(index=False).split('\n')[2:]  # Skip header

        chunk_md = (
            f"**{metadata['title']}** (rows {i+1}-{i+len(subset)})\n\n"
            f"{header_row}\n{separator}\n" +
            "\n".join(rows_md)
        )

        chunks.append({
            "content": chunk_md,
            "metadata": {
                "type": "table_chunk",
                "start_row": i + 1,
                "end_row": i + len(subset),
                **metadata
            }
        })

    return chunks

Large Tables (> 100 rows)

Convert to row-by-row format:

DEVELOPERpython
def chunk_large_table(df, metadata: dict) -> list:
    """
    Large tables: each row becomes a chunk.
    """
    return [
        {
            "content": table_to_row_format(df.iloc[[i]], metadata['title'])[0],
            "metadata": {
                "type": "table_row",
                "row_index": i + 1,
                "primary_key": str(df.iloc[i, 0]),  # First column as key
                **metadata
            }
        }
        for i in range(len(df))
    ]

Context Enrichment

Add Surrounding Context

DEVELOPERpython
def enrich_table_context(
    table_html: str,
    surrounding_text: str,
    llm_client
) -> dict:
    """
    Uses LLM to enrich table context.
    """
    prompt = f"""Analyze this table and its surrounding context.

Surrounding text:
{surrounding_text[:500]}

Table (HTML):
{table_html}

Generate:
1. A descriptive title for the table
2. A one-sentence summary of what the table shows
3. The key columns and what they represent

Output as JSON:
{{"title": "...", "summary": "...", "key_columns": [{{"name": "...", "description": "..."}}]}}"""

    result = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    import json
    return json.loads(result.choices[0].message.content)

Generate Summaries

DEVELOPERpython
def summarize_table(df, llm_client) -> str:
    """
    Generates a textual summary of the table.
    """
    # Basic stats
    stats = {
        "rows": len(df),
        "columns": list(df.columns),
        "sample": df.head(3).to_markdown()
    }

    prompt = f"""Summarize this table in 2-3 sentences.

Columns: {stats['columns']}
Rows: {stats['rows']}
Sample:
{stats['sample']}

Summary:"""

    result = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0
    )

    return result.choices[0].message.content.strip()

Complete Pipeline

DEVELOPERpython
class TableProcessor:
    def __init__(self, llm_client=None):
        self.llm = llm_client

    def process_document(self, pdf_path: str) -> dict:
        """
        Complete table extraction and chunking pipeline.
        """
        # 1. Extraction
        raw = extract_with_table_detection(pdf_path)

        processed_tables = []

        for i, table in enumerate(raw["tables"]):
            # 2. Convert to DataFrame
            df = self._html_to_df(table["html"])

            if df is None or df.empty:
                continue

            # 3. Enrich context
            if self.llm:
                context = enrich_table_context(
                    table["html"],
                    raw["text"][:500],
                    self.llm
                )
            else:
                context = {"title": f"Table {i+1}", "summary": ""}

            # 4. Chunking based on size
            if len(df) <= 20:
                chunks = [chunk_small_table(df, context)]
            elif len(df) <= 100:
                chunks = chunk_medium_table(df, context)
            else:
                chunks = chunk_large_table(df, context)

            # 5. Also generate Q&A pairs
            qa_pairs = table_to_qa_pairs(df, context["title"])

            processed_tables.append({
                "table_id": i,
                "metadata": context,
                "chunks": chunks,
                "qa_pairs": qa_pairs,
                "row_count": len(df)
            })

        return {
            "text_chunks": self._chunk_text(raw["text"]),
            "table_chunks": processed_tables,
            "stats": {
                "tables_found": len(raw["tables"]),
                "tables_processed": len(processed_tables)
            }
        }

    def _html_to_df(self, html: str):
        """Converts HTML to DataFrame."""
        import pandas as pd
        try:
            dfs = pd.read_html(html)
            return dfs[0] if dfs else None
        except:
            return None

    def _chunk_text(self, text: str) -> list:
        """Chunks standard text."""
        # Standard chunking implementation
        pass

# Usage
processor = TableProcessor(llm_client=openai_client)
result = processor.process_document("report.pdf")

# Index chunks
for table in result["table_chunks"]:
    for chunk in table["chunks"]:
        vector_db.upsert(chunk)

    # Bonus: index Q&A pairs for better retrieval
    for qa in table["qa_pairs"]:
        vector_db.upsert({
            "content": f"Q: {qa['question']}\nA: {qa['answer']}",
            "metadata": {"type": "table_qa", "source": qa["source"]}
        })

Benchmarks

Method	Accuracy	Complex Tables	Latency
PyPDF2	20%	Fails	50ms
Camelot (lattice)	85%	Good	200ms
Unstructured	80%	Medium	500ms
Claude Vision	95%	Excellent	2s
GPT-4o Vision	93%	Excellent	1.5s

Related Guides

Parsing:

Document Parsing Fundamentals - Overview
Parse PDF Documents - PDF techniques
OCR for Scanned Documents - Image documents

Chunking:

Chunking Strategies - General approaches
Hierarchical Chunking - Preserve structure

Do your documents contain complex tables? Let's analyze the best strategy together →

Table Extraction and Processing for RAG