Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

TL;DR

Les tableaux contiennent souvent les infos les plus importantes (prix, specs, comparaisons)
Problème : les parsers classiques détruisent la structure
Solutions : détection + extraction spécialisée + sérialisation intelligente
Outils : Unstructured, Camelot, Tabula, LLMs multimodaux
Uploadez vos PDFs avec tableaux sur Ailog

Pourquoi les Tableaux Posent Problème

Exemple typique de destruction de tableau :

PDF Original:
┌──────────┬─────────┬──────────┐
│ Produit  │ Prix    │ Stock    │
├──────────┼─────────┼──────────┤
│ Widget A │ 99€     │ En stock │
│ Widget B │ 149€    │ Rupture  │
└──────────┴─────────┴──────────┘

Après parsing naïf:
"Produit Prix Stock Widget A 99€ En stock Widget B 149€ Rupture"

→ Structure perdue, relations brisées

Détection des Tableaux

Avec Unstructured

DEVELOPERpython
from unstructured.partition.pdf import partition_pdf

def extract_with_table_detection(pdf_path: str) -> dict:
    """
    Extrait le contenu PDF avec détection des tableaux.
    """
    elements = partition_pdf(
        filename=pdf_path,
        strategy="hi_res",  # Détection visuelle
        infer_table_structure=True,
        include_page_breaks=True
    )

    tables = []
    text_content = []

    for element in elements:
        if element.category == "Table":
            tables.append({
                "html": element.metadata.text_as_html,
                "text": element.text,
                "page": element.metadata.page_number
            })
        else:
            text_content.append(element.text)

    return {
        "tables": tables,
        "text": "\n".join(text_content)
    }

Avec Camelot (PDF Natifs)

DEVELOPERpython
import camelot

def extract_tables_camelot(pdf_path: str) -> list:
    """
    Extraction de tableaux avec Camelot.
    Fonctionne bien sur les PDFs natifs (pas scannés).
    """
    # Méthode lattice pour tableaux avec bordures
    tables = camelot.read_pdf(
        pdf_path,
        pages='all',
        flavor='lattice'  # ou 'stream' pour sans bordures
    )

    extracted = []
    for i, table in enumerate(tables):
        df = table.df

        extracted.append({
            "table_id": i,
            "page": table.page,
            "accuracy": table.accuracy,
            "dataframe": df,
            "html": df.to_html(),
            "markdown": df.to_markdown()
        })

    return extracted

Détection par Vision (LLM Multimodal)

DEVELOPERpython
import anthropic
import base64

def detect_tables_vision(image_path: str) -> dict:
    """
    Utilise Claude Vision pour détecter et extraire les tableaux.
    """
    client = anthropic.Anthropic()

    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.messages.create(
        model="claude-3-5-sonnet-latest",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": """Extract all tables from this image.
For each table:
1. Output as markdown table
2. Preserve headers
3. Keep all data exactly as shown

Format:
TABLE 1:
| Header1 | Header2 | ... |
|---------|---------|-----|
| data    | data    | ... |

TABLE 2:
..."""
                }
            ]
        }]
    )

    return {
        "extracted_tables": response.content[0].text
    }

Sérialisation des Tableaux

Format Markdown

DEVELOPERpython
def table_to_markdown(df) -> str:
    """
    Convertit un DataFrame en markdown propre.
    """
    return df.to_markdown(index=False)

# Résultat:
# | Produit  | Prix | Stock    |
# |----------|------|----------|
# | Widget A | 99€  | En stock |
# | Widget B | 149€ | Rupture  |

Format Ligne par Ligne (Meilleur pour RAG)

DEVELOPERpython
def table_to_row_format(df, table_context: str = "") -> list:
    """
    Convertit chaque ligne en texte standalone.
    Chaque ligne devient un chunk autonome.
    """
    headers = df.columns.tolist()
    rows_as_text = []

    for _, row in df.iterrows():
        row_text = "; ".join([
            f"{header}: {value}"
            for header, value in zip(headers, row.values)
        ])

        if table_context:
            row_text = f"{table_context} - {row_text}"

        rows_as_text.append(row_text)

    return rows_as_text

# Résultat:
# ["Catalogue Produits - Produit: Widget A; Prix: 99€; Stock: En stock",
#  "Catalogue Produits - Produit: Widget B; Prix: 149€; Stock: Rupture"]

Format Q&A (Optimal pour Retrieval)

DEVELOPERpython
def table_to_qa_pairs(df, table_title: str) -> list:
    """
    Génère des paires Q&A à partir du tableau.
    Améliore significativement le retrieval.
    """
    headers = df.columns.tolist()
    qa_pairs = []

    for _, row in df.iterrows():
        # Identifier la colonne "clé" (souvent la première)
        key_col = headers[0]
        key_val = row[key_col]

        for header in headers[1:]:
            value = row[header]
            if pd.notna(value) and str(value).strip():
                qa_pairs.append({
                    "question": f"Quel est le {header.lower()} de {key_val} ?",
                    "answer": f"Le {header.lower()} de {key_val} est {value}.",
                    "source": table_title
                })

    return qa_pairs

# Résultat:
# [{"question": "Quel est le prix de Widget A ?",
#   "answer": "Le prix de Widget A est 99€.",
#   "source": "Catalogue Produits"},
#  {"question": "Quel est le stock de Widget A ?",
#   "answer": "Le stock de Widget A est En stock.",
#   "source": "Catalogue Produits"}]

Chunking de Tableaux

Tableaux Petits (< 20 lignes)

Garder le tableau entier comme un seul chunk :

DEVELOPERpython
def chunk_small_table(df, metadata: dict) -> dict:
    """
    Petit tableau = un seul chunk avec contexte.
    """
    markdown = df.to_markdown(index=False)

    chunk = {
        "content": f"**{metadata['title']}**\n\n{markdown}",
        "metadata": {
            "type": "table",
            "rows": len(df),
            "columns": list(df.columns),
            **metadata
        }
    }

    return chunk

Tableaux Moyens (20-100 lignes)

Chunking par groupes de lignes avec overlap :

DEVELOPERpython
def chunk_medium_table(
    df,
    metadata: dict,
    rows_per_chunk: int = 10,
    overlap: int = 2
) -> list:
    """
    Chunk par groupes de lignes avec headers répétés.
    """
    chunks = []
    headers = df.columns.tolist()
    header_row = "| " + " | ".join(headers) + " |"
    separator = "| " + " | ".join(["---"] * len(headers)) + " |"

    for i in range(0, len(df), rows_per_chunk - overlap):
        subset = df.iloc[i:i + rows_per_chunk]

        if len(subset) == 0:
            continue

        rows_md = subset.to_markdown(index=False).split('\n')[2:]  # Skip header

        chunk_md = (
            f"**{metadata['title']}** (lignes {i+1}-{i+len(subset)})\n\n"
            f"{header_row}\n{separator}\n" +
            "\n".join(rows_md)
        )

        chunks.append({
            "content": chunk_md,
            "metadata": {
                "type": "table_chunk",
                "start_row": i + 1,
                "end_row": i + len(subset),
                **metadata
            }
        })

    return chunks

Tableaux Grands (> 100 lignes)

Conversion en format row-by-row :

DEVELOPERpython
def chunk_large_table(df, metadata: dict) -> list:
    """
    Grands tableaux : chaque ligne devient un chunk.
    """
    return [
        {
            "content": table_to_row_format(df.iloc[[i]], metadata['title'])[0],
            "metadata": {
                "type": "table_row",
                "row_index": i + 1,
                "primary_key": str(df.iloc[i, 0]),  # Première colonne comme clé
                **metadata
            }
        }
        for i in range(len(df))
    ]

Enrichissement du Contexte

Ajouter le Contexte Environnant

DEVELOPERpython
def enrich_table_context(
    table_html: str,
    surrounding_text: str,
    llm_client
) -> dict:
    """
    Utilise le LLM pour enrichir le contexte du tableau.
    """
    prompt = f"""Analyze this table and its surrounding context.

Surrounding text:
{surrounding_text[:500]}

Table (HTML):
{table_html}

Generate:
1. A descriptive title for the table
2. A one-sentence summary of what the table shows
3. The key columns and what they represent

Output as JSON:
{{"title": "...", "summary": "...", "key_columns": [{{"name": "...", "description": "..."}}]}}"""

    result = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    import json
    return json.loads(result.choices[0].message.content)

Générer des Résumés

DEVELOPERpython
def summarize_table(df, llm_client) -> str:
    """
    Génère un résumé textuel du tableau.
    """
    # Stats de base
    stats = {
        "rows": len(df),
        "columns": list(df.columns),
        "sample": df.head(3).to_markdown()
    }

    prompt = f"""Summarize this table in 2-3 sentences.

Columns: {stats['columns']}
Rows: {stats['rows']}
Sample:
{stats['sample']}

Summary:"""

    result = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0
    )

    return result.choices[0].message.content.strip()

Pipeline Complet

DEVELOPERpython
class TableProcessor:
    def __init__(self, llm_client=None):
        self.llm = llm_client

    def process_document(self, pdf_path: str) -> dict:
        """
        Pipeline complet d'extraction et chunking des tableaux.
        """
        # 1. Extraction
        raw = extract_with_table_detection(pdf_path)

        processed_tables = []

        for i, table in enumerate(raw["tables"]):
            # 2. Convertir en DataFrame
            df = self._html_to_df(table["html"])

            if df is None or df.empty:
                continue

            # 3. Enrichir le contexte
            if self.llm:
                context = enrich_table_context(
                    table["html"],
                    raw["text"][:500],
                    self.llm
                )
            else:
                context = {"title": f"Table {i+1}", "summary": ""}

            # 4. Chunking selon la taille
            if len(df) <= 20:
                chunks = [chunk_small_table(df, context)]
            elif len(df) <= 100:
                chunks = chunk_medium_table(df, context)
            else:
                chunks = chunk_large_table(df, context)

            # 5. Générer aussi les Q&A pairs
            qa_pairs = table_to_qa_pairs(df, context["title"])

            processed_tables.append({
                "table_id": i,
                "metadata": context,
                "chunks": chunks,
                "qa_pairs": qa_pairs,
                "row_count": len(df)
            })

        return {
            "text_chunks": self._chunk_text(raw["text"]),
            "table_chunks": processed_tables,
            "stats": {
                "tables_found": len(raw["tables"]),
                "tables_processed": len(processed_tables)
            }
        }

    def _html_to_df(self, html: str):
        """Convertit HTML en DataFrame."""
        import pandas as pd
        try:
            dfs = pd.read_html(html)
            return dfs[0] if dfs else None
        except:
            return None

    def _chunk_text(self, text: str) -> list:
        """Chunk le texte standard."""
        # Implémentation de chunking standard
        pass

# Usage
processor = TableProcessor(llm_client=openai_client)
result = processor.process_document("rapport.pdf")

# Indexer les chunks
for table in result["table_chunks"]:
    for chunk in table["chunks"]:
        vector_db.upsert(chunk)

    # Bonus : indexer les Q&A pairs pour meilleur retrieval
    for qa in table["qa_pairs"]:
        vector_db.upsert({
            "content": f"Q: {qa['question']}\nA: {qa['answer']}",
            "metadata": {"type": "table_qa", "source": qa["source"]}
        })

Benchmarks

Méthode	Précision	Tableaux complexes	Latence
PyPDF2	20%	Échoue	50ms
Camelot (lattice)	85%	Bon	200ms
Unstructured	80%	Moyen	500ms
Claude Vision	95%	Excellent	2s
GPT-4o Vision	93%	Excellent	1.5s

Guides connexes

Parsing :

Fondamentaux du Parsing - Vue d'ensemble
Parser les PDFs - Techniques PDF
OCR pour Documents Scannés - Documents images

Chunking :

Stratégies de Chunking - Approches générales
Découpage Hiérarchique - Préserver la structure

Vos documents contiennent des tableaux complexes ? Analysons la meilleure stratégie ensemble →

Extraction et Traitement des Tableaux pour le RAG