RAG Multimodal : Bilder, PDFs und mehr als Text

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Das traditionelle RAG beschränkt sich auf Text, aber Unternehmenswissen liegt in vielen Formen vor: PDFs mit Grafiken, Produktbilder, technische Schaubilder, PowerPoint-Präsentationen. Das multimodale RAG erweitert die Fähigkeiten Ihres Assistenten, all diese Formate zu verstehen und zu nutzen.

Warum multimodal?

Die Grenzen des text-only RAG

In einer typischen Unternehmensdokumentation:

Type de contenu	% de l'information	Accessible au RAG texte
Texte brut	30-40%	Ja
PDFs (texte)	20-30%	Teilweise
PDFs (tableaux, graphiques)	15-20%	Nein
Images, schémas	10-15%	Nein
Présentations	5-10%	Teilweise

Ergebnis: Ein text-only RAG kann 40-50% der relevanten Informationen verpassen.

Multimodale Anwendungsfälle

E-commerce : Visuelle Produktsuche, Fragen zu Bildern
Documentation technique : Technische Dokumentation, Schaubilder, Architekturdiagramme
Support : Screenshots von Benutzerfehlern
Formation : Präsentationsfolien, Infografiken
Juridique : Gescannte Dokumente, Unterschriften

Multimodale Architektur

┌─────────────────────────────────────────────────────────────┐
│                    SOURCES MULTIMODALES                      │
├──────────┬──────────┬──────────┬──────────┬────────────────┤
│  Texte   │   PDF    │  Images  │  Slides  │    Vidéo       │
│  .md/.txt│  .pdf    │ .jpg/.png│  .pptx   │   .mp4         │
└────┬─────┴────┬─────┴────┬─────┴────┬─────┴───────┬────────┘
     │          │          │          │             │
     │    ┌─────▼─────┐    │    ┌─────▼─────┐      │
     │    │   PDF     │    │    │   Slide   │      │
     │    │ Extractor │    │    │ Extractor │      │
     │    └─────┬─────┘    │    └─────┬─────┘      │
     │          │          │          │             │
     └──────────┴──────────┼──────────┴─────────────┘
                           │
              ┌────────────▼────────────┐
              │    Vision Encoder       │
              │  (CLIP, GPT-4V, etc.)   │
              └────────────┬────────────┘
                           │
              ┌────────────▼────────────┐
              │   Multimodal Index      │
              │  (Text + Image embeds)  │
              └────────────┬────────────┘
                           │
              ┌────────────▼────────────┐
              │   Multimodal LLM        │
              │  (GPT-4V, Claude 3)     │
              └─────────────────────────┘

Erweiterte PDF-Extraktion

Extraktion mit Erhalt der Struktur

DEVELOPERpython
import fitz  # PyMuPDF
from dataclasses import dataclass
from typing import List, Optional
import base64

@dataclass
class PDFElement:
    type: str  # "text", "table", "image", "heading"
    content: str
    page: int
    bbox: tuple  # (x0, y0, x1, y1)
    metadata: dict = None

class AdvancedPDFExtractor:
    def __init__(self, vision_model=None):
        self.vision = vision_model

    def extract(self, pdf_path: str) -> List[PDFElement]:
        """
        Extrait tous les éléments d'un PDF avec leur structure
        """
        doc = fitz.open(pdf_path)
        elements = []

        for page_num in range(len(doc)):
            page = doc[page_num]

            # Text mit Struktur extrahieren
            text_elements = self._extract_text_blocks(page, page_num)
            elements.extend(text_elements)

            # Tabellen extrahieren
            tables = self._extract_tables(page, page_num)
            elements.extend(tables)

            # Bilder extrahieren
            images = self._extract_images(page, page_num)
            elements.extend(images)

        doc.close()
        return elements

    def _extract_text_blocks(self, page, page_num: int) -> List[PDFElement]:
        """
        Extrait les blocs de texte avec détection des titres
        """
        elements = []
        blocks = page.get_text("dict")["blocks"]

        for block in blocks:
            if "lines" not in block:
                continue

            text_parts = []
            max_font_size = 0

            for line in block["lines"]:
                for span in line["spans"]:
                    text_parts.append(span["text"])
                    max_font_size = max(max_font_size, span["size"])

            text = " ".join(text_parts).strip()
            if not text:
                continue

            # Erkennen, ob es sich um eine Überschrift anhand der Schriftgröße handelt
            is_heading = max_font_size > 14

            elements.append(PDFElement(
                type="heading" if is_heading else "text",
                content=text,
                page=page_num,
                bbox=block["bbox"],
                metadata={"font_size": max_font_size}
            ))

        return elements

    def _extract_tables(self, page, page_num: int) -> List[PDFElement]:
        """
        Extrait les tableaux avec reconnaissance de structure
        """
        elements = []

        # Tabellen-Erkennung von PyMuPDF verwenden
        tables = page.find_tables()

        for table in tables:
            # In Markdown für das RAG konvertieren
            markdown = self._table_to_markdown(table)

            elements.append(PDFElement(
                type="table",
                content=markdown,
                page=page_num,
                bbox=table.bbox,
                metadata={
                    "rows": len(table.cells),
                    "cols": len(table.cells[0]) if table.cells else 0
                }
            ))

        return elements

    def _table_to_markdown(self, table) -> str:
        """
        Convertit un tableau en format Markdown
        """
        rows = []
        for row_idx, row in enumerate(table.extract()):
            cells = [str(cell or "").strip() for cell in row]
            rows.append("| " + " | ".join(cells) + " |")

            # Trennzeile nach der Kopfzeile hinzufügen
            if row_idx == 0:
                separator = "|" + "|".join(["---"] * len(cells)) + "|"
                rows.append(separator)

        return "\n".join(rows)

    def _extract_images(self, page, page_num: int) -> List[PDFElement]:
        """
        Extrait les images et génère des descriptions
        """
        elements = []
        images = page.get_images()

        for img_idx, img in enumerate(images):
            xref = img[0]
            base_image = page.parent.extract_image(xref)

            if base_image:
                image_data = base_image["image"]
                image_b64 = base64.b64encode(image_data).decode()

                # Eine Beschreibung mit dem Vision-Modell erzeugen
                description = ""
                if self.vision:
                    description = self._describe_image(image_b64)

                elements.append(PDFElement(
                    type="image",
                    content=description,
                    page=page_num,
                    bbox=img[1:5] if len(img) > 4 else (0, 0, 0, 0),
                    metadata={
                        "image_b64": image_b64,
                        "format": base_image.get("ext", "unknown"),
                        "width": base_image.get("width"),
                        "height": base_image.get("height")
                    }
                ))

        return elements

    async def _describe_image(self, image_b64: str) -> str:
        """
        Génère une description textuelle de l'image
        """
        prompt = """
        Décris cette image en détail pour qu'elle puisse être retrouvée par recherche textuelle.
        Inclus :
        - Le type d'image (photo, schéma, graphique, capture d'écran)
        - Les éléments principaux visibles
        - Le texte visible s'il y en a
        - Le contexte probable
        """

        return await self.vision.analyze_image(image_b64, prompt)

OCR pour documents scannés

DEVELOPERpython
import pytesseract
from PIL import Image
import cv2
import numpy as np

class OCRExtractor:
    def __init__(self, languages: list = ["fra", "eng"]):
        self.languages = "+".join(languages)

    def extract_from_image(self, image_path: str) -> dict:
        """
        Extrait le texte d'une image avec OCR
        """
        # Vorverarbeitung zur Verbesserung der OCR
        image = cv2.imread(image_path)
        processed = self._preprocess(image)

        # OCR mit Tesseract
        text = pytesseract.image_to_string(
            processed,
            lang=self.languages,
            config='--psm 1'  # Automatic page segmentation
        )

        # Ebenfalls die Koordinaten abrufen
        data = pytesseract.image_to_data(
            processed,
            lang=self.languages,
            output_type=pytesseract.Output.DICT
        )

        return {
            "text": text.strip(),
            "words": self._extract_words_with_positions(data),
            "confidence": self._average_confidence(data)
        }

    def _preprocess(self, image: np.ndarray) -> np.ndarray:
        """
        Prétraitement de l'image pour améliorer l'OCR
        """
        # In Graustufen konvertieren
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

        # Rauschunterdrückung
        denoised = cv2.fastNlMeansDenoising(gray)

        # Adaptive Binarisierung
        binary = cv2.adaptiveThreshold(
            denoised, 255,
            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY, 11, 2
        )

        # Schräglagenkorrektur
        corrected = self._deskew(binary)

        return corrected

    def _deskew(self, image: np.ndarray) -> np.ndarray:
        """
        Corrige l'inclinaison du document
        """
        coords = np.column_stack(np.where(image > 0))
        angle = cv2.minAreaRect(coords)[-1]

        if angle < -45:
            angle = -(90 + angle)
        else:
            angle = -angle

        (h, w) = image.shape[:2]
        center = (w // 2, h // 2)
        M = cv2.getRotationMatrix2D(center, angle, 1.0)
        rotated = cv2.warpAffine(
            image, M, (w, h),
            flags=cv2.INTER_CUBIC,
            borderMode=cv2.BORDER_REPLICATE
        )

        return rotated

    def _average_confidence(self, data: dict) -> float:
        """
        Calcule la confiance moyenne de l'OCR
        """
        confidences = [c for c in data["conf"] if c > 0]
        return sum(confidences) / len(confidences) if confidences else 0

Embeddings multimodaux

CLIP pour images et texte

DEVELOPERpython
from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image

class CLIPEmbedder:
    def __init__(self, model_name: str = "openai/clip-vit-large-patch14"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = CLIPModel.from_pretrained(model_name).to(self.device)
        self.processor = CLIPProcessor.from_pretrained(model_name)

    def embed_image(self, image_path: str) -> np.ndarray:
        """
        Génère un embedding pour une image
        """
        image = Image.open(image_path)
        inputs = self.processor(images=image, return_tensors="pt").to(self.device)

        with torch.no_grad():
            image_features = self.model.get_image_features(**inputs)

        return image_features.cpu().numpy().flatten()

    def embed_text(self, text: str) -> np.ndarray:
        """
        Génère un embedding pour du texte
        """
        inputs = self.processor(text=text, return_tensors="pt", padding=True).to(self.device)

        with torch.no_grad():
            text_features = self.model.get_text_features(**inputs)

        return text_features.cpu().numpy().flatten()

    def similarity(self, image_path: str, text: str) -> float:
        """
        Calcule la similarité entre une image et un texte
        """
        image_emb = self.embed_image(image_path)
        text_emb = self.embed_text(text)

        # Normalisieren und Skalarprodukt berechnen
        image_emb = image_emb / np.linalg.norm(image_emb)
        text_emb = text_emb / np.linalg.norm(text_emb)

        return float(np.dot(image_emb, text_emb))

Index multimodal avec Qdrant

DEVELOPERpython
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

class MultimodalIndex:
    def __init__(self, text_embedder, image_embedder, qdrant_url: str = "localhost"):
        self.text_embedder = text_embedder
        self.image_embedder = image_embedder
        self.client = QdrantClient(host=qdrant_url, port=6333)

    def create_collection(self, name: str):
        """
        Crée une collection avec vecteurs texte ET image
        """
        self.client.create_collection(
            collection_name=name,
            vectors_config={
                "text": VectorParams(size=768, distance=Distance.COSINE),
                "image": VectorParams(size=768, distance=Distance.COSINE),
            }
        )

    def index_document(self, doc_id: str, content: dict, collection: str):
        """
        Indexe un document multimodal
        """
        vectors = {}

        # Text-Embedding falls vorhanden
        if content.get("text"):
            vectors["text"] = self.text_embedder.encode(content["text"]).tolist()

        # Image-Embedding falls vorhanden
        if content.get("image_path"):
            vectors["image"] = self.image_embedder.embed_image(content["image_path"]).tolist()

        point = PointStruct(
            id=hash(doc_id) % (2**63),
            vector=vectors,
            payload={
                "doc_id": doc_id,
                "text": content.get("text", ""),
                "image_path": content.get("image_path"),
                "metadata": content.get("metadata", {})
            }
        )

        self.client.upsert(collection_name=collection, points=[point])

    def search(
        self,
        query: str,
        collection: str,
        query_image_path: str = None,
        top_k: int = 10
    ) -> list:
        """
        Recherche multimodale : texte et/ou image
        """
        # Préparer les vecteurs de requête
        query_vectors = []

        # Recherche textuelle
        text_vector = self.text_embedder.encode(query).tolist()
        query_vectors.append(("text", text_vector))

        # Recherche visuelle si image fournie
        if query_image_path:
            image_vector = self.image_embedder.embed_image(query_image_path).tolist()
            query_vectors.append(("image", image_vector))

        # Exécuter les recherches et fusionner
        all_results = []
        for vector_name, vector in query_vectors:
            results = self.client.search(
                collection_name=collection,
                query_vector=(vector_name, vector),
                limit=top_k
            )
            all_results.extend(results)

        # Dédupliquer et re-scorer
        return self._merge_results(all_results, top_k)

    def _merge_results(self, results: list, top_k: int) -> list:
        """
        Fusionne les résultats des différentes modalités
        """
        scores = {}
        docs = {}

        for result in results:
            doc_id = result.payload["doc_id"]
            if doc_id not in scores:
                scores[doc_id] = 0
                docs[doc_id] = result.payload

            # RRF-Fusion
            scores[doc_id] += result.score

        # Trier par score fusionné
        sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)

        return [
            {"doc_id": doc_id, "score": score, **docs[doc_id]}
            for doc_id, score in sorted_docs[:top_k]
        ]

Génération multimodale avec GPT-4V / Claude 3

Pipeline de réponse multimodale

DEVELOPERpython
from openai import OpenAI
import base64

class MultimodalRAG:
    def __init__(self, index, llm_client=None):
        self.index = index
        self.client = llm_client or OpenAI()

    async def query(
        self,
        text_query: str,
        image_query_path: str = None,
        collection: str = "multimodal_kb"
    ) -> dict:
        """
        Requête RAG multimodale
        """
        # 1. Multimodale Suche
        results = self.index.search(
            query=text_query,
            collection=collection,
            query_image_path=image_query_path,
            top_k=5
        )

        # 2. Multimodalen Kontext vorbereiten
        context_parts = []
        images_for_llm = []

        for result in results:
            if result.get("text"):
                context_parts.append(f"Document: {result['text']}")

            if result.get("image_path"):
                # Bild für das LLM laden
                with open(result["image_path"], "rb") as f:
                    img_b64 = base64.b64encode(f.read()).decode()
                    images_for_llm.append({
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{img_b64}"
                        }
                    })

        # 3. Antwort mit dem multimodalen LLM generieren
        response = await self._generate_response(
            query=text_query,
            context="\n\n".join(context_parts),
            images=images_for_llm,
            query_image=image_query_path
        )

        return {
            "answer": response,
            "sources": results,
            "has_visual_context": len(images_for_llm) > 0
        }

    async def _generate_response(
        self,
        query: str,
        context: str,
        images: list,
        query_image: str = None
    ) -> str:
        """
        Génère une réponse en utilisant un LLM multimodal
        """
        messages = [
            {
                "role": "system",
                "content": """Tu es un assistant RAG multimodal. Tu as accès à des documents textuels et des images.

Règles :
1. Base tes réponses sur les documents ET images fournis
2. Décris ce que tu vois dans les images si pertinent
3. Si l'utilisateur envoie une image, analyse-la pour répondre
4. Cite tes sources (documents ou images)"""
            }
        ]

        # Benutzer-Nachricht aufbauen
        user_content = []

        # Anfragebild hinzufügen, falls vorhanden
        if query_image:
            with open(query_image, "rb") as f:
                img_b64 = base64.b64encode(f.read()).decode()
            user_content.append({
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}
            })

        # Kontextbilder hinzufügen
        user_content.extend(images)

        # Text hinzufügen
        user_content.append({
            "type": "text",
            "text": f"""Contexte documentaire :
{context}

Question : {query}

Réponds en te basant sur le contexte et les images fournis."""
        })

        messages.append({"role": "user", "content": user_content})

        response = self.client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=messages,
            max_tokens=2000
        )

        return response.choices[0].message.content

Cas d'usage spécialisés

Recherche visuelle de produits

DEVELOPERpython
class VisualProductSearch:
    def __init__(self, product_index, clip_embedder):
        self.index = product_index
        self.clip = clip_embedder

    async def find_similar_products(
        self,
        image_path: str,
        top_k: int = 10,
        filters: dict = None
    ) -> list:
        """
        Trouve des produits similaires à une image
        """
        # Embedding des Anfragebilds
        query_embedding = self.clip.embed_image(image_path)

        # Suche im Produkt-Index
        results = self.index.search(
            query_vector=query_embedding,
            top_k=top_k,
            filters=filters
        )

        return results

    async def answer_product_question(
        self,
        product_image_path: str,
        question: str
    ) -> str:
        """
        Répond à une question sur un produit à partir de son image
        """
        # Produkt finden
        similar = await self.find_similar_products(product_image_path, top_k=1)

        if not similar:
            return "Je n'ai pas trouvé ce produit dans notre catalogue."

        product = similar[0]

        # Antwort mit Produktkontext erzeugen
        prompt = f"""
        Produit identifié : {product['name']}
        Description : {product['description']}
        Prix : {product['price']}€
        Caractéristiques : {product['specs']}

        Question du client : {question}

        Réponds de manière utile et commerciale.
        """

        return await self.llm.generate(prompt)

Support avec captures d'écran

DEVELOPERpython
class VisualSupportAssistant:
    def __init__(self, rag_pipeline, vision_model):
        self.rag = rag_pipeline
        self.vision = vision_model

    async def analyze_screenshot(
        self,
        screenshot_path: str,
        user_description: str = ""
    ) -> dict:
        """
        Analyse une capture d'écran d'erreur
        """
        # Bild mit dem Vision-Modell analysieren
        analysis = await self.vision.analyze(
            screenshot_path,
            prompt="""Analyse cette capture d'écran et identifie :
            1. Le type d'application/interface visible
            2. Tout message d'erreur ou anomalie
            3. Le contexte de l'action en cours
            4. Des détails techniques visibles (codes, versions)
            """
        )

        # Mit der Nutzerbeschreibung kombinieren
        combined_query = f"""
        Description utilisateur : {user_description}

        Analyse de la capture d'écran :
        {analysis}
        """

        # In der Support-KB suchen
        results = await self.rag.query(
            text_query=combined_query,
            image_query_path=screenshot_path
        )

        return {
            "visual_analysis": analysis,
            "suggested_solutions": results["answer"],
            "related_articles": results["sources"]
        }

Bonnes pratiques

1. Preprocessing des images

Normaliser les tailles avant embedding
Appliquer un preprocessing cohérent (crop, resize)
Filtrer les images de mauvaise qualité

2. Gestion des métadonnées

Toujours conserver les métadonnées des images :

Source et contexte original
Dimensions et format
Date de création
Description générée

3. Fallback gracieux

DEVELOPERpython
async def safe_multimodal_query(query: str, image: str = None):
    """
    Gère les cas où le multimodal échoue
    """
    try:
        if image:
            return await multimodal_rag.query(query, image)
    except Exception as e:
        logger.warning(f"Multimodal failed: {e}, falling back to text-only")

    # Fallback zum text-only RAG
    return await text_rag.query(query)

Pour aller plus loin

Fondamentaux du Retrieval - Bases de la recherche
Introduction au RAG - Vue d'ensemble
Dense Retrieval - Embeddings avancés

FAQ

Das klassische RAG verarbeitet nur Text: Es indexiert textuelle Dokumente und beantwortet textuelle Fragen. Das multimodale RAG erweitert diese Fähigkeiten auf Bilder, PDFs mit Grafiken, Tabellen und Präsentationen. Es verwendet Vision-Modelle (GPT-4V, Claude 3) zur Interpretation visueller Inhalte und multimodale Embeddings (CLIP) für die cross-modale Suche zwischen Text und Bild.

Ein multimodales RAG kann PDFs (Text, Tabellen, Grafiken), Bilder (Fotos, Schaubilder, Screenshots), PowerPoint-Präsentationen, gescannte Dokumente via OCR und potenziell Videos verarbeiten. Die Extraktion erhält die Dokumentstruktur und erzeugt textuelle Beschreibungen für visuelle Elemente, was eine einheitliche Suche ermöglicht.

Moderne OCR (Tesseract, Cloud-Services) erreicht eine Genauigkeit von 95–99% bei Dokumenten guter Qualität. Vorverarbeitung (Entrauschen, Schräglagenkorrektur, Binarisierung) verbessert die Ergebnisse deutlich. Bei schlechter Qualität oder Handschriften sind spezialisierte Modelle oder menschliche Überprüfung nötig.

Die Bildverarbeitung erhöht die Latenz (Beschreibungsgenerierung, CLIP-Encoding). Best Practices sind: asynchrones Processing beim Indexieren, Caching der Embeddings, Einsatz optimierter (quantisierter) Modelle und Fallback zum text-only RAG, falls der Multimodal-Pfad fehlschlägt. Vorverarbeiten Sie Bilder einmal beim Import, nicht bei jeder Anfrage.

Ja, dank CLIP-Embeddings, die Bilder und Text in denselben Vektorraum projizieren. Ein Benutzer kann ein Bild einreichen, um ähnliche Produkte zu finden (visuelle Produktsuche) oder eine Frage zu einem Screenshot stellen. Das System findet relevante Dokumente auch ohne direkte Textübereinstimmung. ---

RAG Multimodal schlüsselfertig mit Ailog

Die Implementierung eines multimodalen RAG erfordert die Integration zahlreicher Technologien. Mit Ailog erhalten Sie diese Fähigkeiten ohne Komplexität:

Extraction PDF avancée mit Erkennung von Tabellen und Grafiken
OCR intégré für gescannte Dokumente
Indexation d'images mit automatischen Beschreibungen
Recherche hybride texte + visuelle
LLM multimodaux (GPT-4V, Claude 3) vorkonfiguriert

Testez Ailog gratuitement et indexez tous vos contenus, quel que soit le format.

RAG Multimodal: Bilder, PDFs und über den Text hinaus