Multimodal RAG: Images, PDFs, and Beyond Text

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Traditional RAG is limited to text, but enterprise knowledge exists in many forms: PDFs with charts, product images, technical diagrams, PowerPoint presentations. Multimodal RAG extends your assistant's capabilities to understand and leverage all these formats.

Why Multimodal?

Limitations of Text-Only RAG

In a typical enterprise document base:

Content Type	% of Information	Accessible to Text RAG
Plain text	30-40%	Yes
PDFs (text)	20-30%	Partially
PDFs (tables, charts)	15-20%	No
Images, diagrams	10-15%	No
Presentations	5-10%	Partially

Result: a text-only RAG can miss 40-50% of relevant information.

Multimodal Use Cases

E-commerce: Visual product search, questions about images
Technical documentation: Diagrams, architecture schemas
Support: User error screenshots
Training: Presentation slides, infographics
Legal: Scanned documents, signatures

Multimodal Architecture

┌─────────────────────────────────────────────────────────────┐
│                    MULTIMODAL SOURCES                        │
├──────────┬──────────┬──────────┬──────────┬────────────────┤
│   Text   │   PDF    │  Images  │  Slides  │    Video       │
│  .md/.txt│  .pdf    │ .jpg/.png│  .pptx   │   .mp4         │
└────┬─────┴────┬─────┴────┬─────┴────┬─────┴───────┬────────┘
     │          │          │          │             │
     │    ┌─────▼─────┐    │    ┌─────▼─────┐      │
     │    │   PDF     │    │    │   Slide   │      │
     │    │ Extractor │    │    │ Extractor │      │
     │    └─────┬─────┘    │    └─────┬─────┘      │
     │          │          │          │             │
     └──────────┴──────────┼──────────┴─────────────┘
                           │
              ┌────────────▼────────────┐
              │    Vision Encoder       │
              │  (CLIP, GPT-4V, etc.)   │
              └────────────┬────────────┘
                           │
              ┌────────────▼────────────┐
              │   Multimodal Index      │
              │  (Text + Image embeds)  │
              └────────────┬────────────┘
                           │
              ┌────────────▼────────────┐
              │   Multimodal LLM        │
              │  (GPT-4V, Claude 3)     │
              └─────────────────────────┘

Advanced PDF Extraction

Extraction with Structure Preservation

DEVELOPERpython
import fitz  # PyMuPDF
from dataclasses import dataclass
from typing import List, Optional
import base64

@dataclass
class PDFElement:
    type: str  # "text", "table", "image", "heading"
    content: str
    page: int
    bbox: tuple  # (x0, y0, x1, y1)
    metadata: dict = None

class AdvancedPDFExtractor:
    def __init__(self, vision_model=None):
        self.vision = vision_model

    def extract(self, pdf_path: str) -> List[PDFElement]:
        """
        Extract all elements from a PDF with their structure
        """
        doc = fitz.open(pdf_path)
        elements = []

        for page_num in range(len(doc)):
            page = doc[page_num]

            # Extract text with structure
            text_elements = self._extract_text_blocks(page, page_num)
            elements.extend(text_elements)

            # Extract tables
            tables = self._extract_tables(page, page_num)
            elements.extend(tables)

            # Extract images
            images = self._extract_images(page, page_num)
            elements.extend(images)

        doc.close()
        return elements

    def _extract_text_blocks(self, page, page_num: int) -> List[PDFElement]:
        """
        Extract text blocks with heading detection
        """
        elements = []
        blocks = page.get_text("dict")["blocks"]

        for block in blocks:
            if "lines" not in block:
                continue

            text_parts = []
            max_font_size = 0

            for line in block["lines"]:
                for span in line["spans"]:
                    text_parts.append(span["text"])
                    max_font_size = max(max_font_size, span["size"])

            text = " ".join(text_parts).strip()
            if not text:
                continue

            # Detect heading based on font size
            is_heading = max_font_size > 14

            elements.append(PDFElement(
                type="heading" if is_heading else "text",
                content=text,
                page=page_num,
                bbox=block["bbox"],
                metadata={"font_size": max_font_size}
            ))

        return elements

    def _extract_tables(self, page, page_num: int) -> List[PDFElement]:
        """
        Extract tables with structure recognition
        """
        elements = []

        # Use PyMuPDF table detection
        tables = page.find_tables()

        for table in tables:
            # Convert to markdown for RAG
            markdown = self._table_to_markdown(table)

            elements.append(PDFElement(
                type="table",
                content=markdown,
                page=page_num,
                bbox=table.bbox,
                metadata={
                    "rows": len(table.cells),
                    "cols": len(table.cells[0]) if table.cells else 0
                }
            ))

        return elements

    def _table_to_markdown(self, table) -> str:
        """
        Convert a table to Markdown format
        """
        rows = []
        for row_idx, row in enumerate(table.extract()):
            cells = [str(cell or "").strip() for cell in row]
            rows.append("| " + " | ".join(cells) + " |")

            # Add separator line after header
            if row_idx == 0:
                separator = "|" + "|".join(["---"] * len(cells)) + "|"
                rows.append(separator)

        return "\n".join(rows)

    def _extract_images(self, page, page_num: int) -> List[PDFElement]:
        """
        Extract images and generate descriptions
        """
        elements = []
        images = page.get_images()

        for img_idx, img in enumerate(images):
            xref = img[0]
            base_image = page.parent.extract_image(xref)

            if base_image:
                image_data = base_image["image"]
                image_b64 = base64.b64encode(image_data).decode()

                # Generate description with vision model
                description = ""
                if self.vision:
                    description = self._describe_image(image_b64)

                elements.append(PDFElement(
                    type="image",
                    content=description,
                    page=page_num,
                    bbox=img[1:5] if len(img) > 4 else (0, 0, 0, 0),
                    metadata={
                        "image_b64": image_b64,
                        "format": base_image.get("ext", "unknown"),
                        "width": base_image.get("width"),
                        "height": base_image.get("height")
                    }
                ))

        return elements

    async def _describe_image(self, image_b64: str) -> str:
        """
        Generate a textual description of the image
        """
        prompt = """
        Describe this image in detail so it can be found through text search.
        Include:
        - The type of image (photo, diagram, chart, screenshot)
        - Main visible elements
        - Visible text if any
        - Likely context
        """

        return await self.vision.analyze_image(image_b64, prompt)

OCR for Scanned Documents

DEVELOPERpython
import pytesseract
from PIL import Image
import cv2
import numpy as np

class OCRExtractor:
    def __init__(self, languages: list = ["eng"]):
        self.languages = "+".join(languages)

    def extract_from_image(self, image_path: str) -> dict:
        """
        Extract text from an image with OCR
        """
        # Preprocessing to improve OCR
        image = cv2.imread(image_path)
        processed = self._preprocess(image)

        # OCR with Tesseract
        text = pytesseract.image_to_string(
            processed,
            lang=self.languages,
            config='--psm 1'  # Automatic page segmentation
        )

        # Also get coordinates
        data = pytesseract.image_to_data(
            processed,
            lang=self.languages,
            output_type=pytesseract.Output.DICT
        )

        return {
            "text": text.strip(),
            "words": self._extract_words_with_positions(data),
            "confidence": self._average_confidence(data)
        }

    def _preprocess(self, image: np.ndarray) -> np.ndarray:
        """
        Preprocess image to improve OCR
        """
        # Convert to grayscale
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

        # Denoising
        denoised = cv2.fastNlMeansDenoising(gray)

        # Adaptive binarization
        binary = cv2.adaptiveThreshold(
            denoised, 255,
            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY, 11, 2
        )

        # Deskew correction
        corrected = self._deskew(binary)

        return corrected

    def _deskew(self, image: np.ndarray) -> np.ndarray:
        """
        Correct document skew
        """
        coords = np.column_stack(np.where(image > 0))
        angle = cv2.minAreaRect(coords)[-1]

        if angle < -45:
            angle = -(90 + angle)
        else:
            angle = -angle

        (h, w) = image.shape[:2]
        center = (w // 2, h // 2)
        M = cv2.getRotationMatrix2D(center, angle, 1.0)
        rotated = cv2.warpAffine(
            image, M, (w, h),
            flags=cv2.INTER_CUBIC,
            borderMode=cv2.BORDER_REPLICATE
        )

        return rotated

    def _average_confidence(self, data: dict) -> float:
        """
        Calculate average OCR confidence
        """
        confidences = [c for c in data["conf"] if c > 0]
        return sum(confidences) / len(confidences) if confidences else 0

Multimodal Embeddings

CLIP for Images and Text

DEVELOPERpython
from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image

class CLIPEmbedder:
    def __init__(self, model_name: str = "openai/clip-vit-large-patch14"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = CLIPModel.from_pretrained(model_name).to(self.device)
        self.processor = CLIPProcessor.from_pretrained(model_name)

    def embed_image(self, image_path: str) -> np.ndarray:
        """
        Generate embedding for an image
        """
        image = Image.open(image_path)
        inputs = self.processor(images=image, return_tensors="pt").to(self.device)

        with torch.no_grad():
            image_features = self.model.get_image_features(**inputs)

        return image_features.cpu().numpy().flatten()

    def embed_text(self, text: str) -> np.ndarray:
        """
        Generate embedding for text
        """
        inputs = self.processor(text=text, return_tensors="pt", padding=True).to(self.device)

        with torch.no_grad():
            text_features = self.model.get_text_features(**inputs)

        return text_features.cpu().numpy().flatten()

    def similarity(self, image_path: str, text: str) -> float:
        """
        Calculate similarity between an image and text
        """
        image_emb = self.embed_image(image_path)
        text_emb = self.embed_text(text)

        # Normalize and compute dot product
        image_emb = image_emb / np.linalg.norm(image_emb)
        text_emb = text_emb / np.linalg.norm(text_emb)

        return float(np.dot(image_emb, text_emb))

Multimodal Index with Qdrant

DEVELOPERpython
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

class MultimodalIndex:
    def __init__(self, text_embedder, image_embedder, qdrant_url: str = "localhost"):
        self.text_embedder = text_embedder
        self.image_embedder = image_embedder
        self.client = QdrantClient(host=qdrant_url, port=6333)

    def create_collection(self, name: str):
        """
        Create a collection with text AND image vectors
        """
        self.client.create_collection(
            collection_name=name,
            vectors_config={
                "text": VectorParams(size=768, distance=Distance.COSINE),
                "image": VectorParams(size=768, distance=Distance.COSINE),
            }
        )

    def index_document(self, doc_id: str, content: dict, collection: str):
        """
        Index a multimodal document
        """
        vectors = {}

        # Text embedding if present
        if content.get("text"):
            vectors["text"] = self.text_embedder.encode(content["text"]).tolist()

        # Image embedding if present
        if content.get("image_path"):
            vectors["image"] = self.image_embedder.embed_image(content["image_path"]).tolist()

        point = PointStruct(
            id=hash(doc_id) % (2**63),
            vector=vectors,
            payload={
                "doc_id": doc_id,
                "text": content.get("text", ""),
                "image_path": content.get("image_path"),
                "metadata": content.get("metadata", {})
            }
        )

        self.client.upsert(collection_name=collection, points=[point])

    def search(
        self,
        query: str,
        collection: str,
        query_image_path: str = None,
        top_k: int = 10
    ) -> list:
        """
        Multimodal search: text and/or image
        """
        # Prepare query vectors
        query_vectors = []

        # Text search
        text_vector = self.text_embedder.encode(query).tolist()
        query_vectors.append(("text", text_vector))

        # Visual search if image provided
        if query_image_path:
            image_vector = self.image_embedder.embed_image(query_image_path).tolist()
            query_vectors.append(("image", image_vector))

        # Execute searches and merge
        all_results = []
        for vector_name, vector in query_vectors:
            results = self.client.search(
                collection_name=collection,
                query_vector=(vector_name, vector),
                limit=top_k
            )
            all_results.extend(results)

        # Deduplicate and re-score
        return self._merge_results(all_results, top_k)

    def _merge_results(self, results: list, top_k: int) -> list:
        """
        Merge results from different modalities
        """
        scores = {}
        docs = {}

        for result in results:
            doc_id = result.payload["doc_id"]
            if doc_id not in scores:
                scores[doc_id] = 0
                docs[doc_id] = result.payload

            # RRF fusion
            scores[doc_id] += result.score

        # Sort by fused score
        sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)

        return [
            {"doc_id": doc_id, "score": score, **docs[doc_id]}
            for doc_id, score in sorted_docs[:top_k]
        ]

Multimodal Generation with GPT-4V / Claude 3

Multimodal Response Pipeline

DEVELOPERpython
from openai import OpenAI
import base64

class MultimodalRAG:
    def __init__(self, index, llm_client=None):
        self.index = index
        self.client = llm_client or OpenAI()

    async def query(
        self,
        text_query: str,
        image_query_path: str = None,
        collection: str = "multimodal_kb"
    ) -> dict:
        """
        Multimodal RAG query
        """
        # 1. Multimodal search
        results = self.index.search(
            query=text_query,
            collection=collection,
            query_image_path=image_query_path,
            top_k=5
        )

        # 2. Prepare multimodal context
        context_parts = []
        images_for_llm = []

        for result in results:
            if result.get("text"):
                context_parts.append(f"Document: {result['text']}")

            if result.get("image_path"):
                # Load image for LLM
                with open(result["image_path"], "rb") as f:
                    img_b64 = base64.b64encode(f.read()).decode()
                    images_for_llm.append({
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{img_b64}"
                        }
                    })

        # 3. Generate response with multimodal LLM
        response = await self._generate_response(
            query=text_query,
            context="\n\n".join(context_parts),
            images=images_for_llm,
            query_image=image_query_path
        )

        return {
            "answer": response,
            "sources": results,
            "has_visual_context": len(images_for_llm) > 0
        }

    async def _generate_response(
        self,
        query: str,
        context: str,
        images: list,
        query_image: str = None
    ) -> str:
        """
        Generate a response using a multimodal LLM
        """
        messages = [
            {
                "role": "system",
                "content": """You are a multimodal RAG assistant. You have access to text documents and images.

Rules:
1. Base your answers on both provided documents AND images
2. Describe what you see in images if relevant
3. If the user sends an image, analyze it to answer
4. Cite your sources (documents or images)"""
            }
        ]

        # Build user message
        user_content = []

        # Add query image if present
        if query_image:
            with open(query_image, "rb") as f:
                img_b64 = base64.b64encode(f.read()).decode()
            user_content.append({
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}
            })

        # Add context images
        user_content.extend(images)

        # Add text
        user_content.append({
            "type": "text",
            "text": f"""Document context:
{context}

Question: {query}

Answer based on the context and images provided."""
        })

        messages.append({"role": "user", "content": user_content})

        response = self.client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=messages,
            max_tokens=2000
        )

        return response.choices[0].message.content

Specialized Use Cases

Visual Product Search

DEVELOPERpython
class VisualProductSearch:
    def __init__(self, product_index, clip_embedder):
        self.index = product_index
        self.clip = clip_embedder

    async def find_similar_products(
        self,
        image_path: str,
        top_k: int = 10,
        filters: dict = None
    ) -> list:
        """
        Find products similar to an image
        """
        # Query image embedding
        query_embedding = self.clip.embed_image(image_path)

        # Search in product index
        results = self.index.search(
            query_vector=query_embedding,
            top_k=top_k,
            filters=filters
        )

        return results

    async def answer_product_question(
        self,
        product_image_path: str,
        question: str
    ) -> str:
        """
        Answer a question about a product from its image
        """
        # Find the product
        similar = await self.find_similar_products(product_image_path, top_k=1)

        if not similar:
            return "I couldn't find this product in our catalog."

        product = similar[0]

        # Generate answer with product context
        prompt = f"""
        Identified product: {product['name']}
        Description: {product['description']}
        Price: ${product['price']}
        Features: {product['specs']}

        Customer question: {question}

        Respond helpfully and commercially.
        """

        return await self.llm.generate(prompt)

Support with Screenshots

DEVELOPERpython
class VisualSupportAssistant:
    def __init__(self, rag_pipeline, vision_model):
        self.rag = rag_pipeline
        self.vision = vision_model

    async def analyze_screenshot(
        self,
        screenshot_path: str,
        user_description: str = ""
    ) -> dict:
        """
        Analyze an error screenshot
        """
        # Analyze image with vision model
        analysis = await self.vision.analyze(
            screenshot_path,
            prompt="""Analyze this screenshot and identify:
            1. The type of application/interface visible
            2. Any error message or anomaly
            3. The context of the ongoing action
            4. Visible technical details (codes, versions)
            """
        )

        # Combine with user description
        combined_query = f"""
        User description: {user_description}

        Screenshot analysis:
        {analysis}
        """

        # Search in support KB
        results = await self.rag.query(
            text_query=combined_query,
            image_query_path=screenshot_path
        )

        return {
            "visual_analysis": analysis,
            "suggested_solutions": results["answer"],
            "related_articles": results["sources"]
        }

Best Practices

1. Image Preprocessing

Normalize sizes before embedding
Apply consistent preprocessing (crop, resize)
Filter low-quality images

2. Metadata Management

Always preserve image metadata:

Original source and context
Dimensions and format
Creation date
Generated description

3. Graceful Fallback

DEVELOPERpython
async def safe_multimodal_query(query: str, image: str = None):
    """
    Handle cases where multimodal fails
    """
    try:
        if image:
            return await multimodal_rag.query(query, image)
    except Exception as e:
        logger.warning(f"Multimodal failed: {e}, falling back to text-only")

    # Fallback to text RAG
    return await text_rag.query(query)

Learn More

Retrieval Fundamentals - Search basics
Introduction to RAG - Overview
Dense Retrieval - Advanced embeddings

Turnkey Multimodal RAG with Ailog

Implementing multimodal RAG requires integrating many technologies. With Ailog, access these capabilities without complexity:

Advanced PDF extraction with table and chart recognition
Integrated OCR for scanned documents
Image indexing with automatic descriptions
Hybrid search text + visual
Multimodal LLMs (GPT-4V, Claude 3) preconfigured

Try Ailog for free and index all your content, regardless of format.

Multimodal RAG: Images, PDFs, and Beyond Text

Multimodal RAG: Images, PDFs, and Beyond Text

Why Multimodal?

Limitations of Text-Only RAG

Multimodal Use Cases

Multimodal Architecture

Advanced PDF Extraction

Extraction with Structure Preservation

OCR for Scanned Documents

Multimodal Embeddings

CLIP for Images and Text

Multimodal Index with Qdrant

Multimodal Generation with GPT-4V / Claude 3

Multimodal Response Pipeline

Specialized Use Cases

Visual Product Search

Support with Screenshots

Best Practices

1. Image Preprocessing

2. Metadata Management

3. Graceful Fallback

Learn More

Turnkey Multimodal RAG with Ailog

Tags

Related Posts

Table Extraction and Processing for RAG

OCR for Scanned Documents and Images

Parse PDF Documents with PyMuPDF

Ailog Assistant