1. ParsingAdvanced

Multimodal RAG: Images, PDFs, and Beyond Text

January 29, 2026
22 min read
Ailog Team

Extend your RAG beyond text: image indexing, PDF extraction, tables, and charts for a truly complete assistant.

Multimodal RAG: Images, PDFs, and Beyond Text

Traditional RAG is limited to text, but enterprise knowledge exists in many forms: PDFs with charts, product images, technical diagrams, PowerPoint presentations. Multimodal RAG extends your assistant's capabilities to understand and leverage all these formats.

Why Multimodal?

Limitations of Text-Only RAG

In a typical enterprise document base:

Content Type% of InformationAccessible to Text RAG
Plain text30-40%Yes
PDFs (text)20-30%Partially
PDFs (tables, charts)15-20%No
Images, diagrams10-15%No
Presentations5-10%Partially

Result: a text-only RAG can miss 40-50% of relevant information.

Multimodal Use Cases

  • E-commerce: Visual product search, questions about images
  • Technical documentation: Diagrams, architecture schemas
  • Support: User error screenshots
  • Training: Presentation slides, infographics
  • Legal: Scanned documents, signatures

Multimodal Architecture

┌─────────────────────────────────────────────────────────────┐
│                    MULTIMODAL SOURCES                        │
├──────────┬──────────┬──────────┬──────────┬────────────────┤
│   Text   │   PDF    │  Images  │  Slides  │    Video       │
│  .md/.txt│  .pdf    │ .jpg/.png│  .pptx   │   .mp4         │
└────┬─────┴────┬─────┴────┬─────┴────┬─────┴───────┬────────┘
     │          │          │          │             │
     │    ┌─────▼─────┐    │    ┌─────▼─────┐      │
     │    │   PDF     │    │    │   Slide   │      │
     │    │ Extractor │    │    │ Extractor │      │
     │    └─────┬─────┘    │    └─────┬─────┘      │
     │          │          │          │             │
     └──────────┴──────────┼──────────┴─────────────┘
                           │
              ┌────────────▼────────────┐
              │    Vision Encoder       │
              │  (CLIP, GPT-4V, etc.)   │
              └────────────┬────────────┘
                           │
              ┌────────────▼────────────┐
              │   Multimodal Index      │
              │  (Text + Image embeds)  │
              └────────────┬────────────┘
                           │
              ┌────────────▼────────────┐
              │   Multimodal LLM        │
              │  (GPT-4V, Claude 3)     │
              └─────────────────────────┘

Advanced PDF Extraction

Extraction with Structure Preservation

DEVELOPERpython
import fitz # PyMuPDF from dataclasses import dataclass from typing import List, Optional import base64 @dataclass class PDFElement: type: str # "text", "table", "image", "heading" content: str page: int bbox: tuple # (x0, y0, x1, y1) metadata: dict = None class AdvancedPDFExtractor: def __init__(self, vision_model=None): self.vision = vision_model def extract(self, pdf_path: str) -> List[PDFElement]: """ Extract all elements from a PDF with their structure """ doc = fitz.open(pdf_path) elements = [] for page_num in range(len(doc)): page = doc[page_num] # Extract text with structure text_elements = self._extract_text_blocks(page, page_num) elements.extend(text_elements) # Extract tables tables = self._extract_tables(page, page_num) elements.extend(tables) # Extract images images = self._extract_images(page, page_num) elements.extend(images) doc.close() return elements def _extract_text_blocks(self, page, page_num: int) -> List[PDFElement]: """ Extract text blocks with heading detection """ elements = [] blocks = page.get_text("dict")["blocks"] for block in blocks: if "lines" not in block: continue text_parts = [] max_font_size = 0 for line in block["lines"]: for span in line["spans"]: text_parts.append(span["text"]) max_font_size = max(max_font_size, span["size"]) text = " ".join(text_parts).strip() if not text: continue # Detect heading based on font size is_heading = max_font_size > 14 elements.append(PDFElement( type="heading" if is_heading else "text", content=text, page=page_num, bbox=block["bbox"], metadata={"font_size": max_font_size} )) return elements def _extract_tables(self, page, page_num: int) -> List[PDFElement]: """ Extract tables with structure recognition """ elements = [] # Use PyMuPDF table detection tables = page.find_tables() for table in tables: # Convert to markdown for RAG markdown = self._table_to_markdown(table) elements.append(PDFElement( type="table", content=markdown, page=page_num, bbox=table.bbox, metadata={ "rows": len(table.cells), "cols": len(table.cells[0]) if table.cells else 0 } )) return elements def _table_to_markdown(self, table) -> str: """ Convert a table to Markdown format """ rows = [] for row_idx, row in enumerate(table.extract()): cells = [str(cell or "").strip() for cell in row] rows.append("| " + " | ".join(cells) + " |") # Add separator line after header if row_idx == 0: separator = "|" + "|".join(["---"] * len(cells)) + "|" rows.append(separator) return "\n".join(rows) def _extract_images(self, page, page_num: int) -> List[PDFElement]: """ Extract images and generate descriptions """ elements = [] images = page.get_images() for img_idx, img in enumerate(images): xref = img[0] base_image = page.parent.extract_image(xref) if base_image: image_data = base_image["image"] image_b64 = base64.b64encode(image_data).decode() # Generate description with vision model description = "" if self.vision: description = self._describe_image(image_b64) elements.append(PDFElement( type="image", content=description, page=page_num, bbox=img[1:5] if len(img) > 4 else (0, 0, 0, 0), metadata={ "image_b64": image_b64, "format": base_image.get("ext", "unknown"), "width": base_image.get("width"), "height": base_image.get("height") } )) return elements async def _describe_image(self, image_b64: str) -> str: """ Generate a textual description of the image """ prompt = """ Describe this image in detail so it can be found through text search. Include: - The type of image (photo, diagram, chart, screenshot) - Main visible elements - Visible text if any - Likely context """ return await self.vision.analyze_image(image_b64, prompt)

OCR for Scanned Documents

DEVELOPERpython
import pytesseract from PIL import Image import cv2 import numpy as np class OCRExtractor: def __init__(self, languages: list = ["eng"]): self.languages = "+".join(languages) def extract_from_image(self, image_path: str) -> dict: """ Extract text from an image with OCR """ # Preprocessing to improve OCR image = cv2.imread(image_path) processed = self._preprocess(image) # OCR with Tesseract text = pytesseract.image_to_string( processed, lang=self.languages, config='--psm 1' # Automatic page segmentation ) # Also get coordinates data = pytesseract.image_to_data( processed, lang=self.languages, output_type=pytesseract.Output.DICT ) return { "text": text.strip(), "words": self._extract_words_with_positions(data), "confidence": self._average_confidence(data) } def _preprocess(self, image: np.ndarray) -> np.ndarray: """ Preprocess image to improve OCR """ # Convert to grayscale gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # Denoising denoised = cv2.fastNlMeansDenoising(gray) # Adaptive binarization binary = cv2.adaptiveThreshold( denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2 ) # Deskew correction corrected = self._deskew(binary) return corrected def _deskew(self, image: np.ndarray) -> np.ndarray: """ Correct document skew """ coords = np.column_stack(np.where(image > 0)) angle = cv2.minAreaRect(coords)[-1] if angle < -45: angle = -(90 + angle) else: angle = -angle (h, w) = image.shape[:2] center = (w // 2, h // 2) M = cv2.getRotationMatrix2D(center, angle, 1.0) rotated = cv2.warpAffine( image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE ) return rotated def _average_confidence(self, data: dict) -> float: """ Calculate average OCR confidence """ confidences = [c for c in data["conf"] if c > 0] return sum(confidences) / len(confidences) if confidences else 0

Multimodal Embeddings

CLIP for Images and Text

DEVELOPERpython
from transformers import CLIPProcessor, CLIPModel import torch from PIL import Image class CLIPEmbedder: def __init__(self, model_name: str = "openai/clip-vit-large-patch14"): self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model = CLIPModel.from_pretrained(model_name).to(self.device) self.processor = CLIPProcessor.from_pretrained(model_name) def embed_image(self, image_path: str) -> np.ndarray: """ Generate embedding for an image """ image = Image.open(image_path) inputs = self.processor(images=image, return_tensors="pt").to(self.device) with torch.no_grad(): image_features = self.model.get_image_features(**inputs) return image_features.cpu().numpy().flatten() def embed_text(self, text: str) -> np.ndarray: """ Generate embedding for text """ inputs = self.processor(text=text, return_tensors="pt", padding=True).to(self.device) with torch.no_grad(): text_features = self.model.get_text_features(**inputs) return text_features.cpu().numpy().flatten() def similarity(self, image_path: str, text: str) -> float: """ Calculate similarity between an image and text """ image_emb = self.embed_image(image_path) text_emb = self.embed_text(text) # Normalize and compute dot product image_emb = image_emb / np.linalg.norm(image_emb) text_emb = text_emb / np.linalg.norm(text_emb) return float(np.dot(image_emb, text_emb))

Multimodal Index with Qdrant

DEVELOPERpython
from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct class MultimodalIndex: def __init__(self, text_embedder, image_embedder, qdrant_url: str = "localhost"): self.text_embedder = text_embedder self.image_embedder = image_embedder self.client = QdrantClient(host=qdrant_url, port=6333) def create_collection(self, name: str): """ Create a collection with text AND image vectors """ self.client.create_collection( collection_name=name, vectors_config={ "text": VectorParams(size=768, distance=Distance.COSINE), "image": VectorParams(size=768, distance=Distance.COSINE), } ) def index_document(self, doc_id: str, content: dict, collection: str): """ Index a multimodal document """ vectors = {} # Text embedding if present if content.get("text"): vectors["text"] = self.text_embedder.encode(content["text"]).tolist() # Image embedding if present if content.get("image_path"): vectors["image"] = self.image_embedder.embed_image(content["image_path"]).tolist() point = PointStruct( id=hash(doc_id) % (2**63), vector=vectors, payload={ "doc_id": doc_id, "text": content.get("text", ""), "image_path": content.get("image_path"), "metadata": content.get("metadata", {}) } ) self.client.upsert(collection_name=collection, points=[point]) def search( self, query: str, collection: str, query_image_path: str = None, top_k: int = 10 ) -> list: """ Multimodal search: text and/or image """ # Prepare query vectors query_vectors = [] # Text search text_vector = self.text_embedder.encode(query).tolist() query_vectors.append(("text", text_vector)) # Visual search if image provided if query_image_path: image_vector = self.image_embedder.embed_image(query_image_path).tolist() query_vectors.append(("image", image_vector)) # Execute searches and merge all_results = [] for vector_name, vector in query_vectors: results = self.client.search( collection_name=collection, query_vector=(vector_name, vector), limit=top_k ) all_results.extend(results) # Deduplicate and re-score return self._merge_results(all_results, top_k) def _merge_results(self, results: list, top_k: int) -> list: """ Merge results from different modalities """ scores = {} docs = {} for result in results: doc_id = result.payload["doc_id"] if doc_id not in scores: scores[doc_id] = 0 docs[doc_id] = result.payload # RRF fusion scores[doc_id] += result.score # Sort by fused score sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True) return [ {"doc_id": doc_id, "score": score, **docs[doc_id]} for doc_id, score in sorted_docs[:top_k] ]

Multimodal Generation with GPT-4V / Claude 3

Multimodal Response Pipeline

DEVELOPERpython
from openai import OpenAI import base64 class MultimodalRAG: def __init__(self, index, llm_client=None): self.index = index self.client = llm_client or OpenAI() async def query( self, text_query: str, image_query_path: str = None, collection: str = "multimodal_kb" ) -> dict: """ Multimodal RAG query """ # 1. Multimodal search results = self.index.search( query=text_query, collection=collection, query_image_path=image_query_path, top_k=5 ) # 2. Prepare multimodal context context_parts = [] images_for_llm = [] for result in results: if result.get("text"): context_parts.append(f"Document: {result['text']}") if result.get("image_path"): # Load image for LLM with open(result["image_path"], "rb") as f: img_b64 = base64.b64encode(f.read()).decode() images_for_llm.append({ "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{img_b64}" } }) # 3. Generate response with multimodal LLM response = await self._generate_response( query=text_query, context="\n\n".join(context_parts), images=images_for_llm, query_image=image_query_path ) return { "answer": response, "sources": results, "has_visual_context": len(images_for_llm) > 0 } async def _generate_response( self, query: str, context: str, images: list, query_image: str = None ) -> str: """ Generate a response using a multimodal LLM """ messages = [ { "role": "system", "content": """You are a multimodal RAG assistant. You have access to text documents and images. Rules: 1. Base your answers on both provided documents AND images 2. Describe what you see in images if relevant 3. If the user sends an image, analyze it to answer 4. Cite your sources (documents or images)""" } ] # Build user message user_content = [] # Add query image if present if query_image: with open(query_image, "rb") as f: img_b64 = base64.b64encode(f.read()).decode() user_content.append({ "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"} }) # Add context images user_content.extend(images) # Add text user_content.append({ "type": "text", "text": f"""Document context: {context} Question: {query} Answer based on the context and images provided.""" }) messages.append({"role": "user", "content": user_content}) response = self.client.chat.completions.create( model="gpt-4-vision-preview", messages=messages, max_tokens=2000 ) return response.choices[0].message.content

Specialized Use Cases

Visual Product Search

DEVELOPERpython
class VisualProductSearch: def __init__(self, product_index, clip_embedder): self.index = product_index self.clip = clip_embedder async def find_similar_products( self, image_path: str, top_k: int = 10, filters: dict = None ) -> list: """ Find products similar to an image """ # Query image embedding query_embedding = self.clip.embed_image(image_path) # Search in product index results = self.index.search( query_vector=query_embedding, top_k=top_k, filters=filters ) return results async def answer_product_question( self, product_image_path: str, question: str ) -> str: """ Answer a question about a product from its image """ # Find the product similar = await self.find_similar_products(product_image_path, top_k=1) if not similar: return "I couldn't find this product in our catalog." product = similar[0] # Generate answer with product context prompt = f""" Identified product: {product['name']} Description: {product['description']} Price: ${product['price']} Features: {product['specs']} Customer question: {question} Respond helpfully and commercially. """ return await self.llm.generate(prompt)

Support with Screenshots

DEVELOPERpython
class VisualSupportAssistant: def __init__(self, rag_pipeline, vision_model): self.rag = rag_pipeline self.vision = vision_model async def analyze_screenshot( self, screenshot_path: str, user_description: str = "" ) -> dict: """ Analyze an error screenshot """ # Analyze image with vision model analysis = await self.vision.analyze( screenshot_path, prompt="""Analyze this screenshot and identify: 1. The type of application/interface visible 2. Any error message or anomaly 3. The context of the ongoing action 4. Visible technical details (codes, versions) """ ) # Combine with user description combined_query = f""" User description: {user_description} Screenshot analysis: {analysis} """ # Search in support KB results = await self.rag.query( text_query=combined_query, image_query_path=screenshot_path ) return { "visual_analysis": analysis, "suggested_solutions": results["answer"], "related_articles": results["sources"] }

Best Practices

1. Image Preprocessing

  • Normalize sizes before embedding
  • Apply consistent preprocessing (crop, resize)
  • Filter low-quality images

2. Metadata Management

Always preserve image metadata:

  • Original source and context
  • Dimensions and format
  • Creation date
  • Generated description

3. Graceful Fallback

DEVELOPERpython
async def safe_multimodal_query(query: str, image: str = None): """ Handle cases where multimodal fails """ try: if image: return await multimodal_rag.query(query, image) except Exception as e: logger.warning(f"Multimodal failed: {e}, falling back to text-only") # Fallback to text RAG return await text_rag.query(query)

Learn More


Turnkey Multimodal RAG with Ailog

Implementing multimodal RAG requires integrating many technologies. With Ailog, access these capabilities without complexity:

  • Advanced PDF extraction with table and chart recognition
  • Integrated OCR for scanned documents
  • Image indexing with automatic descriptions
  • Hybrid search text + visual
  • Multimodal LLMs (GPT-4V, Claude 3) preconfigured

Try Ailog for free and index all your content, regardless of format.

Tags

RAGmultimodalvisionPDFimagesOCR

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !