Multimodal RAG: Images, PDFs, and Beyond Text
Extend your RAG beyond text: image indexing, PDF extraction, tables, and charts for a truly complete assistant.
Multimodal RAG: Images, PDFs, and Beyond Text
Traditional RAG is limited to text, but enterprise knowledge exists in many forms: PDFs with charts, product images, technical diagrams, PowerPoint presentations. Multimodal RAG extends your assistant's capabilities to understand and leverage all these formats.
Why Multimodal?
Limitations of Text-Only RAG
In a typical enterprise document base:
| Content Type | % of Information | Accessible to Text RAG |
|---|---|---|
| Plain text | 30-40% | Yes |
| PDFs (text) | 20-30% | Partially |
| PDFs (tables, charts) | 15-20% | No |
| Images, diagrams | 10-15% | No |
| Presentations | 5-10% | Partially |
Result: a text-only RAG can miss 40-50% of relevant information.
Multimodal Use Cases
- E-commerce: Visual product search, questions about images
- Technical documentation: Diagrams, architecture schemas
- Support: User error screenshots
- Training: Presentation slides, infographics
- Legal: Scanned documents, signatures
Multimodal Architecture
┌─────────────────────────────────────────────────────────────┐
│ MULTIMODAL SOURCES │
├──────────┬──────────┬──────────┬──────────┬────────────────┤
│ Text │ PDF │ Images │ Slides │ Video │
│ .md/.txt│ .pdf │ .jpg/.png│ .pptx │ .mp4 │
└────┬─────┴────┬─────┴────┬─────┴────┬─────┴───────┬────────┘
│ │ │ │ │
│ ┌─────▼─────┐ │ ┌─────▼─────┐ │
│ │ PDF │ │ │ Slide │ │
│ │ Extractor │ │ │ Extractor │ │
│ └─────┬─────┘ │ └─────┬─────┘ │
│ │ │ │ │
└──────────┴──────────┼──────────┴─────────────┘
│
┌────────────▼────────────┐
│ Vision Encoder │
│ (CLIP, GPT-4V, etc.) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Multimodal Index │
│ (Text + Image embeds) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Multimodal LLM │
│ (GPT-4V, Claude 3) │
└─────────────────────────┘
Advanced PDF Extraction
Extraction with Structure Preservation
DEVELOPERpythonimport fitz # PyMuPDF from dataclasses import dataclass from typing import List, Optional import base64 @dataclass class PDFElement: type: str # "text", "table", "image", "heading" content: str page: int bbox: tuple # (x0, y0, x1, y1) metadata: dict = None class AdvancedPDFExtractor: def __init__(self, vision_model=None): self.vision = vision_model def extract(self, pdf_path: str) -> List[PDFElement]: """ Extract all elements from a PDF with their structure """ doc = fitz.open(pdf_path) elements = [] for page_num in range(len(doc)): page = doc[page_num] # Extract text with structure text_elements = self._extract_text_blocks(page, page_num) elements.extend(text_elements) # Extract tables tables = self._extract_tables(page, page_num) elements.extend(tables) # Extract images images = self._extract_images(page, page_num) elements.extend(images) doc.close() return elements def _extract_text_blocks(self, page, page_num: int) -> List[PDFElement]: """ Extract text blocks with heading detection """ elements = [] blocks = page.get_text("dict")["blocks"] for block in blocks: if "lines" not in block: continue text_parts = [] max_font_size = 0 for line in block["lines"]: for span in line["spans"]: text_parts.append(span["text"]) max_font_size = max(max_font_size, span["size"]) text = " ".join(text_parts).strip() if not text: continue # Detect heading based on font size is_heading = max_font_size > 14 elements.append(PDFElement( type="heading" if is_heading else "text", content=text, page=page_num, bbox=block["bbox"], metadata={"font_size": max_font_size} )) return elements def _extract_tables(self, page, page_num: int) -> List[PDFElement]: """ Extract tables with structure recognition """ elements = [] # Use PyMuPDF table detection tables = page.find_tables() for table in tables: # Convert to markdown for RAG markdown = self._table_to_markdown(table) elements.append(PDFElement( type="table", content=markdown, page=page_num, bbox=table.bbox, metadata={ "rows": len(table.cells), "cols": len(table.cells[0]) if table.cells else 0 } )) return elements def _table_to_markdown(self, table) -> str: """ Convert a table to Markdown format """ rows = [] for row_idx, row in enumerate(table.extract()): cells = [str(cell or "").strip() for cell in row] rows.append("| " + " | ".join(cells) + " |") # Add separator line after header if row_idx == 0: separator = "|" + "|".join(["---"] * len(cells)) + "|" rows.append(separator) return "\n".join(rows) def _extract_images(self, page, page_num: int) -> List[PDFElement]: """ Extract images and generate descriptions """ elements = [] images = page.get_images() for img_idx, img in enumerate(images): xref = img[0] base_image = page.parent.extract_image(xref) if base_image: image_data = base_image["image"] image_b64 = base64.b64encode(image_data).decode() # Generate description with vision model description = "" if self.vision: description = self._describe_image(image_b64) elements.append(PDFElement( type="image", content=description, page=page_num, bbox=img[1:5] if len(img) > 4 else (0, 0, 0, 0), metadata={ "image_b64": image_b64, "format": base_image.get("ext", "unknown"), "width": base_image.get("width"), "height": base_image.get("height") } )) return elements async def _describe_image(self, image_b64: str) -> str: """ Generate a textual description of the image """ prompt = """ Describe this image in detail so it can be found through text search. Include: - The type of image (photo, diagram, chart, screenshot) - Main visible elements - Visible text if any - Likely context """ return await self.vision.analyze_image(image_b64, prompt)
OCR for Scanned Documents
DEVELOPERpythonimport pytesseract from PIL import Image import cv2 import numpy as np class OCRExtractor: def __init__(self, languages: list = ["eng"]): self.languages = "+".join(languages) def extract_from_image(self, image_path: str) -> dict: """ Extract text from an image with OCR """ # Preprocessing to improve OCR image = cv2.imread(image_path) processed = self._preprocess(image) # OCR with Tesseract text = pytesseract.image_to_string( processed, lang=self.languages, config='--psm 1' # Automatic page segmentation ) # Also get coordinates data = pytesseract.image_to_data( processed, lang=self.languages, output_type=pytesseract.Output.DICT ) return { "text": text.strip(), "words": self._extract_words_with_positions(data), "confidence": self._average_confidence(data) } def _preprocess(self, image: np.ndarray) -> np.ndarray: """ Preprocess image to improve OCR """ # Convert to grayscale gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # Denoising denoised = cv2.fastNlMeansDenoising(gray) # Adaptive binarization binary = cv2.adaptiveThreshold( denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2 ) # Deskew correction corrected = self._deskew(binary) return corrected def _deskew(self, image: np.ndarray) -> np.ndarray: """ Correct document skew """ coords = np.column_stack(np.where(image > 0)) angle = cv2.minAreaRect(coords)[-1] if angle < -45: angle = -(90 + angle) else: angle = -angle (h, w) = image.shape[:2] center = (w // 2, h // 2) M = cv2.getRotationMatrix2D(center, angle, 1.0) rotated = cv2.warpAffine( image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE ) return rotated def _average_confidence(self, data: dict) -> float: """ Calculate average OCR confidence """ confidences = [c for c in data["conf"] if c > 0] return sum(confidences) / len(confidences) if confidences else 0
Multimodal Embeddings
CLIP for Images and Text
DEVELOPERpythonfrom transformers import CLIPProcessor, CLIPModel import torch from PIL import Image class CLIPEmbedder: def __init__(self, model_name: str = "openai/clip-vit-large-patch14"): self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model = CLIPModel.from_pretrained(model_name).to(self.device) self.processor = CLIPProcessor.from_pretrained(model_name) def embed_image(self, image_path: str) -> np.ndarray: """ Generate embedding for an image """ image = Image.open(image_path) inputs = self.processor(images=image, return_tensors="pt").to(self.device) with torch.no_grad(): image_features = self.model.get_image_features(**inputs) return image_features.cpu().numpy().flatten() def embed_text(self, text: str) -> np.ndarray: """ Generate embedding for text """ inputs = self.processor(text=text, return_tensors="pt", padding=True).to(self.device) with torch.no_grad(): text_features = self.model.get_text_features(**inputs) return text_features.cpu().numpy().flatten() def similarity(self, image_path: str, text: str) -> float: """ Calculate similarity between an image and text """ image_emb = self.embed_image(image_path) text_emb = self.embed_text(text) # Normalize and compute dot product image_emb = image_emb / np.linalg.norm(image_emb) text_emb = text_emb / np.linalg.norm(text_emb) return float(np.dot(image_emb, text_emb))
Multimodal Index with Qdrant
DEVELOPERpythonfrom qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct class MultimodalIndex: def __init__(self, text_embedder, image_embedder, qdrant_url: str = "localhost"): self.text_embedder = text_embedder self.image_embedder = image_embedder self.client = QdrantClient(host=qdrant_url, port=6333) def create_collection(self, name: str): """ Create a collection with text AND image vectors """ self.client.create_collection( collection_name=name, vectors_config={ "text": VectorParams(size=768, distance=Distance.COSINE), "image": VectorParams(size=768, distance=Distance.COSINE), } ) def index_document(self, doc_id: str, content: dict, collection: str): """ Index a multimodal document """ vectors = {} # Text embedding if present if content.get("text"): vectors["text"] = self.text_embedder.encode(content["text"]).tolist() # Image embedding if present if content.get("image_path"): vectors["image"] = self.image_embedder.embed_image(content["image_path"]).tolist() point = PointStruct( id=hash(doc_id) % (2**63), vector=vectors, payload={ "doc_id": doc_id, "text": content.get("text", ""), "image_path": content.get("image_path"), "metadata": content.get("metadata", {}) } ) self.client.upsert(collection_name=collection, points=[point]) def search( self, query: str, collection: str, query_image_path: str = None, top_k: int = 10 ) -> list: """ Multimodal search: text and/or image """ # Prepare query vectors query_vectors = [] # Text search text_vector = self.text_embedder.encode(query).tolist() query_vectors.append(("text", text_vector)) # Visual search if image provided if query_image_path: image_vector = self.image_embedder.embed_image(query_image_path).tolist() query_vectors.append(("image", image_vector)) # Execute searches and merge all_results = [] for vector_name, vector in query_vectors: results = self.client.search( collection_name=collection, query_vector=(vector_name, vector), limit=top_k ) all_results.extend(results) # Deduplicate and re-score return self._merge_results(all_results, top_k) def _merge_results(self, results: list, top_k: int) -> list: """ Merge results from different modalities """ scores = {} docs = {} for result in results: doc_id = result.payload["doc_id"] if doc_id not in scores: scores[doc_id] = 0 docs[doc_id] = result.payload # RRF fusion scores[doc_id] += result.score # Sort by fused score sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True) return [ {"doc_id": doc_id, "score": score, **docs[doc_id]} for doc_id, score in sorted_docs[:top_k] ]
Multimodal Generation with GPT-4V / Claude 3
Multimodal Response Pipeline
DEVELOPERpythonfrom openai import OpenAI import base64 class MultimodalRAG: def __init__(self, index, llm_client=None): self.index = index self.client = llm_client or OpenAI() async def query( self, text_query: str, image_query_path: str = None, collection: str = "multimodal_kb" ) -> dict: """ Multimodal RAG query """ # 1. Multimodal search results = self.index.search( query=text_query, collection=collection, query_image_path=image_query_path, top_k=5 ) # 2. Prepare multimodal context context_parts = [] images_for_llm = [] for result in results: if result.get("text"): context_parts.append(f"Document: {result['text']}") if result.get("image_path"): # Load image for LLM with open(result["image_path"], "rb") as f: img_b64 = base64.b64encode(f.read()).decode() images_for_llm.append({ "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{img_b64}" } }) # 3. Generate response with multimodal LLM response = await self._generate_response( query=text_query, context="\n\n".join(context_parts), images=images_for_llm, query_image=image_query_path ) return { "answer": response, "sources": results, "has_visual_context": len(images_for_llm) > 0 } async def _generate_response( self, query: str, context: str, images: list, query_image: str = None ) -> str: """ Generate a response using a multimodal LLM """ messages = [ { "role": "system", "content": """You are a multimodal RAG assistant. You have access to text documents and images. Rules: 1. Base your answers on both provided documents AND images 2. Describe what you see in images if relevant 3. If the user sends an image, analyze it to answer 4. Cite your sources (documents or images)""" } ] # Build user message user_content = [] # Add query image if present if query_image: with open(query_image, "rb") as f: img_b64 = base64.b64encode(f.read()).decode() user_content.append({ "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"} }) # Add context images user_content.extend(images) # Add text user_content.append({ "type": "text", "text": f"""Document context: {context} Question: {query} Answer based on the context and images provided.""" }) messages.append({"role": "user", "content": user_content}) response = self.client.chat.completions.create( model="gpt-4-vision-preview", messages=messages, max_tokens=2000 ) return response.choices[0].message.content
Specialized Use Cases
Visual Product Search
DEVELOPERpythonclass VisualProductSearch: def __init__(self, product_index, clip_embedder): self.index = product_index self.clip = clip_embedder async def find_similar_products( self, image_path: str, top_k: int = 10, filters: dict = None ) -> list: """ Find products similar to an image """ # Query image embedding query_embedding = self.clip.embed_image(image_path) # Search in product index results = self.index.search( query_vector=query_embedding, top_k=top_k, filters=filters ) return results async def answer_product_question( self, product_image_path: str, question: str ) -> str: """ Answer a question about a product from its image """ # Find the product similar = await self.find_similar_products(product_image_path, top_k=1) if not similar: return "I couldn't find this product in our catalog." product = similar[0] # Generate answer with product context prompt = f""" Identified product: {product['name']} Description: {product['description']} Price: ${product['price']} Features: {product['specs']} Customer question: {question} Respond helpfully and commercially. """ return await self.llm.generate(prompt)
Support with Screenshots
DEVELOPERpythonclass VisualSupportAssistant: def __init__(self, rag_pipeline, vision_model): self.rag = rag_pipeline self.vision = vision_model async def analyze_screenshot( self, screenshot_path: str, user_description: str = "" ) -> dict: """ Analyze an error screenshot """ # Analyze image with vision model analysis = await self.vision.analyze( screenshot_path, prompt="""Analyze this screenshot and identify: 1. The type of application/interface visible 2. Any error message or anomaly 3. The context of the ongoing action 4. Visible technical details (codes, versions) """ ) # Combine with user description combined_query = f""" User description: {user_description} Screenshot analysis: {analysis} """ # Search in support KB results = await self.rag.query( text_query=combined_query, image_query_path=screenshot_path ) return { "visual_analysis": analysis, "suggested_solutions": results["answer"], "related_articles": results["sources"] }
Best Practices
1. Image Preprocessing
- Normalize sizes before embedding
- Apply consistent preprocessing (crop, resize)
- Filter low-quality images
2. Metadata Management
Always preserve image metadata:
- Original source and context
- Dimensions and format
- Creation date
- Generated description
3. Graceful Fallback
DEVELOPERpythonasync def safe_multimodal_query(query: str, image: str = None): """ Handle cases where multimodal fails """ try: if image: return await multimodal_rag.query(query, image) except Exception as e: logger.warning(f"Multimodal failed: {e}, falling back to text-only") # Fallback to text RAG return await text_rag.query(query)
Learn More
- Retrieval Fundamentals - Search basics
- Introduction to RAG - Overview
- Dense Retrieval - Advanced embeddings
Turnkey Multimodal RAG with Ailog
Implementing multimodal RAG requires integrating many technologies. With Ailog, access these capabilities without complexity:
- Advanced PDF extraction with table and chart recognition
- Integrated OCR for scanned documents
- Image indexing with automatic descriptions
- Hybrid search text + visual
- Multimodal LLMs (GPT-4V, Claude 3) preconfigured
Try Ailog for free and index all your content, regardless of format.
Tags
Related Posts
Table Extraction and Processing for RAG
Tables contain critical structured data but are difficult to parse. Master table extraction and chunking techniques for RAG.
OCR for Scanned Documents and Images
Extract text from scanned PDFs and images for RAG. Compare Tesseract, AWS Textract, and Google Vision OCR with code examples and accuracy benchmarks.
Parse PDF Documents with PyMuPDF
Master PDF parsing: extract text, images, tables, and metadata from PDFs using PyMuPDF and alternatives.