Image RAG: Vision Models and Visual Search
Complete guide to integrating images into your RAG system: vision models, multimodal embeddings, indexing and visual search with GPT-4V, Claude Vision and CLIP.
Image RAG: Vision Models and Visual Search
Traditional RAG systems are limited to text. Yet a large portion of enterprise information is visual: product photos, screenshots, charts, technical diagrams. Image RAG enables indexing and searching through visual content with the same precision as text-based RAG.
Why Integrate Images into RAG?
Visual Data in Business
- E-commerce: 70% of purchase decisions are influenced by product images
- Technical support: Screenshots accelerate ticket resolution by 60%
- Documentation: A diagram is often worth more than a page of text
- Compliance: Site photos, property inspections, visual evidence
Concrete Use Cases
| Sector | Usage | Example Query |
|---|---|---|
| E-commerce | Visual search | "Find dresses similar to this photo" |
| Real estate | Property analysis | "Show me modern equipped kitchens" |
| IT support | Diagnosis | "What is this error message?" |
| Manufacturing | Quality control | "Does this part have a defect?" |
Image RAG Architecture
Overview
┌─────────────────────────────────────────────────────────────┐
│ IMAGE RAG PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Image │───▶│ Vision │───▶│ Embedding │ │
│ │ Input │ │ Model │ │ (CLIP/SigLIP) │ │
│ └──────────┘ └──────────────┘ └──────────────────┘ │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ ┌──────────────┐ ┌──────────────────┐ │
│ │ │ Text │ │ Vector Store │ │
│ │ │ Description │ │ (Qdrant/Pine) │ │
│ │ └──────────────┘ └──────────────────┘ │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ ┌─────────────────────────────────┐ │
│ └────────▶│ Multimodal Retrieval │ │
│ └─────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Generation (VLM/LLM) │ │
│ └─────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Two Approaches
1. Description + Text RAG
- Vision model describes the image as text
- Text is indexed traditionally
- Simpler but loses visual information
2. Native Multimodal Embeddings
- Image is converted directly to vector
- Preserves complete visual information
- Enables image-to-image search
Vision Models for RAG
Proprietary Models
| Model | Max Resolution | Cost | Strengths |
|---|---|---|---|
| GPT-4V | 2048x2048 | $0.00765/image (low) | Complex reasoning, excellent OCR |
| Claude 3.5 Sonnet Vision | 8192x8192 | $0.003/image | Detailed analysis, safety |
| Gemini 1.5 Pro | Unlimited | $0.001315/image | Multi-image, long context |
Open Source Models
| Model | Params | VRAM | Usage |
|---|---|---|---|
| LLaVA 1.6 | 34B | 24GB | General description |
| CogVLM2 | 19B | 16GB | Fine understanding |
| InternVL2 | 76B | 48GB | SOTA performance |
| Qwen-VL-Max | 72B | 48GB | Multilingual |
Multimodal Embedding Models
| Model | Dimension | Languages | Open source |
|---|---|---|---|
| CLIP (OpenAI) | 512/768 | EN primarily | Yes |
| SigLIP | 384-1152 | Multilingual | Yes |
| Jina CLIP v2 | 1024 | 89 languages | Yes |
| Cohere Embed v3 | 1024 | 100+ languages | No |
Practical Implementation
Step 1: Image Extraction and Description
DEVELOPERpythonimport base64 from openai import OpenAI def describe_image_for_rag(image_path: str, context: str = "") -> dict: """ Generate a RAG-optimized description of an image. """ client = OpenAI() # Encode image to base64 with open(image_path, "rb") as f: image_data = base64.b64encode(f.read()).decode("utf-8") # Determine MIME type mime_type = "image/jpeg" if image_path.endswith((".jpg", ".jpeg")) else "image/png" prompt = """Analyze this image for a RAG system. Provide: 1. **General description** (2-3 sentences) 2. **Key elements** (bullet list of important objects/concepts) 3. **Visible text** (any readable text in the image) 4. **Suggested metadata** (category, relevant tags) Be exhaustive but concise. The goal is to enable text search on this image.""" if context: prompt += f"\n\nAdditional context: {context}" response = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": { "url": f"data:{mime_type};base64,{image_data}", "detail": "high" } } ] } ], max_tokens=1000 ) description = response.choices[0].message.content return { "image_path": image_path, "description": description, "model": "gpt-4o", "tokens_used": response.usage.total_tokens }
Step 2: Multimodal Embeddings with CLIP
DEVELOPERpythonimport torch from PIL import Image from transformers import CLIPProcessor, CLIPModel class MultimodalEmbedder: def __init__(self, model_name: str = "openai/clip-vit-large-patch14"): self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model = CLIPModel.from_pretrained(model_name).to(self.device) self.processor = CLIPProcessor.from_pretrained(model_name) def embed_image(self, image_path: str) -> list[float]: """Generate embedding for an image.""" image = Image.open(image_path).convert("RGB") inputs = self.processor(images=image, return_tensors="pt").to(self.device) with torch.no_grad(): embedding = self.model.get_image_features(**inputs) # Normalize for cosine similarity embedding = embedding / embedding.norm(dim=-1, keepdim=True) return embedding.cpu().squeeze().tolist() def embed_text(self, text: str) -> list[float]: """Generate embedding for text (same space as images).""" inputs = self.processor(text=[text], return_tensors="pt", padding=True).to(self.device) with torch.no_grad(): embedding = self.model.get_text_features(**inputs) embedding = embedding / embedding.norm(dim=-1, keepdim=True) return embedding.cpu().squeeze().tolist() def compute_similarity(self, image_path: str, text: str) -> float: """Compute image-text similarity.""" img_emb = torch.tensor(self.embed_image(image_path)) txt_emb = torch.tensor(self.embed_text(text)) return (img_emb @ txt_emb).item()
Step 3: Indexing in Qdrant
DEVELOPERpythonfrom qdrant_client import QdrantClient from qdrant_client.models import ( VectorParams, Distance, PointStruct, Filter, FieldCondition, MatchValue ) class ImageRAGIndex: def __init__(self, collection_name: str = "image_rag"): self.client = QdrantClient(url="http://localhost:6333") self.collection_name = collection_name self.embedder = MultimodalEmbedder() def create_collection(self, vector_size: int = 768): """Create collection with two vector spaces.""" self.client.recreate_collection( collection_name=self.collection_name, vectors_config={ # Visual embedding (CLIP) "visual": VectorParams( size=vector_size, distance=Distance.COSINE ), # Text embedding (description) "textual": VectorParams( size=1536, # Ada-002 or similar distance=Distance.COSINE ) } ) def index_image( self, image_id: str, image_path: str, description: str, text_embedding: list[float], metadata: dict = None ): """Index an image with both embeddings.""" visual_embedding = self.embedder.embed_image(image_path) point = PointStruct( id=hash(image_id) % (2**63), vector={ "visual": visual_embedding, "textual": text_embedding }, payload={ "image_id": image_id, "image_path": image_path, "description": description, **(metadata or {}) } ) self.client.upsert( collection_name=self.collection_name, points=[point] ) def search_by_text(self, query: str, limit: int = 5) -> list[dict]: """Search by text query.""" query_embedding = self.embedder.embed_text(query) results = self.client.search( collection_name=self.collection_name, query_vector=("visual", query_embedding), # CLIP text -> visual limit=limit ) return [ { "image_path": r.payload["image_path"], "description": r.payload["description"], "score": r.score } for r in results ] def search_by_image(self, image_path: str, limit: int = 5) -> list[dict]: """Search for similar images.""" query_embedding = self.embedder.embed_image(image_path) results = self.client.search( collection_name=self.collection_name, query_vector=("visual", query_embedding), limit=limit ) return [ { "image_path": r.payload["image_path"], "description": r.payload["description"], "score": r.score } for r in results ]
Step 4: Generation with Visual Context
DEVELOPERpythondef generate_with_images( query: str, retrieved_images: list[dict], client: OpenAI ) -> str: """ Generate a response using retrieved images as context. """ # Prepare multimodal content content = [ { "type": "text", "text": f"""You are an assistant that answers questions using the provided images as information source. User question: {query} Available images:""" } ] # Add each image with its description for i, img in enumerate(retrieved_images[:3], 1): # Max 3 images with open(img["image_path"], "rb") as f: img_data = base64.b64encode(f.read()).decode("utf-8") content.append({ "type": "text", "text": f"\n**Image {i}** (score: {img['score']:.2f}):\n{img['description']}" }) content.append({ "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{img_data}", "detail": "low" # Save tokens } }) content.append({ "type": "text", "text": "\n\nAnswer the question based only on these images. If the images don't allow answering, say so clearly." }) response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": content}], max_tokens=1000 ) return response.choices[0].message.content
Advanced Optimizations
High-Resolution Image Chunking
For large images (plans, diagrams), split into tiles:
DEVELOPERpythonfrom PIL import Image def tile_large_image(image_path: str, tile_size: int = 512, overlap: int = 64): """Split a large image into overlapping tiles.""" img = Image.open(image_path) width, height = img.size tiles = [] for y in range(0, height - overlap, tile_size - overlap): for x in range(0, width - overlap, tile_size - overlap): box = (x, y, min(x + tile_size, width), min(y + tile_size, height)) tile = img.crop(box) tiles.append({ "tile": tile, "position": (x, y), "original_size": (width, height) }) return tiles
Hybrid Image + Text Search
DEVELOPERpythondef hybrid_image_search( query: str, text_embedding: list[float], index: ImageRAGIndex, alpha: float = 0.7 # Weight of visual vs textual ) -> list[dict]: """Combine visual and text search.""" # Visual search (CLIP) visual_results = index.search_by_text(query, limit=20) # Text search (on descriptions) text_results = index.client.search( collection_name=index.collection_name, query_vector=("textual", text_embedding), limit=20 ) # Score fusion with RRF combined_scores = {} for rank, r in enumerate(visual_results): img_id = r["image_path"] combined_scores[img_id] = combined_scores.get(img_id, 0) + alpha / (rank + 60) for rank, r in enumerate(text_results): img_id = r.payload["image_path"] combined_scores[img_id] = combined_scores.get(img_id, 0) + (1 - alpha) / (rank + 60) # Sort by combined score sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True) return [{"image_path": path, "score": score} for path, score in sorted_results[:5]]
Benchmarks and Costs
Retrieval Performance
| Method | Precision@5 | Recall@10 | Latency |
|---|---|---|---|
| Description only | 0.72 | 0.81 | 50ms |
| CLIP only | 0.78 | 0.85 | 30ms |
| Hybrid | 0.84 | 0.91 | 80ms |
Cost per Indexed Image
| Step | Estimated Cost | Notes |
|---|---|---|
| GPT-4V description | $0.01-0.03 | Depends on size and detail |
| CLIP embedding | $0 (local) | GPU recommended |
| Qdrant storage | ~$0.0001 | Per vector/month |
Embedding Model Comparison
| Model | Zero-shot accuracy | Multilingual | Speed |
|---|---|---|---|
| CLIP ViT-L/14 | 75.5% | No | Fast |
| SigLIP So400m | 83.1% | Yes | Medium |
| Jina CLIP v2 | 81.2% | Yes | Fast |
Pitfalls and Solutions
Problem 1: Images with Little Visual Content
Symptom: Text screenshots are poorly indexed by CLIP.
Solution: Explicit OCR + text indexing.
DEVELOPERpythonimport pytesseract def extract_text_from_image(image_path: str) -> str: """Extract text from image via OCR.""" img = Image.open(image_path) text = pytesseract.image_to_string(img, lang='eng+fra') return text.strip()
Problem 2: Visual Duplicates
Symptom: Multiple nearly identical images pollute results.
Solution: Similarity-based deduplication.
DEVELOPERpythondef deduplicate_images(embeddings: list, threshold: float = 0.95): """Remove images that are too similar.""" keep = [] for i, emb in enumerate(embeddings): is_duplicate = False for j in keep: similarity = cosine_similarity(emb, embeddings[j]) if similarity > threshold: is_duplicate = True break if not is_duplicate: keep.append(i) return keep
Problem 3: Visual vs Textual Context Contradiction
Symptom: Generated description contradicts the image.
Solution: Cross-validation and confidence score.
Integration with Ailog
Ailog natively supports image indexing in your knowledge bases:
- Upload: Drag and drop your images in the interface
- Automatic analysis: Vision model for content extraction
- Hybrid indexing: Visual + text embeddings
- Unified search: Single query for text and images
Try Image RAG on Ailog - No configuration required.
Related Guides
Tags
Related Posts
Multimodal RAG: Images, PDFs, and Beyond Text
Extend your RAG beyond text: image indexing, PDF extraction, tables, and charts for a truly complete assistant.
Retrieval Fundamentals: How RAG Search Works
Master the basics of retrieval in RAG systems: embeddings, vector search, chunking, and indexing for relevant results.
Magento: Intelligent Catalog Assistant
Deploy an AI assistant on Magento to navigate complex catalogs, recommend products and improve B2B and B2C experience.