Image RAG: Vision Models and Visual Search

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Traditional RAG systems are limited to text. Yet a large portion of enterprise information is visual: product photos, screenshots, charts, technical diagrams. Image RAG enables indexing and searching through visual content with the same precision as text-based RAG.

Why Integrate Images into RAG?

Visual Data in Business

E-commerce: 70% of purchase decisions are influenced by product images
Technical support: Screenshots accelerate ticket resolution by 60%
Documentation: A diagram is often worth more than a page of text
Compliance: Site photos, property inspections, visual evidence

Concrete Use Cases

Sector	Usage	Example Query
E-commerce	Visual search	"Find dresses similar to this photo"
Real estate	Property analysis	"Show me modern equipped kitchens"
IT support	Diagnosis	"What is this error message?"
Manufacturing	Quality control	"Does this part have a defect?"

Image RAG Architecture

Overview

┌─────────────────────────────────────────────────────────────┐
│                    IMAGE RAG PIPELINE                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │  Image   │───▶│   Vision     │───▶│   Embedding      │  │
│  │  Input   │    │   Model      │    │   (CLIP/SigLIP)  │  │
│  └──────────┘    └──────────────┘    └──────────────────┘  │
│       │                │                      │             │
│       │                ▼                      ▼             │
│       │         ┌──────────────┐    ┌──────────────────┐   │
│       │         │    Text      │    │  Vector Store    │   │
│       │         │  Description │    │  (Qdrant/Pine)   │   │
│       │         └──────────────┘    └──────────────────┘   │
│       │                │                      │             │
│       │                ▼                      ▼             │
│       │         ┌─────────────────────────────────┐        │
│       └────────▶│     Multimodal Retrieval        │        │
│                 └─────────────────────────────────┘        │
│                                │                            │
│                                ▼                            │
│                 ┌─────────────────────────────────┐        │
│                 │     Generation (VLM/LLM)        │        │
│                 └─────────────────────────────────┘        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Two Approaches

1. Description + Text RAG

Vision model describes the image as text
Text is indexed traditionally
Simpler but loses visual information

2. Native Multimodal Embeddings

Image is converted directly to vector
Preserves complete visual information
Enables image-to-image search

Vision Models for RAG

Proprietary Models

Model	Max Resolution	Cost	Strengths
GPT-4V	2048x2048	$0.00765/image (low)	Complex reasoning, excellent OCR
Claude 3.5 Sonnet Vision	8192x8192	$0.003/image	Detailed analysis, safety
Gemini 1.5 Pro	Unlimited	$0.001315/image	Multi-image, long context

Open Source Models

Model	Params	VRAM	Usage
LLaVA 1.6	34B	24GB	General description
CogVLM2	19B	16GB	Fine understanding
InternVL2	76B	48GB	SOTA performance
Qwen-VL-Max	72B	48GB	Multilingual

Multimodal Embedding Models

Model	Dimension	Languages	Open source
CLIP (OpenAI)	512/768	EN primarily	Yes
SigLIP	384-1152	Multilingual	Yes
Jina CLIP v2	1024	89 languages	Yes
Cohere Embed v3	1024	100+ languages	No

Practical Implementation

Step 1: Image Extraction and Description

DEVELOPERpython
import base64
from openai import OpenAI

def describe_image_for_rag(image_path: str, context: str = "") -> dict:
    """
    Generate a RAG-optimized description of an image.
    """
    client = OpenAI()

    # Encode image to base64
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    # Determine MIME type
    mime_type = "image/jpeg" if image_path.endswith((".jpg", ".jpeg")) else "image/png"

    prompt = """Analyze this image for a RAG system. Provide:

1. **General description** (2-3 sentences)
2. **Key elements** (bullet list of important objects/concepts)
3. **Visible text** (any readable text in the image)
4. **Suggested metadata** (category, relevant tags)

Be exhaustive but concise. The goal is to enable text search on this image."""

    if context:
        prompt += f"\n\nAdditional context: {context}"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:{mime_type};base64,{image_data}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )

    description = response.choices[0].message.content

    return {
        "image_path": image_path,
        "description": description,
        "model": "gpt-4o",
        "tokens_used": response.usage.total_tokens
    }

Step 2: Multimodal Embeddings with CLIP

DEVELOPERpython
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

class MultimodalEmbedder:
    def __init__(self, model_name: str = "openai/clip-vit-large-patch14"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = CLIPModel.from_pretrained(model_name).to(self.device)
        self.processor = CLIPProcessor.from_pretrained(model_name)

    def embed_image(self, image_path: str) -> list[float]:
        """Generate embedding for an image."""
        image = Image.open(image_path).convert("RGB")
        inputs = self.processor(images=image, return_tensors="pt").to(self.device)

        with torch.no_grad():
            embedding = self.model.get_image_features(**inputs)
            # Normalize for cosine similarity
            embedding = embedding / embedding.norm(dim=-1, keepdim=True)

        return embedding.cpu().squeeze().tolist()

    def embed_text(self, text: str) -> list[float]:
        """Generate embedding for text (same space as images)."""
        inputs = self.processor(text=[text], return_tensors="pt", padding=True).to(self.device)

        with torch.no_grad():
            embedding = self.model.get_text_features(**inputs)
            embedding = embedding / embedding.norm(dim=-1, keepdim=True)

        return embedding.cpu().squeeze().tolist()

    def compute_similarity(self, image_path: str, text: str) -> float:
        """Compute image-text similarity."""
        img_emb = torch.tensor(self.embed_image(image_path))
        txt_emb = torch.tensor(self.embed_text(text))
        return (img_emb @ txt_emb).item()

Step 3: Indexing in Qdrant

DEVELOPERpython
from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, PointStruct,
    Filter, FieldCondition, MatchValue
)

class ImageRAGIndex:
    def __init__(self, collection_name: str = "image_rag"):
        self.client = QdrantClient(url="http://localhost:6333")
        self.collection_name = collection_name
        self.embedder = MultimodalEmbedder()

    def create_collection(self, vector_size: int = 768):
        """Create collection with two vector spaces."""
        self.client.recreate_collection(
            collection_name=self.collection_name,
            vectors_config={
                # Visual embedding (CLIP)
                "visual": VectorParams(
                    size=vector_size,
                    distance=Distance.COSINE
                ),
                # Text embedding (description)
                "textual": VectorParams(
                    size=1536,  # Ada-002 or similar
                    distance=Distance.COSINE
                )
            }
        )

    def index_image(
        self,
        image_id: str,
        image_path: str,
        description: str,
        text_embedding: list[float],
        metadata: dict = None
    ):
        """Index an image with both embeddings."""
        visual_embedding = self.embedder.embed_image(image_path)

        point = PointStruct(
            id=hash(image_id) % (2**63),
            vector={
                "visual": visual_embedding,
                "textual": text_embedding
            },
            payload={
                "image_id": image_id,
                "image_path": image_path,
                "description": description,
                **(metadata or {})
            }
        )

        self.client.upsert(
            collection_name=self.collection_name,
            points=[point]
        )

    def search_by_text(self, query: str, limit: int = 5) -> list[dict]:
        """Search by text query."""
        query_embedding = self.embedder.embed_text(query)

        results = self.client.search(
            collection_name=self.collection_name,
            query_vector=("visual", query_embedding),  # CLIP text -> visual
            limit=limit
        )

        return [
            {
                "image_path": r.payload["image_path"],
                "description": r.payload["description"],
                "score": r.score
            }
            for r in results
        ]

    def search_by_image(self, image_path: str, limit: int = 5) -> list[dict]:
        """Search for similar images."""
        query_embedding = self.embedder.embed_image(image_path)

        results = self.client.search(
            collection_name=self.collection_name,
            query_vector=("visual", query_embedding),
            limit=limit
        )

        return [
            {
                "image_path": r.payload["image_path"],
                "description": r.payload["description"],
                "score": r.score
            }
            for r in results
        ]

Step 4: Generation with Visual Context

DEVELOPERpython
def generate_with_images(
    query: str,
    retrieved_images: list[dict],
    client: OpenAI
) -> str:
    """
    Generate a response using retrieved images as context.
    """
    # Prepare multimodal content
    content = [
        {
            "type": "text",
            "text": f"""You are an assistant that answers questions using the provided images as information source.

User question: {query}

Available images:"""
        }
    ]

    # Add each image with its description
    for i, img in enumerate(retrieved_images[:3], 1):  # Max 3 images
        with open(img["image_path"], "rb") as f:
            img_data = base64.b64encode(f.read()).decode("utf-8")

        content.append({
            "type": "text",
            "text": f"\n**Image {i}** (score: {img['score']:.2f}):\n{img['description']}"
        })
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{img_data}",
                "detail": "low"  # Save tokens
            }
        })

    content.append({
        "type": "text",
        "text": "\n\nAnswer the question based only on these images. If the images don't allow answering, say so clearly."
    })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=1000
    )

    return response.choices[0].message.content

Advanced Optimizations

High-Resolution Image Chunking

For large images (plans, diagrams), split into tiles:

DEVELOPERpython
from PIL import Image

def tile_large_image(image_path: str, tile_size: int = 512, overlap: int = 64):
    """Split a large image into overlapping tiles."""
    img = Image.open(image_path)
    width, height = img.size

    tiles = []
    for y in range(0, height - overlap, tile_size - overlap):
        for x in range(0, width - overlap, tile_size - overlap):
            box = (x, y, min(x + tile_size, width), min(y + tile_size, height))
            tile = img.crop(box)
            tiles.append({
                "tile": tile,
                "position": (x, y),
                "original_size": (width, height)
            })

    return tiles

Hybrid Image + Text Search

DEVELOPERpython
def hybrid_image_search(
    query: str,
    text_embedding: list[float],
    index: ImageRAGIndex,
    alpha: float = 0.7  # Weight of visual vs textual
) -> list[dict]:
    """Combine visual and text search."""
    # Visual search (CLIP)
    visual_results = index.search_by_text(query, limit=20)

    # Text search (on descriptions)
    text_results = index.client.search(
        collection_name=index.collection_name,
        query_vector=("textual", text_embedding),
        limit=20
    )

    # Score fusion with RRF
    combined_scores = {}
    for rank, r in enumerate(visual_results):
        img_id = r["image_path"]
        combined_scores[img_id] = combined_scores.get(img_id, 0) + alpha / (rank + 60)

    for rank, r in enumerate(text_results):
        img_id = r.payload["image_path"]
        combined_scores[img_id] = combined_scores.get(img_id, 0) + (1 - alpha) / (rank + 60)

    # Sort by combined score
    sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)

    return [{"image_path": path, "score": score} for path, score in sorted_results[:5]]

Benchmarks and Costs

Retrieval Performance

Method	Precision@5	Recall@10	Latency
Description only	0.72	0.81	50ms
CLIP only	0.78	0.85	30ms
Hybrid	0.84	0.91	80ms

Cost per Indexed Image

Step	Estimated Cost	Notes
GPT-4V description	$0.01-0.03	Depends on size and detail
CLIP embedding	$0 (local)	GPU recommended
Qdrant storage	~$0.0001	Per vector/month

Embedding Model Comparison

Model	Zero-shot accuracy	Multilingual	Speed
CLIP ViT-L/14	75.5%	No	Fast
SigLIP So400m	83.1%	Yes	Medium
Jina CLIP v2	81.2%	Yes	Fast

Pitfalls and Solutions

Problem 1: Images with Little Visual Content

Symptom: Text screenshots are poorly indexed by CLIP.

Solution: Explicit OCR + text indexing.

DEVELOPERpython
import pytesseract

def extract_text_from_image(image_path: str) -> str:
    """Extract text from image via OCR."""
    img = Image.open(image_path)
    text = pytesseract.image_to_string(img, lang='eng+fra')
    return text.strip()

Problem 2: Visual Duplicates

Symptom: Multiple nearly identical images pollute results.

Solution: Similarity-based deduplication.

DEVELOPERpython
def deduplicate_images(embeddings: list, threshold: float = 0.95):
    """Remove images that are too similar."""
    keep = []
    for i, emb in enumerate(embeddings):
        is_duplicate = False
        for j in keep:
            similarity = cosine_similarity(emb, embeddings[j])
            if similarity > threshold:
                is_duplicate = True
                break
        if not is_duplicate:
            keep.append(i)
    return keep

Problem 3: Visual vs Textual Context Contradiction

Symptom: Generated description contradicts the image.

Solution: Cross-validation and confidence score.

Integration with Ailog

Ailog natively supports image indexing in your knowledge bases:

Upload: Drag and drop your images in the interface
Automatic analysis: Vision model for content extraction
Hybrid indexing: Visual + text embeddings
Unified search: Single query for text and images

Try Image RAG on Ailog - No configuration required.

Image RAG: Vision Models and Visual Search

Image RAG: Vision Models and Visual Search

Why Integrate Images into RAG?

Visual Data in Business

Concrete Use Cases

Image RAG Architecture

Overview

Two Approaches

Vision Models for RAG

Proprietary Models

Open Source Models

Multimodal Embedding Models

Practical Implementation

Step 1: Image Extraction and Description

Step 2: Multimodal Embeddings with CLIP

Step 3: Indexing in Qdrant

Step 4: Generation with Visual Context

Advanced Optimizations

High-Resolution Image Chunking

Hybrid Image + Text Search

Benchmarks and Costs

Retrieval Performance

Cost per Indexed Image

Embedding Model Comparison

Pitfalls and Solutions

Problem 1: Images with Little Visual Content

Problem 2: Visual Duplicates

Problem 3: Visual vs Textual Context Contradiction

Integration with Ailog

Related Guides

Tags

Related Posts

Multimodal RAG: Images, PDFs, and Beyond Text

State of the Art Multimodal RAG 2026

Diagrams and Schemas: Extracting Visual Information

Ailog Assistant