News

Breakthrough in Multimodal RAG: New Framework Handles Text, Images, and Tables

October 15, 2025
6 min read
Ailog Research Team

Stanford and DeepMind researchers present MM-RAG, a unified framework for retrieving and reasoning over multiple modalities with 65% accuracy improvement.

Introduction

A collaborative team from Stanford and Google DeepMind has published research on MM-RAG (Multimodal Retrieval-Augmented Generation), a framework that seamlessly handles retrieval across text, images, tables, and charts within a single system.

The Multimodal Challenge

Traditional RAG systems focus on text, but real-world documents contain:

  • Images and diagrams
  • Tables and spreadsheets
  • Charts and graphs
  • Mixed layouts

Existing approaches either ignore non-text content or process each modality separately, leading to fragmented understanding.

MM-RAG Architecture

Unified Embedding Space

MM-RAG uses CLIP-based encoders to project all modalities into a shared embedding space:

Text → Text Encoder →
Images → Vision Encoder →  [Shared 1024-dim space] → Vector DB
Tables → Table Encoder →

Cross-Modal Retrieval

The system can retrieve:

  • Text for text queries (standard RAG)
  • Images for visual questions
  • Tables for data queries
  • Mixed results for complex queries

Example query: "Show me the architecture diagram and explain the authentication flow"

Retrieves:

  1. Architecture diagram (image)
  2. Authentication section (text)
  3. API endpoints table (structured data)

Multimodal Fusion

Retrieved multimodal content is processed by GPT-4V or Gemini Pro Vision:

DEVELOPERpython
# Pseudocode query = "Compare Q3 revenue across regions" # Retrieve mixed modalities results = mm_rag.retrieve(query, k=5) # Returns: [chart_image, revenue_table, text_analysis] # Generate answer using multimodal LLM answer = gpt4v.generate( text_prompt=query, images=[r for r in results if r.type == 'image'], tables=[r for r in results if r.type == 'table'], context=[r for r in results if r.type == 'text'] )

Benchmark Results

Tested on newly created MixedQA benchmark (10K questions across modalities):

Query TypeBaselineMM-RAGImprovement
Text-only78.2%79.1%+1.2%
Image-only45.3%74.8%+65.1%
Table-only52.1%81.3%+56.0%
Mixed31.2%68.7%+120.2%
Overall51.7%75.9%+46.8%

Key Innovations

Layout-Aware Chunking

MM-RAG preserves document layout during chunking:

  • Keeps images with their captions
  • Maintains table structure
  • Preserves figure references

Modality Routing

Automatically determines which modalities to retrieve based on query:

DEVELOPERpython
query_intent = analyze_query(query) if query_intent.needs_visual: retrieve_images = True if query_intent.needs_data: retrieve_tables = True # Always retrieve text as context retrieve_text = True

Cross-Modal Reranking

After retrieval, a cross-modal reranker scores relevance:

  • Text-to-image relevance
  • Table-to-query relevance
  • Overall coherence of mixed results

Applications

MM-RAG excels in:

Scientific Research

  • Retrieve figures from papers
  • Answer questions about experimental results
  • Compare data across studies

Business Intelligence

  • Query dashboards and reports
  • Extract insights from charts
  • Analyze tabular data

Technical Documentation

  • Find relevant diagrams
  • Understand architecture from visuals
  • Connect text explanations with illustrations

Education

  • Visual learning materials
  • Interactive textbook Q&A
  • Diagram-based explanations

Implementation Considerations

Computational Costs

Processing images and tables is expensive:

  • Image encoding: 10x slower than text
  • Table parsing: 5x slower than text
  • Multimodal LLMs: 2-3x more expensive

Storage Requirements

Embedding all modalities increases storage:

  • Text: 768-1536 dimensions
  • Images: 512-1024 dimensions + original image
  • Tables: Structured representation + embeddings

Estimated: 3-5x storage increase vs. text-only RAG

Quality Dependencies

MM-RAG quality depends on:

  • OCR accuracy for scanned documents
  • Table extraction quality
  • Image resolution and clarity
  • Multimodal LLM capabilities

Open Source Release

The team has released:

  • MM-RAG framework (Apache 2.0 license)
  • MixedQA benchmark dataset
  • Pretrained cross-modal encoders
  • Evaluation scripts

Available at: github.com/stanford-futuredata/mm-rag

Industry Adoption

Early adopters include:

  • Technical documentation platforms
  • Legal document analysis tools
  • Scientific literature search engines
  • Business intelligence providers

Limitations

Current limitations include:

  • Video not yet supported
  • Audio processing limited
  • Real-time performance challenges
  • High resource requirements

Future Work

Planned improvements:

  • Video frame retrieval
  • Audio transcription integration
  • Reduced computational overhead
  • Better handling of complex layouts

Conclusion

MM-RAG represents a significant step toward truly multimodal AI assistants that can understand and reason across all content types in documents, not just text. As multimodal LLMs improve, systems like MM-RAG will become increasingly practical for real-world applications.

Tags

multimodalresearchcomputer visionRAG

Related Guides