News

Cohere Embed v4: The First Production Multimodal Embedding

April 23, 2026
6 min read
Ailog Team

Cohere launches Embed v4 Multimodal, the first embedding model capable of vectorizing text, images, and interleaved documents. A revolution for multimodal RAG.

Cohere Revolutionizes Embeddings with Multimodal

Cohere has announced the general availability of Embed v4 Multimodal, a major advancement in the world of embeddings. For the first time, a production model can vectorize text, images, and mixed documents (PDFs, slides, tables) into the same semantic space.

"Embed v4 eliminates the complexity of document parsing," declares Aidan Gomez, CEO of Cohere. "You can now vectorize a PDF as-is, with its images, tables, and text, without preprocessing."

Benchmark Performance

MTEB Results

ModelMTEB ScoreTypeContext
Cohere Embed v465.2Multimodal128K
Google Gemini Embedding68.3Text2K
Qwen3-Embedding-8B70.6Text8K
OpenAI text-embedding-3-large64.6Text8K
Voyage-363.8Text16K

The Real Innovation: Multimodal

The MTEB score doesn't tell the whole story. Embed v4 excels where others don't exist:

CapabilityEmbed v4Other models
Pure textYesYes
Images onlyYesNo*
Native PDFYesNo
Visual tablesYesNo
Presentation slidesYesNo

*Only a few experimental models support images

To understand the importance of multimodal in embeddings, check out our guide on multimodal RAG.

Technical Innovations

Unified Text-Image Embedding

Embed v4 creates a vector space where text and images coexist:

DEVELOPERpython
import cohere co = cohere.ClientV2('your-api-key') # Text embedding text_response = co.embed( texts=["Product description"], model="embed-v4", input_type="search_document", embedding_types=["float"] ) # Image embedding (base64 or URL) image_response = co.embed( images=["data:image/jpeg;base64,..."], model="embed-v4", input_type="image", embedding_types=["float"] ) # Both embeddings are in the same semantic space!

Technical Specifications

SpecificationValue
Dimensions1536 (configurable 256-1536)
Text context128K tokens
Max image size2 megapixels
Supported languages100+
Image formatsJPEG, PNG, WebP, GIF

Matryoshka Embeddings

Embed v4 supports Matryoshka embeddings, allowing dimension reduction without re-encoding:

DEVELOPERpython
# Full dimensions (1536) full_embedding = co.embed( texts=["Your text"], model="embed-v4", embedding_types=["float"] ) # Reduced dimensions (256) - same vector truncated compact_embedding = co.embed( texts=["Your text"], model="embed-v4", embedding_types=["float"], output_dimension=256 # Matryoshka truncation )
DimensionsQuality lossStorage reduction
15360%Baseline
1024-0.5%33%
512-1.2%67%
256-2.8%83%

This approach optimizes the cost/quality tradeoff without regenerating all your embeddings.

Impact on RAG Pipelines

End of Complex Parsing

Before Embed v4, vectorizing a PDF required:

  1. Text extraction (PyPDF, pdfplumber)
  2. Image OCR (Tesseract, Azure Vision)
  3. Table detection (Camelot, Tabula)
  4. Context reconstruction
  5. Separate chunking and embedding

With Embed v4:

  1. Screenshot or image of PDF
  2. Direct embedding

"We removed 80% of our preprocessing pipeline," testifies Marie Laurent, CTO of a French legaltech startup. "Retrieval quality improved because the model sees documents like a human does."

Transformed Use Cases

Visual E-commerce

  • Product image search
  • PDF catalogs vectorized as-is
  • Technical sheets with diagrams

Technical Documentation

  • Manuals with diagrams
  • Architecture schemas
  • Annotated screenshots

Legal and Finance

  • Scanned contracts
  • Reports with charts
  • Filled forms

Check out our guide on e-commerce RAG for concrete examples.

Pricing and Availability

Pricing

Input typePrice/million units
Text$0.10 / 1M tokens
Images$0.10 / 1000 images

Comparison with Competition

ProviderPrice/1M tokensMultimodal
Cohere Embed v4$0.10Yes
OpenAI text-embedding-3-large$0.13No
Voyage-3$0.12No
Google Gemini Embedding$0.008No

Availability

Embed v4 is available on:

  • Direct Cohere API
  • Amazon Bedrock
  • Amazon SageMaker JumpStart
  • Azure AI Foundry
  • Google Cloud Vertex AI

Practical Integration

Complete Example: Multimodal RAG

DEVELOPERpython
import cohere from qdrant_client import QdrantClient co = cohere.ClientV2('your-api-key') qdrant = QdrantClient(url="http://localhost:6333") # Index a PDF as image def index_pdf_page(image_base64, metadata): response = co.embed( images=[f"data:image/png;base64,{image_base64}"], model="embed-v4", input_type="image", embedding_types=["float"] ) qdrant.upsert( collection_name="documents", points=[{ "id": metadata["id"], "vector": response.embeddings.float[0], "payload": metadata }] ) # Search by text (cross-modal) def search_by_text(query): query_embedding = co.embed( texts=[query], model="embed-v4", input_type="search_query", embedding_types=["float"] ) # Find relevant images/PDFs with a text query results = qdrant.search( collection_name="documents", query_vector=query_embedding.embeddings.float[0], limit=5 ) return results

Best Practices

1. Choose the Right input_type

  • search_document: Text to index
  • search_query: User query
  • image: Images to index or search

2. Optimize Images

  • Ideal resolution: 1024x1024 pixels
  • Maximum: 2 megapixels
  • Formats: JPEG for photos, PNG for captures

3. Batching

DEVELOPERpython
# Up to 96 texts or 1000 images per request response = co.embed( images=list_of_images[:1000], model="embed-v4", input_type="image" )

Our Take

Embed v4 Multimodal is a decisive advancement for RAG applications handling rich documents. The ability to vectorize PDFs, presentations, and images without complex preprocessing radically simplifies architectures.

Strengths:

  • First production multimodal
  • 128K token context
  • Matryoshka for cost optimization
  • Native cloud integration

Points to watch:

  • Pure text MTEB score lower than Qwen3/Gemini
  • Higher price for large image volumes

For new projects with visual documents, Embed v4 is our recommendation. For pure text at very high volume, consider Qwen3-Embedding (open source) or Google Gemini Embedding.

Explore our comprehensive guide on choosing embedding models to deepen this decision.

FAQ

Yes, for most cases. Embed v4 directly vectorizes document images (PDFs, scans, screenshots) without prior text extraction. Retrieval quality is often superior because the model captures visual context (layout, tables, charts). Only cases requiring explicit text extraction (for display or editing) still justify OCR.
Embed v4 scores 65.2 on MTEB (pure text), behind Qwen3-Embedding (70.6) and Google Gemini (68.3). But this comparison is incomplete: Embed v4 is the only one to natively support multimodal. For mixed documents (text + images), it has no equivalent. Evaluate based on your actual use case.
Matryoshka embeddings allow reducing dimensions from 1536 to 256 with only 2.8% quality loss. This reduces vector storage by 83%. Recommended strategy: index at 1536 dimensions, then experiment with reduced dimensions on your test dataset to find the optimal threshold.
Yes. Since text and images share the same vector space, you can do image-to-image, text-to-image, or image-to-text search. It's ideal for visual e-commerce, product catalogs, or finding similar documents.
Maximum is 2 megapixels. For a good quality/cost balance, use 1024x1024 pixels. For documents with fine text (contracts, invoices), prefer higher resolutions. For simple images (product photos), 512x512 often suffices. --- **Need to integrate Embed v4 into your RAG application?** [Ailog](https://ailog.fr) offers a RAG-as-a-Service platform that automatically integrates the best embedding models, including Cohere Embed v4 Multimodal. Deploy your AI assistant in minutes.

Tags

RAGCohereembeddingsmultimodalMTEB

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !