Cohere Embed v4: The First Production Multimodal Embedding

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Cohere Revolutionizes Embeddings with Multimodal

Cohere has announced the general availability of Embed v4 Multimodal, a major advancement in the world of embeddings. For the first time, a production model can vectorize text, images, and mixed documents (PDFs, slides, tables) into the same semantic space.

"Embed v4 eliminates the complexity of document parsing," declares Aidan Gomez, CEO of Cohere. "You can now vectorize a PDF as-is, with its images, tables, and text, without preprocessing."

Benchmark Performance

MTEB Results

Model	MTEB Score	Type	Context
Cohere Embed v4	65.2	Multimodal	128K
Google Gemini Embedding	68.3	Text	2K
Qwen3-Embedding-8B	70.6	Text	8K
OpenAI text-embedding-3-large	64.6	Text	8K
Voyage-3	63.8	Text	16K

The Real Innovation: Multimodal

The MTEB score doesn't tell the whole story. Embed v4 excels where others don't exist:

Capability	Embed v4	Other models
Pure text	Yes	Yes
Images only	Yes	No*
Native PDF	Yes	No
Visual tables	Yes	No
Presentation slides	Yes	No

*Only a few experimental models support images

To understand the importance of multimodal in embeddings, check out our guide on multimodal RAG.

Technical Innovations

Unified Text-Image Embedding

Embed v4 creates a vector space where text and images coexist:

DEVELOPERpython
import cohere

co = cohere.ClientV2('your-api-key')

# Text embedding
text_response = co.embed(
    texts=["Product description"],
    model="embed-v4.0",
    input_type="search_document",
    embedding_types=["float"]
)

# Image embedding (base64 or URL)
image_response = co.embed(
    images=["data:image/jpeg;base64,..."],
    model="embed-v4.0",
    input_type="image",
    embedding_types=["float"]
)

# Both embeddings are in the same semantic space!

Technical Specifications

Specification	Value
Dimensions	1536 (configurable 256-1536)
Text context	128K tokens
Max image size	2 megapixels
Supported languages	100+
Image formats	JPEG, PNG, WebP, GIF

Matryoshka Embeddings

Embed v4 supports Matryoshka embeddings, allowing dimension reduction without re-encoding:

DEVELOPERpython
# Full dimensions (1536)
full_embedding = co.embed(
    texts=["Your text"],
    model="embed-v4.0",
    embedding_types=["float"]
)

# Reduced dimensions (256) - same vector truncated
compact_embedding = co.embed(
    texts=["Your text"],
    model="embed-v4.0",
    embedding_types=["float"],
    output_dimension=256  # Matryoshka truncation
)

Dimensions	Quality loss	Storage reduction
1536	0%	Baseline
1024	-0.5%	33%
512	-1.2%	67%
256	-2.8%	83%

This approach optimizes the cost/quality tradeoff without regenerating all your embeddings.

Impact on RAG Pipelines

End of Complex Parsing

Before Embed v4, vectorizing a PDF required:

Text extraction (PyPDF, pdfplumber)
Image OCR (Tesseract, Azure Vision)
Table detection (Camelot, Tabula)
Context reconstruction
Separate chunking and embedding

With Embed v4:

Screenshot or image of PDF
Direct embedding

"We removed 80% of our preprocessing pipeline," testifies Marie Laurent, CTO of a French legaltech startup. "Retrieval quality improved because the model sees documents like a human does."

Transformed Use Cases

Visual E-commerce

Product image search
PDF catalogs vectorized as-is
Technical sheets with diagrams

Technical Documentation

Manuals with diagrams
Architecture schemas
Annotated screenshots

Legal and Finance

Scanned contracts
Reports with charts
Filled forms

Check out our guide on e-commerce RAG for concrete examples.

Pricing and Availability

Pricing

Cohere pricing: July 2026.

Input type	Price/million units
Text	$0.12 / 1M tokens
Images	$0.47 / 1M image tokens

Comparison with Competition

Provider	Price/1M tokens	Multimodal
Cohere Embed v4	$0.12	Yes
OpenAI text-embedding-3-large	$0.13	No
Voyage-3	$0.06	No
Google Gemini Embedding	$0.15	No

Availability

Embed v4 is available on:

Direct Cohere API
Amazon Bedrock
Amazon SageMaker JumpStart
Azure AI Foundry
Google Cloud Vertex AI

Practical Integration

Complete Example: Multimodal RAG

DEVELOPERpython
import cohere
from qdrant_client import QdrantClient

co = cohere.ClientV2('your-api-key')
qdrant = QdrantClient(url="http://localhost:6333")

# Index a PDF as image
def index_pdf_page(image_base64, metadata):
    response = co.embed(
        images=[f"data:image/png;base64,{image_base64}"],
        model="embed-v4.0",
        input_type="image",
        embedding_types=["float"]
    )

    qdrant.upsert(
        collection_name="documents",
        points=[{
            "id": metadata["id"],
            "vector": response.embeddings.float[0],
            "payload": metadata
        }]
    )

# Search by text (cross-modal)
def search_by_text(query):
    query_embedding = co.embed(
        texts=[query],
        model="embed-v4.0",
        input_type="search_query",
        embedding_types=["float"]
    )

    # Find relevant images/PDFs with a text query
    results = qdrant.search(
        collection_name="documents",
        query_vector=query_embedding.embeddings.float[0],
        limit=5
    )
    return results

Best Practices

1. Choose the Right input_type

search_document: Text to index
search_query: User query
image: Images to index or search

2. Optimize Images

Ideal resolution: 1024x1024 pixels
Maximum: 2 megapixels
Formats: JPEG for photos, PNG for captures

3. Batching

DEVELOPERpython
# Up to 96 texts or 1000 images per request
response = co.embed(
    images=list_of_images[:1000],
    model="embed-v4.0",
    input_type="image"
)

Our Take

Embed v4 Multimodal is a decisive advancement for RAG applications handling rich documents. The ability to vectorize PDFs, presentations, and images without complex preprocessing radically simplifies architectures.

Strengths:

First production multimodal
128K token context
Matryoshka for cost optimization
Native cloud integration

Points to watch:

Pure text MTEB score lower than Qwen3/Gemini
Higher price for large image volumes

For new projects with visual documents, Embed v4 is our recommendation. For pure text at very high volume, consider Qwen3-Embedding (open source) or Google Gemini Embedding.

Explore our comprehensive guide on choosing embedding models to deepen this decision.

FAQ

Yes, for most cases. Embed v4 directly vectorizes document images (PDFs, scans, screenshots) without prior text extraction. Retrieval quality is often superior because the model captures visual context (layout, tables, charts). Only cases requiring explicit text extraction (for display or editing) still justify OCR.

Embed v4 scores 65.2 on MTEB (pure text), behind Qwen3-Embedding (70.6) and Google Gemini (68.3). But this comparison is incomplete: Embed v4 is the only one to natively support multimodal. For mixed documents (text + images), it has no equivalent. Evaluate based on your actual use case.

Matryoshka embeddings allow reducing dimensions from 1536 to 256 with only 2.8% quality loss. This reduces vector storage by 83%. Recommended strategy: index at 1536 dimensions, then experiment with reduced dimensions on your test dataset to find the optimal threshold.

Yes. Since text and images share the same vector space, you can do image-to-image, text-to-image, or image-to-text search. It's ideal for visual e-commerce, product catalogs, or finding similar documents.

Maximum is 2 megapixels. For a good quality/cost balance, use 1024x1024 pixels. For documents with fine text (contracts, invoices), prefer higher resolutions. For simple images (product photos), 512x512 often suffices. --- **Need to integrate Embed v4 into your RAG application?** [Ailog](https://ailog.fr) offers a RAG-as-a-Service platform that automatically integrates the best embedding models, including Cohere Embed v4 Multimodal. Deploy your AI assistant in minutes.