Cohere Embed v4: The First Production Multimodal Embedding
Cohere launches Embed v4 Multimodal, the first embedding model capable of vectorizing text, images, and interleaved documents. A revolution for multimodal RAG.
Cohere Revolutionizes Embeddings with Multimodal
Cohere has announced the general availability of Embed v4 Multimodal, a major advancement in the world of embeddings. For the first time, a production model can vectorize text, images, and mixed documents (PDFs, slides, tables) into the same semantic space.
"Embed v4 eliminates the complexity of document parsing," declares Aidan Gomez, CEO of Cohere. "You can now vectorize a PDF as-is, with its images, tables, and text, without preprocessing."
Benchmark Performance
MTEB Results
| Model | MTEB Score | Type | Context |
|---|---|---|---|
| Cohere Embed v4 | 65.2 | Multimodal | 128K |
| Google Gemini Embedding | 68.3 | Text | 2K |
| Qwen3-Embedding-8B | 70.6 | Text | 8K |
| OpenAI text-embedding-3-large | 64.6 | Text | 8K |
| Voyage-3 | 63.8 | Text | 16K |
The Real Innovation: Multimodal
The MTEB score doesn't tell the whole story. Embed v4 excels where others don't exist:
| Capability | Embed v4 | Other models |
|---|---|---|
| Pure text | Yes | Yes |
| Images only | Yes | No* |
| Native PDF | Yes | No |
| Visual tables | Yes | No |
| Presentation slides | Yes | No |
*Only a few experimental models support images
To understand the importance of multimodal in embeddings, check out our guide on multimodal RAG.
Technical Innovations
Unified Text-Image Embedding
Embed v4 creates a vector space where text and images coexist:
DEVELOPERpythonimport cohere co = cohere.ClientV2('your-api-key') # Text embedding text_response = co.embed( texts=["Product description"], model="embed-v4", input_type="search_document", embedding_types=["float"] ) # Image embedding (base64 or URL) image_response = co.embed( images=["data:image/jpeg;base64,..."], model="embed-v4", input_type="image", embedding_types=["float"] ) # Both embeddings are in the same semantic space!
Technical Specifications
| Specification | Value |
|---|---|
| Dimensions | 1536 (configurable 256-1536) |
| Text context | 128K tokens |
| Max image size | 2 megapixels |
| Supported languages | 100+ |
| Image formats | JPEG, PNG, WebP, GIF |
Matryoshka Embeddings
Embed v4 supports Matryoshka embeddings, allowing dimension reduction without re-encoding:
DEVELOPERpython# Full dimensions (1536) full_embedding = co.embed( texts=["Your text"], model="embed-v4", embedding_types=["float"] ) # Reduced dimensions (256) - same vector truncated compact_embedding = co.embed( texts=["Your text"], model="embed-v4", embedding_types=["float"], output_dimension=256 # Matryoshka truncation )
| Dimensions | Quality loss | Storage reduction |
|---|---|---|
| 1536 | 0% | Baseline |
| 1024 | -0.5% | 33% |
| 512 | -1.2% | 67% |
| 256 | -2.8% | 83% |
This approach optimizes the cost/quality tradeoff without regenerating all your embeddings.
Impact on RAG Pipelines
End of Complex Parsing
Before Embed v4, vectorizing a PDF required:
- Text extraction (PyPDF, pdfplumber)
- Image OCR (Tesseract, Azure Vision)
- Table detection (Camelot, Tabula)
- Context reconstruction
- Separate chunking and embedding
With Embed v4:
- Screenshot or image of PDF
- Direct embedding
"We removed 80% of our preprocessing pipeline," testifies Marie Laurent, CTO of a French legaltech startup. "Retrieval quality improved because the model sees documents like a human does."
Transformed Use Cases
Visual E-commerce
- Product image search
- PDF catalogs vectorized as-is
- Technical sheets with diagrams
Technical Documentation
- Manuals with diagrams
- Architecture schemas
- Annotated screenshots
Legal and Finance
- Scanned contracts
- Reports with charts
- Filled forms
Check out our guide on e-commerce RAG for concrete examples.
Pricing and Availability
Pricing
| Input type | Price/million units |
|---|---|
| Text | $0.10 / 1M tokens |
| Images | $0.10 / 1000 images |
Comparison with Competition
| Provider | Price/1M tokens | Multimodal |
|---|---|---|
| Cohere Embed v4 | $0.10 | Yes |
| OpenAI text-embedding-3-large | $0.13 | No |
| Voyage-3 | $0.12 | No |
| Google Gemini Embedding | $0.008 | No |
Availability
Embed v4 is available on:
- Direct Cohere API
- Amazon Bedrock
- Amazon SageMaker JumpStart
- Azure AI Foundry
- Google Cloud Vertex AI
Practical Integration
Complete Example: Multimodal RAG
DEVELOPERpythonimport cohere from qdrant_client import QdrantClient co = cohere.ClientV2('your-api-key') qdrant = QdrantClient(url="http://localhost:6333") # Index a PDF as image def index_pdf_page(image_base64, metadata): response = co.embed( images=[f"data:image/png;base64,{image_base64}"], model="embed-v4", input_type="image", embedding_types=["float"] ) qdrant.upsert( collection_name="documents", points=[{ "id": metadata["id"], "vector": response.embeddings.float[0], "payload": metadata }] ) # Search by text (cross-modal) def search_by_text(query): query_embedding = co.embed( texts=[query], model="embed-v4", input_type="search_query", embedding_types=["float"] ) # Find relevant images/PDFs with a text query results = qdrant.search( collection_name="documents", query_vector=query_embedding.embeddings.float[0], limit=5 ) return results
Best Practices
1. Choose the Right input_type
search_document: Text to indexsearch_query: User queryimage: Images to index or search
2. Optimize Images
- Ideal resolution: 1024x1024 pixels
- Maximum: 2 megapixels
- Formats: JPEG for photos, PNG for captures
3. Batching
DEVELOPERpython# Up to 96 texts or 1000 images per request response = co.embed( images=list_of_images[:1000], model="embed-v4", input_type="image" )
Our Take
Embed v4 Multimodal is a decisive advancement for RAG applications handling rich documents. The ability to vectorize PDFs, presentations, and images without complex preprocessing radically simplifies architectures.
Strengths:
- First production multimodal
- 128K token context
- Matryoshka for cost optimization
- Native cloud integration
Points to watch:
- Pure text MTEB score lower than Qwen3/Gemini
- Higher price for large image volumes
For new projects with visual documents, Embed v4 is our recommendation. For pure text at very high volume, consider Qwen3-Embedding (open source) or Google Gemini Embedding.
Explore our comprehensive guide on choosing embedding models to deepen this decision.
FAQ
Tags
Related Posts
Embedding Models 2026: Benchmark and Comparison
Comprehensive comparison of the best embedding models in 2026. MTEB benchmarks, multilingual performance, and recommendations for your RAG applications.
Gemini Ultra: Google Strengthens Its RAG Offering
Google unveils Gemini Ultra with revolutionary multimodal RAG capabilities. Analysis of new features and their impact on retrieval-augmented architectures.
Image RAG: Vision Models and Visual Search
Complete guide to integrating images into your RAG system: vision models, multimodal embeddings, indexing and visual search with GPT-4V, Claude Vision and CLIP.