colnomic-embed-multimodal-7b
by nomic-ai
Late-interaction multimodal embeddings — SOTA visual document retrieval without OCR
nomic-ai/colnomic-embed-multimodal-7bmixpeek://image_extractor@v1/nomic_colnomic_multimodal_7b_v1Overview
ColNomic Embed Multimodal 7B is Nomic AI's multi-vector late-interaction embedding model that processes text, images, PDFs, and charts without requiring OCR or image captioning as a preprocessing step. Fine-tuned from Qwen2.5-VL-7B-Instruct, it produces multiple token-level embeddings per document instead of a single vector, enabling fine-grained matching between query tokens and document tokens at retrieval time.
The model achieves 62.7 NDCG@5 on Vidore-v2, a 2.8-point improvement over the previous state-of-the-art for visual document retrieval. On Mixpeek, ColNomic powers high-precision document search where the visual layout of PDFs, slides, and charts carries meaning that traditional text-only embeddings miss — tables, diagrams, and mixed text-image pages are all searchable directly from their rendered appearance.
Architecture
Multi-vector late-interaction architecture fine-tuned from Qwen2.5-VL-7B-Instruct. Produces multiple token-level embeddings per document for fine-grained matching. Uses same-source sampling to generate harder in-batch negatives during training. Processes interleaved text and image inputs natively.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "visual-docs",source: { url: "https://example.com/technical-manual.pdf" },feature_extractors: [{feature: "image_embedding",model: "nomic-ai/colnomic-embed-multimodal-7b"}]});
Capabilities
- Multi-vector late interaction for fine-grained retrieval
- Direct PDF, chart, and diagram processing without OCR
- 62.7 NDCG@5 on Vidore-v2 (visual document retrieval SOTA)
- Interleaved text-image input support
- Apache 2.0 license
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| Vidore-v2 (visual doc retrieval) | NDCG@5 | 62.7 | Nomic AI, 2025 — Blog Post |
| Vidore-v2 (vs previous SOTA) | NDCG@5 delta | +2.8 points | Nomic AI, 2025 — Blog Post |
Performance
Specification
Research Paper
Nomic Embed Multimodal: Open Source Multimodal Embedding Models
arxiv.orgBuild a pipeline with colnomic-embed-multimodal-7b
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio