nemotron-colembed-vl-8b-v2
by nvidia
State-of-the-art late-interaction visual document retrieval
nvidia/nemotron-colembed-vl-8b-v2mixpeek://image_extractor@v1/nvidia_nemotron_colembed_vl_8b_v2Overview
Nemotron ColEmbed VL is an 8B-parameter ColBERT-style multi-vector embedding model built on Qwen3-VL-8B-Instruct. It produces per-token embeddings for both queries and documents, enabling fine-grained matching between query terms and document regions. This late-interaction approach is particularly powerful for visual document retrieval, where different parts of a document page (headers, tables, figures) need to match different parts of a query.
The model ranks #1 on ViDoRe V3, the visual document retrieval benchmark, with a score of 63.54 -- surpassing ColPali and ColQwen variants.
Architecture
ColBERT-style architecture on top of Qwen3-VL-8B-Instruct. Produces multi-vector representations (one vector per token) rather than single-vector embeddings. Matching uses MaxSim: for each query token, find the maximum similarity to any document token, then sum across query tokens.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "doc-collection",source: { url: "https://example.com/report.pdf" },feature_extractors: [{feature: "visual_embeddings",model: "nvidia/nemotron-colembed-vl-8b-v2"}]});
Capabilities
- Multi-vector (ColBERT-style) embeddings for fine-grained matching
- #1 on ViDoRe V3 visual document retrieval benchmark
- Handles mixed-content documents: text, tables, charts, figures
- Supports both text queries and image queries
- Per-token matching enables localization of relevant document regions
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| ViDoRe V3 | NDCG@5 | 63.54 | https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2 |
Specification
Research Paper
Nemotron ColEmbed VL
arxiv.orgBuild a pipeline with nemotron-colembed-vl-8b-v2
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio