colnomic-embed-multimodal-7b

by nomic-ai

Late-interaction multimodal embeddings: SOTA visual document retrieval without OCR

3Kdl/month

107likes

7Bparams

HuggingFace Run on your data

Identifiers

Model ID

nomic-ai/colnomic-embed-multimodal-7b

Feature URI

mixpeek://image_extractor@v1/nomic_colnomic_multimodal_7b_v1

Overview

ColNomic Embed Multimodal 7B is Nomic AI's multi-vector late-interaction embedding model that processes text, images, PDFs, and charts without requiring OCR or image captioning as a preprocessing step. Fine-tuned from Qwen2.5-VL-7B-Instruct, it produces multiple token-level embeddings per document instead of a single vector, enabling fine-grained matching between query tokens and document tokens at retrieval time.

The model achieves 62.7 NDCG@5 on Vidore-v2, a 2.8-point improvement over the previous state-of-the-art for visual document retrieval. On Mixpeek, ColNomic powers high-precision document search where the visual layout of PDFs, slides, and charts carries meaning that traditional text-only embeddings miss: tables, diagrams, and mixed text-image pages are all searchable directly from their rendered appearance.

Architecture

Multi-vector late-interaction architecture fine-tuned from Qwen2.5-VL-7B-Instruct. Produces multiple token-level embeddings per document for fine-grained matching. Uses same-source sampling to generate harder in-batch negatives during training. Processes interleaved text and image inputs natively.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "image_embedding",
    version: "v1",
    parameters: { model_id: "nomic-ai/colnomic-embed-multimodal-7b" },
  },
});