NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/nomic-ai/colnomic-embed-multimodal-7b
    HFVisual EmbeddingsApache-2.0

    colnomic-embed-multimodal-7b

    by nomic-ai

    Late-interaction multimodal embeddings — SOTA visual document retrieval without OCR

    180Kdl/month
    7Bparams
    Identifiers
    Model ID
    nomic-ai/colnomic-embed-multimodal-7b
    Feature URI
    mixpeek://image_extractor@v1/nomic_colnomic_multimodal_7b_v1

    Overview

    ColNomic Embed Multimodal 7B is Nomic AI's multi-vector late-interaction embedding model that processes text, images, PDFs, and charts without requiring OCR or image captioning as a preprocessing step. Fine-tuned from Qwen2.5-VL-7B-Instruct, it produces multiple token-level embeddings per document instead of a single vector, enabling fine-grained matching between query tokens and document tokens at retrieval time.

    The model achieves 62.7 NDCG@5 on Vidore-v2, a 2.8-point improvement over the previous state-of-the-art for visual document retrieval. On Mixpeek, ColNomic powers high-precision document search where the visual layout of PDFs, slides, and charts carries meaning that traditional text-only embeddings miss — tables, diagrams, and mixed text-image pages are all searchable directly from their rendered appearance.

    Architecture

    Multi-vector late-interaction architecture fine-tuned from Qwen2.5-VL-7B-Instruct. Produces multiple token-level embeddings per document for fine-grained matching. Uses same-source sampling to generate harder in-batch negatives during training. Processes interleaved text and image inputs natively.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "visual-docs",
    source: { url: "https://example.com/technical-manual.pdf" },
    feature_extractors: [{
    feature: "image_embedding",
    model: "nomic-ai/colnomic-embed-multimodal-7b"
    }]
    });

    Capabilities

    • Multi-vector late interaction for fine-grained retrieval
    • Direct PDF, chart, and diagram processing without OCR
    • 62.7 NDCG@5 on Vidore-v2 (visual document retrieval SOTA)
    • Interleaved text-image input support
    • Apache 2.0 license

    Use Cases on Mixpeek

    Visual document search: retrieve PDF pages, slides, and charts by layout and content
    OCR-free document retrieval: search scanned documents without preprocessing pipelines
    Technical diagram search: find engineering drawings, flowcharts, and schematics by description

    Benchmarks

    DatasetMetricScoreSource
    Vidore-v2 (visual doc retrieval)NDCG@562.7Nomic AI, 2025 — Blog Post
    Vidore-v2 (vs previous SOTA)NDCG@5 delta+2.8 pointsNomic AI, 2025 — Blog Post

    Performance

    Input SizeText + image (variable resolution pages)
    GPU Latency~50ms / page (A100)
    GPU Throughput~20 pages/sec (A100)
    GPU Memory~14 GB

    Specification

    FrameworkHF
    Organizationnomic-ai
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters7B
    LicenseApache-2.0
    Downloads/mo180K

    Research Paper

    Nomic Embed Multimodal: Open Source Multimodal Embedding Models

    arxiv.org

    Build a pipeline with colnomic-embed-multimodal-7b

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio