NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/ModernVBERT/ColModernVBERT
    HFVisual EmbeddingsApache 2.0

    ColModernVBERT

    by ModernVBERT

    Compact 250M-param vision-language encoder for visual document retrieval

    N/Adl/month
    250Mparams
    Identifiers
    Model ID
    ModernVBERT/ColModernVBERT
    Feature URI
    mixpeek://image_extractor@v1/modernvbert_colmodernvbert_v1

    Overview

    ColModernVBERT is a compact late-interaction model for visual document retrieval that matches models 10x its size. Built on the ModernBERT architecture extended to vision, it produces multi-vector representations of document images that enable efficient MaxSim-based retrieval. Its small footprint means it can run on CPU hardware, making it practical for edge deployment.

    Architecture

    Late-interaction vision-language encoder based on ModernBERT. Uses alternating attention and MLP blocks with Flash Attention for efficient token processing. Vision inputs are patchified and projected into the same embedding space as text tokens. Retrieval uses MaxSim aggregation over per-token embeddings.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mixpeek = Mixpeek(api_key="YOUR_API_KEY")
    mixpeek.ingest.documents(
    collection="invoices",
    source={"type": "s3", "bucket": "invoice-pdfs"},
    pipeline={
    "embedding": {
    "model": "mixpeek://image_extractor@v1/modernvbert_colmodernvbert_v1"
    }
    }
    )

    Capabilities

    • Visual document retrieval
    • CPU-friendly inference
    • Late-interaction scoring
    • Document image search

    Use Cases on Mixpeek

    On-device document search for mobile and edge applications
    Low-resource document retrieval servers
    Rapid prototyping of visual search systems
    Cost-effective document retrieval at scale

    Benchmarks

    DatasetMetricScoreSource
    ViDoRe V2nDCG@578.9Model card

    Performance

    Input SizeVariable
    GPU Latency~35ms per page on A100 / ~200ms on CPU
    GPU Throughput~28 pages/sec (GPU)
    GPU MemoryModel dependent

    Specification

    FrameworkHF
    OrganizationModernVBERT
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters250M
    LicenseApache 2.0
    Downloads/moN/A

    Research Paper

    Model paper or technical report

    arxiv.org

    Build a pipeline with ColModernVBERT

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio