NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/ahmed-masry/ColMate-3B
    HFVisual EmbeddingsApache 2.0

    ColMate-3B

    by ahmed-masry

    Late-interaction multimodal document retrieval with OCR-aware pretraining

    N/Adl/month
    3Bparams
    Identifiers
    Model ID
    ahmed-masry/ColMate-3B
    Feature URI
    mixpeek://image_extractor@v1/ahmed_masry_colmate_3b_v1

    Overview

    ColMate-3B is a late-interaction multimodal retrieval model that combines OCR-based pretraining with masked contrastive learning for visual document retrieval. It produces multi-vector representations that capture fine-grained token-patch interactions, achieving strong results on document retrieval benchmarks like ViDoRe V2 without requiring expensive OCR at query time.

    Architecture

    Late-interaction architecture built on a vision-language backbone. During pretraining, the model learns OCR-aware representations through masked contrastive objectives — predicting which text tokens correspond to which image patches. At retrieval time, it computes MaxSim between query token vectors and document patch vectors, similar to ColBERT but extended to the visual domain.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mixpeek = Mixpeek(api_key="YOUR_API_KEY")
    mixpeek.ingest.documents(
    collection="scanned_docs",
    source={"type": "s3", "bucket": "document-archive"},
    pipeline={
    "embedding": {
    "model": "mixpeek://image_extractor@v1/ahmed_masry_colmate_3b_v1"
    }
    }
    )

    Capabilities

    • Visual document retrieval
    • Late-interaction scoring
    • OCR-free document search
    • Cross-modal matching

    Use Cases on Mixpeek

    Searching scanned document archives without OCR preprocessing
    Financial report retrieval from image-based PDFs
    Patent search across visual diagrams and text
    Legal document discovery from scanned filings

    Benchmarks

    DatasetMetricScoreSource
    ViDoRe V2nDCG@586.3Model card

    Performance

    Input SizeVariable
    GPU Latency~120ms per page on A100
    GPU Throughput~8 pages/sec
    GPU MemoryModel dependent

    Specification

    FrameworkHF
    Organizationahmed-masry
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters3B
    LicenseApache 2.0
    Downloads/moN/A

    Research Paper

    Model paper or technical report

    arxiv.org

    Build a pipeline with ColMate-3B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio