ColMate-3B

by ahmed-masry

Late-interaction multimodal document retrieval with OCR-aware pretraining

N/Adl/month

3Bparams

HuggingFace Run on your data

Identifiers

Model ID

ahmed-masry/ColMate-3B

Feature URI

mixpeek://image_extractor@v1/ahmed_masry_colmate_3b_v1

Overview

ColMate-3B is a late-interaction multimodal retrieval model that combines OCR-based pretraining with masked contrastive learning for visual document retrieval. It produces multi-vector representations that capture fine-grained token-patch interactions, achieving strong results on document retrieval benchmarks like ViDoRe V2 without requiring expensive OCR at query time.

Architecture

Late-interaction architecture built on a vision-language backbone. During pretraining, the model learns OCR-aware representations through masked contrastive objectives, predicting which text tokens correspond to which image patches. At retrieval time, it computes MaxSim between query token vectors and document patch vectors, similar to ColBERT but extended to the visual domain.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "s3",
    version: "v1",
    parameters: { model_id: "mixpeek://image_extractor@v1/ahmed_masry_colmate_3b_v1" },
  },
});