ColMate-3B
by ahmed-masry
Late-interaction multimodal document retrieval with OCR-aware pretraining
ahmed-masry/ColMate-3Bmixpeek://image_extractor@v1/ahmed_masry_colmate_3b_v1Overview
ColMate-3B is a late-interaction multimodal retrieval model that combines OCR-based pretraining with masked contrastive learning for visual document retrieval. It produces multi-vector representations that capture fine-grained token-patch interactions, achieving strong results on document retrieval benchmarks like ViDoRe V2 without requiring expensive OCR at query time.
Architecture
Late-interaction architecture built on a vision-language backbone. During pretraining, the model learns OCR-aware representations through masked contrastive objectives — predicting which text tokens correspond to which image patches. At retrieval time, it computes MaxSim between query token vectors and document patch vectors, similar to ColBERT but extended to the visual domain.
Mixpeek SDK Integration
from mixpeek import Mixpeekmixpeek = Mixpeek(api_key="YOUR_API_KEY")mixpeek.ingest.documents(collection="scanned_docs",source={"type": "s3", "bucket": "document-archive"},pipeline={"embedding": {"model": "mixpeek://image_extractor@v1/ahmed_masry_colmate_3b_v1"}})
Capabilities
- Visual document retrieval
- Late-interaction scoring
- OCR-free document search
- Cross-modal matching
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| ViDoRe V2 | nDCG@5 | 86.3 | Model card |
Performance
Common Pipeline Companions
Specification
Research Paper
Model paper or technical report
arxiv.orgBuild a pipeline with ColMate-3B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio