ColModernVBERT
by ModernVBERT
Compact 250M-param vision-language encoder for visual document retrieval
ModernVBERT/ColModernVBERTmixpeek://image_extractor@v1/modernvbert_colmodernvbert_v1Overview
ColModernVBERT is a compact late-interaction model for visual document retrieval that matches models 10x its size. Built on the ModernBERT architecture extended to vision, it produces multi-vector representations of document images that enable efficient MaxSim-based retrieval. Its small footprint means it can run on CPU hardware, making it practical for edge deployment.
Architecture
Late-interaction vision-language encoder based on ModernBERT. Uses alternating attention and MLP blocks with Flash Attention for efficient token processing. Vision inputs are patchified and projected into the same embedding space as text tokens. Retrieval uses MaxSim aggregation over per-token embeddings.
Mixpeek SDK Integration
from mixpeek import Mixpeekmixpeek = Mixpeek(api_key="YOUR_API_KEY")mixpeek.ingest.documents(collection="invoices",source={"type": "s3", "bucket": "invoice-pdfs"},pipeline={"embedding": {"model": "mixpeek://image_extractor@v1/modernvbert_colmodernvbert_v1"}})
Capabilities
- Visual document retrieval
- CPU-friendly inference
- Late-interaction scoring
- Document image search
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| ViDoRe V2 | nDCG@5 | 78.9 | Model card |
Performance
Common Pipeline Companions
Specification
Research Paper
Model paper or technical report
arxiv.orgBuild a pipeline with ColModernVBERT
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio