GTE-ModernColBERT-v1
by lightonai
Late interaction retrieval model with record-breaking long-context performance
lightonai/GTE-ModernColBERT-v1mixpeek://text_extractor@v1/lighton_gte_moderncolbert_v1Overview
GTE-ModernColBERT-v1 is a ColBERT-style late interaction retrieval model built on the ModernBERT architecture. Instead of compressing an entire document into a single vector, it produces 128-dimensional embeddings for every token, then scores query-document pairs using MaxSim — for each query token, find the best-matching document token and sum the scores. This token-level matching preserves fine-grained detail that single-vector models lose.
The model's standout capability is long-context retrieval. On the LongEmbed benchmark (documents up to 32K tokens), it scores 88.39 mean — roughly 10 points above the previous state of the art. It also outperforms ColBERT-small on BEIR while supporting documents up to 32K tokens natively. Trained in just 15K steps on MS MARCO using LightOn's PyLate library, it demonstrated that the ModernBERT + ColBERT recipe produces competitive results with minimal training compute.
Architecture
ModernBERT encoder (from Alibaba-NLP/gte-modernbert-base) with a linear projection layer (768 → 128 dimensions, no bias, no activation). Produces per-token 128-dim embeddings. Default query length 32 tokens, document length up to 32K tokens. Scoring via MaxSim operator. Trained with knowledge distillation on MS MARCO using PyLate.
Mixpeek SDK Integration
from mixpeek import Mixpeekmx = Mixpeek(api_key="YOUR_KEY")# Index documents with late interaction embeddings for precision retrievalmx.ingest(collection_id="knowledge-base",source="s3://documents/",extractors=[{"type": "text_embedding","model": "lightonai/GTE-ModernColBERT-v1","output_feature": "colbert_tokens"},{"type": "text_embedding","model": "BAAI/bge-m3","output_feature": "dense_embedding"}])
Capabilities
- Late interaction retrieval with per-token 128-dim embeddings
- Long-context support up to 32K tokens (tested to 32,768)
- 88.39 mean on LongEmbed benchmark (~10 points above prior SOTA)
- 54.75 NDCG@10 on BEIR — outperforms ColBERT-small
- Apache 2.0 license, reproducible training with PyLate
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| BEIR (15 datasets) | NDCG@10 | 54.75 | LightOn, 2025 — Model Card |
| LongEmbed (32K context) | Mean Score | 88.39 | LightOn, 2025 — Blog Post |
| NanoBEIR | NDCG@10 | 67.58 | LightOn, 2025 — Model Card |
Performance
Specification
Research Paper
LightOn Releases GTE-ModernColBERT, First SOTA Late-Interaction Model Trained on PyLate
arxiv.orgBuild a pipeline with GTE-ModernColBERT-v1
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio