GTE-ModernColBERT-v1

by lightonai

Late interaction retrieval model with record-breaking long-context performance

119Kdl/month

149Mparams

HuggingFace Run on your data

Identifiers

Model ID

lightonai/GTE-ModernColBERT-v1

Feature URI

mixpeek://text_extractor@v1/lighton_gte_moderncolbert_v1

Overview

GTE-ModernColBERT-v1 is a ColBERT-style late interaction retrieval model built on the ModernBERT architecture. Instead of compressing an entire document into a single vector, it produces 128-dimensional embeddings for every token, then scores query-document pairs using MaxSim: for each query token, find the best-matching document token and sum the scores. This token-level matching preserves fine-grained detail that single-vector models lose.

The model's standout capability is long-context retrieval. On the LongEmbed benchmark (documents up to 32K tokens), it scores 88.39 mean: roughly 10 points above the previous state of the art. It also outperforms ColBERT-small on BEIR while supporting documents up to 32K tokens natively. Trained in just 15K steps on MS MARCO using LightOn's PyLate library, it demonstrated that the ModernBERT + ColBERT recipe produces competitive results with minimal training compute.

Architecture

ModernBERT encoder (from Alibaba-NLP/gte-modernbert-base) with a linear projection layer (768 → 128 dimensions, no bias, no activation). Produces per-token 128-dim embeddings. Default query length 32 tokens, document length up to 32K tokens. Scoring via MaxSim operator. Trained with knowledge distillation on MS MARCO using PyLate.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "text_embedding",
    version: "v1",
    parameters: { model_id: "lightonai/GTE-ModernColBERT-v1" },
  },
});

Capabilities

Late interaction retrieval with per-token 128-dim embeddings
Long-context support up to 32K tokens (tested to 32,768)
88.39 mean on LongEmbed benchmark (~10 points above prior SOTA)
54.75 NDCG@10 on BEIR: outperforms ColBERT-small
Apache 2.0 license, reproducible training with PyLate

Use Cases on Mixpeek

Precision retrieval for entity-rich queries in Mixpeek multi-stage pipelines

Long-document search where single-vector compression loses detail

Second-stage rescoring after dense retrieval for factoid and exact-match queries

Benchmarks

Dataset	Metric	Score	Source
BEIR (15 datasets)	NDCG@10	54.75	LightOn, 2025: Model Card
LongEmbed (32K context)	Mean Score	88.39	LightOn, 2025: Blog Post
NanoBEIR	NDCG@10	67.58	LightOn, 2025: Model Card