Reason-ModernColBERT

by lightonai

Late-interaction retriever trained for reasoning-intensive search queries

9.1Kdl/month

150M classparams

HuggingFace Run on your data, free

Identifiers

Model ID

lightonai/Reason-ModernColBERT

Feature URI

mixpeek://text_extractor@v1/lighton_reason_moderncolbert_v1

Overview

Reason-ModernColBERT is a PyLate ColBERT model fine-tuned from LightOn's GTE-ModernColBERT-v1 on the ReasonIR dataset. It targets retrieval problems where the query is not a short keyword string but a reasoning-heavy prompt that requires matching evidence across paragraphs.

On Mixpeek, this makes it a useful text retrieval companion for agents. After visual, audio, or document extractors produce text evidence, Reason-ModernColBERT can retrieve passages that match an agent's intermediate reasoning state with token-level MaxSim scoring instead of collapsing each document into one dense vector.

Architecture

ModernBERT-based late-interaction retriever trained with PyLate. It maps queries and passages to sequences of 128-dimensional token vectors and scores them with MaxSim. The model supports 8,192-token documents and 128-token queries according to the model card.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "text_embeddings",
    version: "v1",
    parameters: { model_id: "lightonai/Reason-ModernColBERT" },
  },
});

Capabilities

Reasoning-intensive retrieval over long passages
Late-interaction token matching with MaxSim
8K-token document support
Useful for agent queries that include context, constraints, and partial findings
Fine-tuned on ReasonIR data from GTE-ModernColBERT-v1