Lance

by bytedance-research

Unified 3B model for image and video understanding, generation, and editing

32Kdl/month

3Bparams

HuggingFace Run on your data

Identifiers

Model ID

bytedance-research/Lance

Feature URI

mixpeek://video_extractor@v1/bytedance_lance_3b_v1

Overview

Lance is ByteDance's 3B-parameter unified vision model that handles image understanding, video understanding, image generation, video generation, and image/video editing in a single architecture. It uses a vision tokenizer to convert between continuous pixel space and discrete token space, enabling a shared transformer to reason across both modalities.

On Mixpeek, Lance is relevant as a compact video understanding model that can caption, describe, and answer questions about both images and video content. Its unified architecture means a single model can power scene description, visual Q&A, and content analysis pipelines.

Architecture

Unified autoregressive transformer with a learned vision tokenizer. 3B parameters. Supports text-to-image, text-to-video, image/video understanding, and editing through a shared token space.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "s3",
    version: "v1",
    parameters: { model_id: "mixpeek://video_extractor@v1/bytedance_lance_3b_v1" },
  },
});