NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/bytedance-research/Lance
    HFScene CaptioningApache 2.0

    Lance

    by bytedance-research

    Unified 3B model for image and video understanding, generation, and editing

    32Kdl/month
    3Bparams
    Identifiers
    Model ID
    bytedance-research/Lance
    Feature URI
    mixpeek://video_extractor@v1/bytedance_lance_3b_v1

    Overview

    Lance is ByteDance's 3B-parameter unified vision model that handles image understanding, video understanding, image generation, video generation, and image/video editing in a single architecture. It uses a vision tokenizer to convert between continuous pixel space and discrete token space, enabling a shared transformer to reason across both modalities.

    On Mixpeek, Lance is relevant as a compact video understanding model that can caption, describe, and answer questions about both images and video content. Its unified architecture means a single model can power scene description, visual Q&A, and content analysis pipelines.

    Architecture

    Unified autoregressive transformer with a learned vision tokenizer. 3B parameters. Supports text-to-image, text-to-video, image/video understanding, and editing through a shared token space.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mixpeek = Mixpeek(api_key="YOUR_API_KEY")
    mixpeek.ingest.videos(
    collection="media_library",
    source={"type": "s3", "bucket": "video-assets"},
    pipeline={
    "captioning": {
    "model": "mixpeek://video_extractor@v1/bytedance_lance_3b_v1"
    }
    }
    )

    Capabilities

    • Unified image and video understanding in one model
    • Scene description and visual Q&A for both images and video
    • Compact 3B parameter count suitable for GPU-constrained deployments
    • Multi-task capability reduces pipeline complexity

    Use Cases on Mixpeek

    Video content analysis and scene captioning pipelines
    Unified image+video understanding without separate models
    Content moderation across images and video
    Compact deployment for visual Q&A at scale

    Benchmarks

    DatasetMetricScoreSource
    Video-MMEAccuracy62.1Model card
    MMMU-Pro (vision)Score38.4Model card

    Performance

    Input SizeVariable
    GPU Latency~45ms per frame (A100)
    GPU Throughput~120 frames/sec (A100)
    GPU Memory~6 GB

    Specification

    FrameworkHF
    Organizationbytedance-research
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters3B
    LicenseApache 2.0
    Downloads/mo32K

    Research Paper

    Model paper or technical report

    arxiv.org

    Build a pipeline with Lance

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio