NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/bytedance-research/Vidi-7B
    HFScene CaptioningApache 2.0

    Vidi-7B

    by bytedance-research

    Hour-long video temporal retrieval — find any moment by text query

    48Kdl/month
    7Bparams
    Identifiers
    Model ID
    bytedance-research/Vidi-7B
    Feature URI
    mixpeek://image_extractor@v1/bytedance_vidi_7b_v1

    Overview

    Vidi 2.5 is ByteDance's video language model optimized for temporal retrieval, spatio-temporal grounding, and video question answering over hour-long videos. Unlike feature extraction models that produce per-frame embeddings, Vidi understands temporal relationships — it can find the time range where a specific event occurs, ground objects across frames, and answer questions that require reasoning over long video sequences.

    The 7B model handles videos up to 60+ minutes, making it suitable for full meeting recordings, lecture videos, surveillance feeds, and broadcast content. On Mixpeek, Vidi powers temporal search queries like 'find the moment where the presenter shows the revenue slide' across video libraries.

    Architecture

    Vision-language model (7B parameters) with temporal-aware video encoder. Processes variable-length video with hierarchical frame sampling. Supports temporal retrieval (time range output), spatio-temporal grounding (bounding boxes across frames), and generative QA.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/lecture.mp4" },
    feature_extractors: [{
    name: "scene_caption",
    version: "v1",
    params: {
    model_id: "bytedance-research/Vidi-7B",
    enable_temporal_grounding: true
    }
    }]
    });

    Capabilities

    • Temporal retrieval: find time ranges matching text queries
    • Spatio-temporal grounding: track objects across video frames
    • Hour-long video understanding (60+ minutes)
    • Video QA with temporal reasoning
    • Apache 2.0 license

    Use Cases on Mixpeek

    Temporal search: find specific moments in meeting recordings
    Surveillance video retrieval: locate events by description
    Lecture video indexing: jump to the exact timestamp of any concept
    Broadcast content analysis: find highlights and key moments

    Benchmarks

    DatasetMetricScoreSource
    Video-MME (long)Accuracy64.2%ByteDance, 2026 — Model Card

    Performance

    Input SizeUp to 60+ minutes video (hierarchical sampling)
    GPU Latency~2.1s / minute of video (A100)
    GPU Throughput~28 min video/min (A100)
    GPU Memory~15 GB

    Specification

    FrameworkHF
    Organizationbytedance-research
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters7B
    LicenseApache 2.0
    Downloads/mo48K

    Research Paper

    Vidi: Large Vision-Language Models for Video

    arxiv.org

    Build a pipeline with Vidi-7B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio