Vidi-7B

by bytedance-research

Hour-long video temporal retrieval: find any moment by text query

48Kdl/month

7Bparams

HuggingFace Run on your data

Identifiers

Model ID

bytedance-research/Vidi-7B

Feature URI

mixpeek://image_extractor@v1/bytedance_vidi_7b_v1

Overview

Vidi 2.5 is ByteDance's video language model optimized for temporal retrieval, spatio-temporal grounding, and video question answering over hour-long videos. Unlike feature extraction models that produce per-frame embeddings, Vidi understands temporal relationships: it can find the time range where a specific event occurs, ground objects across frames, and answer questions that require reasoning over long video sequences.

The 7B model handles videos up to 60+ minutes, making it suitable for full meeting recordings, lecture videos, surveillance feeds, and broadcast content. On Mixpeek, Vidi powers temporal search queries like 'find the moment where the presenter shows the revenue slide' across video libraries.

Architecture

Vision-language model (7B parameters) with temporal-aware video encoder. Processes variable-length video with hierarchical frame sampling. Supports temporal retrieval (time range output), spatio-temporal grounding (bounding boxes across frames), and generative QA.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "bytedance-research/Vidi-7B" },
  },
});