Vidi-7B
by bytedance-research
Hour-long video temporal retrieval — find any moment by text query
bytedance-research/Vidi-7Bmixpeek://image_extractor@v1/bytedance_vidi_7b_v1Overview
Vidi 2.5 is ByteDance's video language model optimized for temporal retrieval, spatio-temporal grounding, and video question answering over hour-long videos. Unlike feature extraction models that produce per-frame embeddings, Vidi understands temporal relationships — it can find the time range where a specific event occurs, ground objects across frames, and answer questions that require reasoning over long video sequences.
The 7B model handles videos up to 60+ minutes, making it suitable for full meeting recordings, lecture videos, surveillance feeds, and broadcast content. On Mixpeek, Vidi powers temporal search queries like 'find the moment where the presenter shows the revenue slide' across video libraries.
Architecture
Vision-language model (7B parameters) with temporal-aware video encoder. Processes variable-length video with hierarchical frame sampling. Supports temporal retrieval (time range output), spatio-temporal grounding (bounding boxes across frames), and generative QA.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/lecture.mp4" },feature_extractors: [{name: "scene_caption",version: "v1",params: {model_id: "bytedance-research/Vidi-7B",enable_temporal_grounding: true}}]});
Capabilities
- Temporal retrieval: find time ranges matching text queries
- Spatio-temporal grounding: track objects across video frames
- Hour-long video understanding (60+ minutes)
- Video QA with temporal reasoning
- Apache 2.0 license
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| Video-MME (long) | Accuracy | 64.2% | ByteDance, 2026 — Model Card |
Performance
Specification
Research Paper
Vidi: Large Vision-Language Models for Video
arxiv.orgBuild a pipeline with Vidi-7B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio