videoprism-large-f8r288
by google
Foundational video encoder that achieves SOTA on 31 of 33 video understanding benchmarks
google/videoprism-large-f8r288mixpeek://video_extractor@v1/google_videoprism_large_v1Overview
VideoPrism is Google's foundational video encoder designed specifically for video understanding tasks. Unlike frame-sampling approaches that treat video as a bag of images, VideoPrism uses a factorized ViViT architecture with dedicated temporal attention that captures motion, action progression, and temporal relationships between frames.
On Mixpeek, VideoPrism provides the strongest available video features for action recognition, temporal grounding, and video classification. Its frozen features (no fine-tuning needed) outperform task-specific models on most benchmarks, making it a universal video backbone.
Architecture
ViViT (Video Vision Transformer) with factorized spatial-temporal attention. ViT-L backbone (~310M params). Trained on 36M video-caption pairs + 582M video clips. Processes 8 frames at 288px resolution. Produces per-frame and video-level feature vectors.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "video-archive",source: { url: "https://example.com/training-video.mp4" },feature_extractors: [{feature: "video_embedding",model: "google/videoprism-large-f8r288"}]});
Capabilities
- SOTA on 31 of 33 video understanding benchmarks with frozen features
- Factorized temporal attention captures motion and action dynamics
- Zero-shot video classification without fine-tuning
- Trained on 36M video-caption pairs + 582M video clips
- Apache 2.0 license for commercial use
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| Kinetics-400 | Top-1 Accuracy | 87.2% | Zhao et al., 2024 — arxiv:2402.13217 |
| Moments in Time | Top-1 Accuracy | 45.1% | Zhao et al., 2024 — arxiv:2402.13217 |
| Something-Something v2 | Top-1 Accuracy | 68.8% | Zhao et al., 2024 — arxiv:2402.13217 |
Performance
Specification
Research Paper
VideoPrism: A Foundational Visual Encoder for Video Understanding
arxiv.orgBuild a pipeline with videoprism-large-f8r288
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio