videoprism-large-f8r288

by google

Foundational video encoder that achieves SOTA on 31 of 33 video understanding benchmarks

3Kdl/month

20likes

~310Mparams

HuggingFace Run on your data

Identifiers

Model ID

google/videoprism-large-f8r288

Feature URI

mixpeek://video_extractor@v1/google_videoprism_large_v1

Overview

VideoPrism is Google's foundational video encoder designed specifically for video understanding tasks. Unlike frame-sampling approaches that treat video as a bag of images, VideoPrism uses a factorized ViViT architecture with dedicated temporal attention that captures motion, action progression, and temporal relationships between frames.

On Mixpeek, VideoPrism provides the strongest available video features for action recognition, temporal grounding, and video classification. Its frozen features (no fine-tuning needed) outperform task-specific models on most benchmarks, making it a universal video backbone.

Architecture

ViViT (Video Vision Transformer) with factorized spatial-temporal attention. ViT-L backbone (~310M params). Trained on 36M video-caption pairs + 582M video clips. Processes 8 frames at 288px resolution. Produces per-frame and video-level feature vectors.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "video_embedding",
    version: "v1",
    parameters: { model_id: "google/videoprism-large-f8r288" },
  },
});

Capabilities

SOTA on 31 of 33 video understanding benchmarks with frozen features
Factorized temporal attention captures motion and action dynamics
Zero-shot video classification without fine-tuning
Trained on 36M video-caption pairs + 582M video clips
Apache 2.0 license for commercial use

Use Cases on Mixpeek

Action recognition: identify activities in surveillance, sports, or training videos

Video classification: categorize content by genre, topic, or activity type

Temporal grounding: locate specific actions or events within long videos

Video similarity: find visually similar video segments across archives

Benchmarks

Dataset	Metric	Score	Source
Kinetics-400	Top-1 Accuracy	87.2%	Zhao et al., 2024: arxiv,2402.13217
Moments in Time	Top-1 Accuracy	45.1%	Zhao et al., 2024: arxiv,2402.13217
Something-Something v2	Top-1 Accuracy	68.8%	Zhao et al., 2024: arxiv,2402.13217