NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/google/videoprism-large-f8r288
    HFVisual EmbeddingsApache 2.0

    videoprism-large-f8r288

    by google

    Foundational video encoder that achieves SOTA on 31 of 33 video understanding benchmarks

    22Kdl/month
    ~310Mparams
    Identifiers
    Model ID
    google/videoprism-large-f8r288
    Feature URI
    mixpeek://video_extractor@v1/google_videoprism_large_v1

    Overview

    VideoPrism is Google's foundational video encoder designed specifically for video understanding tasks. Unlike frame-sampling approaches that treat video as a bag of images, VideoPrism uses a factorized ViViT architecture with dedicated temporal attention that captures motion, action progression, and temporal relationships between frames.

    On Mixpeek, VideoPrism provides the strongest available video features for action recognition, temporal grounding, and video classification. Its frozen features (no fine-tuning needed) outperform task-specific models on most benchmarks, making it a universal video backbone.

    Architecture

    ViViT (Video Vision Transformer) with factorized spatial-temporal attention. ViT-L backbone (~310M params). Trained on 36M video-caption pairs + 582M video clips. Processes 8 frames at 288px resolution. Produces per-frame and video-level feature vectors.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "video-archive",
    source: { url: "https://example.com/training-video.mp4" },
    feature_extractors: [{
    feature: "video_embedding",
    model: "google/videoprism-large-f8r288"
    }]
    });

    Capabilities

    • SOTA on 31 of 33 video understanding benchmarks with frozen features
    • Factorized temporal attention captures motion and action dynamics
    • Zero-shot video classification without fine-tuning
    • Trained on 36M video-caption pairs + 582M video clips
    • Apache 2.0 license for commercial use

    Use Cases on Mixpeek

    Action recognition: identify activities in surveillance, sports, or training videos
    Video classification: categorize content by genre, topic, or activity type
    Temporal grounding: locate specific actions or events within long videos
    Video similarity: find visually similar video segments across archives

    Benchmarks

    DatasetMetricScoreSource
    Kinetics-400Top-1 Accuracy87.2%Zhao et al., 2024 — arxiv:2402.13217
    Moments in TimeTop-1 Accuracy45.1%Zhao et al., 2024 — arxiv:2402.13217
    Something-Something v2Top-1 Accuracy68.8%Zhao et al., 2024 — arxiv:2402.13217

    Performance

    Input Size8 frames × 288×288 px
    GPU Latency~15ms / clip (A100)
    GPU Throughput~65 clips/sec (A100)
    GPU Memory~2.5 GB

    Specification

    FrameworkHF
    Organizationgoogle
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters~310M
    LicenseApache 2.0
    Downloads/mo22K

    Research Paper

    VideoPrism: A Foundational Visual Encoder for Video Understanding

    arxiv.org

    Build a pipeline with videoprism-large-f8r288

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio