NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/facebook/vjepa2-vitl-fpc64-256
    HFVisual EmbeddingsMIT

    vjepa2-vitl-fpc64-256

    by facebook

    Self-supervised video encoder for retrieval, classification, and VLM perception

    162Kdl/month
    ViT-Lparams
    Identifiers
    Model ID
    facebook/vjepa2-vitl-fpc64-256
    Feature URI
    mixpeek://video_extractor@v1/facebook_vjepa2_vitl_fpc64_256_v1

    Overview

    V-JEPA 2 is Meta FAIR's video representation model trained with a joint embedding predictive architecture. Instead of treating video as independent frames, it learns representations that preserve temporal structure, motion, and object dynamics.

    On Mixpeek, V-JEPA 2 is useful as a video feature extractor before retrieval or classification. It gives agents and search systems a compact representation of what happens over time, not just what appears in a sampled keyframe.

    Architecture

    Vision Transformer video encoder. The ViT-L FPC64 checkpoint samples 64 frames and exposes get_vision_features through Transformers. It can also encode still images by repeating the image across the expected frame dimension.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "video-library",
    source: { url: "https://example.com/training-video.mp4" },
    feature_extractors: [{
    feature: "video_embedding",
    model: "facebook/vjepa2-vitl-fpc64-256"
    }]
    });

    Capabilities

    • Video feature extraction from 64-frame clips
    • Temporal representation for retrieval and classification
    • Can serve as a video encoder for downstream VLMs
    • MIT license

    Use Cases on Mixpeek

    Video similarity search across clips with comparable actions or motion
    Agent perception over camera streams where temporal state matters
    Pre-filtering long video into candidate clips before VLM captioning
    Action and activity classification for media archives

    Performance

    Input Size64 video frames at 256px
    GPU Latency~20ms / clip (A100, batch dependent)
    GPU Throughput~50 clips/sec (A100, batch dependent)
    GPU Memory~3 GB

    Use as a video feature stage, then rerank with captions or transcripts when precision matters

    Specification

    FrameworkHF
    Organizationfacebook
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    ParametersViT-L
    LicenseMIT
    Downloads/mo162K

    Research Paper

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    arxiv.org

    Build a pipeline with vjepa2-vitl-fpc64-256

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio