NEWVectors or files. Pick a path.Start →
    Models/Embeddings/facebook/vjepa2-vitg-fpc64-256
    HFVisual Embeddingsapache-2.0

    vjepa2-vitg-fpc64-256

    by facebook

    Highest-capacity V-JEPA 2 video encoder — self-supervised temporal representations

    372Kdl/month
    55likes
    1.0Bparams
    Identifiers
    Model ID
    facebook/vjepa2-vitg-fpc64-256
    Feature URI
    mixpeek://video_extractor@v1/facebook_vjepa2_vitg_fpc64_256_v1

    Overview

    V-JEPA 2 (ViT-g) is the largest checkpoint of Meta FAIR's video representation model. What makes the JEPA (Joint-Embedding Predictive Architecture) family different from a masked autoencoder is *where* it predicts: it masks spacetime regions of a clip and predicts the missing regions' **representations in latent space**, not their raw pixels. Skipping pixel reconstruction means the model never spends capacity on texture and lighting detail it doesn't need, so it learns the semantic and dynamic structure of a scene — what moves, how, and in what order — rather than how to repaint it.

    The ViT-g variant trades latency for quality: it is the strongest V-JEPA 2 encoder, worth it when representation quality drives your retrieval or classification accuracy more than throughput does. On Mixpeek it serves as a motion-aware video embedding stage — giving an agent a compact vector of what *happens* over a clip, complementary to keyframe/caption features that describe what merely *appears*.

    Architecture

    Giant Vision Transformer video encoder (ViT-g), the largest V-JEPA 2 checkpoint. The FPC64 variant samples 64 frames and exposes get_vision_features via Transformers; it can also encode a still image by repeating it across the frame dimension. Trained self-supervised by predicting masked spacetime representations in latent space (no pixel decoder), which is the core JEPA distinction from pixel-reconstruction MAEs.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    // Managed: create a collection over a bucket; Mixpeek runs this model's extractor
    const collection = await mx.collections.create({
      namespace_id: "my-namespace",
      collection_name: "my-collection",
      source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
      feature_extractor: {
        feature_extractor_name: "video_embedding",
        version: "v1",
        parameters: { model_id: "facebook/vjepa2-vitg-fpc64-256" },
      },
    });

    Capabilities

    • Highest-quality V-JEPA 2 temporal embeddings (ViT-g scale)
    • Motion- and dynamics-aware representation of 64-frame clips
    • Predicts in latent space (JEPA) — semantic structure over pixel detail
    • Serves as a video perception backbone for downstream VLMs and planners
    • Apache-2.0 license

    Use Cases on Mixpeek

    High-accuracy video similarity search where subtle motion distinguishes clips
    Action and activity classification over media archives where quality beats latency
    Agent perception over camera streams that must track temporal state, not stills
    Pre-filtering long video into candidate clips before expensive VLM captioning

    Performance

    Input Size64 video frames at 256px
    GPU Latencyhigher than ViT-L (largest encoder) — batch dependent
    GPU ThroughputBatch dependent
    GPU Memorynotably higher than ViT-L; plan GPU accordingly

    Choose ViT-g when representation quality drives accuracy; use the ViT-L checkpoint when throughput/latency matters more

    Specification

    FrameworkHF
    Organizationfacebook
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters1.0B
    Licenseapache-2.0
    Downloads/mo372K
    Likes55

    Research Paper

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    arxiv.org

    Build a pipeline with vjepa2-vitg-fpc64-256

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio