vjepa2-vitg-fpc64-256
by facebook
Highest-capacity V-JEPA 2 video encoder — self-supervised temporal representations
facebook/vjepa2-vitg-fpc64-256mixpeek://video_extractor@v1/facebook_vjepa2_vitg_fpc64_256_v1Overview
V-JEPA 2 (ViT-g) is the largest checkpoint of Meta FAIR's video representation model. What makes the JEPA (Joint-Embedding Predictive Architecture) family different from a masked autoencoder is *where* it predicts: it masks spacetime regions of a clip and predicts the missing regions' **representations in latent space**, not their raw pixels. Skipping pixel reconstruction means the model never spends capacity on texture and lighting detail it doesn't need, so it learns the semantic and dynamic structure of a scene — what moves, how, and in what order — rather than how to repaint it.
The ViT-g variant trades latency for quality: it is the strongest V-JEPA 2 encoder, worth it when representation quality drives your retrieval or classification accuracy more than throughput does. On Mixpeek it serves as a motion-aware video embedding stage — giving an agent a compact vector of what *happens* over a clip, complementary to keyframe/caption features that describe what merely *appears*.
Architecture
Giant Vision Transformer video encoder (ViT-g), the largest V-JEPA 2 checkpoint. The FPC64 variant samples 64 frames and exposes get_vision_features via Transformers; it can also encode a still image by repeating it across the frame dimension. Trained self-supervised by predicting masked spacetime representations in latent space (no pixel decoder), which is the core JEPA distinction from pixel-reconstruction MAEs.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
namespace_id: "my-namespace",
collection_name: "my-collection",
source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
feature_extractor: {
feature_extractor_name: "video_embedding",
version: "v1",
parameters: { model_id: "facebook/vjepa2-vitg-fpc64-256" },
},
});Capabilities
- Highest-quality V-JEPA 2 temporal embeddings (ViT-g scale)
- Motion- and dynamics-aware representation of 64-frame clips
- Predicts in latent space (JEPA) — semantic structure over pixel detail
- Serves as a video perception backbone for downstream VLMs and planners
- Apache-2.0 license
Use Cases on Mixpeek
Performance
Choose ViT-g when representation quality drives accuracy; use the ViT-L checkpoint when throughput/latency matters more
Common Pipeline Companions
Explore on Mixpeek
Compare alternatives in this category
Hand-picked tools & platforms compared
Deep-dive technical guide
See how Mixpeek runs models as extractors
Store & search embeddings at scale
Usage-based pricing for pipelines
Compare models, APIs & infrastructure
Specification
Research Paper
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
arxiv.orgBuild a pipeline with vjepa2-vitg-fpc64-256
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio