vjepa2-vitl-fpc64-256
by facebook
Self-supervised video encoder for retrieval, classification, and VLM perception
facebook/vjepa2-vitl-fpc64-256mixpeek://video_extractor@v1/facebook_vjepa2_vitl_fpc64_256_v1Overview
V-JEPA 2 is Meta FAIR's video representation model trained with a joint embedding predictive architecture. Instead of treating video as independent frames, it learns representations that preserve temporal structure, motion, and object dynamics.
On Mixpeek, V-JEPA 2 is useful as a video feature extractor before retrieval or classification. It gives agents and search systems a compact representation of what happens over time, not just what appears in a sampled keyframe.
Architecture
Vision Transformer video encoder. The ViT-L FPC64 checkpoint samples 64 frames and exposes get_vision_features through Transformers. It can also encode still images by repeating the image across the expected frame dimension.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "video-library",source: { url: "https://example.com/training-video.mp4" },feature_extractors: [{feature: "video_embedding",model: "facebook/vjepa2-vitl-fpc64-256"}]});
Capabilities
- Video feature extraction from 64-frame clips
- Temporal representation for retrieval and classification
- Can serve as a video encoder for downstream VLMs
- MIT license
Use Cases on Mixpeek
Performance
Use as a video feature stage, then rerank with captions or transcripts when precision matters
Specification
Research Paper
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
arxiv.orgBuild a pipeline with vjepa2-vitl-fpc64-256
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio