dinov3-vitl16-pretrain-lvd1689m
by facebook
High-traffic DINOv3 ViT-L checkpoint for dense visual features
facebook/dinov3-vitl16-pretrain-lvd1689mmixpeek://image_extractor@v1/facebook_dinov3_vitl_lvd1689m_v1Overview
DINOv3 is Meta's self-supervised vision foundation model family for dense, reusable visual features. The ViT-L LVD-1689M checkpoint is one of the most downloaded DINOv3 checkpoints on HuggingFace and is a practical alternative to the larger ViT-7B model.
On Mixpeek, DINOv3 ViT-L is a strong visual embedding backbone for image collections, video keyframes, satellite imagery, and fine-grained visual similarity tasks where label-free feature quality matters.
Architecture
Vision Transformer Large with 16x16 patches, distilled from the DINOv3 ViT-7B teacher and pretrained on the LVD-1689M web image dataset. Exposed through the Transformers image-feature-extraction pipeline.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "image-library",source: { url: "https://example.com/catalog.zip" },feature_extractors: [{feature: "visual_embeddings",model: "facebook/dinov3-vitl16-pretrain-lvd1689m"}]});
Capabilities
- Dense image feature extraction without task labels
- Strong transfer across classification, segmentation, and retrieval tasks
- Practical ViT-L size compared with the larger ViT-7B checkpoint
- Works with the Transformers image-feature-extraction pipeline
Use Cases on Mixpeek
Performance
Model is gated on HuggingFace and requires license acceptance
Specification
Research Paper
DINOv3
arxiv.orgBuild a pipeline with dinov3-vitl16-pretrain-lvd1689m
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio