NEWVectors or files. Pick a path.Start →
    Models/Embeddings/facebook/dinov3-vitl16-pretrain-lvd1689m
    HFVisual Embeddingsother

    dinov3-vitl16-pretrain-lvd1689m

    by facebook

    High-traffic DINOv3 ViT-L checkpoint for dense visual features

    546Kdl/month
    321likes
    303Mparams
    Identifiers
    Model ID
    facebook/dinov3-vitl16-pretrain-lvd1689m
    Feature URI
    mixpeek://image_extractor@v1/facebook_dinov3_vitl_lvd1689m_v1

    Overview

    DINOv3 is Meta's self-supervised vision foundation model family for dense, reusable visual features. The ViT-L LVD-1689M checkpoint is one of the most downloaded DINOv3 checkpoints on HuggingFace and is a practical alternative to the larger ViT-7B model.

    On Mixpeek, DINOv3 ViT-L is a strong visual embedding backbone for image collections, video keyframes, satellite imagery, and fine-grained visual similarity tasks where label-free feature quality matters.

    Architecture

    Vision Transformer Large with 16x16 patches, distilled from the DINOv3 ViT-7B teacher and pretrained on the LVD-1689M web image dataset. Exposed through the Transformers image-feature-extraction pipeline.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "image-library",
    source: { url: "https://example.com/catalog.zip" },
    feature_extractors: [{
    feature: "visual_embeddings",
    model: "facebook/dinov3-vitl16-pretrain-lvd1689m"
    }]
    });

    Capabilities

    • Dense image feature extraction without task labels
    • Strong transfer across classification, segmentation, and retrieval tasks
    • Practical ViT-L size compared with the larger ViT-7B checkpoint
    • Works with the Transformers image-feature-extraction pipeline

    Use Cases on Mixpeek

    Fine-grained image similarity over product, medical, or satellite imagery
    Visual clustering before human review or downstream labeling
    Keyframe embeddings for video search
    Feature reuse across classification and retrieval workflows

    Performance

    Input SizeImage feature extraction
    GPU Latency~10ms / image (A100, batch dependent)
    GPU Throughput~100 images/sec (A100, batch dependent)
    GPU Memory~2.5 GB

    Model is gated on HuggingFace and requires license acceptance

    Specification

    FrameworkHF
    Organizationfacebook
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters303M
    Licenseother
    Downloads/mo546K
    Likes321

    Research Paper

    DINOv3

    arxiv.org

    Build a pipeline with dinov3-vitl16-pretrain-lvd1689m

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio