NEWManaged multimodal retrieval.Explore platform →
    Models/Spatial Understanding/depth-anything/Depth-Anything-V2-Large
    HFDepth EstimationApache-2.0

    Depth-Anything-V2-Large

    by depth-anything

    Foundation model for monocular depth estimation with synthetic-to-real training

    1.4Mdl/month
    335Mparams
    Identifiers
    Model ID
    depth-anything/Depth-Anything-V2-Large
    Feature URI
    mixpeek://image_extractor@v1/depth_anything_v2_large_v1

    Overview

    Depth Anything V2 Large is a 335M-parameter monocular depth estimation model that produces dense per-pixel depth maps from single images. Built on a DINOv2-Large encoder with a DPT decoder, it is trained via a teacher-student paradigm: a giant ViT-G teacher learns from 595K synthetic images, then supervises student models on 62M pseudo-labeled real images to bridge the synthetic-to-real domain gap.

    On Mixpeek, Depth Anything V2 extracts depth maps from video frames and images, enabling spatial-aware retrieval such as finding scenes with specific depth compositions, foreground/background separation, or 3D layout understanding.

    Architecture

    DINOv2-Large (ViT-L) encoder with 24 layers feeding into a DPT (Dense Prediction Transformer) decoder. Intermediate features from DINOv2 are fused at multiple scales for dense depth prediction. Teacher-student training with ViT-G teacher on synthetic data.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="video-scenes",
    source="s3://footage/",
    extractors=[{
    "type": "depth_estimation",
    "model": "depth-anything/Depth-Anything-V2-Large",
    "output_feature": "depth_map"
    }]
    )

    Capabilities

    • Dense per-pixel relative depth estimation
    • 10x faster than diffusion-based depth models
    • Robust across indoor, outdoor, and synthetic scenes
    • Fine-grained boundary preservation
    • Metric depth variant available for absolute scale

    Use Cases on Mixpeek

    Spatial-aware video retrieval (find scenes by depth composition or layout)
    3D scene understanding for augmented reality content pipelines
    Foreground/background separation in visual effects and media production

    Benchmarks

    DatasetMetricScoreSource
    NYUv2AbsRel0.043Yang et al., 2024 — Depth Anything V2 paper
    KITTIAbsRel0.044Yang et al., 2024 — Depth Anything V2 paper
    SintelAbsRel0.280Yang et al., 2024 — Depth Anything V2 paper

    Performance

    Input Size518x518 px (default)
    GPU Latency~12ms / image (A100)
    CPU Latency~180ms / image
    GPU Throughput~83 images/sec (A100)
    GPU Memory~1.4 GB

    Specification

    FrameworkHF
    Organizationdepth-anything
    FeatureDepth Estimation
    Outputdepth map
    Modalitiesvideo, image
    RetrieverDepth Filter
    Parameters335M
    LicenseApache-2.0
    Downloads/mo1.4M

    Research Paper

    Depth Anything V2

    arxiv.org

    Build a pipeline with Depth-Anything-V2-Large

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio