NEWManaged multimodal retrieval.Explore platform →
    HFDepth EstimationApple Sample Code License

    DepthPro

    by apple

    Zero-shot metric monocular depth estimation with sharp boundaries in under a second

    520Kdl/month
    ~350Mparams
    Identifiers
    Model ID
    apple/DepthPro
    Feature URI
    mixpeek://image_extractor@v1/apple_depthpro_v1

    Overview

    DepthPro is Apple's foundation model for zero-shot metric monocular depth estimation, producing 2.25-megapixel depth maps (1536x1536) in 0.3 seconds on a V100 GPU. Unlike relative depth models, DepthPro predicts absolute metric depth without requiring camera intrinsics, and includes a built-in focal length estimator. Its multi-scale ViT architecture with a shared DINOv2 encoder and DPT-like fusion stage preserves sharp object boundaries.

    On Mixpeek, DepthPro enables metric-accurate spatial understanding of images and video frames, powering use cases like 3D scene reconstruction, spatial filtering in retrieval, and depth-aware content organization.

    Architecture

    Multi-scale Vision Transformer with shared DINOv2 encoder processing image patches at multiple resolutions. DPT-like fusion stage merges and upsamples features for dense prediction. Built-in focal length estimation head. Outputs 1536x1536 metric depth maps with absolute scale.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="real-estate-photos",
    source="s3://listings/",
    extractors=[{
    "type": "depth_estimation",
    "model": "apple/DepthPro",
    "output_feature": "metric_depth"
    }]
    )

    Capabilities

    • Zero-shot metric depth (absolute scale, no camera intrinsics needed)
    • 2.25-megapixel output (1536x1536) in 0.3s
    • Sharp boundary preservation via multi-scale architecture
    • Built-in focal length estimation from a single image
    • State-of-the-art boundary accuracy metrics

    Use Cases on Mixpeek

    3D scene reconstruction from single images or video frames
    Depth-aware retrieval and spatial filtering in media pipelines
    Augmented reality content creation with metric-accurate depth

    Benchmarks

    DatasetMetricScoreSource
    NYUv2AbsRel0.036Bochkovskii et al., 2024 — Depth Pro paper
    KITTIAbsRel0.039Bochkovskii et al., 2024 — Depth Pro paper
    Boundary F1F1 (depth edges)State-of-the-artBochkovskii et al., 2024 — Depth Pro paper

    Performance

    Input SizeVariable (multi-scale, outputs 1536x1536)
    GPU Latency~300ms / image (V100)
    CPU Latency~2.5s / image
    GPU Throughput~12 images/sec (A100)
    GPU Memory~2.5 GB

    Specification

    FrameworkHF
    Organizationapple
    FeatureDepth Estimation
    Outputdepth map
    Modalitiesvideo, image
    RetrieverDepth Filter
    Parameters~350M
    LicenseApple Sample Code License
    Downloads/mo520K

    Research Paper

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    arxiv.org

    Build a pipeline with DepthPro

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio