NEWManaged multimodal retrieval.Explore platform →
    Models/Spatial Understanding/depth-anything/DA3-SMALL
    HFDepth EstimationApache-2.0

    DA3-SMALL

    by depth-anything

    Lightweight monocular and multi-view depth estimation with unified depth-ray representation

    161Kdl/month
    24Mparams
    Identifiers
    Model ID
    depth-anything/DA3-SMALL
    Feature URI
    mixpeek://image_extractor@v1/depth_anything_v3_small_v1

    Overview

    Depth Anything 3 Small (DA3-Small) is the compact variant of ByteDance's Depth Anything 3 family, which uses a single plain Vision Transformer with a unified depth-ray representation to handle monocular depth estimation, multi-view depth estimation, stereo matching, and camera pose estimation from any number of input views.

    Unlike Depth Anything 2 which only handles single images, DA3 processes single images, stereo pairs, multi-view collections, and videos with geometrically consistent outputs. The Small variant uses a DINOv2 ViT-Small backbone, providing fast inference suitable for real-time applications and edge deployment. On Mixpeek, DA3-Small extracts depth maps from video frames and images, enabling spatial understanding, 3D-aware content filtering, and depth-based scene segmentation in retrieval pipelines.

    Architecture

    DINOv2 ViT-Small backbone with unified depth-ray prediction head. Single plain transformer processes any number of input views. Depth-ray representation eliminates need for multi-task learning. Supports monocular, stereo, and multi-view depth estimation in a single model.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="video-library",
    source="s3://footage/",
    extractors=[
    {
    "type": "scene_caption",
    "model": "depth-anything/DA3-SMALL",
    "output_feature": "depth_map"
    }
    ]
    )

    Capabilities

    • Monocular, stereo, and multi-view depth estimation
    • Camera pose estimation from arbitrary view sets
    • Unified depth-ray representation for geometric consistency
    • Lightweight ViT-Small backbone for fast inference
    • 44.3% better camera pose accuracy than prior SOTA (VGGT)

    Use Cases on Mixpeek

    Spatial content filtering: retrieve scenes by depth characteristics (close-up vs. wide shot)
    3D-aware video analysis: extract depth maps for scene understanding in video pipelines
    Augmented reality content indexing: tag content with spatial depth metadata for AR applications

    Benchmarks

    DatasetMetricScoreSource
    DA3 family vs VGGT (camera pose)Accuracy improvement+44.3% avgByteDance, 2025 — arxiv:2511.10647
    DA3 family vs DA2 (monocular)Geometric accuracy+25.1% avgByteDance, 2025 — arxiv:2511.10647

    Performance

    Input SizeVariable (single image to multi-view sets)
    GPU Latency~8ms / image (A100)
    GPU Throughput~125 images/sec (A100)
    GPU Memory~0.4 GB

    Specification

    FrameworkHF
    Organizationdepth-anything
    FeatureDepth Estimation
    Outputdepth map
    Modalitiesvideo, image
    RetrieverDepth Filter
    Parameters24M
    LicenseApache-2.0
    Downloads/mo161K

    Research Paper

    Depth Anything 3: Recovering the Visual Space from Any Views

    arxiv.org

    Build a pipeline with DA3-SMALL

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio