DA3-SMALL
by depth-anything
Lightweight monocular and multi-view depth estimation with unified depth-ray representation
depth-anything/DA3-SMALLmixpeek://image_extractor@v1/depth_anything_v3_small_v1Overview
Depth Anything 3 Small (DA3-Small) is the compact variant of ByteDance's Depth Anything 3 family, which uses a single plain Vision Transformer with a unified depth-ray representation to handle monocular depth estimation, multi-view depth estimation, stereo matching, and camera pose estimation from any number of input views.
Unlike Depth Anything 2 which only handles single images, DA3 processes single images, stereo pairs, multi-view collections, and videos with geometrically consistent outputs. The Small variant uses a DINOv2 ViT-Small backbone, providing fast inference suitable for real-time applications and edge deployment. On Mixpeek, DA3-Small extracts depth maps from video frames and images, enabling spatial understanding, 3D-aware content filtering, and depth-based scene segmentation in retrieval pipelines.
Architecture
DINOv2 ViT-Small backbone with unified depth-ray prediction head. Single plain transformer processes any number of input views. Depth-ray representation eliminates need for multi-task learning. Supports monocular, stereo, and multi-view depth estimation in a single model.
Mixpeek SDK Integration
from mixpeek import Mixpeekmx = Mixpeek(api_key="YOUR_KEY")mx.ingest(collection_id="video-library",source="s3://footage/",extractors=[{"type": "scene_caption","model": "depth-anything/DA3-SMALL","output_feature": "depth_map"}])
Capabilities
- Monocular, stereo, and multi-view depth estimation
- Camera pose estimation from arbitrary view sets
- Unified depth-ray representation for geometric consistency
- Lightweight ViT-Small backbone for fast inference
- 44.3% better camera pose accuracy than prior SOTA (VGGT)
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| DA3 family vs VGGT (camera pose) | Accuracy improvement | +44.3% avg | ByteDance, 2025 — arxiv:2511.10647 |
| DA3 family vs DA2 (monocular) | Geometric accuracy | +25.1% avg | ByteDance, 2025 — arxiv:2511.10647 |
Performance
Specification
Research Paper
Depth Anything 3: Recovering the Visual Space from Any Views
arxiv.orgBuild a pipeline with DA3-SMALL
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio