DA3-SMALL

by depth-anything

Lightweight monocular and multi-view depth estimation with unified depth-ray representation

28Kdl/month

20likes

34Mparams

HuggingFace Run on your data

Identifiers

Model ID

depth-anything/DA3-SMALL

Feature URI

mixpeek://image_extractor@v1/depth_anything_v3_small_v1

Overview

Depth Anything 3 Small (DA3-Small) is the compact variant of ByteDance's Depth Anything 3 family, which uses a single plain Vision Transformer with a unified depth-ray representation to handle monocular depth estimation, multi-view depth estimation, stereo matching, and camera pose estimation from any number of input views.

Unlike Depth Anything 2 which only handles single images, DA3 processes single images, stereo pairs, multi-view collections, and videos with geometrically consistent outputs. The Small variant uses a DINOv2 ViT-Small backbone, providing fast inference suitable for real-time applications and edge deployment. On Mixpeek, DA3-Small extracts depth maps from video frames and images, enabling spatial understanding, 3D-aware content filtering, and depth-based scene segmentation in retrieval pipelines.

Architecture

DINOv2 ViT-Small backbone with unified depth-ray prediction head. Single plain transformer processes any number of input views. Depth-ray representation eliminates need for multi-task learning. Supports monocular, stereo, and multi-view depth estimation in a single model.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "depth-anything/DA3-SMALL" },
  },
});

Capabilities

Monocular, stereo, and multi-view depth estimation
Camera pose estimation from arbitrary view sets
Unified depth-ray representation for geometric consistency
Lightweight ViT-Small backbone for fast inference
44.3% better camera pose accuracy than prior SOTA (VGGT)

Use Cases on Mixpeek

Spatial content filtering: retrieve scenes by depth characteristics (close-up vs. wide shot)

3D-aware video analysis: extract depth maps for scene understanding in video pipelines

Augmented reality content indexing: tag content with spatial depth metadata for AR applications

Benchmarks

Dataset	Metric	Score	Source
DA3 family vs VGGT (camera pose)	Accuracy improvement	+44.3% avg	ByteDance, 2025: arxiv,2511.10647
DA3 family vs DA2 (monocular)	Geometric accuracy	+25.1% avg	ByteDance, 2025: arxiv,2511.10647