Depth-Anything-V2-Large

by depth-anything

Foundation model for monocular depth estimation with synthetic-to-real training

47Kdl/month

155likes

335Mparams

HuggingFace Run on your data

Identifiers

Model ID

depth-anything/Depth-Anything-V2-Large

Feature URI

mixpeek://image_extractor@v1/depth_anything_v2_large_v1

Overview

Depth Anything V2 Large is a 335M-parameter monocular depth estimation model that produces dense per-pixel depth maps from single images. Built on a DINOv2-Large encoder with a DPT decoder, it is trained via a teacher-student paradigm: a giant ViT-G teacher learns from 595K synthetic images, then supervises student models on 62M pseudo-labeled real images to bridge the synthetic-to-real domain gap.

On Mixpeek, Depth Anything V2 extracts depth maps from video frames and images, enabling spatial-aware retrieval such as finding scenes with specific depth compositions, foreground/background separation, or 3D layout understanding.

Architecture

DINOv2-Large (ViT-L) encoder with 24 layers feeding into a DPT (Dense Prediction Transformer) decoder. Intermediate features from DINOv2 are fused at multiple scales for dense depth prediction. Teacher-student training with ViT-G teacher on synthetic data.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "depth_estimation",
    version: "v1",
    parameters: { model_id: "depth-anything/Depth-Anything-V2-Large" },
  },
});

Capabilities

Dense per-pixel relative depth estimation
10x faster than diffusion-based depth models
Robust across indoor, outdoor, and synthetic scenes
Fine-grained boundary preservation
Metric depth variant available for absolute scale

Use Cases on Mixpeek

Spatial-aware video retrieval (find scenes by depth composition or layout)

3D scene understanding for augmented reality content pipelines

Foreground/background separation in visual effects and media production

Benchmarks

Dataset	Metric	Score	Source
NYUv2	AbsRel	0.043	Yang et al., 2024: Depth Anything V2 paper
KITTI	AbsRel	0.044	Yang et al., 2024: Depth Anything V2 paper
Sintel	AbsRel	0.280	Yang et al., 2024: Depth Anything V2 paper