A computer vision task that infers the depth (distance from the camera) of each pixel in an image from a single 2D photo or stereo pair. Depth estimation adds 3D spatial understanding to multimodal content analysis.
Monocular depth estimation uses a single image and a trained neural network to predict a depth map where each pixel value represents relative or metric distance from the camera. The model learns depth cues from training data including perspective, occlusion, relative size, and texture gradients. Stereo depth estimation uses two calibrated cameras and computes depth from pixel disparity.
Foundation models like DPT (Dense Prediction Transformer) and Depth Anything produce high-quality relative depth maps from single images. MiDaS provides robust cross-domain generalization. Metric depth estimation requires camera-specific calibration or training. Output resolution matches input, with depth values typically stored as 16-bit or float32 maps. Self-supervised methods train on stereo video pairs without explicit depth labels.