A computer vision task that infers the depth (distance from the camera) of each pixel in an image from a single 2D photo or stereo pair. Depth estimation adds 3D spatial understanding to multimodal content analysis.
Monocular depth estimation uses a single image and a trained neural network to predict a depth map where each pixel value represents relative or metric distance from the camera. The model learns depth cues from training data including perspective, occlusion, relative size, and texture gradients. Stereo depth estimation uses two calibrated cameras and computes depth from pixel disparity.
Foundation models like DPT (Dense Prediction Transformer) and Depth Anything produce high-quality relative depth maps from single images. MiDaS provides robust cross-domain generalization. Metric depth estimation requires camera-specific calibration or training. Output resolution matches input, with depth values typically stored as 16-bit or float32 maps. Self-supervised methods train on stereo video pairs without explicit depth labels.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS