An Impossible Problem That Works Anyway
A single photograph is a projection: the 3D world is flattened onto a 2D grid, and the third dimension — how far each pixel is from the camera — is thrown away. Recovering it from one image is formally ill-posed. Infinitely many 3D scenes project to the exact same picture: a small object up close and a large object far away can be pixel-for-pixel identical. With one eye closed and one still frame, geometry alone cannot decide.
Yet humans read depth from a flat photo instantly, and so do modern models. They do it the same way: not by measuring, but by exploiting learned priors about how the world usually looks. Monocular depth estimation is the task of predicting a depth value per pixel from a single RGB image, and understanding it means understanding which cues carry the signal and what the output can and cannot promise.
The Cues: What "Depth From One Image" Actually Uses
There is no triangulation available (that needs two viewpoints). Instead the signal comes from pictorial depth cues — the same ones painters have used for centuries, now learned statistically:
A depth model is, in effect, a machine that has seen enough of the world to invert these cues. This is also why it fails on inputs that violate the priors — forced-perspective photographs, mirrors, printed images of scenes, or unfamiliar object scales — where the learned assumptions point the wrong way.
Relative vs Metric Depth, and the Scale Ambiguity
The single most important distinction in depth estimation is relative vs metric, and it comes straight from the ill-posedness above.
Most general-purpose depth foundation models predict relative depth on purpose, because it transfers across any camera and any scene. If your task needs meters, you either calibrate with intrinsics or anchor on a known-size object; if it needs "foreground vs background" or "which shot is wider," relative depth is already enough.
Training Without a Tape Measure: Self-Supervision
Dense per-pixel depth ground truth is expensive (LiDAR, structured light) and scarce, which for years capped model quality. Two ideas broke the ceiling.
Self-supervised depth from geometry. Instead of labels, use the geometry of stereo pairs or video. If you predict depth for a frame and also predict the camera's motion to the next frame, you can *reproject* one frame into the other and compare. The training signal is the photometric reconstruction loss — pixels should land where they actually appear in the other view — so depth is learned purely from raw video or stereo, no annotations. The catch is that the same scale ambiguity reappears (depth and camera translation are only recoverable up to a shared scale), which is again why the result is relative.
Large-scale pretraining (the Depth Anything approach). The other lever is data volume: train a strong vision backbone (a ViT) on a massive mix of labeled and pseudo-labeled images — millions of scenes — so the model's priors cover almost anything it will meet at inference. This is what makes recent depth foundation models robust zero-shot across indoor, outdoor, aerial, and close-up imagery. It is the same "scale beats cleverness" story as CLIP for embeddings, applied to geometry.
Why an Agent Wants a Depth Channel
Depth turns a flat image into a scene with structure, and that unlocks queries and filters that RGB embeddings answer poorly:
1. Spatial reasoning — "the person in the foreground," "objects on the table vs on the wall," "is X in front of or behind Y." Occlusion ordering and relative depth answer these directly; a 2D detector only gives boxes. 2. Shot and framing filters — wide establishing shots, tight close-ups, and medium shots have distinct depth-histogram signatures, so an agent can retrieve by framing without a human tagging it. 3. Foreground/background separation — depth is a cheap, model-driven matte for pulling the subject out of clutter before embedding or captioning, which sharpens downstream retrieval. 4. Geometry sanity for other perception — depth cross-checks detections and helps reject impossible configurations (a "small" car that depth says is actually near is a misdetection).
The agent never consumes the depth map as pixels; it consumes derived signals — an ordering, a histogram, a foreground mask — and asks questions in those terms, the same pattern as any other perception modality.
Doing It on Mixpeek
A depth extractor runs at ingest and attaches a per-object depth signal, so retrieval and filtering can use scene geometry instead of only appearance.
from mixpeek import Mixpeek
mx = Mixpeek(api_key="API_KEY")
# Depth runs alongside the visual embedding at ingest
mx.collections.create(
namespace_id="my-namespace",
collection_name="my-collection",
source={"type": "bucket", "bucket_ids": ["bkt_your_bucket"]},
feature_extractor={"feature_extractor_name": "visual_embedding", "version": "v1"},
)
# An agent retrieves by appearance, then filters by framing/geometry
results = mx.retrievers.execute(
retriever_id="your-retriever-id",
query="analyst presenting at a whiteboard",
)
The visual stage recalls what *appears*; the depth signal lets the agent reason about how the scene is *arranged*. For the appearance side of that pairing see contrastive learning, and for turning geometry into a same/different decision see instance-level visual matching.