Monocular Depth Estimation: How Models Infer 3D From a Single Image

A vendor-neutral guide to inferring depth from one 2D image — why the problem is ill-posed, the pictorial cues models learn, relative vs metric depth and scale ambiguity, how self-supervision trains depth without ground truth, and why a depth channel lets an agent reason about scene geometry it otherwise can't see.

Depth Estimation

Monocular Depth

Depth Anything

Scene Geometry

Agent Perception

Spatial Reasoning

An Impossible Problem That Works Anyway

A single photograph is a projection: the 3D world is flattened onto a 2D grid, and the third dimension — how far each pixel is from the camera — is thrown away. Recovering it from one image is formally ill-posed. Infinitely many 3D scenes project to the exact same picture: a small object up close and a large object far away can be pixel-for-pixel identical. With one eye closed and one still frame, geometry alone cannot decide.

Yet humans read depth from a flat photo instantly, and so do modern models. They do it the same way: not by measuring, but by exploiting learned priors about how the world usually looks. Monocular depth estimation is the task of predicting a depth value per pixel from a single RGB image, and understanding it means understanding which cues carry the signal and what the output can and cannot promise.

The Cues: What "Depth From One Image" Actually Uses

There is no triangulation available (that needs two viewpoints). Instead the signal comes from pictorial depth cues — the same ones painters have used for centuries, now learned statistically:

Perspective and vanishing points — parallel lines (roads, hallways, rooftops) converge with distance.

Relative and familiar size — a car rendered 20 pixels tall is far; the model has learned the real-world size of cars, faces, doorways.

Occlusion — if object A blocks object B, A is nearer. This gives *ordering* even when it gives no distance.

Texture gradient — repeated texture (grass, gravel, bricks) gets finer and denser toward the horizon.

Shading and shadow — the direction of light and cast shadows imply surface orientation and contact with the ground.

Aerial perspective — distant regions are hazier and lower-contrast.

A depth model is, in effect, a machine that has seen enough of the world to invert these cues. This is also why it fails on inputs that violate the priors — forced-perspective photographs, mirrors, printed images of scenes, or unfamiliar object scales — where the learned assumptions point the wrong way.

Relative vs Metric Depth, and the Scale Ambiguity

The single most important distinction in depth estimation is relative vs metric, and it comes straight from the ill-posedness above.

Relative (up-to-scale) depth ranks pixels: this is nearer than that, the ordering and the *ratios* are right, but there is no unit. The same relative-depth map is consistent with a dollhouse and a real house — the model cannot recover absolute scale from one image because nothing in the image fixes the unit. This is scale ambiguity, and it is not a bug; it is a mathematical property of monocular projection.

Metric depth assigns real distances (this pixel is 4.2 meters away). Getting metric output requires extra information the image alone lacks: camera intrinsics (focal length, sensor size), a known reference object, or training on a specific camera rig. Metric models trade generality for those constraints and degrade when the camera changes.

Most general-purpose depth foundation models predict relative depth on purpose, because it transfers across any camera and any scene. If your task needs meters, you either calibrate with intrinsics or anchor on a known-size object; if it needs "foreground vs background" or "which shot is wider," relative depth is already enough.

Training Without a Tape Measure: Self-Supervision

Dense per-pixel depth ground truth is expensive (LiDAR, structured light) and scarce, which for years capped model quality. Two ideas broke the ceiling.

Self-supervised depth from geometry. Instead of labels, use the geometry of stereo pairs or video. If you predict depth for a frame and also predict the camera's motion to the next frame, you can *reproject* one frame into the other and compare. The training signal is the photometric reconstruction loss — pixels should land where they actually appear in the other view — so depth is learned purely from raw video or stereo, no annotations. The catch is that the same scale ambiguity reappears (depth and camera translation are only recoverable up to a shared scale), which is again why the result is relative.

Large-scale pretraining (the Depth Anything approach). The other lever is data volume: train a strong vision backbone (a ViT) on a massive mix of labeled and pseudo-labeled images — millions of scenes — so the model's priors cover almost anything it will meet at inference. This is what makes recent depth foundation models robust zero-shot across indoor, outdoor, aerial, and close-up imagery. It is the same "scale beats cleverness" story as CLIP for embeddings, applied to geometry.

Why an Agent Wants a Depth Channel

Depth turns a flat image into a scene with structure, and that unlocks queries and filters that RGB embeddings answer poorly:

1. Spatial reasoning — "the person in the foreground," "objects on the table vs on the wall," "is X in front of or behind Y." Occlusion ordering and relative depth answer these directly; a 2D detector only gives boxes. 2. Shot and framing filters — wide establishing shots, tight close-ups, and medium shots have distinct depth-histogram signatures, so an agent can retrieve by framing without a human tagging it. 3. Foreground/background separation — depth is a cheap, model-driven matte for pulling the subject out of clutter before embedding or captioning, which sharpens downstream retrieval. 4. Geometry sanity for other perception — depth cross-checks detections and helps reject impossible configurations (a "small" car that depth says is actually near is a misdetection).

The agent never consumes the depth map as pixels; it consumes derived signals — an ordering, a histogram, a foreground mask — and asks questions in those terms, the same pattern as any other perception modality.

Doing It on Mixpeek

A depth extractor runs at ingest and attaches a per-object depth signal, so retrieval and filtering can use scene geometry instead of only appearance.

from mixpeek import Mixpeek

mx = Mixpeek(api_key="API_KEY")

# Depth runs alongside the visual embedding at ingest
mx.collections.create(
    namespace_id="my-namespace",
    collection_name="my-collection",
    source={"type": "bucket", "bucket_ids": ["bkt_your_bucket"]},
    feature_extractor={"feature_extractor_name": "visual_embedding", "version": "v1"},
)

# An agent retrieves by appearance, then filters by framing/geometry
results = mx.retrievers.execute(
    retriever_id="your-retriever-id",
    query="analyst presenting at a whiteboard",
)

The visual stage recalls what *appears*; the depth signal lets the agent reason about how the scene is *arranged*. For the appearance side of that pairing see contrastive learning, and for turning geometry into a same/different decision see instance-level visual matching.

Monocular Depth Estimation: How Models Infer 3D From a Single Image

An Impossible Problem That Works Anyway

The Cues: What "Depth From One Image" Actually Uses

Relative vs Metric Depth, and the Scale Ambiguity

Training Without a Tape Measure: Self-Supervision

Why an Agent Wants a Depth Channel

Doing It on Mixpeek

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Multi-Object Tracking: How Agents Follow Objects Across Video Frames

Face Recognition and Identity Clustering: How Agents Recognize and Group People in Video

Instance-Level Visual Matching: Finding the Same Object, Not Just Similar Ones