Face Recognition and Identity Clustering: How Agents Recognize and Group People in Video

The Question an Agent Is Actually Asking

Two requests sound similar but are mechanically different. "Is this the same person in both clips?" is verification — a 1:1 comparison. "Who is this, and where else do they appear?" is identification — a 1:N search against everyone the system has ever seen. An agent indexing a video archive needs both, and it needs them over footage where the *same* face shows up in wildly different conditions: a three-quarter profile in shadow at frame 1,200, a frontal close-up under studio light at frame 40,000, the same person five years older in a different clip.

No bounding box solves this. A face detector tells you *where* a face is, not *whose* it is. Turning pixels into a stable identity — one that survives pose, lighting, and age — is a chain of four stages: detect → align → embed → match/cluster. The first two normalize the input, the third produces a comparable vector, and the fourth is where "recognition" actually happens. Most of the difficulty, and almost all of the interesting math, lives in the last two.

Stage 1 — Detection: Finding Faces and Landmarks

Detection localizes faces and, critically, returns a handful of facial landmarks — typically five points: both eye centers, the nose tip, and the two mouth corners. Modern detectors (the RetinaFace family and its successors) predict the box and the landmarks jointly in a single forward pass, because the landmarks are what the next stage needs.

The non-obvious design choice is that detection should *over-* rather than under-trigger on small and rotated faces, then let downstream quality filtering discard the bad ones. A face missed at detection is gone forever; a low-quality detection can be filtered later. For video specifically, detection is paired with a tracker so that a face that persists across hundreds of frames becomes a single *track* rather than hundreds of independent detections — this matters enormously for clustering (below).

Stage 2 — Alignment: Why a Canonical Pose Doubles Accuracy

You could crop the detected box and embed it directly. You should not. The embedding model performs far better if every face arrives in the same canonical geometry — eyes on a fixed horizontal line, nose centered, a standard inter-ocular distance. Alignment uses the five landmarks to compute a similarity transform (rotation, scale, translation) that warps the detected face onto that template.

The reason this helps is that it removes nuisance variation the embedder would otherwise have to learn to ignore. Every parameter the model spends becoming invariant to in-plane rotation is a parameter not spent on identity. Alignment hands the model a pre-normalized input so its entire capacity goes toward the thing that matters. In practice, alignment is one of the highest-ROI steps in the whole pipeline — cheap to compute, large effect on accuracy.

Stage 3 — The Embedding: Why Classification Is the Wrong Objective

Here is the central problem. You cannot train a classifier with one output neuron per person, because the set of people is open — the system must recognize identities it never saw in training. This is the open-set problem, and it rules out ordinary softmax classification as the deployment objective.

The answer is metric learning: instead of predicting a label, learn a function that maps a face to a vector in a space where *distance encodes identity*. Same person → vectors close together; different people → vectors far apart. At inference you never classify; you embed two faces and compare their vectors. New identities need no retraining — they simply occupy new regions of the space.

The geometry of angular margin (ArcFace)

The hard part of metric learning is forcing the space to be *discriminative* enough. Early approaches used triplet loss (anchor, positive, negative) but were sensitive to how triplets were mined. The breakthrough that dominates modern face recognition is additive angular margin, popularized by ArcFace.

The idea is geometric. L2-normalize both the embeddings and the classifier weights so everything lives on a unit hypersphere; now the only thing that matters is the angle between an embedding and each class center. Standard softmax pushes an embedding toward its class center, but it stops as soon as the right class merely wins. ArcFace adds an angular margin m to the target class's angle *before* the softmax: the model is scored as if its embedding were m radians further from its own class than it really is, so to drive the loss down it must push the true angle a full margin tighter. The geometric effect is a deliberate gap between identities — intra-class angles shrink, inter-class angles widen, and the decision boundary gets a buffer instead of sitting flush against the data.

The output is an L2-normalized embedding (commonly 512-dim). Because everything is normalized, cosine similarity (equivalently, angular distance) is the only comparison you need: two faces are "the same" if their cosine is above a threshold.

Verification vs Identification, and the Threshold That Governs Both

With embeddings in hand, verification is a single cosine comparison against a threshold τ. Everything rides on τ, and it encodes a tradeoff with two error types:

False accept (FAR) — two different people scored as the same. The dangerous error for access control.

False reject (FRR) — the same person scored as different. The annoying error that fragments an identity.

Sweeping τ traces a curve between these; you pick the operating point your application can tolerate (a building door wants very low FAR; a "group my vacation photos" feature tolerates higher FAR for lower FRR). Identification is the same comparison run 1:N — embed the query face, score it against every gallery identity, return the nearest above τ (or "unknown" if none clears the bar). At archive scale that 1:N search is just approximate nearest-neighbor search over the embedding index, the same machinery used for any other vector retrieval.

Stage 4 — Identity Clustering: Recognition Without a Gallery

The previous section assumed a labeled gallery ("this vector is Alice"). The harder, more common situation for an archive is that you have no labels at all — thousands of face embeddings and the question "how many distinct people are here, and which detections belong to each?" This is unsupervised identity clustering, and it has a property that breaks naive methods: *you do not know K, the number of identities, in advance.*

That rules out anything that needs K up front (plain k-means). What works are threshold-driven methods:

Agglomerative / connected-components clustering — link two faces if their cosine exceeds τ, then take connected components (or merge greedily by nearest pair). Simple, K-free, and the same τ tradeoff reappears as a clustering tradeoff: too tight over-splits one person into several identities; too loose merges distinct people into one.

Graph / rank-order methods — build a k-NN graph over embeddings and cut it into communities, which is more robust to the density differences between a person who appears in 2 frames and one who appears in 2,000.

Two engineering moves make clustering far more reliable than the raw algorithm suggests. First, aggregate by track: pool all faces in one video track into a single high-quality representative (a mean or medoid embedding) before clustering, so a 300-frame appearance contributes one clean point instead of 300 noisy ones. Second, quality-gate the inputs: drop tiny, blurry, or extreme-profile faces before they pollute a cluster, since a single bad embedding bridging two people can merge them.

Where It Genuinely Breaks

Honesty about failure modes is part of using this responsibly:

Pose, illumination, age, occlusion are the classic axes of variation; alignment and margin-trained embeddings absorb a lot but not all, and large age gaps remain hard.

Demographic bias is real and measurable: error rates can differ across skin tone, age, and gender depending on training data, which is why FAR/FRR must be reported *per subgroup*, not just in aggregate. A system tuned to a global threshold can be far more error-prone for under-represented groups.

The threshold is not universal. τ that works for one archive's conditions (lighting, resolution, demographics) may not transfer; it should be calibrated on held-out data resembling deployment.

What This Unlocks for an Agent

Identity is a *join key* across an archive. Once faces are embedded and clustered, an agent can do things no transcript or tag search can:

1. Identity-conditioned retrieval — "find every scene where this person appears," answered by querying the face index with one example image. 2. Cast/presence analytics — "who appears together, and for how long," from co-occurrence of clusters across tracks. 3. Cross-clip linking — connect the same person across separate videos that share no metadata.

The agent never sees raw pixels for this; it sees a searchable space of identities and asks questions in that space — the same pattern as any other modality, with a face embedding as the query vector.

Doing It on Mixpeek

Mixpeek's face-identity extractor runs the full chain — detect → align → embed → cluster — at ingest, so each video object carries identity clusters you can search. An agent retrieves by identity with a single example face:

from mixpeek import Mixpeek

mx = Mixpeek(api_key="API_KEY")

# At ingest, the face_identity feature detects, embeds, and clusters faces per object
mx.collections.create(
    namespace_id="my-namespace",
    collection_name="my-collection",
    source={"type": "bucket", "bucket_ids": ["bkt_your_bucket"]},
    feature_extractor={"feature_extractor_name": "face_identity", "version": "v1"},
)

# An agent asks: every scene where THIS person appears
results = mx.retrievers.execute(
    retriever_id="face-retriever",
    query={"image_url": "https://example.com/person-of-interest.jpg"},
)

The retrieval surface is the same MCP tool an agent calls for any other modality — the face embedding is just the query vector, and the timestamped clusters are the result. For the broader pattern of giving agents a searchable perception layer, see multimodal perception for AI agents and agentic retrieval.

The Question an Agent Is Actually Asking

Stage 1 — Detection: Finding Faces and Landmarks

Stage 2 — Alignment: Why a Canonical Pose Doubles Accuracy

Stage 3 — The Embedding: Why Classification Is the Wrong Objective

The geometry of angular margin (ArcFace)

Verification vs Identification, and the Threshold That Governs Both

Stage 4 — Identity Clustering: Recognition Without a Gallery

Where It Genuinely Breaks

What This Unlocks for an Agent

Doing It on Mixpeek

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Monocular Depth Estimation: How Models Infer 3D From a Single Image

Multi-Object Tracking: How Agents Follow Objects Across Video Frames

Instance-Level Visual Matching: Finding the Same Object, Not Just Similar Ones