Detection Sees a Frame; Tracking Sees Time
An object detector answers "what is in *this* frame, and where." Run it on every frame of a video and you get thousands of independent boxes with no notion that the person in frame 100 is the *same* person as in frame 101. Multi-object tracking (MOT) adds that missing dimension: it assigns each physical object a stable identity that persists across frames, turning a pile of per-frame boxes into a handful of tracks — one continuous trajectory per object.
That identity is what makes video *searchable by entity* rather than by frame. "How long was this person on screen," "count the unique vehicles," "show every clip where object #7 appears," "did A and B ever appear together" — all require knowing that detections across time belong to the same thing. Almost every production tracker follows the same paradigm, tracking-by-detection: a detector proposes boxes each frame, and a separate, cheap association step stitches them into tracks. The detector does the seeing; the tracker does the bookkeeping over time, and the bookkeeping is where the interesting problems live.
The Loop: Predict, Match, Update
For each new frame, a tracker runs a three-step cycle against its set of currently-active tracks:
1. Predict where each existing track *should* be in this frame, from its motion so far. 2. Match the new detections to those predictions. 3. Update matched tracks with their new box, start tracks for unmatched detections, and retire tracks that have gone unmatched too long.
Each step has a canonical solution, and understanding them is understanding tracking.
Step 1 — Motion Prediction with a Kalman Filter
Between two frames (tens of milliseconds), an object moves a little and roughly linearly. A Kalman filter models each track's state — position, size, and their velocities — and predicts the next box from constant-velocity motion. It also carries an *uncertainty* that grows while a track goes unseen and shrinks when it's confirmed by a detection.
Two reasons this matters. First, the prediction gives association a strong prior: you're not matching a detection to a stale last-known box, you're matching it to where the object *should* be now, which resolves most crossings and near-misses. Second, the Kalman prediction lets a track coast through a few frames of missed detection (brief occlusion, a blurred frame) without dying — it keeps predicting until either a detection re-confirms it or its uncertainty says "I've lost this."
Step 2 — Data Association: Matching Detections to Tracks
Now you have N predicted tracks and M new detections and must decide which detection belongs to which track. This is a bipartite assignment problem, solved in two parts.
First, a cost for every (track, detection) pair. The cheap, dominant signal is spatial overlap — IoU (intersection-over-union) between the predicted box and the detection: high overlap, low cost. Optionally you add an appearance cost (below) for robustness.
Second, find the globally cheapest one-to-one matching. Greedily taking the best pair per track is fragile; the Hungarian algorithm finds the optimal assignment over the whole cost matrix in polynomial time, so a locally-tempting bad match can be rejected in favor of a better global solution. Pairs whose cost exceeds a gate (too far apart) are left unmatched — those become new tracks or track deaths.
Step 3 — The ByteTrack Insight: Don't Throw Away Weak Detections
Older trackers kept only high-confidence detections and discarded low-confidence ones as noise. But a low-confidence box is often a *real* object that is briefly occluded, blurred, or partial — exactly the frames where a track is most likely to be lost. ByteTrack's idea is to associate in two passes: first match high-confidence detections to tracks, then take the *leftover* tracks and try to match them against the low-confidence detections too. A weak detection that lines up with a coasting track's prediction is almost certainly that object, so recovering it keeps the identity alive through the hard moments. This one change, with no appearance model at all, closed most of the gap to far heavier methods — a reminder that association quality often beats detector strength.
Appearance Re-Identification: Surviving Long Occlusions
Motion + IoU handles frame-to-frame continuity, but it fails when an object *leaves and returns* — walks behind a pillar for two seconds, or exits and re-enters the shot. Position-based matching has no memory of what the object looked like, so it starts a new identity: an ID switch.
Re-identification (re-ID) fixes this with appearance. Each detection is embedded into a feature vector by a re-ID model (a metric-learned embedding, the same idea as face embeddings but for whole bodies or vehicles), and tracks remember a running appearance signature. When a new detection can't be matched by motion, the tracker compares its embedding to the signatures of recently-lost tracks; a close match *re-attaches* the old identity instead of minting a new one. This is what lets "person #3" survive a walk behind an obstacle — and it is why strong trackers pair a motion model with an embedding, one for short-term continuity and one for long-term identity.
The Metric That Actually Matters: ID Switches
Tracking is not judged only on per-frame box accuracy. The failure that breaks downstream reasoning is the ID switch — the same object being assigned a new identity, or two objects swapping identities when their paths cross. A tracker with great boxes but frequent ID switches is useless for "how long was X present," because X is fragmented into many short tracks. Metrics like MOTA and the identity-focused IDF1/HOTA exist precisely to measure identity *consistency* over time, not just detection quality. When you evaluate a tracker for an agent workload, that identity stability is the number to watch.
What This Gives an Agent
A tracker converts raw video into a small set of tracklets — per-object trajectories with a start time, end time, path, and appearance signature. That representation answers questions frames can't:
1. Presence and duration — "how long was each person/vehicle on screen," from track lifespans. 2. Counting — unique entities = number of distinct tracks, not number of detections (which double-counts across frames). 3. Interaction and co-occurrence — which tracks overlapped in time and space (a hand-off, a meeting, a collision). 4. Entity-level retrieval — index by track, so "find every clip where this object appears" becomes one lookup, and each hit carries a timestamped trajectory.
The agent consumes tracklets and their signatures, not raw pixels — the same pattern as every other perception modality, with a trajectory as the unit instead of a single embedding.
Doing It on Mixpeek
Tracking runs at ingest so each video object carries per-entity tracklets you can search and filter by presence, duration, or identity.
from mixpeek import Mixpeek
mx = Mixpeek(api_key="API_KEY")
# Detection + tracking runs over the video at ingest, producing tracklets
mx.collections.create(
namespace_id="my-namespace",
collection_name="surveillance",
source={"type": "bucket", "bucket_ids": ["bkt_footage"]},
feature_extractor={"feature_extractor_name": "object_tracking", "version": "v1"},
)
# An agent retrieves by what happened over time, not a single frame
results = mx.retrievers.execute(
retriever_id="your-retriever-id",
query="a vehicle that stops near the entrance for more than 30 seconds",
)
The retriever returns entity-level tracks with their timestamps, so the agent reasons about trajectories rather than isolated boxes. For the long-term-identity side of tracking see face recognition and identity clustering; for how the per-frame detector upstream works see open-vocabulary object detection; and for cutting video into the units you track within see video scene segmentation.