Streaming Video Understanding: How Agents Watch an Unbounded Live Feed in Real Time

The Problem Offline Video Pipelines Quietly Assume Away

Almost every video retrieval pipeline assumes the video is a finished file. You have the whole thing on disk, you can seek to any timestamp, sample frames at leisure, embed them in a batch, and build an index before anyone asks a question. That is the comfortable, offline world, and it is what most video understanding guidance describes.

Streaming flips every one of those assumptions. The agent is watching a live feed -- a security camera, a drone downlink, a screen-share during a support call, a body cam, a robot's onboard camera -- and the stream has no end and no rewind. Frames arrive at a fixed rate and you have to decide what to do with each one *as it arrives*, before you have seen what comes next, and usually before you know what question will be asked. You cannot store every frame (an hour of 1080p at 30 FPS is millions of frames), you cannot re-watch (the past is gone unless you kept something), and the user can ask a question at any moment about anything that has happened so far.

This is online video understanding, and it is a genuinely different algorithmic problem from long-context offline video. The core question stops being "which frames do I sample from this file?" and becomes "what do I *keep* from a stream I can only see once, and how do I keep it in bounded memory while staying responsive?"

The Causal Constraint Changes the Math

The defining property of a stream is causality: at time *t* the model only has access to frames up to *t*. An offline model answering "what happened right before the crash?" can look forward and backward freely. A streaming model has to have already decided, before the crash happened, that the preceding moments were worth remembering.

Three constraints fall out of this and they shape every design decision:

1. Bounded memory. Memory usage cannot grow with stream length. A model that accumulates one vector per frame forever will exhaust RAM (and, for a VLM, blow past its context window) in minutes. Memory must be *constant or slowly growing* regardless of how long the stream runs. 2. Bounded per-frame compute. Each incoming frame gets a fixed, small compute budget. If processing frame *t* takes longer than the inter-frame interval, you fall behind and the backlog grows without bound. Real-time means amortized per-frame cost below the frame period. 3. Anytime queryability. A question can arrive at any instant and must be answered from whatever state exists *now*. There is no "let me reprocess the video" escape hatch.

Put together: you are compressing an infinite input into a finite, queryable state under a hard per-frame deadline. Everything below is a technique for doing that compression well.

The Front Door: A Ring Buffer and a Frame Budget

The first structure in any streaming pipeline is a fixed-size ring buffer (circular buffer) of recent frames. New frames overwrite the oldest. This bounds raw-frame memory immediately and gives the model a small window of high-fidelity recent context -- the last few seconds at full detail -- which is where most immediate questions ("what is happening right now?") are answered.

But a ring buffer alone forgets everything older than its window. The whole game is deciding what graduates *out* of the buffer into longer-lived memory before it is overwritten. That promotion decision is driven by a frame budget plus a relevance signal.

The cheapest, most effective filter sits before any heavy model: inter-frame redundancy removal. A static scene produces a long run of near-identical frames that carry no new information. Drop them with a microsecond-cheap signal -- a perceptual hash or a low-dimensional feature difference -- and you spend your compute budget only on frames that changed.

prev_sig = None
for frame in stream:                       # frames arrive at fixed FPS
    sig = cheap_signature(frame)           # e.g. perceptual hash or tiny embedding
    if prev_sig is None or hamming(sig, prev_sig) > CHANGE_THRESHOLD:
        process(frame)                     # changed enough -> spend budget
        prev_sig = sig
    # else: near-duplicate of a frame we already handled -> skip

This single step routinely cuts the frames that reach the expensive encoder by an order of magnitude on real-world footage, and it does so causally -- it only ever compares to the past.

Bounding Memory Inside the Model: Token Merging and KV-Cache Pruning

Once a frame survives the cheap filter and gets encoded, it becomes a set of tokens (visual patches, or a pooled frame vector). For a streaming VLM that reasons over the accumulated stream, these tokens pile up in the attention context and the KV cache. Two families of technique keep that bounded.

Token merging. Instead of keeping every token from every kept frame, merge similar tokens across time. Adjacent frames of the same scene produce highly correlated patch tokens; collapse them into one representative (a mean, or a similarity-weighted pool) and you shrink the per-frame footprint without losing the content. Token merging is the temporal analog of the frame-dedup step, applied one level deeper in the representation.

KV-cache pruning and frame-wise merging. A streaming VLM's KV cache is its working memory of the stream so far. Training-free streaming methods bound it by evicting or merging cache entries: drop the key/value pairs for tokens that have not been attended to recently, or merge the cache contributions of consecutive frames once they age out of the immediate window. This is the streaming counterpart to the eviction policy in a CPU cache -- you are choosing what to forget so the cache stays a fixed size. The art is in the eviction policy: purely recency-based eviction (a sliding window) is simple but throws away salient older moments; attention-weighted or saliency-weighted eviction keeps the moments the model actually used.

The shared idea across both: information that is *redundant* (similar to something already stored) or *stale* (unattended for a long time) is cheap to discard, and discarding it is what makes the memory finite.

Hierarchical Memory: Short-Term Detail, Long-Term Gist

The single most important architectural pattern in streaming video understanding is a memory hierarchy that mirrors how the constraints differ across time scales. Recent content needs detail and is queried often; distant content needs only a compressed summary and is queried rarely. So you keep multiple tiers at decreasing resolution.

A typical three-tier design:

Short-term memory (seconds): the ring buffer of recent frames at full or near-full token resolution. High fidelity, small window, overwritten continuously. Answers "what is happening now?"

Mid-term memory (minutes): per-segment summaries. As frames age out of the short-term buffer, a run of them is pooled into one segment representation -- a pooled embedding plus a short generated caption. Bounded growth: roughly one entry per scene or per fixed interval, not per frame.

Long-term memory (the whole stream): a compact, heavily compressed gist -- cluster centroids of segment embeddings, or a running summary the model periodically rewrites. Near-constant size regardless of stream length.

incoming frames
      |
  [ short-term ring buffer ]  full detail, last N seconds
      |  (age out -> pool a run of frames)
  [ mid-term segment store ]  one summary per scene/interval
      |  (cluster / compress)
  [ long-term gist ]          near-constant-size global memory

Each tier obeys a different bound, and queries route by recency: an immediate question hits the short-term buffer; "what happened a few minutes ago?" hits the mid-term store; "has this person appeared before?" or "summarize the whole shift" hits long-term memory. This is exactly the structure modern streaming systems converge on (hierarchical KV-cache memories, adaptive hierarchical memory banks), because it is the only way to be both detailed-when-recent and bounded-when-old.

Entity Banks: Keeping Identity Consistent Across Time

A subtle failure mode of pooling and compression is losing *identity*. If you summarize a scene into "a person walks across the lobby," you lose the ability to answer "is that the same person who was here earlier?" Streaming systems address this with an entity bank: a small, separate, slowly-growing memory keyed by tracked entities (people, vehicles, objects) rather than by time.

When the model detects an entity, it computes a compact descriptor (a face or appearance embedding, a track id) and either matches it to an existing bank entry or creates a new one. The bank stores a stable id plus a running representation, so later moments can reference the same entity even after the original frames are long gone. This is what lets a streaming agent answer "the same car drove past three times" -- the raw frames were discarded, but the entity bank retained the identity link. The bank stays bounded by capping entries and merging or aging out ones that have not recurred.

Event-Triggered Indexing: Turning a Stream Into Something Searchable

So far the memory tiers serve the model's own reasoning. But for an *agent*, the real payoff is making the stream searchable after the fact -- so that "show me when someone entered the restricted zone" can be answered minutes or hours later, by a retriever, without the model having to have anticipated the exact question.

The pattern is event-triggered indexing. As content ages out of short-term memory, instead of only pooling it for the model's gist, you also emit a durable, indexed record:

short-term buffer --ages out--> segment summarizer
                                     |
                          +----------+-----------+
                          v                      v
                  model's gist memory     retrieval index (durable)
                  (bounded, in-RAM)       embedding + caption + t_start/t_end
                                          + entity ids + event tags

Each emitted segment becomes a retrievable unit: a pooled embedding for similarity search, a generated caption for lexical and hybrid search, a precise time span so a hit points back to a moment, entity ids from the bank, and any event tags (motion threshold crossed, new entity, anomaly). Triggering on *events* rather than on a fixed clock keeps the index information-dense: a static hour produces few segments, an eventful minute produces many. This is what bridges the streaming front end to everything an offline retrieval stack already does -- once a moment is an indexed segment with an embedding, a timestamp, and a caption, the usual multi-stage retrieval, filtering, and reranking apply unchanged.

Latency Budgeting: Staying Real-Time Under Load

Real-time is a deadline, not an average. If the per-frame pipeline occasionally spikes past the frame interval, the backlog grows and never recovers. Production streaming systems defend the deadline with a few standard moves:

Decouple ingest from heavy processing with a bounded queue. The capture loop only writes frames to a fixed-size queue; if the queue is full, it drops frames (graceful degradation) rather than blocking the camera. Dropping is acceptable because the cheap dedup step was going to discard most of them anyway.

Two-rate processing. Run a cheap path on every frame (change detection, motion, the ring buffer) and an expensive path (full VLM encode, captioning) only on triggered frames at a much lower rate. The expensive path's rate is set so its amortized cost fits the budget.

Asynchronous indexing. Emitting a segment to the durable retrieval index happens off the hot path. The capture and memory loop never waits on a write to the index.

The unifying principle: the only work allowed on the per-frame hot path is work that is cheap and bounded. Everything heavy is triggered, batched, rate-limited, or pushed off the critical path.

Streaming vs Offline: Pick by Whether You Control the Tape

Situation

Use

Why

Finished file you can seek and re-read	Offline batch pipeline	Sample, embed, and index with full lookahead; no causal or memory bound
Live feed with no end and no rewind	Streaming pipeline	Bounded memory, per-frame deadline, anytime queries; keep what you cannot re-watch
Long file but tight latency to first answer	Hybrid	Stream it through the online path for an immediate index, refine offline later
Need to answer about a moment hours ago in a live feed	Streaming + event-triggered index	The model's bounded memory forgets; the durable index remembers

The dividing line is control over the tape. If you can re-read the input, prefer the offline path -- it is simpler and strictly more accurate because it has lookahead. Reach for streaming only when the input genuinely arrives once.

In Mixpeek

In Mixpeek terms, the streaming front end is an upstream concern -- a capture-and-memory loop running near the camera that does cheap dedup, maintains the short-term ring buffer, and decides when a moment graduates into a durable segment. What it hands Mixpeek is exactly the unit the rest of the platform already understands: a video segment with a time span, a pooled embedding, a caption, and payload fields for entity ids and event tags. From there it is ordinary multimodal retrieval.

{
  "collection": "live_feed_segments",
  "feature_extractors": [
    { "feature": "video_embedding", "model": "google/siglip-base-patch16-224" },
    { "feature": "caption_text", "model": "vlm-caption" }
  ],
  "payload_schema": {
    "t_start": "number",
    "t_end": "number",
    "entity_ids": "keyword[]",
    "event_tags": "keyword[]"
  }
}

Because index inserts are incremental, a segment emitted by the streaming loop is searchable seconds later, so an agent can ask "find when a new vehicle entered the lot in the last hour" and get a hit that points back to a precise time span -- even though the raw frames for that moment were overwritten in the ring buffer long ago. The agent's retrieval tool filters on event tags and the time range, ranks by embedding similarity to the query, and returns the segment's time span as an evidence handle. The streaming loop's job is to make sure the *right* moments became durable segments; Mixpeek's job is to make those segments findable.

Key Takeaways

1. Streaming is causal and unbounded, which breaks offline assumptions. You see each frame once, with no lookahead and no rewind, under a per-frame deadline. The question shifts from "what do I sample?" to "what do I keep, in bounded memory, before it is gone?"

2. Bound memory at every layer. A fixed-size ring buffer bounds raw frames; cheap change detection bounds what reaches the encoder; token merging and KV-cache pruning bound what reaches the model's attention. Redundant or stale information is the cheap thing to discard.

3. Use a memory hierarchy. Short-term high-detail recent frames, mid-term per-segment summaries, and a near-constant-size long-term gist let the system be detailed when recent and bounded when old, with queries routed by recency.

4. Keep identity in an entity bank. Pooling loses identity; a small bank keyed by tracked entities preserves "the same person/car appeared again" even after the original frames are discarded.

5. Event-triggered indexing turns the stream into a searchable corpus. Emit durable segments (embedding, caption, time span, entity and event tags) as content ages out, so an agent can answer questions about past moments with ordinary retrieval -- without the model having anticipated the question.

The Problem Offline Video Pipelines Quietly Assume Away

The Causal Constraint Changes the Math

The Front Door: A Ring Buffer and a Frame Budget

Bounding Memory Inside the Model: Token Merging and KV-Cache Pruning

Hierarchical Memory: Short-Term Detail, Long-Term Gist

Entity Banks: Keeping Identity Consistent Across Time

Event-Triggered Indexing: Turning a Stream Into Something Searchable

Latency Budgeting: Staying Real-Time Under Load

Streaming vs Offline: Pick by Whether You Control the Tape

In Mixpeek

Key Takeaways

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Creative Ad Analysis for AI Agents: JEPA, Multi-Vector Retrieval, and Signal Fusion

Object Decomposition and Layered Indexing for AI Agent Perception

Long-Context Video Understanding for Agent Perception