Long-Context Video Understanding for Agent Perception

The Hour-Long Video Problem

An AI agent that can answer questions about a 5-second clip is a demo. An agent that can answer questions about a 60-minute meeting recording is a product. The gap between these two is not just scale -- it requires fundamentally different approaches to how video content is represented, compressed, and searched.

Consider the query: "In the board meeting, what did the CFO say right after the slide showing declining EMEA revenue?" Answering this requires the agent to:

1. Identify which frames contain a slide about EMEA revenue 2. Find the temporal boundary where that slide ends 3. Locate the CFO's speech in the audio immediately after 4. Extract and return the relevant transcript segment

A short video fits entirely in a VLM's context window. You pass all frames and the transcript, and the model reasons over everything at once. A 60-minute video at 1 FPS produces 3,600 frames. At ~256 visual tokens per frame, that is 921,600 tokens for frames alone -- far beyond even 128K-token models.

This guide covers the algorithms and architectures that bridge this gap: how to decide which frames matter, how to compress video into searchable representations, and when to use long-context models versus retrieval-augmented approaches.

Frame Sampling: Deciding What to See

The first decision in any video understanding pipeline is frame sampling -- choosing which frames from the raw video stream to process. This decision determines the ceiling of everything downstream. A frame that is never sampled can never be found.

Uniform Sampling

The simplest approach: extract one frame every N seconds. At 1 FPS, a 60-minute video produces 3,600 frames. At 0.5 FPS, 1,800 frames. At one frame every 5 seconds, 720 frames.

Uniform sampling is predictable and cheap. It works well when content changes at a roughly constant rate -- lectures, surveillance footage, interviews. It fails when content varies: a 10-minute static slide followed by 5 seconds of rapid whiteboard drawing. The static section produces hundreds of near-identical frames, while the important 5 seconds gets just one frame.

Shot-Boundary Detection

Shot-boundary detection identifies visual discontinuities -- cuts, dissolves, and fades -- that signal content changes. Instead of sampling at a fixed rate, you sample one frame per shot, or the first and last frame of each shot.

The standard algorithm computes pixel-level or histogram-level differences between consecutive frames. When the difference exceeds a threshold, a shot boundary is declared:

for each consecutive frame pair (f_t, f_{t+1}):
    diff = histogram_distance(f_t, f_{t+1})
    if diff > threshold:
        mark shot boundary at t+1

FFmpeg implements this via the select filter with scene change detection. A threshold of 0.3-0.4 catches most hard cuts while ignoring gradual brightness changes.

Shot-boundary detection adapts to content density. A fast-paced montage produces many shots (many frames sampled). A static presentation produces few shots (few frames sampled). This is more efficient than uniform sampling for heterogeneous content, but it misses within-shot changes -- a presenter drawing on a whiteboard during a single continuous shot will only be captured at the shot boundaries.

Semantic Clustering

The most sophisticated approach: embed all frames (or a uniform subsample) into a visual embedding space, then cluster the embeddings. Each cluster represents a visually distinct "scene," and you sample the centroid frame from each cluster.

# 1. Extract frames at a moderate rate
frames = extract_frames(video, fps=2)

# 2. Embed all frames
embeddings = clip_model.encode_images(frames)

# 3. Cluster by visual similarity
clusters = kmeans(embeddings, k=num_desired_keyframes)

# 4. Select centroid frame from each cluster
keyframes = [frames[closest_to_centroid(c)] for c in clusters]

Semantic clustering produces the most information-dense set of keyframes because it guarantees visual diversity. Two nearly identical slides will end up in the same cluster, and only one representative will be selected. This is the best approach when your token budget is tight and you need maximum visual coverage per frame.

The tradeoff is compute cost: you need to embed every candidate frame before you can select keyframes. For a 60-minute video at 2 FPS, that is 7,200 CLIP forward passes just for the sampling step.

Adaptive Sampling for Production

Production systems typically combine approaches: uniform sampling as a baseline, shot-boundary detection for hard cuts, and semantic deduplication to remove near-identical frames. The pipeline looks like:

1. Extract frames at 2 FPS (uniform baseline) 2. Run shot-boundary detection to add boundary frames 3. Embed all candidate frames 4. Remove frames whose embedding cosine similarity to the previous selected frame exceeds 0.95 5. Return the remaining frames as keyframes

This produces 50-200 keyframes for a typical 60-minute meeting -- far more manageable than 3,600 raw frames, while preserving the important visual content.

Token Budget Management

Modern VLMs accept images as input, but each image consumes tokens. The exact cost depends on the model's vision encoder and resolution settings:

Model

Tokens per Frame

128K Context Budget

InternVL3 (dynamic resolution)	256-1024	125-500 frames
Qwen3-VL (min_pixels to max_pixels)	256-1280	100-500 frames
Phi-4-multimodal	256	~500 frames

At 256 tokens per frame with a 128K context window, you can fit roughly 500 frames -- about 8 minutes of video at 1 FPS, or 42 minutes at one frame per 5 seconds. This means even 128K-token models cannot naively process an hour of video at useful frame rates.

Token budget management is the art of deciding how to allocate those 500 frame slots across an hour of video. The strategies mirror retrieval system design:

Fixed allocation: Divide the token budget equally across the video duration. For 500 frames over 60 minutes, sample one frame every 7.2 seconds. Simple but wastes budget on static sections.

Content-adaptive allocation: Use a cheap preprocessing step (shot detection, motion estimation) to identify high-information segments, then allocate more frames to those segments. A 5-minute rapid discussion gets 100 frames; a 10-minute static slide gets 10 frames.

Query-aware allocation: If you know the query before processing, use a text-video retrieval model to identify the most relevant segments, then allocate frames disproportionately to those segments. This is the retrieval-augmented approach discussed below.

Hierarchical Video Representations

The most powerful approach to long video is hierarchical: represent the video at multiple levels of granularity, each optimized for different query types.

Level 1: Frame Embeddings

Every sampled frame is encoded into a dense vector by a visual embedding model (CLIP, SigLIP, DINOv2). These vectors enable visual similarity search -- "find frames that look like a bar chart" -- without reading any text or audio.

Frame embeddings are cheap to store (one 768-dimensional vector per frame) and fast to search (approximate nearest neighbor lookup). But they capture only visual appearance, not temporal relationships or semantic content.

Level 2: Clip Descriptions

Groups of 10-30 consecutive frames are fed to a VLM that generates a natural language description of what happens in that clip. A 60-minute video produces 20-60 clip descriptions, each a paragraph of text.

Clip 14 [07:23 - 08:01]: The presenter switches to a slide titled
"EMEA Revenue Q3." A declining bar chart is visible. The presenter
points to the rightmost bar and says "We missed target by 12 percent,
primarily driven by currency headwinds in the UK market."

Clip descriptions bridge the visual-textual gap. They can be embedded with a text embedding model and searched with natural language queries. They capture temporal flow ("switches to," "points to") that individual frame embeddings cannot.

Level 3: Scene Summaries

Scenes (groups of related clips) get higher-level summaries that capture narrative arcs: "The CFO presented the quarterly financial results, highlighting missed EMEA targets and proposing a budget reduction for Q4."

Scene summaries enable abstract queries: "What were the key financial concerns raised in the meeting?" These queries cannot be answered by frame-level or clip-level search because they require synthesizing information across multiple clips.

Level 4: Document Summary

A single summary of the entire video. Useful for classification, routing, and high-level questions ("What type of meeting is this?") but too compressed for specific retrieval.

Building the Hierarchy

The practical pipeline processes video bottom-up:

1. Sample keyframes (Level 1) using adaptive sampling 2. Group keyframes into clips by shot boundaries or fixed windows 3. Generate clip descriptions (Level 2) by feeding frame groups to a VLM 4. Cluster clip descriptions into scenes by semantic similarity 5. Summarize each scene (Level 3) by feeding clip descriptions to an LLM 6. Summarize the full video (Level 4)

Each level is stored as a separate feature in the search index. A query first hits the scene summaries (cheap, broad recall), then drills into clip descriptions (medium cost, higher precision), then retrieves specific frames (highest precision, visual evidence).

Dense Captioning as a Preprocessing Step

Dense captioning -- generating a natural language description for every few seconds of video -- has become the dominant preprocessing strategy for video retrieval. The insight is simple: text search is a solved problem, and dense captions convert video into text.

The ShareGPT4Video pattern popularized this approach: run a VLM (GPT-4V, Gemini, InternVL) over short clips, store the captions, and search over the captions with standard text retrieval.

# Dense captioning pipeline
for clip in split_video_into_clips(video, clip_duration=30):
    frames = sample_keyframes(clip, n=5)
    caption = vlm.generate(
        images=frames,
        prompt="Describe in detail what happens in these video frames. "
               "Include visual elements, text on screen, speaker actions, "
               "and any notable events."
    )
    store_caption(
        video_id=video.id,
        start_time=clip.start,
        end_time=clip.end,
        caption=caption,
        keyframe_embeddings=[embed(f) for f in frames]
    )

Dense captions enable powerful hybrid search: semantic search over caption text (for conceptual queries like "budget discussion") combined with visual search over keyframe embeddings (for appearance queries like "slide with a red chart"). The fusion of these two signals consistently outperforms either alone.

The cost is VLM inference at ingest time. A 60-minute video split into 30-second clips requires 120 VLM calls. At ~2 seconds per call on an A100, that is 4 minutes of GPU time -- a meaningful but manageable cost for valuable video content.

Long-Context VLMs vs. Retrieval-Augmented Video

The central architectural question is whether to use a long-context VLM that processes the entire video at once, or a retrieval system that finds relevant segments before processing.

Long-Context Approach

Feed all keyframes and the full transcript into a 128K+ context VLM. The model reasons over everything simultaneously.

Strengths:

No information loss from retrieval filtering

Can find unexpected connections between distant parts of the video

Simple pipeline: no retriever to build or tune

Weaknesses:

Token limits cap video length (even 128K handles ~40 minutes at 1 FPS)

Inference cost scales linearly with video length

Attention may not focus on the right segments for a specific query

Retrieval-Augmented Approach

First retrieve the most relevant clips/frames using embeddings or captions, then feed only those segments to a VLM for reasoning.

Strengths:

Handles arbitrarily long videos

Inference cost scales with query complexity, not video length

Retrieved segments are pre-filtered for relevance

Weaknesses:

Retrieval can miss relevant segments (recall problem)

Cannot find connections between unretrieved segments

Requires building and maintaining a retrieval index

The Hybrid Pattern

Production systems increasingly use both: a retrieval pass to identify candidate segments, followed by a long-context VLM that processes the retrieved segments plus surrounding context. This gives the VLM enough information to reason about temporal relationships while keeping the token budget manageable.

# Hybrid: retrieve then reason
candidates = retriever.search(
    query="EMEA revenue discussion",
    video_id=video_id,
    top_k=10  # Get 10 most relevant clips
)

# Expand context: include 30s before and after each candidate
expanded = expand_temporal_context(candidates, margin_seconds=30)

# Deduplicate overlapping segments
segments = merge_overlapping(expanded)

# Feed to VLM for reasoning
answer = vlm.generate(
    frames=[s.keyframes for s in segments],
    transcript=[s.transcript for s in segments],
    query="What did the CFO say right after the EMEA revenue slide?"
)

This is the pattern used by Mixpeek's video retrieval pipeline: multi-stage retrieval identifies candidate segments, temporal context expansion ensures the VLM has enough surrounding information, and a reasoning model generates the final answer.

Practical Implementation on Mixpeek

Here is how these concepts map to a Mixpeek pipeline:

from mixpeek import Mixpeek

mx = Mixpeek(api_key="API_KEY")

# Ingest: extract hierarchical features
await mx.collections.create(
    namespace_id="my-namespace",
    collection_name="my-collection",
    source={"type": "bucket", "bucket_ids": ["bkt_your_bucket"]},
    feature_extractor={"feature_extractor_name": "image_embedding", "version": "v1"},
)

# Retrieve: multi-stage search
results = await mx.retrievers.execute(
    retriever_id="your-retriever-id",
    query="CFO discussing EMEA revenue decline",
)

The pipeline extracts three feature layers at ingest time (frame embeddings, dense captions, speaker-attributed transcripts), then uses multi-stage retrieval to find relevant segments. This handles hour-long videos efficiently because the retriever narrows the search space before any expensive VLM reasoning.

Related Guides

Video Scene Segmentation -- the shot-boundary detection algorithms used in frame sampling

Omnimodal Embeddings -- the embedding models that power frame-level visual search

Multi-Stage Retrieval -- the retrieval architecture for combining visual and text search

Speaker Diarization -- adding speaker identity to video transcripts

MCP Tool Design -- exposing video search as agent tools

Models -- browse VLMs, embedding models, and ASR models for video pipelines