NEWManaged multimodal retrieval.Explore platform →
    Agent Perception
    20 min read
    Updated 2026-05-19

    Long-Context Video Understanding for Agent Perception

    A technical guide to how AI agents process hour-long videos. Covers frame sampling strategies, token budget management, hierarchical video representations, dense captioning for retrieval, and the tradeoffs between long-context VLMs and retrieval-augmented approaches.

    Video Understanding
    Long Context
    Agent Perception
    VLM
    Frame Sampling

    The Hour-Long Video Problem



    An AI agent that can answer questions about a 5-second clip is a demo. An agent that can answer questions about a 60-minute meeting recording is a product. The gap between these two is not just scale -- it requires fundamentally different approaches to how video content is represented, compressed, and searched.

    Consider the query: "In the board meeting, what did the CFO say right after the slide showing declining EMEA revenue?" Answering this requires the agent to:

    1. Identify which frames contain a slide about EMEA revenue 2. Find the temporal boundary where that slide ends 3. Locate the CFO's speech in the audio immediately after 4. Extract and return the relevant transcript segment

    A short video fits entirely in a VLM's context window. You pass all frames and the transcript, and the model reasons over everything at once. A 60-minute video at 1 FPS produces 3,600 frames. At ~256 visual tokens per frame, that is 921,600 tokens for frames alone -- far beyond even 128K-token models.

    This guide covers the algorithms and architectures that bridge this gap: how to decide which frames matter, how to compress video into searchable representations, and when to use long-context models versus retrieval-augmented approaches.

    Frame Sampling: Deciding What to See



    The first decision in any video understanding pipeline is frame sampling -- choosing which frames from the raw video stream to process. This decision determines the ceiling of everything downstream. A frame that is never sampled can never be found.

    Uniform Sampling



    The simplest approach: extract one frame every N seconds. At 1 FPS, a 60-minute video produces 3,600 frames. At 0.5 FPS, 1,800 frames. At one frame every 5 seconds, 720 frames.

    Uniform sampling is predictable and cheap. It works well when content changes at a roughly constant rate -- lectures, surveillance footage, interviews. It fails when content varies: a 10-minute static slide followed by 5 seconds of rapid whiteboard drawing. The static section produces hundreds of near-identical frames, while the important 5 seconds gets just one frame.

    Shot-Boundary Detection



    Shot-boundary detection identifies visual discontinuities -- cuts, dissolves, and fades -- that signal content changes. Instead of sampling at a fixed rate, you sample one frame per shot, or the first and last frame of each shot.

    The standard algorithm computes pixel-level or histogram-level differences between consecutive frames. When the difference exceeds a threshold, a shot boundary is declared:

    for each consecutive frame pair (f_t, f_{t+1}):
        diff = histogram_distance(f_t, f_{t+1})
        if diff > threshold:
            mark shot boundary at t+1
    


    FFmpeg implements this via the `select` filter with scene change detection. A threshold of 0.3-0.4 catches most hard cuts while ignoring gradual brightness changes.

    Shot-boundary detection adapts to content density. A fast-paced montage produces many shots (many frames sampled). A static presentation produces few shots (few frames sampled). This is more efficient than uniform sampling for heterogeneous content, but it misses within-shot changes -- a presenter drawing on a whiteboard during a single continuous shot will only be captured at the shot boundaries.

    Semantic Clustering



    The most sophisticated approach: embed all frames (or a uniform subsample) into a visual embedding space, then cluster the embeddings. Each cluster represents a visually distinct "scene," and you sample the centroid frame from each cluster.

    # 1. Extract frames at a moderate rate
    frames = extract_frames(video, fps=2)

    # 2. Embed all frames embeddings = clip_model.encode_images(frames)

    # 3. Cluster by visual similarity clusters = kmeans(embeddings, k=num_desired_keyframes)

    # 4. Select centroid frame from each cluster keyframes = [frames[closest_to_centroid(c)] for c in clusters]


    Semantic clustering produces the most information-dense set of keyframes because it guarantees visual diversity. Two nearly identical slides will end up in the same cluster, and only one representative will be selected. This is the best approach when your token budget is tight and you need maximum visual coverage per frame.

    The tradeoff is compute cost: you need to embed every candidate frame before you can select keyframes. For a 60-minute video at 2 FPS, that is 7,200 CLIP forward passes just for the sampling step.

    Adaptive Sampling for Production



    Production systems typically combine approaches: uniform sampling as a baseline, shot-boundary detection for hard cuts, and semantic deduplication to remove near-identical frames. The pipeline looks like:

    1. Extract frames at 2 FPS (uniform baseline) 2. Run shot-boundary detection to add boundary frames 3. Embed all candidate frames 4. Remove frames whose embedding cosine similarity to the previous selected frame exceeds 0.95 5. Return the remaining frames as keyframes

    This produces 50-200 keyframes for a typical 60-minute meeting -- far more manageable than 3,600 raw frames, while preserving the important visual content.

    Token Budget Management



    Modern VLMs accept images as input, but each image consumes tokens. The exact cost depends on the model's vision encoder and resolution settings:

    ModelTokens per Frame128K Context Budget
    InternVL3 (dynamic resolution)256-1024125-500 frames
    Qwen3-VL (min_pixels to max_pixels)256-1280100-500 frames
    Phi-4-multimodal256~500 frames
    At 256 tokens per frame with a 128K context window, you can fit roughly 500 frames -- about 8 minutes of video at 1 FPS, or 42 minutes at one frame per 5 seconds. This means even 128K-token models cannot naively process an hour of video at useful frame rates.

    Token budget management is the art of deciding how to allocate those 500 frame slots across an hour of video. The strategies mirror retrieval system design:

    Fixed allocation: Divide the token budget equally across the video duration. For 500 frames over 60 minutes, sample one frame every 7.2 seconds. Simple but wastes budget on static sections.

    Content-adaptive allocation: Use a cheap preprocessing step (shot detection, motion estimation) to identify high-information segments, then allocate more frames to those segments. A 5-minute rapid discussion gets 100 frames; a 10-minute static slide gets 10 frames.

    Query-aware allocation: If you know the query before processing, use a text-video retrieval model to identify the most relevant segments, then allocate frames disproportionately to those segments. This is the retrieval-augmented approach discussed below.

    Hierarchical Video Representations



    The most powerful approach to long video is hierarchical: represent the video at multiple levels of granularity, each optimized for different query types.

    Level 1: Frame Embeddings



    Every sampled frame is encoded into a dense vector by a visual embedding model (CLIP, SigLIP, DINOv2). These vectors enable visual similarity search -- "find frames that look like a bar chart" -- without reading any text or audio.

    Frame embeddings are cheap to store (one 768-dimensional vector per frame) and fast to search (approximate nearest neighbor lookup). But they capture only visual appearance, not temporal relationships or semantic content.

    Level 2: Clip Descriptions



    Groups of 10-30 consecutive frames are fed to a VLM that generates a natural language description of what happens in that clip. A 60-minute video produces 20-60 clip descriptions, each a paragraph of text.

    Clip 14 [07:23 - 08:01]: The presenter switches to a slide titled
    "EMEA Revenue Q3." A declining bar chart is visible. The presenter
    points to the rightmost bar and says "We missed target by 12 percent,
    primarily driven by currency headwinds in the UK market."
    


    Clip descriptions bridge the visual-textual gap. They can be embedded with a text embedding model and searched with natural language queries. They capture temporal flow ("switches to," "points to") that individual frame embeddings cannot.

    Level 3: Scene Summaries



    Scenes (groups of related clips) get higher-level summaries that capture narrative arcs: "The CFO presented the quarterly financial results, highlighting missed EMEA targets and proposing a budget reduction for Q4."

    Scene summaries enable abstract queries: "What were the key financial concerns raised in the meeting?" These queries cannot be answered by frame-level or clip-level search because they require synthesizing information across multiple clips.

    Level 4: Document Summary



    A single summary of the entire video. Useful for classification, routing, and high-level questions ("What type of meeting is this?") but too compressed for specific retrieval.

    Building the Hierarchy



    The practical pipeline processes video bottom-up:

    1. Sample keyframes (Level 1) using adaptive sampling 2. Group keyframes into clips by shot boundaries or fixed windows 3. Generate clip descriptions (Level 2) by feeding frame groups to a VLM 4. Cluster clip descriptions into scenes by semantic similarity 5. Summarize each scene (Level 3) by feeding clip descriptions to an LLM 6. Summarize the full video (Level 4)

    Each level is stored as a separate feature in the search index. A query first hits the scene summaries (cheap, broad recall), then drills into clip descriptions (medium cost, higher precision), then retrieves specific frames (highest precision, visual evidence).

    Dense Captioning as a Preprocessing Step



    Dense captioning -- generating a natural language description for every few seconds of video -- has become the dominant preprocessing strategy for video retrieval. The insight is simple: text search is a solved problem, and dense captions convert video into text.

    The ShareGPT4Video pattern popularized this approach: run a VLM (GPT-4V, Gemini, InternVL) over short clips, store the captions, and search over the captions with standard text retrieval.

    # Dense captioning pipeline
    for clip in split_video_into_clips(video, clip_duration=30):
        frames = sample_keyframes(clip, n=5)
        caption = vlm.generate(
            images=frames,
            prompt="Describe in detail what happens in these video frames. "
                   "Include visual elements, text on screen, speaker actions, "
                   "and any notable events."
        )
        store_caption(
            video_id=video.id,
            start_time=clip.start,
            end_time=clip.end,
            caption=caption,
            keyframe_embeddings=[embed(f) for f in frames]
        )
    


    Dense captions enable powerful hybrid search: semantic search over caption text (for conceptual queries like "budget discussion") combined with visual search over keyframe embeddings (for appearance queries like "slide with a red chart"). The fusion of these two signals consistently outperforms either alone.

    The cost is VLM inference at ingest time. A 60-minute video split into 30-second clips requires 120 VLM calls. At ~2 seconds per call on an A100, that is 4 minutes of GPU time -- a meaningful but manageable cost for valuable video content.

    Long-Context VLMs vs. Retrieval-Augmented Video



    The central architectural question is whether to use a long-context VLM that processes the entire video at once, or a retrieval system that finds relevant segments before processing.

    Long-Context Approach



    Feed all keyframes and the full transcript into a 128K+ context VLM. The model reasons over everything simultaneously.

    Strengths:
  1. No information loss from retrieval filtering
  2. Can find unexpected connections between distant parts of the video
  3. Simple pipeline: no retriever to build or tune


  4. Weaknesses:
  5. Token limits cap video length (even 128K handles ~40 minutes at 1 FPS)
  6. Inference cost scales linearly with video length
  7. Attention may not focus on the right segments for a specific query


  8. Retrieval-Augmented Approach



    First retrieve the most relevant clips/frames using embeddings or captions, then feed only those segments to a VLM for reasoning.

    Strengths:
  9. Handles arbitrarily long videos
  10. Inference cost scales with query complexity, not video length
  11. Retrieved segments are pre-filtered for relevance


  12. Weaknesses:
  13. Retrieval can miss relevant segments (recall problem)
  14. Cannot find connections between unretrieved segments
  15. Requires building and maintaining a retrieval index


  16. The Hybrid Pattern



    Production systems increasingly use both: a retrieval pass to identify candidate segments, followed by a long-context VLM that processes the retrieved segments plus surrounding context. This gives the VLM enough information to reason about temporal relationships while keeping the token budget manageable.

    # Hybrid: retrieve then reason
    candidates = retriever.search(
        query="EMEA revenue discussion",
        video_id=video_id,
        top_k=10  # Get 10 most relevant clips
    )

    # Expand context: include 30s before and after each candidate expanded = expand_temporal_context(candidates, margin_seconds=30)

    # Deduplicate overlapping segments segments = merge_overlapping(expanded)

    # Feed to VLM for reasoning answer = vlm.generate( frames=[s.keyframes for s in segments], transcript=[s.transcript for s in segments], query="What did the CFO say right after the EMEA revenue slide?" )


    This is the pattern used by Mixpeek's video retrieval pipeline: multi-stage retrieval identifies candidate segments, temporal context expansion ensures the VLM has enough surrounding information, and a reasoning model generates the final answer.

    Practical Implementation on Mixpeek



    Here is how these concepts map to a Mixpeek pipeline:

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="API_KEY")

    # Ingest: extract hierarchical features await mx.collections.ingest( collection_id="meetings", source={"url": "https://example.com/board-meeting.mp4"}, feature_extractors=[ # Level 1: Frame embeddings {"name": "image_embedding", "version": "v1", "params": {"model_id": "Qwen/Qwen3-VL-Embedding-8B"}}, # Level 2: Dense captions {"name": "scene_caption", "version": "v1", "params": {"model_id": "bytedance-research/Vidi-7B"}}, # Audio: Transcription with speaker labels {"name": "transcription", "version": "v1", "params": {"model_id": "ibm-granite/granite-speech-4.1-2b-plus", "enable_diarization": True}} ] )

    # Retrieve: multi-stage search results = await mx.retrievers.retrieve( queries=[{"type": "text", "value": "CFO discussing EMEA revenue decline"}], collection_ids=["meetings"], stages=[ # Broad recall over captions {"type": "feature_search", "feature": "scene_caption", "top_k": 50}, # Visual search over frame embeddings {"type": "feature_search", "feature": "image_embedding", "top_k": 50}, # Precision reranking {"type": "rerank", "model": "Qwen/Qwen3-VL-Reranker-2B", "top_k": 10} ] )


    The pipeline extracts three feature layers at ingest time (frame embeddings, dense captions, speaker-attributed transcripts), then uses multi-stage retrieval to find relevant segments. This handles hour-long videos efficiently because the retriever narrows the search space before any expensive VLM reasoning.

    Related Guides



  17. Video Scene Segmentation -- the shot-boundary detection algorithms used in frame sampling
  18. Omnimodal Embeddings -- the embedding models that power frame-level visual search
  19. Multi-Stage Retrieval -- the retrieval architecture for combining visual and text search
  20. Speaker Diarization -- adding speaker identity to video transcripts
  21. MCP Tool Design -- exposing video search as agent tools
  22. Models -- browse VLMs, embedding models, and ASR models for video pipelines
  23. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs