The Perception Gap in AI Agents
Modern AI agents can reason, plan, and use tools. But most of them are blind and deaf. They operate on text: API responses, database rows, structured JSON. When an agent encounters a video file, an audio recording, or an image, it typically has two options: ignore it, or pass it to a separate system and hope someone already extracted the right text.
This is the perception gap. The information an agent needs is locked inside unstructured media -- a product demo recorded on video, a customer call stored as audio, a scanned contract sitting as a PDF. The agent cannot search it, reason over it, or act on it.
Closing this gap requires a perception layer: a set of pipelines that decompose raw media into structured, queryable features that an agent can search and reason over at inference time. This guide explains how to build one.
What a Perception Layer Does
A perception layer sits between raw media storage and the agent's reasoning loop. It has two phases:
Ingestion (offline): Break each media file into chunks, run feature extractors, and store the results in an index. This happens once per file, ahead of time.
Retrieval (online): When the agent needs information from media, it queries the index with natural language or embedding similarity. The perception layer returns the relevant segments, features, and metadata.
The key insight is that you are not trying to "understand" the entire video at query time. You are building a pre-computed index of features at multiple granularities, so the agent can look up exactly what it needs in milliseconds.
Architecture Overview
A video perception layer has four components:
1. Chunker -- Splits the video into segments (scenes, fixed intervals, or shot boundaries) 2. Feature extractors -- Run models on each chunk to produce embeddings, labels, transcripts, and metadata 3. Index -- Stores extracted features in a searchable format (vector index + metadata store) 4. Query interface -- Lets the agent search across features using natural language, filters, or hybrid queries
Each component can be implemented independently, but the architecture decisions at each layer affect what the agent can and cannot perceive.
Step 1: Chunking Strategy
Video is continuous, but retrieval systems work with discrete units. The chunker determines what a "result" looks like when the agent searches.
Fixed-interval sampling
The simplest approach: extract one frame every N seconds. Common intervals are 1 frame per second (1 FPS) for dense understanding, or 1 frame every 5-10 seconds for coarse search.
When to use: Surveillance footage, dashcam video, any content where visual change is gradual. Also good as a baseline when you are unsure what matters.
Tradeoff: Misses brief events (a flash of a logo, a 2-second gesture). Produces many redundant frames in static scenes. At 1 FPS, a 1-hour video generates 3,600 frames to embed.
Scene-boundary detection
Use a model to detect visual transitions -- cuts, dissolves, and gradual scene changes. Each scene becomes one chunk. Within each scene, you can sample a keyframe (the first or middle frame) or embed the entire segment.
When to use: Edited video (films, ads, presentations, news broadcasts). Scene boundaries align with semantic boundaries in professionally produced content.
Algorithms: PySceneDetect (open source, histogram-based), TransNetV2 (neural shot boundary detector, state of the art), or simple pixel-difference thresholds for fast processing.
Tradeoff: Fails on single-shot content like lectures, interviews, and webcam recordings where there are no visual cuts.
Semantic chunking
Combine multiple signals to determine segment boundaries: visual change, speaker turns (from diarization), topic shifts (from transcript analysis), and silence detection. This produces chunks that correspond to meaningful segments -- "the part where the speaker discusses pricing" rather than "frames 1200-1800."
When to use: Meetings, lectures, podcasts, interviews -- content where meaning is carried by speech more than visuals.
Tradeoff: Requires running transcription and diarization before chunking, which adds latency to the ingestion pipeline. More complex to implement.
Choosing a strategy
| Content type | Recommended strategy | Typical chunk count (1h video) |
| Surveillance / dashcam | Fixed interval (1 FPS) | 3,600 frames |
| Film / ads / news | Scene boundary | 50-200 scenes |
| Lectures / meetings | Semantic (speaker + topic) | 20-80 segments |
| UGC / social video | Scene boundary + fixed fallback | 30-150 segments |
Step 2: Multi-Granularity Feature Extraction
Once the video is chunked, run feature extractors on each segment. The goal is to produce multiple representations at different levels of abstraction.
Level 1: Dense embeddings (what does this look like?)
Run a vision embedding model (CLIP, SigLIP, or a multimodal embedding model like Jina Embed v4) on each keyframe. This produces a vector that captures the visual semantics of the frame -- objects, scene composition, colors, activities.
For video-native embedding, models like Google VideoPrism or InternVideo2 process multiple frames as a temporal sequence, producing a single vector that captures motion and change. These are more compute-intensive but better for action recognition.
Keyframe -> CLIP -> 768-dim vector -> vector index
Video segment -> VideoPrism -> 1024-dim vector -> vector index
Level 2: Structured labels (what objects are present?)
Run object detection (YOLO, DETR, Grounding DINO) and scene classification on keyframes. This produces discrete labels: "person", "car", "whiteboard", "outdoor", "office." These are stored as filterable metadata alongside the embeddings.
Keyframe -> YOLOv8 -> [{label: "person", confidence: 0.94, bbox: [120, 80, 340, 420]}, ...]
Level 3: Transcription and speech features (what is being said?)
Run ASR (Whisper, Parakeet) on the audio track. If the content has multiple speakers, run speaker diarization (Pyannote) to attribute each utterance. The transcript is both stored as searchable text and embedded for semantic search.
Audio -> Whisper -> [{text: "Let me show you the Q3 results", start: 12.4, end: 15.1}, ...]
Audio -> Pyannote -> [{speaker: "SPEAKER_01", start: 12.4, end: 28.7}, ...]
Level 4: Scene descriptions (what is happening?)
Run a vision-language model (Qwen3-VL, Florence-2, Gemma 4) on keyframes or short clips to generate natural language descriptions. These captions bridge the gap between raw visual features and the text-based queries agents will use.
Keyframe -> Qwen3-VL -> "A presenter standing at a whiteboard, pointing to a bar chart
showing quarterly revenue growth. The chart shows Q3 at $4.2M."
The feature matrix
For each chunk, the perception layer stores:
| Feature | Type | Index | Query method |
| Visual embedding | 768-dim vector | Vector (HNSW) | Cosine similarity |
| Scene description | Text | Full-text + vector | Semantic search |
| Transcript | Text + timestamps | Full-text + vector | Keyword or semantic |
| Object labels | Structured | Metadata filter | Exact match / filter |
| Speaker ID | Structured | Metadata filter | Filter by speaker |
| Face embedding | 512-dim vector | Vector | Face similarity |
Step 3: Building the Index
The extracted features need to be stored in a way that supports fast, flexible retrieval. There are three common patterns:
Pattern A: Vector database + metadata store
Store embeddings in a vector database (Qdrant, Weaviate, Milvus) and structured metadata alongside them. Query with hybrid search: vector similarity filtered by metadata predicates.
Pros: Purpose-built for similarity search. Mature ecosystems.
Cons: Vector databases charge per vector. A 1-hour video at 1 FPS with 4 embedding types (visual, audio, transcript, description) produces 14,400 vectors. At $0.10 per 1K vectors/month, 10,000 hours of video costs $14,400/month just for storage.
Pattern B: Object storage + lightweight index
Store extracted features as structured files (Parquet, JSON) in object storage (S3, GCS). Build a lightweight vector index (FAISS, ScaNN) that loads on demand or runs as a sidecar. Metadata queries go through a SQL or document store.
Pros: 10-50x cheaper than vector databases at scale. Object storage costs $0.02/GB/month versus $1-5/GB/month for vector databases. Scales to billions of vectors without operational complexity.
Cons: Higher query latency (10-50ms vs 1-5ms). Requires building the query layer yourself.
Pattern C: Multimodal data warehouse
Use a platform that handles ingestion, extraction, indexing, and retrieval as a unified system. The features are stored in a warehouse-style architecture with SQL-like query semantics over both structured metadata and vector embeddings.
Pros: Fastest path to a working system. Handles the orchestration complexity of running multiple models, storing heterogeneous features, and serving hybrid queries.
Cons: Platform dependency.
Which pattern to choose
For prototyping and small-scale deployments (under 10,000 videos), Pattern A is the fastest to get running. For production systems at scale, Pattern B or C is necessary to control costs. The choice between B and C depends on whether you want to build or buy the orchestration layer.
Step 4: The Query Interface
The perception layer needs an API that agents can call. The interface should support three query types:
Semantic search
The agent provides a natural language query, and the perception layer returns the most relevant video segments.
# Agent asks: "When did the presenter show the revenue chart?"
results = retriever.search(
query="presenter showing revenue chart",
modalities=["visual_embedding", "scene_description", "transcript"],
top_k=5
)
# Returns: [{video_id, start_time, end_time, score, features}, ...]
This works by embedding the query with the same models used during ingestion, then running similarity search across all relevant feature types. Results from different modalities are fused using reciprocal rank fusion (RRF) or a learned reranker.
Filtered search
The agent narrows results using structured predicates before running similarity search.
results = retriever.search(
query="explain the architecture",
filters={
"speaker": "SPEAKER_01",
"objects_contains": "whiteboard",
"duration_gte": 10
},
top_k=5
)
Multi-stage retrieval
For complex queries, chain multiple retrieval stages: a broad vector search followed by a reranker, or a metadata filter followed by semantic search on the filtered set.
# Stage 1: Find all segments with a whiteboard
# Stage 2: Among those, find the ones most similar to "architecture diagram"
# Stage 3: Rerank with a cross-encoder
pipeline = [
{"stage": "filter", "field": "objects", "contains": "whiteboard"},
{"stage": "vector_search", "query": "architecture diagram", "top_k": 20},
{"stage": "rerank", "model": "cross-encoder", "top_k": 5}
]
results = retriever.search(pipeline=pipeline)
Step 5: Connecting to the Agent
The perception layer exposes its query interface as a tool the agent can call. In an MCP (Model Context Protocol) or function-calling setup, this looks like:
{
"name": "search_video_library",
"description": "Search across all indexed video and audio content. Returns relevant segments with timestamps, transcripts, descriptions, and confidence scores.",
"parameters": {
"query": "natural language description of what to find",
"filters": "optional structured filters (speaker, objects, date range)",
"modalities": "which feature types to search (visual, audio, transcript, all)",
"top_k": "number of results to return"
}
}
The agent decides when to invoke this tool based on the user's request. If the user asks "What did Sarah say about the Q3 numbers in last Tuesday's meeting?", the agent:
1. Calls `search_video_library` with query "Q3 numbers discussion", filters for the meeting date and speaker "Sarah" 2. Receives timestamped transcript segments with surrounding context 3. Synthesizes the answer using the retrieved segments as grounding
This is multimodal RAG (Retrieval-Augmented Generation) applied to video. The agent does not watch the video. It searches a pre-built index and uses the retrieved features to ground its response.
Latency and Cost Considerations
Ingestion latency
Processing a 1-hour video through the full extraction pipeline (transcription + diarization + visual embedding + scene captioning + object detection) takes 10-30 minutes on a single GPU, depending on model sizes and frame sampling rate. The pipeline is embarrassingly parallel: each extractor can run independently on the same chunks.
For real-time or near-real-time use cases (live streams, security feeds), you need to reduce the extraction set. Running only transcription + visual embedding at 0.5 FPS brings processing time under 2x real-time on modern GPUs.
Query latency
A well-configured vector index (HNSW with M=16, ef=200) returns results in 5-20ms for collections under 10M vectors. Adding metadata filtering and reranking brings total query latency to 50-200ms -- fast enough for interactive agent use.
Storage cost at scale
| Scale | Vectors (4 features x 1 FPS) | Vector DB cost/month | Object storage cost/month |
| 100 hours | 1.4M | ~$140 | ~$3 |
| 10,000 hours | 144M | ~$14,400 | ~$280 |
| 100,000 hours | 1.44B | ~$144,000 | ~$2,800 |
Common Pitfalls
Embedding everything at maximum resolution. Running CLIP on every frame of a 4K video at 30 FPS produces 108,000 embeddings per hour. Most of these are visually identical. Always downsample first.
Ignoring the audio track. For meetings, lectures, and customer calls, the transcript carries 80% of the retrievable information. A visual-only pipeline misses it entirely.
Single-granularity indexing. If you only store dense embeddings, the agent cannot filter by speaker or object class. If you only store labels, the agent cannot do semantic similarity search. You need both.
Not aligning timestamps. Visual features, transcript segments, and audio embeddings must share a common timeline. If the transcript says "Q3 revenue" at 14.2s but the visual embedding for that frame is indexed at 15.0s, a multi-modal query will miss the alignment.
Treating video as a bag of frames. Temporal order matters. A perception layer should preserve sequence: the agent needs to know that segment A comes before segment B, and that they are 30 seconds apart. This enables temporal reasoning ("what happened after the demo crashed?").
Implementing with Mixpeek
Mixpeek provides a managed perception layer that handles the full pipeline. Here is the mapping between the architecture described above and Mixpeek's components:
| Architecture component | Mixpeek equivalent |
| Chunker | Scene detection + configurable interval sampling |
| Visual embeddings | `image_embedding` extractor (CLIP, SigLIP, Jina v4) |
| Object detection | `object_detection` extractor (DETR, YOLO, Grounding DINO) |
| Transcription | `audio_transcription` extractor (Whisper, Parakeet) |
| Speaker diarization | `speaker_diarization` extractor (Pyannote 3.1) |
| Scene description | `scene_description` extractor (Qwen3-VL, Florence-2) |
| Index | Mixpeek collections with hybrid search |
| Query interface | Retriever API with multi-stage pipelines |
| Agent tool | MCP server or function-calling endpoint |
from mixpeek import Mixpeek
client = Mixpeek(api_key="YOUR_KEY")
# Ingest a video with multi-granularity extraction
client.assets.create(
bucket_id="video-library",
blob_id="meeting-2026-05-27.mp4",
collection_ids=["meetings"],
feature_extractors=[
{
"name": "image_embedding",
"version": "v1",
"params": {
"model_id": "openai/clip-vit-large-patch14",
"interval_sec": 5
}
},
{
"name": "audio_transcription",
"version": "v1",
"params": {"model_id": "openai/whisper-large-v3"}
},
{
"name": "speaker_diarization",
"version": "v1",
"params": {"model_id": "pyannote/speaker-diarization-3.1"}
},
{
"name": "scene_description",
"version": "v1",
"params": {
"model_id": "Qwen/Qwen3-VL-8B-Instruct",
"interval_sec": 10
}
}
]
)
# Agent queries the perception layer
results = client.retrievers.search(
retriever_id="meetings-search",
query="When did they discuss the budget?",
top_k=5
)