The Problem: RAG Was Built for Text
Retrieval-augmented generation has become the standard architecture for grounding LLM responses in factual data. The pattern is simple: chunk documents, embed the chunks, retrieve the most relevant ones at query time, and feed them as context to a language model. For text, this works well. Thousands of production RAG systems process PDFs, knowledge bases, and documentation this way.
But most enterprise data is not text. Surveillance footage, customer support call recordings, product demo videos, training materials, medical imaging sessions, manufacturing line recordings -- these are measured in petabytes, and they contain information that no text document captures. A text-only RAG pipeline is blind to 80% of the information an organization actually has.
Video RAG extends retrieval-augmented generation to video corpora. Instead of retrieving text passages, a Video RAG pipeline retrieves specific moments, scenes, or segments from a video library and feeds them -- as frames, transcripts, or structured descriptions -- to a vision-language model that generates the answer.
This guide walks through the architecture, algorithms, and production considerations for building a Video RAG system from scratch.
Why Video Is Harder Than Text
Text RAG has three convenient properties that video lacks:
1. Natural chunk boundaries. Text has paragraphs, sections, and sentences. You can split a document at these boundaries and each chunk is self-contained. Video has no such boundaries. A factory walkthrough, a meeting recording, or a surveillance feed is a continuous stream with no inherent segmentation.
2. Uniform modality. A text chunk is always text. You embed it, store the vector, and retrieve it. A video segment contains multiple modalities simultaneously: visual frames, speech audio, background sounds, on-screen text, and sometimes metadata like GPS coordinates or sensor readings. Each modality carries different information.
3. Small retrieval units. A text chunk is a few hundred tokens. A 10-second video clip at 30fps is 300 frames, each a 1920x1080 image. The raw data volume is orders of magnitude larger, and you cannot simply dump it into an LLM context window.
These differences mean that every stage of the RAG pipeline -- chunking, embedding, indexing, retrieval, and generation -- needs to be redesigned for video.
Stage 1: Scene-Level Chunking
The first step is decomposing continuous video into retrievable units. In text RAG, you chunk by paragraphs or fixed token windows. In Video RAG, the equivalent is scene segmentation.
Shot Boundary Detection
The most fundamental approach is shot boundary detection: identifying where the camera cuts from one shot to another. A shot is a continuous sequence of frames captured without interruption. Shot boundaries correspond to hard cuts (instant transitions), dissolves (gradual blends), and wipes.
The standard algorithm is frame-difference thresholding:
1. For each consecutive pair of frames, compute a distance metric (histogram difference, pixel-level L1/L2 distance, or embedding cosine distance). 2. If the distance exceeds a threshold, mark it as a shot boundary. 3. Merge shots that are too short (under a minimum duration) with their neighbors.
Modern approaches use learned detectors. TransNetV2 is a dilated 3D CNN that processes 100 frames at a time and predicts boundary probabilities with F1 scores above 0.96 on standard benchmarks. PySceneDetect is a widely-used open-source library that implements both threshold-based and content-aware detection.
Semantic Scene Segmentation
Shot boundaries are syntactic -- they detect camera edits but not semantic shifts. A conversation between two people may have dozens of shot-reverse-shot cuts, all belonging to the same semantic scene.
Semantic scene segmentation groups consecutive shots that share a common topic, setting, or activity. The typical approach:
1. Extract a feature vector for each shot (using the middle frame or a pooled representation across all frames). 2. Compute pairwise similarities between adjacent shots. 3. Identify scene boundaries where similarity drops below a threshold.
More sophisticated methods use temporal clustering: represent each shot as a point in embedding space, then find contiguous groups of shots that form a cluster. BaSSL (Boundary-aware Self-Supervised Learning) trains a temporal transformer to predict scene boundaries directly.
Choosing the Right Granularity
The granularity of your chunks determines the tradeoff between precision and context:
In practice, most Video RAG systems use scene-level segmentation with a minimum duration of 15-30 seconds and a maximum of 3-5 minutes. This matches the context window of current LVLMs and provides enough context for meaningful answers.
Stage 2: Dual-Channel Indexing
A text RAG pipeline has one index: text embeddings. A Video RAG pipeline needs at least two indexes that capture different modalities, because video contains information in both the visual channel and the audio/speech channel.
Channel 1: Visual Embeddings
For each scene, extract visual embeddings that capture what is shown on screen. There are two approaches:
Keyframe-based. Select representative frames from each scene (the first frame, middle frame, or the frame with the highest "information content" as measured by image entropy or sharpness). Embed each keyframe using a visual encoder like CLIP, SigLIP, or DINOv2. Store one or more vectors per scene.
Video-native. Use a video encoder that processes multiple frames jointly and captures temporal information. Models like InternVideo2, VideoPrism, or VideoMAE produce a single embedding that represents the entire clip, including motion, action sequences, and temporal dynamics. These embeddings capture information that keyframe-based approaches miss -- two scenes might have identical keyframes but very different actions.
The choice depends on your query patterns. If users search for static visual content ("find the slide about revenue projections"), keyframe embeddings suffice. If users search for actions or events ("find the moment the machine overheated"), video-native embeddings are better.
Channel 2: Transcript and Audio Embeddings
Extract the speech track and transcribe it using an ASR model (Whisper, Parakeet, or Cohere Transcribe). Align the transcript with timestamps so each word or sentence maps to a specific time in the video. Then embed the transcript segments using a text embedding model.
For non-speech audio (music, environmental sounds, machine noises), embed the audio using a model like CLAP that maps audio into the same embedding space as text descriptions.
Combining Channels
At retrieval time, you have two options for combining channels:
Late fusion. Query both indexes independently, get two ranked lists, and merge them using reciprocal rank fusion (RRF) or a learned score combiner. This is simpler to build and debug, and lets you weight channels differently per query.
Shared embedding space. Use an omnimodal model (like Jina Omni or Omni-Embed-Nemotron) that maps text, images, video, and audio into the same vector space. A single text query retrieves across all modalities simultaneously. This is architecturally simpler but depends on the quality of the unified model.
Stage 3: Retrieval
When a user query arrives, the retrieval stage must find the most relevant scenes from potentially millions of indexed segments.
Multi-Stage Retrieval for Video
Video retrieval benefits enormously from multi-stage pipelines:
Stage 1: Coarse retrieval. Use approximate nearest neighbor search (HNSW, IVF) over the visual and transcript embedding indexes. Retrieve the top 50-100 candidate scenes. This is fast (milliseconds) because it operates on pre-computed vectors.
Stage 2: Cross-modal reranking. A cross-encoder reranker jointly processes the query and each candidate scene's metadata (transcript snippet, visual description, detected objects) to produce a fine-grained relevance score. Models like Qwen3-VL-Reranker or Jina Reranker m0 can process multimodal inputs. This reduces the candidate set to 10-20 scenes.
Stage 3: Temporal grounding. For the top-ranked scenes, pinpoint the exact moment within the scene that answers the query. Temporal grounding models (like Marlin-2B or Cosmos-Reason2) take a natural language query and a video clip and return start/end timestamps. This narrows a 3-minute scene down to the specific 15-second segment the user needs.
Temporal Context Windows
A unique aspect of video retrieval is temporal context. When you retrieve a relevant 30-second scene, the scenes immediately before and after often contain important context. A conversation answer might start in scene N but be completed in scene N+1.
Production Video RAG systems typically retrieve the target scene plus a configurable temporal buffer (e.g., 30 seconds before and after). This is analogous to increasing chunk overlap in text RAG, but the cost is higher because each additional scene means more frames to process.
Stage 4: Frame Selection for LVLMs
You have retrieved the relevant scenes. Now you need to feed them to a vision-language model for answer generation. The challenge: a 2-minute scene at 30fps contains 3,600 frames. No LVLM can process all of them, and even if it could, most frames are redundant.
Frame selection determines which frames from the retrieved scenes are included in the LVLM's context window. This is the Video RAG equivalent of fitting text chunks into the LLM's context.
Strategies
Uniform sampling. Select every Nth frame to hit a target budget (e.g., 16 frames per scene). Simple and deterministic. Works well for slow-moving content (presentations, interviews) but misses brief important moments in fast-moving content.
Keyframe extraction. Select frames that maximize visual diversity within the scene. Cluster all frames by visual similarity and pick the centroid of each cluster. This ensures you capture every distinct visual state in the scene.
Query-driven selection. Score each frame against the user query using CLIP similarity, then select the top-K scoring frames. This focuses the LVLM's attention on the most query-relevant moments. The downside is that it requires computing CLIP embeddings for every frame at query time (or pre-computing and storing them).
Adaptive budgeting. Allocate more frames to scenes with higher retrieval scores and fewer frames to lower-ranked scenes. If your total budget is 32 frames and you retrieved 4 scenes, the top scene might get 12 frames while the bottom scene gets 4.
The Frame Budget Tradeoff
More frames mean more context for the LVLM but higher latency and cost. Empirical studies show diminishing returns past 32-64 frames per query for most tasks. For factual question answering, 16 frames per scene is often sufficient. For temporal reasoning ("what happened after the alarm went off"), 32-64 frames that are temporally ordered are needed.
Stage 5: Answer Generation
The final stage feeds the selected frames and transcript snippets to a vision-language model that generates the answer.
Context Assembly
The LVLM receives a structured prompt:
1. The user's question. 2. For each retrieved scene (ordered by relevance score): the selected frames, the transcript segment with timestamps, and any structured metadata (detected objects, scene description, speaker labels). 3. An instruction to cite the source scene and timestamp for any factual claims.
Grounded Generation
A critical requirement for Video RAG is that the generated answer links back to the source material. The user should be able to click a timestamp and jump to the exact moment in the video. This means the generation prompt must instruct the model to include temporal citations, and the pipeline must preserve the mapping from scene IDs to video URLs and timestamps.
Handling "Not Found"
Not every question has an answer in the video corpus. The LVLM must distinguish between "I found relevant content but the answer is ambiguous" and "nothing in the retrieved scenes addresses this question." Including the retrieval scores in the prompt (or setting a minimum score threshold before generation) helps the model calibrate its confidence.
Production Architecture
A production Video RAG system has two phases: an offline ingestion pipeline and an online query pipeline.
Offline: Ingestion Pipeline
Video Upload
|
v
Scene Segmentation (TransNetV2 / PySceneDetect)
|
v
Parallel Feature Extraction:
- Keyframes --> Visual Encoder --> Visual Embeddings
- Audio --> ASR Model --> Transcript + Text Embeddings
- Frames --> Object Detector --> Structured Metadata
|
v
Dual-Channel Vector Index (Visual + Text)
|
v
Metadata Store (scene boundaries, thumbnails, transcripts)
Online: Query Pipeline
User Query
|
v
Dual-Channel Retrieval (Visual + Transcript)
|
v
Cross-Modal Reranking
|
v
Temporal Grounding (optional: pinpoint exact moment)
|
v
Frame Selection (uniform / keyframe / query-driven)
|
v
LVLM Generation (with frames + transcript context)
|
v
Answer + Timestamp Citations
Latency Breakdown
In a typical production deployment:
| Stage | Latency | Notes |
| Embedding query | 5-15ms | Text encoder |
| ANN retrieval | 10-30ms | HNSW search across both channels |
| Reranking | 100-300ms | Cross-encoder over top-50 candidates |
| Temporal grounding | 200-500ms | Per-scene, parallelizable |
| Frame selection | 10-50ms | Pre-computed embeddings make this fast |
| LVLM generation | 1-3 seconds | Depends on frame count and model |
| Total | 1.5-4 seconds | Acceptable for interactive use |
When to Use Video RAG vs. Other Approaches
Video RAG is not the right tool for every video task:
Mixpeek Implementation
Mixpeek's pipeline architecture maps directly to the Video RAG stages described above:
from mixpeek import Mixpeek
client = Mixpeek(api_key="YOUR_API_KEY")
# Ingest video with scene segmentation + dual-channel extraction
client.ingest.videos(
collection="training_library",
source={"type": "s3", "bucket": "training-videos"},
pipeline={
"scene_segmentation": {
"model": "mixpeek://video_extractor@v1/pyannote_diarization_v3",
"min_scene_length": 15
},
"visual_embedding": {
"model": "mixpeek://video_descriptor@v1/openai_clip_large_v1",
"keyframe_strategy": "centroid",
"frames_per_scene": 4
},
"transcription": {
"model": "mixpeek://transcription@v1/openai_whisper_large_v3"
},
"transcript_embedding": {
"model": "mixpeek://text_extractor@v1/baai_bge_large_v1"
},
"scene_caption": {
"model": "mixpeek://video_extractor@v1/nvidia_cosmos_reason2_2b_v1",
"interval_sec": 5
}
}
)
# Query with multi-stage retrieval
results = client.search.text(
collection="training_library",
query="emergency shutoff procedure demonstration",
pipeline=[
{
"stage_type": "search",
"stage_id": "visual_search",
"model": "mixpeek://video_descriptor@v1/openai_clip_large_v1",
"limit": 50
},
{
"stage_type": "search",
"stage_id": "transcript_search",
"model": "mixpeek://text_extractor@v1/baai_bge_large_v1",
"limit": 50
},
{
"stage_type": "rerank",
"stage_id": "cross_modal_rerank",
"model": "mixpeek://reranker@v1/qwen3_vl_reranker_2b_v1",
"limit": 10
}
]
)