> ## Documentation Index > Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Video Understanding > Ingest video, extract visual + speech embeddings, and search for moments by what's shown or said This is the full warehouse flow end-to-end: decompose video into searchable segments, store visual and speech embeddings, then reassemble answers through a retriever. Every code block below is copy-pasteable and uses the real API field names. Video Understanding Pipeline

## How It Works When you ingest a video, the `multimodal_extractor` runs a multi-stage pipeline: 1. **Chunking** — the video is split into segments (scene detection / fixed intervals). 2. **Visual embeddings** — each segment's keyframes are embedded as `vertex_multimodal_embedding`. 3. **Speech embeddings** — speech is transcribed (Whisper) and embedded (feature URI `mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1`). 4. **Multi-vector indexing** — both embeddings are indexed per segment, so one collection supports hybrid visual + transcript search. At query time, a retriever searches across both embeddings and fuses the results — finding moments by what's *shown* or what's *said*. A single `multimodal_extractor@v1` produces **both** the visual feature (URI suffix `vertex_multimodal_embedding`) and the speech/transcript feature (URI suffix `multilingual_e5_large_instruct_v1`). You do **not** need separate video/audio extractors. ## 1. Create a bucket The bucket schema declares the fields each video object carries. Use the Mixpeek type `video` for the file field. ```bash theme={null} curl -sS -X POST "$MP_API_URL/v1/buckets" \ -H "Authorization: Bearer $MP_API_KEY" \ -H "X-Namespace: $MP_NAMESPACE" \ -H "Content-Type: application/json" \ -d '{ "bucket_name": "video-catalog", "bucket_schema": { "properties": { "video_url": { "type": "video" }, "title": { "type": "text" }, "category": { "type": "string" } } } }' ``` ## 2. Create a collection One collection with the `multimodal_extractor` gives you both visual and speech search. Map the extractor's `video` input to your bucket's `video_url` field. ```bash theme={null} curl -sS -X POST "$MP_API_URL/v1/collections" \ -H "Authorization: Bearer $MP_API_KEY" \ -H "X-Namespace: $MP_NAMESPACE" \ -H "Content-Type: application/json" \ -d '{ "collection_name": "video-moments", "source": { "type": "bucket", "bucket_ids": ["bkt_videos"] }, "feature_extractor": { "feature_extractor_name": "multimodal_extractor", "version": "v1", "input_mappings": { "video": "video_url" }, "field_passthrough": [ { "source_path": "title" }, { "source_path": "category" } ] } }' ``` A collection takes exactly **one** `feature_extractor`. To run additional extractors over the same objects, create more collections from the same bucket. ## 3. Ingest a video Register a video object. The blob's content goes in the `data` field (a URL, `s3://` path, base64, or raw content). ```bash theme={null} curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/objects" \ -H "Authorization: Bearer $MP_API_KEY" \ -H "X-Namespace: $MP_NAMESPACE" \ -H "Content-Type: application/json" \ -d '{ "key_prefix": "/marketing/demos", "blobs": [ { "property": "video_url", "type": "video", "data": "s3://my-bucket/demos/product-launch.mp4" }, { "property": "title", "type": "text", "data": "Product Launch Q4 2025" }, { "property": "category", "type": "text", "data": "marketing" } ] }' ``` ## 4. Process Create a batch from your object(s), submit it, then poll the returned task until it completes. ```bash theme={null} # Create a batch (returns batch_id) curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/batches" \ -H "Authorization: Bearer $MP_API_KEY" \ -H "X-Namespace: $MP_NAMESPACE" \ -H "Content-Type: application/json" \ -d '{ "object_ids": ["obj_video_001"] }' # Submit it for processing (returns task_id) curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/batches/{batch_id}/submit" \ -H "Authorization: Bearer $MP_API_KEY" \ -H "X-Namespace: $MP_NAMESPACE" # Poll until status is terminal: COMPLETED or COMPLETED_WITH_ERRORS curl -sS "$MP_API_URL/v1/tasks/{task_id}" \ -H "Authorization: Bearer $MP_API_KEY" \ -H "X-Namespace: $MP_NAMESPACE" ``` Prefer the SDK? `client.tasks.get(task_id=...)` returns the same status. See [Monitoring ingestion](/processing/tasks) for a complete polling loop. ## 5. Create a hybrid retriever The retriever fuses two feature searches — visual and speech — over the same collection. Note the canonical field names: `collection_identifiers`, a flat `input_schema`, and the stage `config` wrapping `stage_id` + `parameters`. ```bash theme={null} curl -sS -X POST "$MP_API_URL/v1/retrievers" \ -H "Authorization: Bearer $MP_API_KEY" \ -H "X-Namespace: $MP_NAMESPACE" \ -H "Content-Type: application/json" \ -d '{ "retriever_name": "video-search", "collection_identifiers": ["video-moments"], "input_schema": { "query": { "type": "text", "required": true } }, "stages": [ { "stage_name": "hybrid_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [ { "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 100, "weight": 0.6 }, { "feature_uri": "mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 100, "weight": 0.4 } ], "final_top_k": 20, "fusion": "weighted" } } } ] }' ``` | Field | Meaning | | ------------------------ | ---------------------------------------------------------------------------------- | | `collection_identifiers` | Collections to query (names or IDs). | | `input_schema` | Flat map of input field → type. Reference inputs in stages as `{{INPUT.}}`. | | `searches[].feature_uri` | Which embedding index to search. Here: visual + speech. | | `searches[].weight` | Per-search weight used by `weighted` fusion. | | `final_top_k` | Results returned after fusion. | | `fusion` | `rrf` (default) or `weighted`. | ## 6. Search Execute the retriever. The `inputs` keys must match your `input_schema`. ```bash theme={null} curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \ -H "Authorization: Bearer $MP_API_KEY" \ -H "X-Namespace: $MP_NAMESPACE" \ -H "Content-Type: application/json" \ -d '{ "inputs": { "query": "people discussing electric vehicles" } }' ``` Each result is a video segment with its timestamps and keyframe, ranked by combined visual + transcript relevance. ### Filter by category Apply ad-hoc filters at execution time without changing the retriever: ```bash theme={null} curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \ -H "Authorization: Bearer $MP_API_KEY" \ -H "X-Namespace: $MP_NAMESPACE" \ -H "Content-Type: application/json" \ -d '{ "inputs": { "query": "product demonstration" }, "filters": { "field": "category", "operator": "eq", "value": "marketing" } }' ``` ### Search by a reference image Because `vertex_multimodal_embedding` is cross-modal, you can query the visual index with an image to find similar scenes. Add an image input to `input_schema` and set the visual search's `query` to `{ "input_mode": "content", "value": "{{INPUT.query_image}}" }`. See [Feature Search](/retrieval/stages/feature-search) for image/URL query modes. ## Moment-level search Segments carry start/end timestamps, so you can scope results to part of a video with a filter (field names depend on your collection's output schema): ```bash theme={null} curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \ -H "Authorization: Bearer $MP_API_KEY" \ -H "X-Namespace: $MP_NAMESPACE" \ -H "Content-Type: application/json" \ -d '{ "inputs": { "query": "pricing discussion" }, "filters": { "field": "start_time", "operator": "gte", "value": 60.0 } }' ``` ## Search the same video other ways The same segments are searchable across every modality the `multimodal_extractor` produced — combine or swap these into your retriever: Add the `face_identity_extractor` to find every clip featuring a specific person (ArcFace 1:N). Query `mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1` to find moments by what's *said*. Merge matching frames into continuous time-ranges with the `moment_group` stage. Add a cross-encoder `rerank` stage to sharpen the top results. Drop or flag NSFW segments at query time with the `classify` stage. Add a `lexical: true` search to catch exact terms in transcripts (BM25). ### Search by action or activity There is no dedicated action-recognition extractor — instead, search **actions as natural language** against the visual embedding. A text query like `"person running"` or `"two people shaking hands"` matches the `vertex_multimodal_embedding` because it understands actions depicted in frames: ```json theme={null} { "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding", "query": { "input_mode": "text", "value": "person running on a treadmill" }, "top_k": 50 } ``` For higher precision on a fixed set of actions, enable `run_video_description` / `response_shape` on the extractor to emit structured action labels you can then [`attribute_filter`](/retrieval/stages/attribute-filter) or [`classify`](/retrieval/stages/classify). ## Next steps Classify segments ("product demo", "interview") with a taxonomy. Cluster segment embeddings to surface recurring visual themes. Trigger alerts when new content matches a query. Full reference for searches, fusion, and query modes.