Skip to main content
This is the full warehouse flow end-to-end: decompose video into searchable segments, store visual and speech embeddings, then reassemble answers through a retriever. Every code block below is copy-pasteable and uses the real API field names.
Video Understanding Pipeline

How It Works

When you ingest a video, the multimodal_extractor runs a multi-stage pipeline:
  1. Chunking — the video is split into segments (scene detection / fixed intervals).
  2. Visual embeddings — each segment’s keyframes are embedded as vertex_multimodal_embedding.
  3. Speech embeddings — speech is transcribed (Whisper) and embedded (feature URI mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1).
  4. Multi-vector indexing — both embeddings are indexed per segment, so one collection supports hybrid visual + transcript search.
At query time, a retriever searches across both embeddings and fuses the results — finding moments by what’s shown or what’s said.
A single multimodal_extractor@v1 produces both the visual feature (URI suffix vertex_multimodal_embedding) and the speech/transcript feature (URI suffix multilingual_e5_large_instruct_v1). You do not need separate video/audio extractors.

1. Create a bucket

The bucket schema declares the fields each video object carries. Use the Mixpeek type video for the file field.
curl -sS -X POST "$MP_API_URL/v1/buckets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "video-catalog",
    "bucket_schema": {
      "properties": {
        "video_url": { "type": "video" },
        "title": { "type": "text" },
        "category": { "type": "string" }
      }
    }
  }'

2. Create a collection

One collection with the multimodal_extractor gives you both visual and speech search. Map the extractor’s video input to your bucket’s video_url field.
curl -sS -X POST "$MP_API_URL/v1/collections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "video-moments",
    "source": { "type": "bucket", "bucket_ids": ["bkt_videos"] },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": { "video": "video_url" },
      "field_passthrough": [
        { "source_path": "title" },
        { "source_path": "category" }
      ]
    }
  }'
A collection takes exactly one feature_extractor. To run additional extractors over the same objects, create more collections from the same bucket.

3. Ingest a video

Register a video object. The blob’s content goes in the data field (a URL, s3:// path, base64, or raw content).
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/objects" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "key_prefix": "/marketing/demos",
    "blobs": [
      {
        "property": "video_url",
        "type": "video",
        "data": "s3://my-bucket/demos/product-launch.mp4"
      },
      { "property": "title", "type": "text", "data": "Product Launch Q4 2025" },
      { "property": "category", "type": "text", "data": "marketing" }
    ]
  }'

4. Process

Create a batch from your object(s), submit it, then poll the returned task until it completes.
# Create a batch (returns batch_id)
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/batches" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "object_ids": ["obj_video_001"] }'

# Submit it for processing (returns task_id)
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/batches/{batch_id}/submit" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"

# Poll until status is terminal: COMPLETED or COMPLETED_WITH_ERRORS
curl -sS "$MP_API_URL/v1/tasks/{task_id}" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"
Prefer the SDK? client.tasks.get(task_id=...) returns the same status. See Monitoring ingestion for a complete polling loop.

5. Create a hybrid retriever

The retriever fuses two feature searches — visual and speech — over the same collection. Note the canonical field names: collection_identifiers, a flat input_schema, and the stage config wrapping stage_id + parameters.
curl -sS -X POST "$MP_API_URL/v1/retrievers" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "video-search",
    "collection_identifiers": ["video-moments"],
    "input_schema": {
      "query": { "type": "text", "required": true }
    },
    "stages": [
      {
        "stage_name": "hybrid_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 100,
                "weight": 0.6
              },
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 100,
                "weight": 0.4
              }
            ],
            "final_top_k": 20,
            "fusion": "weighted"
          }
        }
      }
    ]
  }'
FieldMeaning
collection_identifiersCollections to query (names or IDs).
input_schemaFlat map of input field → type. Reference inputs in stages as {{INPUT.<field>}}.
searches[].feature_uriWhich embedding index to search. Here: visual + speech.
searches[].weightPer-search weight used by weighted fusion.
final_top_kResults returned after fusion.
fusionrrf (default) or weighted.
Execute the retriever. The inputs keys must match your input_schema.
curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "people discussing electric vehicles" }
  }'
Each result is a video segment with its timestamps and keyframe, ranked by combined visual + transcript relevance.

Filter by category

Apply ad-hoc filters at execution time without changing the retriever:
curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "product demonstration" },
    "filters": {
      "field": "category",
      "operator": "eq",
      "value": "marketing"
    }
  }'

Search by a reference image

Because vertex_multimodal_embedding is cross-modal, you can query the visual index with an image to find similar scenes. Add an image input to input_schema and set the visual search’s query to { "input_mode": "content", "value": "{{INPUT.query_image}}" }. See Feature Search for image/URL query modes. Segments carry start/end timestamps, so you can scope results to part of a video with a filter (field names depend on your collection’s output schema):
curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "pricing discussion" },
    "filters": { "field": "start_time", "operator": "gte", "value": 60.0 }
  }'

Search the same video other ways

The same segments are searchable across every modality the multimodal_extractor produced — combine or swap these into your retriever:

Search faces

Add the face_identity_extractor to find every clip featuring a specific person (ArcFace 1:N).

Search transcripts

Query mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1 to find moments by what’s said.

Group into moments

Merge matching frames into continuous time-ranges with the moment_group stage.

Rerank for precision

Add a cross-encoder rerank stage to sharpen the top results.

Filter unsafe content

Drop or flag NSFW segments at query time with the classify stage.

Hybrid keyword search

Add a lexical: true search to catch exact terms in transcripts (BM25).

Search by action or activity

There is no dedicated action-recognition extractor — instead, search actions as natural language against the visual embedding. A text query like "person running" or "two people shaking hands" matches the vertex_multimodal_embedding because it understands actions depicted in frames:
{
  "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
  "query": { "input_mode": "text", "value": "person running on a treadmill" },
  "top_k": 50
}
For higher precision on a fixed set of actions, enable run_video_description / response_shape on the extractor to emit structured action labels you can then attribute_filter or classify.

Next steps

Auto-tag segments

Classify segments (“product demo”, “interview”) with a taxonomy.

Discover themes

Cluster segment embeddings to surface recurring visual themes.

Get notified

Trigger alerts when new content matches a query.

Tune feature search

Full reference for searches, fusion, and query modes.