> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Video Understanding

> Ingest video, extract visual + speech embeddings, and search for moments by what's shown or said

<Tip>This is the full warehouse flow end-to-end: decompose video into searchable segments, store visual and speech embeddings, then reassemble answers through a retriever. Every code block below is copy-pasteable and uses the real API field names.</Tip>

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/mixpeek-video-understanding.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=8370fdfc5820afd33811585e4eacf302" alt="Video Understanding Pipeline" width="1200" height="920" data-path="assets/mixpeek-video-understanding.svg" />
</Frame>

## How It Works

When you ingest a video, the `multimodal_extractor` runs a multi-stage pipeline:

1. **Chunking** — the video is split into segments (scene detection / fixed intervals).
2. **Visual embeddings** — each segment's keyframes are embedded as `vertex_multimodal_embedding`.
3. **Speech embeddings** — speech is transcribed (Whisper) and embedded (feature URI `mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1`).
4. **Multi-vector indexing** — both embeddings are indexed per segment, so one collection supports hybrid visual + transcript search.

At query time, a retriever searches across both embeddings and fuses the results — finding moments by what's *shown* or what's *said*.

<Note>
  A single `multimodal_extractor@v1` produces **both** the visual feature (URI suffix `vertex_multimodal_embedding`) and the speech/transcript feature (URI suffix `multilingual_e5_large_instruct_v1`). You do **not** need separate video/audio extractors.
</Note>

## 1. Create a bucket

The bucket schema declares the fields each video object carries. Use the Mixpeek type `video` for the file field.

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/buckets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "video-catalog",
    "bucket_schema": {
      "properties": {
        "video_url": { "type": "video" },
        "title": { "type": "text" },
        "category": { "type": "string" }
      }
    }
  }'
```

## 2. Create a collection

One collection with the `multimodal_extractor` gives you both visual and speech search. Map the extractor's `video` input to your bucket's `video_url` field.

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/collections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "video-moments",
    "source": { "type": "bucket", "bucket_ids": ["bkt_videos"] },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": { "video": "video_url" },
      "field_passthrough": [
        { "source_path": "title" },
        { "source_path": "category" }
      ]
    }
  }'
```

<Note>
  A collection takes exactly **one** `feature_extractor`. To run additional extractors over the same objects, create more collections from the same bucket.
</Note>

## 3. Ingest a video

Register a video object. The blob's content goes in the `data` field (a URL, `s3://` path, base64, or raw content).

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/objects" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "key_prefix": "/marketing/demos",
    "blobs": [
      {
        "property": "video_url",
        "type": "video",
        "data": "s3://my-bucket/demos/product-launch.mp4"
      },
      { "property": "title", "type": "text", "data": "Product Launch Q4 2025" },
      { "property": "category", "type": "text", "data": "marketing" }
    ]
  }'
```

## 4. Process

Create a batch from your object(s), submit it, then poll the returned task until it completes.

```bash theme={null}
# Create a batch (returns batch_id)
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/batches" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "object_ids": ["obj_video_001"] }'

# Submit it for processing (returns task_id)
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/batches/{batch_id}/submit" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"

# Poll until status is terminal: COMPLETED or COMPLETED_WITH_ERRORS
curl -sS "$MP_API_URL/v1/tasks/{task_id}" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"
```

<Tip>
  Prefer the SDK? `client.tasks.get(task_id=...)` returns the same status. See [Monitoring ingestion](/processing/tasks) for a complete polling loop.
</Tip>

## 5. Create a hybrid retriever

The retriever fuses two feature searches — visual and speech — over the same collection. Note the canonical field names: `collection_identifiers`, a flat `input_schema`, and the stage `config` wrapping `stage_id` + `parameters`.

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/retrievers" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "video-search",
    "collection_identifiers": ["video-moments"],
    "input_schema": {
      "query": { "type": "text", "required": true }
    },
    "stages": [
      {
        "stage_name": "hybrid_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 100,
                "weight": 0.6
              },
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 100,
                "weight": 0.4
              }
            ],
            "final_top_k": 20,
            "fusion": "weighted"
          }
        }
      }
    ]
  }'
```

| Field                    | Meaning                                                                            |
| ------------------------ | ---------------------------------------------------------------------------------- |
| `collection_identifiers` | Collections to query (names or IDs).                                               |
| `input_schema`           | Flat map of input field → type. Reference inputs in stages as `{{INPUT.<field>}}`. |
| `searches[].feature_uri` | Which embedding index to search. Here: visual + speech.                            |
| `searches[].weight`      | Per-search weight used by `weighted` fusion.                                       |
| `final_top_k`            | Results returned after fusion.                                                     |
| `fusion`                 | `rrf` (default) or `weighted`.                                                     |

## 6. Search

Execute the retriever. The `inputs` keys must match your `input_schema`.

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "people discussing electric vehicles" }
  }'
```

Each result is a video segment with its timestamps and keyframe, ranked by combined visual + transcript relevance.

### Filter by category

Apply ad-hoc filters at execution time without changing the retriever:

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "product demonstration" },
    "filters": {
      "field": "category",
      "operator": "eq",
      "value": "marketing"
    }
  }'
```

### Search by a reference image

Because `vertex_multimodal_embedding` is cross-modal, you can query the visual index with an image to find similar scenes. Add an image input to `input_schema` and set the visual search's `query` to `{ "input_mode": "content", "value": "{{INPUT.query_image}}" }`. See [Feature Search](/retrieval/stages/feature-search) for image/URL query modes.

## Moment-level search

Segments carry start/end timestamps, so you can scope results to part of a video with a filter (field names depend on your collection's output schema):

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "pricing discussion" },
    "filters": { "field": "start_time", "operator": "gte", "value": 60.0 }
  }'
```

## Search the same video other ways

The same segments are searchable across every modality the `multimodal_extractor` produced — combine or swap these into your retriever:

<CardGroup cols={2}>
  <Card title="Search faces" icon="user" href="/processing/extractors/face-identity">
    Add the `face_identity_extractor` to find every clip featuring a specific person (ArcFace 1:N).
  </Card>

  <Card title="Search transcripts" icon="closed-captioning" href="/retrieval/stages/feature-search">
    Query `mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1` to find moments by what's *said*.
  </Card>

  <Card title="Group into moments" icon="film" href="/retrieval/stages/moment-group">
    Merge matching frames into continuous time-ranges with the `moment_group` stage.
  </Card>

  <Card title="Rerank for precision" icon="arrow-up-wide-short" href="/retrieval/stages/rerank">
    Add a cross-encoder `rerank` stage to sharpen the top results.
  </Card>

  <Card title="Filter unsafe content" icon="shield" href="/retrieval/stages/classify">
    Drop or flag NSFW segments at query time with the `classify` stage.
  </Card>

  <Card title="Hybrid keyword search" icon="key" href="/retrieval/stages/feature-search#lexical-bm25-search">
    Add a `lexical: true` search to catch exact terms in transcripts (BM25).
  </Card>
</CardGroup>

### Search by action or activity

There is no dedicated action-recognition extractor — instead, search **actions as natural language** against the visual embedding. A text query like `"person running"` or `"two people shaking hands"` matches the `vertex_multimodal_embedding` because it understands actions depicted in frames:

```json theme={null}
{
  "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
  "query": { "input_mode": "text", "value": "person running on a treadmill" },
  "top_k": 50
}
```

For higher precision on a fixed set of actions, enable `run_video_description` / `response_shape` on the extractor to emit structured action labels you can then [`attribute_filter`](/retrieval/stages/attribute-filter) or [`classify`](/retrieval/stages/classify).

## Next steps

<CardGroup cols={2}>
  <Card title="Auto-tag segments" icon="tags" href="/enrichment/taxonomies">
    Classify segments ("product demo", "interview") with a taxonomy.
  </Card>

  <Card title="Discover themes" icon="diagram-project" href="/enrichment/clusters">
    Cluster segment embeddings to surface recurring visual themes.
  </Card>

  <Card title="Get notified" icon="bell" href="/enrichment/alerts">
    Trigger alerts when new content matches a query.
  </Card>

  <Card title="Tune feature search" icon="sliders" href="/retrieval/stages/feature-search">
    Full reference for searches, fusion, and query modes.
  </Card>
</CardGroup>
