Ingest Video from S3

This guide walks the full path from an S3 bucket of videos to a working search index: create a storage connection, point a sync at your S3 prefix, and let Mixpeek ingest and process new files automatically.

This is the automated/continuous path. To upload a single file by URL instead, see Video Understanding. For the AWS IAM policy and role setup, see AWS S3.

Prerequisites

An S3 bucket containing video files.
AWS credentials — either an access key pair or an IAM role ARN (see AWS S3 for the IAM policy).
A Mixpeek API key and namespace.

1. Create a storage connection

A connection stores your S3 credentials once and can be reused across buckets. Credentials are validated before the connection is saved (test_before_save defaults to true).

curl -sS -X POST "$MP_API_URL/v1/organizations/connections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "prod-video-s3",
    "provider_type": "s3",
    "provider_config": {
      "provider_type": "s3",
      "region": "us-east-1",
      "credentials": {
        "type": "access_key",
        "access_key_id": "AKIA...",
        "secret_access_key": "..."
      }
    }
  }'

curl -sS -X POST "$MP_API_URL/v1/organizations/connections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "prod-video-s3",
    "provider_type": "s3",
    "provider_config": {
      "provider_type": "s3",
      "region": "us-east-1",
      "credentials": {
        "type": "iam_role",
        "role_arn": "arn:aws:iam::123456789012:role/mixpeek-read",
        "external_id": "your-external-id"
      }
    }
  }'

The response includes a connection_id (e.g. conn_abc123). Verify connectivity any time with:

curl -sS -X POST "$MP_API_URL/v1/organizations/connections/conn_abc123/test" \
  -H "Authorization: Bearer $MP_API_KEY"

Connections are org-level and not namespace-scoped — no X-Namespace header is needed for connection calls. Buckets, syncs, and collections below are namespace-scoped, so they require X-Namespace.

2. Create a bucket

The bucket declares the schema for each synced object. Use the Mixpeek type video for the file field.

curl -sS -X POST "$MP_API_URL/v1/buckets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "video-catalog",
    "bucket_schema": {
      "properties": {
        "video_url": { "type": "video" }
      }
    }
  }'

3. Create a collection

One collection with multimodal_extractor produces both visual (vertex_multimodal_embedding) and speech (multilingual_e5_large_instruct_v1) features per segment.

curl -sS -X POST "$MP_API_URL/v1/collections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "video-moments",
    "source": { "type": "bucket", "bucket_ids": ["bkt_videos"] },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": { "video": "video_url" }
    }
  }'

4. Sync the S3 prefix into the bucket

A sync watches an S3 path and ingests matching files. With sync_mode: "continuous", Mixpeek polls for new files and ingests them automatically; initial_only runs a single backfill.

curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/syncs" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "connection_id": "conn_abc123",
    "source_path": "s3://my-bucket/videos/",
    "sync_mode": "continuous",
    "polling_interval_seconds": 300,
    "batch_size": 50,
    "file_filters": { "extensions": [".mp4", ".mov"] },
    "schema_mapping": {
      "mappings": {
        "video_url": {
          "target_type": "blob",
          "source": { "type": "file" },
          "blob_type": "video",
          "blob_property": "video_url"
        }
      }
    }
  }'

Field	Default	Description
`connection_id`	required	The storage connection from step 1.
`source_path`	required	S3 URI/prefix to watch (e.g. `s3://my-bucket/videos/`).
`sync_mode`	`continuous`	`continuous` polls for new files; `initial_only` backfills once.
`polling_interval_seconds`	`300`	How often to check for new files (30–900).
`batch_size`	`50`	Files processed per batch (1–100).
`file_filters`	—	Restrict by extension/pattern.
`schema_mapping`	—	Map each file to a bucket property.
`skip_duplicates`	`true`	Skip files already ingested (by source ID).

The response includes a sync_config_id (e.g. sync_xyz).

5. Run and monitor the sync

continuous syncs start on their own, but you can trigger the first run immediately and watch its job:

# Trigger a run now
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/syncs/sync_xyz/trigger" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE"

# Check sync metrics (files discovered, ingested, failed)
curl -sS "$MP_API_URL/v1/buckets/bkt_videos/syncs/sync_xyz/metrics" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE"

Operation	Endpoint
Trigger now	`POST /v1/buckets/{bucket_id}/syncs/{sync_config_id}/trigger`
Pause / resume	`POST .../pause` · `POST .../resume`
Job status	`GET .../jobs/{sync_job_id}`
Failed files (DLQ)	`POST .../dlq`

The sync ingests files as objects and (unless skip_batch_submission is set) submits them for processing automatically — so your multimodal_extractor collection populates without any extra step. Vectors are searchable within 10–30s of each batch completing.

6. Search

Create a retriever over the collection and execute it:

curl -sS -X POST "$MP_API_URL/v1/retrievers" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "video-search",
    "collection_identifiers": ["video-moments"],
    "input_schema": { "query": { "type": "text", "required": true } },
    "stages": [
      {
        "stage_name": "search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 100, "weight": 0.6
              },
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 100, "weight": 0.4
              }
            ],
            "final_top_k": 20,
            "fusion": "weighted"
          }
        }
      }
    ]
  }'

curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "inputs": { "query": "people discussing electric vehicles" } }'

Keep it fresh

Because the sync is continuous, new videos dropped into the S3 prefix are ingested and indexed automatically. To re-run clustering or enrichment on a schedule as new content lands, see Triggers.

Other storage providers

The same connection → bucket → sync flow works for every supported provider — swap provider_type and provider_config:

Guides

Ingest Video from S3

Prerequisites

1. Create a storage connection

2. Create a bucket

3. Create a collection

4. Sync the S3 prefix into the bucket

5. Run and monitor the sync

6. Search

Keep it fresh

Other storage providers

Google Cloud Storage

Azure Blob

Cloudflare R2

​Prerequisites

​1. Create a storage connection

​2. Create a bucket

​3. Create a collection

​4. Sync the S3 prefix into the bucket

​5. Run and monitor the sync

​6. Search

​Keep it fresh

​Other storage providers

Google Cloud Storage

Azure Blob

Cloudflare R2

Prerequisites

1. Create a storage connection

2. Create a bucket

3. Create a collection

4. Sync the S3 prefix into the bucket

5. Run and monitor the sync

6. Search

Keep it fresh

Other storage providers