Skip to main content
This guide walks the full path from an S3 bucket of videos to a working search index: create a storage connection, point a sync at your S3 prefix, and let Mixpeek ingest and process new files automatically.
This is the automated/continuous path. To upload a single file by URL instead, see Video Understanding. For the AWS IAM policy and role setup, see AWS S3.

Prerequisites

  • An S3 bucket containing video files.
  • AWS credentials — either an access key pair or an IAM role ARN (see AWS S3 for the IAM policy).
  • A Mixpeek API key and namespace.

1. Create a storage connection

A connection stores your S3 credentials once and can be reused across buckets. Credentials are validated before the connection is saved (test_before_save defaults to true).
curl -sS -X POST "$MP_API_URL/v1/organizations/connections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "prod-video-s3",
    "provider_type": "s3",
    "provider_config": {
      "provider_type": "s3",
      "region": "us-east-1",
      "credentials": {
        "type": "access_key",
        "access_key_id": "AKIA...",
        "secret_access_key": "..."
      }
    }
  }'
The response includes a connection_id (e.g. conn_abc123). Verify connectivity any time with:
curl -sS -X POST "$MP_API_URL/v1/organizations/connections/conn_abc123/test" \
  -H "Authorization: Bearer $MP_API_KEY"
Connections are org-level and not namespace-scoped — no X-Namespace header is needed for connection calls. Buckets, syncs, and collections below are namespace-scoped, so they require X-Namespace.

2. Create a bucket

The bucket declares the schema for each synced object. Use the Mixpeek type video for the file field.
curl -sS -X POST "$MP_API_URL/v1/buckets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "video-catalog",
    "bucket_schema": {
      "properties": {
        "video_url": { "type": "video" }
      }
    }
  }'

3. Create a collection

One collection with multimodal_extractor produces both visual (vertex_multimodal_embedding) and speech (transcription_embedding) features per segment.
curl -sS -X POST "$MP_API_URL/v1/collections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "video-moments",
    "source": { "type": "bucket", "bucket_ids": ["bkt_videos"] },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": { "video": "video_url" }
    }
  }'

4. Sync the S3 prefix into the bucket

A sync watches an S3 path and ingests matching files. With sync_mode: "continuous", Mixpeek polls for new files and ingests them automatically; initial_only runs a single backfill.
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/syncs" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "connection_id": "conn_abc123",
    "source_path": "s3://my-bucket/videos/",
    "sync_mode": "continuous",
    "polling_interval_seconds": 300,
    "batch_size": 50,
    "file_filters": { "extensions": [".mp4", ".mov"] },
    "schema_mapping": {
      "mappings": [
        { "source": "file", "target": "video_url" }
      ]
    }
  }'
FieldDefaultDescription
connection_idrequiredThe storage connection from step 1.
source_pathrequiredS3 URI/prefix to watch (e.g. s3://my-bucket/videos/).
sync_modecontinuouscontinuous polls for new files; initial_only backfills once.
polling_interval_seconds300How often to check for new files (30–900).
batch_size50Files processed per batch (1–100).
file_filtersRestrict by extension/pattern.
schema_mappingMap each file to a bucket property.
skip_duplicatestrueSkip files already ingested (by source ID).
The response includes a sync_config_id (e.g. sync_xyz).

5. Run and monitor the sync

continuous syncs start on their own, but you can trigger the first run immediately and watch its job:
# Trigger a run now
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_videos/syncs/sync_xyz/trigger" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE"

# Check sync metrics (files discovered, ingested, failed)
curl -sS "$MP_API_URL/v1/buckets/bkt_videos/syncs/sync_xyz/metrics" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE"
OperationEndpoint
Trigger nowPOST /v1/buckets/{bucket_id}/syncs/{sync_config_id}/trigger
Pause / resumePOST .../pause · POST .../resume
Job statusGET .../jobs/{sync_job_id}
Failed files (DLQ)POST .../dlq
The sync ingests files as objects and (unless skip_batch_submission is set) submits them for processing automatically — so your multimodal_extractor collection populates without any extra step. Vectors are searchable within 10–30s of each batch completing.
Create a retriever over the collection and execute it:
curl -sS -X POST "$MP_API_URL/v1/retrievers" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "video-search",
    "collection_identifiers": ["video-moments"],
    "input_schema": { "query": { "type": "text", "required": true } },
    "stages": [
      {
        "stage_name": "search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 100, "weight": 0.6
              },
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/transcription_embedding",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 100, "weight": 0.4
              }
            ],
            "final_top_k": 20,
            "fusion": "weighted"
          }
        }
      }
    ]
  }'
curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "inputs": { "query": "people discussing electric vehicles" } }'

Keep it fresh

Because the sync is continuous, new videos dropped into the S3 prefix are ingested and indexed automatically. To re-run clustering or enrichment on a schedule as new content lands, see Triggers.

Other storage providers

The same connection → bucket → sync flow works for every supported provider — swap provider_type and provider_config:

Google Cloud Storage

Azure Blob

Cloudflare R2