From Video - Mixpeek

Video extraction: video file is decomposed into segments, each with visual embeddings, transcript, scene description, OCR, and face embeddings

Mixpeek decomposes video into searchable segments and extracts faces, visual embeddings, transcripts, scene descriptions, and structured metadata. Each segment becomes a document with its own vector indexes, so you can search within video at the sub-clip level.

What Gets Extracted

Feature	Model	Dimensions	Extractor
Visual embeddings	Vertex AI multimodal	1408D	`multimodal_extractor`
Audio transcript	Whisper	—	`multimodal_extractor`
Transcript embeddings	E5-Large	1024D	`multimodal_extractor`
Scene descriptions	Gemini	—	`multimodal_extractor`
OCR (on-screen text)	Gemini	—	`multimodal_extractor`
Face embeddings	ArcFace (SCRFD detect)	512D	`face_identity_extractor`
Learning units (lectures)	E5-Large + Jina Code + SigLIP	1024D / 768D	`course_content_extractor`
Temporal segments	FFmpeg (time / scene / silence)	—	`multimodal_extractor`

Choosing an Extractor

Goal	Extractor	Why
General video search (visual + spoken content)	`multimodal_extractor`	Unified embedding space across video, image, and text
Face recognition / identity matching	`face_identity_extractor`	512D ArcFace embeddings with 99.8% verification accuracy
Educational content (lectures, slides, code)	`course_content_extractor`	Atomic learning units with text, code, and visual embeddings

Create a Collection for Video

This collection splits video into 10-second segments, transcribes audio, and generates visual + transcript embeddings.

curl -X POST https://api.mixpeek.com/v1/collections \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "video-library",
    "source": { "type": "bucket", "bucket_id": "bkt_videos" },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "video_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.title" },
        { "source_path": "metadata.category" }
      ],
      "parameters": {
        "split_method": "scene",
        "scene_detection_threshold": 0.5,
        "run_transcription": true,
        "run_transcription_embedding": true,
        "run_multimodal_embedding": true,
        "run_video_description": true,
        "enable_thumbnails": true
      }
    }
  }'

Search by Visual Content

Create a retriever that searches video segments by visual similarity. A text query like “person writing on whiteboard” finds visually matching segments through Vertex AI’s cross-modal embedding space.

curl -X POST https://api.mixpeek.com/v1/retrievers \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "video-visual-search",
    "collection_ids": ["col_video_library"],
    "input_schema": {
      "properties": {
        "query": { "type": "text", "required": true }
      }
    },
    "stages": [
      {
        "stage_name": "visual_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "query": "{{INPUT.query}}",
            "top_k": 20
          }
        }
      }
    ]
  }'

Execute the retriever:

curl -X POST https://api.mixpeek.com/v1/retrievers/ret_abc123/execute \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "person writing on whiteboard" },
    "limit": 10
  }'

Search by Transcript

To search spoken content, create a retriever that targets the transcription embedding index.

curl -X POST https://api.mixpeek.com/v1/retrievers \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "video-transcript-search",
    "collection_ids": ["col_video_library"],
    "input_schema": {
      "properties": {
        "query": { "type": "text", "required": true }
      }
    },
    "stages": [
      {
        "stage_name": "transcript_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "feature_address": "mixpeek://multimodal_extractor@v1/transcription_embedding",
            "input_mapping": { "text": "query" },
            "query": "{{INPUT.query}}",
            "top_k": 20
          }
        }
      }
    ]
  }'

Search by Face

Use a separate collection with face_identity_extractor to find video segments containing a specific person.

curl -X POST https://api.mixpeek.com/v1/collections \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "video-faces",
    "source": { "type": "bucket", "bucket_id": "bkt_videos" },
    "feature_extractor": {
      "feature_extractor_name": "face_identity_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "video_url"
      },
      "parameters": {
        "detection_model": "scrfd_2.5g",
        "embedding_model": "arcface_r100",
        "video_sampling_fps": 1.0,
        "video_deduplication": true,
        "video_deduplication_threshold": 0.8
      }
    }
  }'

Example: Casting Intelligence Across Ad Creatives

A performance marketing agency indexes hundreds of video ads with face_identity_extractor to track which talent appears in which campaigns. The retriever returns face matches with timestamps and confidence scores, enabling queries like “find every ad featuring this creator” or “has this person appeared in a competitor’s campaign?” Face identity document (per detected face):

{
  "document_id": "doc_face_8f2a",
  "source_video_url": "s3://ad-archive/campaigns/summer-2026/outdoor-spot-04.mp4",
  "start_time": 3.2,
  "end_time": 8.7,
  "thumbnail_url": "s3://mixpeek-storage/ns_casting/thumbnails/face_8f2a.jpg",
  "metadata": {
    "campaign": "Summer 2026 Outdoor",
    "ad_id": "ad_04",
    "platform": "meta"
  },
  "face_identity_extractor_v1_face_embedding": [0.112, -0.045, "...512 floats"],
  "face_bbox": { "x": 0.32, "y": 0.15, "width": 0.18, "height": 0.24 },
  "face_confidence": 0.97
}

Retriever execution result (querying by face similarity):

{
  "results": [
    {
      "document_id": "doc_face_8f2a",
      "score": 0.94,
      "metadata": {
        "campaign": "Summer 2026 Outdoor",
        "ad_id": "ad_04",
        "platform": "meta",
        "start_time": 3.2,
        "end_time": 8.7,
        "source_video_url": "s3://ad-archive/campaigns/summer-2026/outdoor-spot-04.mp4",
        "thumbnail_url": "s3://mixpeek-storage/ns_casting/thumbnails/face_8f2a.jpg"
      }
    },
    {
      "document_id": "doc_face_c91b",
      "score": 0.91,
      "metadata": {
        "campaign": "Spring 2026 Fitness",
        "ad_id": "ad_17",
        "platform": "youtube",
        "start_time": 12.0,
        "end_time": 19.5,
        "source_video_url": "s3://ad-archive/campaigns/spring-2026/fitness-spot-17.mp4",
        "thumbnail_url": "s3://mixpeek-storage/ns_casting/thumbnails/face_c91b.jpg"
      }
    }
  ],
  "total_results": 2,
  "execution_time_ms": 142
}

Field	Type	Description
`face_identity_extractor_v1_face_embedding`	float[512]	ArcFace face embedding for identity matching
`face_bbox`	object	Bounding box of the detected face (normalized coordinates)
`face_confidence`	number	Detection confidence (0-1)
`score`	number	Cosine similarity to the query face (in retriever results)

A score above 0.85 typically indicates the same person across different videos. See the Build a Casting Agent quickstart for the full workflow including competitor cross-referencing.

Output Schema

After extraction, each video segment produces a document like this:

{
  "document_id": "doc_abc123",
  "start_time": 10.0,
  "end_time": 20.0,
  "transcription": "Welcome to today's lecture on machine learning fundamentals...",
  "description": "Instructor standing at whiteboard, introducing ML concepts",
  "ocr_text": "Machine Learning 101",
  "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/thumb_1.jpg",
  "source_video_url": "s3://my-bucket/videos/lecture-01.mp4",
  "video_segment_url": "s3://mixpeek-storage/ns_123/segments/seg_001.mp4",
  "metadata": {
    "title": "ML Fundamentals Lecture",
    "category": "education"
  },
  "multimodal_extractor_v1_multimodal_embedding": [0.023, -0.041, "...1408 floats"],
  "multimodal_extractor_v1_transcription_embedding": [0.018, -0.032, "...1024 floats"]
}

Field	Type	Description
`start_time`	number	Segment start in seconds
`end_time`	number	Segment end in seconds
`transcription`	string	Whisper-transcribed audio
`description`	string	Gemini-generated scene description
`ocr_text`	string	Text visible in video frames
`thumbnail_url`	string	S3 URL of the segment thumbnail
`source_video_url`	string	Original source video URL
`video_segment_url`	string	URL of this specific segment clip
`multimodal_extractor_v1_multimodal_embedding`	float[1408]	Vertex AI visual/multimodal embedding
`multimodal_extractor_v1_transcription_embedding`	float[1024]	E5-Large transcript embedding

Multimodal Extractor — full parameter reference
Face Identity Extractor — face detection and recognition
Course Content Extractor — educational video processing
Retrievers — build search pipelines over extracted features

Documentation Index

​What Gets Extracted

​Choosing an Extractor

​Create a Collection for Video

​Search by Visual Content

​Search by Transcript

​Search by Face

​Example: Casting Intelligence Across Ad Creatives

​Output Schema

​Related