What Gets Extracted
| Feature | Model | Dimensions | Extractor |
|---|---|---|---|
| Visual embeddings | Vertex AI multimodal | 1408D | multimodal_extractor |
| Audio transcript | Whisper | — | multimodal_extractor |
| Transcript embeddings | E5-Large | 1024D | multimodal_extractor |
| Scene descriptions | Gemini | — | multimodal_extractor |
| OCR (on-screen text) | Gemini | — | multimodal_extractor |
| Face embeddings | ArcFace (SCRFD detect) | 512D | face_identity_extractor |
| Learning units (lectures) | E5-Large + Jina Code + SigLIP | 1024D / 768D | course_content_extractor |
| Temporal segments | FFmpeg (time / scene / silence) | — | multimodal_extractor |
Choosing an Extractor
| Goal | Extractor | Why |
|---|---|---|
| General video search (visual + spoken content) | multimodal_extractor | Unified embedding space across video, image, and text |
| Face recognition / identity matching | face_identity_extractor | 512D ArcFace embeddings with 99.8% verification accuracy |
| Educational content (lectures, slides, code) | course_content_extractor | Atomic learning units with text, code, and visual embeddings |
Create a Collection for Video
This collection splits video into 10-second segments, transcribes audio, and generates visual + transcript embeddings.Search by Visual Content
Create a retriever that searches video segments by visual similarity. A text query like “person writing on whiteboard” finds visually matching segments through Vertex AI’s cross-modal embedding space.Search by Transcript
To search spoken content, create a retriever that targets the transcription embedding index.Search by Face
Use a separate collection withface_identity_extractor to find video segments containing a specific person.
Example: Casting Intelligence Across Ad Creatives
A performance marketing agency indexes hundreds of video ads withface_identity_extractor to track which talent appears in which campaigns. The retriever returns face matches with timestamps and confidence scores, enabling queries like “find every ad featuring this creator” or “has this person appeared in a competitor’s campaign?”
Face identity document (per detected face):
| Field | Type | Description |
|---|---|---|
face_identity_extractor_v1_face_embedding | float[512] | ArcFace face embedding for identity matching |
face_bbox | object | Bounding box of the detected face (normalized coordinates) |
face_confidence | number | Detection confidence (0-1) |
score | number | Cosine similarity to the query face (in retriever results) |
Output Schema
After extraction, each video segment produces a document like this:| Field | Type | Description |
|---|---|---|
start_time | number | Segment start in seconds |
end_time | number | Segment end in seconds |
transcription | string | Whisper-transcribed audio |
description | string | Gemini-generated scene description |
ocr_text | string | Text visible in video frames |
thumbnail_url | string | S3 URL of the segment thumbnail |
source_video_url | string | Original source video URL |
video_segment_url | string | URL of this specific segment clip |
multimodal_extractor_v1_multimodal_embedding | float[1408] | Vertex AI visual/multimodal embedding |
multimodal_extractor_v1_transcription_embedding | float[1024] | E5-Large transcript embedding |
Related
- Multimodal Extractor — full parameter reference
- Face Identity Extractor — face detection and recognition
- Course Content Extractor — educational video processing
- Retrievers — build search pipelines over extracted features

