How It Works
When you ingest a video, themultimodal_extractor runs a multi-stage pipeline:
- Chunking — the video is split into segments (scene detection / fixed intervals).
- Visual embeddings — each segment’s keyframes are embedded as
vertex_multimodal_embedding. - Speech embeddings — speech is transcribed (Whisper) and embedded (feature URI
mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1). - Multi-vector indexing — both embeddings are indexed per segment, so one collection supports hybrid visual + transcript search.
A single
multimodal_extractor@v1 produces both the visual feature (URI suffix vertex_multimodal_embedding) and the speech/transcript feature (URI suffix multilingual_e5_large_instruct_v1). You do not need separate video/audio extractors.1. Create a bucket
The bucket schema declares the fields each video object carries. Use the Mixpeek typevideo for the file field.
2. Create a collection
One collection with themultimodal_extractor gives you both visual and speech search. Map the extractor’s video input to your bucket’s video_url field.
A collection takes exactly one
feature_extractor. To run additional extractors over the same objects, create more collections from the same bucket.3. Ingest a video
Register a video object. The blob’s content goes in thedata field (a URL, s3:// path, base64, or raw content).
4. Process
Create a batch from your object(s), submit it, then poll the returned task until it completes.5. Create a hybrid retriever
The retriever fuses two feature searches — visual and speech — over the same collection. Note the canonical field names:collection_identifiers, a flat input_schema, and the stage config wrapping stage_id + parameters.
| Field | Meaning |
|---|---|
collection_identifiers | Collections to query (names or IDs). |
input_schema | Flat map of input field → type. Reference inputs in stages as {{INPUT.<field>}}. |
searches[].feature_uri | Which embedding index to search. Here: visual + speech. |
searches[].weight | Per-search weight used by weighted fusion. |
final_top_k | Results returned after fusion. |
fusion | rrf (default) or weighted. |
6. Search
Execute the retriever. Theinputs keys must match your input_schema.
Filter by category
Apply ad-hoc filters at execution time without changing the retriever:Search by a reference image
Becausevertex_multimodal_embedding is cross-modal, you can query the visual index with an image to find similar scenes. Add an image input to input_schema and set the visual search’s query to { "input_mode": "content", "value": "{{INPUT.query_image}}" }. See Feature Search for image/URL query modes.
Moment-level search
Segments carry start/end timestamps, so you can scope results to part of a video with a filter (field names depend on your collection’s output schema):Search the same video other ways
The same segments are searchable across every modality themultimodal_extractor produced — combine or swap these into your retriever:
Search faces
Add the
face_identity_extractor to find every clip featuring a specific person (ArcFace 1:N).Search transcripts
Query
mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1 to find moments by what’s said.Group into moments
Merge matching frames into continuous time-ranges with the
moment_group stage.Rerank for precision
Add a cross-encoder
rerank stage to sharpen the top results.Filter unsafe content
Drop or flag NSFW segments at query time with the
classify stage.Hybrid keyword search
Add a
lexical: true search to catch exact terms in transcripts (BM25).Search by action or activity
There is no dedicated action-recognition extractor — instead, search actions as natural language against the visual embedding. A text query like"person running" or "two people shaking hands" matches the vertex_multimodal_embedding because it understands actions depicted in frames:
run_video_description / response_shape on the extractor to emit structured action labels you can then attribute_filter or classify.
Next steps
Auto-tag segments
Classify segments (“product demo”, “interview”) with a taxonomy.
Discover themes
Cluster segment embeddings to surface recurring visual themes.
Get notified
Trigger alerts when new content matches a query.
Tune feature search
Full reference for searches, fusion, and query modes.

