Documentation Index Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
The multimodal extractor processes video, audio, image, text, and GIF content using unified Vertex embeddings (1408D). Videos/audio are decomposed into segments with transcription (Whisper), visual embeddings, OCR, and descriptions. Images and text are embedded directly without decomposition.
Pipeline Steps
Filter Dataset (if collection_id provided)
Filter to specified collection
Apply Input Mappings
Detect Content Types (sample 100 rows)
Identify: video, audio, image, text, or mixed
Content Routing
Video: FFmpeg chunking (time/scene/silence) → Steps 5-10
Audio: FFmpeg audio chunking (time/silence) → Steps 5-8
Image: Skip to Step 8
Text: Skip to Step 8
Mixed: Branch by type, process separately, union results
Transcription (conditional: if run_transcription=true, video/audio only)
Whisper API or Local GPU speech-to-text
Transcription Embeddings (conditional: if run_transcription_embedding=true)
E5-Large text embeddings (1024D) from transcribed audio
Multimodal Embeddings (conditional: if run_multimodal_embedding=true)
Vertex AI embeddings (1408D) for all content types
Unified embedding space enables cross-modal search
Thumbnail Generation (conditional: if enable_thumbnails=true, visual content only)
640px width at 85% quality, S3 upload with optional CDN
Visual Analysis (conditional: if run_video_description OR run_ocr=true, visual content only)
Gemini-based descriptions and/or OCR text extraction
Output
Segment/document records with embeddings, transcriptions, descriptions, OCR, thumbnails
When to Use
Use Case Description Video content libraries Search and navigate video segments by content Media platforms Search across spoken and visual content Educational content Find moments in lectures and tutorials Surveillance/security Event detection in footage Social media Process user-generated video content Broadcasting/streaming Large video catalog management Marketing analytics Analyze video campaigns Cross-modal search Find videos/images using text queries
When NOT to Use
Scenario Recommended Alternative Static image collections only image_extractorAudio-only content audio_extractorVery short videos (< 5 seconds) Processing overhead not worth it Real-time live streams Specialized streaming extractors 8K+ resolution video Consider downsampling first
Input Type Description Processing videostring URL or S3 path Decomposed into segments imagestring URL or S3 path Direct embedding (no decomposition) textstring Plain text content Direct embedding gifstring URL or S3 path Treated as video, frame-by-frame
Supported formats:
Video : MP4, MOV, AVI, MKV, WebM, FLV
Image : JPG, PNG, WebP, BMP
GIF : Animated GIF
Provide one of the following inputs:
{
"video" : "s3://bucket/videos/lecture.mp4"
}
{
"image" : "https://cdn.example.com/products/laptop.jpg"
}
{
"text" : "High-performance laptop with M3 chip, perfect for developers"
}
Field Type Description videostring URL/S3 path to video file. Recommended: 720p-1080p, < 2 hours imagestring URL/S3 path to image file. Recommended: < 10MB textstring Plain text for cross-modal embedding gifstring URL/S3 path to GIF file custom_thumbnailstring Optional custom thumbnail URL instead of auto-generated
Output Schema
Each video segment produces one document with the following fields:
Field Type Description start_timenumber Segment start time in seconds end_timenumber Segment end time in seconds start_frameinteger Start frame number of the segment (start_time × fps) end_frameinteger End frame number of the segment (end_time × fps) fpsnumber Video frame rate in frames per second durationnumber Segment duration in seconds transcriptionstring Transcribed audio content descriptionstring AI-generated segment description ocr_textstring Text extracted from video frames thumbnail_urlstring S3 URL of thumbnail image source_video_urlstring Original source video URL video_segment_urlstring URL of this specific segment multimodal_extractor_v1_multimodal_embeddingfloat[1408] Visual/multimodal embedding multimodal_extractor_v1_transcription_embeddingfloat[1024] Transcription text embedding
{
"start_time" : 10.0 ,
"end_time" : 20.0 ,
"start_frame" : 300 ,
"end_frame" : 600 ,
"fps" : 30.0 ,
"duration" : 10.0 ,
"transcription" : "Welcome to today's lecture on machine learning fundamentals..." ,
"description" : "Instructor standing at whiteboard, introducing ML concepts" ,
"ocr_text" : "Machine Learning 101" ,
"thumbnail_url" : "s3://mixpeek-storage/ns_123/thumbnails/thumb_1.jpg" ,
"multimodal_extractor_v1_multimodal_embedding" : [ 0.023 , -0.041 , ... ],
"multimodal_extractor_v1_transcription_embedding" : [ 0.018 , -0.032 , ... ]
}
Parameters
Video Splitting
Parameter Type Default Description split_methodstring "time"Primary video splitting strategy: time, scene, or silence
Split Methods
Fixed interval splitting - Splits video into segments of equal duration.Parameter Type Default Description time_split_intervalinteger 10Interval in seconds for each segment
Characteristics:
Predictable segment count: video_duration / interval
Consistent chunk sizes for uniform processing
May cut mid-sentence or mid-scene
Best for: General purpose, consistent chunking, when you need predictable segment counts{
"split_method" : "time" ,
"time_split_interval" : 10
}
Visual change detection - Splits video when significant visual changes occur (shot changes, transitions).Parameter Type Default Description scene_detection_thresholdfloat 0.5Sensitivity threshold (0.0-1.0)
Threshold guide:
0.3 - High sensitivity, detects subtle changes (more segments)
0.5 - Balanced (default)
0.7 - Low sensitivity, only major scene changes (fewer segments)
Characteristics:
Variable segment count (typically 2-20 per minute)
Segments align with visual content boundaries
Better for content with distinct shots/scenes
Best for: Movies, dynamic content, shot changes, music videos, advertisements{
"split_method" : "scene" ,
"scene_detection_threshold" : 0.5
}
Audio pause detection - Splits video at moments of silence or low audio.Parameter Type Default Description silence_db_thresholdinteger -40Decibel level below which audio is considered silent
Threshold guide:
-50 dB - Detects very quiet moments (more segments)
-40 dB - Balanced (default)
-30 dB - Only detects near-silence (fewer segments)
Characteristics:
Variable segment count (typically 5-30 per minute)
Segments align with natural speech pauses
Preserves complete sentences/thoughts
Best for: Lectures, presentations, conversations, podcasts, interviews{
"split_method" : "silence" ,
"silence_db_threshold" : -40
}
Split Methods Comparison
Method Segments/Min Predictability Best For time60 / interval_sec High General purpose, batch processing sceneVariable (2-20) Low Movies, ads, dynamic visual content silenceVariable (5-30) Medium Lectures, podcasts, spoken content
Parameter Type Default Description run_transcriptionboolean trueRun Whisper transcription on audio transcription_languagestring "en"Language for transcription run_transcription_embeddingboolean trueGenerate embeddings for transcriptions run_multimodal_embeddingboolean trueGenerate Vertex multimodal embeddings run_video_descriptionboolean falseGenerate AI descriptions (adds 1-2 min) run_ocrboolean falseExtract text from video frames
Thumbnail Parameters
Parameter Type Default Description enable_thumbnailsboolean trueGenerate thumbnail images use_cdnboolean falseUse CloudFront CDN for thumbnails
CDN benefits : Faster global delivery, permanent URLs, reduced bandwidth costs.
Embedding Task
When run_transcription_embedding is enabled, the E5 model generates text embeddings from transcribed audio. By default, these use retrieval_document for asymmetric search.
Set embedding_task at the collection level , not on the extractor. See Collection Embedding Task for full details and examples.
This only affects the E5 transcription embeddings. Vertex AI multimodal embeddings (run_multimodal_embedding) are not instruction-aware and ignore this parameter.
Description Generation Parameters
Parameter Type Default Description description_promptstring "Describe the video segment in detail."Prompt for Gemini generation_config.temperaturefloat 0.7Randomness (higher = more creative) generation_config.max_output_tokensinteger 1024Maximum description length generation_config.top_pfloat 0.8Nucleus sampling
Parameter Type Default Description response_shapestring | object nullCustom structured output schema
Natural Language Mode:
{
"response_shape" : "Extract product names, colors, materials, and aesthetic style labels from this fashion segment"
}
JSON Schema Mode:
{
"response_shape" : {
"type" : "object" ,
"properties" : {
"products" : {
"type" : "array" ,
"items" : {
"type" : "object" ,
"properties" : {
"name" : { "type" : "string" },
"category" : { "type" : "string" },
"visibility_percentage" : { "type" : "integer" , "minimum" : 0 , "maximum" : 100 }
}
}
},
"aesthetic" : { "type" : "string" }
}
}
}
Configuration Examples
Video with Time-Based Splitting
Video with Scene Detection
Lecture Video with Silence Splitting
Image Embedding
Text Embedding (Cross-Modal Search)
Full Extraction with All Features
Fashion/E-commerce with Structured Extraction
{
"feature_extractor" : {
"feature_extractor_name" : "multimodal_extractor" ,
"version" : "v1" ,
"input_mappings" : {
"video" : "payload.video_url"
},
"field_passthrough" : [
{ "source_path" : "metadata.video_id" }
],
"parameters" : {
"split_method" : "time" ,
"time_split_interval" : 10 ,
"run_transcription" : true ,
"run_multimodal_embedding" : true ,
"enable_thumbnails" : true
}
}
}
Processing Speed
Content Type Speed Video 0.5-2x realtime (depends on features enabled) Image < 1 second Text < 100ms
Example : 10-minute video → 5-20 minutes processing time
Feature Latency per Segment Transcription ~200ms per second of audio Visual embedding ~50ms OCR ~300ms Description ~2s
Cost Estimates (per minute of video)
Configuration Cost Minimal (transcription + embeddings)$0.01 Standard (+ OCR)$0.05 Full (+ descriptions)$0.15
Images : 0.001 p e r i m a g e ∗ ∗ T e x t ∗ ∗ : 0.001 per image **Text**: 0.001 p er ima g e ∗ ∗ T e x t ∗ ∗ : 0.0001 per query
Vector Indexes
Multimodal Embedding
Property Value Index name multimodal_extractor_v1_multimodal_embeddingDimensions 1408 Type Dense Distance metric Cosine Inference model vertex_multimodal_embeddingSupported inputs video, text, image
Transcription Embedding
Property Value Index name multimodal_extractor_v1_transcription_embeddingDimensions 1024 Type Dense Distance metric Cosine Inference model multilingual_e5_large_instruct_v1Supported inputs text, string
Limitations
Video duration : Recommend < 2 hours for optimal processing
Resolution : 8K+ videos should be downsampled
Real-time : Not suitable for live streaming
Short videos : < 5 second videos have disproportionate overhead
Audio quality : Transcription accuracy depends on audio clarity
OCR/Description : Add significant processing time, enable only when needed
Collection-to-Collection Pipelines
The video_segment_url output enables decomposition chains:
Initial collection : Time-based segments (5s intervals)
Downstream collection : Scene detection within each segment
Final collection : Enhanced processing with different models
{
"input_mappings" : {
"video" : "video_segment_url"
}
}