Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
What Gets Extracted
| Feature | Model | Dimensions | Extractor |
|---|---|---|---|
| Audio transcript | Whisper | — | multimodal_extractor |
| Transcript embeddings | E5-Large | 1024D | multimodal_extractor |
| Multimodal audio embeddings | Vertex AI multimodal | 1408D | multimodal_extractor |
| Language detection | Whisper | — | multimodal_extractor |
Choosing an Extractor
| Goal | Extractor | Why |
|---|---|---|
| Transcribe and search spoken content | multimodal_extractor | Whisper transcription + E5-Large 1024D transcript embeddings in one pass |
| Cross-modal audio search (audio + video + text) | multimodal_extractor | Vertex AI 1408D unified embedding space across modalities |
The
multimodal_extractor handles audio natively. Audio files (MP3, WAV, FLAC, AAC, OGG) are routed through the same pipeline as video, with visual processing steps skipped automatically.Create a Collection for Audio
This collection transcribes audio files, generates transcript embeddings, and splits by silence to preserve natural speech boundaries.Audio files use the
video input mapping in multimodal_extractor. The pipeline detects the content type automatically and skips visual processing steps for audio-only files.Search by Transcript
Create a retriever that targets transcript embeddings to find audio segments by what was said. A text query like “discussion about scaling infrastructure” finds segments where that topic is discussed.Output Schema
Each audio segment produces a document like this:| Field | Type | Description |
|---|---|---|
start_time | number | Segment start in seconds |
end_time | number | Segment end in seconds |
transcription | string | Whisper-transcribed speech |
source_audio_url | string | Original source audio file URL |
metadata | object | Pass-through fields from the source document |
multimodal_extractor_v1_transcription_embedding | float[1024] | E5-Large transcript embedding |
multimodal_extractor_v1_multimodal_embedding | float[1408] | Vertex AI multimodal embedding (if enabled) |
Related
Multimodal Extractor
Full parameter reference for audio processing
Retrievers
Build search pipelines over transcript features
From Video
Video extraction uses the same transcription pipeline
Text Extractor
Re-embed transcripts with chunking for fine-grained retrieval

