Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Audio extraction: audio file is split by silence, transcribed with Whisper, and embedded with E5-Large and Vertex AI
Mixpeek transcribes audio, generates embeddings, and extracts structured metadata from audio files. Each audio file is split into segments that become searchable documents with their own vector indexes, so you can search within audio at the sub-clip level.

What Gets Extracted

FeatureModelDimensionsExtractor
Audio transcriptWhispermultimodal_extractor
Transcript embeddingsE5-Large1024Dmultimodal_extractor
Multimodal audio embeddingsVertex AI multimodal1408Dmultimodal_extractor
Language detectionWhispermultimodal_extractor
Speaker diarization (identifying and separating individual speakers) is on the roadmap. Today, transcriptions are returned as a single combined transcript per segment.

Choosing an Extractor

GoalExtractorWhy
Transcribe and search spoken contentmultimodal_extractorWhisper transcription + E5-Large 1024D transcript embeddings in one pass
Cross-modal audio search (audio + video + text)multimodal_extractorVertex AI 1408D unified embedding space across modalities
The multimodal_extractor handles audio natively. Audio files (MP3, WAV, FLAC, AAC, OGG) are routed through the same pipeline as video, with visual processing steps skipped automatically.

Create a Collection for Audio

This collection transcribes audio files, generates transcript embeddings, and splits by silence to preserve natural speech boundaries.
curl -X POST https://api.mixpeek.com/v1/collections \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "audio-library",
    "source": { "type": "bucket", "bucket_id": "bkt_audio" },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "payload.audio_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.title" },
        { "source_path": "metadata.speaker" }
      ],
      "parameters": {
        "split_method": "silence",
        "silence_db_threshold": -40,
        "run_transcription": true,
        "run_transcription_embedding": true,
        "run_multimodal_embedding": false,
        "enable_thumbnails": false
      }
    }
  }'
Audio files use the video input mapping in multimodal_extractor. The pipeline detects the content type automatically and skips visual processing steps for audio-only files.

Search by Transcript

Create a retriever that targets transcript embeddings to find audio segments by what was said. A text query like “discussion about scaling infrastructure” finds segments where that topic is discussed.
curl -X POST https://api.mixpeek.com/v1/retrievers \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "audio-transcript-search",
    "collection_ids": ["col_audio_library"],
    "input_schema": {
      "properties": {
        "query": { "type": "text", "required": true }
      }
    },
    "stages": [
      {
        "stage_name": "transcript_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "feature_address": "mixpeek://multimodal_extractor@v1/transcription_embedding",
            "input_mapping": { "text": "query" },
            "query": "{{INPUT.query}}",
            "top_k": 20
          }
        }
      }
    ]
  }'
Execute the retriever:
curl -X POST https://api.mixpeek.com/v1/retrievers/ret_abc123/execute \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "discussion about scaling infrastructure" },
    "limit": 10
  }'

Output Schema

Each audio segment produces a document like this:
{
  "document_id": "doc_audio_001",
  "start_time": 142.5,
  "end_time": 168.3,
  "transcription": "The key challenge with scaling these models is memory bandwidth. You hit a wall at about 70 billion parameters unless you shard across nodes...",
  "source_audio_url": "s3://my-bucket/audio/podcast-ep-42.mp3",
  "metadata": {
    "title": "Scaling AI Infrastructure",
    "speaker": "Jane Smith"
  },
  "multimodal_extractor_v1_transcription_embedding": [0.018, -0.032, "...1024 floats"]
}
FieldTypeDescription
start_timenumberSegment start in seconds
end_timenumberSegment end in seconds
transcriptionstringWhisper-transcribed speech
source_audio_urlstringOriginal source audio file URL
metadataobjectPass-through fields from the source document
multimodal_extractor_v1_transcription_embeddingfloat[1024]E5-Large transcript embedding
multimodal_extractor_v1_multimodal_embeddingfloat[1408]Vertex AI multimodal embedding (if enabled)

Multimodal Extractor

Full parameter reference for audio processing

Retrievers

Build search pipelines over transcript features

From Video

Video extraction uses the same transcription pipeline

Text Extractor

Re-embed transcripts with chunking for fine-grained retrieval