From Audio - Mixpeek

Audio extraction: audio file is split by silence, transcribed with Whisper, and embedded with E5-Large and Vertex AI

Mixpeek transcribes audio, generates embeddings, and extracts structured metadata from audio files. Each audio file is split into segments that become searchable documents with their own vector indexes, so you can search within audio at the sub-clip level.

What Gets Extracted

Feature	Model	Dimensions	Extractor
Audio transcript	Whisper	—	`multimodal_extractor`
Transcript embeddings	E5-Large	1024D	`multimodal_extractor`
Multimodal audio embeddings	Vertex AI multimodal	1408D	`multimodal_extractor`
Language detection	Whisper	—	`multimodal_extractor`

Speaker diarization (identifying and separating individual speakers) is on the roadmap. Today, transcriptions are returned as a single combined transcript per segment.

Choosing an Extractor

Goal	Extractor	Why
Transcribe and search spoken content	`multimodal_extractor`	Whisper transcription + E5-Large 1024D transcript embeddings in one pass
Cross-modal audio search (audio + video + text)	`multimodal_extractor`	Vertex AI 1408D unified embedding space across modalities

The multimodal_extractor handles audio natively. Audio files (MP3, WAV, FLAC, AAC, OGG) are routed through the same pipeline as video, with visual processing steps skipped automatically.

Create a Collection for Audio

This collection transcribes audio files, generates transcript embeddings, and splits by silence to preserve natural speech boundaries.

curl -X POST https://api.mixpeek.com/v1/collections \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "audio-library",
    "source": { "type": "bucket", "bucket_id": "bkt_audio" },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "audio_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.title" },
        { "source_path": "metadata.speaker" }
      ],
      "parameters": {
        "split_method": "silence",
        "silence_db_threshold": -40,
        "run_transcription": true,
        "run_transcription_embedding": true,
        "run_multimodal_embedding": false,
        "enable_thumbnails": false
      }
    }
  }'

Audio files use the video input mapping in multimodal_extractor. The pipeline detects the content type automatically and skips visual processing steps for audio-only files.

Search by Transcript

Create a retriever that targets transcript embeddings to find audio segments by what was said. A text query like “discussion about scaling infrastructure” finds segments where that topic is discussed.

curl -X POST https://api.mixpeek.com/v1/retrievers \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "audio-transcript-search",
    "collection_ids": ["col_audio_library"],
    "input_schema": {
      "properties": {
        "query": { "type": "text", "required": true }
      }
    },
    "stages": [
      {
        "stage_name": "transcript_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "feature_address": "mixpeek://multimodal_extractor@v1/transcription_embedding",
            "input_mapping": { "text": "query" },
            "query": "{{INPUT.query}}",
            "top_k": 20
          }
        }
      }
    ]
  }'

Execute the retriever:

curl -X POST https://api.mixpeek.com/v1/retrievers/ret_abc123/execute \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "discussion about scaling infrastructure" },
    "limit": 10
  }'

Output Schema

Each audio segment produces a document like this:

{
  "document_id": "doc_audio_001",
  "start_time": 142.5,
  "end_time": 168.3,
  "transcription": "The key challenge with scaling these models is memory bandwidth. You hit a wall at about 70 billion parameters unless you shard across nodes...",
  "source_audio_url": "s3://my-bucket/audio/podcast-ep-42.mp3",
  "metadata": {
    "title": "Scaling AI Infrastructure",
    "speaker": "Jane Smith"
  },
  "multimodal_extractor_v1_transcription_embedding": [0.018, -0.032, "...1024 floats"]
}

Field	Type	Description
`start_time`	number	Segment start in seconds
`end_time`	number	Segment end in seconds
`transcription`	string	Whisper-transcribed speech
`source_audio_url`	string	Original source audio file URL
`metadata`	object	Pass-through fields from the source document
`multimodal_extractor_v1_transcription_embedding`	float[1024]	E5-Large transcript embedding
`multimodal_extractor_v1_multimodal_embedding`	float[1408]	Vertex AI multimodal embedding (if enabled)

Multimodal Extractor

Full parameter reference for audio processing

Retrievers

Build search pipelines over transcript features

From Video

Video extraction uses the same transcription pipeline

Text Extractor

Re-embed transcripts with chunking for fine-grained retrieval

Documentation Index

​What Gets Extracted

​Choosing an Extractor

​Create a Collection for Audio

​Search by Transcript

​Output Schema

​Related

Multimodal Extractor

Retrievers

From Video

Text Extractor

What Gets Extracted

Choosing an Extractor

Create a Collection for Audio

Search by Transcript

Output Schema

Related