NEWAgents can now see video via MCP.Try it now →
    Audio Intelligence

    Audio Search API for Speech, Music, and Podcasts

    Part of the multimodal data warehouse. Make every second of audio searchable: transcribe speech, identify speakers, detect audio events, and retrieve precise moments from millions of hours of recordings with composable retriever pipelines.

    What is Audio Search?

    Audio search transforms unstructured audio recordings into structured, searchable data. Instead of listening through hours of content, audio search lets you query recordings the same way you query a database -- by keyword, speaker, topic, or acoustic similarity.

    The Problem with Audio Data

    Without Audio Search

    Audio is a black box. Finding a specific moment means scrubbing through recordings manually. Call centers generate thousands of hours of calls that no one can review. Podcast archives grow endlessly with no way to surface relevant content.

    With Mixpeek Audio Search

    Every recording is automatically transcribed, speaker-tagged, and indexed. Search by topic, speaker, timestamp, or acoustic similarity across your entire audio library in milliseconds.

    Beyond Simple Transcription

    Transcription Alone

    Traditional speech-to-text gives you a wall of text with no structure. No speaker labels, no semantic understanding, no ability to search by audio characteristics like tone or background sounds.

    Multimodal Audio Intelligence

    Mixpeek extracts multiple feature layers -- transcripts with speaker diarization, audio embeddings for similarity search, event classifications, and rich metadata -- all indexed into a unified namespace.

    Production Infrastructure

    DIY Audio Pipelines

    Building audio search from scratch means stitching together ASR APIs, embedding models, vector databases, and search logic. Scaling to millions of hours requires significant GPU infrastructure.

    Managed Audio Search Platform

    Mixpeek handles the entire pipeline -- feature extraction on auto-scaling Ray GPU clusters, indexing into Qdrant, and composable retriever pipelines for search. Not a framework, managed infrastructure.

    Audio Search Capabilities

    From speech transcription to acoustic fingerprinting -- every tool you need to make audio content searchable and actionable.

    Speech-to-Text Search

    Transcribe spoken audio into searchable text using production-grade ASR models. Index every word, timestamp, and speaker turn for precise retrieval across hours of recordings.

    • Multi-language automatic speech recognition
    • Word-level timestamps for precise segment retrieval
    • Speaker diarization to attribute speech to individuals

    Speaker Identification

    Detect and distinguish individual speakers within audio recordings. Cluster voice embeddings to track speakers across files, enabling search by speaker identity.

    • Voice embedding extraction per speaker
    • Cross-file speaker clustering and matching
    • Speaker timeline segmentation

    Audio Event Detection

    Go beyond speech. Detect non-speech audio events such as alarms, applause, music, machinery sounds, and environmental noise to enable rich audio content understanding.

    • Non-speech sound classification
    • Temporal event localization with timestamps
    • Custom event model training support

    Music Recognition and Similarity

    Extract audio fingerprints and embeddings from music content. Search by melody similarity, genre, tempo, or mood across large music catalogs.

    • Audio fingerprinting for track identification
    • Embedding-based similarity search
    • Genre, tempo, and mood feature extraction

    How Audio Search Works

    From raw audio files to searchable indexes in four steps. No infrastructure to manage, no models to deploy.

    1

    Ingest Audio Files

    Upload audio files to a Mixpeek bucket from S3, GCS, Azure Blob, or direct upload. Supported formats include MP3, WAV, FLAC, AAC, OGG, and M4A.

    2

    Feature Extraction

    Mixpeek collections trigger audio feature extractors on Ray GPU clusters. Extractors run ASR transcription, speaker diarization, audio event classification, and embedding generation in parallel.

    3

    Indexing into Namespaces

    Extracted features -- transcripts, speaker embeddings, event labels, and audio embeddings -- are stored as documents in Qdrant namespaces with full metadata payloads.

    4

    Search with Retrievers

    Build composable retriever pipelines to query audio content. Combine text search over transcripts, vector similarity over audio embeddings, metadata filtering by speaker or event type, and reranking stages.

    Audio Search Use Cases

    From podcast platforms to enterprise call centers -- audio search unlocks value from recordings that were previously impossible to navigate.

    Podcast Search and Discovery

    Make every word in a podcast catalog searchable. Transcribe episodes, index speaker segments, and let users find exact moments by topic, speaker, or quoted phrase. Power discovery with semantic search across thousands of hours of content.

    Call Center Analytics

    Index customer support calls at scale. Search by topic, sentiment, speaker, or specific phrases across millions of call recordings. Identify patterns, compliance issues, and training opportunities without manual review.

    Audio Content Moderation

    Detect prohibited content in audio at scale. Flag specific speech patterns, toxic language, copyrighted music, or policy-violating audio events. Integrate moderation into ingestion pipelines for real-time enforcement.

    Media Archive Search

    Unlock decades of audio archives. Retroactively process legacy recordings -- radio broadcasts, oral histories, meeting recordings -- into searchable indexes. Enable researchers and journalists to find specific moments across massive collections.

    Mixpeek Audio Search vs. Alternatives

    See how Mixpeek compares to dedicated transcription services for audio search use cases.

    FeatureMixpeekDeepgramAssemblyAIGoogle Speech
    Modality SupportAudio, video, image, text, PDF (unified)Audio and video (speech only)Audio and video (speech only)Audio (speech only)
    Search InfrastructureBuilt-in vector + keyword hybrid searchTranscript text only (BYO search)Transcript text only (BYO search)No search (transcription only)
    Audio Event DetectionBuilt-in (speech + non-speech events)Limited (speech-focused)Audio intelligence featuresNot supported
    Speaker EmbeddingsSearchable vector embeddings per speakerDiarization labels onlyDiarization labels onlyDiarization labels only
    Retriever PipelinesComposable multi-stage pipelinesNot availableNot availableNot available
    Deployment OptionsManaged, Dedicated, BYO CloudManaged SaaS or on-premManaged SaaS onlyGoogle Cloud only

    Build Audio Search in Minutes

    A simple Python API to ingest, index, and search audio content with full-featured retriever pipelines.

    audio_search.py
    from mixpeek import Mixpeek
    
    client = Mixpeek(api_key="YOUR_API_KEY")
    
    # Create a collection with audio feature extractors
    collection = client.collections.create(
        name="podcast-archive",
        namespace="podcasts",
        extractors=[
            {
                "type": "audio_transcription",
                "model": "whisper-large-v3",
                "config": {
                    "language": "en",
                    "diarize": True,
                    "word_timestamps": True
                }
            },
            {
                "type": "audio_embedding",
                "model": "clap-large",
                "config": {
                    "segment_duration": 30
                }
            }
        ]
    )
    
    # Upload audio files to trigger processing
    client.buckets.upload(
        bucket="my-bucket",
        files=["episode_042.mp3", "episode_043.mp3"],
        collection=collection.id
    )
    
    # Search across transcripts and audio embeddings
    results = client.retrievers.execute(
        namespace="podcasts",
        stages=[
            {
                "type": "feature_search",
                "method": "hybrid",
                "query": {
                    "text": "discussion about machine learning in healthcare",
                    "modalities": ["text", "audio"]
                },
                "limit": 20
            },
            {
                "type": "filter",
                "conditions": {
                    "metadata.speaker_count": {"$gte": 2}
                }
            },
            {
                "type": "rerank",
                "model": "cross-encoder",
                "limit": 5
            }
        ]
    )
    
    for result in results:
        print(f"Episode: {result.metadata['filename']}")
        print(f"Timestamp: {result.metadata['start_time']}s")
        print(f"Speaker: {result.metadata['speaker']}")
        print(f"Transcript: {result.content}")

    Frequently Asked Questions

    What is an audio search API?

    An audio search API enables developers to index and search audio content programmatically. Instead of manually listening to recordings, you can transcribe speech, extract audio features, and build searchable indexes that support text queries over transcripts, speaker-based filtering, audio similarity search, and event detection. Mixpeek provides a complete audio search infrastructure that handles transcription, embedding generation, indexing, and retrieval in a single platform.

    How does speech-to-text search work in Mixpeek?

    Mixpeek uses production-grade ASR (automatic speech recognition) models like Whisper to transcribe audio into text with word-level timestamps and speaker labels. The transcriptions are indexed into Qdrant namespaces alongside audio embeddings. When you search, Mixpeek's retriever pipelines combine keyword matching over transcripts with vector similarity over audio embeddings, returning precise audio segments with timestamps.

    What audio formats does Mixpeek support?

    Mixpeek supports all common audio formats including MP3, WAV, FLAC, AAC, OGG, M4A, WMA, and AIFF. Audio files are uploaded to Mixpeek buckets and processed by feature extractors running on Ray GPU clusters. The platform handles format conversion, sample rate normalization, and channel mixing automatically during ingestion.

    Can I search for specific speakers across audio files?

    Yes. Mixpeek extracts speaker embeddings during ingestion using speaker diarization models. These embeddings are stored as searchable vectors in your namespace. You can find all segments spoken by a specific person across your entire audio collection, cluster unknown speakers, and filter search results by speaker identity using retriever pipeline stages.

    How is Mixpeek different from Deepgram or AssemblyAI?

    Deepgram and AssemblyAI are transcription services -- they convert speech to text but leave search infrastructure to you. Mixpeek is a complete search platform: it handles transcription, audio embedding generation, vector indexing, and composable retriever pipelines out of the box. Mixpeek also supports multimodal search, so you can search across audio, video, images, and documents in a single query.

    Does Mixpeek support real-time audio processing?

    Mixpeek supports both real-time and batch audio processing. For real-time use cases, collection triggers automatically process new audio files as they arrive in your bucket. For batch processing, you can submit large volumes of audio files and Mixpeek's Ray clusters scale horizontally to process them in parallel with progress tracking.

    Can I detect non-speech audio events?

    Yes. Beyond speech transcription, Mixpeek's audio feature extractors can classify non-speech audio events such as music, alarms, applause, machinery sounds, environmental noise, and more. These events are indexed as metadata on audio segments, enabling you to filter and search by event type in your retriever pipelines.

    How does audio search integrate with video search in Mixpeek?

    Mixpeek treats audio and video as complementary modalities within a unified namespace. When you process a video file, Mixpeek extracts both visual features (frames, objects, scenes) and audio features (transcripts, speaker embeddings, audio events) and indexes them together. A single retriever query can search across both the audio track and visual content of your video collection simultaneously.

    Make Your Audio Content Searchable

    Stop losing insights in hours of recordings. Build production audio search with managed infrastructure that scales from thousands to millions of hours.