NEWWhy single embeddings fail for video.Read the post →
    Models/Speech & Audio/pyannote/speaker-diarization-3.1
    HFSpeaker Diarizationmit

    speaker-diarization-3.1

    by pyannote

    Who spoke when, end-to-end neural speaker diarization

    10.9Mdl/month
    1,730likes
    18Mparams
    Identifiers
    Model ID
    pyannote/speaker-diarization-3.1
    Feature URI
    mixpeek://transcription@v1/pyannote_diarization_v3

    Overview

    Pyannote's speaker diarization pipeline segments audio into speaker-homogeneous regions, determining "who spoke when" without requiring prior knowledge of the number or identity of speakers.

    On Mixpeek, speaker diarization enriches transcription data with speaker labels, enabling queries like "find all segments where Speaker A talks about budgets."

    Architecture

    End-to-end pipeline: (1) segmentation model based on PyanNet (SincNet + LSTM + feedforward), (2) embedding extraction using ECAPA-TDNN, (3) agglomerative clustering for speaker assignment. Supports overlapping speech detection.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/meeting.mp4" },
    feature_extractors: [{
    name: "speaker_diarization",
    version: "v1",
    params: {
    model_id: "pyannote/speaker-diarization-3.1"
    }
    }]
    });

    Capabilities

    • Automatic speaker count estimation
    • Overlapping speech detection
    • Speaker embedding extraction
    • Fine-tunable on custom speaker data

    Use Cases on Mixpeek

    Meeting transcription with speaker attribution
    Interview and podcast analysis, attribute quotes to speakers
    Call center analytics, separate agent and customer speech

    Benchmarks

    DatasetMetricScoreSource
    AMI (headset)DER18.2%Plaquet & Bredin, 2023 — Table 2
    DIHARD IIIDER20.5%Plaquet & Bredin, 2023 — Table 2
    VoxConverseDER11.2%Plaquet & Bredin, 2023 — Table 2

    Performance

    Input Sizevariable audio length
    GPU Latency~1.2s / 60s audio (A100)
    GPU Throughput~50× realtime (A100)
    GPU Memory~0.8 GB

    Specification

    FrameworkHF
    Organizationpyannote
    FeatureSpeaker Diarization
    Outputspeaker segments
    Modalitiesvideo, audio
    RetrieverSpeaker Filter
    Parameters18M
    Licensemit
    Downloads/mo10.9M
    Likes1,730

    Research Paper

    Powerset multi-class cross entropy loss for neural speaker diarization

    arxiv.org

    Build a pipeline with speaker-diarization-3.1

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder