speaker-diarization-3.1
by pyannote
Who spoke when — end-to-end neural speaker diarization
pyannote/speaker-diarization-3.1mixpeek://transcription@v1/pyannote_diarization_v3Overview
Pyannote's speaker diarization pipeline segments audio into speaker-homogeneous regions, determining "who spoke when" without requiring prior knowledge of the number or identity of speakers.
On Mixpeek, speaker diarization enriches transcription data with speaker labels, enabling queries like "find all segments where Speaker A talks about budgets."
Architecture
End-to-end pipeline: (1) segmentation model based on PyanNet (SincNet + LSTM + feedforward), (2) embedding extraction using ECAPA-TDNN, (3) agglomerative clustering for speaker assignment. Supports overlapping speech detection.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/meeting.mp4" },
feature_extractors: [{
name: "speaker_diarization",
version: "v1",
params: {
model_id: "pyannote/speaker-diarization-3.1"
}
}]
});Capabilities
- Automatic speaker count estimation
- Overlapping speech detection
- Speaker embedding extraction
- Fine-tunable on custom speaker data
Use Cases on Mixpeek
Specification
Research Paper
Powerset multi-class cross entropy loss for neural speaker diarization
arxiv.orgBuild a pipeline with speaker-diarization-3.1
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder