View extractor details at api.mixpeek.com/v1/collections/features/extractors/audio_sentiment_extractor_v1 or fetch programmatically with
GET /v1/collections/features/extractors/{feature_extractor_id}.Pipeline Steps
- Filter Dataset (if
collection_idprovided)- Filter to specified collection
- Apply Input Mappings
- Resolve audio/video field from source (e.g.,
payload.audio_url,payload.webcast_url)
- Resolve audio/video field from source (e.g.,
- Audio Extraction (conditional: if video input)
- FFmpeg strips audio track from MP4/MOV; supports AAC, MP3, FLAC output
- Voice Activity Detection + Segmentation
split_method: time— fixed-length windows (default 30s)split_method: silence— split at natural speech pauses (VAD threshold configurable)split_method: speaker— one segment per speaker turn (requiresrun_diarization=true)
- Speaker Diarization (conditional: if
run_diarization=true)- pyannote.audio 3.x pipeline separates speakers (CEO, CFO, Analyst_1, etc.)
- Assigns
speaker_idand optionally maps tospeaker_rolevia role manifest
- Transcription (conditional: if
run_transcription=true)- Whisper large-v3-turbo speech-to-text with financial vocabulary prompt
- Per-segment timestamps aligned to diarization boundaries
- FinBERT Text Sentiment (conditional: if
run_finbert=true)- ProsusAI/FinBERT financial-domain sentiment classifier
- Outputs
sentiment_label(positive/negative/neutral),sentiment_score(-1 to +1),confidence - Generates 768D FinBERT CLS embedding for semantic search
- Prosodic Feature Extraction (conditional: if
run_prosodics=true)- LibROSA + Parselmouth extract 5 features per segment:
- Pitch variability (F0 standard deviation, Hz) — hesitation and stress indicator
- Speech rate (words per minute) — confidence and urgency signal
- Vocal energy (RMS dB) — assertiveness and emotional weight
- Pause ratio (fraction of silence) — cognitive load and evasiveness marker
- Vocal tremor (jitter + shimmer) — anxiety and deception indicator
- Normalized into a 128D prosodic embedding for similarity search
- LibROSA + Parselmouth extract 5 features per segment:
- Audio-Text Alignment Score (conditional: if
run_alignment=true)- Cosine similarity between FinBERT sentiment direction and prosodic valence
- Low alignment = voice contradicts words (high-value deception/stress signal)
- LLM Structured Enrichment (conditional: if
run_llm_enrichment=trueorresponse_shapeset)- Gemini/GPT-4o processes transcription with custom prompt
- Extracts structured signals: guidance confidence, topic classification, hedging language
- Output
- Per-segment documents with both embedding types, raw prosodic features, sentiment scores, speaker metadata, and computed alpha signals
When to Use
| Use Case | Description |
|---|---|
| Earnings call analysis | Process quarterly earnings calls for CEO/CFO vocal stress relative to guidance |
| Forward guidance scoring | Score management’s confidence level on forward-looking statements |
| Analyst day processing | Build speaker-attributed sentiment timelines across presentations |
| Fed press conference monitoring | Track FOMC chair vocal markers around policy language |
| Alpha signal generation | Combine text + prosodic features for uncrowded quantitative factors |
| Sentiment divergence detection | Flag calls where management tone contradicts word sentiment |
| Comparative speaker analysis | Track individual executive vocal patterns across multiple quarters |
| Podcast / financial media | Extract sentiment from analyst interviews, TV appearances, podcasts |
When NOT to Use
| Scenario | Recommended Alternative |
|---|---|
| Text-only documents (PDFs, filings) | document_extractor or text_extractor |
| Very short clips (< 10 seconds) | Processing overhead disproportionate |
| Non-speech audio (music, noise) | multimodal_extractor |
| Real-time live streaming | Specialized streaming extractors |
| Non-English earnings calls | Set transcription_language explicitly; FinBERT is English-only |
Supported Input Types
| Input | Type | Description | Processing |
|---|---|---|---|
audio | string | URL or S3 path to MP3, WAV, FLAC, M4A | Direct processing |
video | string | URL or S3 path to MP4, MOV, MKV | Audio extracted via FFmpeg |
url | string | Direct URL to webcast / podcast stream | Downloaded and processed |
Input Schema
Provide one of the following inputs:| Field | Type | Description |
|---|---|---|
audio | string | URL/S3 path to audio file. Recommended: < 3 hours per file |
video | string | URL/S3 path to video file; audio track extracted automatically |
url | string | Direct stream URL; downloaded before processing |
Output Schema
Each audio segment produces one document:| Field | Type | Description |
|---|---|---|
start_time | number | Segment start time in seconds |
end_time | number | Segment end time in seconds |
speaker_id | string | Diarized speaker label (e.g., SPEAKER_00) |
speaker_role | string | Mapped role if manifest provided (e.g., CEO, CFO, Analyst) |
transcription | string | Whisper transcription of segment |
sentiment_label | string | positive, negative, or neutral |
sentiment_score | number | FinBERT sentiment score: -1.0 (negative) to +1.0 (positive) |
sentiment_confidence | number | FinBERT confidence 0.0–1.0 |
pitch_variability_hz | number | F0 standard deviation (Hz) — stress/hesitation |
speech_rate_wpm | number | Words per minute — confidence/urgency |
vocal_energy_db | number | RMS energy in dB — assertiveness |
pause_ratio | number | Fraction of silence 0.0–1.0 — cognitive load |
vocal_tremor | number | Jitter + shimmer composite 0.0–1.0 — anxiety |
audio_text_alignment | number | Prosody-sentiment cosine alignment -1.0 to +1.0 |
stress_index | number | Composite vocal stress score 0.0–1.0 |
audio_sentiment_extractor_v1_text_embedding | float[768] | FinBERT CLS embedding |
audio_sentiment_extractor_v1_prosodic_embedding | float[128] | Normalized prosodic feature vector |
Parameters
Audio Segmentation
| Parameter | Type | Default | Description |
|---|---|---|---|
split_method | string | "silence" | Segmentation strategy: time, silence, or speaker |
- time
- silence
- speaker
Fixed-interval splitting — equal-duration segments regardless of speech content.
Best for: Batch processing, predictable segment counts, initial exploration
| Parameter | Type | Default | Description |
|---|---|---|---|
time_split_interval | integer | 30 | Segment duration in seconds |
Feature Extraction Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
run_transcription | boolean | true | Run Whisper transcription |
transcription_language | string | "en" | Language code for transcription |
transcription_prompt | string | null | Domain vocabulary hint (e.g., ticker symbols, product names) |
run_finbert | boolean | true | Run FinBERT financial sentiment classification |
run_prosodics | boolean | true | Extract 5 prosodic features + 128D embedding |
run_alignment | boolean | true | Compute audio-text alignment score |
run_diarization | boolean | false | Run speaker diarization (adds ~20% processing time) |
num_speakers | integer | null | Hint for diarization (null = auto-detect) |
Speaker Role Manifest
Map diarized speaker IDs to roles (CEO, CFO, Analyst, etc.) using a manifest:speaker_role_manifest is not provided, roles are labeled SPEAKER_00, SPEAKER_01, etc.
LLM Structured Extraction
| Parameter | Type | Default | Description |
|---|---|---|---|
run_llm_enrichment | boolean | false | Run LLM over transcription segments |
response_shape | string | object | null | Custom structured output schema |
Configuration Examples
Performance & Costs
Processing Speed
| Configuration | Speed | Example |
|---|---|---|
| Transcription only | ~0.5x realtime | 60-min call → ~30 min |
| + FinBERT | ~0.6x realtime | 60-min call → ~36 min |
| + Prosodics | ~0.8x realtime | 60-min call → ~48 min |
| + Diarization | ~1.5x realtime | 60-min call → ~90 min |
| Full pipeline | ~1.5x realtime | 60-min call → ~90 min |
| Feature | Per-Segment Latency |
|---|---|
| Transcription (Whisper) | ~150ms/sec of audio |
| FinBERT classification | ~25ms |
| Prosodic extraction | ~50ms |
| Speaker diarization | ~200ms/sec of audio |
| LLM enrichment | ~1.5s |
Cost Estimates (per hour of audio)
| Configuration | Cost |
|---|---|
| Minimal (transcription + FinBERT) | $0.08 |
| Standard (+ prosodics + alignment) | $0.15 |
| Full (+ diarization + LLM enrichment) | $0.25 |
Vector Indexes
Text Embedding (FinBERT)
| Property | Value |
|---|---|
| Index name | audio_sentiment_extractor_v1_text_embedding |
| Dimensions | 768 |
| Type | Dense |
| Distance metric | Cosine |
| Inference model | finbert_sentiment_v1 |
| Supported inputs | text (transcription segments) |
Prosodic Embedding
| Property | Value |
|---|---|
| Index name | audio_sentiment_extractor_v1_prosodic_embedding |
| Dimensions | 128 |
| Type | Dense |
| Distance metric | Cosine |
| Inference model | prosodic_encoder_v1 |
| Supported inputs | audio segments |
Alpha Signal Guide
This section describes the five core prosodic features and their interpretation as quantitative signals.1. Pitch Variability (F0 Standard Deviation)
1. Pitch Variability (F0 Standard Deviation)
What it measures: Standard deviation of the fundamental frequency (F0) in Hz across the segment.Signal interpretation:
- High variability (> 50 Hz): Elevated emotional engagement; can indicate stress or enthusiasm
- Low variability (< 15 Hz): Monotone delivery; associated with rehearsed/scripted language or disengagement
- Baseline deviation: Compare against the speaker’s historical mean F0 std dev for true anomaly detection
2. Speech Rate (Words Per Minute)
2. Speech Rate (Words Per Minute)
What it measures: Words per minute derived from Whisper word-level timestamps.Signal interpretation:
- High rate (> 180 wpm): Urgency, anxiety, or over-rehearsed scripted answers
- Low rate (< 100 wpm): Deliberate, careful language; common when discussing negative surprises
- Rate deceleration mid-answer: Suggests real-time reasoning, less scripted — higher authenticity signal
3. Vocal Energy (RMS dB)
3. Vocal Energy (RMS dB)
What it measures: Root mean square energy of the audio signal in decibels.Signal interpretation:
- High energy: Assertiveness and confidence; common in positive guidance delivery
- Energy drop mid-sentence: Hedging or trailing off; linguistic uncertainty
- Segment-relative drop: Cross-call energy tracking shows conviction level
4. Pause Ratio (Silence Fraction)
4. Pause Ratio (Silence Fraction)
What it measures: Fraction of segment duration classified as silence (VAD threshold -40 dB).Signal interpretation:
- High pause ratio (> 0.35): Cognitive load; speaker is reasoning in real time rather than reciting
- Low pause ratio (< 0.10): Scripted, rehearsed delivery — less information content
- Q&A vs. prepared remarks delta: A large increase in pause ratio during Q&A is a well-documented stress marker
5. Audio-Text Alignment Score
5. Audio-Text Alignment Score
What it measures: Cosine similarity between the FinBERT sentiment direction (text) and prosodic valence (audio). Range: -1.0 to +1.0.Signal interpretation:
- High alignment (> 0.6): Voice and words agree — higher conviction, less masking
- Low alignment (0.1–0.4): Moderate divergence — common in hedged language
- Negative alignment (< 0): Voice contradicts words — strongest stress/deception marker; e.g., “We feel very good about guidance” delivered with high pitch variability, low energy, and high pauses
Composite Stress Index
Thestress_index field (0.0–1.0) is a normalized composite of all five prosodic features:
_z values are Z-scores computed against the speaker’s rolling 4-quarter baseline when speaker_id is consistent across calls.
Recommended Factor Construction
Limitations
- Speaker diarization accuracy: pyannote achieves ~90% DER on clean 2-speaker recordings; accuracy degrades with > 8 speakers or poor audio quality
- Non-English: Whisper transcription supports 99 languages; FinBERT is English-only — for non-English calls, disable
run_finbertand use multilingual sentiment models - Audio quality: Prosodic features require 16kHz+ audio; compressed phone audio (8kHz) reduces pitch extraction accuracy by ~30%
- Baseline dependency:
stress_indexZ-score normalization requires at least 4 prior segments from the samespeaker_idto be meaningful - Segment length: Prosodic features are unreliable for segments < 5 seconds; short interjections are best excluded
- LLM enrichment latency:
run_llm_enrichment=trueadds 1–2s per segment; disable for batch throughput

