> ## Documentation Index > Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Audio Sentiment Extractor > Vocal intelligence for financial earnings calls — FinBERT text sentiment, prosodic audio features, and speaker diarization for quantitative alternative data Runnable reference for every built-in Mixpeek extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry, so it always matches production. Audio sentiment extractor pipeline showing speaker diarization, parallel FinBERT and prosodic feature extraction, and output alpha signals

Audio sentiment extractor pipeline showing speaker diarization, parallel FinBERT and prosodic feature extraction, and output alpha signals

The audio sentiment extractor processes **earnings call recordings, analyst day presentations, Fed press conferences, and financial podcasts** to produce two parallel signal streams: FinBERT financial-domain text sentiment (768D) from Whisper transcription, and a 5-feature prosodic vector (128D) capturing vocal stress, hesitation, and deception markers. Speaker diarization separates management from analysts for role-attributed sentiment. This extractor addresses the gap identified in SEC 8-K forward guidance NLP studies: text-only sentiment models generate crowded alpha (in-sample IC \~+0.12 but poor walk-forward generalization). The five prosodic features — pitch variability, speech rate, vocal energy, pause ratio, and audio-text alignment — are largely uncorrelated with published text signals and untested at scale, representing a structural alternative data opportunity. View extractor details at [api.mixpeek.com/v1/collections/features/extractors/audio\_sentiment\_extractor\_v1](https://api.mixpeek.com/v1/collections/features/extractors/audio_sentiment_extractor_v1) or fetch programmatically with `GET /v1/collections/features/extractors/{feature_extractor_id}`. ## Pipeline Steps 1. **Filter Dataset** (if `collection_id` provided) * Filter to specified collection 2. **Apply Input Mappings** * Resolve audio/video field from source (e.g., `payload.audio_url`, `payload.webcast_url`) 3. **Audio Extraction** (conditional: if video input) * FFmpeg strips audio track from MP4/MOV; supports AAC, MP3, FLAC output 4. **Voice Activity Detection + Segmentation** * `split_method: time` — fixed-length windows (default 30s) * `split_method: silence` — split at natural speech pauses (VAD threshold configurable) * `split_method: speaker` — one segment per speaker turn (requires `run_diarization=true`) 5. **Speaker Diarization** (conditional: if `run_diarization=true`) * pyannote.audio 3.x pipeline separates speakers (CEO, CFO, Analyst\_1, etc.) * Assigns `speaker_id` and optionally maps to `speaker_role` via role manifest 6. **Transcription** (conditional: if `run_transcription=true`) * Whisper large-v3-turbo speech-to-text with financial vocabulary prompt * Per-segment timestamps aligned to diarization boundaries 7. **FinBERT Text Sentiment** (conditional: if `run_finbert=true`) * ProsusAI/FinBERT financial-domain sentiment classifier * Outputs `sentiment_label` (positive/negative/neutral), `sentiment_score` (-1 to +1), `confidence` * Generates 768D FinBERT CLS embedding for semantic search 8. **Prosodic Feature Extraction** (conditional: if `run_prosodics=true`) * LibROSA + Parselmouth extract 5 features per segment: * **Pitch variability** (F0 standard deviation, Hz) — hesitation and stress indicator * **Speech rate** (words per minute) — confidence and urgency signal * **Vocal energy** (RMS dB) — assertiveness and emotional weight * **Pause ratio** (fraction of silence) — cognitive load and evasiveness marker * **Vocal tremor** (jitter + shimmer) — anxiety and deception indicator * Normalized into a 128D prosodic embedding for similarity search 9. **Audio-Text Alignment Score** (conditional: if `run_alignment=true`) * Cosine similarity between FinBERT sentiment direction and prosodic valence * Low alignment = voice contradicts words (high-value deception/stress signal) 10. **LLM Structured Enrichment** (conditional: if `run_llm_enrichment=true` or `response_shape` set) * Gemini/GPT-4o processes transcription with custom prompt * Extracts structured signals: guidance confidence, topic classification, hedging language 11. **Output** * Per-segment documents with both embedding types, raw prosodic features, sentiment scores, speaker metadata, and computed alpha signals ## When to Use | Use Case | Description | | ----------------------------------- | ------------------------------------------------------------------------------ | | **Earnings call analysis** | Process quarterly earnings calls for CEO/CFO vocal stress relative to guidance | | **Forward guidance scoring** | Score management's confidence level on forward-looking statements | | **Analyst day processing** | Build speaker-attributed sentiment timelines across presentations | | **Fed press conference monitoring** | Track FOMC chair vocal markers around policy language | | **Alpha signal generation** | Combine text + prosodic features for uncrowded quantitative factors | | **Sentiment divergence detection** | Flag calls where management tone contradicts word sentiment | | **Comparative speaker analysis** | Track individual executive vocal patterns across multiple quarters | | **Podcast / financial media** | Extract sentiment from analyst interviews, TV appearances, podcasts | ## When NOT to Use | Scenario | Recommended Alternative | | ----------------------------------- | ---------------------------------------------------------------- | | Text-only documents (PDFs, filings) | `document_extractor` or `text_extractor` | | Very short clips (\< 10 seconds) | Processing overhead disproportionate | | Non-speech audio (music, noise) | `multimodal_extractor` | | Real-time live streaming | Specialized streaming extractors | | Non-English earnings calls | Set `transcription_language` explicitly; FinBERT is English-only | ## Supported Input Types | Input | Type | Description | Processing | | ------- | ------ | -------------------------------------- | -------------------------- | | `audio` | string | URL or S3 path to MP3, WAV, FLAC, M4A | Direct processing | | `video` | string | URL or S3 path to MP4, MOV, MKV | Audio extracted via FFmpeg | | `url` | string | Direct URL to webcast / podcast stream | Downloaded and processed | **Supported audio formats:** MP3, WAV, FLAC, M4A, OGG, OPUS **Supported video formats (audio extracted):** MP4, MOV, MKV, AVI, WebM ## Input Schema Provide **one** of the following inputs: ```json theme={null} { "audio": "s3://bucket/earnings/AAPL_Q4_2024.mp3" } ``` ```json theme={null} { "video": "s3://bucket/investor-day/msft-2024-ceo-keynote.mp4" } ``` ```json theme={null} { "url": "https://edge.media-server.com/mmc/p/xyz/earnings-call.mp3" } ``` | Field | Type | Description | | ------- | ------ | -------------------------------------------------------------- | | `audio` | string | URL/S3 path to audio file. Recommended: \< 3 hours per file | | `video` | string | URL/S3 path to video file; audio track extracted automatically | | `url` | string | Direct stream URL; downloaded before processing | ## Output Schema Each audio segment produces one document: | Field | Type | Description | | ------------------------------------------------- | ----------- | ---------------------------------------------------------------- | | `start_time` | number | Segment start time in seconds | | `end_time` | number | Segment end time in seconds | | `speaker_id` | string | Diarized speaker label (e.g., `SPEAKER_00`) | | `speaker_role` | string | Mapped role if manifest provided (e.g., `CEO`, `CFO`, `Analyst`) | | `transcription` | string | Whisper transcription of segment | | `sentiment_label` | string | `positive`, `negative`, or `neutral` | | `sentiment_score` | number | FinBERT sentiment score: -1.0 (negative) to +1.0 (positive) | | `sentiment_confidence` | number | FinBERT confidence 0.0–1.0 | | `pitch_variability_hz` | number | F0 standard deviation (Hz) — stress/hesitation | | `speech_rate_wpm` | number | Words per minute — confidence/urgency | | `vocal_energy_db` | number | RMS energy in dB — assertiveness | | `pause_ratio` | number | Fraction of silence 0.0–1.0 — cognitive load | | `vocal_tremor` | number | Jitter + shimmer composite 0.0–1.0 — anxiety | | `audio_text_alignment` | number | Prosody-sentiment cosine alignment -1.0 to +1.0 | | `stress_index` | number | Composite vocal stress score 0.0–1.0 | | `audio_sentiment_extractor_v1_text_embedding` | float\[768] | FinBERT CLS embedding | | `audio_sentiment_extractor_v1_prosodic_embedding` | float\[128] | Normalized prosodic feature vector | ```json theme={null} { "start_time": 245.0, "end_time": 275.0, "speaker_id": "SPEAKER_00", "speaker_role": "CEO", "transcription": "We're very confident in our Q1 guidance range of twelve to fourteen dollars per share...", "sentiment_label": "positive", "sentiment_score": 0.71, "sentiment_confidence": 0.89, "pitch_variability_hz": 38.2, "speech_rate_wpm": 142, "vocal_energy_db": -18.4, "pause_ratio": 0.21, "vocal_tremor": 0.14, "audio_text_alignment": 0.63, "stress_index": 0.31, "audio_sentiment_extractor_v1_text_embedding": [0.041, -0.018, ...], "audio_sentiment_extractor_v1_prosodic_embedding": [0.72, 0.34, ...] } ``` ## Parameters ### Audio Segmentation | Parameter | Type | Default | Description | | -------------- | ------ | ----------- | ------------------------------------------------------ | | `split_method` | string | `"silence"` | Segmentation strategy: `time`, `silence`, or `speaker` | **Fixed-interval splitting** — equal-duration segments regardless of speech content. | Parameter | Type | Default | Description | | --------------------- | ------- | ------- | --------------------------- | | `time_split_interval` | integer | `30` | Segment duration in seconds | **Best for:** Batch processing, predictable segment counts, initial exploration ```json theme={null} { "split_method": "time", "time_split_interval": 30 } ``` **Voice activity detection** — splits at natural speech pauses. Preserves complete sentences and thoughts. | Parameter | Type | Default | Description | | ------------------------- | ------- | ------- | --------------------------------------- | | `silence_db_threshold` | integer | `-40` | dB level below which audio is silence | | `min_silence_duration_ms` | integer | `500` | Minimum silence length to trigger split | **Best for:** Earnings calls, presentations, interviews — preserves semantic units ```json theme={null} { "split_method": "silence", "silence_db_threshold": -40, "min_silence_duration_ms": 500 } ``` **Speaker-turn segmentation** — one segment per speaker turn. Requires diarization. Ideal for Q\&A analysis. **Characteristics:** * Variable segment lengths (1s–5 min typical for earnings Q\&A) * Each segment is a single speaker's continuous turn * Enables per-speaker sentiment timelines **Best for:** Q\&A sections, panel discussions, analyst questioning ```json theme={null} { "split_method": "speaker", "run_diarization": true } ``` ### Feature Extraction Parameters | Parameter | Type | Default | Description | | ------------------------ | ------- | ------- | ------------------------------------------------------------ | | `run_transcription` | boolean | `true` | Run Whisper transcription | | `transcription_language` | string | `"en"` | Language code for transcription | | `transcription_prompt` | string | `null` | Domain vocabulary hint (e.g., ticker symbols, product names) | | `run_finbert` | boolean | `true` | Run FinBERT financial sentiment classification | | `run_prosodics` | boolean | `true` | Extract 5 prosodic features + 128D embedding | | `run_alignment` | boolean | `true` | Compute audio-text alignment score | | `run_diarization` | boolean | `false` | Run speaker diarization (adds \~20% processing time) | | `num_speakers` | integer | `null` | Hint for diarization (null = auto-detect) | ### Speaker Role Manifest Map diarized speaker IDs to roles (CEO, CFO, Analyst, etc.) using a manifest: ```json theme={null} { "speaker_role_manifest": { "SPEAKER_00": "CEO", "SPEAKER_01": "CFO", "SPEAKER_02": "Analyst" } } ``` When `speaker_role_manifest` is not provided, roles are labeled `SPEAKER_00`, `SPEAKER_01`, etc. ### LLM Structured Extraction | Parameter | Type | Default | Description | | -------------------- | ---------------- | ------- | ----------------------------------- | | `run_llm_enrichment` | boolean | `false` | Run LLM over transcription segments | | `response_shape` | string \| object | `null` | Custom structured output schema | **Natural Language Mode:** ```json theme={null} { "response_shape": "Extract: forward guidance confidence level (1-5), number of hedging phrases, primary topic discussed, and any mentioned risk factors" } ``` **JSON Schema Mode for Quant Signals:** ```json theme={null} { "response_shape": { "type": "object", "properties": { "guidance_confidence": { "type": "integer", "minimum": 1, "maximum": 5 }, "hedging_phrase_count": { "type": "integer" }, "topic": { "type": "string", "enum": ["revenue", "margins", "guidance", "macro", "capex", "other"] }, "risk_factors": { "type": "array", "items": { "type": "string" } }, "quantitative_claims": { "type": "array", "items": { "type": "string" } } } } } ``` ## Configuration Examples ```json Earnings Call — Full Analysis theme={null} { "feature_extractor": { "feature_extractor_name": "audio_sentiment_extractor", "version": "v1", "input_mappings": { "audio": "audio_url" }, "field_passthrough": [ { "source_path": "metadata.ticker" }, { "source_path": "metadata.fiscal_quarter" }, { "source_path": "metadata.call_section" } ], "parameters": { "split_method": "speaker", "run_transcription": true, "transcription_language": "en", "run_finbert": true, "run_prosodics": true, "run_alignment": true, "run_diarization": true, "num_speakers": 6, "speaker_role_manifest": { "SPEAKER_00": "CEO", "SPEAKER_01": "CFO", "SPEAKER_02": "IR" } } } } ``` ```json CEO Prepared Remarks — Guidance Confidence theme={null} { "feature_extractor": { "feature_extractor_name": "audio_sentiment_extractor", "version": "v1", "input_mappings": { "audio": "prepared_remarks_url" }, "field_passthrough": [ { "source_path": "metadata.ticker" }, { "source_path": "metadata.event_date" } ], "parameters": { "split_method": "silence", "silence_db_threshold": -38, "run_transcription": true, "transcription_prompt": "earnings call forward guidance revenue EPS margin", "run_finbert": true, "run_prosodics": true, "run_alignment": true, "run_llm_enrichment": true, "response_shape": { "type": "object", "properties": { "guidance_confidence": { "type": "integer", "minimum": 1, "maximum": 5 }, "hedging_phrase_count": { "type": "integer" }, "topic": { "type": "string" }, "forward_looking": { "type": "boolean" } } } } } } ``` ```json Fed Press Conference — Policy Tone theme={null} { "feature_extractor": { "feature_extractor_name": "audio_sentiment_extractor", "version": "v1", "input_mappings": { "video": "press_conf_url" }, "field_passthrough": [ { "source_path": "metadata.fomc_date" }, { "source_path": "metadata.rate_decision" } ], "parameters": { "split_method": "silence", "run_transcription": true, "run_finbert": true, "run_prosodics": true, "run_alignment": true, "run_diarization": false } } } ``` ```json Analyst Day — Multi-Speaker Timeline theme={null} { "feature_extractor": { "feature_extractor_name": "audio_sentiment_extractor", "version": "v1", "input_mappings": { "video": "analyst_day_recording" }, "parameters": { "split_method": "speaker", "run_transcription": true, "run_finbert": true, "run_prosodics": true, "run_alignment": true, "run_diarization": true } } } ``` ```json Minimal — Transcription + FinBERT Only theme={null} { "feature_extractor": { "feature_extractor_name": "audio_sentiment_extractor", "version": "v1", "input_mappings": { "audio": "audio_url" }, "parameters": { "split_method": "time", "time_split_interval": 60, "run_transcription": true, "run_finbert": true, "run_prosodics": false, "run_alignment": false, "run_diarization": false } } } ``` ## Performance & Costs ### Processing Speed | Configuration | Speed | Example | | ------------------ | --------------- | ---------------------- | | Transcription only | \~0.5x realtime | 60-min call → \~30 min | | + FinBERT | \~0.6x realtime | 60-min call → \~36 min | | + Prosodics | \~0.8x realtime | 60-min call → \~48 min | | + Diarization | \~1.5x realtime | 60-min call → \~90 min | | Full pipeline | \~1.5x realtime | 60-min call → \~90 min | | Feature | Per-Segment Latency | | ----------------------- | -------------------- | | Transcription (Whisper) | \~150ms/sec of audio | | FinBERT classification | \~25ms | | Prosodic extraction | \~50ms | | Speaker diarization | \~200ms/sec of audio | | LLM enrichment | \~1.5s | ### Cost Estimates (per hour of audio) | Configuration | Cost | | ----------------------------------------- | ------ | | **Minimal** (transcription + FinBERT) | \$0.08 | | **Standard** (+ prosodics + alignment) | \$0.15 | | **Full** (+ diarization + LLM enrichment) | \$0.25 | **Batch processing**: Processing 1,000 S\&P 500 earnings calls (avg 60 min) at full configuration ≈ \$250 ## Vector Indexes ### Text Embedding (FinBERT) | Property | Value | | -------------------- | --------------------------------------------- | | **Index name** | `audio_sentiment_extractor_v1_text_embedding` | | **Dimensions** | 768 | | **Type** | Dense | | **Distance metric** | Cosine | | **Inference model** | `finbert_sentiment_v1` | | **Supported inputs** | text (transcription segments) | ### Prosodic Embedding | Property | Value | | -------------------- | ------------------------------------------------- | | **Index name** | `audio_sentiment_extractor_v1_prosodic_embedding` | | **Dimensions** | 128 | | **Type** | Dense | | **Distance metric** | Cosine | | **Inference model** | `prosodic_encoder_v1` | | **Supported inputs** | audio segments | ## Alpha Signal Guide This section describes the five core prosodic features and their interpretation as quantitative signals. **What it measures:** Standard deviation of the fundamental frequency (F0) in Hz across the segment. **Signal interpretation:** * **High variability (> 50 Hz):** Elevated emotional engagement; can indicate stress or enthusiasm * **Low variability (\< 15 Hz):** Monotone delivery; associated with rehearsed/scripted language or disengagement * **Baseline deviation:** Compare against the speaker's historical mean F0 std dev for true anomaly detection **Quant application:** Track CEO pitch variability during forward guidance vs. historical questions. Anomalous drops on guidance segments may precede earnings misses. **What it measures:** Words per minute derived from Whisper word-level timestamps. **Signal interpretation:** * **High rate (> 180 wpm):** Urgency, anxiety, or over-rehearsed scripted answers * **Low rate (\< 100 wpm):** Deliberate, careful language; common when discussing negative surprises * **Rate deceleration mid-answer:** Suggests real-time reasoning, less scripted — higher authenticity signal **Quant application:** Significant speech rate slowdown during Q\&A relative to prepared remarks may signal management is processing unexpected analyst questions. **What it measures:** Root mean square energy of the audio signal in decibels. **Signal interpretation:** * **High energy:** Assertiveness and confidence; common in positive guidance delivery * **Energy drop mid-sentence:** Hedging or trailing off; linguistic uncertainty * **Segment-relative drop:** Cross-call energy tracking shows conviction level **Quant application:** Energy drop on forward EPS guidance sentences (identifiable via LLM topic tagging) is a stress-linked signal distinct from text sentiment. **What it measures:** Fraction of segment duration classified as silence (VAD threshold -40 dB). **Signal interpretation:** * **High pause ratio (> 0.35):** Cognitive load; speaker is reasoning in real time rather than reciting * **Low pause ratio (\< 0.10):** Scripted, rehearsed delivery — less information content * **Q\&A vs. prepared remarks delta:** A large increase in pause ratio during Q\&A is a well-documented stress marker **Quant application:** Pause ratio on Q\&A segments answering analyst questions about inventory / margin / guidance has shown predictive value for negative guidance revisions in academic literature. **What it measures:** Cosine similarity between the FinBERT sentiment direction (text) and prosodic valence (audio). Range: -1.0 to +1.0. **Signal interpretation:** * **High alignment (> 0.6):** Voice and words agree — higher conviction, less masking * **Low alignment (0.1–0.4):** Moderate divergence — common in hedged language * **Negative alignment (\< 0):** Voice contradicts words — strongest stress/deception marker; e.g., "We feel very good about guidance" delivered with high pitch variability, low energy, and high pauses **Quant application:** This is the most novel of the five features. Text NLP cannot capture it. Segments with positive text sentiment but negative alignment are the primary alpha generation target. ### Composite Stress Index The `stress_index` field (0.0–1.0) is a normalized composite of all five prosodic features: ``` stress_index = normalize( 0.25 * pitch_variability_z + 0.20 * speech_rate_z + # inverted: lower rate = higher stress 0.20 * (1 - vocal_energy_z) + 0.20 * pause_ratio_z + 0.15 * vocal_tremor_z ) ``` Where `_z` values are Z-scores computed against the speaker's rolling 4-quarter baseline when `speaker_id` is consistent across calls. ### Recommended Factor Construction ```python theme={null} # Example: CEO guidance confidence factor import mixpeek client = mixpeek.Client(api_key="YOUR_KEY") # Query for CEO guidance segments with sentiment divergence results = client.retrievers.run( retriever_id="earnings-sentiment-retriever", inputs={ "query": "revenue guidance outlook fiscal year", "filters": { "speaker_role": "CEO", "forward_looking": True, "audio_text_alignment": {"$lt": 0.3}, # divergence signal "stress_index": {"$gt": 0.6} # high stress } } ) ``` ## Limitations * **Speaker diarization accuracy**: pyannote achieves \~90% DER on clean 2-speaker recordings; accuracy degrades with > 8 speakers or poor audio quality * **Non-English**: Whisper transcription supports 99 languages; FinBERT is English-only — for non-English calls, disable `run_finbert` and use multilingual sentiment models * **Audio quality**: Prosodic features require 16kHz+ audio; compressed phone audio (8kHz) reduces pitch extraction accuracy by \~30% * **Baseline dependency**: `stress_index` Z-score normalization requires at least 4 prior segments from the same `speaker_id` to be meaningful * **Segment length**: Prosodic features are unreliable for segments \< 5 seconds; short interjections are best excluded * **LLM enrichment latency**: `run_llm_enrichment=true` adds 1–2s per segment; disable for batch throughput ## Related * [Feature Extractors Overview](/processing/feature-extractors) * [Text Extractor](/processing/extractors/text) * [Multimodal Extractor](/processing/extractors/multimodal) * [Document Extractor](/processing/extractors/document)