Skip to main content
Audio sentiment extractor pipeline showing speaker diarization, parallel FinBERT and prosodic feature extraction, and output alpha signals
The audio sentiment extractor processes earnings call recordings, analyst day presentations, Fed press conferences, and financial podcasts to produce two parallel signal streams: FinBERT financial-domain text sentiment (768D) from Whisper transcription, and a 5-feature prosodic vector (128D) capturing vocal stress, hesitation, and deception markers. Speaker diarization separates management from analysts for role-attributed sentiment. This extractor addresses the gap identified in SEC 8-K forward guidance NLP studies: text-only sentiment models generate crowded alpha (in-sample IC ~+0.12 but poor walk-forward generalization). The five prosodic features — pitch variability, speech rate, vocal energy, pause ratio, and audio-text alignment — are largely uncorrelated with published text signals and untested at scale, representing a structural alternative data opportunity.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/audio_sentiment_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

  1. Filter Dataset (if collection_id provided)
    • Filter to specified collection
  2. Apply Input Mappings
    • Resolve audio/video field from source (e.g., payload.audio_url, payload.webcast_url)
  3. Audio Extraction (conditional: if video input)
    • FFmpeg strips audio track from MP4/MOV; supports AAC, MP3, FLAC output
  4. Voice Activity Detection + Segmentation
    • split_method: time — fixed-length windows (default 30s)
    • split_method: silence — split at natural speech pauses (VAD threshold configurable)
    • split_method: speaker — one segment per speaker turn (requires run_diarization=true)
  5. Speaker Diarization (conditional: if run_diarization=true)
    • pyannote.audio 3.x pipeline separates speakers (CEO, CFO, Analyst_1, etc.)
    • Assigns speaker_id and optionally maps to speaker_role via role manifest
  6. Transcription (conditional: if run_transcription=true)
    • Whisper large-v3-turbo speech-to-text with financial vocabulary prompt
    • Per-segment timestamps aligned to diarization boundaries
  7. FinBERT Text Sentiment (conditional: if run_finbert=true)
    • ProsusAI/FinBERT financial-domain sentiment classifier
    • Outputs sentiment_label (positive/negative/neutral), sentiment_score (-1 to +1), confidence
    • Generates 768D FinBERT CLS embedding for semantic search
  8. Prosodic Feature Extraction (conditional: if run_prosodics=true)
    • LibROSA + Parselmouth extract 5 features per segment:
      • Pitch variability (F0 standard deviation, Hz) — hesitation and stress indicator
      • Speech rate (words per minute) — confidence and urgency signal
      • Vocal energy (RMS dB) — assertiveness and emotional weight
      • Pause ratio (fraction of silence) — cognitive load and evasiveness marker
      • Vocal tremor (jitter + shimmer) — anxiety and deception indicator
    • Normalized into a 128D prosodic embedding for similarity search
  9. Audio-Text Alignment Score (conditional: if run_alignment=true)
    • Cosine similarity between FinBERT sentiment direction and prosodic valence
    • Low alignment = voice contradicts words (high-value deception/stress signal)
  10. LLM Structured Enrichment (conditional: if run_llm_enrichment=true or response_shape set)
    • Gemini/GPT-4o processes transcription with custom prompt
    • Extracts structured signals: guidance confidence, topic classification, hedging language
  11. Output
    • Per-segment documents with both embedding types, raw prosodic features, sentiment scores, speaker metadata, and computed alpha signals

When to Use

Use CaseDescription
Earnings call analysisProcess quarterly earnings calls for CEO/CFO vocal stress relative to guidance
Forward guidance scoringScore management’s confidence level on forward-looking statements
Analyst day processingBuild speaker-attributed sentiment timelines across presentations
Fed press conference monitoringTrack FOMC chair vocal markers around policy language
Alpha signal generationCombine text + prosodic features for uncrowded quantitative factors
Sentiment divergence detectionFlag calls where management tone contradicts word sentiment
Comparative speaker analysisTrack individual executive vocal patterns across multiple quarters
Podcast / financial mediaExtract sentiment from analyst interviews, TV appearances, podcasts

When NOT to Use

ScenarioRecommended Alternative
Text-only documents (PDFs, filings)document_extractor or text_extractor
Very short clips (< 10 seconds)Processing overhead disproportionate
Non-speech audio (music, noise)multimodal_extractor
Real-time live streamingSpecialized streaming extractors
Non-English earnings callsSet transcription_language explicitly; FinBERT is English-only

Supported Input Types

InputTypeDescriptionProcessing
audiostringURL or S3 path to MP3, WAV, FLAC, M4ADirect processing
videostringURL or S3 path to MP4, MOV, MKVAudio extracted via FFmpeg
urlstringDirect URL to webcast / podcast streamDownloaded and processed
Supported audio formats: MP3, WAV, FLAC, M4A, OGG, OPUS Supported video formats (audio extracted): MP4, MOV, MKV, AVI, WebM

Input Schema

Provide one of the following inputs:
{
  "audio": "s3://bucket/earnings/AAPL_Q4_2024.mp3"
}
{
  "video": "s3://bucket/investor-day/msft-2024-ceo-keynote.mp4"
}
{
  "url": "https://edge.media-server.com/mmc/p/xyz/earnings-call.mp3"
}
FieldTypeDescription
audiostringURL/S3 path to audio file. Recommended: < 3 hours per file
videostringURL/S3 path to video file; audio track extracted automatically
urlstringDirect stream URL; downloaded before processing

Output Schema

Each audio segment produces one document:
FieldTypeDescription
start_timenumberSegment start time in seconds
end_timenumberSegment end time in seconds
speaker_idstringDiarized speaker label (e.g., SPEAKER_00)
speaker_rolestringMapped role if manifest provided (e.g., CEO, CFO, Analyst)
transcriptionstringWhisper transcription of segment
sentiment_labelstringpositive, negative, or neutral
sentiment_scorenumberFinBERT sentiment score: -1.0 (negative) to +1.0 (positive)
sentiment_confidencenumberFinBERT confidence 0.0–1.0
pitch_variability_hznumberF0 standard deviation (Hz) — stress/hesitation
speech_rate_wpmnumberWords per minute — confidence/urgency
vocal_energy_dbnumberRMS energy in dB — assertiveness
pause_rationumberFraction of silence 0.0–1.0 — cognitive load
vocal_tremornumberJitter + shimmer composite 0.0–1.0 — anxiety
audio_text_alignmentnumberProsody-sentiment cosine alignment -1.0 to +1.0
stress_indexnumberComposite vocal stress score 0.0–1.0
audio_sentiment_extractor_v1_text_embeddingfloat[768]FinBERT CLS embedding
audio_sentiment_extractor_v1_prosodic_embeddingfloat[128]Normalized prosodic feature vector
{
  "start_time": 245.0,
  "end_time": 275.0,
  "speaker_id": "SPEAKER_00",
  "speaker_role": "CEO",
  "transcription": "We're very confident in our Q1 guidance range of twelve to fourteen dollars per share...",
  "sentiment_label": "positive",
  "sentiment_score": 0.71,
  "sentiment_confidence": 0.89,
  "pitch_variability_hz": 38.2,
  "speech_rate_wpm": 142,
  "vocal_energy_db": -18.4,
  "pause_ratio": 0.21,
  "vocal_tremor": 0.14,
  "audio_text_alignment": 0.63,
  "stress_index": 0.31,
  "audio_sentiment_extractor_v1_text_embedding": [0.041, -0.018, ...],
  "audio_sentiment_extractor_v1_prosodic_embedding": [0.72, 0.34, ...]
}

Parameters

Audio Segmentation

ParameterTypeDefaultDescription
split_methodstring"silence"Segmentation strategy: time, silence, or speaker
Fixed-interval splitting — equal-duration segments regardless of speech content.
ParameterTypeDefaultDescription
time_split_intervalinteger30Segment duration in seconds
Best for: Batch processing, predictable segment counts, initial exploration
{
  "split_method": "time",
  "time_split_interval": 30
}

Feature Extraction Parameters

ParameterTypeDefaultDescription
run_transcriptionbooleantrueRun Whisper transcription
transcription_languagestring"en"Language code for transcription
transcription_promptstringnullDomain vocabulary hint (e.g., ticker symbols, product names)
run_finbertbooleantrueRun FinBERT financial sentiment classification
run_prosodicsbooleantrueExtract 5 prosodic features + 128D embedding
run_alignmentbooleantrueCompute audio-text alignment score
run_diarizationbooleanfalseRun speaker diarization (adds ~20% processing time)
num_speakersintegernullHint for diarization (null = auto-detect)

Speaker Role Manifest

Map diarized speaker IDs to roles (CEO, CFO, Analyst, etc.) using a manifest:
{
  "speaker_role_manifest": {
    "SPEAKER_00": "CEO",
    "SPEAKER_01": "CFO",
    "SPEAKER_02": "Analyst"
  }
}
When speaker_role_manifest is not provided, roles are labeled SPEAKER_00, SPEAKER_01, etc.

LLM Structured Extraction

ParameterTypeDefaultDescription
run_llm_enrichmentbooleanfalseRun LLM over transcription segments
response_shapestring | objectnullCustom structured output schema
Natural Language Mode:
{
  "response_shape": "Extract: forward guidance confidence level (1-5), number of hedging phrases, primary topic discussed, and any mentioned risk factors"
}
JSON Schema Mode for Quant Signals:
{
  "response_shape": {
    "type": "object",
    "properties": {
      "guidance_confidence": { "type": "integer", "minimum": 1, "maximum": 5 },
      "hedging_phrase_count": { "type": "integer" },
      "topic": { "type": "string", "enum": ["revenue", "margins", "guidance", "macro", "capex", "other"] },
      "risk_factors": { "type": "array", "items": { "type": "string" } },
      "quantitative_claims": { "type": "array", "items": { "type": "string" } }
    }
  }
}

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "audio_sentiment_extractor",
    "version": "v1",
    "input_mappings": {
      "audio": "audio_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.ticker" },
      { "source_path": "metadata.fiscal_quarter" },
      { "source_path": "metadata.call_section" }
    ],
    "parameters": {
      "split_method": "speaker",
      "run_transcription": true,
      "transcription_language": "en",
      "run_finbert": true,
      "run_prosodics": true,
      "run_alignment": true,
      "run_diarization": true,
      "num_speakers": 6,
      "speaker_role_manifest": {
        "SPEAKER_00": "CEO",
        "SPEAKER_01": "CFO",
        "SPEAKER_02": "IR"
      }
    }
  }
}

Performance & Costs

Processing Speed

ConfigurationSpeedExample
Transcription only~0.5x realtime60-min call → ~30 min
+ FinBERT~0.6x realtime60-min call → ~36 min
+ Prosodics~0.8x realtime60-min call → ~48 min
+ Diarization~1.5x realtime60-min call → ~90 min
Full pipeline~1.5x realtime60-min call → ~90 min
FeaturePer-Segment Latency
Transcription (Whisper)~150ms/sec of audio
FinBERT classification~25ms
Prosodic extraction~50ms
Speaker diarization~200ms/sec of audio
LLM enrichment~1.5s

Cost Estimates (per hour of audio)

ConfigurationCost
Minimal (transcription + FinBERT)$0.08
Standard (+ prosodics + alignment)$0.15
Full (+ diarization + LLM enrichment)$0.25
Batch processing: Processing 1,000 S&P 500 earnings calls (avg 60 min) at full configuration ≈ $250

Vector Indexes

Text Embedding (FinBERT)

PropertyValue
Index nameaudio_sentiment_extractor_v1_text_embedding
Dimensions768
TypeDense
Distance metricCosine
Inference modelfinbert_sentiment_v1
Supported inputstext (transcription segments)

Prosodic Embedding

PropertyValue
Index nameaudio_sentiment_extractor_v1_prosodic_embedding
Dimensions128
TypeDense
Distance metricCosine
Inference modelprosodic_encoder_v1
Supported inputsaudio segments

Alpha Signal Guide

This section describes the five core prosodic features and their interpretation as quantitative signals.
What it measures: Standard deviation of the fundamental frequency (F0) in Hz across the segment.Signal interpretation:
  • High variability (> 50 Hz): Elevated emotional engagement; can indicate stress or enthusiasm
  • Low variability (< 15 Hz): Monotone delivery; associated with rehearsed/scripted language or disengagement
  • Baseline deviation: Compare against the speaker’s historical mean F0 std dev for true anomaly detection
Quant application: Track CEO pitch variability during forward guidance vs. historical questions. Anomalous drops on guidance segments may precede earnings misses.
What it measures: Words per minute derived from Whisper word-level timestamps.Signal interpretation:
  • High rate (> 180 wpm): Urgency, anxiety, or over-rehearsed scripted answers
  • Low rate (< 100 wpm): Deliberate, careful language; common when discussing negative surprises
  • Rate deceleration mid-answer: Suggests real-time reasoning, less scripted — higher authenticity signal
Quant application: Significant speech rate slowdown during Q&A relative to prepared remarks may signal management is processing unexpected analyst questions.
What it measures: Root mean square energy of the audio signal in decibels.Signal interpretation:
  • High energy: Assertiveness and confidence; common in positive guidance delivery
  • Energy drop mid-sentence: Hedging or trailing off; linguistic uncertainty
  • Segment-relative drop: Cross-call energy tracking shows conviction level
Quant application: Energy drop on forward EPS guidance sentences (identifiable via LLM topic tagging) is a stress-linked signal distinct from text sentiment.
What it measures: Fraction of segment duration classified as silence (VAD threshold -40 dB).Signal interpretation:
  • High pause ratio (> 0.35): Cognitive load; speaker is reasoning in real time rather than reciting
  • Low pause ratio (< 0.10): Scripted, rehearsed delivery — less information content
  • Q&A vs. prepared remarks delta: A large increase in pause ratio during Q&A is a well-documented stress marker
Quant application: Pause ratio on Q&A segments answering analyst questions about inventory / margin / guidance has shown predictive value for negative guidance revisions in academic literature.
What it measures: Cosine similarity between the FinBERT sentiment direction (text) and prosodic valence (audio). Range: -1.0 to +1.0.Signal interpretation:
  • High alignment (> 0.6): Voice and words agree — higher conviction, less masking
  • Low alignment (0.1–0.4): Moderate divergence — common in hedged language
  • Negative alignment (< 0): Voice contradicts words — strongest stress/deception marker; e.g., “We feel very good about guidance” delivered with high pitch variability, low energy, and high pauses
Quant application: This is the most novel of the five features. Text NLP cannot capture it. Segments with positive text sentiment but negative alignment are the primary alpha generation target.

Composite Stress Index

The stress_index field (0.0–1.0) is a normalized composite of all five prosodic features:
stress_index = normalize(
  0.25 * pitch_variability_z +
  0.20 * speech_rate_z +      # inverted: lower rate = higher stress
  0.20 * (1 - vocal_energy_z) +
  0.20 * pause_ratio_z +
  0.15 * vocal_tremor_z
)
Where _z values are Z-scores computed against the speaker’s rolling 4-quarter baseline when speaker_id is consistent across calls.
# Example: CEO guidance confidence factor
import mixpeek

client = mixpeek.Client(api_key="YOUR_KEY")

# Query for CEO guidance segments with sentiment divergence
results = client.retrievers.run(
    retriever_id="earnings-sentiment-retriever",
    inputs={
        "query": "revenue guidance outlook fiscal year",
        "filters": {
            "speaker_role": "CEO",
            "forward_looking": True,
            "audio_text_alignment": {"$lt": 0.3},   # divergence signal
            "stress_index": {"$gt": 0.6}             # high stress
        }
    }
)

Limitations

  • Speaker diarization accuracy: pyannote achieves ~90% DER on clean 2-speaker recordings; accuracy degrades with > 8 speakers or poor audio quality
  • Non-English: Whisper transcription supports 99 languages; FinBERT is English-only — for non-English calls, disable run_finbert and use multilingual sentiment models
  • Audio quality: Prosodic features require 16kHz+ audio; compressed phone audio (8kHz) reduces pitch extraction accuracy by ~30%
  • Baseline dependency: stress_index Z-score normalization requires at least 4 prior segments from the same speaker_id to be meaningful
  • Segment length: Prosodic features are unreliable for segments < 5 seconds; short interjections are best excluded
  • LLM enrichment latency: run_llm_enrichment=true adds 1–2s per segment; disable for batch throughput