Audio Sentiment Extractor

Not yet available. This extractor is a design/roadmap page — it is not in the platform’s extractor registry today, and referencing it in a collection returns a validation error. For working alternatives see the extractor catalog.

Configuring collections by built-in extractor name is a deprecated path — collections are now created by picking features. This extractor does not yet have a direct feature-key replacement; existing feature_extractor configs keep working. See the migration guide.

Browse the extractor catalog on GitHub

Runnable reference for every built-in Mixpeek extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry, so it always matches production.

Audio sentiment extractor pipeline showing speaker diarization, parallel FinBERT and prosodic feature extraction, and output alpha signals

The audio sentiment extractor processes earnings call recordings, analyst day presentations, Fed press conferences, and financial podcasts to produce two parallel signal streams: FinBERT financial-domain text sentiment (768D) from Whisper transcription, and a 5-feature prosodic vector (128D) capturing vocal stress, hesitation, and deception markers. Speaker diarization separates management from analysts for role-attributed sentiment. This extractor addresses the gap identified in SEC 8-K forward guidance NLP studies: text-only sentiment models generate crowded alpha (in-sample IC ~+0.12 but poor walk-forward generalization). The five prosodic features — pitch variability, speech rate, vocal energy, pause ratio, and audio-text alignment — are largely uncorrelated with published text signals and untested at scale, representing a structural alternative data opportunity.

View extractor details at api.mixpeek.com/v1/collections/features/extractors/audio_sentiment_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

Filter Dataset (if collection_id provided)
- Filter to specified collection
Apply Input Mappings
- Resolve audio/video field from source (e.g., payload.audio_url, payload.webcast_url)
Audio Extraction (conditional: if video input)
- FFmpeg strips audio track from MP4/MOV; supports AAC, MP3, FLAC output
Voice Activity Detection + Segmentation
- split_method: time — fixed-length windows (default 30s)
- split_method: silence — split at natural speech pauses (VAD threshold configurable)
- split_method: speaker — one segment per speaker turn (requires run_diarization=true)
Speaker Diarization (conditional: if run_diarization=true)
- pyannote.audio 3.x pipeline separates speakers (CEO, CFO, Analyst_1, etc.)
- Assigns speaker_id and optionally maps to speaker_role via role manifest
Transcription (conditional: if run_transcription=true)
- Whisper large-v3-turbo speech-to-text with financial vocabulary prompt
- Per-segment timestamps aligned to diarization boundaries
FinBERT Text Sentiment (conditional: if run_finbert=true)
- ProsusAI/FinBERT financial-domain sentiment classifier
- Outputs sentiment_label (positive/negative/neutral), sentiment_score (-1 to +1), confidence
- Generates 768D FinBERT CLS embedding for semantic search
Prosodic Feature Extraction (conditional: if run_prosodics=true)
- LibROSA + Parselmouth extract 5 features per segment:
  - Pitch variability (F0 standard deviation, Hz) — hesitation and stress indicator
  - Speech rate (words per minute) — confidence and urgency signal
  - Vocal energy (RMS dB) — assertiveness and emotional weight
  - Pause ratio (fraction of silence) — cognitive load and evasiveness marker
  - Vocal tremor (jitter + shimmer) — anxiety and deception indicator
- Normalized into a 128D prosodic embedding for similarity search
Audio-Text Alignment Score (conditional: if run_alignment=true)
- Cosine similarity between FinBERT sentiment direction and prosodic valence
- Low alignment = voice contradicts words (high-value deception/stress signal)
LLM Structured Enrichment (conditional: if run_llm_enrichment=true or response_shape set)
- Gemini/GPT-4o processes transcription with custom prompt
- Extracts structured signals: guidance confidence, topic classification, hedging language
Output
- Per-segment documents with both embedding types, raw prosodic features, sentiment scores, speaker metadata, and computed alpha signals

When to Use

Use Case	Description
Earnings call analysis	Process quarterly earnings calls for CEO/CFO vocal stress relative to guidance
Forward guidance scoring	Score management’s confidence level on forward-looking statements
Analyst day processing	Build speaker-attributed sentiment timelines across presentations
Fed press conference monitoring	Track FOMC chair vocal markers around policy language
Alpha signal generation	Combine text + prosodic features for uncrowded quantitative factors
Sentiment divergence detection	Flag calls where management tone contradicts word sentiment
Comparative speaker analysis	Track individual executive vocal patterns across multiple quarters
Podcast / financial media	Extract sentiment from analyst interviews, TV appearances, podcasts

When NOT to Use

Scenario	Recommended Alternative
Text-only documents (PDFs, filings)	`document_extractor` or `text_extractor`
Very short clips (< 10 seconds)	Processing overhead disproportionate
Non-speech audio (music, noise)	`multimodal_extractor`
Real-time live streaming	Specialized streaming extractors
Non-English earnings calls	Set `transcription_language` explicitly; FinBERT is English-only

Supported Input Types

Input	Type	Description	Processing
`audio`	string	URL or S3 path to MP3, WAV, FLAC, M4A	Direct processing
`video`	string	URL or S3 path to MP4, MOV, MKV	Audio extracted via FFmpeg
`url`	string	Direct URL to webcast / podcast stream	Downloaded and processed

Supported audio formats: MP3, WAV, FLAC, M4A, OGG, OPUS Supported video formats (audio extracted): MP4, MOV, MKV, AVI, WebM

Input Schema

Provide one of the following inputs:

{
  "audio": "s3://bucket/earnings/AAPL_Q4_2024.mp3"
}

{
  "video": "s3://bucket/investor-day/msft-2024-ceo-keynote.mp4"
}

{
  "url": "https://edge.media-server.com/mmc/p/xyz/earnings-call.mp3"
}

Field	Type	Description
`audio`	string	URL/S3 path to audio file. Recommended: < 3 hours per file
`video`	string	URL/S3 path to video file; audio track extracted automatically
`url`	string	Direct stream URL; downloaded before processing

Output Schema

Each audio segment produces one document:

Field	Type	Description
`start_time`	number	Segment start time in seconds
`end_time`	number	Segment end time in seconds
`speaker_id`	string	Diarized speaker label (e.g., `SPEAKER_00`)
`speaker_role`	string	Mapped role if manifest provided (e.g., `CEO`, `CFO`, `Analyst`)
`transcription`	string	Whisper transcription of segment
`sentiment_label`	string	`positive`, `negative`, or `neutral`
`sentiment_score`	number	FinBERT sentiment score: -1.0 (negative) to +1.0 (positive)
`sentiment_confidence`	number	FinBERT confidence 0.0–1.0
`pitch_variability_hz`	number	F0 standard deviation (Hz) — stress/hesitation
`speech_rate_wpm`	number	Words per minute — confidence/urgency
`vocal_energy_db`	number	RMS energy in dB — assertiveness
`pause_ratio`	number	Fraction of silence 0.0–1.0 — cognitive load
`vocal_tremor`	number	Jitter + shimmer composite 0.0–1.0 — anxiety
`audio_text_alignment`	number	Prosody-sentiment cosine alignment -1.0 to +1.0
`stress_index`	number	Composite vocal stress score 0.0–1.0
`audio_sentiment_extractor_v1_text_embedding`	float[768]	FinBERT CLS embedding
`audio_sentiment_extractor_v1_prosodic_embedding`	float[128]	Normalized prosodic feature vector

{
  "start_time": 245.0,
  "end_time": 275.0,
  "speaker_id": "SPEAKER_00",
  "speaker_role": "CEO",
  "transcription": "We're very confident in our Q1 guidance range of twelve to fourteen dollars per share...",
  "sentiment_label": "positive",
  "sentiment_score": 0.71,
  "sentiment_confidence": 0.89,
  "pitch_variability_hz": 38.2,
  "speech_rate_wpm": 142,
  "vocal_energy_db": -18.4,
  "pause_ratio": 0.21,
  "vocal_tremor": 0.14,
  "audio_text_alignment": 0.63,
  "stress_index": 0.31,
  "audio_sentiment_extractor_v1_text_embedding": [0.041, -0.018, ...],
  "audio_sentiment_extractor_v1_prosodic_embedding": [0.72, 0.34, ...]
}

Parameters

Audio Segmentation

Parameter	Type	Default	Description
`split_method`	string	`"silence"`	Segmentation strategy: `time`, `silence`, or `speaker`

time
silence
speaker

Fixed-interval splitting — equal-duration segments regardless of speech content.

Parameter	Type	Default	Description
`time_split_interval`	integer	`30`	Segment duration in seconds

Best for: Batch processing, predictable segment counts, initial exploration

{
  "split_method": "time",
  "time_split_interval": 30
}

Voice activity detection — splits at natural speech pauses. Preserves complete sentences and thoughts.

Parameter	Type	Default	Description
`silence_db_threshold`	integer	`-40`	dB level below which audio is silence
`min_silence_duration_ms`	integer	`500`	Minimum silence length to trigger split

Best for: Earnings calls, presentations, interviews — preserves semantic units

{
  "split_method": "silence",
  "silence_db_threshold": -40,
  "min_silence_duration_ms": 500
}

Speaker-turn segmentation — one segment per speaker turn. Requires diarization. Ideal for Q&A analysis.Characteristics:

Variable segment lengths (1s–5 min typical for earnings Q&A)
Each segment is a single speaker’s continuous turn
Enables per-speaker sentiment timelines

Best for: Q&A sections, panel discussions, analyst questioning

{
  "split_method": "speaker",
  "run_diarization": true
}

Feature Extraction Parameters

Parameter	Type	Default	Description
`run_transcription`	boolean	`true`	Run Whisper transcription
`transcription_language`	string	`"en"`	Language code for transcription
`transcription_prompt`	string	`null`	Domain vocabulary hint (e.g., ticker symbols, product names)
`run_finbert`	boolean	`true`	Run FinBERT financial sentiment classification
`run_prosodics`	boolean	`true`	Extract 5 prosodic features + 128D embedding
`run_alignment`	boolean	`true`	Compute audio-text alignment score
`run_diarization`	boolean	`false`	Run speaker diarization (adds ~20% processing time)
`num_speakers`	integer	`null`	Hint for diarization (null = auto-detect)

Speaker Role Manifest

Map diarized speaker IDs to roles (CEO, CFO, Analyst, etc.) using a manifest:

{
  "speaker_role_manifest": {
    "SPEAKER_00": "CEO",
    "SPEAKER_01": "CFO",
    "SPEAKER_02": "Analyst"
  }
}

When speaker_role_manifest is not provided, roles are labeled SPEAKER_00, SPEAKER_01, etc.

LLM Structured Extraction

Parameter	Type	Default	Description
`run_llm_enrichment`	boolean	`false`	Run LLM over transcription segments
`response_shape`	string \| object	`null`	Custom structured output schema

Natural Language Mode:

{
  "response_shape": "Extract: forward guidance confidence level (1-5), number of hedging phrases, primary topic discussed, and any mentioned risk factors"
}

JSON Schema Mode for Quant Signals:

{
  "response_shape": {
    "type": "object",
    "properties": {
      "guidance_confidence": { "type": "integer", "minimum": 1, "maximum": 5 },
      "hedging_phrase_count": { "type": "integer" },
      "topic": { "type": "string", "enum": ["revenue", "margins", "guidance", "macro", "capex", "other"] },
      "risk_factors": { "type": "array", "items": { "type": "string" } },
      "quantitative_claims": { "type": "array", "items": { "type": "string" } }
    }
  }
}

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "audio_sentiment_extractor",
    "version": "v1",
    "input_mappings": {
      "audio": "audio_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.ticker" },
      { "source_path": "metadata.fiscal_quarter" },
      { "source_path": "metadata.call_section" }
    ],
    "parameters": {
      "split_method": "speaker",
      "run_transcription": true,
      "transcription_language": "en",
      "run_finbert": true,
      "run_prosodics": true,
      "run_alignment": true,
      "run_diarization": true,
      "num_speakers": 6,
      "speaker_role_manifest": {
        "SPEAKER_00": "CEO",
        "SPEAKER_01": "CFO",
        "SPEAKER_02": "IR"
      }
    }
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "audio_sentiment_extractor",
    "version": "v1",
    "input_mappings": {
      "audio": "prepared_remarks_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.ticker" },
      { "source_path": "metadata.event_date" }
    ],
    "parameters": {
      "split_method": "silence",
      "silence_db_threshold": -38,
      "run_transcription": true,
      "transcription_prompt": "earnings call forward guidance revenue EPS margin",
      "run_finbert": true,
      "run_prosodics": true,
      "run_alignment": true,
      "run_llm_enrichment": true,
      "response_shape": {
        "type": "object",
        "properties": {
          "guidance_confidence": { "type": "integer", "minimum": 1, "maximum": 5 },
          "hedging_phrase_count": { "type": "integer" },
          "topic": { "type": "string" },
          "forward_looking": { "type": "boolean" }
        }
      }
    }
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "audio_sentiment_extractor",
    "version": "v1",
    "input_mappings": {
      "video": "press_conf_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.fomc_date" },
      { "source_path": "metadata.rate_decision" }
    ],
    "parameters": {
      "split_method": "silence",
      "run_transcription": true,
      "run_finbert": true,
      "run_prosodics": true,
      "run_alignment": true,
      "run_diarization": false
    }
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "audio_sentiment_extractor",
    "version": "v1",
    "input_mappings": {
      "video": "analyst_day_recording"
    },
    "parameters": {
      "split_method": "speaker",
      "run_transcription": true,
      "run_finbert": true,
      "run_prosodics": true,
      "run_alignment": true,
      "run_diarization": true
    }
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "audio_sentiment_extractor",
    "version": "v1",
    "input_mappings": {
      "audio": "audio_url"
    },
    "parameters": {
      "split_method": "time",
      "time_split_interval": 60,
      "run_transcription": true,
      "run_finbert": true,
      "run_prosodics": false,
      "run_alignment": false,
      "run_diarization": false
    }
  }
}

Performance & Costs

Processing Speed

Configuration	Speed	Example
Transcription only	~0.5x realtime	60-min call → ~30 min
+ FinBERT	~0.6x realtime	60-min call → ~36 min
+ Prosodics	~0.8x realtime	60-min call → ~48 min
+ Diarization	~1.5x realtime	60-min call → ~90 min
Full pipeline	~1.5x realtime	60-min call → ~90 min

Feature	Per-Segment Latency
Transcription (Whisper)	~150ms/sec of audio
FinBERT classification	~25ms
Prosodic extraction	~50ms
Speaker diarization	~200ms/sec of audio
LLM enrichment	~1.5s

Cost Estimates (per hour of audio)

Configuration	Cost
Minimal (transcription + FinBERT)	$0.08
Standard (+ prosodics + alignment)	$0.15
Full (+ diarization + LLM enrichment)	$0.25

Batch processing: Processing 1,000 S&P 500 earnings calls (avg 60 min) at full configuration ≈ $250

Vector Indexes

Text Embedding (FinBERT)

Property	Value
Index name	`audio_sentiment_extractor_v1_text_embedding`
Dimensions	768
Type	Dense
Distance metric	Cosine
Inference model	`finbert_sentiment_v1`
Supported inputs	text (transcription segments)

Prosodic Embedding

Property	Value
Index name	`audio_sentiment_extractor_v1_prosodic_embedding`
Dimensions	128
Type	Dense
Distance metric	Cosine
Inference model	`prosodic_encoder_v1`
Supported inputs	audio segments

Alpha Signal Guide

This section describes the five core prosodic features and their interpretation as quantitative signals.

1. Pitch Variability (F0 Standard Deviation)

What it measures: Standard deviation of the fundamental frequency (F0) in Hz across the segment.Signal interpretation:

High variability (> 50 Hz): Elevated emotional engagement; can indicate stress or enthusiasm
Low variability (< 15 Hz): Monotone delivery; associated with rehearsed/scripted language or disengagement
Baseline deviation: Compare against the speaker’s historical mean F0 std dev for true anomaly detection

Quant application: Track CEO pitch variability during forward guidance vs. historical questions. Anomalous drops on guidance segments may precede earnings misses.

2. Speech Rate (Words Per Minute)

What it measures: Words per minute derived from Whisper word-level timestamps.Signal interpretation:

High rate (> 180 wpm): Urgency, anxiety, or over-rehearsed scripted answers
Low rate (< 100 wpm): Deliberate, careful language; common when discussing negative surprises
Rate deceleration mid-answer: Suggests real-time reasoning, less scripted — higher authenticity signal

Quant application: Significant speech rate slowdown during Q&A relative to prepared remarks may signal management is processing unexpected analyst questions.

3. Vocal Energy (RMS dB)

What it measures: Root mean square energy of the audio signal in decibels.Signal interpretation:

High energy: Assertiveness and confidence; common in positive guidance delivery
Energy drop mid-sentence: Hedging or trailing off; linguistic uncertainty
Segment-relative drop: Cross-call energy tracking shows conviction level

Quant application: Energy drop on forward EPS guidance sentences (identifiable via LLM topic tagging) is a stress-linked signal distinct from text sentiment.

4. Pause Ratio (Silence Fraction)

What it measures: Fraction of segment duration classified as silence (VAD threshold -40 dB).Signal interpretation:

High pause ratio (> 0.35): Cognitive load; speaker is reasoning in real time rather than reciting
Low pause ratio (< 0.10): Scripted, rehearsed delivery — less information content
Q&A vs. prepared remarks delta: A large increase in pause ratio during Q&A is a well-documented stress marker

Quant application: Pause ratio on Q&A segments answering analyst questions about inventory / margin / guidance has shown predictive value for negative guidance revisions in academic literature.

5. Audio-Text Alignment Score

What it measures: Cosine similarity between the FinBERT sentiment direction (text) and prosodic valence (audio). Range: -1.0 to +1.0.Signal interpretation:

High alignment (> 0.6): Voice and words agree — higher conviction, less masking
Low alignment (0.1–0.4): Moderate divergence — common in hedged language
Negative alignment (< 0): Voice contradicts words — strongest stress/deception marker; e.g., “We feel very good about guidance” delivered with high pitch variability, low energy, and high pauses

Quant application: This is the most novel of the five features. Text NLP cannot capture it. Segments with positive text sentiment but negative alignment are the primary alpha generation target.

Composite Stress Index

The stress_index field (0.0–1.0) is a normalized composite of all five prosodic features:

stress_index = normalize(
25 * pitch_variability_z +
20 * speech_rate_z +      # inverted: lower rate = higher stress
20 * (1 - vocal_energy_z) +
20 * pause_ratio_z +
15 * vocal_tremor_z
)

Where _z values are Z-scores computed against the speaker’s rolling 4-quarter baseline when speaker_id is consistent across calls.

Recommended Factor Construction

# Example: CEO guidance confidence factor
import mixpeek

client = mixpeek.Client(api_key="YOUR_KEY")

# Query for CEO guidance segments with sentiment divergence
results = client.retrievers.run(
    retriever_id="earnings-sentiment-retriever",
    inputs={
        "query": "revenue guidance outlook fiscal year",
        "filters": {
            "speaker_role": "CEO",
            "forward_looking": True,
            "audio_text_alignment": {"$lt": 0.3},   # divergence signal
            "stress_index": {"$gt": 0.6}             # high stress
        }
    }
)

Limitations

Speaker diarization accuracy: pyannote achieves ~90% DER on clean 2-speaker recordings; accuracy degrades with > 8 speakers or poor audio quality
Non-English: Whisper transcription supports 99 languages; FinBERT is English-only — for non-English calls, disable run_finbert and use multilingual sentiment models
Audio quality: Prosodic features require 16kHz+ audio; compressed phone audio (8kHz) reduces pitch extraction accuracy by ~30%
Baseline dependency: stress_index Z-score normalization requires at least 4 prior segments from the same speaker_id to be meaningful
Segment length: Prosodic features are unreliable for segments < 5 seconds; short interjections are best excluded
LLM enrichment latency: run_llm_enrichment=true adds 1–2s per segment; disable for batch throughput

Get started

Connect your data

Extract features

Build retrievers

Enrich & organize

Integrate & operate

Resources

Audio Sentiment Extractor

Browse the extractor catalog on GitHub

Pipeline Steps

When to Use

When NOT to Use

Supported Input Types

Input Schema

Output Schema

Parameters

Audio Segmentation

Feature Extraction Parameters

Speaker Role Manifest

LLM Structured Extraction

Configuration Examples

Performance & Costs

Processing Speed

Cost Estimates (per hour of audio)

Vector Indexes

Text Embedding (FinBERT)

Prosodic Embedding

Alpha Signal Guide

Composite Stress Index

Recommended Factor Construction

Limitations

Browse the extractor catalog on GitHub

​Pipeline Steps

​When to Use

​When NOT to Use

​Supported Input Types

​Input Schema

​Output Schema

​Parameters

​Audio Segmentation

​Feature Extraction Parameters

​Speaker Role Manifest

​LLM Structured Extraction

​Configuration Examples

​Performance & Costs

​Processing Speed

​Cost Estimates (per hour of audio)

​Vector Indexes

​Text Embedding (FinBERT)

​Prosodic Embedding

​Alpha Signal Guide

​Composite Stress Index

​Recommended Factor Construction

​Limitations

​Related

Pipeline Steps

When to Use

When NOT to Use

Supported Input Types

Input Schema

Output Schema

Parameters

Audio Segmentation

Feature Extraction Parameters

Speaker Role Manifest

LLM Structured Extraction

Configuration Examples

Performance & Costs

Processing Speed

Cost Estimates (per hour of audio)

Vector Indexes

Text Embedding (FinBERT)

Prosodic Embedding

Alpha Signal Guide

Composite Stress Index

Recommended Factor Construction

Limitations

Related