Earnings Call Signal Extraction
Go beyond transcripts. Extract vocal stress, speech pace, sentiment shifts, and forward guidance signals from earnings call audio and text to surface alpha-generating features across 500+ tickers.
Quantitative hedge funds, systematic trading desks, and fundamental research teams analyzing 500+ earnings events per quarter
Earnings call analysis relies on text transcripts alone, missing the vocal and behavioral signals that reveal executive confidence, evasiveness, and stress. Text-only NLP features like forward guidance specificity show promise (IC = +0.12) but fail to generalize out-of-sample — the missing audio modality is the key gap.
Ready to implement?
Before & After Mixpeek
Before
Data source
SEC EDGAR 8-K text only (Exhibit 99.1 press releases)
Audio features
Unavailable — proxied with text-only heuristics
Cross-modal analysis
Text-only proxy (FinBERT sentiment vs. confidence phrases)
Scale
50 tickers, 588 events, local Python scripts
After
Data source
8-K filings + earnings call audio + video webcasts
Audio features
Speech pace, pause duration, filler words, vocal energy, pitch
Cross-modal analysis
True audio-text divergence via 1408D shared embeddings
Scale
500+ tickers, ~6,000 events/year via Ray distributed processing
Ticker universe
10x coverage
Feature modalities
3 modalities
Cross-modal divergence
Native audio
Processing infrastructure
Real-time
Why Mixpeek
True cross-modal divergence measurement via shared 1408D multimodal embeddings, not text-only proxies. Pluggable feature extractors process audio streams natively through Ray-distributed pipelines. Batch processing handles 6,000+ earnings events per year across the full S&P 500.
Overview
Earnings calls contain far more information than the words spoken. Executive vocal patterns — speech pace changes between prepared remarks and Q&A, pause duration before difficult questions, filler word frequency, pitch variation under pressure — carry predictive signals that text transcripts alone cannot capture. A proof-of-concept backtest across 50 tickers and 588 events demonstrated that text features like forward guidance specificity correlate with post-earnings returns (IC = +0.12, Sharpe = 1.43 in-sample), but the signal does not generalize out-of-sample with text alone. The key gap: five defined audio features were never tested because earnings call audio data was unavailable. Mixpeek closes this gap by processing earnings call audio natively, enabling true cross-modal analysis where vocal stress diverging from optimistic text becomes a measurable, tradeable signal.
Challenges This Solves
Text-Only Signal Ceiling
NLP features extracted from SEC EDGAR 8-K press releases show in-sample predictive power but fail walk-forward validation (IC = -0.022 out-of-sample), indicating text alone is insufficient
Impact: Research teams invest months building text-only models that overfit to small datasets and don't generalize to production
Missing Audio Modality
Five critical audio features (speech pace, pause duration, filler words, vocal energy, pitch variation) are defined in research but untestable without audio processing infrastructure
Impact: The highest-value signals — cross-modal divergence between what executives say and how they say it — remain inaccessible
Scale Limitations
Local Python scripts with FinBERT and MiniLM models process 50 tickers in hours. Scaling to the full S&P 500 (6,000+ events/year) requires distributed infrastructure
Impact: Small sample sizes (588 events over 12 quarters) produce unstable feature estimates and unreliable walk-forward tests
Cross-Modal Proxy Problem
Audio-text sentiment divergence is approximated using text-only proxies (FinBERT sentiment minus confidence phrase density), missing true vocal-textual divergence
Impact: Proxy features show weak or negative IC, masking what may be a strong signal when measured with actual audio data
Recipe Composition
This use case is composed of the following recipes, connected as a pipeline.
Feature Extractors Used
Audio Transcription
Transcribe audio content to text
Emotion Detection
Detect emotions in audio content
Speech to Text
Convert speech content to text with timestamps and confidence scores
Text Embedding
Extract semantic embeddings from documents, transcripts and text content
Audio Embedding
Extract semantic embeddings from audio content for similarity search
Speaker Diarization
Identify and separate different speakers in audio content
+1 more extractors
Expected Outcomes
Text + audio + video (vs. text-only)
Feature modality coverage
500+ tickers (10x increase)
Ticker universe scale
True audio-text measurement (not proxy)
Cross-modal divergence
6,000+ events/year via distributed pipeline
Processing throughput
Build Earnings Call Audio Intelligence
Deploy audio and text feature extraction for earnings calls. Process call recordings, extract vocal stress signals, measure cross-modal divergence, and query signals through structured retrieval.
Frequently Asked Questions
Ready to Implement This Use Case?
Our team can help you get started with Earnings Call Signal Extraction in your organization.
