> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Audio Sentiment Extractor

> Vocal intelligence for financial earnings calls — FinBERT text sentiment, prosodic audio features, and speaker diarization for quantitative alternative data

<Card title="Browse the extractor catalog on GitHub" icon="github" href="https://github.com/mixpeek/mixpeek-extractors" horizontal>
  Runnable reference for every built-in Mixpeek extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry, so it always matches production.
</Card>

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/extractors/audio-sentiment.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=0ecc5ff0b116775a0db182cbb648289e" alt="Audio sentiment extractor pipeline showing speaker diarization, parallel FinBERT and prosodic feature extraction, and output alpha signals" width="1200" height="520" data-path="assets/extractors/audio-sentiment.svg" />
</Frame>

The audio sentiment extractor processes **earnings call recordings, analyst day presentations, Fed press conferences, and financial podcasts** to produce two parallel signal streams: FinBERT financial-domain text sentiment (768D) from Whisper transcription, and a 5-feature prosodic vector (128D) capturing vocal stress, hesitation, and deception markers. Speaker diarization separates management from analysts for role-attributed sentiment.

This extractor addresses the gap identified in SEC 8-K forward guidance NLP studies: text-only sentiment models generate crowded alpha (in-sample IC \~+0.12 but poor walk-forward generalization). The five prosodic features — pitch variability, speech rate, vocal energy, pause ratio, and audio-text alignment — are largely uncorrelated with published text signals and untested at scale, representing a structural alternative data opportunity.

<Note>
  View extractor details at [api.mixpeek.com/v1/collections/features/extractors/audio\_sentiment\_extractor\_v1](https://api.mixpeek.com/v1/collections/features/extractors/audio_sentiment_extractor_v1) or fetch programmatically with `GET /v1/collections/features/extractors/{feature_extractor_id}`.
</Note>

## Pipeline Steps

1. **Filter Dataset** (if `collection_id` provided)
   * Filter to specified collection
2. **Apply Input Mappings**
   * Resolve audio/video field from source (e.g., `payload.audio_url`, `payload.webcast_url`)
3. **Audio Extraction** (conditional: if video input)
   * FFmpeg strips audio track from MP4/MOV; supports AAC, MP3, FLAC output
4. **Voice Activity Detection + Segmentation**
   * `split_method: time` — fixed-length windows (default 30s)
   * `split_method: silence` — split at natural speech pauses (VAD threshold configurable)
   * `split_method: speaker` — one segment per speaker turn (requires `run_diarization=true`)
5. **Speaker Diarization** (conditional: if `run_diarization=true`)
   * pyannote.audio 3.x pipeline separates speakers (CEO, CFO, Analyst\_1, etc.)
   * Assigns `speaker_id` and optionally maps to `speaker_role` via role manifest
6. **Transcription** (conditional: if `run_transcription=true`)
   * Whisper large-v3-turbo speech-to-text with financial vocabulary prompt
   * Per-segment timestamps aligned to diarization boundaries
7. **FinBERT Text Sentiment** (conditional: if `run_finbert=true`)
   * ProsusAI/FinBERT financial-domain sentiment classifier
   * Outputs `sentiment_label` (positive/negative/neutral), `sentiment_score` (-1 to +1), `confidence`
   * Generates 768D FinBERT CLS embedding for semantic search
8. **Prosodic Feature Extraction** (conditional: if `run_prosodics=true`)
   * LibROSA + Parselmouth extract 5 features per segment:
     * **Pitch variability** (F0 standard deviation, Hz) — hesitation and stress indicator
     * **Speech rate** (words per minute) — confidence and urgency signal
     * **Vocal energy** (RMS dB) — assertiveness and emotional weight
     * **Pause ratio** (fraction of silence) — cognitive load and evasiveness marker
     * **Vocal tremor** (jitter + shimmer) — anxiety and deception indicator
   * Normalized into a 128D prosodic embedding for similarity search
9. **Audio-Text Alignment Score** (conditional: if `run_alignment=true`)
   * Cosine similarity between FinBERT sentiment direction and prosodic valence
   * Low alignment = voice contradicts words (high-value deception/stress signal)
10. **LLM Structured Enrichment** (conditional: if `run_llm_enrichment=true` or `response_shape` set)
    * Gemini/GPT-4o processes transcription with custom prompt
    * Extracts structured signals: guidance confidence, topic classification, hedging language
11. **Output**
    * Per-segment documents with both embedding types, raw prosodic features, sentiment scores, speaker metadata, and computed alpha signals

## When to Use

| Use Case                            | Description                                                                    |
| ----------------------------------- | ------------------------------------------------------------------------------ |
| **Earnings call analysis**          | Process quarterly earnings calls for CEO/CFO vocal stress relative to guidance |
| **Forward guidance scoring**        | Score management's confidence level on forward-looking statements              |
| **Analyst day processing**          | Build speaker-attributed sentiment timelines across presentations              |
| **Fed press conference monitoring** | Track FOMC chair vocal markers around policy language                          |
| **Alpha signal generation**         | Combine text + prosodic features for uncrowded quantitative factors            |
| **Sentiment divergence detection**  | Flag calls where management tone contradicts word sentiment                    |
| **Comparative speaker analysis**    | Track individual executive vocal patterns across multiple quarters             |
| **Podcast / financial media**       | Extract sentiment from analyst interviews, TV appearances, podcasts            |

## When NOT to Use

| Scenario                            | Recommended Alternative                                          |
| ----------------------------------- | ---------------------------------------------------------------- |
| Text-only documents (PDFs, filings) | `document_extractor` or `text_extractor`                         |
| Very short clips (\< 10 seconds)    | Processing overhead disproportionate                             |
| Non-speech audio (music, noise)     | `multimodal_extractor`                                           |
| Real-time live streaming            | Specialized streaming extractors                                 |
| Non-English earnings calls          | Set `transcription_language` explicitly; FinBERT is English-only |

## Supported Input Types

| Input   | Type   | Description                            | Processing                 |
| ------- | ------ | -------------------------------------- | -------------------------- |
| `audio` | string | URL or S3 path to MP3, WAV, FLAC, M4A  | Direct processing          |
| `video` | string | URL or S3 path to MP4, MOV, MKV        | Audio extracted via FFmpeg |
| `url`   | string | Direct URL to webcast / podcast stream | Downloaded and processed   |

**Supported audio formats:** MP3, WAV, FLAC, M4A, OGG, OPUS

**Supported video formats (audio extracted):** MP4, MOV, MKV, AVI, WebM

## Input Schema

Provide **one** of the following inputs:

```json theme={null}
{
  "audio": "s3://bucket/earnings/AAPL_Q4_2024.mp3"
}
```

```json theme={null}
{
  "video": "s3://bucket/investor-day/msft-2024-ceo-keynote.mp4"
}
```

```json theme={null}
{
  "url": "https://edge.media-server.com/mmc/p/xyz/earnings-call.mp3"
}
```

| Field   | Type   | Description                                                    |
| ------- | ------ | -------------------------------------------------------------- |
| `audio` | string | URL/S3 path to audio file. Recommended: \< 3 hours per file    |
| `video` | string | URL/S3 path to video file; audio track extracted automatically |
| `url`   | string | Direct stream URL; downloaded before processing                |

## Output Schema

Each audio segment produces one document:

| Field                                             | Type        | Description                                                      |
| ------------------------------------------------- | ----------- | ---------------------------------------------------------------- |
| `start_time`                                      | number      | Segment start time in seconds                                    |
| `end_time`                                        | number      | Segment end time in seconds                                      |
| `speaker_id`                                      | string      | Diarized speaker label (e.g., `SPEAKER_00`)                      |
| `speaker_role`                                    | string      | Mapped role if manifest provided (e.g., `CEO`, `CFO`, `Analyst`) |
| `transcription`                                   | string      | Whisper transcription of segment                                 |
| `sentiment_label`                                 | string      | `positive`, `negative`, or `neutral`                             |
| `sentiment_score`                                 | number      | FinBERT sentiment score: -1.0 (negative) to +1.0 (positive)      |
| `sentiment_confidence`                            | number      | FinBERT confidence 0.0–1.0                                       |
| `pitch_variability_hz`                            | number      | F0 standard deviation (Hz) — stress/hesitation                   |
| `speech_rate_wpm`                                 | number      | Words per minute — confidence/urgency                            |
| `vocal_energy_db`                                 | number      | RMS energy in dB — assertiveness                                 |
| `pause_ratio`                                     | number      | Fraction of silence 0.0–1.0 — cognitive load                     |
| `vocal_tremor`                                    | number      | Jitter + shimmer composite 0.0–1.0 — anxiety                     |
| `audio_text_alignment`                            | number      | Prosody-sentiment cosine alignment -1.0 to +1.0                  |
| `stress_index`                                    | number      | Composite vocal stress score 0.0–1.0                             |
| `audio_sentiment_extractor_v1_text_embedding`     | float\[768] | FinBERT CLS embedding                                            |
| `audio_sentiment_extractor_v1_prosodic_embedding` | float\[128] | Normalized prosodic feature vector                               |

```json theme={null}
{
  "start_time": 245.0,
  "end_time": 275.0,
  "speaker_id": "SPEAKER_00",
  "speaker_role": "CEO",
  "transcription": "We're very confident in our Q1 guidance range of twelve to fourteen dollars per share...",
  "sentiment_label": "positive",
  "sentiment_score": 0.71,
  "sentiment_confidence": 0.89,
  "pitch_variability_hz": 38.2,
  "speech_rate_wpm": 142,
  "vocal_energy_db": -18.4,
  "pause_ratio": 0.21,
  "vocal_tremor": 0.14,
  "audio_text_alignment": 0.63,
  "stress_index": 0.31,
  "audio_sentiment_extractor_v1_text_embedding": [0.041, -0.018, ...],
  "audio_sentiment_extractor_v1_prosodic_embedding": [0.72, 0.34, ...]
}
```

## Parameters

### Audio Segmentation

| Parameter      | Type   | Default     | Description                                            |
| -------------- | ------ | ----------- | ------------------------------------------------------ |
| `split_method` | string | `"silence"` | Segmentation strategy: `time`, `silence`, or `speaker` |

<Tabs>
  <Tab title="time">
    **Fixed-interval splitting** — equal-duration segments regardless of speech content.

    | Parameter             | Type    | Default | Description                 |
    | --------------------- | ------- | ------- | --------------------------- |
    | `time_split_interval` | integer | `30`    | Segment duration in seconds |

    **Best for:** Batch processing, predictable segment counts, initial exploration

    ```json theme={null}
    {
      "split_method": "time",
      "time_split_interval": 30
    }
    ```
  </Tab>

  <Tab title="silence">
    **Voice activity detection** — splits at natural speech pauses. Preserves complete sentences and thoughts.

    | Parameter                 | Type    | Default | Description                             |
    | ------------------------- | ------- | ------- | --------------------------------------- |
    | `silence_db_threshold`    | integer | `-40`   | dB level below which audio is silence   |
    | `min_silence_duration_ms` | integer | `500`   | Minimum silence length to trigger split |

    **Best for:** Earnings calls, presentations, interviews — preserves semantic units

    ```json theme={null}
    {
      "split_method": "silence",
      "silence_db_threshold": -40,
      "min_silence_duration_ms": 500
    }
    ```
  </Tab>

  <Tab title="speaker">
    **Speaker-turn segmentation** — one segment per speaker turn. Requires diarization. Ideal for Q\&A analysis.

    **Characteristics:**

    * Variable segment lengths (1s–5 min typical for earnings Q\&A)
    * Each segment is a single speaker's continuous turn
    * Enables per-speaker sentiment timelines

    **Best for:** Q\&A sections, panel discussions, analyst questioning

    ```json theme={null}
    {
      "split_method": "speaker",
      "run_diarization": true
    }
    ```
  </Tab>
</Tabs>

### Feature Extraction Parameters

| Parameter                | Type    | Default | Description                                                  |
| ------------------------ | ------- | ------- | ------------------------------------------------------------ |
| `run_transcription`      | boolean | `true`  | Run Whisper transcription                                    |
| `transcription_language` | string  | `"en"`  | Language code for transcription                              |
| `transcription_prompt`   | string  | `null`  | Domain vocabulary hint (e.g., ticker symbols, product names) |
| `run_finbert`            | boolean | `true`  | Run FinBERT financial sentiment classification               |
| `run_prosodics`          | boolean | `true`  | Extract 5 prosodic features + 128D embedding                 |
| `run_alignment`          | boolean | `true`  | Compute audio-text alignment score                           |
| `run_diarization`        | boolean | `false` | Run speaker diarization (adds \~20% processing time)         |
| `num_speakers`           | integer | `null`  | Hint for diarization (null = auto-detect)                    |

### Speaker Role Manifest

Map diarized speaker IDs to roles (CEO, CFO, Analyst, etc.) using a manifest:

```json theme={null}
{
  "speaker_role_manifest": {
    "SPEAKER_00": "CEO",
    "SPEAKER_01": "CFO",
    "SPEAKER_02": "Analyst"
  }
}
```

When `speaker_role_manifest` is not provided, roles are labeled `SPEAKER_00`, `SPEAKER_01`, etc.

### LLM Structured Extraction

| Parameter            | Type             | Default | Description                         |
| -------------------- | ---------------- | ------- | ----------------------------------- |
| `run_llm_enrichment` | boolean          | `false` | Run LLM over transcription segments |
| `response_shape`     | string \| object | `null`  | Custom structured output schema     |

**Natural Language Mode:**

```json theme={null}
{
  "response_shape": "Extract: forward guidance confidence level (1-5), number of hedging phrases, primary topic discussed, and any mentioned risk factors"
}
```

**JSON Schema Mode for Quant Signals:**

```json theme={null}
{
  "response_shape": {
    "type": "object",
    "properties": {
      "guidance_confidence": { "type": "integer", "minimum": 1, "maximum": 5 },
      "hedging_phrase_count": { "type": "integer" },
      "topic": { "type": "string", "enum": ["revenue", "margins", "guidance", "macro", "capex", "other"] },
      "risk_factors": { "type": "array", "items": { "type": "string" } },
      "quantitative_claims": { "type": "array", "items": { "type": "string" } }
    }
  }
}
```

## Configuration Examples

<CodeGroup>
  ```json Earnings Call — Full Analysis theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "audio_sentiment_extractor",
      "version": "v1",
      "input_mappings": {
        "audio": "audio_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.ticker" },
        { "source_path": "metadata.fiscal_quarter" },
        { "source_path": "metadata.call_section" }
      ],
      "parameters": {
        "split_method": "speaker",
        "run_transcription": true,
        "transcription_language": "en",
        "run_finbert": true,
        "run_prosodics": true,
        "run_alignment": true,
        "run_diarization": true,
        "num_speakers": 6,
        "speaker_role_manifest": {
          "SPEAKER_00": "CEO",
          "SPEAKER_01": "CFO",
          "SPEAKER_02": "IR"
        }
      }
    }
  }
  ```

  ```json CEO Prepared Remarks — Guidance Confidence theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "audio_sentiment_extractor",
      "version": "v1",
      "input_mappings": {
        "audio": "prepared_remarks_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.ticker" },
        { "source_path": "metadata.event_date" }
      ],
      "parameters": {
        "split_method": "silence",
        "silence_db_threshold": -38,
        "run_transcription": true,
        "transcription_prompt": "earnings call forward guidance revenue EPS margin",
        "run_finbert": true,
        "run_prosodics": true,
        "run_alignment": true,
        "run_llm_enrichment": true,
        "response_shape": {
          "type": "object",
          "properties": {
            "guidance_confidence": { "type": "integer", "minimum": 1, "maximum": 5 },
            "hedging_phrase_count": { "type": "integer" },
            "topic": { "type": "string" },
            "forward_looking": { "type": "boolean" }
          }
        }
      }
    }
  }
  ```

  ```json Fed Press Conference — Policy Tone theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "audio_sentiment_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "press_conf_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.fomc_date" },
        { "source_path": "metadata.rate_decision" }
      ],
      "parameters": {
        "split_method": "silence",
        "run_transcription": true,
        "run_finbert": true,
        "run_prosodics": true,
        "run_alignment": true,
        "run_diarization": false
      }
    }
  }
  ```

  ```json Analyst Day — Multi-Speaker Timeline theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "audio_sentiment_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "analyst_day_recording"
      },
      "parameters": {
        "split_method": "speaker",
        "run_transcription": true,
        "run_finbert": true,
        "run_prosodics": true,
        "run_alignment": true,
        "run_diarization": true
      }
    }
  }
  ```

  ```json Minimal — Transcription + FinBERT Only theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "audio_sentiment_extractor",
      "version": "v1",
      "input_mappings": {
        "audio": "audio_url"
      },
      "parameters": {
        "split_method": "time",
        "time_split_interval": 60,
        "run_transcription": true,
        "run_finbert": true,
        "run_prosodics": false,
        "run_alignment": false,
        "run_diarization": false
      }
    }
  }
  ```
</CodeGroup>

## Performance & Costs

### Processing Speed

| Configuration      | Speed           | Example                |
| ------------------ | --------------- | ---------------------- |
| Transcription only | \~0.5x realtime | 60-min call → \~30 min |
| + FinBERT          | \~0.6x realtime | 60-min call → \~36 min |
| + Prosodics        | \~0.8x realtime | 60-min call → \~48 min |
| + Diarization      | \~1.5x realtime | 60-min call → \~90 min |
| Full pipeline      | \~1.5x realtime | 60-min call → \~90 min |

| Feature                 | Per-Segment Latency  |
| ----------------------- | -------------------- |
| Transcription (Whisper) | \~150ms/sec of audio |
| FinBERT classification  | \~25ms               |
| Prosodic extraction     | \~50ms               |
| Speaker diarization     | \~200ms/sec of audio |
| LLM enrichment          | \~1.5s               |

### Cost Estimates (per hour of audio)

| Configuration                             | Cost   |
| ----------------------------------------- | ------ |
| **Minimal** (transcription + FinBERT)     | \$0.08 |
| **Standard** (+ prosodics + alignment)    | \$0.15 |
| **Full** (+ diarization + LLM enrichment) | \$0.25 |

**Batch processing**: Processing 1,000 S\&P 500 earnings calls (avg 60 min) at full configuration ≈ \$250

## Vector Indexes

### Text Embedding (FinBERT)

| Property             | Value                                         |
| -------------------- | --------------------------------------------- |
| **Index name**       | `audio_sentiment_extractor_v1_text_embedding` |
| **Dimensions**       | 768                                           |
| **Type**             | Dense                                         |
| **Distance metric**  | Cosine                                        |
| **Inference model**  | `finbert_sentiment_v1`                        |
| **Supported inputs** | text (transcription segments)                 |

### Prosodic Embedding

| Property             | Value                                             |
| -------------------- | ------------------------------------------------- |
| **Index name**       | `audio_sentiment_extractor_v1_prosodic_embedding` |
| **Dimensions**       | 128                                               |
| **Type**             | Dense                                             |
| **Distance metric**  | Cosine                                            |
| **Inference model**  | `prosodic_encoder_v1`                             |
| **Supported inputs** | audio segments                                    |

## Alpha Signal Guide

This section describes the five core prosodic features and their interpretation as quantitative signals.

<AccordionGroup>
  <Accordion title="1. Pitch Variability (F0 Standard Deviation)">
    **What it measures:** Standard deviation of the fundamental frequency (F0) in Hz across the segment.

    **Signal interpretation:**

    * **High variability (> 50 Hz):** Elevated emotional engagement; can indicate stress or enthusiasm
    * **Low variability (\< 15 Hz):** Monotone delivery; associated with rehearsed/scripted language or disengagement
    * **Baseline deviation:** Compare against the speaker's historical mean F0 std dev for true anomaly detection

    **Quant application:** Track CEO pitch variability during forward guidance vs. historical questions. Anomalous drops on guidance segments may precede earnings misses.
  </Accordion>

  <Accordion title="2. Speech Rate (Words Per Minute)">
    **What it measures:** Words per minute derived from Whisper word-level timestamps.

    **Signal interpretation:**

    * **High rate (> 180 wpm):** Urgency, anxiety, or over-rehearsed scripted answers
    * **Low rate (\< 100 wpm):** Deliberate, careful language; common when discussing negative surprises
    * **Rate deceleration mid-answer:** Suggests real-time reasoning, less scripted — higher authenticity signal

    **Quant application:** Significant speech rate slowdown during Q\&A relative to prepared remarks may signal management is processing unexpected analyst questions.
  </Accordion>

  <Accordion title="3. Vocal Energy (RMS dB)">
    **What it measures:** Root mean square energy of the audio signal in decibels.

    **Signal interpretation:**

    * **High energy:** Assertiveness and confidence; common in positive guidance delivery
    * **Energy drop mid-sentence:** Hedging or trailing off; linguistic uncertainty
    * **Segment-relative drop:** Cross-call energy tracking shows conviction level

    **Quant application:** Energy drop on forward EPS guidance sentences (identifiable via LLM topic tagging) is a stress-linked signal distinct from text sentiment.
  </Accordion>

  <Accordion title="4. Pause Ratio (Silence Fraction)">
    **What it measures:** Fraction of segment duration classified as silence (VAD threshold -40 dB).

    **Signal interpretation:**

    * **High pause ratio (> 0.35):** Cognitive load; speaker is reasoning in real time rather than reciting
    * **Low pause ratio (\< 0.10):** Scripted, rehearsed delivery — less information content
    * **Q\&A vs. prepared remarks delta:** A large increase in pause ratio during Q\&A is a well-documented stress marker

    **Quant application:** Pause ratio on Q\&A segments answering analyst questions about inventory / margin / guidance has shown predictive value for negative guidance revisions in academic literature.
  </Accordion>

  <Accordion title="5. Audio-Text Alignment Score">
    **What it measures:** Cosine similarity between the FinBERT sentiment direction (text) and prosodic valence (audio). Range: -1.0 to +1.0.

    **Signal interpretation:**

    * **High alignment (> 0.6):** Voice and words agree — higher conviction, less masking
    * **Low alignment (0.1–0.4):** Moderate divergence — common in hedged language
    * **Negative alignment (\< 0):** Voice contradicts words — strongest stress/deception marker; e.g., "We feel very good about guidance" delivered with high pitch variability, low energy, and high pauses

    **Quant application:** This is the most novel of the five features. Text NLP cannot capture it. Segments with positive text sentiment but negative alignment are the primary alpha generation target.
  </Accordion>
</AccordionGroup>

### Composite Stress Index

The `stress_index` field (0.0–1.0) is a normalized composite of all five prosodic features:

```
stress_index = normalize(
  0.25 * pitch_variability_z +
  0.20 * speech_rate_z +      # inverted: lower rate = higher stress
  0.20 * (1 - vocal_energy_z) +
  0.20 * pause_ratio_z +
  0.15 * vocal_tremor_z
)
```

Where `_z` values are Z-scores computed against the speaker's rolling 4-quarter baseline when `speaker_id` is consistent across calls.

### Recommended Factor Construction

```python theme={null}
# Example: CEO guidance confidence factor
import mixpeek

client = mixpeek.Client(api_key="YOUR_KEY")

# Query for CEO guidance segments with sentiment divergence
results = client.retrievers.run(
    retriever_id="earnings-sentiment-retriever",
    inputs={
        "query": "revenue guidance outlook fiscal year",
        "filters": {
            "speaker_role": "CEO",
            "forward_looking": True,
            "audio_text_alignment": {"$lt": 0.3},   # divergence signal
            "stress_index": {"$gt": 0.6}             # high stress
        }
    }
)
```

## Limitations

* **Speaker diarization accuracy**: pyannote achieves \~90% DER on clean 2-speaker recordings; accuracy degrades with > 8 speakers or poor audio quality
* **Non-English**: Whisper transcription supports 99 languages; FinBERT is English-only — for non-English calls, disable `run_finbert` and use multilingual sentiment models
* **Audio quality**: Prosodic features require 16kHz+ audio; compressed phone audio (8kHz) reduces pitch extraction accuracy by \~30%
* **Baseline dependency**: `stress_index` Z-score normalization requires at least 4 prior segments from the same `speaker_id` to be meaningful
* **Segment length**: Prosodic features are unreliable for segments \< 5 seconds; short interjections are best excluded
* **LLM enrichment latency**: `run_llm_enrichment=true` adds 1–2s per segment; disable for batch throughput

## Related

* [Feature Extractors Overview](/processing/feature-extractors)
* [Text Extractor](/processing/extractors/text)
* [Multimodal Extractor](/processing/extractors/multimodal)
* [Document Extractor](/processing/extractors/document)
