> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Audio Fingerprint Extractor

> Audio fingerprinting with CLAP — 512-d embeddings from audio files or video audio tracks for sound-mark matching and audio similarity

<Card title="View on GitHub" icon="github" href="https://github.com/mixpeek/mixpeek-extractors/blob/main/extractors/audio_fingerprint_extractor/README.md" horizontal>
  Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.
</Card>

The audio fingerprint extractor produces **512-dimensional CLAP embeddings** (Contrastive Language-Audio Pretraining, `laion/clap-htsat-tiny`) from audio files or the audio track of a video. It segments audio into overlapping windows, embeds each segment, and L2-normalizes the vectors for cosine similarity. Use it for sound-mark matching, audio similarity, and retrieving audio by acoustic fingerprint.

<Note>
  View extractor details at [api.mixpeek.com/v1/collections/features/extractors/audio\_fingerprint\_extractor\_v1](https://api.mixpeek.com/v1/collections/features/extractors/audio_fingerprint_extractor_v1) or fetch programmatically with `GET /v1/collections/features/extractors/{feature_extractor_id}`.
</Note>

## Pipeline Steps

1. **Resolve input** — apply `input_mappings` to get the audio or video URL.
2. **Audio extraction** — if the source is video, extract the audio track.
3. **Resample** — resample audio to `sample_rate` (48000 Hz CLAP default).
4. **Segment** — split into `segment_duration_sec` windows hopping by `segment_hop_sec` (default 50% overlap); truncate beyond `max_audio_length_sec`.
5. **CLAP embedding** — embed each segment to a 512-d vector.
6. **Normalize** (if `normalize_embeddings`) — L2-normalize to unit vectors.
7. **Output** — one document per segment with timing metadata.

## When to Use

| Use Case                | Description                                                        |
| ----------------------- | ------------------------------------------------------------------ |
| **Sound-mark matching** | Detect a known jingle, sound logo, or audio cue across a corpus    |
| **Audio similarity**    | Find acoustically similar clips (music, ambience, effects)         |
| **Ad/asset detection**  | Match the audio fingerprint of an ad or asset within longer media  |
| **Video audio search**  | Search the audio track of video assets without separate extraction |

## When NOT to Use

| Scenario                               | Recommended Alternative                        |
| -------------------------------------- | ---------------------------------------------- |
| Speech-to-text / transcription         | A transcription extractor (e.g. Whisper-based) |
| Text semantic search over spoken words | Transcribe, then `text_extractor`              |
| Whole-file multimodal embedding        | `universal_extractor` / `multimodal_extractor` |
| Music metadata/tagging only            | A classification taxonomy over fingerprints    |

## Input Schema

| Field   | Type   | Required | Description                                                    |
| ------- | ------ | -------- | -------------------------------------------------------------- |
| `audio` | string | one of   | URL or path to an audio file. Populated from `input_mappings`. |
| `video` | string | one of   | URL or path to a video file; the audio track is extracted.     |

```json theme={null}
{
  "audio": "s3://my-bucket/spots/jingle.wav"
}
```

Supported input types: **AUDIO, VIDEO** (max 1 each per object).

## Output Schema

One document per audio segment:

| Field                                      | Type          | Description                                            |
| ------------------------------------------ | ------------- | ------------------------------------------------------ |
| `audio_fingerprint_extractor_v1_embedding` | float\[512]   | CLAP embedding (L2-normalized when enabled)            |
| `segment_index`                            | integer       | Segment index (0-based)                                |
| `start_time_sec` / `end_time_sec`          | float         | Segment start/end time in seconds                      |
| `duration_sec`                             | float         | Duration of this segment (seconds)                     |
| `total_duration_sec`                       | float \| null | Source audio duration                                  |
| `sample_rate`                              | integer       | Sample rate used for processing                        |
| `audio_source_type`                        | string        | Source type: `audio` or `video`                        |
| `embedding_model`                          | string        | Embedding model used (default `laion/clap-htsat-tiny`) |
| `processing_time_ms`                       | float         | Per-segment processing time                            |

```json theme={null}
{
  "audio_fingerprint_extractor_v1_embedding": [0.041, -0.018, 0.092, ...],
  "segment_index": 0,
  "start_time_sec": 0.0,
  "end_time_sec": 5.0,
  "duration_sec": 5.0,
  "sample_rate": 48000,
  "audio_source_type": "audio",
  "embedding_model": "laion/clap-htsat-tiny"
}
```

## Parameters

| Parameter              | Type    | Default | Range     | Description                                                                                           |
| ---------------------- | ------- | ------- | --------- | ----------------------------------------------------------------------------------------------------- |
| `segment_duration_sec` | float   | `5.0`   | 1.0–30.0  | Duration of each audio segment (seconds). 5.0 recommended for sound-mark matching                     |
| `segment_hop_sec`      | float   | `2.5`   | 0.5–15.0  | Hop between segments (seconds). 2.5 = 50% overlap. Set equal to `segment_duration_sec` for no overlap |
| `sample_rate`          | integer | `48000` | —         | Target sample rate (Hz). 48000 is the CLAP default; audio is resampled before embedding               |
| `normalize_embeddings` | boolean | `true`  | —         | L2-normalize embeddings to unit vectors (recommended for cosine similarity)                           |
| `max_audio_length_sec` | float   | `120.0` | 1.0–600.0 | Maximum audio length to process (seconds). Audio beyond this is truncated                             |

## Configuration Examples

<CodeGroup>
  ```json Default (5s windows, 50% overlap) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "audio_fingerprint_extractor",
      "version": "v1",
      "input_mappings": {
        "audio": "audio_url"
      },
      "parameters": {}
    }
  }
  ```

  ```json Video Audio Track theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "audio_fingerprint_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "video_url"
      },
      "parameters": {}
    }
  }
  ```

  ```json Non-Overlapping Segments, Longer Audio theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "audio_fingerprint_extractor",
      "version": "v1",
      "input_mappings": {
        "audio": "audio_url"
      },
      "parameters": {
        "segment_duration_sec": 10.0,
        "segment_hop_sec": 10.0,
        "max_audio_length_sec": 300.0
      }
    }
  }
  ```
</CodeGroup>

## Performance & Costs

| Metric               | Value                                        |
| -------------------- | -------------------------------------------- |
| **Cost**             | 3 credits per audio segment processed        |
| **Model**            | `laion/clap-htsat-tiny` (CLAP)               |
| **Default coverage** | First 120 s of audio (configurable to 600 s) |

## Vector Index

| Property            | Value                                       |
| ------------------- | ------------------------------------------- |
| **Index name**      | `audio_fingerprint_extractor_v1_embedding`  |
| **Dimensions**      | 512                                         |
| **Type**            | Dense                                       |
| **Distance metric** | Cosine                                      |
| **Inference model** | `laion/clap-htsat-tiny`                     |
| **Normalization**   | L2 normalized (when `normalize_embeddings`) |

## Limitations

* **Length cap**: Audio beyond `max_audio_length_sec` is truncated (default 120 s).
* **Not for transcription**: Produces acoustic fingerprints, not text — pair with a transcription extractor for spoken-word search.
* **Segment fan-out**: Overlapping windows multiply the document count per source; tune `segment_hop_sec` to control density.

## Related

* [Feature Extractors Overview](/processing/feature-extractors)
* [Audio Sentiment Extractor](/processing/extractors/audio-sentiment)
* [Universal Extractor](/processing/extractors/universal)
