laion/clap-htsat-tiny) from audio files or the audio track of a video. It segments audio into overlapping windows, embeds each segment, and L2-normalizes the vectors for cosine similarity. Use it for sound-mark matching, audio similarity, and retrieving audio by acoustic fingerprint.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/audio_fingerprint_extractor_v1 or fetch programmatically with
GET /v1/collections/features/extractors/{feature_extractor_id}.Pipeline Steps
- Resolve input — apply
input_mappingsto get the audio or video URL. - Audio extraction — if the source is video, extract the audio track.
- Resample — resample audio to
sample_rate(48000 Hz CLAP default). - Segment — split into
segment_duration_secwindows hopping bysegment_hop_sec(default 50% overlap); truncate beyondmax_audio_length_sec. - CLAP embedding — embed each segment to a 512-d vector.
- Normalize (if
normalize_embeddings) — L2-normalize to unit vectors. - Output — one document per segment with timing metadata.
When to Use
| Use Case | Description |
|---|---|
| Sound-mark matching | Detect a known jingle, sound logo, or audio cue across a corpus |
| Audio similarity | Find acoustically similar clips (music, ambience, effects) |
| Ad/asset detection | Match the audio fingerprint of an ad or asset within longer media |
| Video audio search | Search the audio track of video assets without separate extraction |
When NOT to Use
| Scenario | Recommended Alternative |
|---|---|
| Speech-to-text / transcription | A transcription extractor (e.g. Whisper-based) |
| Text semantic search over spoken words | Transcribe, then text_extractor |
| Whole-file multimodal embedding | universal_extractor / multimodal_extractor |
| Music metadata/tagging only | A classification taxonomy over fingerprints |
Input Schema
| Field | Type | Required | Description |
|---|---|---|---|
audio | string | one of | URL or path to an audio file. Populated from input_mappings. |
video | string | one of | URL or path to a video file; the audio track is extracted. |
Output Schema
One document per audio segment:| Field | Type | Description |
|---|---|---|
audio_fingerprint_extractor_v1_embedding | float[512] | CLAP embedding (L2-normalized when enabled) |
segment_index | integer | Segment index (0-based) |
start_time_sec / end_time_sec | float | Segment start/end time in seconds |
duration_sec | float | Duration of this segment (seconds) |
total_duration_sec | float | null | Source audio duration |
sample_rate | integer | Sample rate used for processing |
audio_source_type | string | Source type: audio or video |
embedding_model | string | Embedding model used (default laion/clap-htsat-tiny) |
processing_time_ms | float | Per-segment processing time |
Parameters
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
segment_duration_sec | float | 5.0 | 1.0–30.0 | Duration of each audio segment (seconds). 5.0 recommended for sound-mark matching |
segment_hop_sec | float | 2.5 | 0.5–15.0 | Hop between segments (seconds). 2.5 = 50% overlap. Set equal to segment_duration_sec for no overlap |
sample_rate | integer | 48000 | — | Target sample rate (Hz). 48000 is the CLAP default; audio is resampled before embedding |
normalize_embeddings | boolean | true | — | L2-normalize embeddings to unit vectors (recommended for cosine similarity) |
max_audio_length_sec | float | 120.0 | 1.0–600.0 | Maximum audio length to process (seconds). Audio beyond this is truncated |
Configuration Examples
Performance & Costs
| Metric | Value |
|---|---|
| Cost | 3 credits per audio segment processed |
| Model | laion/clap-htsat-tiny (CLAP) |
| Default coverage | First 120 s of audio (configurable to 600 s) |
Vector Index
| Property | Value |
|---|---|
| Index name | audio_fingerprint_extractor_v1_embedding |
| Dimensions | 512 |
| Type | Dense |
| Distance metric | Cosine |
| Inference model | laion/clap-htsat-tiny |
| Normalization | L2 normalized (when normalize_embeddings) |
Limitations
- Length cap: Audio beyond
max_audio_length_secis truncated (default 120 s). - Not for transcription: Produces acoustic fingerprints, not text — pair with a transcription extractor for spoken-word search.
- Segment fan-out: Overlapping windows multiply the document count per source; tune
segment_hop_secto control density.

