Audio Fingerprint Extractor

Built-in extractor names are a deprecated alias — collections are now created by picking features. This pipeline is selected with features: ["audio_search"] for audio files, or features: ["audio_fingerprint"] as an in-video add-on. Existing feature_extractor configs keep working; see the migration guide.

View on GitHub

Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.

The audio fingerprint extractor produces 512-dimensional CLAP embeddings (Contrastive Language-Audio Pretraining, laion/clap-htsat-tiny) from audio files or the audio track of a video. It segments audio into overlapping windows, embeds each segment, and L2-normalizes the vectors for cosine similarity. Use it for sound-mark matching, audio similarity, and retrieving audio by acoustic fingerprint.

View extractor details at api.mixpeek.com/v1/collections/features/extractors/audio_fingerprint_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

Resolve input — apply input_mappings to get the audio or video URL.
Audio extraction — if the source is video, extract the audio track.
Resample — resample audio to sample_rate (48000 Hz CLAP default).
Segment — split into segment_duration_sec windows hopping by segment_hop_sec (default 50% overlap); truncate beyond max_audio_length_sec.
CLAP embedding — embed each segment to a 512-d vector.
Normalize (if normalize_embeddings) — L2-normalize to unit vectors.
Output — one document per segment with timing metadata.

When to Use

Use Case	Description
Sound-mark matching	Detect a known jingle, sound logo, or audio cue across a corpus
Audio similarity	Find acoustically similar clips (music, ambience, effects)
Ad/asset detection	Match the audio fingerprint of an ad or asset within longer media
Video audio search	Search the audio track of video assets without separate extraction

When NOT to Use

Scenario	Recommended Alternative
Speech-to-text / transcription	A transcription extractor (e.g. Whisper-based)
Text semantic search over spoken words	Transcribe, then `text_extractor`
Whole-file multimodal embedding	`universal_extractor` / `multimodal_extractor`
Music metadata/tagging only	A classification taxonomy over fingerprints

Input Schema

Field	Type	Required	Description
`audio`	string	one of	URL or path to an audio file. Populated from `input_mappings`.
`video`	string	one of	URL or path to a video file; the audio track is extracted.

{
  "audio": "s3://my-bucket/spots/jingle.wav"
}

Supported input types: AUDIO, VIDEO (max 1 each per object).

Output Schema

One document per audio segment:

Field	Type	Description
`audio_fingerprint_extractor_v1_embedding`	float[512]	CLAP embedding (L2-normalized when enabled)
`segment_index`	integer	Segment index (0-based)
`start_time_sec` / `end_time_sec`	float	Segment start/end time in seconds
`duration_sec`	float	Duration of this segment (seconds)
`total_duration_sec`	float \| null	Source audio duration
`sample_rate`	integer	Sample rate used for processing
`audio_source_type`	string	Source type: `audio` or `video`
`embedding_model`	string	Embedding model used (default `laion/clap-htsat-tiny`)
`processing_time_ms`	float	Per-segment processing time

{
  "audio_fingerprint_extractor_v1_embedding": [0.041, -0.018, 0.092, ...],
  "segment_index": 0,
  "start_time_sec": 0.0,
  "end_time_sec": 5.0,
  "duration_sec": 5.0,
  "sample_rate": 48000,
  "audio_source_type": "audio",
  "embedding_model": "laion/clap-htsat-tiny"
}

Parameters

Parameter	Type	Default	Range	Description
`segment_duration_sec`	float	`5.0`	1.0–30.0	Duration of each audio segment (seconds). 5.0 recommended for sound-mark matching
`segment_hop_sec`	float	`2.5`	0.5–15.0	Hop between segments (seconds). 2.5 = 50% overlap. Set equal to `segment_duration_sec` for no overlap
`sample_rate`	integer	`48000`	—	Target sample rate (Hz). 48000 is the CLAP default; audio is resampled before embedding
`normalize_embeddings`	boolean	`true`	—	L2-normalize embeddings to unit vectors (recommended for cosine similarity)
`max_audio_length_sec`	float	`120.0`	1.0–600.0	Maximum audio length to process (seconds). Audio beyond this is truncated

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "audio_fingerprint_extractor",
    "version": "v1",
    "input_mappings": {
      "audio": "audio_url"
    },
    "parameters": {}
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "audio_fingerprint_extractor",
    "version": "v1",
    "input_mappings": {
      "video": "video_url"
    },
    "parameters": {}
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "audio_fingerprint_extractor",
    "version": "v1",
    "input_mappings": {
      "audio": "audio_url"
    },
    "parameters": {
      "segment_duration_sec": 10.0,
      "segment_hop_sec": 10.0,
      "max_audio_length_sec": 300.0
    }
  }
}

Performance & Costs

Metric	Value
Cost	See Billing & Pricing — rates come from `GET /v1/billing/pricing`
Model	`laion/clap-htsat-tiny` (CLAP)
Default coverage	First 120 s of audio (configurable to 600 s)

Vector Index

Property	Value
Index name	`audio_fingerprint_extractor_v1_embedding`
Dimensions	512
Type	Dense
Distance metric	Cosine
Inference model	`laion/clap-htsat-tiny`
Normalization	L2 normalized (when `normalize_embeddings`)

Limitations

Length cap: Audio beyond max_audio_length_sec is truncated (default 120 s).
Not for transcription: Produces acoustic fingerprints, not text — pair with a transcription extractor for spoken-word search.
Segment fan-out: Overlapping windows multiply the document count per source; tune segment_hop_sec to control density.

Get started

Connect your data

Extract features

Build retrievers

Enrich & organize

Integrate & operate

Resources

Audio Fingerprint Extractor

View on GitHub

Pipeline Steps

When to Use

When NOT to Use

Input Schema

Output Schema

Parameters

Configuration Examples

Performance & Costs

Vector Index

Limitations

View on GitHub

​Pipeline Steps

​When to Use

​When NOT to Use

​Input Schema

​Output Schema

​Parameters

​Configuration Examples

​Performance & Costs

​Vector Index

​Limitations

​Related

Pipeline Steps

When to Use

When NOT to Use

Input Schema

Output Schema

Parameters

Configuration Examples

Performance & Costs

Vector Index

Limitations

Related