Skip to main content
The audio fingerprint extractor produces 512-dimensional CLAP embeddings (Contrastive Language-Audio Pretraining, laion/clap-htsat-tiny) from audio files or the audio track of a video. It segments audio into overlapping windows, embeds each segment, and L2-normalizes the vectors for cosine similarity. Use it for sound-mark matching, audio similarity, and retrieving audio by acoustic fingerprint.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/audio_fingerprint_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

  1. Resolve input — apply input_mappings to get the audio or video URL.
  2. Audio extraction — if the source is video, extract the audio track.
  3. Resample — resample audio to sample_rate (48000 Hz CLAP default).
  4. Segment — split into segment_duration_sec windows hopping by segment_hop_sec (default 50% overlap); truncate beyond max_audio_length_sec.
  5. CLAP embedding — embed each segment to a 512-d vector.
  6. Normalize (if normalize_embeddings) — L2-normalize to unit vectors.
  7. Output — one document per segment with timing metadata.

When to Use

Use CaseDescription
Sound-mark matchingDetect a known jingle, sound logo, or audio cue across a corpus
Audio similarityFind acoustically similar clips (music, ambience, effects)
Ad/asset detectionMatch the audio fingerprint of an ad or asset within longer media
Video audio searchSearch the audio track of video assets without separate extraction

When NOT to Use

ScenarioRecommended Alternative
Speech-to-text / transcriptionA transcription extractor (e.g. Whisper-based)
Text semantic search over spoken wordsTranscribe, then text_extractor
Whole-file multimodal embeddinguniversal_extractor / multimodal_extractor
Music metadata/tagging onlyA classification taxonomy over fingerprints

Input Schema

FieldTypeRequiredDescription
audiostringone ofURL or path to an audio file. Populated from input_mappings.
videostringone ofURL or path to a video file; the audio track is extracted.
{
  "audio": "s3://my-bucket/spots/jingle.wav"
}
Supported input types: AUDIO, VIDEO (max 1 each per object).

Output Schema

One document per audio segment:
FieldTypeDescription
audio_fingerprint_extractor_v1_embeddingfloat[512]CLAP embedding (L2-normalized when enabled)
segment_indexintegerSegment index (0-based)
start_time_sec / end_time_secfloatSegment start/end time in seconds
duration_secfloatDuration of this segment (seconds)
total_duration_secfloat | nullSource audio duration
sample_rateintegerSample rate used for processing
audio_source_typestringSource type: audio or video
embedding_modelstringEmbedding model used (default laion/clap-htsat-tiny)
processing_time_msfloatPer-segment processing time
{
  "audio_fingerprint_extractor_v1_embedding": [0.041, -0.018, 0.092, ...],
  "segment_index": 0,
  "start_time_sec": 0.0,
  "end_time_sec": 5.0,
  "duration_sec": 5.0,
  "sample_rate": 48000,
  "audio_source_type": "audio",
  "embedding_model": "laion/clap-htsat-tiny"
}

Parameters

ParameterTypeDefaultRangeDescription
segment_duration_secfloat5.01.0–30.0Duration of each audio segment (seconds). 5.0 recommended for sound-mark matching
segment_hop_secfloat2.50.5–15.0Hop between segments (seconds). 2.5 = 50% overlap. Set equal to segment_duration_sec for no overlap
sample_rateinteger48000Target sample rate (Hz). 48000 is the CLAP default; audio is resampled before embedding
normalize_embeddingsbooleantrueL2-normalize embeddings to unit vectors (recommended for cosine similarity)
max_audio_length_secfloat120.01.0–600.0Maximum audio length to process (seconds). Audio beyond this is truncated

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "audio_fingerprint_extractor",
    "version": "v1",
    "input_mappings": {
      "audio": "audio_url"
    },
    "parameters": {}
  }
}

Performance & Costs

MetricValue
Cost3 credits per audio segment processed
Modellaion/clap-htsat-tiny (CLAP)
Default coverageFirst 120 s of audio (configurable to 600 s)

Vector Index

PropertyValue
Index nameaudio_fingerprint_extractor_v1_embedding
Dimensions512
TypeDense
Distance metricCosine
Inference modellaion/clap-htsat-tiny
NormalizationL2 normalized (when normalize_embeddings)

Limitations

  • Length cap: Audio beyond max_audio_length_sec is truncated (default 120 s).
  • Not for transcription: Produces acoustic fingerprints, not text — pair with a transcription extractor for spoken-word search.
  • Segment fan-out: Overlapping windows multiply the document count per source; tune segment_hop_sec to control density.