clap-htsat-fused
by laion
Contrastive Language-Audio Pretraining for audio-text retrieval
laion/clap-htsat-fusedmixpeek://audio_extractor@v1/laion_clap_fused_v1Overview
CLAP learns aligned audio and text representations through contrastive learning, similar to how CLIP works for images and text. The HTSAT-fused variant uses the HTS-AT audio transformer fused with RoBERTa text embeddings.
On Mixpeek, CLAP enables semantic audio search, find audio segments matching natural language descriptions like "crowd cheering" or "rain on a roof."
Architecture
HTS-AT (Hierarchical Token-Semantic Audio Transformer) as audio encoder, RoBERTa as text encoder. Trained on AudioSet, Clotho, and other audio-text pair datasets with contrastive loss. Outputs 512-dim joint embedding space.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/audio.wav" },feature_extractors: [{name: "audio_embedding",version: "v1",params: {model_id: "laion/clap-htsat-fused"}}]});
Capabilities
- Audio-text cross-modal retrieval
- 512-dimensional audio embeddings
- Zero-shot audio classification
- Environmental sound recognition
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| ESC-50 | Accuracy (zero-shot) | 93.7% | Wu et al., 2023 — Table 2 |
| AudioCaps (text→audio) | Recall@1 | 36.7% | Wu et al., 2023 — Table 3 |
Performance
Specification
Research Paper
Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
arxiv.orgBuild a pipeline with clap-htsat-fused
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder