clap-htsat-fused
by laion
Contrastive Language-Audio Pretraining for audio-text retrieval
laion/clap-htsat-fusedmixpeek://audio_extractor@v1/laion_clap_fused_v1Overview
CLAP learns aligned audio and text representations through contrastive learning, similar to how CLIP works for images and text. The HTSAT-fused variant uses the HTS-AT audio transformer fused with RoBERTa text embeddings.
On Mixpeek, CLAP enables semantic audio search — find audio segments matching natural language descriptions like "crowd cheering" or "rain on a roof."
Architecture
HTS-AT (Hierarchical Token-Semantic Audio Transformer) as audio encoder, RoBERTa as text encoder. Trained on AudioSet, Clotho, and other audio-text pair datasets with contrastive loss. Outputs 512-dim joint embedding space.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/audio.wav" },
feature_extractors: [{
name: "audio_embedding",
version: "v1",
params: {
model_id: "laion/clap-htsat-fused"
}
}]
});Capabilities
- Audio-text cross-modal retrieval
- 512-dimensional audio embeddings
- Zero-shot audio classification
- Environmental sound recognition
Use Cases on Mixpeek
Specification
Research Paper
Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
arxiv.orgBuild a pipeline with clap-htsat-fused
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder