NEWAgents can now see video via MCP.Try it now →
    Models/Embeddings/laion/clap-htsat-fused
    HFAudio Embeddingsapache-2.0

    clap-htsat-fused

    by laion

    Contrastive Language-Audio Pretraining for audio-text retrieval

    20.7Mdl/month
    73likes
    154Mparams
    Identifiers
    Model ID
    laion/clap-htsat-fused
    Feature URI
    mixpeek://audio_extractor@v1/laion_clap_fused_v1

    Overview

    CLAP learns aligned audio and text representations through contrastive learning, similar to how CLIP works for images and text. The HTSAT-fused variant uses the HTS-AT audio transformer fused with RoBERTa text embeddings.

    On Mixpeek, CLAP enables semantic audio search, find audio segments matching natural language descriptions like "crowd cheering" or "rain on a roof."

    Architecture

    HTS-AT (Hierarchical Token-Semantic Audio Transformer) as audio encoder, RoBERTa as text encoder. Trained on AudioSet, Clotho, and other audio-text pair datasets with contrastive loss. Outputs 512-dim joint embedding space.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/audio.wav" },
    feature_extractors: [{
    name: "audio_embedding",
    version: "v1",
    params: {
    model_id: "laion/clap-htsat-fused"
    }
    }]
    });

    Capabilities

    • Audio-text cross-modal retrieval
    • 512-dimensional audio embeddings
    • Zero-shot audio classification
    • Environmental sound recognition

    Use Cases on Mixpeek

    Sound effect search, find audio by description
    Music discovery, semantic similarity across audio tracks
    Environmental monitoring, classify ambient sounds

    Benchmarks

    DatasetMetricScoreSource
    ESC-50Accuracy (zero-shot)93.7%Wu et al., 2023 — Table 2
    AudioCaps (text→audio)Recall@136.7%Wu et al., 2023 — Table 3

    Performance

    Input Sizevariable audio (10s chunks typical)
    Embedding Dim512
    GPU Latency~6ms / chunk (A100)
    GPU Throughput~165 chunks/sec (A100)
    GPU Memory~0.5 GB

    Specification

    FrameworkHF
    Organizationlaion
    FeatureAudio Embeddings
    Output512-dim vector
    Modalitiesvideo, audio
    RetrieverAudio Similarity
    Parameters154M
    Licenseapache-2.0
    Downloads/mo20.7M
    Likes73

    Research Paper

    Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

    arxiv.org

    Build a pipeline with clap-htsat-fused

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder