Mixpeek Logo
    Models/Embeddings/laion/clap-htsat-fused
    HFAudio Embeddingsapache-2.0

    clap-htsat-fused

    by laion

    Contrastive Language-Audio Pretraining for audio-text retrieval

    20.4Mdl/month
    56likes
    154Mparams
    Identifiers
    Model ID
    laion/clap-htsat-fused
    Feature URI
    mixpeek://audio_extractor@v1/laion_clap_fused_v1

    Overview

    CLAP learns aligned audio and text representations through contrastive learning, similar to how CLIP works for images and text. The HTSAT-fused variant uses the HTS-AT audio transformer fused with RoBERTa text embeddings.

    On Mixpeek, CLAP enables semantic audio search — find audio segments matching natural language descriptions like "crowd cheering" or "rain on a roof."

    Architecture

    HTS-AT (Hierarchical Token-Semantic Audio Transformer) as audio encoder, RoBERTa as text encoder. Trained on AudioSet, Clotho, and other audio-text pair datasets with contrastive loss. Outputs 512-dim joint embedding space.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    await mx.collections.ingest({
      collection_id: "my-collection",
      source: { url: "https://example.com/audio.wav" },
      feature_extractors: [{
        name: "audio_embedding",
        version: "v1",
        params: {
          model_id: "laion/clap-htsat-fused"
        }
      }]
    });

    Capabilities

    • Audio-text cross-modal retrieval
    • 512-dimensional audio embeddings
    • Zero-shot audio classification
    • Environmental sound recognition

    Use Cases on Mixpeek

    Sound effect search — find audio by description
    Music discovery — semantic similarity across audio tracks
    Environmental monitoring — classify ambient sounds

    Specification

    FrameworkHF
    Organizationlaion
    FeatureAudio Embeddings
    Output512-dim vector
    Modalitiesvideo, audio
    RetrieverAudio Similarity
    Parameters154M
    Licenseapache-2.0
    Downloads/mo20.4M
    Likes56

    Research Paper

    Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

    arxiv.org

    Build a pipeline with clap-htsat-fused

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder