Mixpeek Logo
    Login / Signup
    Models/Embeddings/laion/clap-htsat-fused
    HFAudio Embeddingsapache-2.0

    clap-htsat-fused

    by laion

    Contrastive Language-Audio Pretraining for audio-text retrieval

    20.4Mdl/month
    56likes
    154Mparams
    Identifiers
    Model ID
    laion/clap-htsat-fused
    Feature URI
    mixpeek://audio_extractor@v1/laion_clap_fused_v1

    Overview

    CLAP learns aligned audio and text representations through contrastive learning, similar to how CLIP works for images and text. The HTSAT-fused variant uses the HTS-AT audio transformer fused with RoBERTa text embeddings.

    On Mixpeek, CLAP enables semantic audio search, find audio segments matching natural language descriptions like "crowd cheering" or "rain on a roof."

    Architecture

    HTS-AT (Hierarchical Token-Semantic Audio Transformer) as audio encoder, RoBERTa as text encoder. Trained on AudioSet, Clotho, and other audio-text pair datasets with contrastive loss. Outputs 512-dim joint embedding space.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    await mx.collections.ingest({
      collection_id: "my-collection",
      source: { url: "https://example.com/audio.wav" },
      feature_extractors: [{
        name: "audio_embedding",
        version: "v1",
        params: {
          model_id: "laion/clap-htsat-fused"
        }
      }]
    });

    Capabilities

    • Audio-text cross-modal retrieval
    • 512-dimensional audio embeddings
    • Zero-shot audio classification
    • Environmental sound recognition

    Use Cases on Mixpeek

    Sound effect search, find audio by description
    Music discovery, semantic similarity across audio tracks
    Environmental monitoring, classify ambient sounds

    Specification

    FrameworkHF
    Organizationlaion
    FeatureAudio Embeddings
    Output512-dim vector
    Modalitiesvideo, audio
    RetrieverAudio Similarity
    Parameters154M
    Licenseapache-2.0
    Downloads/mo20.4M
    Likes56

    Research Paper

    Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

    arxiv.org

    Build a pipeline with clap-htsat-fused

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder