NEWVectors or files. Pick a path.Start →
    Models/Embeddings/facebook/pe-av-large
    HFAudio Embeddingsapache-2.0

    pe-av-large

    by facebook

    Joint audio-video-text embeddings from Meta's Perception Encoder family

    4Kdl/month
    61likes
    2.2Bparams
    Identifiers
    Model ID
    facebook/pe-av-large
    Feature URI
    mixpeek://audio_extractor@v1/facebook_pe_av_large_v1

    Overview

    PE-AV Large embeds audio, video, synchronized audio-video, and text into one shared retrieval space. It is useful when the same event is expressed through motion, sound, or language, such as a siren, a crowd reaction, a machine failure, or a tennis serve.

    On Mixpeek, PE-AV Large gives agents a single evidence channel for audiovisual retrieval. Instead of searching transcripts, frames, and audio fingerprints separately, an agent can retrieve clips where the sound and visual motion jointly match the query, then pass the top results to a reasoning model.

    Architecture

    Perception Encoder audio-video model with roughly 2.2B parameters. The model aligns raw audio, video frames, audio-video pairs, and text through contrastive training so cross-modal retrieval works across all supported input combinations.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "av-memory",
    source: { url: "s3://camera-footage/" },
    feature_extractors: [{
    feature: "audio_embeddings",
    model: "facebook/pe-av-large"
    }]
    });

    Capabilities

    • Text-to-video, text-to-audio, and text-to-audio-video retrieval
    • Joint embeddings for synchronized sound and motion
    • Useful for clips where audio carries the key signal
    • Apache 2.0 license

    Use Cases on Mixpeek

    Find video moments by sound events, visual motion, or both
    Retrieve security, sports, or broadcast clips where audio changes the meaning
    Build agent memory over camera footage with synchronized audio
    Use one embedding family before transcript, object, or VLM reranking

    Performance

    Input SizeAudio, video, audio-video, or text input
    Embedding DimModel dependent
    GPU LatencyInput dependent
    GPU ThroughputBatch by clip for best throughput
    GPU Memory~5 GB plus serving overhead

    Specification

    FrameworkHF
    Organizationfacebook
    FeatureAudio Embeddings
    Output512-dim vector
    Modalitiesvideo, audio
    RetrieverAudio Similarity
    Parameters2.2B
    Licenseapache-2.0
    Downloads/mo4K
    Likes61

    Research Paper

    PE Audio Video

    arxiv.org

    Build a pipeline with pe-av-large

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio