pe-av-large
by facebook
Joint audio-video-text embeddings from Meta's Perception Encoder family
facebook/pe-av-largemixpeek://audio_extractor@v1/facebook_pe_av_large_v1Overview
PE-AV Large embeds audio, video, synchronized audio-video, and text into one shared retrieval space. It is useful when the same event is expressed through motion, sound, or language, such as a siren, a crowd reaction, a machine failure, or a tennis serve.
On Mixpeek, PE-AV Large gives agents a single evidence channel for audiovisual retrieval. Instead of searching transcripts, frames, and audio fingerprints separately, an agent can retrieve clips where the sound and visual motion jointly match the query, then pass the top results to a reasoning model.
Architecture
Perception Encoder audio-video model with roughly 2.2B parameters. The model aligns raw audio, video frames, audio-video pairs, and text through contrastive training so cross-modal retrieval works across all supported input combinations.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "av-memory",source: { url: "s3://camera-footage/" },feature_extractors: [{feature: "audio_embeddings",model: "facebook/pe-av-large"}]});
Capabilities
- Text-to-video, text-to-audio, and text-to-audio-video retrieval
- Joint embeddings for synchronized sound and motion
- Useful for clips where audio carries the key signal
- Apache 2.0 license
Use Cases on Mixpeek
Performance
Specification
Research Paper
PE Audio Video
arxiv.orgBuild a pipeline with pe-av-large
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio