NEWMVS for embeddings. Managed for files. Both on object storage.Vectors or files. Pick a path.Start →

Start here

Vector Store (MVS)

Bring your own vectors. Dense, sparse, and BM25 search on object storage.

Managed Indexing

Connect a bucket and auto-extract scenes, faces, OCR, transcripts, and embeddings.

Build

Compose multi-stage search in <100ms: filter, join, rerank.

Feature Extractors

Typed pipelines for faces, scenes, transcripts, OCR, fingerprints.

S3, GCS, R2, Mux, LangChain, MCP, and more. Connect your stack.

Generate and store embeddings from 50+ models, then search them.

By Industry

Map, search, and reuse the moments that perform. Plugs into iconik & Mux.

Talent search, brand safety, creative analytics.

Scene search, recommendation, archive access.

Visual search, PDP enrichment, catalog QA.

Lecture search, transcript Q&A, content safety.

View all solutions →

By Use Case

Face & Person Search

Find anyone across video libraries in milliseconds.

IP & Copyright Detection

Logos, songs, faces: one pipeline, one report.

Visual Taste & Recs

Scene-similarity ranked recommendations with RL.

Brand & Ad Safety

Pre-publish content screening at bid-time speeds.

View all use cases →

Build

API reference, SDKs, recipes, and architecture guides.

Launches, deep dives, and field notes from our engineers.

Browse supported HuggingFace models by task and modality.

See what teams are building with Mixpeek.

Education

Vendor-neutral deep dives on perception, retrieval, and embeddings.

Multimodal University

Fundamentals of multimodal retrieval, modules + certs.

Every term you need: embeddings to re-rankers.

Talks, demos, and customer sessions on demand.

Mixpeek vs. Pinecone, Weaviate, Twelve Labs, more.

Mission, team, and the multimodal vision.

We're hiring across research, infra, and design.

Talk to sales, support, or press.

45-min working session on your data — leave with a running notebook.

Sign in Request Demo Get started →

Models/Embeddings/facebook/pe-av-large

HFAudio Embeddingsapache-2.0

pe-av-large

by facebook

Joint audio-video-text embeddings from Meta's Perception Encoder family

9Kdl/month

63likes

2.2Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

facebook/pe-av-large

Feature URI

mixpeek://audio_extractor@v1/facebook_pe_av_large_v1

Overview

PE-AV Large embeds audio, video, synchronized audio-video, and text into one shared retrieval space. It is useful when the same event is expressed through motion, sound, or language, such as a siren, a crowd reaction, a machine failure, or a tennis serve.

On Mixpeek, PE-AV Large gives agents a single evidence channel for audiovisual retrieval. Instead of searching transcripts, frames, and audio fingerprints separately, an agent can retrieve clips where the sound and visual motion jointly match the query, then pass the top results to a reasoning model.

Architecture

Perception Encoder audio-video model with roughly 2.2B parameters. The model aligns raw audio, video frames, audio-video pairs, and text through contrastive training so cross-modal retrieval works across all supported input combinations.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "audio_embeddings",
    version: "v1",
    parameters: { model_id: "facebook/pe-av-large" },
  },
});

Capabilities

Text-to-video, text-to-audio, and text-to-audio-video retrieval
Joint embeddings for synchronized sound and motion
Useful for clips where audio carries the key signal
Apache 2.0 license

Use Cases on Mixpeek

Find video moments by sound events, visual motion, or both

Retrieve security, sports, or broadcast clips where audio changes the meaning

Build agent memory over camera footage with synchronized audio

Use one embedding family before transcript, object, or VLM reranking

Performance

Input SizeAudio, video, audio-video, or text input

Embedding DimModel dependent

GPU LatencyInput dependent

GPU ThroughputBatch by clip for best throughput

GPU Memory~5 GB plus serving overhead

Common Pipeline Companions

openai/whisper-large-v3

Transcript search alongside audiovisual embeddings

facebook/vjepa2-vitl-fpc64-256

Temporal motion representation for video-only evidence

Explore on Mixpeek

More Embeddings models

Compare alternatives in this category

Best Multimodal Embedding Models

Hand-picked tools & platforms compared

How CLIP, SigLIP & CLAP Work

Deep-dive technical guide

Feature Extractors

See how Mixpeek runs models as extractors

Mixpeek Vector Store

Store & search embeddings at scale

Usage-based pricing for pipelines

All Curated Lists

Compare models, APIs & infrastructure

Specification

FrameworkHF

Organizationfacebook

FeatureAudio Embeddings

Output512-dim vector

Modalitiesvideo, audio

RetrieverAudio Similarity

Parameters2.2B

Licenseapache-2.0

Downloads/mo9K

Likes63

Research Paper

PE Audio Video

Build a pipeline with pe-av-large

Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

Run on your data, free

Alternative Models

laion/clap-htsat-fused

Audio Embeddings

tsinghua-ee/WAVE-7B

Audio Embeddings

facebook/encodec_24khz

Audio Embeddings

laion/larger_clap_general

Audio Embeddings

Related in Embeddings

openai/clip-vit-large-patch14

Visual Embeddings

google/siglip-base-patch16-224

Visual Embeddings

google/siglip2-giant-opt-patch16-384

Visual Embeddings

facebook/dinov2-large

Visual Embeddings