Find any scene in your video library.
Describe what you are looking for in plain English and get the exact matches back, across video, images, audio, and documents. One multimodal API on the object storage you already use.
Bring vectors
MVSAgent-native vector store on object storage. Dense, sparse, and BM25 search. First 1M vectors free forever.
Connect files
ManagedManaged indexing extracts scenes, faces, OCR, transcripts, and embeddings from any file type.
A performance-video agency maps the creative DNA of 170K+ scenes across 8 brands: search any moment, cluster what works, reuse the winners. See how
What the agent sees.
Every object you index becomes structured, searchable features: faces and objects in a frame, layout regions in a document, speakers in audio. Those are the same features an agent queries, and joins across modalities.
Hover or tap a card to preview the search it powers.
Person · 0.95Face · 0.98Handbag · 0.91Video · 00:04:12“A woman in a white shirt walks past a red storefront.”
transcript · “…meet me at the corner in five.”
HeaderChartBodySignaturePDF · resume.pdfHeader, body, charts, and signature detected as typed regions.
OCR · 1 header · 3 sections · 1 signature

Who spoke when, aligned to the transcript and the timeline.
audio · 2 speakers · matched at 00:01:30
One query across every modality.
Real questions rarely fit one feature. “Find the moment our CEO said guidance while the slide read Q4 outlook” needs a face, a spoken phrase, and on-screen text to line up at the same instant.
Mixpeek ties those features to the same object and timestamp, so an agent gets back the exact clip instead of three unrelated matches.
The CEO says “guidance” as the slide behind her reads “Q4 outlook.”
One install. Two paths.
Most retrieval stacks mean gluing together a vector DB, a file pipeline, and an agent layer. Mixpeek is one install with two ways in.
Bring embeddings
Plugs into your existing stack.
Connect your storage, point Mixpeek at it, and every file becomes searchable by what's inside it. No migration, no code changes.

Mux
Every Mux upload becomes searchable by face, scene, transcript, and on-screen text, with no manual tagging.
View integration →
Backblaze B2
S3-compatible extraction at 1/5th the cost. Store on B2, extract with Mixpeek, zero egress fees.
View integration →Iconik
Every asset in your DAM becomes findable by what's inside it: scenes, faces, spoken words, on-screen text.
View integration →In production right now.
Visual search across 45k artworks
Upload any image and find visually similar paintings across 45,000+ artworks, or just describe what you're looking for. Hybrid image and text retrieval, ranked with RRF.
Try gallery search →Posters that learn your taste
Like or dislike movie posters and watch the grid adapt to your taste in real time. Interaction signals feed learned fusion so recommendations improve from usage.
Try movie personalization →Face search across video
Drop in a headshot and find every clip a person appears in across 63 video ads and 2,600+ faces. Full trace for takedown evidence.
Try face search →Turn any file into searchable features.
Connect a bucket and these pipelines run as they are. Every one is documented and open source in the extractor cookbook.
Multimodal (Video/Audio/Image)
Video · Image · Audio · Text
Unified embeddings for video, audio, image, and text. Scene and silence chunking, Whisper transcription, thumbnails.
Universal All-in-One
Any file
One extractor for image, video, audio, and documents. Auto-detects modality and applies the right pipeline.
Image Embeddings (SigLIP)
Image · PDF
Dense 768-D image embeddings with Google SigLIP for text-to-image search in one contrastive space.
Text Embeddings (E5-Large)
Text
Multilingual dense text embeddings with E5-Large for semantic search and RAG out of the box.
Multi-File Object Embeddings (Gemini)
Any file
Embed ALL files of an object (images, PDFs, video, audio, text) into one 3072-D Gemini vector.
Face Identity (SCRFD + ArcFace)
Image · Video · PDF
Production face recognition that detects, aligns, and embeds faces to 512-D ArcFace vectors.
Free vectors. Usage-based indexing.
Two ways to pay. MVS is priced by the vector and starts with 1M free. Managed is priced by the object, with credits covering extraction, embedding, and indexing.
Bring your own embeddings and pay by the vector. Dense, sparse, and BM25 search on your own object storage, with no expiration on the free tier.
Start with MVSBring raw objects and pay by the object. Credits cover extraction, embedding, indexing, enrichment, and retrieval.
Start with ManagedDedicated infrastructure, self-hosted options, SSO, SLA, security reviews, and hands-on architecture support.
Talk to usCommon questions.
Do I have to move my data?
No. Mixpeek reads from your existing S3, GCS, R2, Azure, or S3-compatible bucket. Your storage stays the system of record, and nothing leaves your cloud.
How fast is retrieval?
Hybrid queries (dense, sparse, and BM25) return in well under 100ms p95, even with vectors persisted on object storage rather than held in RAM.
Do I need embeddings to start?
No. Bring your own vectors with MVS, or point Managed at raw files and it generates embeddings and features for you.
What can Managed extract?
Faces, scenes, transcripts, OCR, labels, and embeddings from video, images, audio, PDFs, and documents, all indexed at the object level.
Can I self-host?
Yes. Deploy in your own cloud (BYO-Cloud) with SOC 2 and HIPAA-ready controls, SSO, audit trails, and namespaces.
How does pricing work?
MVS starts free with 1M vectors and no expiration. Managed is usage-based credits covering extraction, embedding, indexing, and retriever execution.
