Multimodal Recipes

Ready-to-use solutions for building multimodal applications

All Modalities

Image

Video

Audio

Text

Document

Documents where a chart contradicts a text claim

Identify documents where data visualizations in charts conflict with statements made in nearby text. Useful for fact-checking reports in finance, healthcare, and journalism.

Modalities:

document

Feature Extractors:

pdf-extraction

chart-graph-extraction

...

Clips with a specific object and a related spoken keyword

Search for clips where a specific object is visible while a related keyword is spoken. This combines object detection with speech-to-text and keyword analysis.

Modalities:

video

audio

Feature Extractors:

object-detection

video-transcription

...

Clips where someone is talking but no person is visible

Use speech-to-text to detect narration or dialogue and cross-reference with object detection to ensure no person is visually present in the clips.

Modalities:

video

audio

Feature Extractors:

video-transcription

object-detection

Scenes with fast movement, loud sounds, and dim lighting

Identify high-action scenes by combining action recognition with sound event detection for loud noises and scene classification for lighting conditions.

Modalities:

video

audio

Feature Extractors:

action-recognition

audio-event-detection

...

Moments where a person gestures while speaking a command word

Pinpoint interactive moments by detecting specific physical gestures alongside key spoken command words using action recognition and speech-to-text.

Modalities:

video

audio

Feature Extractors:

action-recognition

video-transcription

On-screen text with narration and background music

Detect scenes where on-screen text, human narration, and background music occur simultaneously using OCR, speech-to-text, and audio classification.

Modalities:

video

audio

text

Feature Extractors:

image-text-extraction

video-transcription

...

Segments with angry expressions and negative phrases

Find moments of conflict or frustration by analyzing facial expressions for anger, and cross-referencing with negative keywords from the transcript.

Modalities:

video

audio

Feature Extractors:

face-grouping

keyword-extraction

...

Frames with multiple people arguing and high visual activity

Isolate heated moments by identifying multiple speakers arguing through speaker diarization and audio event detection, combined with high visual activity from action recognition.

Modalities:

video

audio

Feature Extractors:

speaker-diarization

audio-event-detection

...

What are Mixpeek Recipes?

Mixpeek recipes are practical blueprints for multimodal search. They demonstrate how to combine multiple feature extractors to answer complex, high-value questions that are impossible with traditional search methods.

Composable Blueprints

Each recipe provides a practical blueprint, showing how to combine multiple feature extractors to answer complex, real-world questions across your data.

Unlock Multimodal Search

Go beyond simple keyword search. Recipes demonstrate how to query across modalities—like matching spoken words with visual elements—to find precise moments.

Practical & Actionable

Get inspired by real-world use cases. From fact-checking financial reports to analyzing video content, recipes provide actionable patterns you can adapt and use today.