Object decomposition is the first operation in a multimodal data warehouse. A single file, such as a video, document, or audio recording, is analyzed by one or more feature extractors to produce multiple independent semantic representations. Each representation (a feature) has its own embedding space and can be queried independently through retrieval pipelines.
When a file is ingested into Mixpeek, it passes through a collection's configured feature extractors. A video might produce scene embeddings (CLIP), face identities (ArcFace), logo detections (SigLIP), speech transcripts (Whisper), and audio fingerprints. Each feature is stored as a separate vector with metadata linking it back to the source object, timestamp, and spatial location.