Core Concepts

Mixpeek is the multimodal data warehouse — a system that decomposes unstructured objects into queryable features, stores them across cost tiers, and reassembles them through multi-stage retrieval pipelines. This page introduces the warehouse primitives: the resources that make this work.

The six objects, in one sentence each

If you remember nothing else, remember these:

Object	In plain English
Namespace	Your project/environment boundary — everything lives inside one (think “database”).
Bucket	Where your raw files land before processing (think “inbox” / S3 folder).
Object	One raw item in a bucket — a video, image, PDF, or JSON record.
Collection	A recipe that turns objects into searchable documents by running the pipeline for the features you pick.
Feature	What you want extracted per modality (e.g. `video_search`, `faces`) — the plain-language capability you ask for. See Features.
Extractor	The model pipeline a feature runs under the hood (a Gemini embedder, Whisper, CLIP, …). You rarely name one directly — features resolve to extractors — but it’s what a `feature_uri` points at when you search. See Extractors.
Retriever	Your query pipeline — composable stages (search → filter → rerank → …) that return results.

The flow: you put objects in a bucket, a collection extracts the features you picked to produce searchable documents, and a retriever queries those documents — all inside a namespace.

Mixpeek flow: objects in a bucket are processed by a collection's feature pipeline into documents, queried by a retriever, all within a namespace

Entities & Relationships

The full resource model, including the operational and enrichment layers:

Layer	Entity	What it Represents	Related APIs
Isolation	Organization / API Key	Authentication boundary (`Authorization: Bearer …`)	API Keys
Isolation	Namespace	Tenant or environment boundary (`X-Namespace`)	Namespaces
Storage	Bucket	Schema-validated container for objects	Ingest Data
Storage	Object	Logical record referencing blobs (files/JSON)	Ingest Data
Processing	Batch	Submission that feeds objects into collection processing	Ingest Data
Processing	Collection	Document store + feature extraction recipe	Extract Features
Processing	Feature	What you want extracted per modality — resolved to a versioned pipeline internally	Features (advanced pipeline config: Feature Extractors)
Retrieval	Retriever	Stage-based search pipeline	Retrievers
Enrichment	Taxonomy	Retrieval-backed enrichment recipe (flat or hierarchical)	Taxonomies
Enrichment	Cluster	Vector-based grouping and enrichment artifacts	Clusters
Operations	Task	Status wrapper for asynchronous jobs	Tasks
Operations	Webhook	Event notification subscription	Webhooks

Dual-ID Multi-Tenancy

Mixpeek separates authentication from authorization by using two IDs per organization:

organization_id – Short, user-facing identifier returned in API responses
internal_id – 24-character key used inside services, task payloads, and database documents

Namespaces are the primary isolation boundary. Every API request must include X-Namespace unless your organization has a single shared namespace. Enforced rules:

All MongoDB collections index on namespace_id
Each namespace maps to a dedicated MVS namespace (ns_<namespace_id>)
Redis keys and Ray jobs include namespace prefixes
Cross-namespace queries are not permitted by design

Object → Document Lineage

Ingestion separates raw objects from processed documents so you can run multiple extraction tiers without duplicating data. Every document tracks:

{
  "root_object_id": "obj_video_123",
  "root_bucket_id": "bkt_marketing",
  "source_type": "collection",
  "source_collection_id": "col_frames",
  "source_document_id": "doc_frame_050",
  "lineage_path": "bkt_marketing/col_frames/col_scenes/col_highlights",
  "processing_tier": 3
}

Tier 0 – Raw object in the bucket
Tier N – Document produced by another collection (source_type = "collection")
The lineage_path is a denormalized materialized path for fast queries
Collections respect dependency tiers during extraction so downstream collections only execute when inputs are ready

Use the Object Decomposition Tree endpoint to inspect the entire lineage for a given object.

Feature URIs

Every feature a collection produces is addressed with a version-pinned URI that references the internal pipeline that produced it:

mixpeek://{extractor_name}@{version}/{output_name}

Examples:

mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1
mixpeek://image_extractor@v1/google_siglip_base_v1
mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding

You don’t construct these names yourself — discover the URIs a collection emits with GET /v1/collections/{collection_id}/features. Feature URIs are referenced by retriever stages (feature_uri), taxonomies, clustering jobs, and analytics. They guarantee query-time model compatibility with the ingestion pipeline.

TaskStatusEnum Standard

All asynchronous operations—batches, clustering jobs, taxonomy materialization, namespace migrations—report status using the shared TaskStatusEnum:

PENDING → PROCESSING → COMPLETED
            ↘           ↘
            FAILED      COMPLETED_WITH_ERRORS

Terminal statuses are COMPLETED, COMPLETED_WITH_ERRORS, FAILED, and CANCELED — poll until any of these (see Tasks). Additional lifecycle values include IN_PROGRESS, SKIPPED, UNKNOWN, DRAFT, ACTIVE, ARCHIVED, and SUSPENDED. Use the Tasks API for short-term polling and fall back to the resource (e.g., batch or cluster) for long-running workflows.

Caching Signatures

Mixpeek uses deterministic signatures to avoid stale results:

Collection index signatures hash document count, vector dimensions, and schema state
Retriever caches incorporate the collection signature to invalidate automatically
Stage-level caches speed up pipelines that reuse expensive stages (KNN → rerank)
Inference cache shortcuts repeated embedding requests for identical inputs

Learn more in Caching.

Putting It Together

Namespace
 └── Bucket
      ├── Object (Tier 0)
      └── Batch → Collection (Tier 1)
               └── Collection (Tier 2)
                    └── ...

Documents retain lineage to the original object (root_object_id)
Enrichment layers (taxonomies, clustering) augment documents in place
Retrievers run on namespace-scoped data, returning results with presigned URLs, metrics, and cache hints

With these concepts in mind you can navigate deeper sections of the docs—whether you’re planning ingestion schemas, designing retriever pipelines, or wiring observability for production deployments.

Get started

Connect your data

Extract features

Build retrievers

Enrich & organize

Integrate & operate

Resources

The six objects, in one sentence each

Entities & Relationships

Dual-ID Multi-Tenancy

Object → Document Lineage

Feature URIs

TaskStatusEnum Standard

Caching Signatures

Putting It Together

​The six objects, in one sentence each

​Entities & Relationships

​Dual-ID Multi-Tenancy

​Object → Document Lineage

​Feature URIs

​TaskStatusEnum Standard

​Caching Signatures

​Putting It Together

The six objects, in one sentence each

Entities & Relationships

Dual-ID Multi-Tenancy

Object → Document Lineage

Feature URIs

TaskStatusEnum Standard

Caching Signatures

Putting It Together