Skip to main content
Mixpeek is the multimodal data warehouse — a system that decomposes unstructured objects into queryable features, stores them across cost tiers, and reassembles them through multi-stage retrieval pipelines. This page introduces the warehouse primitives: the resources that make this work.

The six objects, in one sentence each

If you remember nothing else, remember these:
ObjectIn plain English
NamespaceYour project/environment boundary — everything lives inside one (think “database”).
BucketWhere your raw files land before processing (think “inbox” / S3 folder).
ObjectOne raw item in a bucket — a video, image, PDF, or JSON record.
CollectionA recipe that turns objects into searchable documents by running an extractor.
ExtractorThe model that converts a file into vectors + metadata (e.g. multimodal_extractor for video).
RetrieverYour query pipeline — composable stages (search → filter → rerank → …) that return results.
The flow: you put objects in a bucket, a collection runs an extractor over them to produce searchable documents, and a retriever queries those documents — all inside a namespace.
Mixpeek flow: objects in a bucket are processed by a collection's extractor into documents, queried by a retriever, all within a namespace

Entities & Relationships

The full resource model, including the operational and enrichment layers:
LayerEntityWhat it RepresentsRelated APIs
IsolationOrganization / API KeyAuthentication boundary (Authorization: Bearer …)API Keys
IsolationNamespaceTenant or environment boundary (X-Namespace)Namespaces
StorageBucketSchema-validated container for objectsIngest Data
StorageObjectLogical record referencing blobs (files/JSON)Ingest Data
ProcessingBatchSubmission that feeds objects into extractorsIngest Data
ProcessingCollectionDocument store + feature extraction recipeExtract Features
ProcessingFeature ExtractorReusable pipeline component that emits featuresFeature Extractors
RetrievalRetrieverStage-based search pipelineRetrievers
EnrichmentTaxonomyRetrieval-backed enrichment recipe (flat or hierarchical)Taxonomies
EnrichmentClusterVector-based grouping and enrichment artifactsClusters
OperationsTaskStatus wrapper for asynchronous jobsTasks
OperationsWebhookEvent notification subscriptionWebhooks

Dual-ID Multi-Tenancy

Mixpeek separates authentication from authorization by using two IDs per organization:
  • organization_id – Short, user-facing identifier returned in API responses
  • internal_id – 24-character key used inside services, task payloads, and database documents
Namespaces are the primary isolation boundary. Every API request must include X-Namespace unless your organization has a single shared namespace. Enforced rules:
  • All MongoDB collections index on namespace_id
  • Each namespace maps to a dedicated MVS namespace (ns_<namespace_id>)
  • Redis keys and Ray jobs include namespace prefixes
  • Cross-namespace queries are not permitted by design

Object → Document Lineage

Ingestion separates raw objects from processed documents so you can run multiple extraction tiers without duplicating data. Every document tracks:
{
  "root_object_id": "obj_video_123",
  "root_bucket_id": "bkt_marketing",
  "source_type": "collection",
  "source_collection_id": "col_frames",
  "source_document_id": "doc_frame_050",
  "lineage_path": "bkt_marketing/col_frames/col_scenes/col_highlights",
  "processing_tier": 3
}
  • Tier 0 – Raw object in the bucket
  • Tier N – Document produced by another collection (source_type = "collection")
  • The lineage_path is a denormalized materialized path for fast queries
  • Collections respect dependency tiers during extraction so downstream collections only execute when inputs are ready
Use the Object Decomposition Tree endpoint to inspect the entire lineage for a given object.

Feature URIs

Every feature emitted by an extractor is addressed with a URI:
mixpeek://{extractor_name}@{version}/{output_name}
Examples:
  • mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1
  • mixpeek://image_extractor@v1/google_siglip_base_v1
  • mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding
Feature URIs are referenced by collections (output schemas), retriever stages (feature_uri), taxonomies, clustering jobs, and analytics. They guarantee query-time model compatibility with the ingestion pipeline.

TaskStatusEnum Standard

All asynchronous operations—batches, clustering jobs, taxonomy materialization, namespace migrations—report status using the shared TaskStatusEnum:
PENDING → PROCESSING → COMPLETED
            ↘           ↘
            FAILED      COMPLETED_WITH_ERRORS
Terminal statuses are COMPLETED, COMPLETED_WITH_ERRORS, FAILED, and CANCELED — poll until any of these (see Tasks). Additional lifecycle values include IN_PROGRESS, SKIPPED, UNKNOWN, DRAFT, ACTIVE, ARCHIVED, and SUSPENDED. Use the Tasks API for short-term polling and fall back to the resource (e.g., batch or cluster) for long-running workflows.

Caching Signatures

Mixpeek uses deterministic signatures to avoid stale results:
  • Collection index signatures hash document count, vector dimensions, and schema state
  • Retriever caches incorporate the collection signature to invalidate automatically
  • Stage-level caches speed up pipelines that reuse expensive stages (KNN → rerank)
  • Inference cache shortcuts repeated embedding requests for identical inputs
Learn more in Caching.

Putting It Together

Namespace
 └── Bucket
      ├── Object (Tier 0)
      └── Batch → Collection (Tier 1)
               └── Collection (Tier 2)
                    └── ...
  • Documents retain lineage to the original object (root_object_id)
  • Enrichment layers (taxonomies, clustering) augment documents in place
  • Retrievers run on namespace-scoped data, returning results with presigned URLs, metrics, and cache hints
With these concepts in mind you can navigate deeper sections of the docs—whether you’re planning ingestion schemas, designing retriever pipelines, or wiring observability for production deployments.