> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Core Concepts

> Understand the building blocks that power Mixpeek

Mixpeek is the **multimodal data warehouse** — a system that decomposes unstructured objects into queryable features, stores them across cost tiers, and reassembles them through multi-stage retrieval pipelines. This page introduces the warehouse primitives: the resources that make this work.

## The six objects, in one sentence each

If you remember nothing else, remember these:

| Object         | In plain English                                                                                |
| -------------- | ----------------------------------------------------------------------------------------------- |
| **Namespace**  | Your project/environment boundary — everything lives inside one (think "database").             |
| **Bucket**     | Where your raw files land before processing (think "inbox" / S3 folder).                        |
| **Object**     | One raw item in a bucket — a video, image, PDF, or JSON record.                                 |
| **Collection** | A recipe that turns objects into searchable **documents** by running an extractor.              |
| **Extractor**  | The model that converts a file into vectors + metadata (e.g. `multimodal_extractor` for video). |
| **Retriever**  | Your query pipeline — composable stages (search → filter → rerank → …) that return results.     |

The flow: you put **objects** in a **bucket**, a **collection** runs an **extractor** over them to produce searchable **documents**, and a **retriever** queries those documents — all inside a **namespace**.

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/mixpeek-flow.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=ae6474d56d2179ffd66e0576c18e64a6" alt="Mixpeek flow: objects in a bucket are processed by a collection's extractor into documents, queried by a retriever, all within a namespace" width="1000" height="500" data-path="assets/mixpeek-flow.svg" />
</Frame>

## Entities & Relationships

The full resource model, including the operational and enrichment layers:

| Layer      | Entity                     | What it Represents                                        | Related APIs                                                   |
| ---------- | -------------------------- | --------------------------------------------------------- | -------------------------------------------------------------- |
| Isolation  | **Organization / API Key** | Authentication boundary (`Authorization: Bearer …`)       | [API Keys](/api-reference/organization-api-keys/list-api-keys) |
| Isolation  | **Namespace**              | Tenant or environment boundary (`X-Namespace`)            | [Namespaces](/vector-store/namespaces)                         |
| Storage    | **Bucket**                 | Schema-validated container for objects                    | [Ingest Data](/platform/data-model)                            |
| Storage    | **Object**                 | Logical record referencing blobs (files/JSON)             | [Ingest Data](/platform/data-model)                            |
| Processing | **Batch**                  | Submission that feeds objects into extractors             | [Ingest Data](/platform/data-model)                            |
| Processing | **Collection**             | Document store + feature extraction recipe                | [Extract Features](/platform/processing)                       |
| Processing | **Feature Extractor**      | Reusable pipeline component that emits features           | [Feature Extractors](/processing/feature-extractors)           |
| Retrieval  | **Retriever**              | Stage-based search pipeline                               | [Retrievers](/retrieval/retrievers)                            |
| Enrichment | **Taxonomy**               | Retrieval-backed enrichment recipe (flat or hierarchical) | [Taxonomies](/enrichment/taxonomies)                           |
| Enrichment | **Cluster**                | Vector-based grouping and enrichment artifacts            | [Clusters](/enrichment/clusters)                               |
| Operations | **Task**                   | Status wrapper for asynchronous jobs                      | [Tasks](/processing/tasks)                                     |
| Operations | **Webhook**                | Event notification subscription                           | [Webhooks](/platform/operations#webhooks)                      |

## Dual-ID Multi-Tenancy

Mixpeek separates authentication from authorization by using two IDs per organization:

* **`organization_id`** – Short, user-facing identifier returned in API responses
* **`internal_id`** – 24-character key used inside services, task payloads, and database documents

Namespaces are the primary isolation boundary. Every API request must include `X-Namespace` unless your organization has a single shared namespace. Enforced rules:

* All MongoDB collections index on `namespace_id`
* Each namespace maps to a dedicated [MVS](https://mixpeek.com/mvs) namespace (`ns_<namespace_id>`)
* Redis keys and Ray jobs include namespace prefixes
* Cross-namespace queries are not permitted by design

## Object → Document Lineage

Ingestion separates raw objects from processed documents so you can run multiple extraction tiers without duplicating data. Every document tracks:

```json theme={null}
{
  "root_object_id": "obj_video_123",
  "root_bucket_id": "bkt_marketing",
  "source_type": "collection",
  "source_collection_id": "col_frames",
  "source_document_id": "doc_frame_050",
  "lineage_path": "bkt_marketing/col_frames/col_scenes/col_highlights",
  "processing_tier": 3
}
```

* **Tier 0** – Raw object in the bucket
* **Tier N** – Document produced by another collection (`source_type = "collection"`)
* The `lineage_path` is a denormalized materialized path for fast queries
* Collections respect dependency tiers during extraction so downstream collections only execute when inputs are ready

Use the [Object Decomposition Tree](/api-reference/document-lineage/get-decomposition-tree-visualization) endpoint to inspect the entire lineage for a given object.

## Feature URIs

Every feature emitted by an extractor is addressed with a URI:

```
mixpeek://{extractor_name}@{version}/{output_name}
```

Examples:

* `mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1`
* `mixpeek://image_extractor@v1/google_siglip_base_v1`
* `mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding`

Feature URIs are referenced by collections (output schemas), retriever stages (`feature_uri`), taxonomies, clustering jobs, and analytics. They guarantee query-time model compatibility with the ingestion pipeline.

## TaskStatusEnum Standard

All asynchronous operations—batches, clustering jobs, taxonomy materialization, namespace migrations—report status using the shared `TaskStatusEnum`:

```
PENDING → PROCESSING → COMPLETED
            ↘           ↘
            FAILED      COMPLETED_WITH_ERRORS
```

Terminal statuses are `COMPLETED`, `COMPLETED_WITH_ERRORS`, `FAILED`, and `CANCELED` — poll until any of these (see [Tasks](/processing/tasks)). Additional lifecycle values include `IN_PROGRESS`, `SKIPPED`, `UNKNOWN`, `DRAFT`, `ACTIVE`, `ARCHIVED`, and `SUSPENDED`. Use the [Tasks API](/processing/tasks) for short-term polling and fall back to the resource (e.g., batch or cluster) for long-running workflows.

## Caching Signatures

Mixpeek uses deterministic signatures to avoid stale results:

* Collection index signatures hash document count, vector dimensions, and schema state
* Retriever caches incorporate the collection signature to invalidate automatically
* Stage-level caches speed up pipelines that reuse expensive stages (KNN → rerank)
* Inference cache shortcuts repeated embedding requests for identical inputs

Learn more in [Caching](/overview/caching).

## Putting It Together

```
Namespace
 └── Bucket
      ├── Object (Tier 0)
      └── Batch → Collection (Tier 1)
               └── Collection (Tier 2)
                    └── ...
```

* Documents retain lineage to the original object (`root_object_id`)
* Enrichment layers (taxonomies, clustering) augment documents in place
* Retrievers run on namespace-scoped data, returning results with presigned URLs, metrics, and cache hints

With these concepts in mind you can navigate deeper sections of the docs—whether you’re planning ingestion schemas, designing retriever pipelines, or wiring observability for production deployments.
