> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Environment Branching

> Clone namespaces, collections, retrievers, and taxonomies to create isolated staging environments and run experiments without re-processing data

## Overview

Branching an AI data environment used to mean one of two things: re-processing your entire corpus (slow, expensive) or experimenting directly in production (dangerous). Mixpeek solves this with a **clone-based branching model** that operates at every layer of the pipeline — namespace, collection, retriever, and taxonomy — so you can create fully isolated environments instantly and promote changes deliberately.

This matters most when:

* You want to test a new retrieval pipeline on production data without affecting live traffic
* You need a staging namespace that mirrors production for QA without re-ingesting everything
* You're evaluating a new embedding model and want side-by-side comparison on the same corpus
* A taxonomy schema is about to change and you need a safe place to validate it first

## Branching Primitives

Mixpeek resources are **immutable by design** — you can't change a collection's feature extractor or a retriever's pipeline stages via PATCH. This preserves execution history, dependent results, and audit trails. The branching mechanism is a first-class `clone` operation available on every resource type.

### Namespace Clone (full environment branch)

The most powerful primitive. A namespace clone deep-copies the entire environment: collections (including MVS vectors), retrievers, buckets, and optionally taxonomies — remapping all internal IDs so nothing points back to production.

<CodeGroup>
  ```python Python theme={null}
  from mixpeek import Mixpeek

  client = Mixpeek(api_key="your-api-key")

  clone = client.namespaces.clone(
      "ns_prod",
      namespace_name="listings_staging",
      include_resources={
          "collections": True,   # copies MVS vectors — no reprocessing
          "retrievers": True,    # remaps all collection refs to staging copies
          "taxonomies": False    # optional: include taxonomy configs
      }
  )

  # The response returns immediately with the new namespace and a clone status.
  new_namespace_id = clone.namespace.namespace_id
  print(clone.status)            # "cloning" -> completes async to "ready"/"failed"
  print(clone.cloned_resources)  # what was copied (source_id -> cloned_id per resource)
  ```

  ```javascript JavaScript theme={null}
  import { Mixpeek } from 'mixpeek-sdk';

  const client = new Mixpeek({ apiKey: 'your-api-key' });

  const clone = await client.namespaces.clone('ns_prod', {
    namespace_name: 'listings_staging',
    include_resources: {
      collections: true,
      retrievers: true,
      taxonomies: false,
    },
  });

  const newNamespaceId = clone.namespace.namespace_id;
  console.log(clone.status);           // "cloning" -> completes async to "ready"
  console.log(clone.cloned_resources); // source_id -> cloned_id per resource
  ```

  ```bash cURL theme={null}
  curl -X POST https://api.mixpeek.com/v1/namespaces/ns_prod/clone \
    -H "Authorization: Bearer $API_KEY" \
    -H "X-Namespace: ns_prod" \
    -H "Content-Type: application/json" \
    -d '{
      "namespace_name": "listings_staging",
      "include_resources": {
        "collections": true,
        "retrievers": true,
        "taxonomies": false
      }
    }'
  ```
</CodeGroup>

<Note>
  Namespace clone copies MVS vectors directly — your data is **not re-processed**. The clone is isolated: changes to collections, retrievers, or documents in staging have zero effect on production.
</Note>

**What gets copied:**

| Resource    | What's cloned              | Notes                                   |
| ----------- | -------------------------- | --------------------------------------- |
| Collections | Metadata + MVS vectors     | Vectors copied, not recomputed          |
| Retrievers  | Full pipeline stage config | All collection refs remapped to staging |
| Buckets     | Metadata only              | S3 objects are not duplicated           |
| Taxonomies  | Config only (optional)     | Retriever refs remapped                 |

### Collection Clone (swap extractor or source)

Clone a single collection and optionally override its feature extractor or source. This is the entry point for **embedding model experimentation** — run two extractors on the same corpus and compare retrieval quality.

<CodeGroup>
  ```python Python theme={null}
  # Clone with a different embedding model
  new_col = client.collections.clone(
      "col_properties",
      collection_name="properties_siglip_v2",
      feature_extractor={
          "feature_extractor_name": "image_extractor",
          "version": "v2",
          "parameters": { "model": "google_siglip_base_v1" }
      }
  )

  # Trigger reprocessing — required when changing the extractor
  client.collections.trigger(new_col.collection_id)
  ```

  ```bash cURL theme={null}
  curl -X POST https://api.mixpeek.com/v1/collections/col_properties/clone \
    -H "Authorization: Bearer $API_KEY" \
    -H "X-Namespace: $NAMESPACE_ID" \
    -H "Content-Type: application/json" \
    -d '{
      "collection_name": "properties_siglip_v2",
      "feature_extractor": {
        "feature_extractor_name": "image_extractor",
        "version": "v2",
        "parameters": { "model": "google_siglip_base_v1" }
      }
    }'
  ```
</CodeGroup>

<Note>
  When you clone a collection without changing the feature extractor, vectors are reused. When you **change the extractor**, you must trigger reprocessing — vectors are model-specific and cannot be ported across embedding spaces.
</Note>

### Retriever Clone (pipeline experiment)

Immutable retriever stages mean the safe way to test a new ranking strategy, add a rerank stage, or adjust fusion weights is to clone the retriever with overrides.

<CodeGroup>
  ```python Python theme={null}
  # Clone retriever and add an MMR rerank stage
  new_ret = client.retrievers.clone(
      "ret_ad_relevance",
      body={
          "retriever_name": "ad_relevance_mmr_v2",
          "stages": [
              {
                  "stage_name": "search",
                  "stage_type": "filter",
                  "config": {
                      "stage_id": "feature_search",
                      "parameters": {
                          "feature_uri": "mixpeek://text_extractor@v1/e5_large",
                          "query": "{{INPUT.query}}",
                          "final_top_k": 50
                      }
                  }
              },
              {
                  "stage_name": "diversify",
                  "stage_type": "sort",
                  "config": {
                      "stage_id": "mmr",
                      "parameters": { "lambda": 0.7, "top_k": 20 }
                  }
              }
          ]
      }
  )
  ```

  ```bash cURL theme={null}
  curl -X POST https://api.mixpeek.com/v1/retrievers/ret_ad_relevance/clone \
    -H "Authorization: Bearer $API_KEY" \
    -H "X-Namespace: $NAMESPACE_ID" \
    -H "Content-Type: application/json" \
    -d '{
      "retriever_name": "ad_relevance_mmr_v2",
      "stages": [...]
    }'
  ```
</CodeGroup>

### Taxonomy Clone (schema version branch)

Taxonomies are immutable in their core config (`retriever_id`, `input_mappings`, `enrichment_fields`). Clone to branch a schema — swap the backing retriever, adjust the hierarchy, or test a new classification model.

<CodeGroup>
  ```python Python theme={null}
  # Branch a taxonomy to test a new classification retriever
  new_tax = client.taxonomies.clone(
      taxonomy_identifier="tax_icd_codes",
      body={
          "taxonomy_name": "icd_codes_llama_v2",
          "retriever_id": "ret_llama_classifier"  # only changed field
      }
  )
  ```

  ```bash cURL theme={null}
  curl -X POST https://api.mixpeek.com/v1/taxonomies/tax_icd_codes/clone \
    -H "Authorization: Bearer $API_KEY" \
    -H "X-Namespace: $NAMESPACE_ID" \
    -H "Content-Type: application/json" \
    -d '{
      "taxonomy_name": "icd_codes_llama_v2",
      "retriever_id": "ret_llama_classifier"
    }'
  ```
</CodeGroup>

***

## Common Patterns

### Pattern 1: Staging environment for a production namespace

The most common use case: a full mirror of production where QA teams can validate changes before go-live.

```
prod namespace (ns_prod)
├── col_content_v1       (CLIP embeddings, 2M docs)
├── ret_content_search   (semantic + rerank pipeline)
└── tax_iab_v3           (IAB 3.0 taxonomy)

→ clone → staging namespace (ns_staging)
├── col_content_v1_copy  (vectors copied, no reprocessing)
├── ret_content_search_copy  (points to staging collection)
└── [taxonomies excluded]
```

After cloning, engineers can modify the staging retriever pipeline, run evaluations, and only promote to prod once quality gates pass.

### Pattern 2: Embedding model A/B test

Run two embedding models on the same corpus, then compare retrieval quality with Mixpeek's [Evaluations](/retrieval/evaluations) before committing.

```python theme={null}
# Collection A — existing model (no reprocessing needed)
col_a = "col_listings_clip"   # already exists

# Collection B — new model (clone + reprocess)
col_b = client.collections.clone("col_listings_clip", {
    "collection_name": "col_listings_siglip",
    "feature_extractor": {
        "feature_extractor_name": "image_extractor",
        "version": "v1",
        "parameters": { "model": "google_siglip_base_v1" }
    }
})
client.collections.trigger(col_b.collection_id)

# Retriever A — current model
ret_a = "ret_property_search"

# Retriever B — points to new collection
ret_b = client.retrievers.clone("ret_property_search", {
    "retriever_name": "ret_property_search_siglip",
    "collection_identifiers": [col_b.collection_id]
})

# Run evaluations side by side
eval_a = client.retrievers.run_evaluation("ret_property_search", dataset_id="eval_ds_001")
eval_b = client.retrievers.run_evaluation(ret_b.retriever_id, dataset_id="eval_ds_001")
```

### Pattern 3: Retriever pipeline experiment

Test a new retrieval stage (reranker, MMR, query expansion) without touching the live retriever.

```python theme={null}
# Current: bare semantic search
# Experiment: add query expansion + rerank

exp_retriever = client.retrievers.clone("ret_content_search", {
    "retriever_name": "ret_content_search_exp_rerank",
    "stages": [
        { "stage_type": "filter", "config": { "stage_id": "query_expand", "parameters": {...} } },   # new
        { "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": {...} } },
        { "stage_type": "sort", "config": { "stage_id": "rerank", "parameters": {...} } }             # new
    ]
})

# Shadow-test: run both retrievers on the same queries, compare metrics
```

### Pattern 4: Taxonomy version checkpoint

Before migrating an IAB or ICD taxonomy schema, snapshot the current version so you can validate in parallel.

```python theme={null}
# Snapshot before migration
checkpoint = client.taxonomies.clone("tax_iab_content", {
    "taxonomy_name": "tax_iab_content_v3_snapshot"
})

# Apply to existing docs in the new taxonomy to validate
client.taxonomies.apply(
    taxonomy_identifier=checkpoint.taxonomy_id,
    collection_id="col_content_sample_100"
)
```

***

## Vertical Examples

<AccordionGroup>
  <Accordion title="Adtech — IAB taxonomy migration">
    **Problem:** Your IAB 2.2 taxonomy needs to be upgraded to IAB 3.0. You can't migrate production mid-campaign without validating classification quality first.

    **Solution:**

    1. Clone the production namespace → `ns_adtech_staging`
    2. Clone `tax_iab_v2` with the new retriever trained on IAB 3.0 → `tax_iab_v3_candidate`
    3. Apply `tax_iab_v3_candidate` to a sample of staging documents
    4. Validate label quality against ground truth
    5. Promote: update production taxonomy config once quality gates pass

    No live campaigns are affected. Staging vectors are reused from production (no reprocessing).
  </Accordion>

  <Accordion title="Healthcare — clinical classification model upgrade">
    **Problem:** You're switching the ICD-10 classification retriever from GPT-4o to a fine-tuned clinical model. Patient record classification in production cannot be interrupted.

    **Solution:**

    1. Clone `tax_icd10_prod` with the new retriever → `tax_icd10_clinical_v2`
    2. Apply to a test collection of de-identified records
    3. Compare classification accuracy against the production taxonomy's output on the same records
    4. Only swap production once the new model meets or exceeds current accuracy

    Both taxonomies run in parallel. There's no downtime and no disruption to the production classification pipeline.
  </Accordion>

  <Accordion title="Media — search ranking experiment">
    **Problem:** Editorial wants to test a diversity-aware ranking algorithm (MMR) on the content search endpoint before rolling out to all users.

    **Solution:**

    1. Clone `ret_content_search` → `ret_content_search_mmr`
    2. Override stages to add an MMR sort step after semantic search
    3. Route 10% of internal QA traffic to the experimental retriever (traffic splitting handled in your application layer)
    4. Compare CTR, dwell time, and result diversity metrics
    5. Promote by cloning the staging retriever config back to production

    The collection (and all its vectors) is shared between both retrievers — no extra storage cost.
  </Accordion>

  <Accordion title="CRE — new embedding model for property images">
    **Problem:** A new image embedding model shows better performance on architectural/interior photos. You want to validate before migrating 5M property images.

    **Solution:**

    1. Clone `col_property_images` with the new extractor → `col_property_images_v2`
    2. Trigger reprocessing (required for model changes — vectors are model-specific)
    3. Create `ret_property_search_v2` pointing to the new collection
    4. Run offline evaluation: compare top-5 retrieval precision on a labeled query set
    5. If metrics improve, migrate production: update `ret_property_search` to point to the new collection

    The old collection stays active during migration as a fallback. Roll back is instant — just repoint the retriever.
  </Accordion>
</AccordionGroup>

***

## Promotion Workflow

Branching is only useful if you have a clear path back to production. The recommended flow:

```
dev branch  →  staging clone  →  eval gate  →  production
     ↑                                              |
     └──────────── rollback (repoint retriever) ───┘
```

**Promoting a retriever experiment to production:**

1. Run [Evaluations](/retrieval/evaluations) on the experimental retriever
2. If metrics pass, delete the old production retriever (after confirming no dependent published pages)
3. Rename the experimental retriever to the production name via PATCH (name is mutable)
4. Or: update your application to point to the new retriever ID directly

**Rolling back** is always safe — because the old retriever and collection still exist unchanged, you can revert by updating the retriever ID in your application config.

***

## Best Practices

<Tip>
  **One namespace per environment** (dev, staging, prod). Clone from prod to create staging rather than maintaining them separately — this guarantees staging always reflects current production data and config.
</Tip>

* **Clone, don't modify.** Resist the urge to patch production resources for "quick experiments." A clone takes seconds and preserves the ability to roll back.
* **Retriever clones are free** — they share the underlying collection (and all its vectors). You only pay for additional MVS storage when collection vectors diverge.
* **Trigger reprocessing only when the extractor changes.** Cloning a collection with the same extractor reuses existing vectors — no GPU time consumed.
* **Use evaluations before promoting.** The [Evaluations API](/retrieval/evaluations) lets you run offline quality checks on any retriever before it touches production traffic.
* **Name branches consistently.** A naming convention like `{resource}_staging`, `{resource}_exp_{date}`, or `{resource}_v{n}` makes it easy to identify which resources are active experiments vs. production.
* **Clean up stale branches.** Delete experimental collections and retrievers after promotion or abandonment. MVS vectors from branched collections consume storage until deleted.

***

## Related

* [Namespaces](/ingestion/namespaces) — isolation boundaries and multi-tenancy
* [Collections](/ingestion/collections) — processing pipelines and lifecycle states
* [Retrievers](/retrieval/retrievers) — pipeline stages and configuration
* [Evaluations](/retrieval/evaluations) — offline quality testing before promotion