> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract Features

> Turn raw files into searchable documents with collections, extractors, and custom extractors

## Collections

Collections bind a bucket to a feature extractor. When you submit a batch, the engine runs the extractor against each object and produces searchable documents.

```bash theme={null}
curl -X POST "https://api.mixpeek.com/v1/collections" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "product-embeddings",
    "source": { "type": "bucket", "bucket_ids": ["'$BUCKET_ID'"] },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": { "image": "hero_image", "text": "product_text" }
    }
  }'
```

A single object can feed multiple collections — each running a different extractor. Documents retain lineage to the source object via `root_object_id`.

[Collection API →](/api-reference/collections/create-collection)

### Embedding Task

Instruction-aware embedding models (E5, Gemini) use a **task hint** to optimize the embedding for a specific downstream use case. Set `embedding_task` at the collection level so it applies to every task-aware model in the pipeline.

```bash theme={null}
curl -X POST "https://api.mixpeek.com/v1/collections" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "product-clusters",
    "embedding_task": "clustering",
    "source": { "type": "bucket", "bucket_ids": ["'$BUCKET_ID'"] },
    "feature_extractor": {
      "feature_extractor_name": "text_extractor",
      "version": "v1",
      "input_mappings": { "text": "product_text" }
    }
  }'
```

| Task                  | Use Case                                     | Default |
| --------------------- | -------------------------------------------- | ------- |
| `retrieval_document`  | Search: find documents from queries          | **Yes** |
| `retrieval_query`     | Rare at index time — query-side is automatic | No      |
| `semantic_similarity` | Symmetric comparison (dedup, matching)       | No      |
| `classification`      | Document categorization pipelines            | No      |
| `clustering`          | Grouping documents into clusters             | No      |

<Info>
  You almost never need to set this. The default `retrieval_document` is correct for search, and at query time Mixpeek automatically uses `retrieval_query`. Only override for clustering, classification, or symmetric similarity.
</Info>

<Info>
  Non-instruction-aware models (SigLIP, CLIP, Vertex multimodal) ignore this setting.
</Info>

## Feature URIs

Every extracted feature is addressed by a URI that pins it to a specific extractor version:

```
mixpeek://{extractor_name}@{version}/{output_name}
```

Examples:

* `mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding`
* `mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1`
* `mixpeek://face_identity_extractor@v1/insightface__arcface`

Feature URIs are referenced by retriever stages, taxonomies, and clustering jobs. They guarantee query-time compatibility with the extraction pipeline — swap the URI, re-embed, everything downstream stays consistent.

## Tiered Pipelines

When a batch is submitted, the engine runs a DAG of extractors:

1. **Tier 1** collections process raw objects from the bucket
2. **Tier 2** collections consume Tier 1 documents as input
3. Each tier waits for dependencies before executing

```
video → scenes (Tier 1) → faces per scene (Tier 2) → expressions per face (Tier 3)
```

Collections define the pipeline through their `source` and `feature_extractor` configuration. Dependencies are resolved automatically.

## Built-in Extractors

| Extractor                                                    | Modality            | Output                                                      |
| ------------------------------------------------------------ | ------------------- | ----------------------------------------------------------- |
| [Multimodal](/processing/extractors/multimodal)              | Video, image, audio | Vertex AI 1408D embeddings, transcripts, scene descriptions |
| [Text](/processing/extractors/text)                          | Text                | E5-Large 1024D embeddings                                   |
| [Image](/processing/extractors/image)                        | Image               | SigLIP 768D embeddings, descriptions, structured extraction |
| [Face Identity](/processing/extractors/face-identity)        | Video, image        | ArcFace 512D face embeddings, bounding boxes                |
| [Document](/processing/extractors/document)                  | PDF, DOCX           | Text chunks, OCR, embeddings                                |
| [Gemini Multi-file](/processing/extractors/gemini-multifile) | Any                 | Gemini-powered cross-file analysis                          |
| [Web Scraper](/processing/extractors/web-scraper)            | URLs                | Scraped text content + embeddings                           |
| [Course Content](/processing/extractors/course-content)      | Video               | Lecture segments, slides, transcripts                       |
| [Passthrough](/processing/extractors/passthrough)            | Any                 | Forward metadata without extraction                         |

See the full [Extractor Reference](/processing/feature-extractors) for configuration details.

## Custom Extractors

For extraction logic beyond built-in models, build custom extractors:

```bash theme={null}
pip install mixpeek
mixpeek plugin init my-extractor     # Scaffold from template
mixpeek plugin test my-extractor     # Validate locally
mixpeek plugin publish my-extractor  # Upload and deploy
```

Custom extractors run on managed infrastructure with access to GPU/CPU resources, HuggingFace models, and LLM services. They support batch processing, real-time endpoints, and custom model loading.

See the [full extractors guide](/processing/custom-extractors) for manifest format, pipeline hooks, security constraints, and deployment lifecycle.

[Custom Extractors →](/processing/custom-extractors)
