> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Apply Taxonomy to Existing Documents

> Apply a taxonomy to all existing documents in a collection retroactively.

This endpoint triggers distributed Ray processing to enrich existing documents
with taxonomy data. Unlike automatic materialization (which happens during ingestion),
this endpoint allows you to:

1. **Backfill enrichment** for documents ingested before the taxonomy was created
2. **Re-apply taxonomy** after configuration changes
3. **Process specific subsets** using scroll_filters

⚙️ **Processing Details:**
- Uses Ray datasets with map_batches for parallel processing
- Scales horizontally across Ray cluster
- Non-blocking: Returns immediately with task_id
- Monitor progress via Tasks API

⚠️ **Prerequisites:**
- Taxonomy must exist and be valid
- Taxonomy must be in collection's taxonomy_applications list
- Collection must contain documents

📊 **Performance:**
- ~1000-5000 docs/second depending on cluster size
- Parallel processing across multiple Ray workers
- Batch size and parallelism configurable

🔍 **Use Cases:**
- Backfill: Apply new taxonomy to historical data
- Re-enrichment: Update after taxonomy changes
- Selective: Process filtered document subsets

See Collections API and Taxonomies API documentation for details.



## OpenAPI

````yaml post /v1/collections/{collection_identifier}/apply-taxonomy
openapi: 3.1.0
info:
  title: Mixpeek API
  description: >-
    This is the Mixpeek API, providing access to various endpoints for data
    processing and retrieval.
  termsOfService: https://mixpeek.com/terms
  contact:
    name: Mixpeek Support
    url: https://mixpeek.com/contact
    email: info@mixpeek.com
  version: '0.82'
servers:
  - url: https://api.mixpeek.com
    description: Production
security: []
paths:
  /v1/collections/{collection_identifier}/apply-taxonomy:
    post:
      tags:
        - Collection Taxonomies
      summary: Apply Taxonomy to Existing Documents
      description: >-
        Apply a taxonomy to all existing documents in a collection
        retroactively.


        This endpoint triggers distributed Ray processing to enrich existing
        documents

        with taxonomy data. Unlike automatic materialization (which happens
        during ingestion),

        this endpoint allows you to:


        1. **Backfill enrichment** for documents ingested before the taxonomy
        was created

        2. **Re-apply taxonomy** after configuration changes

        3. **Process specific subsets** using scroll_filters


        ⚙️ **Processing Details:**

        - Uses Ray datasets with map_batches for parallel processing

        - Scales horizontally across Ray cluster

        - Non-blocking: Returns immediately with task_id

        - Monitor progress via Tasks API


        ⚠️ **Prerequisites:**

        - Taxonomy must exist and be valid

        - Taxonomy must be in collection's taxonomy_applications list

        - Collection must contain documents


        📊 **Performance:**

        - ~1000-5000 docs/second depending on cluster size

        - Parallel processing across multiple Ray workers

        - Batch size and parallelism configurable


        🔍 **Use Cases:**

        - Backfill: Apply new taxonomy to historical data

        - Re-enrichment: Update after taxonomy changes

        - Selective: Process filtered document subsets


        See Collections API and Taxonomies API documentation for details.
      operationId: >-
        apply_taxonomy_to_collection_v1_collections__collection_identifier__apply_taxonomy_post
      parameters:
        - name: collection_identifier
          in: path
          required: true
          schema:
            type: string
            description: Collection ID or name to apply taxonomy to
            title: Collection Identifier
          description: Collection ID or name to apply taxonomy to
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/ApplyTaxonomyRequest'
      responses:
        '200':
          description: Successful Response
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ApplyTaxonomyResponse'
        '400':
          description: Bad Request
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '401':
          description: Unauthorized
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '403':
          description: Forbidden
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '404':
          description: Not Found
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
        '500':
          description: Internal Server Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
components:
  schemas:
    ApplyTaxonomyRequest:
      properties:
        taxonomy_id:
          type: string
          title: Taxonomy Id
          description: >-
            ID of the taxonomy to apply. REQUIRED. Must be an existing taxonomy
            (tax_*). The taxonomy must already be in the collection's
            taxonomy_applications list.
          examples:
            - tax_abc123
            - tax_products
        scroll_filters:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Scroll Filters
          description: >-
            Optional filters to limit which documents are enriched. NOT
            REQUIRED. If not provided, all documents in the collection will be
            enriched. Use to process specific subsets (e.g., documents missing
            enrichment).
          examples:
            - must:
                - key: metadata.category
                  match:
                    value: products
            - null
        batch_size:
          type: integer
          maximum: 5000
          minimum: 100
          title: Batch Size
          description: >-
            Number of documents to process in each parallel batch. NOT REQUIRED.
            Defaults to 1000. Larger batches = fewer Ray tasks but more memory
            per task. Smaller batches = more Ray tasks but lower memory per
            task.
          default: 1000
          examples:
            - 1000
            - 500
            - 2000
        parallelism:
          type: integer
          maximum: 20
          minimum: 1
          title: Parallelism
          description: >-
            Number of parallel Ray workers to use for processing. NOT REQUIRED.
            Defaults to 4. Higher parallelism = faster processing but more
            cluster resources. Set based on available Ray cluster capacity.
          default: 4
          examples:
            - 4
            - 8
            - 2
      type: object
      required:
        - taxonomy_id
      title: ApplyTaxonomyRequest
      description: |-
        Request to apply a taxonomy to an existing collection.

        This endpoint triggers retroactive taxonomy materialization on
        all documents in a collection using distributed Ray processing.

        Use Cases:
            - Apply taxonomy to documents that were ingested before the taxonomy was created
            - Re-apply taxonomy after taxonomy configuration changes
            - Backfill enrichment data for existing collections

        Requirements:
            - taxonomy_id: REQUIRED - Must be an existing, valid taxonomy
            - The taxonomy must already be attached to the collection via taxonomy_applications
            - Documents must exist in the collection
      examples:
        - description: Apply taxonomy to all documents
          taxonomy_id: tax_abc123
        - description: Apply taxonomy to filtered documents
          scroll_filters:
            must:
              - key: metadata.processed
                match:
                  value: false
          taxonomy_id: tax_products
        - batch_size: 500
          description: Apply with custom batch settings
          parallelism: 8
          taxonomy_id: tax_categories
    ApplyTaxonomyResponse:
      properties:
        task_id:
          type: string
          title: Task Id
          description: ID of the Ray task executing the materialization
        status:
          type: string
          title: Status
          description: Status of the materialization task
          examples:
            - submitted
            - running
            - completed
            - failed
        collection_id:
          type: string
          title: Collection Id
          description: Collection ID where taxonomy is being applied
        taxonomy_id:
          type: string
          title: Taxonomy Id
          description: Taxonomy ID being applied
        estimated_documents:
          anyOf:
            - type: integer
            - type: 'null'
          title: Estimated Documents
          description: Estimated number of documents to process (if available)
      type: object
      required:
        - task_id
        - status
        - collection_id
        - taxonomy_id
      title: ApplyTaxonomyResponse
      description: |-
        Response from applying taxonomy to collection.

        Returns statistics about the materialization process.
      examples:
        - collection_id: col_products
          estimated_documents: 15000
          status: submitted
          task_id: ray_task_abc123
          taxonomy_id: tax_categories
    ErrorResponse:
      properties:
        success:
          type: boolean
          title: Success
          description: Always false for error responses
          default: false
        status:
          type: integer
          title: Status
          description: HTTP status code for this error
        error:
          $ref: '#/components/schemas/ErrorDetail'
          description: Error details payload
      type: object
      required:
        - status
        - error
      title: ErrorResponse
      description: Error response model.
      examples:
        - error:
            details:
              id: ns_123
              resource: namespace
            message: Namespace not found
            type: NotFoundError
          status: 404
          success: false
    HTTPValidationError:
      properties:
        detail:
          items:
            $ref: '#/components/schemas/ValidationError'
          type: array
          title: Detail
      type: object
      title: HTTPValidationError
    ErrorDetail:
      properties:
        message:
          type: string
          title: Message
          description: Human-readable error message
        type:
          type: string
          title: Type
          description: Stable error type identifier (machine-readable)
        code:
          anyOf:
            - type: string
            - type: 'null'
          title: Code
          description: >-
            Fine-grained error code for programmatic handling (e.g.,
            namespace_name_taken, feature_extractor_not_found). Present only
            when consumers may need to branch on a specific error condition.
        details:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Details
          description: >-
            Optional structured details to help debugging (validation errors,
            IDs, etc.)
      type: object
      required:
        - message
        - type
      title: ErrorDetail
      description: Error detail model.
    ValidationError:
      properties:
        loc:
          items:
            anyOf:
              - type: string
              - type: integer
          type: array
          title: Location
        msg:
          type: string
          title: Message
        type:
          type: string
          title: Error Type
      type: object
      required:
        - loc
        - msg
        - type
      title: ValidationError

````