> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Partially Update Cluster

> This endpoint partially updates a cluster (PATCH operation).
    Only provided fields will be updated. At minimum, metadata can always be updated.
    Immutable fields like cluster_id, status, and computed fields cannot be modified.



## OpenAPI

````yaml patch /v1/clusters/{cluster_identifier}
openapi: 3.1.0
info:
  title: Mixpeek API
  description: >-
    This is the Mixpeek API, providing access to various endpoints for data
    processing and retrieval.
  termsOfService: https://mixpeek.com/terms
  contact:
    name: Mixpeek Support
    url: https://mixpeek.com/contact
    email: info@mixpeek.com
  version: '0.82'
servers:
  - url: https://api.mixpeek.com
    description: Production
security: []
paths:
  /v1/clusters/{cluster_identifier}:
    patch:
      tags:
        - Clusters
      summary: Partially Update Cluster
      description: |-
        This endpoint partially updates a cluster (PATCH operation).
            Only provided fields will be updated. At minimum, metadata can always be updated.
            Immutable fields like cluster_id, status, and computed fields cannot be modified.
      operationId: patch_cluster_v1_clusters__cluster_identifier__patch
      parameters:
        - name: cluster_identifier
          in: path
          required: true
          schema:
            type: string
            description: Cluster ID or name
            title: Cluster Identifier
          description: Cluster ID or name
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/PatchClusterRequest'
      responses:
        '200':
          description: Successful Response
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ClusterModel'
        '400':
          description: Bad Request
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '401':
          description: Unauthorized
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '403':
          description: Forbidden
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '404':
          description: Not Found
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
        '500':
          description: Internal Server Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
components:
  schemas:
    PatchClusterRequest:
      properties:
        cluster_name:
          anyOf:
            - type: string
            - type: 'null'
          title: Cluster Name
          description: Updated name for the cluster
        description:
          anyOf:
            - type: string
            - type: 'null'
          title: Description
          description: Updated description for the cluster
        metadata:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Metadata
          description: Updated metadata for the cluster
        llm_labeling:
          anyOf:
            - $ref: '#/components/schemas/LLMLabeling-Input'
            - type: 'null'
          description: >-
            Updated LLM labeling configuration. Takes effect on the next `POST
            /v1/clusters/{id}/execute` — use this to correct a null
            `labeling_inputs` mapping that produced schema-metadata labels,
            without re-embedding or re-running HDBSCAN.
        filters:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Filters
          description: >-
            Updated pre-filter for clustering input documents. Overrides the
            cluster's stored filter on subsequent execute calls.
        face_cluster_merge:
          anyOf:
            - $ref: '#/components/schemas/FaceClusterMergeConfig'
            - type: 'null'
          description: >-
            Updated post-HDBSCAN face-identity merge configuration. Takes effect
            on the next `POST /v1/clusters/{id}/execute`. Pass an object with
            enabled=false to turn the merge pass off without removing the
            config; pass null in the patch to leave the stored value untouched.
        sample_size:
          anyOf:
            - type: integer
              maximum: 1000000
            - type: 'null'
          title: Sample Size
          description: >-
            Updated per-execution document cap. Takes effect on the next `POST
            /v1/clusters/{id}/execute`. Omit to leave the stored value
            untouched; set to an integer to change it. KMeans supports up to 1M;
            O(N²) algorithms are capped at 100K by the engine.
        algorithm_params:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Algorithm Params
          description: >-
            Updated algorithm parameters (e.g. min_cluster_size, min_samples for
            HDBSCAN). Takes effect on the next `POST /v1/clusters/{id}/execute`.
      type: object
      title: PatchClusterRequest
      description: Request model for partially updating a cluster (PATCH operation).
    ClusterModel:
      properties:
        collection_ids:
          anyOf:
            - items:
                type: string
              type: array
              minItems: 1
            - type: 'null'
          title: Collection Ids
          description: Collections to cluster together
        cluster_name:
          anyOf:
            - type: string
            - type: 'null'
          title: Cluster Name
          description: Optional human-friendly name for the clustering job
        cluster_type:
          $ref: '#/components/schemas/ClusterType'
          description: Vector or attribute clustering
          default: vector
        vector_config:
          anyOf:
            - $ref: '#/components/schemas/VectorBasedConfig-Output'
            - type: 'null'
          description: Required when cluster_type is 'vector'
        attribute_config:
          anyOf:
            - $ref: '#/components/schemas/AttributeBasedConfig'
            - type: 'null'
          description: Required when cluster_type is 'attribute'
        filters:
          anyOf:
            - $ref: '#/components/schemas/LogicalOperator-Output'
            - type: 'null'
          description: >-
            Optional filters to pre-filter documents before clustering (same
            format as list documents). Applied during Qdrant scroll before
            parquet export. Useful for clustering subsets like: status='active',
            category='electronics', etc.
        llm_labeling:
          anyOf:
            - $ref: '#/components/schemas/LLMLabeling-Output'
            - type: 'null'
          description: >-
            Optional configuration for LLM-based cluster labeling. When provided
            with enabled=True, clusters will have semantic labels generated by
            LLM instead of generic labels like 'Cluster 0'. When not provided or
            enabled=False, uses fallback labels.
        enrich_source_collection:
          type: boolean
          title: Enrich Source Collection
          description: >-
            If True, cluster results are written back to source collection(s)
            in-place instead of creating new output collections. Documents will
            be enriched with cluster_id, cluster_label, distance_to_centroid,
            and optionally other metadata. Similar to taxonomy enrichment
            pattern.
          default: false
        source_enrichment_config:
          anyOf:
            - $ref: '#/components/schemas/SourceEnrichmentConfig'
            - type: 'null'
          description: >-
            Configuration for source collection enrichment (only used if
            enrich_source_collection=True). Controls which fields are added to
            source documents and field naming conventions.
        auto_execute_on_batch:
          type: boolean
          title: Auto Execute On Batch
          description: >-
            Automatically execute this cluster whenever a batch completes on any
            of its input collections. When True, a ClusterApplicationConfig
            entry is added to each input collection's cluster_applications field
            at creation time. The cluster will then auto-trigger after each
            batch completion (subject to cooldown and document threshold). When
            False (default), the cluster must be executed manually via the API.
          default: false
        auto_execute_min_documents:
          anyOf:
            - type: integer
            - type: 'null'
          title: Auto Execute Min Documents
          description: >-
            Minimum number of documents required before auto-executing cluster.
            Only used when auto_execute_on_batch=True. If the collection has
            fewer documents than this threshold, clustering is skipped.
        auto_execute_cooldown_seconds:
          type: integer
          title: Auto Execute Cooldown Seconds
          description: >-
            Minimum time (in seconds) between automatic cluster executions. Only
            used when auto_execute_on_batch=True. Default: 3600 (1 hour).
          default: 3600
        cluster_id:
          type: string
          title: Cluster Id
          description: Unique cluster identifier
        parquet_path:
          anyOf:
            - type: string
            - type: 'null'
          title: Parquet Path
          description: S3 path to parquet files with cluster data
        members_key:
          anyOf:
            - type: string
            - type: 'null'
          title: Members Key
          description: S3 key to members.parquet (if saved)
        num_clusters:
          anyOf:
            - type: integer
            - type: 'null'
          title: Num Clusters
          description: Number of clusters found
        cluster_stats:
          anyOf:
            - $ref: '#/components/schemas/ClusterStats'
            - type: 'null'
          description: Clustering quality metrics
        status:
          $ref: '#/components/schemas/TaskStatusEnum'
          description: Clustering job status
          default: PENDING
        task_id:
          anyOf:
            - type: string
            - type: 'null'
          title: Task Id
          description: Associated task ID for clustering job
        last_run_id:
          anyOf:
            - type: string
            - type: 'null'
          title: Last Run Id
          description: >-
            Run ID of the most recent successful clustering execution. Used to
            retrieve execution results.
        created_at:
          type: string
          format: date-time
          title: Created At
          description: When the cluster was created
        updated_at:
          type: string
          format: date-time
          title: Updated At
          description: When the cluster was last updated
        metadata:
          additionalProperties: true
          type: object
          title: Metadata
          description: Additional user-defined metadata for the cluster
      type: object
      title: ClusterModel
      description: Cluster metadata stored in MongoDB.
      examples:
        - cluster_name: products_clip_hdbscan
          cluster_type: vector
          collection_ids:
            - col_products_v1
            - col_products_v2
          llm_labeling:
            enabled: true
            labeling_inputs:
              input_mappings:
                - input_key: text
                  path: description
                  source_type: payload
            model_name: gpt-4o-mini-2024-07-18
            provider: openai
          vector_config:
            clustering_method: hdbscan
            feature_uri: mixpeek://clip_vit_l_14@v1/embedding
            hdbscan_parameters:
              min_cluster_size: 10
              min_samples: 5
            sample_size: 5000
        - cluster_name: video_multimodal_clustering
          cluster_type: vector
          collection_ids:
            - col_videos_v1
          llm_labeling:
            enabled: true
            include_keywords: true
            include_summary: true
            labeling_inputs:
              input_mappings:
                - input_key: text
                  path: title
                  source_type: payload
                - input_key: video_url
                  path: video_url
                  source_type: payload
            model_name: gemini-2.5-flash-lite
            provider: google
          vector_config:
            clustering_method: hdbscan
            feature_uri: mixpeek://clip_vit_l_14@v1/embedding
            hdbscan_parameters:
              min_cluster_size: 5
              min_samples: 3
            sample_size: 3000
        - cluster_name: image_multimodal_clustering
          cluster_type: vector
          collection_ids:
            - col_images_v1
          llm_labeling:
            enabled: true
            labeling_inputs:
              input_mappings:
                - input_key: text
                  path: caption
                  source_type: payload
                - input_key: image_url
                  path: image_url
                  source_type: payload
            model_name: gemini-2.5-flash-lite
            provider: google
          vector_config:
            clustering_method: kmeans
            feature_uri: mixpeek://clip_vit_l_14@v1/embedding
            kmeans_parameters:
              n_clusters: 10
            sample_size: 2000
        - attribute_config:
            attributes:
              - status
            hierarchical_grouping: true
          cluster_name: orders_group_by_status
          cluster_type: attribute
          collection_ids:
            - col_orders_v1
            - col_orders_v2
            - col_orders_v3
    ErrorResponse:
      properties:
        success:
          type: boolean
          title: Success
          description: Always false for error responses
          default: false
        status:
          type: integer
          title: Status
          description: HTTP status code for this error
        error:
          $ref: '#/components/schemas/ErrorDetail'
          description: Error details payload
      type: object
      required:
        - status
        - error
      title: ErrorResponse
      description: Error response model.
      examples:
        - error:
            details:
              id: ns_123
              resource: namespace
            message: Namespace not found
            type: NotFoundError
          status: 404
          success: false
    HTTPValidationError:
      properties:
        detail:
          items:
            $ref: '#/components/schemas/ValidationError'
          type: array
          title: Detail
      type: object
      title: HTTPValidationError
    LLMLabeling-Input:
      properties:
        enabled:
          type: boolean
          title: Enabled
          description: >-
            Whether to generate labels for clusters using LLM. When enabled,
            clusters will have semantic labels like 'High-Performance Laptops'
            instead of generic labels like 'Cluster 0'.
          default: false
        labeling_inputs:
          anyOf:
            - $ref: '#/components/schemas/LLMLabelingInput-Input'
            - type: 'null'
          description: >-
            Input configuration for LLM labeling. Supports flexible input
            mappings for multimodal inputs (text, images, videos, audio). Use
            input_mappings for advanced multimodal labeling with providers like
            Gemini. If not provided (null/undefined), the full document payload
            is serialized as JSON and sent to the LLM — WARNING: in practice
            this produces schema-metadata labels (e.g. 'Mid-Timeline Video
            Events' with keywords like ['start_time', 'end_time']) because the
            LLM describes whatever field names it sees. For meaningful labels,
            set input_mappings that point at semantic fields (title,
            description, thumbnail_url, document_blobs[].url for image/video
            content) rather than relying on the default.
        provider:
          anyOf:
            - $ref: '#/components/schemas/LLMProvider'
            - type: 'null'
          description: |-
            LLM provider to use for labeling. Supported providers:
            - openai: GPT models (GPT-4o, GPT-4o-mini, GPT-4.1, O3-mini)
            - google: Gemini models (Gemini 2.5 Flash, Gemini 1.5 Flash)
            - anthropic: Claude models (Claude 3.5 Sonnet, Claude 3.5 Haiku)

            If not specified, automatically inferred from model_name.
          examples:
            - openai
            - google
            - anthropic
        model_name:
          anyOf:
            - $ref: '#/components/schemas/OpenAIModel'
            - $ref: '#/components/schemas/GoogleModel'
            - $ref: '#/components/schemas/AnthropicModel'
            - type: 'null'
          title: Model Name
          description: >-
            REQUIRED when enabled=True. Specific LLM model to use for cluster
            labeling. All models are defined as enums for type safety.


            OpenAI Models (provider='openai'):

            - gpt-4o-2024-08-06: Highest quality, best for production ($2.50/$10
            per 1M tokens)

            - gpt-4o-mini-2024-07-18: Cost-effective, recommended for most use
            cases ($0.15/$0.60 per 1M tokens)

            - gpt-4.1-2025-04-14: Latest model, future-proofed

            - gpt-4.1-mini-2025-04-14: Latest cost-optimized model

            - o3-mini-2025-01-31: Advanced reasoning, best for complex
            clustering


            Google Models (provider='google'):

            - gemini-2.5-flash-lite: Fastest, latest multimodal model,
            recommended ($0.15/$0.60 per 1M tokens)


            Anthropic Models (provider='anthropic'):

            - claude-3-5-sonnet-20241022: Best reasoning, 200K context ($3/$15
            per 1M tokens)

            - claude-3-5-haiku-20241022: Fast, cost-effective ($0.25/$1.25 per
            1M tokens)


            Recommendation:

            - Use gemini-2.5-flash-lite (DEFAULT) - multimodal support

            - Use gpt-4o-mini-2024-07-18 for OpenAI compatibility

            - Use gpt-4o-2024-08-06 for highest quality when cost is not a
            concern
          examples:
            - gemini-2.5-flash-lite
            - gemini-2.5-pro
            - gpt-4o-mini-2024-07-18
            - gpt-4o-2024-08-06
            - claude-sonnet-4-5-20250929
            - claude-haiku-4-5-20251001
            - gpt-4.1-2025-04-14
            - gpt-4.1-mini-2025-04-14
            - o3-mini-2025-01-31
        include_summary:
          type: boolean
          title: Include Summary
          description: Whether to generate cluster summaries
          default: true
        include_keywords:
          type: boolean
          title: Include Keywords
          description: Whether to extract keywords for clusters
          default: true
        max_samples_per_cluster:
          anyOf:
            - type: integer
              maximum: 20
              minimum: 1
            - type: 'null'
          title: Max Samples Per Cluster
          description: >-
            Maximum representative documents to send to LLM per cluster for
            semantic analysis. When null (default), automatically scales based
            on cluster size and spread — smaller/tighter clusters get fewer
            samples, larger/sparser clusters get more (range 3-20). Set
            explicitly to override with a fixed value.
        sample_text_max_length:
          type: integer
          maximum: 500
          minimum: 50
          title: Sample Text Max Length
          description: Maximum characters per document sample text
          default: 200
        use_embedding_dedup:
          type: boolean
          title: Use Embedding Dedup
          description: >-
            Enable embedding-based label deduplication to prevent near-duplicate
            labels (requires sentence-transformers)
          default: true
        embedding_similarity_threshold:
          type: number
          maximum: 1
          minimum: 0.5
          title: Embedding Similarity Threshold
          description: >-
            Cosine similarity threshold for duplicate label detection (labels
            above this are considered duplicates)
          default: 0.8
        cache_ttl_seconds:
          type: integer
          maximum: 2592000
          minimum: 0
          title: Cache Ttl Seconds
          description: >-
            Time-to-live for cached labels in seconds. Labels for clusters with
            identical representative documents will be reused within this TTL
            window, reducing LLM API costs. Default: 604800 (7 days). Set to 0
            to disable caching.
          default: 604800
        custom_prompt:
          anyOf:
            - type: string
            - type: 'null'
          title: Custom Prompt
          description: >-
            OPTIONAL. Custom prompt template for LLM labeling. NOT REQUIRED -
            uses default discriminative prompt if not provided. When provided,
            completely replaces the default prompt. Your custom prompt receives
            cluster information but you must format it yourself. Use when:   -
            Need domain-specific labeling (e.g., medical, legal, technical)   -
            Want different label format (e.g., emoji labels, code names)   -
            Require specific output structure   - Have custom business logic for
            categorization Default prompt includes: cluster document samples,
            forbidden labels for uniqueness, and JSON response format. See
            engine/clusters/labeling/prompts.py for reference. Example: 'Analyze
            these product clusters and generate SHORT category names (2-3 words
            max) focusing on product type and price range. Return JSON:
            [{"cluster_id": "cl_0", "label": "..."}]'
          examples:
            - >-
              Analyze these document clusters and generate technical labels (2-3
              words). Focus on programming languages and frameworks mentioned.
              Return JSON: [{'cluster_id': 'cl_0', 'label': '...', 'keywords':
              [...]}]
            - >-
              Generate emoji-based labels for these clusters. Use 1-2 emojis
              that represent the main theme. Return JSON: [{'cluster_id':
              'cl_0', 'label': '🚀 Tech Innovation'}]
            - null
        response_shape:
          anyOf:
            - type: string
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Response Shape
          description: >
            OPTIONAL. Define custom structured output for LLM labeling. NOT
            REQUIRED - uses default structure (label, summary, keywords) if not
            provided. When provided, LLM output will match this structure and be
            stored in cluster documents. 


            Two modes supported:

            1. Natural language prompt (string): Describe desired output in
            plain English
               - Service automatically infers JSON schema from your description
               - Example: 'Extract cluster category, confidence score (0-1), and top 3 representative terms'
               - Auto-generates schema with appropriate types (string, number, array, etc.)

            2. Explicit JSON schema (dict): Provide complete JSON schema for
            output structure
               - Full control over output structure, types, and constraints
               - Example: {'type': 'object', 'properties': {'category': {'type': 'string'}, ...}}


            Use when:
              - Need custom metadata fields (confidence scores, sentiment, complexity)
              - Want domain-specific structure (taxonomy hierarchies, entity extractions)
              - Require specific data types (arrays, nested objects, enums)
              - Have downstream schema requirements


            Output fields are automatically added to cluster collection schema
            and stored in metadata.

            Default behavior (if not provided): label (string), summary
            (string), keywords (array of strings)
          examples:
            - >-
              Extract cluster category, confidence score between 0 and 1, and
              top 3 representative keywords
            - >-
              Generate cluster theme, sentiment (positive/negative/neutral), and
              list of key entities
            - properties:
                category:
                  description: Main category
                  type: string
                subcategory:
                  description: Subcategory if applicable
                  type: string
                confidence:
                  maximum: 1
                  minimum: 0
                  type: number
                keywords:
                  items:
                    type: string
                  maxItems: 5
                  type: array
              required:
                - category
                - confidence
              type: object
            - null
        parameters:
          additionalProperties: true
          type: object
          title: Parameters
          description: >-
            Provider-specific parameters forwarded to the LLM service. For
            OpenAI: temperature, max_tokens, top_p, json_output, etc. For
            Google: temperature, top_k, max_output_tokens, json_output, etc.
      type: object
      title: LLMLabeling
      description: |-
        Configuration for LLM-based cluster labeling.

        Supports multiple LLM providers with comprehensive model selection:
        - OpenAI: GPT-4o, GPT-4o-mini, GPT-4.1, O3-mini (best for quality)
        - Google: Gemini 2.5 Flash, Gemini 1.5 Flash (best for speed and cost)
        - Anthropic: Claude 3.5 Sonnet, Claude 3.5 Haiku (best for reasoning)

        All models are defined as enums and validated at API level.
      examples:
        - description: Text-only labeling with multiple fields
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: title
                path: title
                source_type: payload
              - input_key: description
                path: description
                source_type: payload
              - input_key: text
                path: text
                source_type: payload
          model_name: gpt-4o-mini-2024-07-18
          provider: openai
        - description: Multimodal labeling with images (Gemini)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: headline
                source_type: payload
              - input_key: image_url
                path: thumbnail_url
                source_type: payload
          model_name: gemini-2.5-flash-lite
          provider: google
        - description: Multimodal labeling with video (Gemini)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: description
                source_type: payload
              - input_key: video_url
                path: video_url
                source_type: payload
          model_name: gemini-2.5-flash-lite
          provider: google
        - description: OpenAI GPT-4o (highest quality, text-only)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: text
                source_type: payload
          model_name: gpt-4o-2024-08-06
          provider: openai
        - description: Anthropic Claude 3.5 Sonnet (best reasoning)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: text
                source_type: payload
          model_name: claude-sonnet-4-5-20250929
          provider: anthropic
        - description: 'Minimal configuration (uses defaults: text field from payload)'
          enabled: true
        - custom_prompt: >-
            You are a medical document classifier. Analyze the following patient
            record clusters and generate HIPAA-compliant category labels (2-3
            words) that describe the medical condition or treatment type. DO NOT
            include patient names or identifiers. Return JSON array:
            [{"cluster_id": "cl_0", "label": "...", "keywords": [...]}]
          description: Custom prompt for domain-specific labeling
          enabled: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: text
                source_type: payload
          model_name: gpt-4o-mini-2024-07-18
          provider: openai
    FaceClusterMergeConfig:
      properties:
        enabled:
          type: boolean
          title: Enabled
          description: >-
            Run the merge pass. Set False to disable without removing the
            config.
          default: true
        centroid_cosine_threshold:
          type: number
          maximum: 1
          minimum: 0
          title: Centroid Cosine Threshold
          description: Minimum centroid cosine similarity for a candidate merge.
          default: 0.55
        bbox_iou_threshold:
          type: number
          maximum: 1
          minimum: 0
          title: Bbox Iou Threshold
          description: Minimum bbox IoU on overlapping frames to satisfy the spatial half.
          default: 0.4
        scene_jaccard_threshold:
          type: number
          maximum: 1
          minimum: 0
          title: Scene Jaccard Threshold
          description: >-
            Minimum Jaccard similarity of scene-id sets to satisfy the spatial
            half.
          default: 0.3
        bbox_field:
          type: string
          title: Bbox Field
          description: >-
            Document-payload field holding the face bbox (list/tuple of 4
            floats).
          default: bbox
        frame_field:
          type: string
          title: Frame Field
          description: >-
            Document-payload field holding the frame identifier used to pair
            bboxes.
          default: frame_number
        scene_field:
          type: string
          title: Scene Field
          description: >-
            Document-payload field holding the scene identifier used for
            Jaccard.
          default: scene_id
      type: object
      title: FaceClusterMergeConfig
      description: |-
        Configuration for the post-HDBSCAN face-identity merge pass.

        Enables an agglomerative merge after HDBSCAN labels are assigned but
        before centroid calculation. Two clusters merge when the centroid
        cosine meets the cosine threshold AND at least one of the spatial
        signals (bbox IoU on overlapping frames, scene Jaccard) also clears
        its threshold. Defaults target ArcFace 512d face embeddings at the
        brand-corpus scale (~10^4 faces, ~10^2 true identities).
    ClusterType:
      type: string
      enum:
        - vector
        - attribute
      title: ClusterType
      description: >-
        Type of clustering to perform.


        Determines the clustering approach:

        - vector: Cluster documents by embedding similarity (semantic
        clustering)

        - attribute: Cluster documents by metadata attributes (business logic
        clustering)


        Use Cases:
            vector:
                - Group semantically similar content
                - Find content with similar meaning
                - Organize by topic/theme
                - Requires vector embeddings

            attribute:
                - Group by business attributes (category, brand, status, etc.)
                - Organize by explicit metadata
                - Create hierarchical groupings
                - No embeddings required
    VectorBasedConfig-Output:
      properties:
        feature_uri:
          anyOf:
            - type: string
            - type: 'null'
          title: Feature Uri
          description: >-
            DEPRECATED: Use feature_uris instead. Canonical feature URI for the
            vector embedding to cluster. Format:
            'mixpeek://{extractor}@{version}/{output}'. For multi-feature
            clustering, use feature_uris (plural) instead.
          examples:
            - mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding
            - mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1
        feature_uris:
          anyOf:
            - items:
                type: string
              type: array
              minItems: 1
            - type: 'null'
          title: Feature Uris
          description: >-
            RECOMMENDED. List of feature URIs to cluster. Format:
            'mixpeek://{extractor}@{version}/{output}'. For single-feature
            clustering, provide a list with one element. For multi-feature
            clustering, provide multiple feature URIs. Each feature must exist
            in all input collections.
          examples:
            - - mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1
            - - mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1
              - mixpeek://image_extractor@v1/embedding
        clustering_method:
          $ref: '#/components/schemas/ClusteringAlgorithm'
          description: Clustering algorithm to use
        sample_size:
          anyOf:
            - type: integer
              maximum: 1000000
            - type: 'null'
          title: Sample Size
          description: >-
            Number of samples to use for clustering. If not set, defaults are
            applied per algorithm to prevent out-of-memory:
            HDBSCAN/DBSCAN/OPTICS: 50,000 (O(N²) memory),
            Spectral/Agglomerative: 50,000, KMeans/GaussianMixture/MeanShift:
            100,000. KMeans/GMM hard max: 1,000,000; O(N²) algorithms: 100,000.
        kmeans_parameters:
          anyOf:
            - $ref: '#/components/schemas/KMeansParams'
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Kmeans Parameters
          description: Parameters for K-means clustering (deprecated, use algorithm_params)
        dbscan_parameters:
          anyOf:
            - $ref: '#/components/schemas/DBSCANParams'
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Dbscan Parameters
          description: Parameters for DBSCAN clustering (deprecated, use algorithm_params)
        hdbscan_parameters:
          anyOf:
            - $ref: '#/components/schemas/HDBSCANParams'
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Hdbscan Parameters
          description: Parameters for HDBSCAN clustering (deprecated, use algorithm_params)
        algorithm_params:
          anyOf:
            - $ref: '#/components/schemas/KMeansParams'
            - $ref: '#/components/schemas/DBSCANParams'
            - $ref: '#/components/schemas/HDBSCANParams'
            - $ref: '#/components/schemas/AgglomerativeParams'
            - $ref: '#/components/schemas/SpectralParams'
            - $ref: '#/components/schemas/GaussianMixtureParams'
            - $ref: '#/components/schemas/MeanShiftParams'
            - $ref: '#/components/schemas/OPTICSParams'
            - $ref: '#/components/schemas/LeidenParams'
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Algorithm Params
          description: Algorithm-specific parameters
        multi_feature_strategy:
          type: string
          enum:
            - concatenate
            - independent
            - weighted
          title: Multi Feature Strategy
          description: |-
            Strategy for handling multiple feature vectors:
            - concatenate: Combine embeddings into one vector, single clustering
            - independent: Run separate clustering per feature
            - weighted: Learn optimal feature weights
          default: concatenate
        normalize_features:
          type: boolean
          title: Normalize Features
          description: >-
            Apply L2 normalization to each feature block before concatenation.
            Prevents feature dominance when combining different modalities. Only
            applies when multi_feature_strategy='concatenate'.
          default: true
        feature_weights:
          anyOf:
            - additionalProperties:
                type: number
              type: object
            - type: 'null'
          title: Feature Weights
          description: >-
            Optional per-feature weights (0.0-1.0) for concatenation strategy.
            Keys are feature URIs, values are weights. Example:
            {'mixpeek://text@v1/emb': 0.7, 'mixpeek://image@v1/emb': 0.3}.
            Defaults to equal weights (1.0) if not specified. Only applies when
            multi_feature_strategy='concatenate'. If
            multi_feature_strategy='weighted' and this is None, weights are
            learned automatically using weight_learning_config.
          examples:
            - mixpeek://image_extractor@v1/embedding: 0.3
              mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1: 0.7
        weight_learning_config:
          anyOf:
            - $ref: '#/components/schemas/WeightLearningConfig'
            - type: 'null'
          description: >-
            Configuration for automatic feature weight learning. Only used when
            multi_feature_strategy='weighted' and feature_weights is None. If
            feature_weights is provided, manual weights are used instead of
            learning. If this is None when learning is needed, default
            WeightLearningConfig is used.
          examples:
            - max_iterations: 20
              method: bayesian
              metric: silhouette
              sample_size: 5000
            - max_iterations: 5
              method: grid_search
              metric: silhouette
        output_strategy:
          type: string
          enum:
            - single
            - per_feature
          title: Output Strategy
          description: >-
            Output collection creation strategy:

            - single: Create one collection with all feature vectors

            - per_feature: Create separate collections for each feature (for
            hierarchical taxonomies)
          default: single
        effective_feature_method:
          type: string
          enum:
            - mean
            - median
            - medoid
          title: Effective Feature Method
          description: |-
            Method for calculating cluster centroids:
            - mean: Average of all vectors in cluster
            - median: Median vector (robust to outliers)
            - medoid: Actual cluster member closest to centroid
          default: mean
        enrich_source:
          type: boolean
          title: Enrich Source
          description: Whether to enrich source documents with cluster_id
          default: false
        face_cluster_merge:
          anyOf:
            - $ref: '#/components/schemas/FaceClusterMergeConfig'
            - type: 'null'
          description: >-
            Optional post-HDBSCAN agglomerative merge pass for face identity
            clusters. HDBSCAN over ArcFace 512d face embeddings tends to
            over-split the same person across pose/lighting variants. When
            provided with enabled=true, two clusters merge if their centroid
            cosine meets the threshold AND one of the spatial signals (bbox IoU
            on overlapping frames, scene-id Jaccard) clears its threshold. Leave
            null (default) for non-face clusterings — existing behavior is
            preserved.
        preprocessing_steps:
          anyOf:
            - items:
                oneOf:
                  - $ref: '#/components/schemas/TSNEParams'
                  - $ref: '#/components/schemas/UMAPParams'
                  - $ref: '#/components/schemas/WhiteningParams'
                  - $ref: '#/components/schemas/NoReduction'
                discriminator:
                  propertyName: method
                  mapping:
                    none:
                      $ref: '#/components/schemas/NoReduction'
                    tsne:
                      $ref: '#/components/schemas/TSNEParams'
                    umap:
                      $ref: '#/components/schemas/UMAPParams'
                    whitening:
                      $ref: '#/components/schemas/WhiteningParams'
              type: array
            - type: 'null'
          title: Preprocessing Steps
          description: >-
            Ordered list of preprocessing steps applied before clustering. Steps
            execute in order. Common patterns:

            - [whitening, umap]: Decorrelate then reduce — best for
            high-dimensional embeddings

            - [umap]: UMAP pre-reduction only (defaults: 50D, cosine,
            n_neighbors=30)

            - [whitening]: Whitening only — improves density-based clustering
            without dimension reduction


            If set, overrides any default dimensionality reduction.
          examples:
            - - method: whitening
                regularization: 0.00001
              - method: umap
                min_dist: 0
                n_components: 50
                n_neighbors: 30
            - - method: umap
                n_components: 50
        hierarchical:
          type: boolean
          title: Hierarchical
          description: >-
            Enable recursive sub-clustering. After initial clustering, each
            cluster with enough members is sub-divided using UMAP+HDBSCAN
            recursively. Produces hierarchical cluster IDs (e.g.,
            cl_0_sub_1_sub_0).
          default: false
        max_hierarchy_depth:
          type: integer
          maximum: 5
          minimum: 1
          title: Max Hierarchy Depth
          description: Maximum recursion depth for hierarchical sub-clustering.
          default: 3
        vis_n_components:
          type: integer
          enum:
            - 2
            - 3
          title: Vis N Components
          description: >-
            Number of dimensions for visualization coordinates (2 or 3). When 3,
            the z coordinate is populated for depth/size-based rendering. Stored
            on the cluster and used as the default for all executions.
          default: 2
      type: object
      required:
        - clustering_method
      title: VectorBasedConfig
      description: >-
        Configuration for vector-based clustering.


        Use canonical feature URIs to specify which vector embeddings to
        cluster.

        Feature URIs follow the format: mixpeek://{extractor}@{version}/{output}


        Supports both single and multi-feature clustering:

        - Single feature: Provide one feature_uri for standard clustering

        - Multi-feature: Provide multiple feature_uris for hybrid clustering


        Examples:
            Single feature:
            {
                "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
                "clustering_method": "hdbscan",
                "sample_size": 1000
            }

            Multi-feature:
            {
                "feature_uris": [
                    "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
                    "mixpeek://image_extractor@v1/embedding"
                ],
                "clustering_method": "hdbscan",
                "multi_feature_strategy": "concatenate"
            }
      examples:
        - algorithm_params:
            min_cluster_size: 10
            min_samples: 5
          clustering_method: hdbscan
          description: HDBSCAN clustering with multimodal embeddings
          feature_uri: mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding
          sample_size: 1000
        - clustering_method: hdbscan
          description: Face-identity clustering with post-HDBSCAN merge
          face_cluster_merge:
            bbox_iou_threshold: 0.4
            centroid_cosine_threshold: 0.55
            enabled: true
            scene_jaccard_threshold: 0.3
          feature_uris:
            - mixpeek://face_identity_extractor@v1/face_embedding
        - algorithm_params:
            n_clusters: 10
          clustering_method: kmeans
          description: K-means clustering with text embeddings
          feature_uri: mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1
        - algorithm_params:
            eps: 0.5
            min_samples: 5
          clustering_method: dbscan
          description: DBSCAN clustering with CLIP image embeddings
          feature_uri: mixpeek://clip@v1/image_embedding
    AttributeBasedConfig:
      properties:
        attributes:
          items:
            type: string
          type: array
          minItems: 1
          title: Attributes
          description: >-
            List of attribute field names to use for clustering. Documents will
            be grouped by unique combinations of these attribute values.
            Supports dot-notation for nested fields (e.g., 'metadata.category').
            Order matters for hierarchical grouping: first attribute is
            top-level, subsequent are nested.
          examples:
            - - category
            - - category
              - brand
            - - status
              - priority
            - - metadata.author
              - metadata.topic
        hierarchical_grouping:
          type: boolean
          title: Hierarchical Grouping
          description: >-
            Whether to create hierarchical clusters based on attribute order.
            When True: Creates parent clusters for each unique value of the
            first attribute, then child clusters for subsequent attributes
            within each parent. When False: Creates flat clusters for each
            unique combination of all attributes. Example with ['category',
            'brand']:   hierarchical=True → 'Electronics' (parent) → 'Apple',
            'Samsung' (children).   hierarchical=False → 'Electronics_Apple',
            'Electronics_Samsung' (flat).
          default: false
        aggregation_method:
          anyOf:
            - type: string
            - type: 'null'
          title: Aggregation Method
          description: >-
            Method for aggregating attribute values when creating cluster
            centroids. Options: 'most_frequent' (default), 'first', 'last'. Most
            use cases should use the default.
          examples:
            - most_frequent
            - first
            - last
      type: object
      required:
        - attributes
      title: AttributeBasedConfig
      description: >-
        Configuration for attribute-based clustering.


        Attribute-based clustering groups documents by metadata attributes
        (e.g., category, brand, status)

        instead of vector similarity. This is useful for organizing content by
        business logic rather than

        semantic similarity.


        Examples:
            - Group products by category and brand
            - Organize orders by status and priority
            - Cluster content by author and topic
      examples:
        - attributes:
            - category
          description: Simple category clustering
          hierarchical_grouping: false
        - attributes:
            - category
            - brand
          description: Hierarchical category → brand clustering
          hierarchical_grouping: true
        - aggregation_method: most_frequent
          attributes:
            - status
            - priority
          description: Order status and priority (flat)
          hierarchical_grouping: false
    LogicalOperator-Output:
      properties:
        AND:
          anyOf:
            - items:
                anyOf:
                  - $ref: '#/components/schemas/LogicalOperator-Output'
                  - $ref: '#/components/schemas/FilterCondition'
              type: array
            - type: 'null'
          title: And
          description: Logical AND operation - all conditions must be true
          example:
            - field: name
              operator: eq
              value: John
            - field: age
              operator: gte
              value: 30
        OR:
          anyOf:
            - items:
                anyOf:
                  - $ref: '#/components/schemas/LogicalOperator-Output'
                  - $ref: '#/components/schemas/FilterCondition'
              type: array
            - type: 'null'
          title: Or
          description: Logical OR operation - at least one condition must be true
          example:
            - field: status
              operator: eq
              value: active
            - field: role
              operator: eq
              value: admin
        NOT:
          anyOf:
            - items:
                anyOf:
                  - $ref: '#/components/schemas/LogicalOperator-Output'
                  - $ref: '#/components/schemas/FilterCondition'
              type: array
            - type: 'null'
          title: Not
          description: Logical NOT operation - all conditions must be false
          example:
            - field: department
              operator: eq
              value: HR
            - field: location
              operator: eq
              value: remote
        case_sensitive:
          anyOf:
            - type: boolean
            - type: 'null'
          title: Case Sensitive
          description: Whether to perform case-sensitive matching
          default: false
          example: true
      additionalProperties: true
      type: object
      title: LogicalOperator
      description: >-
        Represents a logical operation (AND, OR, NOT) on filter conditions.


        Allows nesting with a defined depth limit.


        Also supports shorthand syntax where field names can be passed directly

        as key-value pairs for equality filtering (e.g., {"metadata.title":
        "value"}).
    LLMLabeling-Output:
      properties:
        enabled:
          type: boolean
          title: Enabled
          description: >-
            Whether to generate labels for clusters using LLM. When enabled,
            clusters will have semantic labels like 'High-Performance Laptops'
            instead of generic labels like 'Cluster 0'.
          default: false
        labeling_inputs:
          anyOf:
            - $ref: '#/components/schemas/LLMLabelingInput-Output'
            - type: 'null'
          description: >-
            Input configuration for LLM labeling. Supports flexible input
            mappings for multimodal inputs (text, images, videos, audio). Use
            input_mappings for advanced multimodal labeling with providers like
            Gemini. If not provided (null/undefined), the full document payload
            is serialized as JSON and sent to the LLM — WARNING: in practice
            this produces schema-metadata labels (e.g. 'Mid-Timeline Video
            Events' with keywords like ['start_time', 'end_time']) because the
            LLM describes whatever field names it sees. For meaningful labels,
            set input_mappings that point at semantic fields (title,
            description, thumbnail_url, document_blobs[].url for image/video
            content) rather than relying on the default.
        provider:
          anyOf:
            - $ref: '#/components/schemas/LLMProvider'
            - type: 'null'
          description: |-
            LLM provider to use for labeling. Supported providers:
            - openai: GPT models (GPT-4o, GPT-4o-mini, GPT-4.1, O3-mini)
            - google: Gemini models (Gemini 2.5 Flash, Gemini 1.5 Flash)
            - anthropic: Claude models (Claude 3.5 Sonnet, Claude 3.5 Haiku)

            If not specified, automatically inferred from model_name.
          examples:
            - openai
            - google
            - anthropic
        model_name:
          anyOf:
            - $ref: '#/components/schemas/OpenAIModel'
            - $ref: '#/components/schemas/GoogleModel'
            - $ref: '#/components/schemas/AnthropicModel'
            - type: 'null'
          title: Model Name
          description: >-
            REQUIRED when enabled=True. Specific LLM model to use for cluster
            labeling. All models are defined as enums for type safety.


            OpenAI Models (provider='openai'):

            - gpt-4o-2024-08-06: Highest quality, best for production ($2.50/$10
            per 1M tokens)

            - gpt-4o-mini-2024-07-18: Cost-effective, recommended for most use
            cases ($0.15/$0.60 per 1M tokens)

            - gpt-4.1-2025-04-14: Latest model, future-proofed

            - gpt-4.1-mini-2025-04-14: Latest cost-optimized model

            - o3-mini-2025-01-31: Advanced reasoning, best for complex
            clustering


            Google Models (provider='google'):

            - gemini-2.5-flash-lite: Fastest, latest multimodal model,
            recommended ($0.15/$0.60 per 1M tokens)


            Anthropic Models (provider='anthropic'):

            - claude-3-5-sonnet-20241022: Best reasoning, 200K context ($3/$15
            per 1M tokens)

            - claude-3-5-haiku-20241022: Fast, cost-effective ($0.25/$1.25 per
            1M tokens)


            Recommendation:

            - Use gemini-2.5-flash-lite (DEFAULT) - multimodal support

            - Use gpt-4o-mini-2024-07-18 for OpenAI compatibility

            - Use gpt-4o-2024-08-06 for highest quality when cost is not a
            concern
          examples:
            - gemini-2.5-flash-lite
            - gemini-2.5-pro
            - gpt-4o-mini-2024-07-18
            - gpt-4o-2024-08-06
            - claude-sonnet-4-5-20250929
            - claude-haiku-4-5-20251001
            - gpt-4.1-2025-04-14
            - gpt-4.1-mini-2025-04-14
            - o3-mini-2025-01-31
        include_summary:
          type: boolean
          title: Include Summary
          description: Whether to generate cluster summaries
          default: true
        include_keywords:
          type: boolean
          title: Include Keywords
          description: Whether to extract keywords for clusters
          default: true
        max_samples_per_cluster:
          anyOf:
            - type: integer
              maximum: 20
              minimum: 1
            - type: 'null'
          title: Max Samples Per Cluster
          description: >-
            Maximum representative documents to send to LLM per cluster for
            semantic analysis. When null (default), automatically scales based
            on cluster size and spread — smaller/tighter clusters get fewer
            samples, larger/sparser clusters get more (range 3-20). Set
            explicitly to override with a fixed value.
        sample_text_max_length:
          type: integer
          maximum: 500
          minimum: 50
          title: Sample Text Max Length
          description: Maximum characters per document sample text
          default: 200
        use_embedding_dedup:
          type: boolean
          title: Use Embedding Dedup
          description: >-
            Enable embedding-based label deduplication to prevent near-duplicate
            labels (requires sentence-transformers)
          default: true
        embedding_similarity_threshold:
          type: number
          maximum: 1
          minimum: 0.5
          title: Embedding Similarity Threshold
          description: >-
            Cosine similarity threshold for duplicate label detection (labels
            above this are considered duplicates)
          default: 0.8
        cache_ttl_seconds:
          type: integer
          maximum: 2592000
          minimum: 0
          title: Cache Ttl Seconds
          description: >-
            Time-to-live for cached labels in seconds. Labels for clusters with
            identical representative documents will be reused within this TTL
            window, reducing LLM API costs. Default: 604800 (7 days). Set to 0
            to disable caching.
          default: 604800
        custom_prompt:
          anyOf:
            - type: string
            - type: 'null'
          title: Custom Prompt
          description: >-
            OPTIONAL. Custom prompt template for LLM labeling. NOT REQUIRED -
            uses default discriminative prompt if not provided. When provided,
            completely replaces the default prompt. Your custom prompt receives
            cluster information but you must format it yourself. Use when:   -
            Need domain-specific labeling (e.g., medical, legal, technical)   -
            Want different label format (e.g., emoji labels, code names)   -
            Require specific output structure   - Have custom business logic for
            categorization Default prompt includes: cluster document samples,
            forbidden labels for uniqueness, and JSON response format. See
            engine/clusters/labeling/prompts.py for reference. Example: 'Analyze
            these product clusters and generate SHORT category names (2-3 words
            max) focusing on product type and price range. Return JSON:
            [{"cluster_id": "cl_0", "label": "..."}]'
          examples:
            - >-
              Analyze these document clusters and generate technical labels (2-3
              words). Focus on programming languages and frameworks mentioned.
              Return JSON: [{'cluster_id': 'cl_0', 'label': '...', 'keywords':
              [...]}]
            - >-
              Generate emoji-based labels for these clusters. Use 1-2 emojis
              that represent the main theme. Return JSON: [{'cluster_id':
              'cl_0', 'label': '🚀 Tech Innovation'}]
            - null
        response_shape:
          anyOf:
            - type: string
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Response Shape
          description: >
            OPTIONAL. Define custom structured output for LLM labeling. NOT
            REQUIRED - uses default structure (label, summary, keywords) if not
            provided. When provided, LLM output will match this structure and be
            stored in cluster documents. 


            Two modes supported:

            1. Natural language prompt (string): Describe desired output in
            plain English
               - Service automatically infers JSON schema from your description
               - Example: 'Extract cluster category, confidence score (0-1), and top 3 representative terms'
               - Auto-generates schema with appropriate types (string, number, array, etc.)

            2. Explicit JSON schema (dict): Provide complete JSON schema for
            output structure
               - Full control over output structure, types, and constraints
               - Example: {'type': 'object', 'properties': {'category': {'type': 'string'}, ...}}


            Use when:
              - Need custom metadata fields (confidence scores, sentiment, complexity)
              - Want domain-specific structure (taxonomy hierarchies, entity extractions)
              - Require specific data types (arrays, nested objects, enums)
              - Have downstream schema requirements


            Output fields are automatically added to cluster collection schema
            and stored in metadata.

            Default behavior (if not provided): label (string), summary
            (string), keywords (array of strings)
          examples:
            - >-
              Extract cluster category, confidence score between 0 and 1, and
              top 3 representative keywords
            - >-
              Generate cluster theme, sentiment (positive/negative/neutral), and
              list of key entities
            - properties:
                category:
                  description: Main category
                  type: string
                subcategory:
                  description: Subcategory if applicable
                  type: string
                confidence:
                  maximum: 1
                  minimum: 0
                  type: number
                keywords:
                  items:
                    type: string
                  maxItems: 5
                  type: array
              required:
                - category
                - confidence
              type: object
            - null
        parameters:
          additionalProperties: true
          type: object
          title: Parameters
          description: >-
            Provider-specific parameters forwarded to the LLM service. For
            OpenAI: temperature, max_tokens, top_p, json_output, etc. For
            Google: temperature, top_k, max_output_tokens, json_output, etc.
      type: object
      title: LLMLabeling
      description: |-
        Configuration for LLM-based cluster labeling.

        Supports multiple LLM providers with comprehensive model selection:
        - OpenAI: GPT-4o, GPT-4o-mini, GPT-4.1, O3-mini (best for quality)
        - Google: Gemini 2.5 Flash, Gemini 1.5 Flash (best for speed and cost)
        - Anthropic: Claude 3.5 Sonnet, Claude 3.5 Haiku (best for reasoning)

        All models are defined as enums and validated at API level.
      examples:
        - description: Text-only labeling with multiple fields
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: title
                path: title
                source_type: payload
              - input_key: description
                path: description
                source_type: payload
              - input_key: text
                path: text
                source_type: payload
          model_name: gpt-4o-mini-2024-07-18
          provider: openai
        - description: Multimodal labeling with images (Gemini)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: headline
                source_type: payload
              - input_key: image_url
                path: thumbnail_url
                source_type: payload
          model_name: gemini-2.5-flash-lite
          provider: google
        - description: Multimodal labeling with video (Gemini)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: description
                source_type: payload
              - input_key: video_url
                path: video_url
                source_type: payload
          model_name: gemini-2.5-flash-lite
          provider: google
        - description: OpenAI GPT-4o (highest quality, text-only)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: text
                source_type: payload
          model_name: gpt-4o-2024-08-06
          provider: openai
        - description: Anthropic Claude 3.5 Sonnet (best reasoning)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: text
                source_type: payload
          model_name: claude-sonnet-4-5-20250929
          provider: anthropic
        - description: 'Minimal configuration (uses defaults: text field from payload)'
          enabled: true
        - custom_prompt: >-
            You are a medical document classifier. Analyze the following patient
            record clusters and generate HIPAA-compliant category labels (2-3
            words) that describe the medical condition or treatment type. DO NOT
            include patient names or identifiers. Return JSON array:
            [{"cluster_id": "cl_0", "label": "...", "keywords": [...]}]
          description: Custom prompt for domain-specific labeling
          enabled: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: text
                source_type: payload
          model_name: gpt-4o-mini-2024-07-18
          provider: openai
    SourceEnrichmentConfig:
      properties:
        field_mappings:
          items:
            $ref: '#/components/schemas/EnrichmentFieldMapping'
          type: array
          title: Field Mappings
          description: >-
            List of field mappings from cluster results to document fields.
            Default includes cluster_id and cluster_label. Can include:
            distance_to_centroid, member_count, keywords, visualization coords
            (x, y, z), etc.
      type: object
      title: SourceEnrichmentConfig
      description: >-
        Configuration for enriching source collection documents with cluster
        assignments.


        When enrich_source_collection=True, cluster assignments are written back
        to

        the original source documents, similar to taxonomy enrichment.


        Uses flexible field mapping pattern to support any cluster result
        fields.
      examples:
        - field_mappings:
            - source_field: cluster_id
              target_field: category_id
            - source_field: cluster_label
              target_field: category_name
            - source_field: distance_to_centroid
              target_field: category_confidence
        - field_mappings:
            - source_field: cluster_id
              target_field: segment
            - source_field: cluster_label
              target_field: segment_label
            - source_field: x
              target_field: viz_x
            - source_field: 'y'
              target_field: viz_y
    ClusterStats:
      properties:
        num_clusters:
          type: integer
          title: Num Clusters
        noise_points:
          anyOf:
            - type: integer
            - type: 'null'
          title: Noise Points
          description: Number of noise points (for DBSCAN, etc.)
        silhouette_score:
          anyOf:
            - type: number
            - type: 'null'
          title: Silhouette Score
          description: Silhouette score (-1 to 1, higher is better)
        extra:
          additionalProperties: true
          type: object
          title: Extra
      type: object
      required:
        - num_clusters
      title: ClusterStats
      description: Basic clustering quality metrics.
    TaskStatusEnum:
      type: string
      enum:
        - PENDING
        - QUEUED
        - IN_PROGRESS
        - PROCESSING
        - COMPLETED
        - COMPLETED_WITH_ERRORS
        - FAILED
        - CANCELED
        - INTERRUPTED
        - UNKNOWN
        - SKIPPED
        - DRAFT
        - ACTIVE
        - ARCHIVED
        - SUSPENDED
      title: TaskStatusEnum
      description: |-
        Enumeration of task statuses for tracking asynchronous operations.

        Task statuses indicate the current state of asynchronous operations like
        batch processing, object ingestion, clustering, and taxonomy execution.

        Status Categories:
            Operation Statuses: Track progress of async operations
            Lifecycle Statuses: Track entity state (buckets, collections, namespaces)

        Values:
            PENDING: Task is queued but has not started processing yet
            IN_PROGRESS: Task is currently being executed
            PROCESSING: Task is actively processing data (similar to IN_PROGRESS)
            COMPLETED: Task finished successfully with no errors
            COMPLETED_WITH_ERRORS: Task finished but some items failed (partial success)
            FAILED: Task encountered an error and could not complete
            CANCELED: Task was manually canceled by a user or system
            UNKNOWN: Task status could not be determined
            SKIPPED: Task was intentionally skipped
            DRAFT: Task is in draft state and not yet submitted

            ACTIVE: Entity is active and operational (for buckets, collections, etc.)
            ARCHIVED: Entity has been archived
            SUSPENDED: Entity has been temporarily suspended

        Terminal Statuses:
            COMPLETED, COMPLETED_WITH_ERRORS, FAILED, CANCELED are terminal statuses.
            Once a task reaches these states, it will not transition to another state.

        Partial Success Handling:
            COMPLETED_WITH_ERRORS indicates that the operation completed but some
            documents/items failed. The task result includes:
            - List of successful items
            - List of failed items with error details
            - Success rate percentage
            This allows clients to handle partial success scenarios appropriately.

        Polling Guidance:
            - Poll tasks in PENDING, QUEUED, IN_PROGRESS, or PROCESSING states
            - Stop polling when task reaches COMPLETED, COMPLETED_WITH_ERRORS, FAILED, or CANCELED
            - Use exponential backoff (1s → 30s) when polling
    ErrorDetail:
      properties:
        message:
          type: string
          title: Message
          description: Human-readable error message
        type:
          type: string
          title: Type
          description: Stable error type identifier (machine-readable)
        code:
          anyOf:
            - type: string
            - type: 'null'
          title: Code
          description: >-
            Fine-grained error code for programmatic handling (e.g.,
            namespace_name_taken, feature_extractor_not_found). Present only
            when consumers may need to branch on a specific error condition.
        details:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Details
          description: >-
            Optional structured details to help debugging (validation errors,
            IDs, etc.)
      type: object
      required:
        - message
        - type
      title: ErrorDetail
      description: Error detail model.
    ValidationError:
      properties:
        loc:
          items:
            anyOf:
              - type: string
              - type: integer
          type: array
          title: Location
        msg:
          type: string
          title: Message
        type:
          type: string
          title: Error Type
      type: object
      required:
        - loc
        - msg
        - type
      title: ValidationError
    LLMLabelingInput-Input:
      properties:
        input_mappings:
          items:
            $ref: '#/components/schemas/InputMapping'
          type: array
          minItems: 1
          title: Input Mappings
          description: >-
            Flexible input mappings for constructing LLM context. Supports
            multimodal inputs (text, image_url, video_url, audio_url). Each
            mapping specifies how to extract data from document payloads. At
            least one input mapping is required.
      type: object
      required:
        - input_mappings
      title: LLMLabelingInput
      description: |-
        Input configuration for LLM-based cluster labeling.

        Supports flexible input mappings similar to retrievers and buckets,
        allowing multimodal inputs (text, images, videos, audio) for providers
        like Gemini that support native multimodal understanding.

        Examples:
            # Text-only labeling:
            LLMLabelingInput(input_mappings=[
                InputMapping(input_key="headline", source_type="payload", path="headline"),
                InputMapping(input_key="description", source_type="payload", path="description")
            ])

            # Multimodal labeling with images:
            LLMLabelingInput(input_mappings=[
                InputMapping(input_key="text", source_type="payload", path="headline"),
                InputMapping(input_key="image_url", source_type="payload", path="thumbnail_url")
            ])

            # Multimodal with video (for Gemini):
            LLMLabelingInput(input_mappings=[
                InputMapping(input_key="text", source_type="payload", path="description"),
                InputMapping(input_key="video_url", source_type="payload", path="video_url")
            ])
    LLMProvider:
      type: string
      enum:
        - openai
        - google
        - anthropic
      title: LLMProvider
      description: >-
        Supported LLM providers for content generation.


        Each provider has different strengths, pricing, and multimodal
        capabilities.

        Choose based on your use case, performance requirements, and budget.


        Values:
            OPENAI: OpenAI GPT models (GPT-4o, GPT-4.1, O3-mini)
                - Best for: General purpose, vision tasks, structured outputs
                - Multimodal: Text, images
                - Performance: Fast (100-500ms), reliable
                - Cost: Moderate to high ($0.15-$10 per 1M tokens)
                - Use when: Need high-quality generation with vision support

            GOOGLE: Google Gemini models (Gemini 3.1 Flash Lite, Gemini 2.5 Pro)
                - Best for: Fast generation, video understanding, cost-efficiency
                - Multimodal: Text, images, video, audio, PDFs
                - Performance: Very fast (50-200ms)
                - Cost: Low to moderate ($0.075-$0.40 per 1M tokens)
                - Use when: Need video/audio/PDF support or cost-efficiency

            ANTHROPIC: Anthropic Claude models (Claude 3.5 Sonnet, Claude 3.5 Haiku)
                - Best for: Long context, complex reasoning, safety
                - Multimodal: Text, images
                - Performance: Moderate (200-800ms)
                - Cost: Moderate to high ($0.25-$15 per 1M tokens)
                - Use when: Need long context or complex reasoning

        Examples:
            - Use OPENAI for production with structured JSON outputs
            - Use GOOGLE for video summarization and cost-sensitive workloads
            - Use ANTHROPIC for complex reasoning with long documents
    OpenAIModel:
      type: string
      enum:
        - gpt-4o-2024-08-06
        - gpt-4o-mini-2024-07-18
        - gpt-4.1-2025-04-14
        - gpt-4.1-mini-2025-04-14
        - o3-mini-2025-01-31
      title: OpenAIModel
      description: |-
        OpenAI model identifiers for LLM generation.

        Models listed in order of capability and cost (highest to lowest).
        All models support vision (images) except O3-mini.

        Values:
            GPT_4O: Latest GPT-4 Omni model (2024-08-06)
                - Use for: Production, highest quality generation
                - Context: 128K tokens
                - Vision: Yes
                - Cost: $2.50/1M input, $10/1M output
                - Performance: 200-500ms per request
                - When to use: Need best quality, willing to pay premium

            GPT_41: GPT-4.1 (2025-04-14)
                - Use for: Future-proofed pipelines
                - Context: 128K tokens
                - Vision: Yes
                - Cost: TBD (expected similar to GPT-4o)
                - When to use: Want latest model features

            GPT_4O_MINI: Smaller, faster GPT-4 Omni (2024-07-18)
                - Use for: High-volume, cost-sensitive workloads
                - Context: 128K tokens
                - Vision: Yes
                - Cost: $0.15/1M input, $0.60/1M output
                - Performance: 100-200ms per request
                - When to use: Good balance of quality and cost

            GPT_41_MINI: Smaller GPT-4.1 (2025-04-14)
                - Use for: Future cost-optimized pipelines
                - Context: 128K tokens
                - Vision: Yes
                - Cost: TBD (expected similar to GPT-4o-mini)
                - When to use: Want latest features at lower cost

            O3_MINI: Reasoning-optimized model (2025-01-31)
                - Use for: Complex reasoning, math, code
                - Context: 200K tokens
                - Vision: No
                - Cost: TBD
                - When to use: Need advanced reasoning capabilities

        Examples:
            - Use GPT_4O for caption generation with images (best quality)
            - Use GPT_4O_MINI for high-volume video scene summarization (cost-effective)
            - Use O3_MINI for complex entity extraction requiring reasoning
    GoogleModel:
      type: string
      enum:
        - gemini-2.5-flash-lite
        - gemini-2.5-flash
        - gemini-2.5-pro
        - gemini-3.1-flash-lite
      title: GoogleModel
      description: >-
        Google Gemini model identifiers for LLM generation.


        Gemini models excel at multimodal understanding with best-in-class video
        support.

        All models support text, images, video, audio, and PDFs.


        Values:
            GEMINI_2_5_FLASH_LITE: Gemini 2.5 Flash Lite model (recommended, stable GA)
                - Use for: Fastest generation, cost-effective multimodal
                - Context: 1M tokens
                - Multimodal: Text, images, video, audio, PDFs
                - When to use: Default choice for all Gemini use cases

            GEMINI_2_5_PRO: Gemini 2.5 Pro model
                - Use for: Higher quality reasoning, complex tasks
                - Context: 1M tokens

            GEMINI_2_5_FLASH: Gemini 2.5 Flash model
                - Kept for backward compatibility.

            GEMINI_3_1_FLASH_LITE: Alias for gemini-2.5-flash-lite (backwards compat)
                - Note: gemini-3.1-flash-lite does NOT exist in Google's API.
                  This value is mapped to gemini-2.5-flash-lite at runtime.
    AnthropicModel:
      type: string
      enum:
        - claude-sonnet-4-5-20250929
        - claude-haiku-4-5-20251001
        - claude-3-5-sonnet-20241022
        - claude-3-5-haiku-20241022
      title: AnthropicModel
      description: |-
        Anthropic Claude model identifiers for LLM generation.

        Claude models excel at long context, complex reasoning, and safety.
        All models support text and images.

        Values:
            CLAUDE_3_5_SONNET: Most capable Claude model
                - Use for: Complex reasoning, long documents, safety-critical
                - Context: 200K tokens
                - Vision: Yes
                - Cost: $3/1M input, $15/1M output
                - Performance: 300-800ms per request
                - When to use: Need best reasoning, safety, or long context

            CLAUDE_3_5_HAIKU: Fast, cost-effective Claude model
                - Use for: High-volume, quick summaries
                - Context: 200K tokens
                - Vision: Yes
                - Cost: $0.25/1M input, $1.25/1M output
                - Performance: 100-300ms per request
                - When to use: Good balance of quality and cost

        Examples:
            - Use CLAUDE_3_5_SONNET for complex entity extraction from contracts (best reasoning)
            - Use CLAUDE_3_5_HAIKU for high-volume content moderation (cost-effective)
    ClusteringAlgorithm:
      type: string
      enum:
        - kmeans
        - dbscan
        - hdbscan
        - agglomerative
        - spectral
        - gaussian_mixture
        - mean_shift
        - optics
        - leiden
        - attribute_based
        - auto
      title: ClusteringAlgorithm
      description: |-
        Supported clustering algorithms.

        Two types of clustering are available:
        1. Vector-based: Clusters documents by embedding similarity
        2. Attribute-based: Clusters documents by metadata attributes

        Vector-based algorithms (require feature_vector):
            - kmeans: Partitions data into K clusters by minimizing within-cluster variance
            - dbscan: Density-based clustering, finds clusters of arbitrary shape
            - hdbscan: Hierarchical DBSCAN, auto-determines number of clusters
            - agglomerative: Hierarchical clustering using linkage criteria
            - spectral: Uses graph theory to find clusters
            - gaussian_mixture: Probabilistic model assuming Gaussian distributions
            - mean_shift: Finds clusters by locating density maxima
            - optics: Ordering points to identify clustering structure
            - leiden: Community detection on a kNN graph (fast at scale, auto cluster
              count via resolution; skips the UMAP step that dominates KMeans)

        Attribute-based algorithm (requires attribute_config):
            - attribute_based: Groups documents by metadata attributes (e.g., category, brand)
    KMeansParams:
      properties:
        n_clusters:
          type: integer
          maximum: 1000
          minimum: 2
          title: N Clusters
          description: Number of clusters to form
          default: 8
        max_iter:
          type: integer
          maximum: 10000
          minimum: 1
          title: Max Iter
          description: Maximum number of iterations
          default: 300
        random_state:
          anyOf:
            - type: integer
            - type: 'null'
          title: Random State
          description: Random seed for reproducibility
          default: 42
        n_init:
          type: integer
          minimum: 1
          title: N Init
          description: Number of times k-means will run with different centroid seeds
          default: 10
        tol:
          type: number
          exclusiveMinimum: 0
          title: Tol
          description: Tolerance for convergence
          default: 0.0001
        init:
          type: string
          title: Init
          description: Method for initialization ('k-means++' or 'random')
          default: k-means++
        verbose:
          type: integer
          minimum: 0
          title: Verbose
          description: Verbosity mode
          default: 0
        copy_x:
          type: boolean
          title: Copy X
          description: If True, the original data is not modified
          default: true
        algorithm:
          type: string
          title: Algorithm
          description: K-means algorithm to use ('lloyd', 'elkan', or 'auto')
          default: lloyd
      type: object
      title: KMeansParams
      description: Parameters for K-Means clustering algorithm.
    DBSCANParams:
      properties:
        eps:
          type: number
          exclusiveMinimum: 0
          title: Eps
          description: >-
            Maximum distance between two samples for one to be considered in the
            neighborhood of the other
          default: 0.5
        min_samples:
          type: integer
          minimum: 1
          title: Min Samples
          description: >-
            Number of samples in a neighborhood for a point to be considered a
            core point
          default: 5
        metric:
          type: string
          title: Metric
          description: Metric to use for distance computation
          default: euclidean
        metric_params:
          additionalProperties: true
          type: object
          title: Metric Params
          description: Additional keyword arguments for the metric function
        algorithm:
          type: string
          title: Algorithm
          description: >-
            Algorithm to compute pointwise distances and find nearest neighbors
            ('auto', 'ball_tree', 'kd_tree', 'brute')
          default: auto
        leaf_size:
          type: integer
          minimum: 1
          title: Leaf Size
          description: Leaf size passed to BallTree or KDTree
          default: 30
        p:
          type: number
          exclusiveMinimum: 0
          title: P
          description: >-
            The power of the Minkowski metric to be used to calculate distance
            between points
          default: 2
        n_jobs:
          type: integer
          title: N Jobs
          description: The number of parallel jobs to run (-1 means using all processors)
          default: 1
      type: object
      title: DBSCANParams
      description: Parameters for DBSCAN clustering algorithm.
    HDBSCANParams:
      properties:
        min_cluster_size:
          type: integer
          minimum: 2
          title: Min Cluster Size
          description: Minimum number of samples in a cluster
          default: 5
        min_samples:
          anyOf:
            - type: integer
              minimum: 1
            - type: 'null'
          title: Min Samples
          description: >-
            Number of samples in a neighborhood for a point to be considered a
            core point. Defaults to min_cluster_size if None
        cluster_selection_epsilon:
          type: number
          minimum: 0
          title: Cluster Selection Epsilon
          description: >-
            A distance threshold for cluster selection. Clusters below this
            value will be merged
          default: 0
        max_cluster_size:
          anyOf:
            - type: integer
              minimum: 1
            - type: 'null'
          title: Max Cluster Size
          description: >-
            Maximum number of samples in a cluster. Clusters above this size
            will be split
        metric:
          type: string
          title: Metric
          description: Metric to use for distance computation
          default: euclidean
        alpha:
          type: number
          exclusiveMinimum: 0
          title: Alpha
          description: A distance scaling parameter
          default: 1
        cluster_selection_method:
          type: string
          title: Cluster Selection Method
          description: Method to select clusters from the condensed tree ('eom' or 'leaf')
          default: eom
        allow_single_cluster:
          type: boolean
          title: Allow Single Cluster
          description: Allow HDBSCAN to find only a single cluster
          default: false
        prediction_data:
          type: boolean
          title: Prediction Data
          description: Whether to generate extra data for predicting cluster membership
          default: false
        match_reference_implementation:
          type: boolean
          title: Match Reference Implementation
          description: Whether to match the reference implementation exactly
          default: false
      type: object
      title: HDBSCANParams
      description: Parameters for HDBSCAN clustering algorithm.
    AgglomerativeParams:
      properties:
        n_clusters:
          anyOf:
            - type: integer
              minimum: 2
            - type: 'null'
          title: N Clusters
          description: >-
            Number of clusters to find. Can be None if distance_threshold is not
            None
          default: 2
        affinity:
          type: string
          title: Affinity
          description: >-
            Metric used to compute linkage. Can be 'euclidean', 'l1', 'l2',
            'manhattan', 'cosine', or 'precomputed'
          default: euclidean
        memory:
          anyOf:
            - type: string
            - type: 'null'
          title: Memory
          description: Path to the caching directory
        connectivity:
          anyOf:
            - {}
            - type: 'null'
          title: Connectivity
          description: Connectivity matrix. Defines which samples are neighbors
        compute_full_tree:
          type: string
          title: Compute Full Tree
          description: Whether to compute the full tree ('auto', True, or False)
          default: auto
        linkage:
          type: string
          title: Linkage
          description: Linkage criterion ('ward', 'complete', 'average', 'single')
          default: ward
        distance_threshold:
          anyOf:
            - type: number
              exclusiveMinimum: 0
            - type: 'null'
          title: Distance Threshold
          description: >-
            The linkage distance threshold above which clusters will not be
            merged
        compute_distances:
          type: boolean
          title: Compute Distances
          description: Whether to compute distances between clusters
          default: false
      type: object
      title: AgglomerativeParams
      description: Parameters for Agglomerative clustering algorithm.
    SpectralParams:
      properties:
        n_clusters:
          type: integer
          minimum: 2
          title: N Clusters
          description: Number of clusters to form
          default: 8
        eigen_solver:
          anyOf:
            - type: string
            - type: 'null'
          title: Eigen Solver
          description: >-
            The eigenvalue decomposition strategy ('arpack', 'lobpcg', 'amg', or
            None)
        n_components:
          anyOf:
            - type: integer
              minimum: 1
            - type: 'null'
          title: N Components
          description: Number of eigenvectors to use for spectral embedding
        random_state:
          anyOf:
            - type: integer
            - type: 'null'
          title: Random State
          description: Random seed for reproducibility
          default: 42
        n_init:
          type: integer
          minimum: 1
          title: N Init
          description: Number of times k-means will run with different centroid seeds
          default: 10
        gamma:
          type: number
          exclusiveMinimum: 0
          title: Gamma
          description: >-
            Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2
            kernels
          default: 1
        affinity:
          type: string
          title: Affinity
          description: >-
            How to construct the affinity matrix ('nearest_neighbors', 'rbf',
            'precomputed', 'precomputed_nearest_neighbors')
          default: rbf
        n_neighbors:
          type: integer
          minimum: 1
          title: N Neighbors
          description: >-
            Number of neighbors to use when constructing the affinity matrix
            using nearest neighbors
          default: 10
        eigen_tol:
          type: number
          minimum: 0
          title: Eigen Tol
          description: Stopping criterion for eigendecomposition
          default: 0
        assign_labels:
          type: string
          title: Assign Labels
          description: >-
            Strategy to assign labels in the embedding space ('kmeans' or
            'discretize')
          default: kmeans
        degree:
          type: number
          title: Degree
          description: Degree of the polynomial kernel. Ignored by other kernels
          default: 3
        coef0:
          type: number
          title: Coef0
          description: Zero coefficient for polynomial and sigmoid kernels
          default: 1
        kernel_params:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Kernel Params
          description: Parameters for the kernel function
        n_jobs:
          type: integer
          title: N Jobs
          description: Number of parallel jobs to run (-1 means using all processors)
          default: 1
        verbose:
          type: boolean
          title: Verbose
          description: Verbosity mode
          default: false
      type: object
      title: SpectralParams
      description: Parameters for Spectral clustering algorithm.
    GaussianMixtureParams:
      properties:
        n_components:
          type: integer
          minimum: 1
          title: N Components
          description: Number of mixture components
          default: 1
        covariance_type:
          type: string
          title: Covariance Type
          description: Type of covariance parameters ('full', 'tied', 'diag', 'spherical')
          default: full
        tol:
          type: number
          exclusiveMinimum: 0
          title: Tol
          description: Convergence threshold
          default: 0.001
        reg_covar:
          type: number
          minimum: 0
          title: Reg Covar
          description: Regularization added to the diagonal of covariance
          default: 0.000001
        max_iter:
          type: integer
          minimum: 1
          title: Max Iter
          description: Maximum number of EM iterations
          default: 100
        n_init:
          type: integer
          minimum: 1
          title: N Init
          description: Number of initializations to perform
          default: 1
        init_params:
          type: string
          title: Init Params
          description: >-
            Method used to initialize weights, means and covariances ('kmeans'
            or 'random')
          default: kmeans
        weights_init:
          anyOf:
            - items: {}
              type: array
            - type: 'null'
          title: Weights Init
          description: Initial weights
        means_init:
          anyOf:
            - items: {}
              type: array
            - type: 'null'
          title: Means Init
          description: Initial means
        precisions_init:
          anyOf:
            - items: {}
              type: array
            - type: 'null'
          title: Precisions Init
          description: Initial precisions
        random_state:
          anyOf:
            - type: integer
            - type: 'null'
          title: Random State
          description: Random seed for reproducibility
          default: 42
        warm_start:
          type: boolean
          title: Warm Start
          description: If True, use the solution of the last fit as initialization
          default: false
        verbose:
          type: integer
          minimum: 0
          title: Verbose
          description: Enable verbose output
          default: 0
        verbose_interval:
          type: integer
          minimum: 1
          title: Verbose Interval
          description: Number of iterations between each verbose message
          default: 10
      type: object
      title: GaussianMixtureParams
      description: Parameters for Gaussian Mixture Model clustering.
    MeanShiftParams:
      properties:
        bandwidth:
          anyOf:
            - type: number
              exclusiveMinimum: 0
            - type: 'null'
          title: Bandwidth
          description: >-
            Bandwidth used in the RBF kernel. If None, estimated using
            sklearn.cluster.estimate_bandwidth
        seeds:
          anyOf:
            - items:
                items:
                  type: number
                type: array
              type: array
            - type: 'null'
          title: Seeds
          description: >-
            Seeds used to initialize kernels. If None, all points are used as
            seeds
        bin_seeding:
          type: boolean
          title: Bin Seeding
          description: >-
            If true, initial kernel locations are discretized into a grid to
            speed up algorithm
          default: false
        min_bin_freq:
          type: integer
          minimum: 1
          title: Min Bin Freq
          description: Minimum number of seeds within a bin for the bin to be considered
          default: 1
        cluster_all:
          type: boolean
          title: Cluster All
          description: >-
            If true, all points are clustered, even orphans. If false, orphans
            are given label -1
          default: true
        n_jobs:
          type: integer
          title: N Jobs
          description: Number of parallel jobs to run (-1 means using all processors)
          default: 1
        max_iter:
          type: integer
          minimum: 1
          title: Max Iter
          description: >-
            Maximum number of iterations per seed point before the algorithm
            stops
          default: 300
      type: object
      title: MeanShiftParams
      description: Parameters for Mean Shift clustering algorithm.
    OPTICSParams:
      properties:
        min_samples:
          type: integer
          minimum: 2
          title: Min Samples
          description: >-
            Number of samples in a neighborhood for a point to be considered a
            core point
          default: 5
        max_eps:
          anyOf:
            - type: number
              exclusiveMinimum: 0
            - type: 'null'
          title: Max Eps
          description: >-
            Maximum distance between two samples. Default (None) means no
            maximum distance
        metric:
          type: string
          title: Metric
          description: Metric to use for distance computation
          default: minkowski
        p:
          type: number
          exclusiveMinimum: 0
          title: P
          description: Parameter for the Minkowski metric
          default: 2
        metric_params:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Metric Params
          description: Additional keyword arguments for the metric function
        cluster_method:
          type: string
          title: Cluster Method
          description: Method to extract clusters ('xi' or 'dbscan')
          default: xi
        eps:
          anyOf:
            - type: number
              exclusiveMinimum: 0
            - type: 'null'
          title: Eps
          description: Maximum distance for DBSCAN cluster extraction method
        xi:
          type: number
          maximum: 1
          minimum: 0
          title: Xi
          description: >-
            Minimum steepness on the reachability plot for cluster boundary (xi
            method)
          default: 0.05
        predecessor_correction:
          type: boolean
          title: Predecessor Correction
          description: Correct clusters based on predecessors (xi method)
          default: true
        min_cluster_size:
          anyOf:
            - type: number
            - type: 'null'
          title: Min Cluster Size
          description: Minimum number of samples in a cluster. Can be a fraction if < 1.0
        algorithm:
          type: string
          title: Algorithm
          description: >-
            Algorithm to compute pointwise distances ('auto', 'ball_tree',
            'kd_tree', 'brute')
          default: auto
        leaf_size:
          type: integer
          minimum: 1
          title: Leaf Size
          description: Leaf size passed to BallTree or KDTree
          default: 30
        n_jobs:
          type: integer
          title: N Jobs
          description: Number of parallel jobs to run (-1 means using all processors)
          default: 1
      type: object
      title: OPTICSParams
      description: Parameters for OPTICS clustering algorithm.
    LeidenParams:
      properties:
        resolution:
          type: number
          maximum: 100
          exclusiveMinimum: 0
          title: Resolution
          description: >-
            Resolution for the Leiden objective. Higher values yield more,
            smaller communities; lower values yield fewer, larger ones. Leiden
            discovers the cluster count from the graph — there is no fixed
            n_clusters.
          default: 1
        n_neighbors:
          type: integer
          maximum: 200
          minimum: 2
          title: N Neighbors
          description: >-
            Number of nearest neighbours per node when building the kNN graph.
            Larger values give a denser graph (smoother communities) at higher
            build cost.
          default: 30
        metric:
          type: string
          title: Metric
          description: Distance metric for kNN graph construction (cosine or euclidean).
          default: cosine
        min_cluster_size:
          type: integer
          minimum: 0
          title: Min Cluster Size
          description: >-
            Communities smaller than this are relabelled as noise (cluster_id
            -1). 0 keeps every community.
          default: 0
        objective_function:
          type: string
          enum:
            - RBConfiguration
            - modularity
            - CPM
          title: Objective Function
          description: >-
            Leiden quality function. RBConfiguration and CPM honour the
            resolution parameter; modularity ignores it.
          default: RBConfiguration
        n_iterations:
          type: integer
          title: N Iterations
          description: >-
            Leiden optimisation passes. -1 runs until no further improvement; a
            small positive value (e.g. 2) is faster and usually sufficient.
          default: 2
        random_state:
          anyOf:
            - type: integer
            - type: 'null'
          title: Random State
          description: Random seed for reproducible partitions.
          default: 42
      type: object
      title: LeidenParams
      description: Parameters for Leiden graph community-detection clustering.
    WeightLearningConfig:
      properties:
        method:
          type: string
          enum:
            - grid_search
            - bayesian
          title: Method
          description: >-
            Weight learning method:

            - bayesian: Gaussian process optimization (recommended, scales to 5+
            features)

            - grid_search: Exhaustive search (limited to 2-3 features, simpler
            but slower)
          default: bayesian
          examples:
            - bayesian
            - grid_search
        max_iterations:
          type: integer
          maximum: 100
          minimum: 5
          title: Max Iterations
          description: >-
            Maximum optimization iterations:

            - grid_search: Number of values to try per feature (total:
            max_iterations^n_features)

            - bayesian: Number of weight combinations to evaluate

            Recommended: 20 for bayesian, 5 for grid_search
          default: 20
          examples:
            - 20
            - 50
        metric:
          type: string
          enum:
            - silhouette
            - davies_bouldin
            - calinski_harabasz
          title: Metric
          description: >-
            Clustering quality metric to optimize:

            - silhouette: Measures how similar points are to their cluster vs
            other clusters (range: [-1, 1], higher is better)

            - davies_bouldin: Ratio of within-cluster to between-cluster
            distances (range: [0, ∞], lower is better)

            - calinski_harabasz: Ratio of between-cluster to within-cluster
            variance (range: [0, ∞], higher is better)

            Recommended: silhouette (most general-purpose)
          default: silhouette
          examples:
            - silhouette
            - davies_bouldin
            - calinski_harabasz
        sample_size:
          anyOf:
            - type: integer
              minimum: 100
            - type: 'null'
          title: Sample Size
          description: >-
            Optional: Learn weights on a random sample (speeds up large
            datasets).

            If provided and dataset has more documents, weights are learned on
            sample_size random documents, then applied to full dataset.

            Recommended: 5000 for datasets >10k documents
          examples:
            - 5000
            - 10000
        random_state:
          type: integer
          title: Random State
          description: Random seed for reproducibility of weight learning
          default: 42
          examples:
            - 42
      type: object
      title: WeightLearningConfig
      description: >-
        Configuration for automatic feature weight learning in multi-feature
        clustering.


        When multi_feature_strategy='weighted' and feature_weights is not
        provided,

        this configuration controls how optimal weights are automatically
        learned.


        The system tries different weight combinations and picks the one that

        produces the best clustering quality (measured by silhouette score,
        etc.).


        Examples:
            Bayesian optimization (recommended):
            {
                "method": "bayesian",
                "max_iterations": 20,
                "metric": "silhouette",
                "sample_size": 5000
            }

            Grid search (exhaustive, limited to 2-3 features):
            {
                "method": "grid_search",
                "max_iterations": 5,
                "metric": "silhouette"
            }
      examples:
        - max_iterations: 20
          method: bayesian
          metric: silhouette
          random_state: 42
          sample_size: 5000
        - max_iterations: 5
          method: grid_search
          metric: silhouette
          random_state: 42
    TSNEParams:
      properties:
        method:
          type: string
          const: tsne
          title: Method
          default: tsne
        n_components:
          type: integer
          maximum: 256
          minimum: 2
          title: N Components
          default: 2
        random_state:
          type: integer
          title: Random State
          default: 42
        perplexity:
          type: number
          exclusiveMinimum: 0
          title: Perplexity
          default: 30
        learning_rate:
          type: number
          exclusiveMinimum: 0
          title: Learning Rate
          default: 200
      type: object
      title: TSNEParams
    UMAPParams:
      properties:
        method:
          type: string
          const: umap
          title: Method
          default: umap
        n_components:
          type: integer
          maximum: 256
          minimum: 2
          title: N Components
          default: 50
        random_state:
          type: integer
          title: Random State
          default: 42
        n_neighbors:
          type: integer
          minimum: 2
          title: N Neighbors
          default: 30
        min_dist:
          type: number
          maximum: 1
          minimum: 0
          title: Min Dist
          default: 0
        metric:
          type: string
          title: Metric
          description: >-
            Distance metric for UMAP. 'cosine' is best for normalized
            embeddings.
          default: cosine
      type: object
      title: UMAPParams
    WhiteningParams:
      properties:
        method:
          type: string
          const: whitening
          title: Method
          default: whitening
        regularization:
          type: number
          minimum: 0
          title: Regularization
          description: Eigenvalue floor to prevent division by near-zero values.
          default: 0.00001
      type: object
      title: WhiteningParams
    NoReduction:
      properties:
        method:
          type: string
          const: none
          title: Method
          default: none
      type: object
      title: NoReduction
    FilterCondition:
      properties:
        field:
          type: string
          title: Field
          description: Field name to filter on
        operator:
          $ref: '#/components/schemas/FilterOperator'
          description: Comparison operator
          default: eq
        value:
          anyOf:
            - $ref: '#/components/schemas/DynamicValue'
            - {}
          title: Value
          description: Value to compare against
      type: object
      required:
        - field
        - value
      title: FilterCondition
      description: |-
        Represents a single filter condition.

        Attributes:
            field: The field to filter on
            operator: The comparison operator
            value: The value to compare against
    LLMLabelingInput-Output:
      properties:
        input_mappings:
          items:
            $ref: '#/components/schemas/InputMapping'
          type: array
          minItems: 1
          title: Input Mappings
          description: >-
            Flexible input mappings for constructing LLM context. Supports
            multimodal inputs (text, image_url, video_url, audio_url). Each
            mapping specifies how to extract data from document payloads. At
            least one input mapping is required.
      type: object
      required:
        - input_mappings
      title: LLMLabelingInput
      description: |-
        Input configuration for LLM-based cluster labeling.

        Supports flexible input mappings similar to retrievers and buckets,
        allowing multimodal inputs (text, images, videos, audio) for providers
        like Gemini that support native multimodal understanding.

        Examples:
            # Text-only labeling:
            LLMLabelingInput(input_mappings=[
                InputMapping(input_key="headline", source_type="payload", path="headline"),
                InputMapping(input_key="description", source_type="payload", path="description")
            ])

            # Multimodal labeling with images:
            LLMLabelingInput(input_mappings=[
                InputMapping(input_key="text", source_type="payload", path="headline"),
                InputMapping(input_key="image_url", source_type="payload", path="thumbnail_url")
            ])

            # Multimodal with video (for Gemini):
            LLMLabelingInput(input_mappings=[
                InputMapping(input_key="text", source_type="payload", path="description"),
                InputMapping(input_key="video_url", source_type="payload", path="video_url")
            ])
    EnrichmentFieldMapping:
      properties:
        source_field:
          type: string
          title: Source Field
          description: >-
            Field from cluster results to include. Available fields: cluster_id,
            cluster_label, distance_to_centroid, member_count, keywords, x, y, z
            (visualization coords), metadata.*
        target_field:
          type: string
          title: Target Field
          description: >-
            Target field name in enriched document. Example: 'category_id' for
            cluster_id, 'product_category' for cluster_label
      type: object
      required:
        - source_field
        - target_field
      title: EnrichmentFieldMapping
      description: |-
        Maps a cluster result field to a document enrichment field.

        Similar to InputMapping pattern used throughout Mixpeek.
    InputMapping:
      properties:
        input_key:
          type: string
          title: Input Key
          description: Key used in the constructed inputs payload.
        source_type:
          anyOf:
            - $ref: '#/components/schemas/InputSourceType'
            - type: 'null'
          description: Source of the value (payload, literal, vector).
        path:
          anyOf:
            - type: string
            - type: 'null'
          title: Path
          description: >-
            Dot-notation path inside payload/vector when source_type is PAYLOAD
            or VECTOR.
        override:
          anyOf:
            - {}
            - type: 'null'
          title: Override
          description: Static value used when source_type is LITERAL. Overrides any path.
      type: object
      required:
        - input_key
      title: InputMapping
      description: |-
        Declarative mapping for building inputs from various sources.

        - input_key: The key used in the constructed inputs payload
        - source_type: Where to fetch the value (payload, literal, vector)
        - path: Dot-notation path when source_type is PAYLOAD or VECTOR
        - override: Static value when source_type is LITERAL
      examples:
        - input_key: query_text
          path: content.title
          source_type: payload
        - input_key: lang
          override: en
          source_type: literal
        - input_key: image_vector
          path: features.clip_vit_l_14
          source_type: vector
    FilterOperator:
      type: string
      enum:
        - eq
        - ne
        - gt
        - lt
        - gte
        - lte
        - in
        - nin
        - contains
        - starts_with
        - ends_with
        - regex
        - exists
        - is_null
        - text
        - phrase
      title: FilterOperator
      description: Supported filter operators across database implementations.
    DynamicValue:
      properties:
        type:
          type: string
          const: dynamic
          title: Type
          default: dynamic
        field:
          type: string
          title: Field
          description: >-
            The dot-notation path to the value in the runtime query request,
            e.g., 'inputs.user_id'
          examples:
            - inputs.query_text
            - filters.AND[0].value
      type: object
      required:
        - field
      title: DynamicValue
      description: A value that should be dynamically resolved from the query request.
    InputSourceType:
      type: string
      enum:
        - payload
        - literal
        - vector
        - blob
      title: InputSourceType
      description: Where the value for an input should be retrieved from.

````