> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Get Cluster

> Retrieve a cluster by ID or name.

    Returns cluster metadata including:
    - Configuration (cluster_type, algorithm, parameters)
    - Output collection information (output_collection_id, output_collection_name)
    - Execution results (num_clusters, num_documents_clustered, status)
    - Timestamps and metadata


## OpenAPI

````yaml get /v1/clusters/{cluster_identifier}
openapi: 3.1.0
info:
  title: Mixpeek API
  description: >-
    This is the Mixpeek API, providing access to various endpoints for data
    processing and retrieval.
  termsOfService: https://mixpeek.com/terms
  contact:
    name: Mixpeek Support
    url: https://mixpeek.com/contact
    email: info@mixpeek.com
  version: '0.82'
servers:
  - url: https://api.mixpeek.com
    description: Production
security: []
paths:
  /v1/clusters/{cluster_identifier}:
    get:
      tags:
        - Clusters
      summary: Get Cluster
      description: |-
        Retrieve a cluster by ID or name.

            Returns cluster metadata including:
            - Configuration (cluster_type, algorithm, parameters)
            - Output collection information (output_collection_id, output_collection_name)
            - Execution results (num_clusters, num_documents_clustered, status)
            - Timestamps and metadata
      operationId: get_cluster_v1_clusters__cluster_identifier__get
      parameters:
        - name: cluster_identifier
          in: path
          required: true
          schema:
            type: string
            description: Cluster ID or name
            title: Cluster Identifier
          description: Cluster ID or name
      responses:
        '200':
          description: Successful Response
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ClusterMetadata'
        '400':
          description: Bad Request
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '401':
          description: Unauthorized
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '403':
          description: Forbidden
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '404':
          description: Not Found
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
        '500':
          description: Internal Server Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
components:
  schemas:
    ClusterMetadata:
      properties:
        cluster_id:
          type: string
          title: Cluster Id
          description: Unique cluster job identifier
        cluster_name:
          type: string
          title: Cluster Name
          description: Human-readable cluster name
        namespace_id:
          type: string
          title: Namespace Id
          description: Namespace this cluster belongs to
        input_collections:
          items:
            type: string
          type: array
          title: Input Collections
          description: Source collection IDs that were clustered
        source_bucket_ids:
          anyOf:
            - items:
                type: string
              type: array
            - type: 'null'
          title: Source Bucket Ids
          description: >-
            Source bucket IDs that the input collections originated from.
            Enables bucket lineage tracking.
        filters:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Filters
          description: >-
            Optional filters that were applied to pre-filter documents before
            clustering
        cluster_type:
          type: string
          enum:
            - vector
            - attribute
          title: Cluster Type
          description: >-
            Type of clustering: vector (embedding-based) or attribute
            (metadata-based)
        feature_uris:
          anyOf:
            - items:
                type: string
              type: array
            - type: 'null'
          title: Feature Uris
          description: >-
            Feature URIs that were clustered
            (mixpeek://{extractor}@{version}/{output}). Only for vector
            clustering.
        multi_feature_strategy:
          anyOf:
            - type: string
            - type: 'null'
          title: Multi Feature Strategy
          description: >-
            Strategy used if multiple features
            (concatenate/independent/weighted). Only for vector clustering.
        learned_weights:
          anyOf:
            - additionalProperties:
                type: number
              type: object
            - type: 'null'
          title: Learned Weights
          description: >-
            Automatically learned feature weights (when
            multi_feature_strategy='weighted'). Keys are feature URIs, values
            are learned weights. Only populated after clustering execution
            completes.
        learning_quality_score:
          anyOf:
            - type: number
            - type: 'null'
          title: Learning Quality Score
          description: >-
            Clustering quality score from weight learning (e.g., silhouette
            score). Only populated when multi_feature_strategy='weighted' and
            weights were learned.
        effective_feature_method:
          anyOf:
            - type: string
            - type: 'null'
          title: Effective Feature Method
          description: >-
            Method for calculating cluster centroids (mean/median/medoid). Only
            for vector clustering.
        face_cluster_merge:
          anyOf:
            - $ref: '#/components/schemas/FaceClusterMergeConfig'
            - type: 'null'
          description: >-
            Stored post-HDBSCAN face-identity merge config. Populated from
            `vector_config.face_cluster_merge` at cluster creation and replayed
            into `ClusteringConfig` on every execute. Only applies to vector
            clustering.
        sample_size:
          anyOf:
            - type: integer
            - type: 'null'
          title: Sample Size
          description: >-
            Stored per-execution document cap. Populated from
            `vector_config.sample_size` at cluster creation and replayed into
            `ClusteringConfig` on every `POST /v1/clusters/{id}/execute` so
            re-runs stay consistent with the original config. When None, the
            export is uncapped (bounded by the engine's 100,000 safety limit).
            Only applies to vector clustering.
        preprocessing_steps:
          anyOf:
            - items:
                additionalProperties: true
                type: object
              type: array
            - type: 'null'
          title: Preprocessing Steps
          description: >-
            Stored preprocessing steps from vector_config. Replayed into
            ClusteringConfig on every execute.
        hierarchical_vector:
          anyOf:
            - type: boolean
            - type: 'null'
          title: Hierarchical Vector
          description: Whether recursive sub-clustering is enabled for vector clustering.
        max_hierarchy_depth:
          anyOf:
            - type: integer
            - type: 'null'
          title: Max Hierarchy Depth
          description: Maximum recursion depth for hierarchical sub-clustering.
        vis_n_components:
          anyOf:
            - type: integer
              enum:
                - 2
                - 3
            - type: 'null'
          title: Vis N Components
          description: >-
            Stored visualization dimensionality (2D or 3D). Replayed into
            ClusteringConfig on every execute as the default.
        clustered_attributes:
          anyOf:
            - items:
                type: string
              type: array
            - type: 'null'
          title: Clustered Attributes
          description: >-
            Attribute field names that were clustered. Only for attribute
            clustering.
        hierarchical_grouping:
          anyOf:
            - type: boolean
            - type: 'null'
          title: Hierarchical Grouping
          description: >-
            Whether hierarchical clustering was used. Only for attribute
            clustering.
        aggregation_method:
          anyOf:
            - type: string
            - type: 'null'
          title: Aggregation Method
          description: >-
            Method for aggregating attributes (most_frequent/first/last). Only
            for attribute clustering.
        output_collection_ids:
          items:
            type: string
          type: array
          title: Output Collection Ids
          description: >-
            Collection IDs where cluster documents are stored. For single
            output: list with one collection ID. For per-feature output: list
            with one collection ID per feature.
        output_collection_names:
          items:
            type: string
          type: array
          title: Output Collection Names
          description: Names of output collections. Corresponds to output_collection_ids.
        algorithm:
          anyOf:
            - type: string
            - type: 'null'
          title: Algorithm
          description: Clustering algorithm used (hdbscan, kmeans, attribute_based, etc.)
        algorithm_params:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Algorithm Params
          description: Algorithm-specific parameters (not used for attribute_based)
        enrich_source:
          type: boolean
          title: Enrich Source
          description: Whether source documents were enriched with cluster_id
          default: false
        source_enrichment_config:
          anyOf:
            - $ref: '#/components/schemas/SourceEnrichmentConfig'
            - type: 'null'
          description: Configuration for source enrichment (if enrich_source=True)
        llm_labeling:
          anyOf:
            - $ref: '#/components/schemas/LLMLabeling-Output'
            - type: 'null'
          description: >-
            Configuration for LLM-based cluster labeling (applies to all cluster
            types)
        num_clusters:
          anyOf:
            - type: integer
            - type: 'null'
          title: Num Clusters
          description: >-
            Number of clusters found (excludes noise/outliers, populated after
            execution)
        num_documents_clustered:
          anyOf:
            - type: integer
            - type: 'null'
          title: Num Documents Clustered
          description: Total documents processed
        execution_time_seconds:
          anyOf:
            - type: number
            - type: 'null'
          title: Execution Time Seconds
          description: Time taken to complete clustering
        quality_metrics:
          anyOf:
            - additionalProperties:
                type: number
              type: object
            - type: 'null'
          title: Quality Metrics
          description: >-
            Clustering quality metrics (silhouette_score, davies_bouldin_score,
            calinski_harabasz_score, etc.)
        hierarchy_detected:
          type: boolean
          title: Hierarchy Detected
          description: >-
            Whether implicit hierarchy was detected (multi-feature independent)
            or created (hierarchical attributes)
          default: false
        parent_cluster_id:
          anyOf:
            - type: string
            - type: 'null'
          title: Parent Cluster Id
          description: For child clusters in hierarchy
        child_cluster_ids:
          anyOf:
            - items:
                type: string
              type: array
            - type: 'null'
          title: Child Cluster Ids
          description: For parent clusters
        hierarchy_relationships:
          anyOf:
            - items:
                additionalProperties: true
                type: object
              type: array
            - type: 'null'
          title: Hierarchy Relationships
          description: Parent-child relationships detected from cluster membership overlap
        status:
          $ref: '#/components/schemas/TaskStatusEnum'
          description: Cluster job status (propagated from TaskService)
          default: PENDING
        error:
          anyOf:
            - type: string
            - type: 'null'
          title: Error
          description: >-
            Error message if cluster execution failed. Propagated from
            TaskService.
        failure_category:
          anyOf:
            - $ref: '#/components/schemas/FailureCategory'
            - type: 'null'
          description: >-
            Machine-readable failure classification. Same enum as batch
            failure_category.
        last_execution_task_id:
          anyOf:
            - type: string
            - type: 'null'
          title: Last Execution Task Id
          description: Most recent task ID for this cluster
        last_run_id:
          anyOf:
            - type: string
            - type: 'null'
          title: Last Run Id
          description: Most recent execution run ID
        created_at:
          type: string
          format: date-time
          title: Created At
          description: When cluster was created
        updated_at:
          type: string
          format: date-time
          title: Updated At
          description: When cluster was last updated
        last_executed_at:
          anyOf:
            - type: string
              format: date-time
            - type: 'null'
          title: Last Executed At
          description: Last execution timestamp
        completed_at:
          anyOf:
            - type: string
              format: date-time
            - type: 'null'
          title: Completed At
          description: When clustering completed successfully
        llm_labeling_errors:
          anyOf:
            - items:
                type: string
              type: array
            - type: 'null'
          title: Llm Labeling Errors
          description: >-
            List of errors encountered during LLM labeling (if any). Stored in
            MongoDB cluster metadata only, NOT in Qdrant cluster documents. Used
            to track LLM failures while allowing fallback labels to work.
        metadata:
          additionalProperties: true
          type: object
          title: Metadata
          description: Additional user-defined metadata
      type: object
      required:
        - cluster_name
        - namespace_id
        - input_collections
        - cluster_type
      title: ClusterMetadata
      description: |-
        Cluster job metadata stored in MongoDB clusters collection.

        This is separate from cluster documents themselves. Tracks job-level
        configuration, status, and summary statistics.

        Supports both vector and attribute clustering with appropriate metadata.
    ErrorResponse:
      properties:
        success:
          type: boolean
          title: Success
          description: Always false for error responses
          default: false
        status:
          type: integer
          title: Status
          description: HTTP status code for this error
        error:
          $ref: '#/components/schemas/ErrorDetail'
          description: Error details payload
      type: object
      required:
        - status
        - error
      title: ErrorResponse
      description: Error response model.
      examples:
        - error:
            details:
              id: ns_123
              resource: namespace
            message: Namespace not found
            type: NotFoundError
          status: 404
          success: false
    HTTPValidationError:
      properties:
        detail:
          items:
            $ref: '#/components/schemas/ValidationError'
          type: array
          title: Detail
      type: object
      title: HTTPValidationError
    FaceClusterMergeConfig:
      properties:
        enabled:
          type: boolean
          title: Enabled
          description: >-
            Run the merge pass. Set False to disable without removing the
            config.
          default: true
        centroid_cosine_threshold:
          type: number
          maximum: 1
          minimum: 0
          title: Centroid Cosine Threshold
          description: Minimum centroid cosine similarity for a candidate merge.
          default: 0.55
        bbox_iou_threshold:
          type: number
          maximum: 1
          minimum: 0
          title: Bbox Iou Threshold
          description: Minimum bbox IoU on overlapping frames to satisfy the spatial half.
          default: 0.4
        scene_jaccard_threshold:
          type: number
          maximum: 1
          minimum: 0
          title: Scene Jaccard Threshold
          description: >-
            Minimum Jaccard similarity of scene-id sets to satisfy the spatial
            half.
          default: 0.3
        bbox_field:
          type: string
          title: Bbox Field
          description: >-
            Document-payload field holding the face bbox (list/tuple of 4
            floats).
          default: bbox
        frame_field:
          type: string
          title: Frame Field
          description: >-
            Document-payload field holding the frame identifier used to pair
            bboxes.
          default: frame_number
        scene_field:
          type: string
          title: Scene Field
          description: >-
            Document-payload field holding the scene identifier used for
            Jaccard.
          default: scene_id
      type: object
      title: FaceClusterMergeConfig
      description: |-
        Configuration for the post-HDBSCAN face-identity merge pass.

        Enables an agglomerative merge after HDBSCAN labels are assigned but
        before centroid calculation. Two clusters merge when the centroid
        cosine meets the cosine threshold AND at least one of the spatial
        signals (bbox IoU on overlapping frames, scene Jaccard) also clears
        its threshold. Defaults target ArcFace 512d face embeddings at the
        brand-corpus scale (~10^4 faces, ~10^2 true identities).
    SourceEnrichmentConfig:
      properties:
        field_mappings:
          items:
            $ref: '#/components/schemas/EnrichmentFieldMapping'
          type: array
          title: Field Mappings
          description: >-
            List of field mappings from cluster results to document fields.
            Default includes cluster_id and cluster_label. Can include:
            distance_to_centroid, member_count, keywords, visualization coords
            (x, y, z), etc.
      type: object
      title: SourceEnrichmentConfig
      description: >-
        Configuration for enriching source collection documents with cluster
        assignments.


        When enrich_source_collection=True, cluster assignments are written back
        to

        the original source documents, similar to taxonomy enrichment.


        Uses flexible field mapping pattern to support any cluster result
        fields.
      examples:
        - field_mappings:
            - source_field: cluster_id
              target_field: category_id
            - source_field: cluster_label
              target_field: category_name
            - source_field: distance_to_centroid
              target_field: category_confidence
        - field_mappings:
            - source_field: cluster_id
              target_field: segment
            - source_field: cluster_label
              target_field: segment_label
            - source_field: x
              target_field: viz_x
            - source_field: 'y'
              target_field: viz_y
    LLMLabeling-Output:
      properties:
        enabled:
          type: boolean
          title: Enabled
          description: >-
            Whether to generate labels for clusters using LLM. When enabled,
            clusters will have semantic labels like 'High-Performance Laptops'
            instead of generic labels like 'Cluster 0'.
          default: false
        labeling_inputs:
          anyOf:
            - $ref: '#/components/schemas/LLMLabelingInput-Output'
            - type: 'null'
          description: >-
            Input configuration for LLM labeling. Supports flexible input
            mappings for multimodal inputs (text, images, videos, audio). Use
            input_mappings for advanced multimodal labeling with providers like
            Gemini. If not provided (null/undefined), the full document payload
            is serialized as JSON and sent to the LLM — WARNING: in practice
            this produces schema-metadata labels (e.g. 'Mid-Timeline Video
            Events' with keywords like ['start_time', 'end_time']) because the
            LLM describes whatever field names it sees. For meaningful labels,
            set input_mappings that point at semantic fields (title,
            description, thumbnail_url, document_blobs[].url for image/video
            content) rather than relying on the default.
        provider:
          anyOf:
            - $ref: '#/components/schemas/LLMProvider'
            - type: 'null'
          description: |-
            LLM provider to use for labeling. Supported providers:
            - openai: GPT models (GPT-4o, GPT-4o-mini, GPT-4.1, O3-mini)
            - google: Gemini models (Gemini 2.5 Flash, Gemini 1.5 Flash)
            - anthropic: Claude models (Claude 3.5 Sonnet, Claude 3.5 Haiku)

            If not specified, automatically inferred from model_name.
          examples:
            - openai
            - google
            - anthropic
        model_name:
          anyOf:
            - $ref: '#/components/schemas/OpenAIModel'
            - $ref: '#/components/schemas/GoogleModel'
            - $ref: '#/components/schemas/AnthropicModel'
            - type: 'null'
          title: Model Name
          description: >-
            REQUIRED when enabled=True. Specific LLM model to use for cluster
            labeling. All models are defined as enums for type safety.


            OpenAI Models (provider='openai'):

            - gpt-4o-2024-08-06: Highest quality, best for production ($2.50/$10
            per 1M tokens)

            - gpt-4o-mini-2024-07-18: Cost-effective, recommended for most use
            cases ($0.15/$0.60 per 1M tokens)

            - gpt-4.1-2025-04-14: Latest model, future-proofed

            - gpt-4.1-mini-2025-04-14: Latest cost-optimized model

            - o3-mini-2025-01-31: Advanced reasoning, best for complex
            clustering


            Google Models (provider='google'):

            - gemini-2.5-flash-lite: Fastest, latest multimodal model,
            recommended ($0.15/$0.60 per 1M tokens)


            Anthropic Models (provider='anthropic'):

            - claude-3-5-sonnet-20241022: Best reasoning, 200K context ($3/$15
            per 1M tokens)

            - claude-3-5-haiku-20241022: Fast, cost-effective ($0.25/$1.25 per
            1M tokens)


            Recommendation:

            - Use gemini-2.5-flash-lite (DEFAULT) - multimodal support

            - Use gpt-4o-mini-2024-07-18 for OpenAI compatibility

            - Use gpt-4o-2024-08-06 for highest quality when cost is not a
            concern
          examples:
            - gemini-2.5-flash-lite
            - gemini-2.5-pro
            - gpt-4o-mini-2024-07-18
            - gpt-4o-2024-08-06
            - claude-sonnet-4-5-20250929
            - claude-haiku-4-5-20251001
            - gpt-4.1-2025-04-14
            - gpt-4.1-mini-2025-04-14
            - o3-mini-2025-01-31
        include_summary:
          type: boolean
          title: Include Summary
          description: Whether to generate cluster summaries
          default: true
        include_keywords:
          type: boolean
          title: Include Keywords
          description: Whether to extract keywords for clusters
          default: true
        max_samples_per_cluster:
          anyOf:
            - type: integer
              maximum: 20
              minimum: 1
            - type: 'null'
          title: Max Samples Per Cluster
          description: >-
            Maximum representative documents to send to LLM per cluster for
            semantic analysis. When null (default), automatically scales based
            on cluster size and spread — smaller/tighter clusters get fewer
            samples, larger/sparser clusters get more (range 3-20). Set
            explicitly to override with a fixed value.
        sample_text_max_length:
          type: integer
          maximum: 500
          minimum: 50
          title: Sample Text Max Length
          description: Maximum characters per document sample text
          default: 200
        use_embedding_dedup:
          type: boolean
          title: Use Embedding Dedup
          description: >-
            Enable embedding-based label deduplication to prevent near-duplicate
            labels (requires sentence-transformers)
          default: true
        embedding_similarity_threshold:
          type: number
          maximum: 1
          minimum: 0.5
          title: Embedding Similarity Threshold
          description: >-
            Cosine similarity threshold for duplicate label detection (labels
            above this are considered duplicates)
          default: 0.8
        cache_ttl_seconds:
          type: integer
          maximum: 2592000
          minimum: 0
          title: Cache Ttl Seconds
          description: >-
            Time-to-live for cached labels in seconds. Labels for clusters with
            identical representative documents will be reused within this TTL
            window, reducing LLM API costs. Default: 604800 (7 days). Set to 0
            to disable caching.
          default: 604800
        custom_prompt:
          anyOf:
            - type: string
            - type: 'null'
          title: Custom Prompt
          description: >-
            OPTIONAL. Custom prompt template for LLM labeling. NOT REQUIRED -
            uses default discriminative prompt if not provided. When provided,
            completely replaces the default prompt. Your custom prompt receives
            cluster information but you must format it yourself. Use when:   -
            Need domain-specific labeling (e.g., medical, legal, technical)   -
            Want different label format (e.g., emoji labels, code names)   -
            Require specific output structure   - Have custom business logic for
            categorization Default prompt includes: cluster document samples,
            forbidden labels for uniqueness, and JSON response format. See
            engine/clusters/labeling/prompts.py for reference. Example: 'Analyze
            these product clusters and generate SHORT category names (2-3 words
            max) focusing on product type and price range. Return JSON:
            [{"cluster_id": "cl_0", "label": "..."}]'
          examples:
            - >-
              Analyze these document clusters and generate technical labels (2-3
              words). Focus on programming languages and frameworks mentioned.
              Return JSON: [{'cluster_id': 'cl_0', 'label': '...', 'keywords':
              [...]}]
            - >-
              Generate emoji-based labels for these clusters. Use 1-2 emojis
              that represent the main theme. Return JSON: [{'cluster_id':
              'cl_0', 'label': '🚀 Tech Innovation'}]
            - null
        response_shape:
          anyOf:
            - type: string
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Response Shape
          description: >
            OPTIONAL. Define custom structured output for LLM labeling. NOT
            REQUIRED - uses default structure (label, summary, keywords) if not
            provided. When provided, LLM output will match this structure and be
            stored in cluster documents. 


            Two modes supported:

            1. Natural language prompt (string): Describe desired output in
            plain English
               - Service automatically infers JSON schema from your description
               - Example: 'Extract cluster category, confidence score (0-1), and top 3 representative terms'
               - Auto-generates schema with appropriate types (string, number, array, etc.)

            2. Explicit JSON schema (dict): Provide complete JSON schema for
            output structure
               - Full control over output structure, types, and constraints
               - Example: {'type': 'object', 'properties': {'category': {'type': 'string'}, ...}}


            Use when:
              - Need custom metadata fields (confidence scores, sentiment, complexity)
              - Want domain-specific structure (taxonomy hierarchies, entity extractions)
              - Require specific data types (arrays, nested objects, enums)
              - Have downstream schema requirements


            Output fields are automatically added to cluster collection schema
            and stored in metadata.

            Default behavior (if not provided): label (string), summary
            (string), keywords (array of strings)
          examples:
            - >-
              Extract cluster category, confidence score between 0 and 1, and
              top 3 representative keywords
            - >-
              Generate cluster theme, sentiment (positive/negative/neutral), and
              list of key entities
            - properties:
                category:
                  description: Main category
                  type: string
                subcategory:
                  description: Subcategory if applicable
                  type: string
                confidence:
                  maximum: 1
                  minimum: 0
                  type: number
                keywords:
                  items:
                    type: string
                  maxItems: 5
                  type: array
              required:
                - category
                - confidence
              type: object
            - null
        parameters:
          additionalProperties: true
          type: object
          title: Parameters
          description: >-
            Provider-specific parameters forwarded to the LLM service. For
            OpenAI: temperature, max_tokens, top_p, json_output, etc. For
            Google: temperature, top_k, max_output_tokens, json_output, etc.
      type: object
      title: LLMLabeling
      description: |-
        Configuration for LLM-based cluster labeling.

        Supports multiple LLM providers with comprehensive model selection:
        - OpenAI: GPT-4o, GPT-4o-mini, GPT-4.1, O3-mini (best for quality)
        - Google: Gemini 2.5 Flash, Gemini 1.5 Flash (best for speed and cost)
        - Anthropic: Claude 3.5 Sonnet, Claude 3.5 Haiku (best for reasoning)

        All models are defined as enums and validated at API level.
      examples:
        - description: Text-only labeling with multiple fields
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: title
                path: title
                source_type: payload
              - input_key: description
                path: description
                source_type: payload
              - input_key: text
                path: text
                source_type: payload
          model_name: gpt-4o-mini-2024-07-18
          provider: openai
        - description: Multimodal labeling with images (Gemini)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: headline
                source_type: payload
              - input_key: image_url
                path: thumbnail_url
                source_type: payload
          model_name: gemini-2.5-flash-lite
          provider: google
        - description: Multimodal labeling with video (Gemini)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: description
                source_type: payload
              - input_key: video_url
                path: video_url
                source_type: payload
          model_name: gemini-2.5-flash-lite
          provider: google
        - description: OpenAI GPT-4o (highest quality, text-only)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: text
                source_type: payload
          model_name: gpt-4o-2024-08-06
          provider: openai
        - description: Anthropic Claude 3.5 Sonnet (best reasoning)
          enabled: true
          include_keywords: true
          include_summary: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: text
                source_type: payload
          model_name: claude-sonnet-4-5-20250929
          provider: anthropic
        - description: 'Minimal configuration (uses defaults: text field from payload)'
          enabled: true
        - custom_prompt: >-
            You are a medical document classifier. Analyze the following patient
            record clusters and generate HIPAA-compliant category labels (2-3
            words) that describe the medical condition or treatment type. DO NOT
            include patient names or identifiers. Return JSON array:
            [{"cluster_id": "cl_0", "label": "...", "keywords": [...]}]
          description: Custom prompt for domain-specific labeling
          enabled: true
          labeling_inputs:
            input_mappings:
              - input_key: text
                path: text
                source_type: payload
          model_name: gpt-4o-mini-2024-07-18
          provider: openai
    TaskStatusEnum:
      type: string
      enum:
        - PENDING
        - QUEUED
        - IN_PROGRESS
        - PROCESSING
        - COMPLETED
        - COMPLETED_WITH_ERRORS
        - FAILED
        - CANCELED
        - INTERRUPTED
        - UNKNOWN
        - SKIPPED
        - DRAFT
        - ACTIVE
        - ARCHIVED
        - SUSPENDED
      title: TaskStatusEnum
      description: |-
        Enumeration of task statuses for tracking asynchronous operations.

        Task statuses indicate the current state of asynchronous operations like
        batch processing, object ingestion, clustering, and taxonomy execution.

        Status Categories:
            Operation Statuses: Track progress of async operations
            Lifecycle Statuses: Track entity state (buckets, collections, namespaces)

        Values:
            PENDING: Task is queued but has not started processing yet
            IN_PROGRESS: Task is currently being executed
            PROCESSING: Task is actively processing data (similar to IN_PROGRESS)
            COMPLETED: Task finished successfully with no errors
            COMPLETED_WITH_ERRORS: Task finished but some items failed (partial success)
            FAILED: Task encountered an error and could not complete
            CANCELED: Task was manually canceled by a user or system
            UNKNOWN: Task status could not be determined
            SKIPPED: Task was intentionally skipped
            DRAFT: Task is in draft state and not yet submitted

            ACTIVE: Entity is active and operational (for buckets, collections, etc.)
            ARCHIVED: Entity has been archived
            SUSPENDED: Entity has been temporarily suspended

        Terminal Statuses:
            COMPLETED, COMPLETED_WITH_ERRORS, FAILED, CANCELED are terminal statuses.
            Once a task reaches these states, it will not transition to another state.

        Partial Success Handling:
            COMPLETED_WITH_ERRORS indicates that the operation completed but some
            documents/items failed. The task result includes:
            - List of successful items
            - List of failed items with error details
            - Success rate percentage
            This allows clients to handle partial success scenarios appropriately.

        Polling Guidance:
            - Poll tasks in PENDING, QUEUED, IN_PROGRESS, or PROCESSING states
            - Stop polling when task reaches COMPLETED, COMPLETED_WITH_ERRORS, FAILED, or CANCELED
            - Use exponential backoff (1s → 30s) when polling
    FailureCategory:
      type: string
      enum:
        - timeout
        - infrastructure
        - orphaned
        - pipeline
        - validation
        - unknown
      title: FailureCategory
      description: |-
        Batch-level failure classification.

        Coarser-grained than ErrorCategory (which classifies individual object
        errors). FailureCategory is set on the batch itself to tell users
        *why the batch as a whole failed* — timeout, infra, orphan, pipeline,
        or unknown. Drives the "Batch failed: <category>" badge in Studio and
        lets callers distinguish retryable infra blips from genuine pipeline
        bugs without parsing human-readable strings.
    ErrorDetail:
      properties:
        message:
          type: string
          title: Message
          description: Human-readable error message
        type:
          type: string
          title: Type
          description: Stable error type identifier (machine-readable)
        code:
          anyOf:
            - type: string
            - type: 'null'
          title: Code
          description: >-
            Fine-grained error code for programmatic handling (e.g.,
            namespace_name_taken, feature_extractor_not_found). Present only
            when consumers may need to branch on a specific error condition.
        details:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Details
          description: >-
            Optional structured details to help debugging (validation errors,
            IDs, etc.)
      type: object
      required:
        - message
        - type
      title: ErrorDetail
      description: Error detail model.
    ValidationError:
      properties:
        loc:
          items:
            anyOf:
              - type: string
              - type: integer
          type: array
          title: Location
        msg:
          type: string
          title: Message
        type:
          type: string
          title: Error Type
      type: object
      required:
        - loc
        - msg
        - type
      title: ValidationError
    EnrichmentFieldMapping:
      properties:
        source_field:
          type: string
          title: Source Field
          description: >-
            Field from cluster results to include. Available fields: cluster_id,
            cluster_label, distance_to_centroid, member_count, keywords, x, y, z
            (visualization coords), metadata.*
        target_field:
          type: string
          title: Target Field
          description: >-
            Target field name in enriched document. Example: 'category_id' for
            cluster_id, 'product_category' for cluster_label
      type: object
      required:
        - source_field
        - target_field
      title: EnrichmentFieldMapping
      description: |-
        Maps a cluster result field to a document enrichment field.

        Similar to InputMapping pattern used throughout Mixpeek.
    LLMLabelingInput-Output:
      properties:
        input_mappings:
          items:
            $ref: '#/components/schemas/InputMapping'
          type: array
          minItems: 1
          title: Input Mappings
          description: >-
            Flexible input mappings for constructing LLM context. Supports
            multimodal inputs (text, image_url, video_url, audio_url). Each
            mapping specifies how to extract data from document payloads. At
            least one input mapping is required.
      type: object
      required:
        - input_mappings
      title: LLMLabelingInput
      description: |-
        Input configuration for LLM-based cluster labeling.

        Supports flexible input mappings similar to retrievers and buckets,
        allowing multimodal inputs (text, images, videos, audio) for providers
        like Gemini that support native multimodal understanding.

        Examples:
            # Text-only labeling:
            LLMLabelingInput(input_mappings=[
                InputMapping(input_key="headline", source_type="payload", path="headline"),
                InputMapping(input_key="description", source_type="payload", path="description")
            ])

            # Multimodal labeling with images:
            LLMLabelingInput(input_mappings=[
                InputMapping(input_key="text", source_type="payload", path="headline"),
                InputMapping(input_key="image_url", source_type="payload", path="thumbnail_url")
            ])

            # Multimodal with video (for Gemini):
            LLMLabelingInput(input_mappings=[
                InputMapping(input_key="text", source_type="payload", path="description"),
                InputMapping(input_key="video_url", source_type="payload", path="video_url")
            ])
    LLMProvider:
      type: string
      enum:
        - openai
        - google
        - anthropic
      title: LLMProvider
      description: >-
        Supported LLM providers for content generation.


        Each provider has different strengths, pricing, and multimodal
        capabilities.

        Choose based on your use case, performance requirements, and budget.


        Values:
            OPENAI: OpenAI GPT models (GPT-4o, GPT-4.1, O3-mini)
                - Best for: General purpose, vision tasks, structured outputs
                - Multimodal: Text, images
                - Performance: Fast (100-500ms), reliable
                - Cost: Moderate to high ($0.15-$10 per 1M tokens)
                - Use when: Need high-quality generation with vision support

            GOOGLE: Google Gemini models (Gemini 3.1 Flash Lite, Gemini 2.5 Pro)
                - Best for: Fast generation, video understanding, cost-efficiency
                - Multimodal: Text, images, video, audio, PDFs
                - Performance: Very fast (50-200ms)
                - Cost: Low to moderate ($0.075-$0.40 per 1M tokens)
                - Use when: Need video/audio/PDF support or cost-efficiency

            ANTHROPIC: Anthropic Claude models (Claude 3.5 Sonnet, Claude 3.5 Haiku)
                - Best for: Long context, complex reasoning, safety
                - Multimodal: Text, images
                - Performance: Moderate (200-800ms)
                - Cost: Moderate to high ($0.25-$15 per 1M tokens)
                - Use when: Need long context or complex reasoning

        Examples:
            - Use OPENAI for production with structured JSON outputs
            - Use GOOGLE for video summarization and cost-sensitive workloads
            - Use ANTHROPIC for complex reasoning with long documents
    OpenAIModel:
      type: string
      enum:
        - gpt-4o-2024-08-06
        - gpt-4o-mini-2024-07-18
        - gpt-4.1-2025-04-14
        - gpt-4.1-mini-2025-04-14
        - o3-mini-2025-01-31
      title: OpenAIModel
      description: |-
        OpenAI model identifiers for LLM generation.

        Models listed in order of capability and cost (highest to lowest).
        All models support vision (images) except O3-mini.

        Values:
            GPT_4O: Latest GPT-4 Omni model (2024-08-06)
                - Use for: Production, highest quality generation
                - Context: 128K tokens
                - Vision: Yes
                - Cost: $2.50/1M input, $10/1M output
                - Performance: 200-500ms per request
                - When to use: Need best quality, willing to pay premium

            GPT_41: GPT-4.1 (2025-04-14)
                - Use for: Future-proofed pipelines
                - Context: 128K tokens
                - Vision: Yes
                - Cost: TBD (expected similar to GPT-4o)
                - When to use: Want latest model features

            GPT_4O_MINI: Smaller, faster GPT-4 Omni (2024-07-18)
                - Use for: High-volume, cost-sensitive workloads
                - Context: 128K tokens
                - Vision: Yes
                - Cost: $0.15/1M input, $0.60/1M output
                - Performance: 100-200ms per request
                - When to use: Good balance of quality and cost

            GPT_41_MINI: Smaller GPT-4.1 (2025-04-14)
                - Use for: Future cost-optimized pipelines
                - Context: 128K tokens
                - Vision: Yes
                - Cost: TBD (expected similar to GPT-4o-mini)
                - When to use: Want latest features at lower cost

            O3_MINI: Reasoning-optimized model (2025-01-31)
                - Use for: Complex reasoning, math, code
                - Context: 200K tokens
                - Vision: No
                - Cost: TBD
                - When to use: Need advanced reasoning capabilities

        Examples:
            - Use GPT_4O for caption generation with images (best quality)
            - Use GPT_4O_MINI for high-volume video scene summarization (cost-effective)
            - Use O3_MINI for complex entity extraction requiring reasoning
    GoogleModel:
      type: string
      enum:
        - gemini-2.5-flash-lite
        - gemini-2.5-flash
        - gemini-2.5-pro
        - gemini-3.1-flash-lite
      title: GoogleModel
      description: >-
        Google Gemini model identifiers for LLM generation.


        Gemini models excel at multimodal understanding with best-in-class video
        support.

        All models support text, images, video, audio, and PDFs.


        Values:
            GEMINI_2_5_FLASH_LITE: Gemini 2.5 Flash Lite model (recommended, stable GA)
                - Use for: Fastest generation, cost-effective multimodal
                - Context: 1M tokens
                - Multimodal: Text, images, video, audio, PDFs
                - When to use: Default choice for all Gemini use cases

            GEMINI_2_5_PRO: Gemini 2.5 Pro model
                - Use for: Higher quality reasoning, complex tasks
                - Context: 1M tokens

            GEMINI_2_5_FLASH: Gemini 2.5 Flash model
                - Kept for backward compatibility.

            GEMINI_3_1_FLASH_LITE: Alias for gemini-2.5-flash-lite (backwards compat)
                - Note: gemini-3.1-flash-lite does NOT exist in Google's API.
                  This value is mapped to gemini-2.5-flash-lite at runtime.
    AnthropicModel:
      type: string
      enum:
        - claude-sonnet-4-5-20250929
        - claude-haiku-4-5-20251001
        - claude-3-5-sonnet-20241022
        - claude-3-5-haiku-20241022
      title: AnthropicModel
      description: |-
        Anthropic Claude model identifiers for LLM generation.

        Claude models excel at long context, complex reasoning, and safety.
        All models support text and images.

        Values:
            CLAUDE_3_5_SONNET: Most capable Claude model
                - Use for: Complex reasoning, long documents, safety-critical
                - Context: 200K tokens
                - Vision: Yes
                - Cost: $3/1M input, $15/1M output
                - Performance: 300-800ms per request
                - When to use: Need best reasoning, safety, or long context

            CLAUDE_3_5_HAIKU: Fast, cost-effective Claude model
                - Use for: High-volume, quick summaries
                - Context: 200K tokens
                - Vision: Yes
                - Cost: $0.25/1M input, $1.25/1M output
                - Performance: 100-300ms per request
                - When to use: Good balance of quality and cost

        Examples:
            - Use CLAUDE_3_5_SONNET for complex entity extraction from contracts (best reasoning)
            - Use CLAUDE_3_5_HAIKU for high-volume content moderation (cost-effective)
    InputMapping:
      properties:
        input_key:
          type: string
          title: Input Key
          description: Key used in the constructed inputs payload.
        source_type:
          anyOf:
            - $ref: '#/components/schemas/InputSourceType'
            - type: 'null'
          description: Source of the value (payload, literal, vector).
        path:
          anyOf:
            - type: string
            - type: 'null'
          title: Path
          description: >-
            Dot-notation path inside payload/vector when source_type is PAYLOAD
            or VECTOR.
        override:
          anyOf:
            - {}
            - type: 'null'
          title: Override
          description: Static value used when source_type is LITERAL. Overrides any path.
      type: object
      required:
        - input_key
      title: InputMapping
      description: |-
        Declarative mapping for building inputs from various sources.

        - input_key: The key used in the constructed inputs payload
        - source_type: Where to fetch the value (payload, literal, vector)
        - path: Dot-notation path when source_type is PAYLOAD or VECTOR
        - override: Static value when source_type is LITERAL
      examples:
        - input_key: query_text
          path: content.title
          source_type: payload
        - input_key: lang
          override: en
          source_type: literal
        - input_key: image_vector
          path: features.clip_vit_l_14
          source_type: vector
    InputSourceType:
      type: string
      enum:
        - payload
        - literal
        - vector
        - blob
      title: InputSourceType
      description: Where the value for an input should be retrieved from.

````