> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Get extractor details

> Get detailed information about a specific extractor.

Works for both builtin extractors and custom plugins.

**Parameters:**
- `extractor_id`: Extractor identifier (e.g., 'text_extractor_v1', 'my_custom_plugin_1_0_0')

**Response includes:**
- Full schema information (input, output, parameters)
- Vector index configuration
- For custom plugins: deployment status, validation status


## OpenAPI

````yaml get /v1/namespaces/{namespace_id}/extractors/{extractor_id}
openapi: 3.1.0
info:
  title: Mixpeek API
  description: >-
    This is the Mixpeek API, providing access to various endpoints for data
    processing and retrieval.
  termsOfService: https://mixpeek.com/terms
  contact:
    name: Mixpeek Support
    url: https://mixpeek.com/contact
    email: info@mixpeek.com
  version: '0.82'
servers:
  - url: https://api.mixpeek.com
    description: Production
security: []
paths:
  /v1/namespaces/{namespace_id}/extractors/{extractor_id}:
    get:
      tags:
        - Namespace Extractors
      summary: Get extractor details
      description: >-
        Get detailed information about a specific extractor.


        Works for both builtin extractors and custom plugins.


        **Parameters:**

        - `extractor_id`: Extractor identifier (e.g., 'text_extractor_v1',
        'my_custom_plugin_1_0_0')


        **Response includes:**

        - Full schema information (input, output, parameters)

        - Vector index configuration

        - For custom plugins: deployment status, validation status
      operationId: get_extractor_v1_namespaces__namespace_id__extractors__extractor_id__get
      parameters:
        - name: namespace_id
          in: path
          required: true
          schema:
            type: string
            title: Namespace Id
        - name: extractor_id
          in: path
          required: true
          schema:
            type: string
            title: Extractor Id
      responses:
        '200':
          description: Extractor details
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/UnifiedExtractorResponse'
        '400':
          description: Bad Request
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '401':
          description: Unauthorized
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '403':
          description: Forbidden
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '404':
          description: Extractor not found
          content:
            application/json:
              example:
                detail: Extractor 'unknown_extractor_v1' not found
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
        '500':
          description: Internal Server Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
components:
  schemas:
    UnifiedExtractorResponse:
      properties:
        feature_extractor_name:
          type: string
          title: Feature Extractor Name
          description: Name of the feature extractor
        version:
          type: string
          title: Version
          description: Version of the feature extractor
        feature_extractor_id:
          type: string
          title: Feature Extractor Id
          description: Unique identifier (name_version)
        source:
          $ref: '#/components/schemas/ExtractorSource'
          description: >-
            Origin of this extractor: 'builtin' (shipped with Mixpeek), 'custom'
            (user-uploaded plugin), or 'community' (marketplace)
        description:
          type: string
          title: Description
          description: Human-readable description
        icon:
          type: string
          title: Icon
          description: Lucide-react icon name for frontend rendering
          default: box
        input_schema:
          additionalProperties: true
          type: object
          title: Input Schema
          description: JSON schema for input data
        output_schema:
          additionalProperties: true
          type: object
          title: Output Schema
          description: JSON schema for output data
        parameter_schema:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Parameter Schema
          description: JSON schema for parameters
        type_mode:
          anyOf:
            - type: string
            - type: 'null'
          title: Type Mode
          description: >-
            What input types this extractor can handle: 'type_specific' (only
            one type, e.g. video-only) or 'multimodal' (handles multiple types
            with conditional processing). Type-specific extractors cannot use
            automatic-typed bucket properties.
        expected_input_types:
          anyOf:
            - additionalProperties:
                type: string
              type: object
            - type: 'null'
          title: Expected Input Types
          description: >-
            For type-specific extractors: maps input keys to required types
            (e.g., {'video': 'video', 'thumbnail': 'image'}). For multimodal
            extractors: null.
        inference_type:
          anyOf:
            - type: string
            - type: 'null'
          title: Inference Type
          description: >-
            Kind of real-time inference this extractor provides: 'embedding',
            'rerank', 'classify', 'generate', or 'general'. Determines which
            retriever stages are compatible. Null if the extractor is
            batch-only.
        supported_input_types:
          items:
            type: string
          type: array
          title: Supported Input Types
          description: Supported input types (video, image, text, etc.)
        max_inputs:
          additionalProperties:
            type: integer
          type: object
          title: Max Inputs
          description: Maximum number of inputs per type
        default_parameters:
          additionalProperties: true
          type: object
          title: Default Parameters
          description: Default parameter values
        costs:
          anyOf:
            - $ref: '#/components/schemas/CostsInfo'
            - type: 'null'
          description: Credit cost information (builtin extractors only)
        required_vector_indexes:
          anyOf:
            - items:
                $ref: '#/components/schemas/VectorIndexDefinition'
              type: array
            - type: 'null'
          title: Required Vector Indexes
          description: Vector indexes this extractor produces
        required_payload_indexes:
          anyOf:
            - items:
                $ref: '#/components/schemas/PayloadIndexConfig-Output'
              type: array
            - type: 'null'
          title: Required Payload Indexes
          description: Payload indexes required by this extractor
        position_fields:
          items:
            type: string
          type: array
          title: Position Fields
          description: >-
            Fields that identify unique positions within output documents. Used
            for deterministic document ID generation.
        feature_uri:
          anyOf:
            - type: string
            - type: 'null'
          title: Feature Uri
          description: Primary feature URI (e.g., mixpeek://text_extractor@v1/embedding)
        capabilities:
          items:
            type: string
          type: array
          title: Capabilities
          description: >-
            What this extractor can do: 'batch' (feature extraction during
            ingestion), 'realtime' (query-time inference for retriever stages)
        example_usage:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Example Usage
          description: >-
            Minimal working configuration for namespace + collection +
            input_mappings + parameters
        plugin_id:
          anyOf:
            - type: string
            - type: 'null'
          title: Plugin Id
          description: Plugin ID (custom plugins only)
        deployed:
          anyOf:
            - type: boolean
            - type: 'null'
          title: Deployed
          description: Whether the plugin is deployed (custom plugins only)
        validation_status:
          anyOf:
            - type: string
              enum:
                - passed
                - failed
                - pending
            - type: 'null'
          title: Validation Status
          description: Validation status (custom plugins only)
        created_at:
          anyOf:
            - type: string
              format: date-time
            - type: 'null'
          title: Created At
          description: Creation timestamp (custom plugins only)
        updated_at:
          anyOf:
            - type: string
              format: date-time
            - type: 'null'
          title: Updated At
          description: Last update timestamp (custom plugins only)
      type: object
      required:
        - feature_extractor_name
        - version
        - feature_extractor_id
        - source
        - description
        - input_schema
        - output_schema
      title: UnifiedExtractorResponse
      description: |-
        Unified extractor response combining builtin and custom plugins.

        This model provides a consistent view of all extractors available
        to a namespace, regardless of whether they are builtin or custom.
    ErrorResponse:
      properties:
        success:
          type: boolean
          title: Success
          description: Always false for error responses
          default: false
        status:
          type: integer
          title: Status
          description: HTTP status code for this error
        error:
          $ref: '#/components/schemas/ErrorDetail'
          description: Error details payload
      type: object
      required:
        - status
        - error
      title: ErrorResponse
      description: Error response model.
      examples:
        - error:
            details:
              id: ns_123
              resource: namespace
            message: Namespace not found
            type: NotFoundError
          status: 404
          success: false
    HTTPValidationError:
      properties:
        detail:
          items:
            $ref: '#/components/schemas/ValidationError'
          type: array
          title: Detail
      type: object
      title: HTTPValidationError
    ExtractorSource:
      type: string
      enum:
        - builtin
        - custom
        - community
      title: ExtractorSource
      description: |-
        The source/origin of a feature extractor.

        Values:
            BUILTIN: Core extractors shipped with Mixpeek (text, image, multimodal, etc.)
            CUSTOM: User-created extractors uploaded to their namespace (Enterprise only)
            COMMUNITY: Community-contributed extractors from the Mixpeek marketplace

        This field helps API consumers understand:
        - What level of support/maintenance to expect
        - Whether the extractor is available to all users or namespace-specific
        - Licensing and attribution requirements
    CostsInfo:
      properties:
        tier:
          type: integer
          maximum: 4
          minimum: 1
          title: Tier
          description: 'Cost tier (1-4): 1=SIMPLE, 2=MODERATE, 3=COMPLEX, 4=PREMIUM'
        tier_label:
          type: string
          title: Tier Label
          description: Human-readable tier label (SIMPLE, MODERATE, COMPLEX, PREMIUM)
        rates:
          items:
            $ref: '#/components/schemas/CostRate'
          type: array
          title: Rates
          description: >-
            List of cost rates for different input types this extractor
            processes
      type: object
      required:
        - tier
        - tier_label
        - rates
      title: CostsInfo
      description: >-
        Credit cost information for a feature extractor.


        Describes the pricing tier and standardized cost rates for using this
        extractor.

        Rates are defined using CostUnit types that align with extractor input
        types.
    VectorIndexDefinition:
      properties:
        feature_uri:
          anyOf:
            - type: string
            - type: 'null'
          title: Feature Uri
          description: >-
            Full feature URI for this vector index. Format:
            mixpeek://{extractor}@{version}/{output_name}. Populated at
            collection creation time. Use this URI in retriever feature_filter
            stages.
          examples:
            - mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1
            - mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding
        name:
          type: string
          minLength: 1
          title: Name
          description: >-
            REQUIRED. Short user-facing output name used in feature URIs. This
            is NOT the Qdrant collection name - it's the clean identifier for
            this output. Format: Simple snake_case name (e.g., 'embedding',
            'video_embedding', 'sparse_embedding'). Used in feature URIs:
            mixpeek://{extractor}@{version}/{THIS_NAME}. Must be unique within
            this extractor's outputs.
          examples:
            - embedding
            - video_embedding
            - transcription_embedding
            - sparse_embedding
        description:
          type: string
          minLength: 10
          title: Description
          description: >-
            REQUIRED. Human-readable description of this vector output. Explain
            what content this output embeds and when to use it. Appears in API
            documentation and helps users choose the right feature URI. Be
            specific about the embedding type and use cases.
          examples:
            - Vector index for video segment embeddings
            - Dense text embeddings for semantic search
            - Sparse keyword embeddings for explainable retrieval
        type:
          type: string
          enum:
            - single
            - multi
          title: Type
          description: >-
            REQUIRED. Index type - 'single' or 'multi'. 'single': One vector per
            document (most common). Use for standard embeddings. 'multi':
            Multiple named vectors per document (rare). Use for hybrid/ensemble.
            Determines whether 'index' field contains VectorIndex or
            MultiVectorIndex.
          examples:
            - single
            - multi
        index:
          anyOf:
            - $ref: '#/components/schemas/VectorIndex'
            - $ref: '#/components/schemas/MultiVectorIndex'
          title: Index
          description: >-
            REQUIRED. Nested index configuration. VectorIndex if type='single'
            (most common case). MultiVectorIndex if type='multi' (rare, for
            hybrid search). Contains the full storage configuration including
            Qdrant collection name, dimensions, distance metric, and inference
            service.
      type: object
      required:
        - name
        - description
        - type
        - index
      title: VectorIndexDefinition
      description: >-
        Complete vector index definition that can be either single or
        multi-vector.


        This is the USER-FACING representation that appears in feature extractor
        definitions

        and API responses. It wraps a VectorIndex (or MultiVectorIndex) and adds
        metadata.


        Key Concepts - Two-Name System:
            - VectorIndexDefinition.name: SHORT user-facing name (e.g., "embedding")
              Used in feature URIs: mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1
                                                                ^^^^^^^^^^

            - VectorIndex.name: FULL storage name (e.g., "text_extractor_v1_embedding")
              Used as Qdrant collection name for namespace isolation

        This two-level naming allows clean URIs while preventing storage
        collisions.


        Use Cases:
            - Define extractor outputs in feature extractor definitions
            - Expose available vector indexes in collection metadata
            - Enable feature URI resolution (short name → full storage name)

        Requirements:
            - name: REQUIRED - Short output name for feature URIs
            - description: REQUIRED - Explain what this output produces
            - type: REQUIRED - "single" (most common) or "multi" (rare)
            - index: REQUIRED - Nested VectorIndex or MultiVectorIndex
            - feature_uri: OPTIONAL - Populated at collection creation time
      examples:
        - description: Vector index for text embeddings using E5-Large model.
          index:
            datatype: float32
            description: Dense vector embedding for text content
            dimensions: 1024
            distance: cosine
            inference_name: multilingual_e5_large_instruct_v1
            name: text_extractor_v1_embedding
            supported_inputs:
              - text
              - string
            type: dense
          name: embedding
          type: single
        - description: Vector index for video segment embeddings using multimodal model.
          index:
            datatype: float32
            description: Dense vector embeddings for video segments
            dimensions: 1408
            distance: cosine
            inference_name: vertex_multimodal_embedding
            name: multimodal_extractor_v1_video_embedding
            supported_inputs:
              - video
              - image
            type: dense
          name: video_embedding
          type: single
        - description: Hybrid dense + sparse embeddings for enhanced retrieval.
          index:
            description: Combined dense and sparse embeddings
            name: hybrid_extractor_v1_multi
            vectors:
              dense:
                description: Dense semantic embedding
                dimensions: 1024
                distance: cosine
                inference_name: multilingual_e5_large_instruct_v1
                name: hybrid_v1_dense
                type: dense
              sparse:
                description: Sparse keyword embedding
                distance: dot
                inference_name: splade_plus_plus_v1
                name: hybrid_v1_sparse
                type: sparse
          name: hybrid_embedding
          type: multi
    PayloadIndexConfig-Output:
      properties:
        field_name:
          type: string
          maxLength: 255
          minLength: 1
          title: Field Name
          description: >-
            Name of the payload field to index. Must be unique within the
            namespace. Use dot notation for nested fields (e.g.,
            'metadata.title'). Cannot use protected system field names when
            is_protected=False.
          examples:
            - metadata.title
            - user_id
            - tags
        type:
          $ref: '#/components/schemas/PayloadSchemaType'
          description: >-
            Data type of the indexed field. Determines query capabilities and
            storage optimization. TEXT: Full-text search. KEYWORD: Exact
            matching, filtering. INTEGER/FLOAT: Range queries, sorting.
            DATETIME: Temporal queries. GEO: Geospatial queries. BOOL: Boolean
            filtering. UUID: Unique identifier matching.
        field_schema:
          anyOf:
            - $ref: '#/components/schemas/TextIndexParams'
            - $ref: '#/components/schemas/IntegerIndexParams'
            - $ref: '#/components/schemas/KeywordIndexParams'
            - $ref: '#/components/schemas/FloatIndexParams'
            - $ref: '#/components/schemas/GeoIndexParams'
            - $ref: '#/components/schemas/DatetimeIndexParams'
            - $ref: '#/components/schemas/UuidIndexParams'
            - $ref: '#/components/schemas/BoolIndexParams'
            - type: 'null'
          title: Field Schema
          description: >-
            Optional schema configuration for the index. If not provided, uses
            default parameters for the specified type. Different types support
            different parameters (e.g., KeywordIndexParams.is_tenant).
        is_protected:
          type: boolean
          title: Is Protected
          description: >-
            Whether this index is system-managed and cannot be modified by
            users. Protected indexes (is_protected=True) are created
            automatically by Mixpeek and are essential for internal operations
            like tenant isolation, lineage tracking, and document management.
            Users cannot create, modify, or delete protected indexes.
            User-created indexes always have is_protected=False.
          default: false
      type: object
      required:
        - field_name
        - type
      title: PayloadIndexConfig
      description: >-
        Configuration for a payload index.


        Defines the structure and behavior of a payload field index in Qdrant
        collections.

        Payload indexes enable efficient filtering and searching on document
        metadata.


        Protected Indexes:
            System-managed indexes (is_protected=True) cannot be modified or deleted by users.
            These are essential for Mixpeek's internal operations:
            - internal_id: Tenant isolation
            - namespace_id: Namespace scoping
            - collection_id, document_id: Document lineage
            - bucket_id, object_id, root_object_id, root_bucket_id, source_object_id: Object lineage
            - created_at, updated_at: Timestamps

        Use Cases:
            - Create custom metadata indexes for efficient filtering
            - Configure full-text search on text fields
            - Set up geospatial queries on location data
            - Enable range queries on numeric fields

        Requirements:
            - field_name: REQUIRED - Must be unique within the namespace
            - type: REQUIRED - Must match PayloadSchemaType enum
            - field_schema: OPTIONAL - Auto-generated from type if not provided
            - is_protected: OPTIONAL - Defaults to False (user-managed index)
      examples:
        - description: User-created text index for full-text search
          field_name: metadata.description
          is_protected: false
          type: text
        - description: User-created keyword index for exact filtering
          field_name: user_id
          is_protected: false
          type: keyword
        - description: System-managed protected index (created automatically)
          field_name: internal_id
          field_schema:
            is_tenant: true
          is_protected: true
          type: keyword
    ErrorDetail:
      properties:
        message:
          type: string
          title: Message
          description: Human-readable error message
        type:
          type: string
          title: Type
          description: Stable error type identifier (machine-readable)
        code:
          anyOf:
            - type: string
            - type: 'null'
          title: Code
          description: >-
            Fine-grained error code for programmatic handling (e.g.,
            namespace_name_taken, feature_extractor_not_found). Present only
            when consumers may need to branch on a specific error condition.
        details:
          anyOf:
            - additionalProperties: true
              type: object
            - type: 'null'
          title: Details
          description: >-
            Optional structured details to help debugging (validation errors,
            IDs, etc.)
      type: object
      required:
        - message
        - type
      title: ErrorDetail
      description: Error detail model.
    ValidationError:
      properties:
        loc:
          items:
            anyOf:
              - type: string
              - type: integer
          type: array
          title: Location
        msg:
          type: string
          title: Message
        type:
          type: string
          title: Error Type
      type: object
      required:
        - loc
        - msg
        - type
      title: ValidationError
    CostRate:
      properties:
        unit:
          $ref: '#/components/schemas/CostUnit'
          description: The billing unit type
        credits_per_unit:
          type: integer
          minimum: 1
          title: Credits Per Unit
          description: Number of credits charged per unit
        description:
          anyOf:
            - type: string
            - type: 'null'
          title: Description
          description: Human-readable description of what this rate covers
      type: object
      required:
        - unit
        - credits_per_unit
      title: CostRate
      description: |-
        Cost rate for a specific billing unit.

        Defines how many credits are charged per unit of a specific type.

        Example:
            CostRate(
                unit=CostUnit.MINUTE,
                credits_per_unit=200,
                description="Video processing"
            )
            # Means: 200 credits per minute of video
    VectorIndex:
      properties:
        name:
          anyOf:
            - type: string
            - type: 'null'
          title: Name
          description: >-
            OPTIONAL. Qdrant named vector identifier. If not provided,
            auto-derived from inference_service_id using the same conversion as
            inference_name (org/model -> org__model with hyphens as
            underscores). This enables cross-extractor compatibility: extractors
            using the same model will share the same named vector in Qdrant,
            allowing direct vector search across collections without fusion
            logic.
          examples:
            - intfloat__multilingual_e5_large_instruct
            - google__siglip_base_patch16_224
            - jinaai__jina_embeddings_v2_base_code
        description:
          type: string
          minLength: 10
          title: Description
          description: >-
            REQUIRED. Human-readable description of what this vector index
            represents. Explain the content type, use cases, and search
            characteristics. Shown in API documentation and collection metadata.
            Be specific about what embeddings are stored here.
          examples:
            - Dense vector embedding for text content using E5-Large model
            - Video segment embeddings for semantic visual search
            - Sparse keyword expansion embeddings for explainable search
        dimensions:
          anyOf:
            - type: integer
              minimum: 1
            - type: 'null'
          title: Dimensions
          description: >-
            Number of vector dimensions. REQUIRED for DENSE vectors (e.g., 1024
            for E5-Large, 1408 for multimodal). NOT REQUIRED for SPARSE vectors
            (dimensions determined dynamically). Must match the output
            dimensions of the inference service. Cannot be changed after index
            creation without recreating the collection.
          examples:
            - 1024
            - 1408
            - 768
            - 512
        type:
          $ref: '#/components/schemas/VectorType'
          description: >-
            REQUIRED. Vector storage format type. Determines how vectors are
            stored and searched in Qdrant. Use DENSE for traditional embeddings
            (most common), SPARSE for keyword-based models like SPLADE,
            MULTI_DENSE for late-interaction models like ColBERT. Must match the
            output format of your inference service.
          examples:
            - dense
            - sparse
            - multi_dense
        distance:
          anyOf:
            - type: string
            - type: 'null'
          title: Distance
          description: >-
            Distance metric for similarity search. OPTIONAL - defaults to
            'cosine' (normalized dot product). Options: 'cosine' (most common,
            normalized), 'dot' (raw dot product), 'euclidean' (L2 distance),
            'manhattan' (L1 distance). Cosine recommended for most embeddings as
            it's scale-invariant. Must match the metric your model was trained
            with.
          default: cosine
          examples:
            - cosine
            - dot
            - euclidean
        datatype:
          anyOf:
            - $ref: '#/components/schemas/VectorDataType'
            - type: 'null'
          description: >-
            Data type for storing vector values. OPTIONAL - defaults to FLOAT32
            (standard precision). Use FLOAT32 for general use (4 bytes per
            dimension). Use FLOAT16 to save 50% storage with minimal quality
            loss. Use UINT8 for maximum compression (quantization, ~2% quality
            loss). Lower precision = smaller storage + faster search, slightly
            lower accuracy.
          default: float32
          examples:
            - float32
            - float16
            - uint8
        on_disk:
          anyOf:
            - type: boolean
            - type: 'null'
          title: On Disk
          description: >-
            OPTIONAL. If true, vectors stored on disk instead of RAM. Defaults
            to true for memory efficiency. Set to false for faster search with
            higher memory usage. Trade-off: on_disk=true saves ~95% RAM but ~10x
            slower search. Recommended to keep default (true) unless RAM is
            abundant and low latency critical.
        supported_inputs:
          anyOf:
            - items:
                $ref: '#/components/schemas/BucketSchemaFieldType'
              type: array
            - type: 'null'
          title: Supported Inputs
          description: >-
            OPTIONAL. List of bucket schema field types this vector can process.
            Validates that input fields are compatible with this index.
            Examples: TEXT and STRING for text embeddings, VIDEO and IMAGE for
            multimodal embeddings, DOCUMENT for PDF extractors. Used for
            validation during collection creation.
          examples:
            - - text
              - string
            - - video
              - image
            - - document
        inference_name:
          anyOf:
            - type: string
            - type: 'null'
          title: Inference Name
          description: >-
            DEPRECATED: Use inference_service_id instead. Identifier of the
            inference service to generate embeddings. Must reference a valid
            inference service registered in the system. Examples:
            'multilingual_e5_large_instruct_v1' for text,
            'vertex_multimodal_embedding' for video, 'laion_clip_vit_l_14_v1'
            for images. This determines which model creates the vectors during
            ingestion. Cannot be changed after collection creation.
          examples:
            - multilingual_e5_large_instruct_v1
            - vertex_multimodal_embedding
            - laion_clip_vit_l_14_v1
            - openai_text_embedding_3_small
        inference_service_id:
          anyOf:
            - type: string
            - type: 'null'
          title: Inference Service Id
          description: >-
            RECOMMENDED. Service ID in org/name format (e.g.,
            'intfloat/e5-large'). When set, dimensions and distance are
            automatically derived from the registry. This is the canonical
            identifier for cross-plugin compatibility. Plugins using the same
            service_id can search across each other's vectors. Takes precedence
            over inference_name when both are set.
          examples:
            - intfloat/e5-large
            - google/vertex-multimodal
            - google/siglip
            - jinaai/jina-code-v2
        purpose:
          anyOf:
            - $ref: '#/components/schemas/VectorPurpose'
            - type: 'null'
          description: >-
            RECOMMENDED. Semantic purpose of this vector index. Enables
            pipelines to look up vector configs by purpose (text, code, image)
            without needing to know the specific inference_service_id. This
            provides automatic configuration - the pipeline just says 'give me
            the text vector' and gets the correct column name. If not specified,
            pipeline must use inference_service_id lookup.
          examples:
            - text
            - code
            - image
            - multimodal
        vector_name_override:
          anyOf:
            - type: string
            - type: 'null'
          title: Vector Name Override
          description: >-
            OPTIONAL. Override for Qdrant named vector identifier. When set,
            this value is used as the Qdrant vector name instead of
            auto-deriving from inference_service_id. This enables multiple
            vectors from the same embedding model with different storage names.
            The inference_service_id is still used for cross-extractor
            compatibility checking, but storage uses this custom name. Use case:
            A single extractor producing N vectors (e.g., title_embedding,
            body_embedding) using the same model but needing separate storage.
          examples:
            - title_embedding
            - body_embedding
            - summary_embedding
        supports_multi_query:
          type: boolean
          title: Supports Multi Query
          description: >-
            Whether this vector index supports multi-content queries at
            retrieval time. When True, the feature_search stage accepts
            input_mode='multi_content' — a list of URLs and/or text strings that
            are embedded together in one API call to produce a single query
            vector. Only set for extractors whose underlying model natively
            supports multi-file input (e.g., gemini_multifile_extractor using
            Gemini Embedding 2).
          default: false
      type: object
      required:
        - description
        - type
      title: VectorIndex
      description: >-
        Configuration for a single vector index in Qdrant.


        Defines the fully-qualified vector index including storage name,
        dimensions,

        distance metric, and inference service. This is the actual index that
        gets

        created in Qdrant and used for vector similarity search.


        Key Concepts:
            - The `name` field is the FULL qualified name used as the Qdrant collection name
            - Format: {extractor}_{version}_{output} (e.g., "text_extractor_v1_embedding")
            - This ensures namespace isolation between extractors and versions
            - Different from VectorIndexDefinition.name which is the short user-facing name

        Use Cases:
            - Define vector storage configuration for feature extractors
            - Specify inference service and model parameters
            - Configure distance metrics for similarity search
            - Set storage optimization (on-disk for large vectors)

        Requirements:
            - name: REQUIRED - Must be unique across all extractors in namespace
            - description: REQUIRED - Explain what this vector represents
            - dimensions: REQUIRED for DENSE vectors, OPTIONAL for SPARSE
            - type: REQUIRED - Must match VectorType enum
            - inference_name: REQUIRED - Must reference a valid inference service
      examples:
        - datatype: float32
          description: >-
            Dense vector embedding for text content using E5-Large multilingual
            model. Optimized for semantic search across 100+ languages.
          dimensions: 1024
          distance: cosine
          inference_name: multilingual_e5_large_instruct_v1
          name: text_extractor_v1_embedding
          supported_inputs:
            - text
            - string
          type: dense
        - datatype: float32
          description: >-
            Dense vector embeddings for video segments using Google's multimodal
            model. Supports visual semantic search.
          dimensions: 1408
          distance: cosine
          inference_name: vertex_multimodal_embedding
          name: multimodal_extractor_v1_video_embedding
          supported_inputs:
            - video
            - image
          type: dense
        - datatype: float32
          description: >-
            Dense vector embeddings for images using SigLIP model. Supports
            visual semantic search.
          dimensions: 768
          distance: cosine
          inference_name: siglip_base_v1
          name: image_extractor_v1_embedding
          supported_inputs:
            - image
          type: dense
        - description: >-
            Title embedding using E5-Large - same model as body but stored
            separately.
          inference_service_id: intfloat/multilingual-e5-large-instruct
          type: dense
          vector_name_override: title_embedding
        - description: >-
            Body embedding using E5-Large - same model as title but stored
            separately.
          inference_service_id: intfloat/multilingual-e5-large-instruct
          type: dense
          vector_name_override: body_embedding
    MultiVectorIndex:
      properties:
        name:
          type: string
          minLength: 1
          title: Name
          description: >-
            REQUIRED. Fully-qualified name for the multi-vector index. Format:
            {extractor}_{version}_{output} (e.g., 'hybrid_extractor_v1_multi').
            Must be unique across namespace.
          examples:
            - hybrid_extractor_v1_multi
            - ensemble_v1_combined
        description:
          type: string
          minLength: 10
          title: Description
          description: >-
            REQUIRED. Human-readable description of the multi-vector index.
            Explain what vector types are included and their purposes. Describe
            use cases for this multi-vector configuration.
          examples:
            - Hybrid dense + sparse embeddings for enhanced retrieval
            - Multi-model ensemble with BERT + RoBERTa embeddings
        vectors:
          additionalProperties:
            $ref: '#/components/schemas/VectorIndex'
          type: object
          title: Vectors
          description: >-
            REQUIRED. Dictionary mapping vector output names to their
            VectorIndex configurations. Each key is a unique identifier for that
            vector type within this multi-index. Each value is a complete
            VectorIndex with its own dimensions, type, and inference service.
            Example keys: 'dense', 'sparse', 'primary', 'secondary'.
          examples:
            - dense:
                dimensions: 1024
                inference_name: e5_large
                name: hybrid_v1_dense
                type: dense
              sparse:
                inference_name: splade
                name: hybrid_v1_sparse
                type: sparse
      type: object
      required:
        - name
        - description
        - vectors
      title: MultiVectorIndex
      description: >-
        Configuration for multi-vector indexes.


        Allows a single extractor to produce multiple named vector outputs in
        one index.

        Useful for hybrid search combining different embedding types or multiple
        models.


        Use Cases:
            - Hybrid dense + sparse embeddings in one index
            - Multiple models for ensemble retrieval
            - Different granularities (sentence + paragraph embeddings)

        Requirements:
            - name: REQUIRED - Full qualified name for the multi-vector index
            - description: REQUIRED - Explain what vector combinations are included
            - vectors: REQUIRED - Dictionary mapping output names to VectorIndex configs

        Note: Currently less common than single VectorIndex. Most extractors use

        separate VectorIndexDefinitions for each output instead.
      examples:
        - description: >-
            Combined dense and sparse embeddings for hybrid search. Dense
            provides semantic understanding, sparse adds keyword precision.
          name: hybrid_extractor_v1_multi
          vectors:
            dense:
              datatype: float32
              description: Dense semantic embedding
              dimensions: 1024
              distance: cosine
              inference_name: multilingual_e5_large_instruct_v1
              name: hybrid_v1_dense
              type: dense
            sparse:
              datatype: float32
              description: Sparse keyword embedding
              distance: dot
              inference_name: splade_plus_plus_v1
              name: hybrid_v1_sparse
              type: sparse
    PayloadSchemaType:
      type: string
      enum:
        - keyword
        - integer
        - float
        - bool
        - geo
        - datetime
        - text
        - uuid
      title: PayloadSchemaType
      description: Payload schema type.
    TextIndexParams:
      properties:
        type:
          type: string
          title: Type
          default: text
        tokenizer:
          $ref: '#/components/schemas/TokenizerType'
          default: word
        min_token_len:
          type: integer
          title: Min Token Len
          default: 2
        max_token_len:
          type: integer
          title: Max Token Len
          default: 15
        lowercase:
          type: boolean
          title: Lowercase
          default: true
      type: object
      title: TextIndexParams
      description: Configuration for text index.
    IntegerIndexParams:
      properties:
        type:
          type: string
          title: Type
          default: integer
        lookup:
          type: boolean
          title: Lookup
          default: true
        range:
          type: boolean
          title: Range
          default: true
      type: object
      title: IntegerIndexParams
      description: Configuration for integer index.
    KeywordIndexParams:
      properties:
        type:
          type: string
          title: Type
          default: keyword
        is_tenant:
          type: boolean
          title: Is Tenant
          default: false
      type: object
      title: KeywordIndexParams
      description: Configuration for keyword index.
    FloatIndexParams:
      properties:
        type:
          type: string
          title: Type
          default: float
      type: object
      title: FloatIndexParams
      description: Configuration for float index.
    GeoIndexParams:
      properties:
        type:
          type: string
          title: Type
          default: geo
      type: object
      title: GeoIndexParams
      description: Configuration for geo index.
    DatetimeIndexParams:
      properties:
        type:
          type: string
          title: Type
          default: datetime
      type: object
      title: DatetimeIndexParams
      description: Configuration for datetime index.
    UuidIndexParams:
      properties:
        type:
          type: string
          title: Type
          default: uuid
        is_tenant:
          type: boolean
          title: Is Tenant
          default: false
      type: object
      title: UuidIndexParams
      description: Configuration for UUID index.
    BoolIndexParams:
      properties:
        type:
          type: string
          title: Type
          default: bool
      type: object
      title: BoolIndexParams
      description: Configuration for boolean index.
    CostUnit:
      type: string
      enum:
        - minute
        - image
        - 1k_tokens
        - page
        - face
        - extraction
      title: CostUnit
      description: |-
        Standard billing units aligned with extractor input types.

        Each unit represents a measurable quantity that extractors process:
        - MINUTE: Video/audio duration in minutes
        - IMAGE: Per image processed
        - TOKENS_1K: Text tokens in thousands
        - PAGE: Document pages (PDF, etc.)
        - FACE: Detected faces in images/video
        - EXTRACTION: Flat per-operation cost
    VectorType:
      type: string
      enum:
        - dense
        - sparse
        - multi_dense
      title: VectorType
      description: |-
        Vector types supported by the Mixpeek system.

        Defines the storage format and structure of embeddings in Qdrant.

        Values:
            DENSE: Traditional float array embeddings (e.g., [0.1, 0.2, 0.3]).
                   Most common format. Used by: text_extractor, multimodal_extractor, image_extractor.
                   Storage: ~4KB per 1024-dim vector. Fast cosine/dot similarity search.

            SPARSE: Index-value pairs for sparse embeddings (e.g., SPLADE, BM25).
                    Only stores non-zero dimensions. Format: {indices: [1,5,9], values: [0.8,0.6,0.4]}.
                    Storage: ~20KB. Keyword-based semantic search.

            MULTI_DENSE: List of dense vectors for late interaction models (e.g., ColBERT).
                         Each document has multiple embeddings. Format: [[0.1,0.2], [0.3,0.4], ...].
                         Storage: ~500KB. High-precision retrieval.

        Examples:
            - DENSE for general semantic search (text_extractor, multimodal_extractor)
            - SPARSE for keyword expansion and explainability
            - MULTI_DENSE for maximum precision retrieval
    VectorDataType:
      type: string
      enum:
        - float32
        - uint8
      title: VectorDataType
      description: Vector data type.
    BucketSchemaFieldType:
      type: string
      enum:
        - string
        - number
        - integer
        - float
        - boolean
        - object
        - array
        - date
        - datetime
        - text
        - image
        - audio
        - video
        - pdf
        - excel
      title: BucketSchemaFieldType
      description: >-
        Supported data types for bucket schema fields.


        Types fall into two categories:


        1. **Metadata Types** (JSON types):
           - Stored as object metadata
           - Standard JSON-compatible types
           - Not processed by extractors (unless explicitly mapped)
           - Examples: string, number, boolean, date

        2. **File Types** (blobs):
           - Stored as files/blobs
           - Processed by extractors
           - Require file content (URL or base64)
           - Examples: text, image, video, pdf

        **GIF Special Handling**:
            GIF files can be declared as either IMAGE or VIDEO type:

            - As IMAGE: GIF is embedded as a single static image (first frame)
            - As VIDEO: GIF is decomposed frame-by-frame with embeddings per frame

            The multimodal extractor detects GIFs via MIME type (image/gif) and routes
            them based on your schema declaration. Use VIDEO for animated GIFs where
            frame-level search is needed, IMAGE for static/thumbnail use cases.

        NOTE: For retriever input schemas that need to accept document
        references

        (e.g., "find similar documents"), use RetrieverInputSchemaFieldType
        instead,

        which includes all bucket types plus document_reference.
    VectorPurpose:
      type: string
      enum:
        - text
        - code
        - image
        - multimodal
        - video
        - audio
        - sparse
      title: VectorPurpose
      description: |-
        Semantic purpose of a vector index.

        Used by pipelines to look up vector configs by purpose without
        needing to know the specific inference_service_id.
    TokenizerType:
      type: string
      enum:
        - word
        - whitespace
        - prefix
        - multilingual
      title: TokenizerType
      description: Tokenizer type.

````