Mixpeek Logo
    Data Extraction

    Buckets

    Schema-backed containers for organizing multimodal objects with automatic validation and lineage tracking

    Video Overview

    Why do anything?

    Organizing multimodal data (videos, images, documents, audio) requires consistent structure and validation before processing. Without buckets, data pipelines break when inputs don't match expected formats.

    Why now?

    AI applications need reliable data staging areas. Manual validation is error-prone and doesn't scale. Collections need structured inputs to process data correctly.

    Why this feature?

    Schema-backed buckets validate blob types (text, image, video, audio, JSON) before storage, ensure metadata consistency, and maintain complete lineage from source to processed documents. Works with S3, MinIO, LocalStack, and direct uploads.

    How It Works

    Buckets are schema-backed containers that organize raw multimodal inputs before processing. They enforce validation, maintain metadata, and provide a staging area for collections to transform objects into searchable documents.

    1

    Schema Definition

    Define blob properties (text, image, video, audio, JSON) with validation rules (required, enum, description)

    2

    Object Registration

    Upload objects with blobs that match schema properties, include metadata for downstream use

    3

    Blob Validation

    Validate each blob against schema: check type, required fields, and content format

    4

    Storage & Indexing

    Store blob metadata in MongoDB, save blob content to S3/MinIO/LocalStack, index by key_prefix

    5

    Lineage Tracking

    Assign object_id, track root_object_id and root_bucket_id for complete decomposition history

    6

    Collection Integration

    Feed validated objects to collections for feature extraction and document creation

    Why This Approach

    Schema validation at ingestion prevents downstream processing failures. Separation of storage (buckets) from processing (collections) enables reusing the same data across multiple feature extractors without re-uploading. Lineage tracking maintains provenance from raw input to final enriched document.

    Integration

    from mixpeek import Mixpeek
    client = Mixpeek(api_key="YOUR_API_KEY")
    # Create bucket with schema
    bucket = client.buckets.create(
    bucket_name="product_catalog",
    description="E-commerce product data",
    bucket_schema={
    "properties": {
    "product_text": {
    "type": "text",
    "required": True,
    "description": "Product description"
    },
    "hero_image": {
    "type": "image",
    "required": True
    },
    "spec_sheet": {
    "type": "json"
    }
    }
    }
    )
    # Upload object to bucket
    object_response = client.buckets.objects.create(
    bucket_id=bucket.bucket_id,
    key_prefix="/products/red-sneaker",
    metadata={
    "category": "footwear",
    "brand": "Acme",
    "sku": "SNK-001"
    },
    blobs=[
    {
    "property": "product_text",
    "type": "text",
    "data": "Comfortable red sneaker with foam sole."
    },
    {
    "property": "hero_image",
    "type": "image",
    "data": "https://cdn.example.com/images/red-sneaker.jpg"
    },
    {
    "property": "spec_sheet",
    "type": "json",
    "data": {
    "size_range": "6-13",
    "colors": ["red", "blue"],
    "weight_oz": 12
    }
    }
    ]
    )
    # List objects with filters
    objects = client.buckets.objects.list(
    bucket_id=bucket.bucket_id,
    filters={
    "AND": [
    {"field": "metadata.category", "operator": "eq", "value": "footwear"}
    ]
    },
    page_size=50
    )

    Comparisons & Alternatives

    Resources

    This capability is referenced and used across the following resources: