Data Extraction

Buckets

Schema-backed containers for organizing multimodal objects with automatic validation and lineage tracking

Create Your First Bucket Beginner Tutorial Advanced Tutorial

Video Overview

Why do anything?

Organizing multimodal data (videos, images, documents, audio) requires consistent structure and validation before processing. Without buckets, data pipelines break when inputs don't match expected formats.

Why now?

AI applications need reliable data staging areas. Manual validation is error-prone and doesn't scale. Collections need structured inputs to process data correctly.

Why this feature?

Schema-backed buckets validate blob types (text, image, video, audio, JSON) before storage, ensure metadata consistency, and maintain complete lineage from source to processed documents. Works with S3, MinIO, LocalStack, and direct uploads.

How It Works

Buckets are schema-backed containers that organize raw multimodal inputs before processing. They enforce validation, maintain metadata, and provide a staging area for collections to transform objects into searchable documents.

Schema Definition

Define blob properties (text, image, video, audio, JSON) with validation rules (required, enum, description)

Object Registration

Upload objects with blobs that match schema properties, include metadata for downstream use

Blob Validation

Validate each blob against schema: check type, required fields, and content format

Storage & Indexing

Store blob metadata in MongoDB, save blob content to S3/MinIO/LocalStack, index by key_prefix

Lineage Tracking

Assign object_id, track root_object_id and root_bucket_id for complete decomposition history

Collection Integration

Feed validated objects to collections for feature extraction and document creation

Why This Approach

Schema validation at ingestion prevents downstream processing failures. Separation of storage (buckets) from processing (collections) enables reusing the same data across multiple feature extractors without re-uploading. Lineage tracking maintains provenance from raw input to final enriched document.

Where This Is Used

E-commerce

Product Catalog Management

Media

Video Asset Library

Healthcare

Medical Imaging Archive

Integration

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Create bucket with schema
bucket = client.buckets.create(
    bucket_name="product_catalog",
    description="E-commerce product data",
    bucket_schema={
        "properties": {
            "product_text": {
                "type": "text",
                "required": True,
                "description": "Product description"
            },
            "hero_image": {
                "type": "image",
                "required": True
            },
            "spec_sheet": {
                "type": "json"
            }
        }
    }
)

# Upload object to bucket
object_response = client.buckets.objects.create(
    bucket_id=bucket.bucket_id,
    key_prefix="/products/red-sneaker",
    metadata={
        "category": "footwear",
        "brand": "Acme",
        "sku": "SNK-001"
    },
    blobs=[
        {
            "property": "product_text",
            "type": "text",
            "data": "Comfortable red sneaker with foam sole."
        },
        {
            "property": "hero_image",
            "type": "image",
            "data": "https://cdn.example.com/images/red-sneaker.jpg"
        },
        {
            "property": "spec_sheet",
            "type": "json",
            "data": {
                "size_range": "6-13",
                "colors": ["red", "blue"],
                "weight_oz": 12
            }
        }
    ]
)

# List objects with filters
objects = client.buckets.objects.list(
    bucket_id=bucket.bucket_id,
    filters={
        "AND": [
            {"field": "metadata.category", "operator": "eq", "value": "footwear"}
        ]
    },
    page_size=50
)

View Documentation

Comparisons & Alternatives

Resources

This capability is referenced and used across the following resources:

Related Capabilities

prerequisite

Collections

Collections consume bucket objects for feature extraction