Buckets
Schema-backed containers for organizing multimodal objects with automatic validation and lineage tracking
Video Overview
Why do anything?
Organizing multimodal data (videos, images, documents, audio) requires consistent structure and validation before processing. Without buckets, data pipelines break when inputs don't match expected formats.
Why now?
AI applications need reliable data staging areas. Manual validation is error-prone and doesn't scale. Collections need structured inputs to process data correctly.
Why this feature?
Schema-backed buckets validate blob types (text, image, video, audio, JSON) before storage, ensure metadata consistency, and maintain complete lineage from source to processed documents. Works with S3, MinIO, LocalStack, and direct uploads.
How It Works
Buckets are schema-backed containers that organize raw multimodal inputs before processing. They enforce validation, maintain metadata, and provide a staging area for collections to transform objects into searchable documents.
Schema Definition
Define blob properties (text, image, video, audio, JSON) with validation rules (required, enum, description)
Object Registration
Upload objects with blobs that match schema properties, include metadata for downstream use
Blob Validation
Validate each blob against schema: check type, required fields, and content format
Storage & Indexing
Store blob metadata in MongoDB, save blob content to S3/MinIO/LocalStack, index by key_prefix
Lineage Tracking
Assign object_id, track root_object_id and root_bucket_id for complete decomposition history
Collection Integration
Feed validated objects to collections for feature extraction and document creation
Why This Approach
Schema validation at ingestion prevents downstream processing failures. Separation of storage (buckets) from processing (collections) enables reusing the same data across multiple feature extractors without re-uploading. Lineage tracking maintains provenance from raw input to final enriched document.
Integration
from mixpeek import Mixpeekclient = Mixpeek(api_key="YOUR_API_KEY")# Create bucket with schemabucket = client.buckets.create(bucket_name="product_catalog",description="E-commerce product data",bucket_schema={"properties": {"product_text": {"type": "text","required": True,"description": "Product description"},"hero_image": {"type": "image","required": True},"spec_sheet": {"type": "json"}}})# Upload object to bucketobject_response = client.buckets.objects.create(bucket_id=bucket.bucket_id,key_prefix="/products/red-sneaker",metadata={"category": "footwear","brand": "Acme","sku": "SNK-001"},blobs=[{"property": "product_text","type": "text","data": "Comfortable red sneaker with foam sole."},{"property": "hero_image","type": "image","data": "https://cdn.example.com/images/red-sneaker.jpg"},{"property": "spec_sheet","type": "json","data": {"size_range": "6-13","colors": ["red", "blue"],"weight_oz": 12}}])# List objects with filtersobjects = client.buckets.objects.list(bucket_id=bucket.bucket_id,filters={"AND": [{"field": "metadata.category", "operator": "eq", "value": "footwear"}]},page_size=50)
Comparisons & Alternatives
Resources
This capability is referenced and used across the following resources:
