Buckets are the ingestion layer of the warehouse, where raw files enter before being decomposed into features. They organize raw inputs before the Engine transforms them into documents. Each bucket enforces a JSON schema that describes the blobs you expect to ingest (text, image, audio, video, json, binary).
Create a Bucket
bucket_schema is required. Every bucket must include a bucket_schema with at least one property in bucket_schema.properties. Omitting it will return a validation error.
Minimal Working Example
The simplest valid bucket creation — a single text property:
curl -sS -X POST "$MP_API_URL/v1/buckets" \
-H "Authorization: Bearer $MP_API_KEY" \
-H "X-Namespace: $MP_NAMESPACE" \
-H "Content-Type: application/json" \
-d '{
"bucket_name": "my-bucket",
"bucket_schema": {
"properties": {
"content": { "type": "text", "required": true }
}
}
}'
Full Example
curl -sS -X POST "$MP_API_URL/v1/buckets" \
-H "Authorization: Bearer $MP_API_KEY" \
-H "X-Namespace: $MP_NAMESPACE" \
-H "Content-Type: application/json" \
-d '{
"bucket_name": "product-catalog",
"description": "E-commerce product data",
"bucket_schema": {
"properties": {
"product_text": { "type": "text", "required": true },
"hero_image": { "type": "image" },
"spec_sheet": { "type": "json" }
}
}
}'
Response fields:
bucket_id
schema with validation metadata
object_count
created_at
Bucket Schema
- Uses a lightweight JSON schema subset (type, required, enum, description).
- Validates each object’s blobs before storing metadata.
- Helps collections map input fields to feature extractor targets.
Example schema fragment:
{
"properties": {
"transcript": {
"type": "text",
"description": "Full podcast transcript",
"required": true
},
"audio_file": {
"type": "audio",
"required": true
}
}
}
Manage Buckets
- Get bucket –
GET /v1/buckets/{bucket_id}
- List buckets –
POST /v1/buckets/list (supports filters, sort, pagination)
- Delete bucket –
DELETE /v1/buckets/{bucket_id} (removes objects and blobs)
Buckets are strictly namespace-scoped: the same bucket name can exist in different namespaces without conflict.
Bucket vs Collection
| Aspect | Bucket | Collection |
|---|
| Purpose | Raw input registry | Processed documents + features |
| Schema | Blob validation | Output schema (deterministic) |
| Storage | MongoDB (metadata) + S3 (blobs) | MongoDB (metadata) + MVS (vectors/payloads) |
| Processing | None | Runs feature extractors via Engine |
Best Practices
- One bucket per data domain (products, support tickets, surveillance footage).
- Keep schemas coarse; collections can slice the data differently downstream.
- Use
key_prefix in objects to group files (e.g., /2025/01/).
- Leverage metadata for later filtering (set tags at ingestion time).
Buckets give you a reliable staging area for multimodal data—clean separation before you branch into multiple collection-specific processing pipelines.