Well-designed schemas balance validation, flexibility, and performance. This guide covers bucket schema patterns, collection field mappings, and evolution strategies to keep your data model clean and scalable.
Bucket Schema Principles
Bucket schemas enforce object registration shape but shouldn’t replicate downstream processing logic.
Good:
{
"schema": {
"properties": {
"title": { "type": "text", "required": true },
"content": { "type": "text", "required": true },
"category": { "type": "text" },
"published_at": { "type": "datetime" }
}
}
}
Avoid:
{
"schema": {
"properties": {
"title": { "type": "text", "required": true, "min_length": 10, "max_length": 200 },
"content": { "type": "text", "required": true, "must_contain": ["keyword"] },
"category": { "type": "text", "enum": ["tech", "business"] } // Hard to extend
}
}
}
Why: Collections can apply transformations and filters. Bucket schemas should validate data integrity, not business rules.
2. Use Nested Objects for Grouping
Group related fields to improve readability and support partial updates:
{
"schema": {
"properties": {
"content": {
"type": "object",
"properties": {
"title": { "type": "text" },
"body": { "type": "text" },
"summary": { "type": "text" }
}
},
"metadata": {
"type": "object",
"properties": {
"author": { "type": "text" },
"tags": { "type": "array" },
"published_at": { "type": "datetime" }
}
}
}
}
}
3. Arrays for Multi-Valued Fields
Use arrays for fields that naturally have multiple values:
{
"tags": { "type": "array", "items": { "type": "text" } },
"authors": { "type": "array", "items": { "type": "text" } },
"images": {
"type": "array",
"items": {
"type": "object",
"properties": {
"url": { "type": "url" },
"caption": { "type": "text" }
}
}
}
}
4. Separate Mutable and Immutable Fields
Structure schemas to distinguish fields that change vs remain constant:
{
"immutable": {
"created_at": { "type": "datetime" },
"source_system": { "type": "text" },
"original_filename": { "type": "text" }
},
"mutable": {
"status": { "type": "text" },
"assignee": { "type": "text" },
"priority": { "type": "number" }
}
}
This pattern clarifies which fields can be updated via PATCH operations.
Collection Mapping Patterns
Always specify input_mappings explicitly rather than relying on defaults:
Good:
{
"feature_extractor": {
"feature_extractor_name": "text_extractor",
"input_mappings": {
"text": "content.body" // Clear source path
}
}
}
Avoid:
{
"feature_extractor": {
"feature_extractor_name": "text_extractor"
// Implicitly maps "text" field - fragile if bucket schema changes
}
}
2. Passthrough Only What’s Needed
Use field_passthrough to selectively propagate metadata:
{
"field_passthrough": [
{ "source_path": "metadata.category" },
{ "source_path": "metadata.tags" },
{ "source_path": "metadata.published_at" }
]
}
Don’t passthrough:
- Large text blobs (duplicate storage)
- Sensitive fields not needed for retrieval
- Computed fields that can be derived on-demand
3. Namespace Feature Outputs
If multiple extractors produce similar outputs, use unique names:
{
"feature_extractor": {
"feature_extractor_name": "text_extractor",
"version": "v1",
"output_namespace": "en" // Produces mixpeek://text_extractor@v1/en/text_embedding
}
}
{
"feature_extractor": {
"feature_extractor_name": "text_extractor",
"version": "v1",
"output_namespace": "es",
"parameters": { "model": "multilingual-e5-large-instruct", "language": "es" }
}
}
This enables language-specific retrievers without collection duplication.
4. Leverage Chunking Strategies
Match chunking to content type:
| Content Type | Strategy | Rationale |
|---|
| Blog posts | paragraph | Preserves narrative flow |
| Documentation | sentence | Precise Q&A matching |
| Transcripts | time_window (60s) | Natural speech boundaries |
| Code | function | Semantic units |
{
"parameters": {
"chunk_strategy": "paragraph",
"chunk_size": 512,
"chunk_overlap": 50
}
}
Schema Evolution
Adding Fields (Non-Breaking)
New optional fields are safe:
// Before
{
"schema": {
"properties": {
"title": { "type": "text" }
}
}
}
// After (safe)
{
"schema": {
"properties": {
"title": { "type": "text" },
"subtitle": { "type": "text" } // Optional, non-breaking
}
}
}
Existing objects remain valid; new objects can include subtitle.
Making Fields Required (Breaking)
Requires migration:
// Step 1: Add field as optional
{ "description": { "type": "text" } }
// Step 2: Backfill existing objects
PATCH /v1/buckets/{bucket_id}/objects/{object_id}
{ "metadata": { "description": "Default description" } }
// Step 3: Make required
{ "description": { "type": "text", "required": true } }
Changing Field Types (Breaking)
Create a new field instead of mutating:
// Before
{ "price": { "type": "text" } } // "19.99"
// Migration (add new field)
{ "price_numeric": { "type": "number" } } // 19.99
// Deprecate old field
{ "price": { "type": "text", "deprecated": true } }
Versioning Collections
For major schema changes, create a new collection:
POST /v1/collections
{
"collection_name": "products-v2",
"source": { "type": "bucket", "bucket_id": "bkt_products" },
"feature_extractor": {
// Updated mappings and extractors
}
}
Migrate documents:
- Keep
products-v1 read-only
- Process new batches into
products-v2
- Update retrievers to query both collections during transition
- Archive
products-v1 after migration
Common Anti-Patterns
Problem:
{
"metadata": {
"content": "Sample text...",
"word_count": 150, // Computed from content
"embedding": [0.1, 0.2, ...] // Computed by extractor
}
}
Solution: Store only source data in buckets; let extractors compute derived values.
❌ Inconsistent Naming Conventions
Problem:
{
"CreatedDate": "...", // PascalCase
"updated_at": "...", // snake_case
"PublishTime": "..." // Mixed
}
Solution: Enforce consistent naming (prefer snake_case for compatibility).
❌ Overusing Nested Objects
Problem:
{
"data": {
"content": {
"main": {
"text": {
"body": "..." // 5 levels deep
}
}
}
}
}
Solution: Flatten to 2-3 levels max for readability and query simplicity.
❌ Missing Timestamps
Problem: No created_at or updated_at fields.
Solution: Always include audit timestamps:
{
"created_at": { "type": "datetime", "required": true },
"updated_at": { "type": "datetime" }
}
❌ Hardcoding Enum Values
Problem:
{
"status": { "type": "text", "enum": ["draft", "published"] }
}
Adding "archived" requires schema migration.
Solution: Use flexible text field + application-level validation or taxonomy enrichment.
Document Schema Validation
Collections support document-level schema validation using JSON Schema (draft-07). This validates the user-defined fields on every document create, update, and patch operation.
Setting Up Document Schema Validation
Configure a JSON Schema on your collection along with a validation mode:
from mixpeek import Mixpeek
client = Mixpeek(api_key="your-api-key")
# Update an existing collection with a document schema
client.collections.update(
collection_identifier="col_abc123",
document_schema={
"type": "object",
"properties": {
"title": {"type": "string"},
"category": {"type": "string", "enum": ["tech", "business", "science"]},
"priority": {"type": "integer", "minimum": 1, "maximum": 5}
},
"required": ["title", "category"]
},
schema_validation="strict"
)
Validation Modes
| Mode | Behavior | Use Case |
|---|
off | No validation (default) | Development, unstructured data |
warn | Accept document, attach _schema_violations | Migration periods, gradual rollout |
strict | Reject document with 422 error | Production, enforced data quality |
Start with warn mode to discover schema violations in existing data before switching to strict.
Dry-Run Validation
Test a document against the collection schema without creating it:
POST /v1/collections/{collection_id}/documents/schema/validate
{
"title": "My Article",
"priority": 10
}
Response:
{
"valid": false,
"violations": [
"(root): 'category' is a required property",
"priority: 10 is greater than the maximum of 5"
],
"schema_validation_mode": "strict"
}
What Gets Validated
Document schema validation only checks user-defined fields — system fields like document_id, collection_id, features, metadata, and _internal are automatically excluded. This means your schema only needs to describe your application’s data shape.
Validation applies to the full merged document on updates. If you PATCH a single field, the entire document (existing + patched fields) is validated against the schema.
Validation Best Practices
Use Required Fields Sparingly
Mark fields required only if absolutely necessary for downstream processing:
{
"title": { "type": "text", "required": true }, // Extractors need this
"tags": { "type": "array" } // Optional, but useful
}
Validate Externally
For complex validation (e.g., “URL must be from allowed domains”), validate in your application before calling Mixpeek.
Enable Schema Linting
Check schemas before deployment:
# Validate schema before creating bucket
POST /v1/buckets/validate
{
"schema": { ... }
}
Multi-Collection Strategies
Separate by Modality
Create distinct collections per feature type:
products-text → text embeddings
products-images → visual embeddings
products-metadata → structured data only
Query multiple collections via retrievers for cross-modal search.
Separate by Language
For multilingual content:
docs-en → English embeddings
docs-es → Spanish embeddings
docs-fr → French embeddings
Use retriever stages to route queries by detected language.
Separate by Lifecycle
For content with different retention policies:
logs-hot → last 7 days (fast storage)
logs-warm → 8-30 days (slower storage)
logs-cold → 30+ days (archive)
Checklist
Design bucket schema
- Include required source fields only
- Add audit timestamps (
created_at, updated_at)
- Group related fields in nested objects
- Use arrays for multi-valued fields
Define collection mappings
- Explicit
input_mappings for all extractors
- Selective
field_passthrough (no large blobs)
- Choose appropriate
chunk_strategy
- Namespace outputs if extracting multiple times
Plan for evolution
- Add optional fields for new requirements
- Version collections for breaking changes
- Migrate data with backfill scripts
- Deprecate old fields gracefully
Validate and test
- Lint schemas before deployment
- Test with representative sample data
- Monitor
__fully_enriched rates
- Review document payloads in Qdrant
Next Steps