Document intelligence uses the warehouse’s decompose layer to parse text and layout from PDFs and scanned documents, then makes them queryable through retrieval. Every code block below uses the real API field names.
How It Works
When you ingest a document, the universal_extractor runs a multi-stage pipeline:
Content extraction — text is parsed from native PDFs, with OCR fallback for scanned pages.
Chunking — documents are split into searchable segments.
Embedding — each document is embedded for semantic search.
Indexing — segments are stored with metadata for filtered vector search.
At query time, the retriever runs semantic search over the document embeddings and can filter by metadata such as document type.
Extractor Use For universal_extractor@v1Parse any file (PDF, image, scanned doc), OCR fallback, and produce a Gemini embedding for semantic search text_extractor@v1Text embeddings, NER, and summarization over already-extracted text
Mixpeek does not have separate pdf_extractor / table_extractor extractors. The universal_extractor ingests arbitrary documents (PDFs and scans) and produces a searchable embedding.
1. Create a bucket
curl -sS -X POST " $MP_API_URL /v1/buckets" \
-H "Authorization: Bearer $MP_API_KEY " \
-H "X-Namespace: $MP_NAMESPACE " \
-H "Content-Type: application/json" \
-d '{
"bucket_name": "contracts",
"bucket_schema": {
"properties": {
"document_url": { "type": "pdf" },
"document_type": { "type": "string" },
"contract_date": { "type": "datetime" }
}
}
}'
2. Create a collection
Use universal_extractor to parse each document and produce a searchable embedding. Map the extractor’s content input to your bucket’s document_url field.
curl -sS -X POST " $MP_API_URL /v1/collections" \
-H "Authorization: Bearer $MP_API_KEY " \
-H "X-Namespace: $MP_NAMESPACE " \
-H "Content-Type: application/json" \
-d '{
"collection_name": "contracts-text",
"source": { "type": "bucket", "bucket_ids": ["bkt_contracts"] },
"feature_extractor": {
"feature_extractor_name": "universal_extractor",
"version": "v1",
"input_mappings": { "content": "document_url" },
"field_passthrough": [
{ "source_path": "document_type" },
{ "source_path": "contract_date" }
]
}
}'
3. Ingest documents
curl -sS -X POST " $MP_API_URL /v1/buckets/bkt_contracts/objects" \
-H "Authorization: Bearer $MP_API_KEY " \
-H "X-Namespace: $MP_NAMESPACE " \
-H "Content-Type: application/json" \
-d '{
"key_prefix": "/2025/agreements",
"blobs": [
{ "property": "document_url", "type": "pdf", "data": "s3://my-bucket/contracts/vendor-001.pdf" },
{ "property": "document_type", "type": "text", "data": "vendor_agreement" }
]
}'
4. Process
# Create a batch, then submit it; poll the returned task until COMPLETED
curl -sS -X POST " $MP_API_URL /v1/buckets/bkt_contracts/batches" \
-H "Authorization: Bearer $MP_API_KEY " -H "X-Namespace: $MP_NAMESPACE " \
-H "Content-Type: application/json" \
-d '{ "object_ids": ["obj_001", "obj_002"] }'
curl -sS -X POST " $MP_API_URL /v1/buckets/bkt_contracts/batches/{batch_id}/submit" \
-H "Authorization: Bearer $MP_API_KEY " -H "X-Namespace: $MP_NAMESPACE "
See Monitoring ingestion for the task-polling loop.
5. Create a retriever
curl -sS -X POST " $MP_API_URL /v1/retrievers" \
-H "Authorization: Bearer $MP_API_KEY " \
-H "X-Namespace: $MP_NAMESPACE " \
-H "Content-Type: application/json" \
-d '{
"retriever_name": "contract-search",
"collection_identifiers": ["contracts-text"],
"input_schema": {
"query": { "type": "text", "required": true }
},
"stages": [
{
"stage_name": "search",
"stage_type": "filter",
"config": {
"stage_id": "feature_search",
"parameters": {
"searches": [
{
"feature_uri": "mixpeek://universal_extractor@v1/embedding",
"query": { "input_mode": "text", "value": "{{INPUT.query}}" },
"top_k": 50
}
],
"final_top_k": 20
}
}
}
]
}'
6. Query
Filter by document type at execution time with the filters field:
curl -sS -X POST " $MP_API_URL /v1/retrievers/{retriever_id}/execute" \
-H "Authorization: Bearer $MP_API_KEY " \
-H "X-Namespace: $MP_NAMESPACE " \
-H "Content-Type: application/json" \
-d '{
"inputs": { "query": "termination clauses with 30-day notice" },
"filters": {
"field": "document_type",
"operator": "eq",
"value": "vendor_agreement"
}
}'
Multi-page assembly
Retrieve all segments from a source document using lineage:
curl -sS " $MP_API_URL /v1/documents/{document_id}/lineage" \
-H "Authorization: Bearer $MP_API_KEY " -H "X-Namespace: $MP_NAMESPACE "
Next steps
Classify documents Auto-classify documents by type (contract, invoice, NDA) with a taxonomy.
Discover themes Cluster document embeddings to surface recurring contract patterns.
Get notified Trigger alerts when new documents match a query.
Add NER & summaries Pair with text_extractor for named-entity recognition and summaries.