Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Document intelligence uses the warehouse’s Decompose layer to extract structure, text, and layout from PDFs and scanned documents, then makes them queryable through multi-stage retrieval.
How It Works
When you ingest a document, Mixpeek runs a multi-stage pipeline:
- Content Extraction — Text extraction from native PDFs, OCR fallback for scanned pages
- Hierarchical Chunking — Documents split into pages, sections, or paragraphs with parent-child relationships
- Semantic Extraction — Document type detection, section classification, and metadata inference
- Multi-Vector Embeddings — Separate embeddings for titles, summaries, and full text
- Indexing — Chunks stored with metadata for filtered vector search
At query time, the retriever searches across embeddings and joins results from multiple collections (text, tables, entities) back to their source documents.
| Extractor | Use For |
|---|
pdf_extractor@v1 | Native PDF text, metadata, page chunking |
document_extractor@v1 | OCR for scanned docs, layout detection |
table_extractor@v1 | Table detection and cell extraction |
text_extractor@v1 | Text embeddings, NER, summarization |
1. Create a Bucket
POST /v1/buckets
{
"bucket_name": "contracts",
"schema": {
"properties": {
"document_url": { "type": "url", "required": true },
"document_type": { "type": "text" },
"contract_date": { "type": "datetime" }
}
}
}
2. Create Collections
For text extraction:
POST /v1/collections
{
"collection_name": "contracts-text",
"source": { "type": "bucket", "bucket_id": "bkt_contracts" },
"feature_extractor": {
"feature_extractor_name": "pdf_extractor",
"version": "v1",
"input_mappings": { "document_url": "document_url" },
"parameters": {
"chunk_strategy": "page",
"enable_ocr_fallback": true
},
"field_passthrough": [
{ "source_path": "document_type" },
{ "source_path": "contract_date" }
]
}
}
For tables:
POST /v1/collections
{
"collection_name": "contracts-tables",
"source": { "type": "bucket", "bucket_id": "bkt_contracts" },
"feature_extractor": {
"feature_extractor_name": "table_extractor",
"version": "v1",
"input_mappings": { "document_url": "document_url" },
"parameters": {
"output_format": "json",
"min_confidence": 0.7
}
}
}
3. Ingest Documents
POST /v1/buckets/{bucket_id}/objects
{
"key_prefix": "/2025/agreements",
"metadata": {
"document_type": "vendor_agreement",
"contract_date": "2025-01-15T00:00:00Z"
},
"blobs": [
{
"property": "document_url",
"type": "document",
"url": "s3://my-bucket/contracts/vendor-001.pdf"
}
]
}
4. Process
POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_001", "obj_002"] }
POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit
5. Create a Retriever
POST /v1/retrievers
{
"retriever_name": "contract-search",
"collection_ids": ["col_contracts_text", "col_contracts_tables"],
"input_schema": {
"properties": {
"query": { "type": "text", "required": true },
"document_type": { "type": "text" }
}
},
"stages": [
{
"stage_name": "filter",
"version": "v1",
"parameters": {
"filters": {
"field": "metadata.document_type",
"operator": "eq",
"value": "{{inputs.document_type}}"
}
}
},
{
"stage_name": "knn_search",
"version": "v1",
"parameters": {
"feature_address": "mixpeek://pdf_extractor@v1/text_embedding",
"input_mapping": { "text": "query" },
"limit": 50
}
}
]
}
6. Query
POST /v1/retrievers/{retriever_id}/execute
{
"inputs": {
"query": "termination clauses with 30-day notice",
"document_type": "vendor_agreement"
},
"limit": 10
}
Named Entity Recognition
Enable NER to extract entities like dates, amounts, and names:
{
"feature_extractor": {
"feature_extractor_name": "text_extractor",
"version": "v1",
"parameters": {
"enable_ner": true,
"entity_types": ["PERSON", "ORG", "DATE", "MONEY"]
}
}
}
Filter by entity:
{
"filters": {
"field": "metadata.entities.ORG",
"operator": "contains",
"value": "Acme Corp"
}
}
Multi-Page Assembly
Retrieve all pages from a document using lineage:
GET /v1/documents/{document_id}/lineage
Classify with Taxonomies
Auto-classify documents by type (contract, invoice, NDA) using a reference collection:
POST /v1/taxonomies
{
"taxonomy_name": "document-classifier",
"taxonomy_type": "flat",
"retriever_id": "ret_contract_search",
"collection_id": "col_contracts_text",
"input_mappings": [{ "source": "payload.content", "target": "query" }],
"enrichment_fields": [{ "source": "payload.document_type", "target": "auto_doc_type" }],
"threshold": 0.7,
"execution_mode": "materialize"
}
New documents automatically get auto_doc_type enriched. See Taxonomies for hierarchical taxonomies and retroactive classification.
Discover Clusters
Find patterns across your document corpus:
POST /v1/clusters
{
"cluster_name": "contract-themes",
"collection_id": "col_contracts_text",
"feature_uri": "mixpeek://pdf_extractor@v1/text_embedding",
"algorithm": { "name": "agglomerative", "params": { "n_clusters": 8 } },
"llm_labeling": {
"enabled": true,
"input_mappings": [{ "source": "payload", "fields": ["document_type", "title"] }]
},
"dimension_reduction": { "method": "umap", "n_components": 2 }
}
Clusters reveal groupings like “vendor agreements with auto-renewal”, “service contracts with SLA terms”, etc. Promote stable clusters to taxonomy nodes. See Clusters.
Set Up Alerts
Get notified when new documents match specific criteria:
POST /v1/alerts
{
"alert_name": "new-vendor-contracts",
"collection_id": "col_contracts_text",
"condition": { "field": "metadata.document_type", "operator": "eq", "value": "vendor_agreement" },
"notification": { "type": "webhook", "url": "https://example.com/webhook" }
}
Set Up Webhooks
Monitor document processing and extraction status:
POST /v1/webhooks
{
"webhook_name": "doc-processing",
"url": "https://example.com/webhook",
"events": ["batch.completed", "batch.failed"]
}