Skip to main content
Document intelligence uses the warehouse’s Decompose layer to extract structure, text, and layout from PDFs and scanned documents, then makes them queryable through multi-stage retrieval.
Document Intelligence Pipeline

How It Works

When you ingest a document, Mixpeek runs a multi-stage pipeline:
  1. Content Extraction — Text extraction from native PDFs, OCR fallback for scanned pages
  2. Hierarchical Chunking — Documents split into pages, sections, or paragraphs with parent-child relationships
  3. Semantic Extraction — Document type detection, section classification, and metadata inference
  4. Multi-Vector Embeddings — Separate embeddings for titles, summaries, and full text
  5. Indexing — Chunks stored with metadata for filtered vector search
At query time, the retriever searches across embeddings and joins results from multiple collections (text, tables, entities) back to their source documents.

Feature Extractors

ExtractorUse For
pdf_extractor@v1Native PDF text, metadata, page chunking
document_extractor@v1OCR for scanned docs, layout detection
table_extractor@v1Table detection and cell extraction
text_extractor@v1Text embeddings, NER, summarization

1. Create a Bucket

POST /v1/buckets
{
  "bucket_name": "contracts",
  "schema": {
    "properties": {
      "document_url": { "type": "url", "required": true },
      "document_type": { "type": "text" },
      "contract_date": { "type": "datetime" }
    }
  }
}

2. Create Collections

For text extraction:
POST /v1/collections
{
  "collection_name": "contracts-text",
  "source": { "type": "bucket", "bucket_id": "bkt_contracts" },
  "feature_extractor": {
    "feature_extractor_name": "pdf_extractor",
    "version": "v1",
    "input_mappings": { "document_url": "document_url" },
    "parameters": {
      "chunk_strategy": "page",
      "enable_ocr_fallback": true
    },
    "field_passthrough": [
      { "source_path": "document_type" },
      { "source_path": "contract_date" }
    ]
  }
}
For tables:
POST /v1/collections
{
  "collection_name": "contracts-tables",
  "source": { "type": "bucket", "bucket_id": "bkt_contracts" },
  "feature_extractor": {
    "feature_extractor_name": "table_extractor",
    "version": "v1",
    "input_mappings": { "document_url": "document_url" },
    "parameters": {
      "output_format": "json",
      "min_confidence": 0.7
    }
  }
}

3. Ingest Documents

POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/2025/agreements",
  "metadata": {
    "document_type": "vendor_agreement",
    "contract_date": "2025-01-15T00:00:00Z"
  },
  "blobs": [
    {
      "property": "document_url",
      "type": "document",
      "url": "s3://my-bucket/contracts/vendor-001.pdf"
    }
  ]
}

4. Process

POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_001", "obj_002"] }

POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit

5. Create a Retriever

POST /v1/retrievers
{
  "retriever_name": "contract-search",
  "collection_ids": ["col_contracts_text", "col_contracts_tables"],
  "input_schema": {
    "properties": {
      "query": { "type": "text", "required": true },
      "document_type": { "type": "text" }
    }
  },
  "stages": [
    {
      "stage_name": "filter",
      "version": "v1",
      "parameters": {
        "filters": {
          "field": "metadata.document_type",
          "operator": "eq",
          "value": "{{inputs.document_type}}"
        }
      }
    },
    {
      "stage_name": "knn_search",
      "version": "v1",
      "parameters": {
        "feature_address": "mixpeek://pdf_extractor@v1/text_embedding",
        "input_mapping": { "text": "query" },
        "limit": 50
      }
    }
  ]
}

6. Query

POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query": "termination clauses with 30-day notice",
    "document_type": "vendor_agreement"
  },
  "limit": 10
}

Named Entity Recognition

Enable NER to extract entities like dates, amounts, and names:
{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "parameters": {
      "enable_ner": true,
      "entity_types": ["PERSON", "ORG", "DATE", "MONEY"]
    }
  }
}
Filter by entity:
{
  "filters": {
    "field": "metadata.entities.ORG",
    "operator": "contains",
    "value": "Acme Corp"
  }
}

Multi-Page Assembly

Retrieve all pages from a document using lineage:
GET /v1/documents/{document_id}/lineage

Classify with Taxonomies

Auto-classify documents by type (contract, invoice, NDA) using a reference collection:
POST /v1/taxonomies
{
  "taxonomy_name": "document-classifier",
  "taxonomy_type": "flat",
  "retriever_id": "ret_contract_search",
  "collection_id": "col_contracts_text",
  "input_mappings": [{ "source": "payload.content", "target": "query" }],
  "enrichment_fields": [{ "source": "payload.document_type", "target": "auto_doc_type" }],
  "threshold": 0.7,
  "execution_mode": "materialize"
}
New documents automatically get auto_doc_type enriched. See Taxonomies for hierarchical taxonomies and retroactive classification.

Discover Clusters

Find patterns across your document corpus:
POST /v1/clusters
{
  "cluster_name": "contract-themes",
  "collection_id": "col_contracts_text",
  "feature_uri": "mixpeek://pdf_extractor@v1/text_embedding",
  "algorithm": { "name": "agglomerative", "params": { "n_clusters": 8 } },
  "llm_labeling": {
    "enabled": true,
    "input_mappings": [{ "source": "payload", "fields": ["document_type", "title"] }]
  },
  "dimension_reduction": { "method": "umap", "n_components": 2 }
}
Clusters reveal groupings like “vendor agreements with auto-renewal”, “service contracts with SLA terms”, etc. Promote stable clusters to taxonomy nodes. See Clusters.

Set Up Alerts

Get notified when new documents match specific criteria:
POST /v1/alerts
{
  "alert_name": "new-vendor-contracts",
  "collection_id": "col_contracts_text",
  "condition": { "field": "metadata.document_type", "operator": "eq", "value": "vendor_agreement" },
  "notification": { "type": "webhook", "url": "https://example.com/webhook" }
}

Set Up Webhooks

Monitor document processing and extraction status:
POST /v1/webhooks
{
  "webhook_name": "doc-processing",
  "url": "https://example.com/webhook",
  "events": ["batch.completed", "batch.failed"]
}