Document Intelligence

Document intelligence uses the warehouse’s Decompose layer to extract structure, text, and layout from PDFs and scanned documents, then makes them queryable through multi-stage retrieval.

How It Works

When you ingest a document, Mixpeek runs a multi-stage pipeline:

Content Extraction — Text extraction from native PDFs, OCR fallback for scanned pages
Hierarchical Chunking — Documents split into pages, sections, or paragraphs with parent-child relationships
Semantic Extraction — Document type detection, section classification, and metadata inference
Multi-Vector Embeddings — Separate embeddings for titles, summaries, and full text
Indexing — Chunks stored with metadata for filtered vector search

At query time, the retriever searches across embeddings and joins results from multiple collections (text, tables, entities) back to their source documents.

Feature Extractors

Extractor	Use For
`pdf_extractor@v1`	Native PDF text, metadata, page chunking
`document_extractor@v1`	OCR for scanned docs, layout detection
`table_extractor@v1`	Table detection and cell extraction
`text_extractor@v1`	Text embeddings, NER, summarization

1. Create a Bucket

POST /v1/buckets
{
  "bucket_name": "contracts",
  "schema": {
    "properties": {
      "document_url": { "type": "url", "required": true },
      "document_type": { "type": "text" },
      "contract_date": { "type": "datetime" }
    }
  }
}

2. Create Collections

For text extraction:

POST /v1/collections
{
  "collection_name": "contracts-text",
  "source": { "type": "bucket", "bucket_id": "bkt_contracts" },
  "feature_extractor": {
    "feature_extractor_name": "pdf_extractor",
    "version": "v1",
    "input_mappings": { "document_url": "document_url" },
    "parameters": {
      "chunk_strategy": "page",
      "enable_ocr_fallback": true
    },
    "field_passthrough": [
      { "source_path": "document_type" },
      { "source_path": "contract_date" }
    ]
  }
}

For tables:

POST /v1/collections
{
  "collection_name": "contracts-tables",
  "source": { "type": "bucket", "bucket_id": "bkt_contracts" },
  "feature_extractor": {
    "feature_extractor_name": "table_extractor",
    "version": "v1",
    "input_mappings": { "document_url": "document_url" },
    "parameters": {
      "output_format": "json",
      "min_confidence": 0.7
    }
  }
}

3. Ingest Documents

POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/2025/agreements",
  "metadata": {
    "document_type": "vendor_agreement",
    "contract_date": "2025-01-15T00:00:00Z"
  },
  "blobs": [
    {
      "property": "document_url",
      "type": "document",
      "url": "s3://my-bucket/contracts/vendor-001.pdf"
    }
  ]
}

4. Process

POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_001", "obj_002"] }

POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit

5. Create a Retriever

POST /v1/retrievers
{
  "retriever_name": "contract-search",
  "collection_ids": ["col_contracts_text", "col_contracts_tables"],
  "input_schema": {
    "properties": {
      "query": { "type": "text", "required": true },
      "document_type": { "type": "text" }
    }
  },
  "stages": [
    {
      "stage_name": "filter",
      "version": "v1",
      "parameters": {
        "filters": {
          "field": "metadata.document_type",
          "operator": "eq",
          "value": "{{inputs.document_type}}"
        }
      }
    },
    {
      "stage_name": "knn_search",
      "version": "v1",
      "parameters": {
        "feature_address": "mixpeek://pdf_extractor@v1/text_embedding",
        "input_mapping": { "text": "query" },
        "limit": 50
      }
    }
  ]
}

6. Query

POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query": "termination clauses with 30-day notice",
    "document_type": "vendor_agreement"
  },
  "limit": 10
}

Named Entity Recognition

Enable NER to extract entities like dates, amounts, and names:

{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "parameters": {
      "enable_ner": true,
      "entity_types": ["PERSON", "ORG", "DATE", "MONEY"]
    }
  }
}

Filter by entity:

{
  "filters": {
    "field": "metadata.entities.ORG",
    "operator": "contains",
    "value": "Acme Corp"
  }
}

Multi-Page Assembly

Retrieve all pages from a document using lineage:

GET /v1/documents/{document_id}/lineage

Classify with Taxonomies

Auto-classify documents by type (contract, invoice, NDA) using a reference collection:

POST /v1/taxonomies
{
  "taxonomy_name": "document-classifier",
  "taxonomy_type": "flat",
  "retriever_id": "ret_contract_search",
  "collection_id": "col_contracts_text",
  "input_mappings": [{ "source": "payload.content", "target": "query" }],
  "enrichment_fields": [{ "source": "payload.document_type", "target": "auto_doc_type" }],
  "threshold": 0.7,
  "execution_mode": "materialize"
}

New documents automatically get auto_doc_type enriched. See Taxonomies for hierarchical taxonomies and retroactive classification.

Discover Clusters

Find patterns across your document corpus:

POST /v1/clusters
{
  "cluster_name": "contract-themes",
  "collection_id": "col_contracts_text",
  "feature_uri": "mixpeek://pdf_extractor@v1/text_embedding",
  "algorithm": { "name": "agglomerative", "params": { "n_clusters": 8 } },
  "llm_labeling": {
    "enabled": true,
    "input_mappings": [{ "source": "payload", "fields": ["document_type", "title"] }]
  },
  "dimension_reduction": { "method": "umap", "n_components": 2 }
}

Clusters reveal groupings like “vendor agreements with auto-renewal”, “service contracts with SLA terms”, etc. Promote stable clusters to taxonomy nodes. See Clusters.

Set Up Alerts

Get notified when new documents match specific criteria:

POST /v1/alerts
{
  "alert_name": "new-vendor-contracts",
  "collection_id": "col_contracts_text",
  "condition": { "field": "metadata.document_type", "operator": "eq", "value": "vendor_agreement" },
  "notification": { "type": "webhook", "url": "https://example.com/webhook" }
}

Set Up Webhooks

Monitor document processing and extraction status:

POST /v1/webhooks
{
  "webhook_name": "doc-processing",
  "url": "https://example.com/webhook",
  "events": ["batch.completed", "batch.failed"]
}

Tutorials

Documentation Index

​How It Works

​Feature Extractors

​1. Create a Bucket

​2. Create Collections

​3. Ingest Documents

​4. Process

​5. Create a Retriever

​6. Query

​Named Entity Recognition

​Multi-Page Assembly

​Classify with Taxonomies

​Discover Clusters

​Set Up Alerts

​Set Up Webhooks

How It Works

Feature Extractors

1. Create a Bucket

2. Create Collections

3. Ingest Documents

4. Process

5. Create a Retriever

6. Query

Named Entity Recognition

Multi-Page Assembly

Classify with Taxonomies

Discover Clusters

Set Up Alerts

Set Up Webhooks