Document Intelligence

Document intelligence uses the warehouse’s decompose layer to parse text and layout from PDFs and scanned documents, then makes them queryable through retrieval. Every code block below uses the real API field names.

How It Works

When you ingest a document, the universal_extractor runs a multi-stage pipeline:

Content extraction — text is parsed from native PDFs, with OCR fallback for scanned pages.
Chunking — documents are split into searchable segments.
Embedding — each document is embedded for semantic search.
Indexing — segments are stored with metadata for filtered vector search.

At query time, the retriever runs semantic search over the document embeddings and can filter by metadata such as document type.

Feature Extractors

Extractor	Use For
`universal_extractor@v1`	Parse any file (PDF, image, scanned doc), OCR fallback, and produce a Gemini embedding for semantic search
`text_extractor@v1`	Text embeddings, NER, and summarization over already-extracted text

Mixpeek does not have separate pdf_extractor / table_extractor extractors. The universal_extractor ingests arbitrary documents (PDFs and scans) and produces a searchable embedding.

1. Create a bucket

curl -sS -X POST "$MP_API_URL/v1/buckets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "contracts",
    "bucket_schema": {
      "properties": {
        "document_url": { "type": "pdf" },
        "document_type": { "type": "string" },
        "contract_date": { "type": "datetime" }
      }
    }
  }'

2. Create a collection

Use universal_extractor to parse each document and produce a searchable embedding. Map the extractor’s content input to your bucket’s document_url field.

curl -sS -X POST "$MP_API_URL/v1/collections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "contracts-text",
    "source": { "type": "bucket", "bucket_ids": ["bkt_contracts"] },
    "feature_extractor": {
      "feature_extractor_name": "universal_extractor",
      "version": "v1",
      "input_mappings": { "content": "document_url" },
      "field_passthrough": [
        { "source_path": "document_type" },
        { "source_path": "contract_date" }
      ]
    }
  }'

3. Ingest documents

curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_contracts/objects" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "key_prefix": "/2025/agreements",
    "blobs": [
      { "property": "document_url", "type": "pdf", "data": "s3://my-bucket/contracts/vendor-001.pdf" },
      { "property": "document_type", "type": "text", "data": "vendor_agreement" }
    ]
  }'

4. Process

# Create a batch, then submit it; poll the returned task until COMPLETED
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_contracts/batches" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "object_ids": ["obj_001", "obj_002"] }'

curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_contracts/batches/{batch_id}/submit" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE"

See Monitoring ingestion for the task-polling loop.

5. Create a retriever

curl -sS -X POST "$MP_API_URL/v1/retrievers" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "contract-search",
    "collection_identifiers": ["contracts-text"],
    "input_schema": {
      "query": { "type": "text", "required": true }
    },
    "stages": [
      {
        "stage_name": "search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://universal_extractor@v1/gemini-embedding-2",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 50
              }
            ],
            "final_top_k": 20
          }
        }
      }
    ]
  }'

6. Query

Filter by document type at execution time with the filters field:

curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "termination clauses with 30-day notice" },
    "filters": {
      "field": "document_type",
      "operator": "eq",
      "value": "vendor_agreement"
    }
  }'

Multi-page assembly

Retrieve all segments from a source document using lineage:

curl -sS "$MP_API_URL/v1/documents/{document_id}/lineage" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE"

Next steps

Classify documents

Auto-classify documents by type (contract, invoice, NDA) with a taxonomy.

Discover themes

Cluster document embeddings to surface recurring contract patterns.

Get notified

Trigger alerts when new documents match a query.

Add NER & summaries

Pair with text_extractor for named-entity recognition and summaries.

​How It Works

​Feature Extractors

​1. Create a bucket

​2. Create a collection

​3. Ingest documents

​4. Process

​5. Create a retriever

​6. Query

​Multi-page assembly

​Next steps

Classify documents

Discover themes

Get notified

Add NER & summaries

How It Works

Feature Extractors

1. Create a bucket

2. Create a collection

3. Ingest documents

4. Process

5. Create a retriever

6. Query

Multi-page assembly

Next steps