Document Classification Pipeline

Classify documents into custom business categories using layout-aware extraction and taxonomy enrichment. Handles invoices, contracts, reports, forms, and correspondence by analyzing both textual content and visual document structure.

text

image

Multi-Stage

1.9K runs

Run in Builder

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Create taxonomy for document types
taxonomy = client.taxonomies.create(
    namespace_id="ns_your_namespace",
    name="document_types",
    taxonomy_type="hierarchical",
    hierarchy=[
        {"node_id": "invoice", "collection_id": "col_invoice_examples"},
        {"node_id": "contract", "collection_id": "col_contract_examples"},
        {"node_id": "report", "collection_id": "col_report_examples"},
        {"node_id": "form", "collection_id": "col_form_examples"},
        {"node_id": "correspondence", "collection_id": "col_letter_examples"},
    ]
)

# Create document collection with layout extraction
collection = client.collections.create(
    namespace_id="ns_your_namespace",
    name="incoming_documents",
    extractors=["document-graph-extractor", "text-extractor"]
)

# Apply taxonomy for automatic classification
client.collections.apply_taxonomy(
    collection_id="col_incoming_documents",
    taxonomy_id=taxonomy["taxonomy_id"]
)

# Upload documents for classification
client.buckets.upload(bucket_id="bkt_docs", url="s3://your-bucket/incoming/")

# Query classified documents
docs = client.documents.list(
    collection_id="col_incoming_documents",
    filters={"taxonomy_enrichment.category": "invoice"}
)
print(f"Found {len(docs['results'])} invoices")

Feature Extractors

Retriever Stages

aggregate

Compute aggregations (COUNT, SUM, AVG, etc.) on pipeline results

reduce

Use Cases Using This Recipe

Advanced

12 min

Clinical NLP at Scale

Extract structured intelligence from clinical notes, pathology reports, and medical records

94% F1 on medical NER benchmarks

Entity extraction accuracy

healthcare

Who It's For

Healthcare IT teams, clinical informatics departments, and health systems processing thousands of clinical documents daily

View Details

Document Classification Pipeline

Feature Extractors

Retriever Stages

Use Cases Using This Recipe

Clinical NLP at Scale

Related Recipes & Resources

Multimodal Content Moderation

Document Intelligence Search

PDF Data Extraction Pipeline

Document RAG Pipeline

BYO Embeddings Vector Search

Web Scraper