Document Classification Pipeline
Classify documents into custom business categories using layout-aware extraction and taxonomy enrichment. Handles invoices, contracts, reports, forms, and correspondence by analyzing both textual content and visual document structure.
from mixpeek import Mixpeekclient = Mixpeek(api_key="YOUR_API_KEY")# Create taxonomy for document typestaxonomy = client.taxonomies.create(namespace_id="ns_your_namespace",name="document_types",taxonomy_type="hierarchical",hierarchy=[{"node_id": "invoice", "collection_id": "col_invoice_examples"},{"node_id": "contract", "collection_id": "col_contract_examples"},{"node_id": "report", "collection_id": "col_report_examples"},{"node_id": "form", "collection_id": "col_form_examples"},{"node_id": "correspondence", "collection_id": "col_letter_examples"},])# Create document collection with layout extractioncollection = client.collections.create(namespace_id="ns_your_namespace",name="incoming_documents",extractors=["document-graph-extractor", "text-extractor"])# Apply taxonomy for automatic classificationclient.collections.apply_taxonomy(collection_id="col_incoming_documents",taxonomy_id=taxonomy["taxonomy_id"])# Upload documents for classificationclient.buckets.upload(bucket_id="bkt_docs", url="s3://your-bucket/incoming/")# Query classified documentsdocs = client.documents.list(collection_id="col_incoming_documents",filters={"taxonomy_enrichment.category": "invoice"})print(f"Found {len(docs['results'])} invoices")
Feature Extractors
Retriever Stages
aggregate
Compute aggregations (COUNT, SUM, AVG, etc.) on pipeline results
