Mixpeek Logo

    Document Classification Pipeline

    Classify documents into custom business categories using layout-aware extraction and taxonomy enrichment. Handles invoices, contracts, reports, forms, and correspondence by analyzing both textual content and visual document structure.

    text
    image
    Multi-Stage
    1.9K runs
    Deploy Recipe
    from mixpeek import Mixpeek
    client = Mixpeek(api_key="YOUR_API_KEY")
    # Create taxonomy for document types
    taxonomy = client.taxonomies.create(
    namespace_id="ns_your_namespace",
    name="document_types",
    taxonomy_type="hierarchical",
    hierarchy=[
    {"node_id": "invoice", "collection_id": "col_invoice_examples"},
    {"node_id": "contract", "collection_id": "col_contract_examples"},
    {"node_id": "report", "collection_id": "col_report_examples"},
    {"node_id": "form", "collection_id": "col_form_examples"},
    {"node_id": "correspondence", "collection_id": "col_letter_examples"},
    ]
    )
    # Create document collection with layout extraction
    collection = client.collections.create(
    namespace_id="ns_your_namespace",
    name="incoming_documents",
    extractors=["document-graph-extractor", "text-extractor"]
    )
    # Apply taxonomy for automatic classification
    client.collections.apply_taxonomy(
    collection_id="col_incoming_documents",
    taxonomy_id=taxonomy["taxonomy_id"]
    )
    # Upload documents for classification
    client.buckets.upload(bucket_id="bkt_docs", url="s3://your-bucket/incoming/")
    # Query classified documents
    docs = client.documents.list(
    collection_id="col_incoming_documents",
    filters={"taxonomy_enrichment.category": "invoice"}
    )
    print(f"Found {len(docs['results'])} invoices")

    Feature Extractors

    Retriever Stages

    aggregate

    Compute aggregations (COUNT, SUM, AVG, etc.) on pipeline results

    reduce