Mixpeek Logo
    Login / Signup
    PDF, Image & Document Processing

    Document Processing API

    Documents are decomposed into features stored across cost tiers in the multimodal data warehouse. Extract text, tables, and layout structure from PDFs, contracts, and reports, then query across your entire corpus with composable multi-stage retrieval pipelines.

    Document Processing Capabilities

    From raw document input to searchable, structured output in a single pipeline.

    OCR & Text Extraction

    Extract text from scanned documents, photographs, and image-based PDFs using state-of-the-art OCR models with support for 100+ languages.

    Layout Analysis

    Detect and preserve document structure -- headers, paragraphs, tables, figures, and reading order -- for structured data extraction.

    Table Extraction

    Identify and extract tabular data from documents, preserving row and column relationships for downstream analysis and indexing.

    Semantic Embedding

    Generate vector embeddings from document content that capture meaning, enabling semantic search across your entire document corpus.

    How It Works

    A four-stage pipeline transforms raw documents into searchable, structured data.

    1

    Upload

    Send documents through the API or upload to an S3-compatible bucket. Supports PDF, DOCX, PPTX, images, and scanned files.

    2

    Classify

    Documents are automatically classified by type and routed to the appropriate extraction pipeline based on their structure and content.

    3

    Extract

    OCR, layout analysis, table extraction, and text parsing run in parallel. Each extraction stage produces structured output with position metadata.

    4

    Embed & Index

    Extracted content is embedded into vectors and indexed alongside structural metadata for semantic search and retrieval.

    Supported Document Formats

    Process any document type through a single API endpoint.

    PDF Documents

    • Digital PDFs
    • Scanned PDFs
    • Multi-page documents
    • Fillable forms

    Office Documents

    • DOCX (Word)
    • PPTX (PowerPoint)
    • XLSX (Excel)
    • ODT / ODP

    Images

    • JPEG / PNG / WebP
    • TIFF (multi-page)
    • Scanned documents
    • Photographs of text

    Structured Data

    • CSV / TSV files
    • JSON documents
    • XML files
    • HTML pages

    Use Cases

    Document processing powers intelligent workflows across industries.

    Contract Analysis

    Extract clauses, parties, dates, and obligations from legal contracts. Enable semantic search across thousands of agreements to find relevant precedents and terms.

    Invoice Processing

    Automatically extract vendor names, line items, amounts, and dates from invoices in any format -- digital PDFs, scanned documents, or photographed receipts.

    Medical Records

    Parse clinical documents, lab reports, and imaging results. Extract structured data from unstructured medical records for analysis and compliance.

    Research Literature

    Process academic papers, patents, and technical documents. Extract figures, tables, citations, and full text for semantic search and knowledge discovery.

    Mixpeek vs Traditional Document Processing

    See what changes when you move beyond basic text extraction.

    FeatureTraditional ToolsMixpeek
    Input TypesDigital text PDFs onlyPDF, images, scans, Office docs, structured data
    UnderstandingRaw text extractionLayout-aware extraction with structural metadata
    SearchKeyword matchingSemantic vector search across document content
    Table HandlingTables lost in extractionTable structure preserved with row/column relationships
    ScalabilitySequential processingDistributed batch processing across GPU workers
    OutputPlain textStructured data + embeddings + searchable index

    Simple API Integration

    Process documents and search across them with a few lines of code.

    document_processing.py
    from mixpeek import Mixpeek
    
    client = Mixpeek(api_key="YOUR_API_KEY")
    
    # Process a PDF document
    result = client.collections.process(
        collection_id="contracts",
        source={
            "type": "file",
            "url": "s3://legal-docs/contract-2024.pdf"
        }
    )
    
    # Search across all processed documents
    results = client.retrievers.search(
        retriever_id="document-search",
        queries=[
            {
                "type": "text",
                "value": "indemnification clauses with liability caps",
                "modalities": ["text"]
            }
        ],
        filters={
            "AND": [
                {"key": "document_type", "value": "contract", "operator": "eq"},
                {"key": "year", "value": 2024, "operator": "gte"}
            ]
        },
        limit=10
    )
    
    for doc in results:
        print(f"Page {doc.metadata['page']}: {doc.score:.3f}")
        print(f"  {doc.text[:200]}...")

    Frequently Asked Questions

    What document formats does Mixpeek support?

    Mixpeek processes PDF (digital and scanned), DOCX, PPTX, XLSX, images (JPEG, PNG, WebP, TIFF), CSV, JSON, XML, and HTML. Documents are automatically detected and routed to the appropriate extraction pipeline based on their format and structure.

    How does Mixpeek handle scanned documents and images of text?

    Scanned documents and images are processed through OCR models that extract text with high accuracy across 100+ languages. Layout analysis identifies document structure -- headers, paragraphs, tables, and figures -- preserving the reading order and spatial relationships that give context to the extracted text.

    Can Mixpeek extract tables from PDFs?

    Yes. The document processing pipeline includes table detection and extraction models that identify tabular structures within documents and extract cell values with preserved row and column relationships. The output includes both the raw table data and structural metadata that enables downstream processing and querying.

    How does semantic search work on documents?

    After text and structure are extracted, the content is embedded into vector representations using language models. These embeddings capture the meaning of the content, not just keywords. You can then search across your entire document corpus using natural language queries, finding relevant passages even when they use different terminology than your search terms.

    What is the processing latency for documents?

    Processing time depends on document complexity, page count, and configured extractors. A typical single-page PDF processes in 1-3 seconds including OCR, layout analysis, and embedding generation. Multi-page documents are processed in parallel across pages. Batch processing of large document collections is distributed across workers for high throughput.

    Can I process documents in languages other than English?

    Yes. The OCR and text extraction models support over 100 languages. Embedding models are available for multilingual content, enabling semantic search across documents in any supported language. Language detection is automatic, so mixed-language document collections are handled without manual configuration.

    How do I integrate document processing into my existing workflow?

    Mixpeek provides a REST API for document upload and processing, an S3-compatible bucket trigger for automated processing of new documents, and webhooks for notification when processing completes. SDKs are available for Python and JavaScript. Documents can be uploaded individually or in batches.

    Does Mixpeek preserve the original document layout?

    Yes. The layout analysis stage identifies structural elements like headers, paragraphs, lists, tables, and figures along with their spatial coordinates and reading order. This structural metadata is stored alongside the extracted text, enabling applications that need to reconstruct the original document layout or reference specific regions.

    Start Processing Documents Today

    One API to extract, embed, and search across PDFs, images, and office documents. Get started with our free tier or talk to us about enterprise deployment.