Document Processing API
Documents are decomposed into features stored across cost tiers in the multimodal data warehouse. Extract text, tables, and layout structure from PDFs, contracts, and reports, then query across your entire corpus with composable multi-stage retrieval pipelines.
Document Processing Capabilities
From raw document input to searchable, structured output in a single pipeline.
OCR & Text Extraction
Extract text from scanned documents, photographs, and image-based PDFs using state-of-the-art OCR models with support for 100+ languages.
Layout Analysis
Detect and preserve document structure -- headers, paragraphs, tables, figures, and reading order -- for structured data extraction.
Table Extraction
Identify and extract tabular data from documents, preserving row and column relationships for downstream analysis and indexing.
Semantic Embedding
Generate vector embeddings from document content that capture meaning, enabling semantic search across your entire document corpus.
How It Works
A four-stage pipeline transforms raw documents into searchable, structured data.
Upload
Send documents through the API or upload to an S3-compatible bucket. Supports PDF, DOCX, PPTX, images, and scanned files.
Classify
Documents are automatically classified by type and routed to the appropriate extraction pipeline based on their structure and content.
Extract
OCR, layout analysis, table extraction, and text parsing run in parallel. Each extraction stage produces structured output with position metadata.
Embed & Index
Extracted content is embedded into vectors and indexed alongside structural metadata for semantic search and retrieval.
Supported Document Formats
Process any document type through a single API endpoint.
PDF Documents
- Digital PDFs
- Scanned PDFs
- Multi-page documents
- Fillable forms
Office Documents
- DOCX (Word)
- PPTX (PowerPoint)
- XLSX (Excel)
- ODT / ODP
Images
- JPEG / PNG / WebP
- TIFF (multi-page)
- Scanned documents
- Photographs of text
Structured Data
- CSV / TSV files
- JSON documents
- XML files
- HTML pages
Use Cases
Document processing powers intelligent workflows across industries.
Contract Analysis
Extract clauses, parties, dates, and obligations from legal contracts. Enable semantic search across thousands of agreements to find relevant precedents and terms.
Invoice Processing
Automatically extract vendor names, line items, amounts, and dates from invoices in any format -- digital PDFs, scanned documents, or photographed receipts.
Medical Records
Parse clinical documents, lab reports, and imaging results. Extract structured data from unstructured medical records for analysis and compliance.
Research Literature
Process academic papers, patents, and technical documents. Extract figures, tables, citations, and full text for semantic search and knowledge discovery.
Mixpeek vs Traditional Document Processing
See what changes when you move beyond basic text extraction.
| Feature | Traditional Tools | Mixpeek |
|---|---|---|
| Input Types | Digital text PDFs only | PDF, images, scans, Office docs, structured data |
| Understanding | Raw text extraction | Layout-aware extraction with structural metadata |
| Search | Keyword matching | Semantic vector search across document content |
| Table Handling | Tables lost in extraction | Table structure preserved with row/column relationships |
| Scalability | Sequential processing | Distributed batch processing across GPU workers |
| Output | Plain text | Structured data + embeddings + searchable index |
Simple API Integration
Process documents and search across them with a few lines of code.
from mixpeek import Mixpeek
client = Mixpeek(api_key="YOUR_API_KEY")
# Process a PDF document
result = client.collections.process(
collection_id="contracts",
source={
"type": "file",
"url": "s3://legal-docs/contract-2024.pdf"
}
)
# Search across all processed documents
results = client.retrievers.search(
retriever_id="document-search",
queries=[
{
"type": "text",
"value": "indemnification clauses with liability caps",
"modalities": ["text"]
}
],
filters={
"AND": [
{"key": "document_type", "value": "contract", "operator": "eq"},
{"key": "year", "value": 2024, "operator": "gte"}
]
},
limit=10
)
for doc in results:
print(f"Page {doc.metadata['page']}: {doc.score:.3f}")
print(f" {doc.text[:200]}...")Frequently Asked Questions
What document formats does Mixpeek support?
Mixpeek processes PDF (digital and scanned), DOCX, PPTX, XLSX, images (JPEG, PNG, WebP, TIFF), CSV, JSON, XML, and HTML. Documents are automatically detected and routed to the appropriate extraction pipeline based on their format and structure.
How does Mixpeek handle scanned documents and images of text?
Scanned documents and images are processed through OCR models that extract text with high accuracy across 100+ languages. Layout analysis identifies document structure -- headers, paragraphs, tables, and figures -- preserving the reading order and spatial relationships that give context to the extracted text.
Can Mixpeek extract tables from PDFs?
Yes. The document processing pipeline includes table detection and extraction models that identify tabular structures within documents and extract cell values with preserved row and column relationships. The output includes both the raw table data and structural metadata that enables downstream processing and querying.
How does semantic search work on documents?
After text and structure are extracted, the content is embedded into vector representations using language models. These embeddings capture the meaning of the content, not just keywords. You can then search across your entire document corpus using natural language queries, finding relevant passages even when they use different terminology than your search terms.
What is the processing latency for documents?
Processing time depends on document complexity, page count, and configured extractors. A typical single-page PDF processes in 1-3 seconds including OCR, layout analysis, and embedding generation. Multi-page documents are processed in parallel across pages. Batch processing of large document collections is distributed across workers for high throughput.
Can I process documents in languages other than English?
Yes. The OCR and text extraction models support over 100 languages. Embedding models are available for multilingual content, enabling semantic search across documents in any supported language. Language detection is automatic, so mixed-language document collections are handled without manual configuration.
How do I integrate document processing into my existing workflow?
Mixpeek provides a REST API for document upload and processing, an S3-compatible bucket trigger for automated processing of new documents, and webhooks for notification when processing completes. SDKs are available for Python and JavaScript. Documents can be uploaded individually or in batches.
Does Mixpeek preserve the original document layout?
Yes. The layout analysis stage identifies structural elements like headers, paragraphs, lists, tables, and figures along with their spatial coordinates and reading order. This structural metadata is stored alongside the extracted text, enabling applications that need to reconstruct the original document layout or reference specific regions.
