Mixpeek Logo
    Login / Signup
    Document Intelligence

    OCR API: AI-Powered Text Extraction from Images and Documents

    OCR is one extractor in the warehouse's Decompose layer. Go beyond simple text extraction: combine optical character recognition with multimodal document understanding, extracting text, tables, and layouts into searchable indexes you can query with composable retriever pipelines.

    Beyond Traditional OCR

    Traditional OCR extracts text. Mixpeek understands documents. Combine text extraction with visual layout analysis, table detection, and multimodal embeddings for true document intelligence.

    Extract + Understand

    Traditional OCR

    Returns raw text strings with no context about document structure, table relationships, or visual layout. You get a wall of text that loses the document's meaning.

    Mixpeek Document Intelligence

    Extracts text with layout context, table structures, spatial coordinates, and reading order. Every text block knows where it sits in the document and how it relates to other elements.

    Extract + Search

    Traditional OCR

    Gives you extracted text as a file output. Building search over that text requires a separate pipeline -- text processing, embedding generation, vector database, and search API.

    Mixpeek OCR + Search

    Extracted text is automatically embedded and indexed into Qdrant namespaces. Composable retriever pipelines let you search immediately with hybrid keyword and semantic matching.

    Extract + Multimodal

    Traditional OCR

    Ignores the visual appearance of documents. A chart, diagram, or annotated image is reduced to whatever text characters appear in it, losing critical information.

    Mixpeek Multimodal OCR

    Generates document embeddings that capture both text content and visual layout. Search by what a document looks like, not just what text it contains. Charts, diagrams, and formatted content are semantically searchable.

    OCR Capabilities

    From printed text to handwritten notes, from simple pages to complex multi-column layouts -- comprehensive document text extraction.

    Printed and Handwritten Text

    Extract text from printed documents and handwritten notes with high accuracy. Support for diverse fonts, layouts, and handwriting styles across scanned documents, photographs, and digital images.

    • High-accuracy printed text recognition
    • Handwritten text extraction (ICR)
    • Mixed printed and handwritten document support

    Table Extraction

    Detect and extract structured table data from documents, invoices, and spreadsheets. Preserve row-column relationships and output structured data that is immediately searchable and queryable.

    • Automatic table boundary detection
    • Row-column relationship preservation
    • Export to structured JSON with cell coordinates

    Layout Preservation

    Understand document layout -- headers, paragraphs, columns, sidebars, and footnotes. Extract text with spatial context so search results reference specific document regions, not just raw text.

    • Document layout analysis and segmentation
    • Reading order detection for multi-column layouts
    • Spatial coordinates for every text block

    Multi-Language Recognition

    Extract text in 100+ languages with script-specific models. Handle mixed-language documents, right-to-left scripts, and CJK characters with dedicated recognition pipelines.

    • 100+ languages supported
    • Mixed-language document handling
    • Right-to-left and CJK script support

    Supported Document Formats

    Process documents in all standard formats. Mixpeek handles format conversion, resolution optimization, and page splitting automatically.

    PDF

    Single and multi-page documents, scanned and digital

    JPEG/PNG

    Photographs, screenshots, scanned images

    TIFF

    High-resolution scans, multi-page TIFF files

    WebP

    Web-optimized images with text content

    BMP

    Bitmap images from legacy systems

    HEIC

    Apple device photos with text content

    Industry Use Cases

    From financial document processing to historical archive digitization -- OCR powers document intelligence across industries.

    Financial Document Processing

    Extract data from invoices, receipts, bank statements, and financial reports. OCR combined with table extraction automates data entry workflows. Index extracted text for compliance search and audit trail queries across years of financial documents.

    Legal Document Analysis

    Process contracts, court filings, patents, and legal correspondence at scale. Extract clauses, signatures, dates, and party names. Build searchable legal archives where attorneys can find relevant precedents and contract terms using natural language queries.

    Healthcare Records Digitization

    Convert handwritten clinical notes, prescription labels, lab reports, and patient forms into searchable text. Preserve document structure for regulatory compliance. Enable clinicians to search across patient records by symptom, medication, or diagnosis.

    Archive and Library Digitization

    Digitize historical documents, manuscripts, newspapers, and book collections. Handle degraded print, faded text, and unusual fonts common in archival materials. Build full-text search indexes across massive document collections for researchers.

    Mixpeek OCR vs. Alternatives

    See how Mixpeek compares to dedicated OCR and document processing services.

    FeatureMixpeekMindeeTextractDocument AIABBYY
    OCR ApproachMultimodal (OCR + visual understanding + embeddings)Template-based document parsingLayout-aware text extractionDocument understanding with entitiesTraditional OCR + structure recognition
    Search IntegrationBuilt-in hybrid search over extracted textNo search (extraction only)No search (extraction only)No search (extraction only)No search (extraction only)
    Multimodal ContextText + visual layout + document embeddingsText extraction onlyText + table + form extractionText + entity + layout extractionText + layout recognition
    Custom ModelsDocker-based plugin system on Ray GPUsCustom document types via trainingNot supportedCustom processor trainingLimited customization
    Retriever PipelinesComposable multi-stage (filter, search, rerank)Not availableNot availableNot availableNot available
    Deployment OptionsManaged, Dedicated, BYO CloudManaged SaaS or on-premAWS onlyGoogle Cloud onlyOn-prem or cloud

    Build Document Search with OCR in Minutes

    A simple Python API to extract text from documents and build searchable indexes with composable retriever pipelines.

    document_ocr.py
    from mixpeek import Mixpeek
    
    client = Mixpeek(api_key="YOUR_API_KEY")
    
    # Create a collection with OCR and document extractors
    collection = client.collections.create(
        name="invoice-archive",
        namespace="invoices",
        extractors=[
            {
                "type": "ocr",
                "model": "paddleocr-v4",
                "config": {
                    "languages": ["en", "es", "fr"],
                    "detect_tables": True,
                    "preserve_layout": True
                }
            },
            {
                "type": "document_embedding",
                "model": "colpali-v1.3",
                "config": {
                    "include_visual_features": True
                }
            },
            {
                "type": "text_embedding",
                "model": "sentence-transformers/all-MiniLM-L6-v2",
                "config": {
                    "source": "ocr_output",
                    "chunk_size": 512
                }
            }
        ]
    )
    
    # Upload documents to trigger OCR processing
    client.buckets.upload(
        bucket="my-bucket",
        files=["invoice_2024_001.pdf", "receipt_scan.jpg"],
        collection=collection.id
    )
    
    # Search extracted text with hybrid retrieval
    results = client.retrievers.execute(
        namespace="invoices",
        stages=[
            {
                "type": "feature_search",
                "method": "hybrid",
                "query": {
                    "text": "payment due date March 2026 over $5000",
                    "modalities": ["text", "image"]
                },
                "limit": 20
            },
            {
                "type": "filter",
                "conditions": {
                    "metadata.doc_type": "invoice",
                    "metadata.tables_detected": True
                }
            },
            {
                "type": "rerank",
                "model": "cross-encoder",
                "limit": 5
            }
        ]
    )
    
    for result in results:
        print(f"Document: {result.metadata['filename']}")
        print(f"Page: {result.metadata['page_number']}")
        print(f"Tables: {result.metadata.get('tables', [])}")
        print(f"Text: {result.content[:300]}")

    Frequently Asked Questions

    What is OCR and how does it work?

    OCR (Optical Character Recognition) is the technology that converts images of text into machine-readable text data. Modern OCR uses deep learning models to detect text regions in images, recognize individual characters and words, and reconstruct the document's text content. Mixpeek extends traditional OCR with multimodal understanding -- combining text extraction with visual layout analysis, table detection, and document embeddings for comprehensive document search.

    How is Mixpeek OCR different from traditional OCR services?

    Traditional OCR services extract text and return it as a file. You then need to build your own search infrastructure to make that text queryable. Mixpeek is an end-to-end platform: OCR extraction runs on Ray GPU clusters, extracted text is automatically indexed alongside visual document embeddings in Qdrant namespaces, and composable retriever pipelines let you search immediately. You also get multimodal context -- the visual layout and appearance of documents is searchable alongside the extracted text.

    What document formats does Mixpeek OCR support?

    Mixpeek OCR supports PDF (single and multi-page, both scanned and digital), JPEG, PNG, TIFF (including multi-page), WebP, BMP, and HEIC image formats. For PDFs, Mixpeek intelligently routes digital PDFs through text extraction and scanned PDFs through OCR processing. The platform handles format conversion, resolution optimization, and page splitting automatically.

    Can Mixpeek extract tables from documents?

    Yes. Mixpeek's OCR extractors include table detection and extraction capabilities. They identify table boundaries, detect row and column structure, and extract cell content while preserving the relational layout. Table data is stored as structured metadata on documents in your namespace, making it filterable and searchable through retriever pipelines.

    How accurate is Mixpeek OCR for handwritten text?

    Accuracy depends on handwriting legibility and image quality. Mixpeek uses state-of-the-art ICR (Intelligent Character Recognition) models trained on large handwriting datasets. For clean handwriting on standard forms, accuracy typically exceeds 90%. For degraded or unusual handwriting, accuracy may be lower. You can improve results by using custom models fine-tuned on your specific handwriting styles via the Docker plugin system.

    Does Mixpeek preserve document layout during OCR?

    Yes. Mixpeek's layout-aware OCR preserves the spatial structure of documents -- headers, paragraphs, columns, sidebars, tables, and footnotes are identified and segmented. Each text block includes bounding box coordinates and reading order information. This means search results can reference specific regions of a document, not just raw text strings.

    How many languages does Mixpeek OCR support?

    Mixpeek OCR supports 100+ languages through models like PaddleOCR and Tesseract. This includes Latin, Cyrillic, Arabic, Chinese, Japanese, Korean, Devanagari, and other scripts. Mixed-language documents are handled automatically -- the system detects script types and applies appropriate recognition models per text region.

    Can I combine OCR with semantic search in Mixpeek?

    Yes. This is a core advantage of Mixpeek's approach. OCR-extracted text is embedded using text embedding models and indexed alongside visual document embeddings in the same namespace. A single retriever query can combine semantic search over extracted text content, visual similarity over document appearance, metadata filtering by document type or date, and keyword matching for exact terms -- all in one composable pipeline.

    Transform Documents into Searchable Intelligence

    Stop extracting text into files that nobody searches. Build document intelligence with OCR, multimodal embeddings, and composable retriever pipelines.