NEWAgents can now see video via MCP.Try it now →
    Models/Document Analysis/naver-clova-ix/donut-base
    HFDocument Structuremit

    donut-base

    by naver-clova-ix

    Document understanding transformer, OCR-free document parsing

    216Kdl/month
    252likes
    210Mparams
    Identifiers
    Model ID
    naver-clova-ix/donut-base
    Feature URI
    mixpeek://document_extractor@v1/naver_donut_base_v1

    Overview

    Donut (Document Understanding Transformer) is an end-to-end model for document understanding that directly maps document images to structured outputs without relying on a separate OCR engine. This simplifies the pipeline and avoids OCR error propagation.

    On Mixpeek, Donut offers an OCR-free alternative for document structure extraction, particularly useful for visually rich documents like receipts, forms, and infographics.

    Architecture

    Swin Transformer encoder for image features, BART decoder for text generation. Trained end-to-end on document images with their corresponding JSON annotations. No OCR dependency.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/receipt.jpg" },
    feature_extractors: [{
    name: "document_structure",
    version: "v1",
    params: {
    model_id: "naver-clova-ix/donut-base"
    }
    }]
    });

    Capabilities

    • OCR-free document understanding
    • Structured JSON output from document images
    • Document classification
    • Key-value extraction from forms

    Use Cases on Mixpeek

    Receipt and invoice parsing without OCR
    Form data extraction for automated workflows
    Document classification and routing

    Benchmarks

    DatasetMetricScoreSource
    CORD (receipt parsing)Tree Edit Distance91.6%Kim et al., 2022 — Table 2
    DocVQA (test)ANLS67.5Kim et al., 2022 — Table 3

    Performance

    Input Size2560×1920 px (max)
    GPU Latency~40ms / page (A100)
    CPU Latency~580ms / page
    GPU Throughput~25 pages/sec (A100)
    GPU Memory~0.8 GB

    OCR-free architecture — reads documents directly from pixels

    Specification

    FrameworkHF
    Organizationnaver-clova-ix
    FeatureDocument Structure
    Outputstructure tokens
    Modalitiesdocument
    RetrieverSection Filter
    Parameters210M
    Licensemit
    Downloads/mo216K
    Likes252

    Research Paper

    OCR-free Document Understanding Transformer

    arxiv.org

    Build a pipeline with donut-base

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder