NEWManaged multimodal retrieval.Explore platform →
    Models/Text Extraction/stepfun-ai/GOT-OCR-2.0-hf
    HFOCRApache-2.0

    GOT-OCR-2.0-hf

    by stepfun-ai

    General OCR Theory -- unified end-to-end OCR for documents, scenes, formulas, and sheet music

    3.1Mdl/month
    580Mparams
    Identifiers
    Model ID
    stepfun-ai/GOT-OCR-2.0-hf
    Feature URI
    mixpeek://image_extractor@v1/stepfun_got_ocr2_v1

    Overview

    GOT-OCR 2.0 is StepFun's general-purpose OCR model that handles an unusually broad range of visual text recognition tasks in a single unified architecture. Beyond standard document and scene text, it processes mathematical formulas, geometric diagrams, molecular structures, charts, tables, and even sheet music notation.

    At 580M parameters, it achieves strong accuracy across all these domains without task-specific fine-tuning. The model uses a vision encoder paired with a text decoder, outputting structured text including LaTeX for formulas and markdown for tables. On Mixpeek, it provides broad-coverage OCR extraction for diverse document types that would otherwise require multiple specialized models.

    Architecture

    Vision encoder + autoregressive text decoder, 580M parameters. Handles dynamic image resolutions. Outputs plain text, LaTeX, markdown, or structured formats depending on content type. End-to-end (no separate detection + recognition stages).

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/research-paper.pdf" },
    feature_extractors: [{
    name: "ocr",
    version: "v1",
    params: {
    model_id: "stepfun-ai/GOT-OCR-2.0-hf"
    }
    }]
    });

    Capabilities

    • Plain document OCR (printed and handwritten)
    • Scene text recognition
    • Mathematical formula extraction (LaTeX output)
    • Table extraction (markdown output)
    • Chart and diagram understanding
    • Sheet music notation recognition

    Use Cases on Mixpeek

    Mixed-content document digitization
    Academic paper processing (text + equations + figures)
    Universal OCR for heterogeneous document libraries
    Scene text extraction from video frames

    Benchmarks

    DatasetMetricScoreSource
    GOT-Bench (all tasks)Accuracy85.2%StepFun, 2024 -- Paper Table 2

    Performance

    Input SizeDynamic resolution images
    GPU Latency~45ms / page (A100)
    GPU Throughput~22 pages/sec (A100, batch 8)
    GPU Memory~2.0 GB

    Specification

    FrameworkHF
    Organizationstepfun-ai
    FeatureOCR
    Outputtext + bbox
    Modalitiesvideo, image, document
    RetrieverText-in-Image
    Parameters580M
    LicenseApache-2.0
    Downloads/mo3.1M

    Research Paper

    General OCR Theory

    arxiv.org

    Build a pipeline with GOT-OCR-2.0-hf

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio