GOT-OCR-2.0-hf

by stepfun-ai

General OCR Theory -- unified end-to-end OCR for documents, scenes, formulas, and sheet music

3.1Mdl/month

580Mparams

HuggingFace Use in Pipeline

Identifiers

Model ID

stepfun-ai/GOT-OCR-2.0-hf

Feature URI

mixpeek://image_extractor@v1/stepfun_got_ocr2_v1

Overview

GOT-OCR 2.0 is StepFun's general-purpose OCR model that handles an unusually broad range of visual text recognition tasks in a single unified architecture. Beyond standard document and scene text, it processes mathematical formulas, geometric diagrams, molecular structures, charts, tables, and even sheet music notation.

At 580M parameters, it achieves strong accuracy across all these domains without task-specific fine-tuning. The model uses a vision encoder paired with a text decoder, outputting structured text including LaTeX for formulas and markdown for tables. On Mixpeek, it provides broad-coverage OCR extraction for diverse document types that would otherwise require multiple specialized models.

Architecture

Vision encoder + autoregressive text decoder, 580M parameters. Handles dynamic image resolutions. Outputs plain text, LaTeX, markdown, or structured formats depending on content type. End-to-end (no separate detection + recognition stages).

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "my-collection",
  source: { url: "https://example.com/research-paper.pdf" },
  feature_extractors: [{
    name: "ocr",
    version: "v1",
    params: {
      model_id: "stepfun-ai/GOT-OCR-2.0-hf"
    }
  }]
});