GOT-OCR-2.0-hf
by stepfun-ai
General OCR Theory -- unified end-to-end OCR for documents, scenes, formulas, and sheet music
stepfun-ai/GOT-OCR-2.0-hfmixpeek://image_extractor@v1/stepfun_got_ocr2_v1Overview
GOT-OCR 2.0 is StepFun's general-purpose OCR model that handles an unusually broad range of visual text recognition tasks in a single unified architecture. Beyond standard document and scene text, it processes mathematical formulas, geometric diagrams, molecular structures, charts, tables, and even sheet music notation.
At 580M parameters, it achieves strong accuracy across all these domains without task-specific fine-tuning. The model uses a vision encoder paired with a text decoder, outputting structured text including LaTeX for formulas and markdown for tables. On Mixpeek, it provides broad-coverage OCR extraction for diverse document types that would otherwise require multiple specialized models.
Architecture
Vision encoder + autoregressive text decoder, 580M parameters. Handles dynamic image resolutions. Outputs plain text, LaTeX, markdown, or structured formats depending on content type. End-to-end (no separate detection + recognition stages).
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/research-paper.pdf" },feature_extractors: [{name: "ocr",version: "v1",params: {model_id: "stepfun-ai/GOT-OCR-2.0-hf"}}]});
Capabilities
- Plain document OCR (printed and handwritten)
- Scene text recognition
- Mathematical formula extraction (LaTeX output)
- Table extraction (markdown output)
- Chart and diagram understanding
- Sheet music notation recognition
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| GOT-Bench (all tasks) | Accuracy | 85.2% | StepFun, 2024 -- Paper Table 2 |
Performance
Specification
Research Paper
General OCR Theory
arxiv.orgBuild a pipeline with GOT-OCR-2.0-hf
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio