NEWManaged multimodal retrieval.Explore platform →
    Models/Text Extraction/ByteDance/Dolphin-v2
    HFOCRMIT

    Dolphin-v2

    by ByteDance

    End-to-end document parsing VLM — 21 element types, pixel-accurate layout

    210Kdl/month
    ~3Bparams
    Identifiers
    Model ID
    ByteDance/Dolphin-v2
    Feature URI
    mixpeek://image_extractor@v1/bytedance_dolphin_v2

    Overview

    Dolphin v2 is ByteDance's visual document parsing model that classifies and extracts 21 element categories from both digital and photographed documents: text blocks, tables, formulas, figures, code blocks, headers, footers, captions, and more. Built on a Qwen2.5-VL-3B backbone, it processes document pages end-to-end without a separate OCR pipeline.

    It scores 89.45 on OmniDocBench V1.5 overall, with standout performance on tables (TEDS: 90.48) and formulas (CDM: 86.72). The key advance over v1 is absolute pixel-coordinate spatial localization -- every extracted element comes with precise bounding box coordinates. On Mixpeek, Dolphin v2 powers structured document extraction for RAG pipelines that need to understand document layout, not just raw text.

    Architecture

    Qwen2.5-VL-3B vision-language backbone fine-tuned for document parsing. Processes pages at native resolution with adaptive tiling. Outputs structured JSON with element type, text content, and absolute pixel-coordinate bounding boxes for each of 21 element categories.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/financial-report.pdf" },
    feature_extractors: [{
    name: "ocr",
    version: "v1",
    params: {
    model_id: "ByteDance/Dolphin-v2",
    extract_layout: true
    }
    }]
    });

    Capabilities

    • 21 document element categories (text, tables, formulas, figures, code, etc.)
    • Pixel-accurate bounding box localization
    • Tables with structure preservation (TEDS: 90.48)
    • Formula recognition (CDM: 86.72)
    • MIT license, 3B parameters

    Use Cases on Mixpeek

    Structured document extraction for enterprise RAG
    Table extraction from financial reports and invoices
    Formula extraction from scientific papers and textbooks
    Layout-aware document indexing for search across mixed-content pages

    Benchmarks

    DatasetMetricScoreSource
    OmniDocBench V1.5 (overall)Score89.45ByteDance, 2026 — Model Card
    OmniDocBench V1.5 (tables)TEDS90.48ByteDance, 2026 — Model Card
    OmniDocBench V1.5 (formulas)CDM86.72ByteDance, 2026 — Model Card

    Performance

    Input SizeVariable resolution document pages
    GPU Latency~420ms / page (A100)
    GPU Throughput~2.4 pages/sec (A100)
    GPU Memory~7 GB

    Specification

    FrameworkHF
    OrganizationByteDance
    FeatureOCR
    Outputtext + bbox
    Modalitiesvideo, image, document
    RetrieverText-in-Image
    Parameters~3B
    LicenseMIT
    Downloads/mo210K

    Research Paper

    Dolphin: A Document Parsing Model

    arxiv.org

    Build a pipeline with Dolphin-v2

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio