NEWAgents can now see video via MCP.Try it now →
    Models/Text Extraction/microsoft/trocr-large-printed
    HFOCRMIT

    trocr-large-printed

    by microsoft

    Transformer-based OCR for printed text recognition

    554Kdl/month
    179likes
    608Mparams
    Identifiers
    Model ID
    microsoft/trocr-large-printed
    Feature URI
    mixpeek://image_extractor@v1/microsoft_trocr_large_v1

    Overview

    TrOCR is an end-to-end text recognition model that uses a pre-trained image Transformer (DeiT) as the encoder and a pre-trained language model (RoBERTa) as the decoder. The large variant achieves state-of-the-art on printed text benchmarks.

    On Mixpeek, TrOCR extracts readable text from images and video frames, making text-in-image content searchable through natural language queries.

    Architecture

    Encoder-decoder transformer: DeiT-Large (24 layers) as image encoder, RoBERTa-Large (24 layers) as text decoder. Pre-trained on large-scale synthetic printed text data, fine-tuned on SROIE and IAM datasets.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/document.pdf" },
    feature_extractors: [{
    name: "ocr",
    version: "v1",
    params: {
    model_id: "microsoft/trocr-large-printed"
    }
    }]
    });

    Capabilities

    • High-accuracy printed text recognition
    • End-to-end pipeline (no separate detection step)
    • Multi-line text extraction
    • Robust to noise, blur, and varying fonts

    Use Cases on Mixpeek

    Extract text from video overlays, subtitles, and signage
    Digitize scanned documents and receipts
    Search text-in-image content across media libraries

    Benchmarks

    DatasetMetricScoreSource
    SROIE (text recognition)Word Accuracy96.1%Li et al., 2023 — Table 3
    IAM HandwrittenCER3.4%Li et al., 2023 — Table 2

    Performance

    Input Size384×384 px
    GPU Latency~18ms / image (A100)
    CPU Latency~210ms / image
    GPU Throughput~55 images/sec (A100)
    GPU Memory~1.4 GB

    Specification

    FrameworkHF
    Organizationmicrosoft
    FeatureOCR
    Outputtext + bbox
    Modalitiesvideo, image, document
    RetrieverText-in-Image
    Parameters608M
    LicenseMIT
    Downloads/mo554K
    Likes179

    Research Paper

    TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

    arxiv.org

    Build a pipeline with trocr-large-printed

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder

    Alternative Models