NEWAgents can now see video via MCP.Try it now →
    Models/Document Analysis/microsoft/layoutlmv3-base
    HFDocument Structurecc-by-nc-sa-4.0

    layoutlmv3-base

    by microsoft

    Pre-trained multimodal transformer for document AI

    565Kdl/month
    480likes
    125Mparams
    Identifiers
    Model ID
    microsoft/layoutlmv3-base
    Feature URI
    mixpeek://document_extractor@v1/microsoft_layoutlmv3_v1

    Overview

    LayoutLMv3 is a pre-trained multimodal transformer that jointly models text, layout (bounding boxes), and image information for document understanding. It achieves state-of-the-art on form understanding, receipt extraction, and document classification.

    On Mixpeek, LayoutLMv3 extracts document structure, identifying headings, paragraphs, tables, and their spatial relationships for structured retrieval.

    Architecture

    Unified multimodal transformer that takes text tokens, spatial layout coordinates, and image patches as input. Pre-trained with Masked Language Modeling, Masked Image Modeling, and Word-Patch Alignment objectives.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/invoice.pdf" },
    feature_extractors: [{
    name: "document_structure",
    version: "v1",
    params: {
    model_id: "microsoft/layoutlmv3-base"
    }
    }]
    });

    Capabilities

    • Document layout understanding
    • Form and receipt key-value extraction
    • Document classification
    • Named entity recognition on documents

    Use Cases on Mixpeek

    Intelligent document processing, extract fields from forms
    Financial document analysis, parse invoices and statements
    Legal document structure extraction

    Benchmarks

    DatasetMetricScoreSource
    FUNSDF192.1%Huang et al., 2022 — Table 2
    CORDF196.6%Huang et al., 2022 — Table 3
    DocVQA (test)ANLS83.4Huang et al., 2022 — Table 5

    Performance

    Input Size224×224 px + 512 tokens
    GPU Latency~15ms / page (A100)
    CPU Latency~180ms / page
    GPU Throughput~65 pages/sec (A100)
    GPU Memory~1.1 GB

    Specification

    FrameworkHF
    Organizationmicrosoft
    FeatureDocument Structure
    Outputstructure tokens
    Modalitiesdocument
    RetrieverSection Filter
    Parameters125M
    Licensecc-by-nc-sa-4.0
    Downloads/mo565K
    Likes480

    Research Paper

    LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

    arxiv.org

    Build a pipeline with layoutlmv3-base

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder