layoutlmv3-base
by microsoft
Pre-trained multimodal transformer for document AI
microsoft/layoutlmv3-basemixpeek://document_extractor@v1/microsoft_layoutlmv3_v1Overview
LayoutLMv3 is a pre-trained multimodal transformer that jointly models text, layout (bounding boxes), and image information for document understanding. It achieves state-of-the-art on form understanding, receipt extraction, and document classification.
On Mixpeek, LayoutLMv3 extracts document structure, identifying headings, paragraphs, tables, and their spatial relationships for structured retrieval.
Architecture
Unified multimodal transformer that takes text tokens, spatial layout coordinates, and image patches as input. Pre-trained with Masked Language Modeling, Masked Image Modeling, and Word-Patch Alignment objectives.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/invoice.pdf" },
feature_extractors: [{
name: "document_structure",
version: "v1",
params: {
model_id: "microsoft/layoutlmv3-base"
}
}]
});Capabilities
- Document layout understanding
- Form and receipt key-value extraction
- Document classification
- Named entity recognition on documents
Use Cases on Mixpeek
Specification
Research Paper
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
arxiv.orgBuild a pipeline with layoutlmv3-base
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder