layoutlmv3-base
by microsoft
Pre-trained multimodal transformer for document AI
microsoft/layoutlmv3-basemixpeek://document_extractor@v1/microsoft_layoutlmv3_v1Overview
LayoutLMv3 is a pre-trained multimodal transformer that jointly models text, layout (bounding boxes), and image information for document understanding. It achieves state-of-the-art on form understanding, receipt extraction, and document classification.
On Mixpeek, LayoutLMv3 extracts document structure — identifying headings, paragraphs, tables, and their spatial relationships for structured retrieval.
Architecture
Unified multimodal transformer that takes text tokens, spatial layout coordinates, and image patches as input. Pre-trained with Masked Language Modeling, Masked Image Modeling, and Word-Patch Alignment objectives.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/invoice.pdf" },
feature_extractors: [{
name: "document_structure",
version: "v1",
params: {
model_id: "microsoft/layoutlmv3-base"
}
}]
});Capabilities
- Document layout understanding
- Form and receipt key-value extraction
- Document classification
- Named entity recognition on documents
Use Cases on Mixpeek
Specification
Research Paper
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
arxiv.orgBuild a pipeline with layoutlmv3-base
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder