Falcon-OCR

by tiiuae

300M early-fusion OCR model: plain text, LaTeX, and HTML table output from document images

195Kdl/month

300Mparams

HuggingFace Run on your data

Identifiers

Model ID

tiiuae/Falcon-OCR

Feature URI

mixpeek://image_extractor@v1/tiiuae_falcon_ocr_v1

Overview

Falcon-OCR is an ultra-compact 300M-parameter early-fusion vision-language model for document OCR, developed by the Technology Innovation Institute (TII). Unlike traditional OCR pipelines that chain detection, recognition, and layout analysis, Falcon-OCR processes image patches and text tokens in a shared parameter space from the very first transformer layer, using a hybrid attention mask where image tokens attend bidirectionally while text tokens decode causally conditioned on the image.

At just 300M parameters, Falcon-OCR is roughly 3x smaller than competing VLM-based OCR models yet achieves 80.3% on the olmOCR benchmark and 88.64 overall on OmniDocBench. On Mixpeek, it provides fast, lightweight OCR extraction from scanned documents, receipts, and printed materials, producing plain text, LaTeX for formulas, or HTML for tables depending on the requested output format.

Architecture

Early-fusion dense autoregressive Transformer. A single transformer processes image patches and text tokens in a shared parameter space from layer 1. Hybrid attention mask: image tokens attend bidirectionally, text tokens decode causally conditioned on image. Requires PyTorch 2.5+ for FlexAttention.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "ocr",
    version: "v1",
    parameters: { model_id: "tiiuae/Falcon-OCR" },
  },
});