Florence-2-large

by microsoft

Foundation model for unified vision tasks with sequence-to-sequence architecture

816Kdl/month

1,832likes

777Mparams

HuggingFace Run on your data, free

Identifiers

Model ID

microsoft/Florence-2-large

Feature URI

mixpeek://image_extractor@v1/microsoft_florence2_large_v1

Overview

Florence-2 is a versatile vision foundation model that handles captioning, object detection, grounding, and OCR in a single unified architecture using a sequence-to-sequence paradigm. It processes images and task-specific text prompts to produce structured outputs.

On Mixpeek, Florence-2 provides detailed scene descriptions that go beyond simple captions, including spatial relationships, object attributes, and contextual information.

Architecture

DaViT vision encoder paired with a transformer-based sequence-to-sequence decoder. Supports multiple vision tasks via task-specific prompt tokens. Large variant uses 770M parameters.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_description",
    version: "v1",
    parameters: { model_id: "microsoft/Florence-2-large" },
  },
});

Capabilities

Dense captioning with region descriptions
Referring expression comprehension
Object detection and visual grounding
OCR with text localization

Use Cases on Mixpeek

Rich scene understanding for video analytics

Multi-task visual extraction in a single pass

Grounded captioning for accessibility

Benchmarks

Dataset	Metric	Score	Source
COCO Captioning	CIDEr	140.0	Xiao et al., 2024 — Table 2
RefCOCO (val)	Accuracy	92.6%	Xiao et al., 2024 — Table 5
TextVQA (val)	Accuracy	78.0%	Xiao et al., 2024 — Table 4