Florence-2-large
by microsoft
Foundation model for unified vision tasks with sequence-to-sequence architecture
microsoft/Florence-2-largemixpeek://image_extractor@v1/microsoft_florence2_large_v1Overview
Florence-2 is a versatile vision foundation model that handles captioning, object detection, grounding, and OCR in a single unified architecture using a sequence-to-sequence paradigm. It processes images and task-specific text prompts to produce structured outputs.
On Mixpeek, Florence-2 provides detailed scene descriptions that go beyond simple captions — including spatial relationships, object attributes, and contextual information.
Architecture
DaViT vision encoder paired with a transformer-based sequence-to-sequence decoder. Supports multiple vision tasks via task-specific prompt tokens. Large variant uses 770M parameters.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/video.mp4" },
feature_extractors: [{
name: "scene_description",
version: "v1",
params: {
model_id: "microsoft/Florence-2-large"
}
}]
});Capabilities
- Dense captioning with region descriptions
- Referring expression comprehension
- Object detection and visual grounding
- OCR with text localization
Use Cases on Mixpeek
Specification
Research Paper
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
arxiv.orgBuild a pipeline with Florence-2-large
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder