NEWAgents can now see video via MCP.Try it now →
    Models/Captioning/microsoft/Florence-2-large
    HFScene Captioningmit

    Florence-2-large

    by microsoft

    Foundation model for unified vision tasks with sequence-to-sequence architecture

    1.3Mdl/month
    1,793likes
    777Mparams
    Identifiers
    Model ID
    microsoft/Florence-2-large
    Feature URI
    mixpeek://image_extractor@v1/microsoft_florence2_large_v1

    Overview

    Florence-2 is a versatile vision foundation model that handles captioning, object detection, grounding, and OCR in a single unified architecture using a sequence-to-sequence paradigm. It processes images and task-specific text prompts to produce structured outputs.

    On Mixpeek, Florence-2 provides detailed scene descriptions that go beyond simple captions, including spatial relationships, object attributes, and contextual information.

    Architecture

    DaViT vision encoder paired with a transformer-based sequence-to-sequence decoder. Supports multiple vision tasks via task-specific prompt tokens. Large variant uses 770M parameters.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    name: "scene_description",
    version: "v1",
    params: {
    model_id: "microsoft/Florence-2-large"
    }
    }]
    });

    Capabilities

    • Dense captioning with region descriptions
    • Referring expression comprehension
    • Object detection and visual grounding
    • OCR with text localization

    Use Cases on Mixpeek

    Rich scene understanding for video analytics
    Multi-task visual extraction in a single pass
    Grounded captioning for accessibility

    Benchmarks

    DatasetMetricScoreSource
    COCO CaptioningCIDEr140.0Xiao et al., 2024 — Table 2
    RefCOCO (val)Accuracy92.6%Xiao et al., 2024 — Table 5
    TextVQA (val)Accuracy78.0%Xiao et al., 2024 — Table 4

    Performance

    Input Size768×768 px
    GPU Latency~35ms / image (A100)
    CPU Latency~520ms / image
    GPU Throughput~28 images/sec (A100)
    GPU Memory~3.1 GB

    Specification

    FrameworkHF
    Organizationmicrosoft
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters777M
    Licensemit
    Downloads/mo1.3M
    Likes1,793

    Research Paper

    Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

    arxiv.org

    Build a pipeline with Florence-2-large

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder