NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/stepfun-ai/Step-3.7-Flash
    HFScene CaptioningApache 2.0

    Step-3.7-Flash

    by stepfun-ai

    Apache-licensed multimodal MoE for image-text reasoning and fast visual QA

    9.3Kdl/month
    MoEparams
    Identifiers
    Model ID
    stepfun-ai/Step-3.7-Flash
    Feature URI
    mixpeek://image_extractor@v1/stepfun_step37_flash_v1

    Overview

    Step 3.7 Flash is a new multimodal Mixture-of-Experts model from StepFun with image-text-to-text support. It is notable because the model card ships with Transformers and vLLM usage, making it more practical for teams that want a deployable open VLM rather than an API-only model.

    On Mixpeek, Step 3.7 Flash is a candidate for scene captioning, visual question answering, screenshot analysis, and agent perception tasks where a single model needs to reason over images plus instructions.

    Architecture

    Vision-language Mixture-of-Experts model exposed through custom Transformers code and vLLM. Supports image-text chat prompts with Apache 2.0 licensing.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "media-library",
    source: { url: "https://example.com/keyframe.jpg" },
    feature_extractors: [{
    feature: "scene_caption",
    model: "stepfun-ai/Step-3.7-Flash"
    }]
    });

    Capabilities

    • Image-text-to-text generation
    • Vision-language reasoning over screenshots and natural images
    • vLLM serving support
    • Apache 2.0 license

    Use Cases on Mixpeek

    Detailed scene descriptions for images and video keyframes
    Agent visual QA over screenshots or camera frames
    Extract structured observations from unstructured visual content
    Second-pass reasoning over retrieved images

    Performance

    Input SizeImage plus text prompt
    GPU LatencyDepends on vLLM configuration and output length
    GPU ThroughputDepends on vLLM configuration and output length
    GPU MemoryMoE deployment dependent

    Use for reasoning or caption generation after cheaper retrieval stages

    Specification

    FrameworkHF
    Organizationstepfun-ai
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    ParametersMoE
    LicenseApache 2.0
    Downloads/mo9.3K

    Research Paper

    Step 3.7 Flash model card

    arxiv.org

    Build a pipeline with Step-3.7-Flash

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio