Step-3.7-Flash
by stepfun-ai
Apache-licensed multimodal MoE for image-text reasoning and fast visual QA
stepfun-ai/Step-3.7-Flashmixpeek://image_extractor@v1/stepfun_step37_flash_v1Overview
Step 3.7 Flash is a new multimodal Mixture-of-Experts model from StepFun with image-text-to-text support. It is notable because the model card ships with Transformers and vLLM usage, making it more practical for teams that want a deployable open VLM rather than an API-only model.
On Mixpeek, Step 3.7 Flash is a candidate for scene captioning, visual question answering, screenshot analysis, and agent perception tasks where a single model needs to reason over images plus instructions.
Architecture
Vision-language Mixture-of-Experts model exposed through custom Transformers code and vLLM. Supports image-text chat prompts with Apache 2.0 licensing.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "media-library",source: { url: "https://example.com/keyframe.jpg" },feature_extractors: [{feature: "scene_caption",model: "stepfun-ai/Step-3.7-Flash"}]});
Capabilities
- Image-text-to-text generation
- Vision-language reasoning over screenshots and natural images
- vLLM serving support
- Apache 2.0 license
Use Cases on Mixpeek
Performance
Use for reasoning or caption generation after cheaper retrieval stages
Specification
Research Paper
Step 3.7 Flash model card
arxiv.orgBuild a pipeline with Step-3.7-Flash
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio