NEWVectors or files. Pick a path.Start →
    Models/Captioning/MiniMaxAI/MiniMax-M3
    HFScene CaptioningMiniMax Community License

    MiniMax-M3

    by MiniMaxAI

    Agent-native MoE vision-language model with native video understanding at 1M context

    200Kdl/month
    428B (23B active)params
    Identifiers
    Model ID
    MiniMaxAI/MiniMax-M3
    Feature URI
    mixpeek://video_extractor@v1/minimax_m3_vl_v1

    Overview

    MiniMax-M3 is a sparse mixture-of-experts vision-language model, about 428B total parameters with roughly 23B active per token, trained natively on text, images, and video from the start rather than bolting vision onto a text LLM. Its headline trick is MiniMax Sparse Attention (MSA), which cuts per-token attention compute to about 1/20 of dense attention and delivers 9x prefill and 15x decode speedups at a 1M-token context, so it can reason over long videos and multi-document sessions in one pass.

    On Mixpeek, MiniMax-M3 is a strong scene-understanding extractor for video and image collections: it produces grounded descriptions, answers questions about frames, and drives agentic pipelines where an agent inspects footage, decides what matters, and stores the result as searchable metadata. Its long context makes it a fit for whole-clip understanding rather than isolated frames.

    Architecture

    Sparse MoE transformer, ~428B total / ~23B active parameters, natively multimodal (text, image, video). MiniMax Sparse Attention (MSA) reduces attention compute and memory so the model sustains a 1M-token context with large prefill/decode speedups over dense attention. Custom modeling code; served via Transformers.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    // Managed: create a collection over a bucket; Mixpeek runs this model's extractor
    const collection = await mx.collections.create({
      namespace_id: "my-namespace",
      collection_name: "my-collection",
      source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
      feature_extractor: {
        feature_extractor_name: "scene_description",
        version: "v1",
        parameters: { model_id: "MiniMaxAI/MiniMax-M3" },
      },
    });

    Capabilities

    • Native video understanding (85.4% on Video-MME-v2)
    • 1M-token context for whole-clip and multi-document reasoning
    • Long-horizon agentic and tool-use tasks
    • Grounded image and frame question answering

    Use Cases on Mixpeek

    Whole-clip video understanding and captioning for search metadata
    Agent inspection of footage where the query decides what to extract
    Long-context reasoning over multi-page or multi-clip evidence
    Scene descriptions that feed downstream embedding and retrieval

    Specification

    FrameworkHF
    OrganizationMiniMaxAI
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters428B (23B active)
    LicenseMiniMax Community License
    Downloads/mo200K

    Research Paper

    MiniMax-M3

    arxiv.org

    Build a pipeline with MiniMax-M3

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio