MiniMax-M3
by MiniMaxAI
Agent-native MoE vision-language model with native video understanding at 1M context
MiniMaxAI/MiniMax-M3mixpeek://video_extractor@v1/minimax_m3_vl_v1Overview
MiniMax-M3 is a sparse mixture-of-experts vision-language model, about 428B total parameters with roughly 23B active per token, trained natively on text, images, and video from the start rather than bolting vision onto a text LLM. Its headline trick is MiniMax Sparse Attention (MSA), which cuts per-token attention compute to about 1/20 of dense attention and delivers 9x prefill and 15x decode speedups at a 1M-token context, so it can reason over long videos and multi-document sessions in one pass.
On Mixpeek, MiniMax-M3 is a strong scene-understanding extractor for video and image collections: it produces grounded descriptions, answers questions about frames, and drives agentic pipelines where an agent inspects footage, decides what matters, and stores the result as searchable metadata. Its long context makes it a fit for whole-clip understanding rather than isolated frames.
Architecture
Sparse MoE transformer, ~428B total / ~23B active parameters, natively multimodal (text, image, video). MiniMax Sparse Attention (MSA) reduces attention compute and memory so the model sustains a 1M-token context with large prefill/decode speedups over dense attention. Custom modeling code; served via Transformers.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
namespace_id: "my-namespace",
collection_name: "my-collection",
source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
feature_extractor: {
feature_extractor_name: "scene_description",
version: "v1",
parameters: { model_id: "MiniMaxAI/MiniMax-M3" },
},
});Capabilities
- Native video understanding (85.4% on Video-MME-v2)
- 1M-token context for whole-clip and multi-document reasoning
- Long-horizon agentic and tool-use tasks
- Grounded image and frame question answering
Use Cases on Mixpeek
Common Pipeline Companions
Explore on Mixpeek
Compare alternatives in this category
Hand-picked tools & platforms compared
Deep-dive technical guide
See how Mixpeek runs models as extractors
Store & search embeddings at scale
Usage-based pricing for pipelines
Compare models, APIs & infrastructure
Specification
Research Paper
MiniMax-M3
arxiv.orgBuild a pipeline with MiniMax-M3
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio