MiniMax M3 decodes 1 million tokens 15x faster – and it shouldn't be this cheap

MiniMax M3 Decodes 1 Million Tokens 15x Faster – And It Shouldn’t Be This Cheap

On June 1, a laboratory in Shanghai unveiled a groundbreaking model, MiniMax M3, capable of decoding a context of one million tokens at a speed 15.6x faster than its predecessor. Shockingly, it achieves this at just 8% of what Claude Opus costs. Over two days of exploration via the API, it became clear that the real innovation lies in its attention mechanism, not just its impressive speed. This article delves into the architecture behind MiniMax M3, offering an insightful look at the technology driving this model.

The Breakthrough: MiniMax Sparse Attention (MSA)

What sets MiniMax apart isn’t just its speed or performance on standard benchmarks like SWE-Bench, but its innovative model architecture: MiniMax Sparse Attention (MSA). As explained by Chew Loong Nian, a seasoned AI engineer, standard attention mechanisms become prohibitively expensive with long contexts. MSA tackles this by incorporating a lightweight index branch on top of bulk query attention, which selects query-relevant key-value (KV) cache blocks. This method focuses attention only on the selected blocks, using uncompressed key values and optimizing GPU memory access through a “KV external collect Q” model.

Comparison with Other Attention Approaches

MSA stands out when compared to other methods like DeepSeek’s Latent Attention (MLA) and Native Sparse Attention (NSA). These comparisons highlight MSA’s efficiency and effectiveness in handling large token contexts. However, it is important to note that the reported benchmarks are vendor-supplied, and independent testing was not feasible at launch due to the lack of published weights. While M3 excels in coding, it falls short in multimodal grounding and hallucination-related performance.

The Economic Advantage

One of MiniMax M3’s most compelling features is its cost-effectiveness. The low entry and exit costs per million tokens make it economically viable for long-context agent workflows. This affordability opens up new possibilities for industries relying on extensive data processing, making M3 a unique product category despite some uncertainties regarding benchmark independence and open weights.

Getting Started with MiniMax M3

For those eager to explore MiniMax M3, there are quick-start tips available for using the model via OpenRouter or MiniMax’s API. Conducting practical tests to observe behavior in long contexts can provide valuable insights into its capabilities and potential applications.

In conclusion, while MiniMax M3 may not be the smartest model overall, its affordability and efficiency in handling large token contexts position it as a game-changer in the industry. Its economic viability paves the way for new applications, despite the need for further independent benchmarking and the release of open weights.

Read the full blog for free on Medium.

Published via Toward AI

We are Building Enterprise-Grade AI. We Will Also Teach You How to Master It.

15 engineers. More than 100,000 students. Towards AI Academy teaches what actually survives production.

Start for free – no obligation:

→ 6-Day Agentic AI Engineering Email Guide — One Practical Lesson Per Day

→ Agents Architecture Cheatsheet — 3 years of architectural decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course available.

→ Agent Engineering Course — Hands-on with production agent architectures, memory, routing, and evaluation frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate and apply AI for complex work tasks.

Note: The content of the article contains the views of the contributing authors and not of Towards AI.

For further details and insights, visit the original source Here.

“`

Lab-free testing moves closer to home as German Intu Diagnostics raises €1.1 million

VIPCOO H2 electric dirt bike review: A small bike with big performance

We know how to build smarter robots. Now we need to learn smarter ways to test them

The simple premise of this puzzle game hides surprising depth

MiniMax M3 decodes 1 million tokens 15x faster – and it shouldn’t be this cheap

MiniMax M3 Decodes 1 Million Tokens 15x Faster – And It Shouldn’t Be This Cheap

The Breakthrough: MiniMax Sparse Attention (MSA)

Comparison with Other Attention Approaches

The Economic Advantage

Getting Started with MiniMax M3

We are Building Enterprise-Grade AI. We Will Also Teach You How to Master It.

Lab-free testing moves closer to home as German Intu Diagnostics raises €1.1 million

VIPCOO H2 electric dirt bike review: A small bike with big performance

We know how to build smarter robots. Now we need to learn smarter ways to test them

The simple premise of this puzzle game hides surprising depth

Every Prime Day 2026 deal we’ve covered, from the Pixel 10 to Dyson’s new V16

Acceleration of Gemini Nano models on Pixel with frozen multi-token prediction

How Cara is pioneering domain-specific AI for enterprise insurance brokerages with AWS

Use Gemini to create Google Sheets

I deleted all the static Claude API keys I had. Here’s the keyless migration, vendor by vendor.

Thinking to remember: how reasoning unlocks parametric knowledge in LLMs

LEAVE A REPLY Cancel reply

Useful Links

Latest News

VIPCOO H2 electric dirt bike review: A small bike with big performance

We know how to build smarter robots. Now we need to learn smarter ways to test them

The simple premise of this puzzle game hides surprising depth

Our Newsletter