MiniMax M2.5 - The first open model to beat Sonnet made for productivity
by•
Introducing M2.5, an open-source frontier model designed for real-world productivity. SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work. Optimized for efficient execution, 37% faster at complex tasks. At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible.


Replies
Big news for open models: MiniMax-M2.5 is out with SOTA performance at coding (SWE-Bench Verified 80.2%). The first open model to beat Sonnet. Only @Claude by Anthropic's Opus and @OpenAI 's GPT-5.2 Codex score higher.
Paths between open and proprietary models are converging...
Pro tip: If you want to quickly experiment with it, @MiniMax-M2.5 is free for a week on @Kilo Code (until Thursday, Feb 19).
OSS ftw!
@fmerian How do you define “productivity” in the context of an AI model? How should users expect the model to change daily workflows?
80%+ on SWE-Bench Verified for an open model is wild — especially if it’s actually usable in real workflows and not just benchmark-flexing. Curious how it holds up on messy, legacy codebases vs clean benchmark repos?
vibecoder.date
Awesome!
is it available for opencode yet?
apparently! see pricing: https://opencode.ai/docs/zen/#pricing
The claim of beating Sonnet on SWE-Bench is bold for an open model! :o How does the context window size compare to Sonnet when handling large codebases?
AI Meal Planner
Impressive benchmarks especially on SWE-Bench and tool-calling.
I’m curious though: in real-world workflows, where does M2.5 feel meaningfully different from existing frontier models?
For example, does the 37% speed gain translate into noticeably better agent reliability on longer tasks, or is it mostly execution time?
Would love to understand where it actually changes day-to-day usage.
That SWE-Bench score is wild for an open model. I've been running Sonnet for most of my coding workflows and honestly the cost adds up fast when you're doing long agentic runs. $1/hr with 100 tps would be a game changer if the quality holds up in practice. Curious - how does it handle multi-file refactors? That's where I see most models fall apart, they lose context across files even when the benchmarks look great.