Mistral just shipped their most capable model yet, and it runs self-hosted on four GPUs.

What it is: Mistral Medium 3.5 is a 128B dense model that merges instruction-following, reasoning, and coding into a single set of weights, with a 256k context window and configurable reasoning effort per request.

Most frontier-class models either require massive infrastructure to self-host or lock you into proprietary APIs.

Mistral Medium 3.5 sits in an interesting position: it scores 77.6% on SWE-Bench Verified, ahead of models like Qwen3.5 397B A17B, while running on as few as four GPUs.

The reasoning effort is configurable per call, so you're not paying or waiting for deep reasoning on a simple reply, but the same model can handle a multi-step agentic run.

What makes it different: This is Mistral's first "merged" flagship model, meaning instruction-following, reasoning, and coding live in one set of weights rather than being split across specialised variants.

The open weights are released under a modified MIT license on Hugging Face, and it's already the default model in both Mistral Vibe and Le Chat.

The vision encoder was trained from scratch to handle variable image sizes and aspect ratios.

Key features:

128B dense model, 256k context window
Configurable reasoning effort per request
77.6% on SWE-Bench Verified
Open weights on Hugging Face under a modified MIT license
Self-hostable on 4 GPUs
API at $1.5/M input tokens and $7.5/M output tokens
Powers Vibe remote coding agents and Le Chat Work mode (Pro/Team/Enterprise plans)
Available on NVIDIA build.nvidia.com and as an NIM container

Benefits:

Run a frontier-class model on your own infrastructure without a large GPU cluster
Tune reasoning depth at the API level, useful for cost-sensitive agentic pipelines
Single model handles the full range from quick chat replies to long-horizon coding tasks
Open weights means fine-tuning, auditing, and on-prem deployment are all on the table

Who it's for: Backend and ML engineers evaluating open-weight alternatives to proprietary frontier models for agentic pipelines, coding tools, or self-hosted inference.

The interesting design choice here is the merged weights architecture.

Most labs at this capability tier still ship separate reasoning and instruction models.

Collapsing them with configurable effort per call is a practical tradeoff that's worth watching as other labs respond.