Zac Zuo

Qwen3.5 - The 397B native multimodal agent with 17B active params

An open-weight, native vision-language model built for long-horizon agentic tasks. Its hybrid architecture (linear attention + MoE) delivers the capabilities of a 397B giant with the inference speed of a 17B model.

Add a comment

Replies

Best
Zac Zuo

Hi everyone!

Qwen3.5 is here. It is a native vision-language model with a massive 397B parameter count.

Built on the Qwen3-Next architecture (Linear Attention + MoE), only 17B parameters are active per forward pass. This hits a specific sweet spot: you get the reasoning depth of a giant model with the inference latency of a much smaller one.

For applications, this efficiency is key for agents.

It is natively multimodal with no glued-on vision adapters, demonstrating outstanding results on agentic tasks. This means handling complex workflows without burning through tokens.

Apache 2.0 and ready for vLLM/SGLang out of the box!

Vadim Ermolin

Congrats @zaczuo !

Excited to test it against agentic workflows. Being a fan of Qwen3 – always a rock solid choice as a local model.

Piroune Balachandran

Linear attention keeping latency flat across long tool-call chains is the part that actually matters for agents. Standard transformers get brutal once you're 50+ steps into a workflow with accumulated context. 17B active params on a 397B base with vLLM support out of the box makes self-hosting realistic too.

Ryan Thill

Serving a 397B MoE native multimodal model for long-horizon agents will bottleneck on KV-cache growth and multimodal prefill latency, and expert-routing variance can reduce batching efficiency at high throughput. Best practice: run it under vLLM or SGLang with continuous batching plus paged KV cache, add aggressive prompt and image embedding caching, and lean on FP8 where supported to keep cost predictable. :contentReference[oaicite:0]{index=0} Question: what max context length are you targeting for Qwen3.5 in production and how stable is expert routing under long tool-using trajectories when served via vLLM or SGLang?

Piroune Balachandran

@ryan_thill How does Qwen3.5's 3:1 ratio of linear attention to full attention layers hold up when tool calls return wildly different payload sizes? 397B params with only 17B active keeps inference fast, but uneven chunk lengths from mixed tool outputs could still spike memory on those full attention layers even if the linear ones stay flat.

Go Sakioka

397B with only 17B active params is impressive efficiency. The hybrid linear attention + MoE approach seems like the right direction for long-horizon agentic tasks. As someone building a vision AI app for pet health, I'm always watching open-weight multimodal models closely — excited to benchmark this against our current pipeline. Congrats on the release!

Bhavin Sheth

The 17B active params with that level of capability is impressive — efficiency like this is what actually makes real-world agent use practical, not just demos.