Qwen3.5 - The 397B native multimodal agent with 17B active params

Flowtica Scribe

•23d ago

An open-weight, native vision-language model built for long-horizon agentic tasks. Its hybrid architecture (linear attention + MoE) delivers the capabilities of a 397B giant with the inference speed of a 17B model.

Replies

Best

Flowtica Scribe

Hunter

📌

Hi everyone!

Qwen3.5 is here. It is a native vision-language model with a massive 397B parameter count.

Built on the Qwen3-Next architecture (Linear Attention + MoE), only 17B parameters are active per forward pass. This hits a specific sweet spot: you get the reasoning depth of a giant model with the inference latency of a much smaller one.

For applications, this efficiency is key for agents.

It is natively multimodal with no glued-on vision adapters, demonstrating outstanding results on agentic tasks. This means handling complex workflows without burning through tokens.

Apache 2.0 and ready for vLLM/SGLang out of the box!

Report

24d ago

Fluent

Congrats @zaczuo !

Excited to test it against agentic workflows. Being a fan of Qwen3 – always a rock solid choice as a local model.

Report

23d ago

Linear attention keeping latency flat across long tool-call chains is the part that actually matters for agents. Standard transformers get brutal once you're 50+ steps into a workflow with accumulated context. 17B active params on a 397B base with vLLM support out of the box makes self-hosting realistic too.

Report

23d ago

Serving a 397B MoE native multimodal model for long-horizon agents will bottleneck on KV-cache growth and multimodal prefill latency, and expert-routing variance can reduce batching efficiency at high throughput. Best practice: run it under vLLM or SGLang with continuous batching plus paged KV cache, add aggressive prompt and image embedding caching, and lean on FP8 where supported to keep cost predictable. :contentReference[oaicite:0]{index=0} Question: what max context length are you targeting for Qwen3.5 in production and how stable is expert routing under long tool-using trajectories when served via vLLM or SGLang?

Report

23d ago

@ryan_thill How does Qwen3.5's 3:1 ratio of linear attention to full attention layers hold up when tool calls return wildly different payload sizes? 397B params with only 17B active keeps inference fast, but uneven chunk lengths from mixed tool outputs could still spike memory on those full attention layers even if the linear ones stay flat.

Report

22d ago

397B with only 17B active params is impressive efficiency. The hybrid linear attention + MoE approach seems like the right direction for long-horizon agentic tasks. As someone building a vision AI app for pet health, I'm always watching open-weight multimodal models closely — excited to benchmark this against our current pipeline. Congrats on the release!

Report

22d ago

The 17B active params with that level of capability is impressive — efficiency like this is what actually makes real-world agent use practical, not just demos.

Report

22d ago