Zac Zuo

Phi-4-reasoning-vision - Open-weight 15B multimodal model for thinking and GUI agents

byβ€’
Phi-4-reasoning-vision-15B is a compact open-weight multimodal model built on a mid-fusion architecture. Balancing fast direct perception with deep chain-of-thought, building capable computer-use agents and solving complex math is now highly efficient.

Add a comment

Replies

Best
Zac Zuo

Hi everyone!

Phi-4-Reasoning-Vision-15B is Microsoft"s new 15B open-weight model that makes multimodal reasoning feel much more efficient.

It was trained on 200B multimodal tokens, handles high-res screens well, and stays direct on simpler tasks while switching into deeper reasoning when needed.

Looks especially strong for math, science, and computer-use agents. Weights on HF.

Whetlan

@zaczuoΒ 15B with mid-fusion is a sweet spot β€” large enough for real reasoning but still runnable on a single 24GB card. The "direct perception vs deep chain-of-thought" switching is interesting. Does it decide that automatically based on task complexity, or is there a way to force one mode over the other?

Emad Ibrahim

The GUI agent angle is what makes this really compelling. A 15B model that can handle high-res screens well enough for computer-use tasks is a big deal for anyone building browser automation or testing tools. The adaptive reasoning depth -- going direct on simple perception but switching to chain-of-thought for harder problems -- seems like the right tradeoff for latency-sensitive agent loops. Have you seen benchmarks on how it compares to larger models specifically on GUI grounding tasks?