Zac Zuo

GLM-5V-Turbo - Vision-to-code foundation model for real GUI automation

GLM-5V-Turbo is Z.AI's first multimodal coding model. It understands images, video, files, and UI layouts, then turns that visual context into runnable code, debugging help, and stronger agent workflows with Claude Code and OpenClaw.

Add a comment

Replies

Best
Zac Zuo

Hi everyone!

GLM-5V-Turbo is one of the more interesting coding model releases lately because it is not just "vision added onto a code model." @Z.ai is clearly positioning it as a native multimodal coding model that can understand screenshots, design drafts, videos, document layouts, and real interfaces, then turn that into code, debugging, and action.

"Seeing the screen and writing the code" is a very real workflow, and GLM-5V is built exactly for that.

It is also deeply adapted for @Claude Code and @OpenClaw style loops, which makes it feel much more relevant than a generic VLM with some coding demos on top.

Try it on chat.z.ai or plug in the official API.

Vojtěch Hořava
@zaczuo hello Zac 🍀 how i can use it together with Claude code? i really like this model And i really want to use it but i know that for example using Claude code sub in third party app Is problem So just wanted to ask 🍀🤍 thank you And Wish you And the team all thé best!
fmerian

few months ago, @Claude by Anthropic announced Opus 4.5 and we thought they won the AI coding race. then @MiniMax released M2.7, and now GLM-5V-Turbo by @Z.ai.

open source is so back.

pro tip: you can experiment with this new model with @Kilo Code and @KiloClaw

Bill Chirico
I was so executed for this to launch, so I tried it on my OpenClaw and it is still really slow compared to other models. Truly disappointing to say the least.
Chintan

this looks exciting! we struggle with creating vector diagrams that we can embed in website. generally they start with a sketch on paper and now we want to put them on our website. right now the process is very cumbersome. can the model help with sketch-in -> .svg-out ?

Sounak Bhattacharya

The "video → runnable code" claim is the one I want to pull on. Are we talking about screen recordings of a UI workflow, where the model watches what a user does and generates automation code from that? Or is video support more like "static frames extracted and analyzed sequentially"? Those are very different capabilities with very different use cases.

Manchit Sanan

Vision-to-code is a fascinating direction. We use a simpler version of this in Krafl-IO — users upload an image and our AI describes it, then generates a LinkedIn post around it. Going from visual context to structured output is harder than it looks. Curious how GLM-5V handles ambiguous UI elements where the "right" code depends on intent, not just layout.