
Ollama
The easiest way to run large language models locally
5.0•26 reviews•1.4K followers
The easiest way to run large language models locally
5.0•26 reviews•1.4K followers
Run Llama 2 and other models on macOS, with Windows and Linux coming soon. Customize and create your own.
This is the 4th launch from Ollama. View more
Ollama v0.19
Launching today
Ollama v0.19 rebuilds Apple Silicon inference on top of MLX, bringing much faster local performance for coding and agent workflows. It also adds NVFP4 support and smarter cache reuse, snapshots, and eviction for more responsive sessions.




Free
Launch Team




Flowtica Scribe
Hi everyone!
The engineering in Ollama v0.19 is a massive leap for anyone running local models on macOS. Moving to Apple's native MLX framework changes the game for performance, leveraging the unified memory architecture and the new GPU Neural Accelerators on the M5 chips.
v0.19 now also supports NVFP4, which brings local inference closer to production parity, and the KV cache has been reworked with cache reuse across conversations, intelligent checkpoints, and smarter eviction. For branching agent workflows like @Claude Code or @OpenClaw , that should mean lower memory use and faster responses.
If you have a Mac with 32GB+ of unified memory, you can pull the new Qwen3.5-35B-A3B NVFP4 model and test this right now. Running heavy agentic workflows locally just became a lot more viable!
Been running Ollama since like v0.12 and the speed improvements keep blowing my mind. The MLX integration is huge for M-series Macs tbh.
Smarter cache reuse is the underrated feature here. I run a coding assistant locally and switching between projects used to basically cold start every time. If the KV cache actually persists across sessions that changes everything for agent workflows.
Finally, MLX-native inference. I've been running local models on my M2 Air for quick prototyping when I don't want to burn API credits, and the speed difference on Apple Silicon matters a lot when you're going back and forth between coding and testing. Curious how it handles the bigger models now, like 70B+ quantized. Does the memory management play nicer with other heavy processes running?
Well done! Do all the current models work automatically with MLX with this version on macOS, or do you need to download a specific version of each model?
This is huge for local-first AI workflows. Curious how much real-world speedup people are seeing on M-series chips