OpenCut-AI now runs TurboQuant on your GPU β 7.3Γ KV cache compression
π OpenCut-AI just shipped real GPU support for TurboQuant KV cache compression.
OpenCut-AI is an open-source, local-first AI video editor. Everything runs on your machine β transcription, voice cloning, image generation, LLM commands. No cloud, no API keys.
The catch was always memory. Running a 7B LLM + Whisper + TTS + Stable Diffusion locally means fighting for every gigabyte of RAM. TurboQuant solves this by compressing the KV cache (the biggest memory hog during inference) by up to 7.3Γ.
What's new in this release:
β User-selectable Compute Mode in Settings β AI Optimization. Pick Auto, CPU, or GPU (CUDA).
β Real integration with the turboquant-gpu library. The GPU backend runs cuTile fused kernels for the full 2-bit / 3-bit KV compression path. The CPU backend uses a PyTorch fallback with physical-core thread pinning and MKLDNN acceleration.
β Live-measured compression ratios in the UI. No more static lookup tables β you see the actual compression your backend produced on the last request.
β Graceful fallback everywhere. Missing CUDA? Falls back to CPU. Missing cuTile kernels? Falls back to PyTorch. The service always comes up.
Huge thanks to Anirudh Bharadwaj Vangara for the turboquant-gpu library that made the real GPU path possible.
OpenCut-AI: https://github.com/Ekaanth/OpenCut-AI
turboquant-gpu: https://github.com/DevTechJr/turboquant-gpu



Replies