Abhishek Sira Chandrashekar

OpenCut AI now runs 7B models on 8GB RAM -- TurboQuant KV cache compression is live

Hey everyone!

We just shipped TurboQuant into OpenCut AI, and this one changes what hardware you need to run the full AI stack.

The problem we had

OpenCut AI runs everything locally -- LLM, transcription, voice cloning, image generation. That's great for privacy, but brutal on memory. Running the full stack needed 35+ GB RAM. Most of our users have 8-16 GB laptops, so they were stuck with tiny 1B models that gave mediocre scripts, slow commands, and limited context.

What TurboQuant does

TurboQuant implements two algorithms from Google Research paper PolarQuant and QJL. That compress the KV cache (the biggest memory bottleneck during AI inference) by up to 6x with mathematically proven quality preservation.

In plain terms: your AI models now use a fraction of the memory without getting dumber.

Before vs After

On a 16 GB machine:
- Before: Llama 3.2 1B + Whisper Base + TTS = barely fits, mediocre quality
- After: Llama 3.1 8B + Whisper Medium + TTS = runs comfortably, dramatically better output

On an 8 GB machine:
- Before: Could only run the 1B model alone
- After: Runs a 3B model + Whisper Base + TTS together

Full stack memory:
- Before: 35 GB for everything
- After: 15 GB for everything

What this means for editing

- Better AI commands "remove the intro" actually works now because Mistral 7B understands context far better than a 1B model
- Better transcription Whisper Medium fits where only Whisper Base could before, so captions are more accurate
- Longer content: Process hour-long podcast transcripts without running out of memory. The 6x KV cache reduction means 6x longer input context

One-click setup in Settings

We added a new AI Optimization panel in Settings. It auto-detects your hardware and recommends the best configuration:

- Performance Tier: Lite (4-8 GB), Standard (8-16 GB), or Pro (16-32 GB). Each tier is tagged with "Best for your hardware" based on your actual RAM.
- KV Cache Compression: Pick 4-bit (near-lossless), 3-bit (5x compression), or 2-bit (aggressive). Recommended level highlighted based on your system.
- Memory Budget: Set once, and the system optimizes everything to fit.

Would love to hear, what's your RAM situation, and does this make local AI editing viable for you?

38 views

Add a comment

Replies

Best
Rohan Chaubey

How this affects render times when you're actually exporting the final video? :)

Abhishek Sira Chandrashekar

Hi @rohanrecommends,
this doesn't affect render time, it remains the same. This is useful since the LLMs needs lot more space with going context and TurboQuant will reduce the required space.