Building an autonomous video agent with Gemini 2.0 & YOLOv8, Roast my logic?
I'm a CS student building VIRL-ai – an autonomous agent that takes 8-hour Twitch streams and turns them into viral TikToks without human help.
Most AI clippers just look for "loud noises." I found that approach cuts off the setup to jokes.
My Solution: I built a local pipeline (Python) that does two things differently:
Context Engine: It grabs a 90-second audio buffer around spikes and uses Gemini 2.0 Flash to find the actual start/end of the interaction.
Universal Vision: I trained a YOLOv8 model to detect facecams vs. game UI to dynamically switch between "Split Screen" and "Full Screen" layouts.
I'm currently running this locally but looking to move to cloud GPUs.
Question: For those building video AI, how are you handling the "Ghosting" issue where facecams overlap gameplay? I wrote a "Smart Dodge" script for FFmpeg, but curious if there's a better CV approach?
Would love feedback on the landing page if anyone has a moment: [Link to Virl.ai]


Replies