Marengo 3.0 by TwelveLabs - The most powerful embedding model for video understanding
by•
Marengo 3.0 is TwelveLabs' most significant model to date, delivering human-like video understanding at scale. A multimodal embedding model, Marengo fuses video, audio, and text for holistic video understanding to power precise video search and retrieval.


Replies
DesignRevision
TwelveLabs is impressive in pushing the limits of video AI. It seems powerful and efficient. How does it handle complex scenes to ensure accurate context understanding across different video genres?
TwelveLabs
Hey Product Hunt! 👋 This is Allie from @TwelveLabs!
Today we’re launching Marengo 3.0 (M3) — our biggest upgrade yet in multimodal AI.
If you’ve ever tried to build on top of models that say they understand video but collapse on long content, sports, or anything beyond short clips… M3 is built for you.
🚀 What’s M3?
M3 is a unified multimodal foundation model powering our Search API and Embed API.
It understands video, audio, images, and text in a single space — fast, efficient, and built for production.
🔥 Highlights
⚡ Breakaway speed on long-form video processing — practical at massive scale
💾 512-d embeddings → up to 6× more storage-efficient with top-tier accuracy
🎥 True multimodality across video, audio, image, and text
🌍 Native multilingual support (English, Korean, Japanese, and more)
🏀 Elite sports intelligence: fine-grained action recognition, player tracking, and temporal reasoning
🧠 Handles hour-long videos, long queries, and composed queries (image + text)
💡 What you can build
Search platforms, AI agents that watch content, sports analytics tools, compliance systems, media workflows — anything that needs real video understanding.
🛠️ Try Marengo 3.0
Available via:
TwelveLabs SaaS (Search API + Embed API)
AWS Bedrock
I’m so proud of the research-first team behind this release — and excited to see what you build with M3.
Ask me anything below 👇
Hi,can I use it for my game promo video?
Do you have plans to integrate Marengo 3.0 with professional video editing tools (e.g., Adobe Premiere Pro) so teams can pull retrieved clips directly into timelines? Also, will the model support analysis of extra-long-form footage (e.g., 12+ hour raw interviews), common in documentary and investigative work?
Great work on Twelve Labs. The notion of AI that understands video — visuals, audio, context — like a human does, but at scale, feels like the next big leap for video workflows. Curious how well it handles noisy, real-world footage
Hey there!
This looks incredibly useful for anyone drowning in video content. The ability to actually search through video based on what's happening, not just transcripts or tags, would save our team hours every week. I can't count how many times I've had to skim through a long design tutorial or a recording just to find one specific segment I know was in there somewhere. If Pegasus is as good as it sounds, it might finally stop me from saying, "I know I saw that somewhere in the video..." Solid solution to a very real problem.
I recommend TwelveLabs—it's a powerful AI platform that truly understands video. Using advanced multimodal models like Marengo and Pegasus, the service takes searching, analyzing, and generating text from video content to a whole new level.
Congrats on the launch! The multimodal performance looks seriously impressive, especially the long-form and multilingual handling. How it performs on noisy user-generated content in real workflows?
Congratulations on the new release! We once made a similar service: we recognized text from videos, translated it, and generated videos with the translation. This way, YouTube bloggers could automatically create videos in 70+ languages. YouTube even officially recommended this service later.
Unloop
This is great!! Congrats on the launch
Would love to test this for auto generating summaries of short films. Does it handle narrative structure well or its more optimized for action/object detection?