Launching today

Gemini Embedding 2
Google's first natively multimodal embedding model
134 followers
Google's first natively multimodal embedding model
134 followers
Gemini Embedding 2 is Google's first natively multimodal embedding model that maps text, images, video, audio and documents into a single embedding space, enabling multimodal retrieval and classification across different types of media and it’s available now in public preview.



Gemini Embedding 2 is Google's first natively multimodal embedding model, designed to map text, images, video, audio, and documents into a single embedding space.
Most embedding pipelines today are fragmented... developers often need separate models and preprocessing steps (like audio transcription or image captioning) before generating embeddings. Gemini Embedding 2 simplifies this by handling multiple modalities directly and enabling multimodal retrieval, classification, and semantic search from one unified model.
Key features:
Multimodal embeddings for text, images, video, audio, and PDFs
Up to 8192 tokens for text, 6 images per request, 120s video, and 6-page PDFs
Native audio embeddings without transcription
Supports 100+ languages
Interleaved multimodal inputs (e.g., text + image together)
Flexible embedding dimensions with Matryoshka Representation Learning (3072 → 768)
Why this matters: Developers can build RAG systems, semantic search, sentiment analysis, clustering, and multimodal retrieval much more easily with a single embedding model that understands different media types together.
Who it’s for: AI developers, ML engineers, and teams building search, assistants, knowledge bases, and multimodal AI applications.
If you’re building the next generation of multimodal AI experiences, this is definitely worth exploring.
I hunt the latest and greatest launches in tech, SaaS and AI, follow to be notified → @rohanrecommends
Copus
A natively multimodal embedding model that maps text, images, video, and audio into the same space is a big deal. Most embedding approaches still treat modalities separately, which creates friction when you want to do cross-modal search or retrieval. This should make building multimodal RAG systems much more straightforward.