Gemini Embedding 2: Multimodal Embedding Made Easy

Ask most enterprise leaders what percentage of their data their AI can actually see, and watch the pause.

The honest answer, for most organizations, is somewhere around 20%. Maybe less. The rest isn’t missing. It’s just in formats that most AI systems treat like wallpaper: present in every room, but completely ignored.

Scanned contracts sitting in shared drives since 2019. Recorded sales calls that contain more competitive intelligence than any CRM entry. Product images, engineering diagrams, and training videos sit untouched. A decade of institutional knowledge rots in unread PDFs. Your AI can summarize the notes from the meeting. It can’t watch the meeting. For most companies, the content that would actually change decisions is exactly the content their AI can’t touch.

How Enterprise AI Got Stuck on Text

The reason most AI systems are text-first isn’t a design philosophy. It’s an infrastructure problem that calcified into a default. Until recently, supporting multiple content types meant running separate embedding models. You needed one model for each format. Then you needed a custom layer to stitch the outputs together so you could compare them.

That’s three infrastructure components before writing any product code. Each has its own latency and failure modes. And even then, the results tended to disappoint. Running content through separate models and trying to align the outputs loses information at every handoff. The system might find the right document and the wrong image. It ranks the audio clip that’s most keyword-searchable rather than most relevant to what you actually asked. The engine works well enough to ship but not well enough to rely on.

Most enterprise teams looked at that and made the rational call: build for text, worry about the rest later. “Later” is still sitting on their roadmaps, right next to “improve data quality” and “finish the documentation.”

What Gemini Embedding 2 Changes

Google built Gemini Embedding 2, launched in public preview in March 2026, on a fundamentally different architecture. Instead of separate models, it maps text, images, video, audio, and documents into a single shared vector space. It uses one model with 3,072 dimensions. A plain English question can pull an answer directly from a video clip. A product image can surface related content from a forgotten PDF archive.

Think about what that means for a legal team. Today, searching across thousands of contracts and deposition recordings requires multiple tools. Usually, a paralegal does the grunt work. With a unified embedding model, that becomes a single query. The model understands questions across formats. An answer in a Word document carries the same weight as a recorded deposition. It works in a shared semantic space rather than isolated silos.

The same logic applies in healthcare. Clinically important data often lives in imaging rather than in notes. Manufacturing quality control depends on inspection photos that no AI system ever indexed. In financial services, an earnings call transcript is sanitized. The actual audio recording holds the real signal.

Better Document Context

Gemini Embedding 2 also improves document handling significantly. Previous models required chopping long documents into smaller pieces because they lacked the context window. That chopping quietly degrades answer quality in ways teams usually blame on hallucinations. Gemini Embedding 2 handles documents four times longer without needing to slice them up. The model reads the whole thing, the way you’d want a colleague to before giving you their opinion.

BigQuery users get a new ObjectRef data type. It lets you reference unstructured Cloud Storage files directly inside BigQuery tables. Multimodal vector search then runs across structured and unstructured data together in SQL. No new pipeline, no new vendor, no new infrastructure to explain to your security team.

Why the Timing Matters

As of the time of writing, no major (or minor) cloud provider has a production-ready equivalent to this capability. AWS and Azure both have multimodal AI capabilities. However, neither ships a single unified embedding model. They lack this level of data platform integration for cross-content search.

For ISVs, this is the kind of window that closes faster than it looks. The first product to make all customer content searchable earns immense loyalty. The second product does not. Not because customers are sentimental about who got there first, but because they build workflows around it. Workflows become habits. Habits become switching costs that no competitor feature announcement is going to dislodge.

The Practical Move

Pick one ignored content type in your customers’ environments that contains valuable signal. Look at recording archives, image libraries, or legacy document stores. Whatever it is in your vertical. Build something that makes it searchable before your competitors realize the window is open, then consider the broader question: What would your product look like if your AI could see everything your customers have and not just the easy stuff?

That question used to be hypothetical, but the infrastructure to answer it exists now. The only thing left to decide is whether you’re the ISV who’ll lead the way to what comes next.

Want to go deeper?

Gemini Embedding 2 on Vertex AI, Full capability specs, supported modalities, context length, and integration details.
Google blog: Gemini Embedding 2 announcement, The original launch post with architecture detail and benchmark comparisons.
BigQuery multimodal vector search, How ObjectRef and multimodal embeddings bring unstructured data into the warehouse for SQL-native search.
VentureBeat: Gemini Embedding 2 arrives with native multimodal support, Independent coverage of the launch and competitive context.