|

Decoding Gemini Embedding 2 Preview: The First Native Multimodal Embedding Model and 5 Major Breakthroughs for #1 MTEB Ranking

In March 2026, Google unveiled a landmark model: Gemini Embedding 2 Preview, the industry's first native multimodal embedding model. It can map text, images, video, audio, and PDF documents into a unified vector space, securing the #1 spot on the MTEB multilingual benchmark—outperforming the runner-up by more than 5 percentage points.

Core Value: By the end of this article, you'll understand the 5 major technical breakthroughs of Gemini Embedding 2 Preview, how it compares to competitors in pricing and performance, and how to quickly integrate it via API.

gemini-embedding-2-preview-multimodal-embedding-model-apiyi-guide-en 图示

What is Gemini Embedding 2 Preview?

Gemini Embedding 2 Preview is the latest embedding model released by Google on March 10, 2026. Initialized based on the Gemini architecture and utilizing a bidirectional attention Transformer structure, it is Google's first embedding model with native multimodal input support.

Specification Details
Model ID gemini-embedding-2-preview
Release Date March 10, 2026
Status Preview (General availability TBD)
Default Output Dimensions 3,072
Optional Dimension Range 128 — 3,072
Max Input Tokens 8,192 (4x the previous generation)
Multimodal Support Text, Image, Video, Audio, PDF
Language Support 100+ languages
Matryoshka Training Supported (truncatable dimensions while maintaining semantic quality)
Available Platforms Gemini API, Vertex AI, APIYI apiyi.com

Key Differences from Previous Generations

Feature text-embedding-004 gemini-embedding-001 gemini-embedding-2-preview
Max Input Tokens 2,048 2,048 8,192
Output Dimensions Up to 768 128-3,072 128-3,072
Multimodal Text only Text only Text+Image+Video+Audio+PDF
Task Type Specification task_type field task_type field Prompt-embedded instructions
MRL Support Not supported Supported Supported
Price/Million Tokens Service discontinued $0.15 $0.20

🎯 Integration Tip: APIYI apiyi.com now supports gemini-embedding-2-preview model invocation. You can integrate it via an OpenAI-compatible interface without needing to configure a separate Google API key.

Detailed Breakdown of 5 Key Technical Breakthroughs

gemini-embedding-2-preview-multimodal-embedding-model-apiyi-guide-en 图示

Breakthrough 1: Native Multimodal Unified Embedding Space

This is the biggest differentiator for Gemini Embedding 2—content from 5 different modalities is mapped into the same vector space.

Modality Format Requirements Limit per Request Notes
Text Plain text 8,192 tokens Supports 100+ languages
Image PNG, JPEG Max 6 per request Direct pixel processing
Video MP4, MOV Max 120 seconds Auto-samples up to 32 frames
Audio MP3, WAV Max 80 seconds Native processing, no transcription needed
PDF PDF document Max 6 pages per request Includes OCR capabilities

Practical Use Cases:

  • Search for images using text ("a red sports car on a racetrack" → returns matching images)
  • Search for similar video clips using an image
  • Search for relevant documents using voice descriptions
  • Build a unified, cross-modal knowledge base

This wasn't possible with previous embedding models. OpenAI's text-embedding-3 series only supports text; if you wanted image search, you'd have to use a vision model to extract a description first, which adds an extra step and loses information.

Breakthrough 2: 8,192 Token Context Window

The input window has been increased from 2,048 to 8,192 tokens, meaning you can embed much longer document segments at once.

For RAG (Retrieval-Augmented Generation) systems, this is incredibly practical:

  • Previously, you had to chop documents into small 500–1,000 token chunks.
  • Now, you can use larger 2,000–4,000 token chunks, preserving more context.
  • Larger document segments = fewer splits = more complete retrieval results.

Breakthrough 3: Matryoshka Dimensionality Scaling

Gemini Embedding 2 is trained using Matryoshka Representation Learning (MRL), which concentrates the most important semantic information into the first few dimensions of the vector.

This means you can flexibly choose the dimensionality based on your specific needs:

Dimensions Vector Size Best For Quality Loss
3,072 (Default) 12.3 KB Highest precision retrieval None
1,536 6.1 KB Balancing precision and storage Minimal
768 3.1 KB Preferred for large-scale deployment Small
256 1.0 KB Real-time recommendation systems Moderate
128 0.5 KB Extreme compression scenarios Significant

Note: When using dimensions lower than 3,072, you need to manually normalize the vectors before calculating similarity.

Breakthrough 4: 100+ Languages Supported

In the MTEB multilingual benchmark, Gemini Embedding 2 was evaluated across 250+ languages, covering a range far beyond its competitors.

Key language performance metrics:

  • Bitext Mining: 79.32 points
  • Cross-lingual Retrieval (XOR-Retrieve): Recall@5kt 90.42 points
  • Multilingual Understanding (XTREME-UP): MRR@10 64.33 points

Breakthrough 5: #1 in Multiple MTEB Rankings

Benchmark Score Rank Lead Margin
MTEB Multilingual (Mean Task) 68.32 #1 +5.09
MTEB Multilingual (Mean Type) 59.64 #1
MTEB English v2 (Mean Task) 73.30 #1
MTEB English v2 (Mean Type) 67.67 #1
MTEB Code (Mean All) 74.66 #1

For comparison, the second-place model, gte-Qwen2-7B-instruct, scored 62.51 on the multilingual MTEB. Gemini Embedding 2 leads by nearly 6 points, which is a massive gap in the world of embedding models.

💡 Development Tip: If you're building a RAG system or a semantic search application, Gemini Embedding 2 is currently the strongest choice for multilingual and code-heavy scenarios. You can easily integrate this model via APIYI (apiyi.com), which also supports OpenAI embedding models, making it simple to quickly compare performance.

Pricing and Performance Comparison with Competitors

gemini-embedding-2-preview-multimodal-embedding-model-apiyi-guide-en 图示

Text Embedding Pricing Comparison

Model Price/Million Tokens Max Dimension Max Input Multimodal Multilingual Rank
Gemini Embedding 2 $0.20 3,072 8,192 ✅ 5-modal #1
gemini-embedding-001 $0.15 3,072 2,048
OpenAI text-embedding-3-large $0.13 3,072 8,191
OpenAI text-embedding-3-small $0.02 1,536 8,191

Multimodal Content Pricing (Exclusive to Gemini Embedding 2):

Input Type Pay-as-you-go/Million Tokens Batch Price/Million Tokens
Text $0.20 $0.10
Image $0.45 (~$0.00012/image) $0.225
Audio $6.50 (~$0.00016/sec) $3.25
Video $12.00 (~$0.00079/frame) $6.00

Selection Recommendations

Use Case Recommended Model Reason
Text-only, cost-sensitive OpenAI text-embedding-3-small Cheapest ($0.02)
Text-only, high-precision Gemini Embedding 2 or OpenAI 3-large Similar accuracy, Gemini has better multilingual support
Multimodal search Gemini Embedding 2 Only native multimodal solution
Multilingual retrieval Gemini Embedding 2 MTEB Multilingual #1
Code search Gemini Embedding 2 MTEB Code #1
Large-scale, low-cost OpenAI 3-small + Batch API 10x price advantage

🎯 Pro Tip: Choosing the right embedding model depends on your specific use case. We recommend using the APIYI (apiyi.com) platform to access both Gemini and OpenAI embedding models simultaneously. You can compare retrieval performance with real data before making a decision. The platform supports a unified interface, so you can switch models without changing your code.

API Invocation Details

Specifying Task Types (Important Change)

Unlike gemini-embedding-001, Gemini Embedding 2 no longer uses the task_type parameter. Instead, you specify the task type by embedding a task instruction directly into your input content.

8 Supported Task Types:

Task Type Query Format Document Format
Search/Retrieval task: search result | query: {content} title: {title} | text: {content}
Q&A task: question answering | query: {question} title: {title} | text: {content}
Fact Checking task: fact checking | query: {statement} title: {title} | text: {content}
Code Retrieval task: code retrieval | query: {description} title: {title} | text: {code}
Classification task: classification | query: {content} Same format
Clustering task: clustering | query: {content} Same format
Sentence Similarity task: sentence similarity | query: {sentence} Same format

For the document side, if there is no title, use title: none.

Python Example

import openai

# Call via APIYI unified interface
client = openai.OpenAI(
    api_key="YOUR_APIYI_KEY",
    base_url="https://api.apiyi.com/v1"
)

# Text embedding - Search scenario
response = client.embeddings.create(
    model="gemini-embedding-2-preview",
    input="task: search result | query: What is a vector database",
    dimensions=768  # Optional dimensions: 128-3072
)

embedding = response.data[0].embedding
print(f"Vector dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
View complete RAG retrieval code
import openai
import numpy as np
from typing import List

client = openai.OpenAI(
    api_key="YOUR_APIYI_KEY",
    base_url="https://api.apiyi.com/v1"
)

def get_embedding(text: str, task: str = "search result", dim: int = 768) -> List[float]:
    """Get text embedding vector"""
    formatted = f"task: {task} | query: {text}"
    response = client.embeddings.create(
        model="gemini-embedding-2-preview",
        input=formatted,
        dimensions=dim
    )
    vec = response.data[0].embedding
    # MRL truncation dimensions require manual normalization
    if dim < 3072:
        norm = np.linalg.norm(vec)
        vec = (np.array(vec) / norm).tolist()
    return vec

def get_doc_embedding(title: str, text: str, dim: int = 768) -> List[float]:
    """Get document embedding vector"""
    formatted = f"title: {title} | text: {text}"
    response = client.embeddings.create(
        model="gemini-embedding-2-preview",
        input=formatted,
        dimensions=dim
    )
    vec = response.data[0].embedding
    if dim < 3072:
        norm = np.linalg.norm(vec)
        vec = (np.array(vec) / norm).tolist()
    return vec

def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Calculate cosine similarity"""
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Usage example
query_vec = get_embedding("How to optimize RAG retrieval")
doc_vec = get_doc_embedding(
    "RAG Optimization Guide",
    "This article introduces 5 methods to optimize RAG retrieval quality..."
)
similarity = cosine_similarity(query_vec, doc_vec)
print(f"Similarity: {similarity:.4f}")

🚀 Quick Start: We recommend using the APIYI (apiyi.com) platform to quickly integrate Gemini Embedding 2. The platform provides an OpenAI-compatible embedding interface, allowing you to complete integration in just 5 minutes. It also supports unified calls for mainstream embedding models like OpenAI, Gemini, and Cohere.

Usage Notes

Preview Status Limitations

Limitation Description Impact
Version Changes Specifications and pricing may change during the Preview phase We recommend having a fallback plan for production environments
Vector Space Incompatibility Cannot be mixed with vectors from older models Upgrading requires a full re-indexing
Normalization for Low Dimensions Manual normalization required when using <3,072 dimensions You'll need to add a normalization step in your code
Strict Rate Limits Preview model quotas are lower than GA models Request a limit increase for large-scale usage
Free Tier Data Usage Data from the free tier is used for product improvement Use the paid tier for sensitive data

Migration Notes from Older Models

  1. Re-indexing is Mandatory: Vector spaces are incompatible between different models; you cannot mix them in the same database.
  2. Task Type Format Changes: The task_type parameter has been replaced by embedded instructions within the prompt.
  3. Normalization Handling: If you're using non-default dimensions, you must add normalization logic to your code.
  4. Test Before Migrating: We recommend comparing the retrieval performance of the new and old models in a test environment before deciding to migrate.

FAQ

Q1: How does Gemini Embedding 2 Preview compare to OpenAI’s text-embedding-3-large?

The main advantages lie in three areas: native multimodal support (OpenAI only supports text), a #1 ranking on the MTEB leaderboard (with a significant lead), and higher quality code embeddings. However, OpenAI's text-embedding-3-large is cheaper ($0.13 vs $0.20), and if you only need English text embeddings, the quality is very similar. You can use APIYI (apiyi.com) to call both models and compare them with your real data.

Q2: What are the practical use cases for multimodal embeddings?

The most direct application is cross-modal search: users input text, and the search returns relevant images, videos, or documents. For example, in e-commerce, you could search for products using "red dress," or in an enterprise knowledge base, you could use a text description to find relevant clips in training videos. Traditionally, you'd need to use a vision model to extract descriptions before embedding the text, but Gemini Embedding 2 handles raw images/videos directly, resulting in less information loss.

Q3: What’s the right dimension to choose? Is there a big difference between 768 and 3072?

For most applications, 768 dimensions is the "sweet spot"—the storage cost is only 1/4 of 3072 dimensions, but the retrieval quality loss is minimal (thanks to Matryoshka training). If your dataset is small (<1 million records) and you have extremely high precision requirements, go with 3072. If you have a large volume of data or need real-time retrieval, 768 or even 256 dimensions are perfectly reasonable choices.

Q4: How does APIYI support Gemini Embedding 2? Is extra configuration needed?

APIYI (apiyi.com) already supports the gemini-embedding-2-preview model. You can call it using the standard OpenAI-compatible embedding interface without needing an extra Google API key. Simply specify gemini-embedding-2-preview in the model parameter; all other parameters (like dimensions) are identical to the OpenAI embedding interface.

gemini-embedding-2-preview-multimodal-embedding-model-apiyi-guide-en 图示

Summary: A New Benchmark for Multimodal Embeddings

Gemini Embedding 2 Preview marks a major milestone for embedding models—a shift from text-only to a truly unified multimodal space. By securing the #1 spot across MTEB’s multilingual, English, and coding benchmarks simultaneously, combined with an 8K context window and MRL dimension scaling, it provides the most powerful foundation currently available for RAG systems, semantic search, and knowledge base construction.

Key Takeaways:

  • The industry's first native five-modality embedding model (text + image + video + audio + PDF)
  • #1 on the MTEB multilingual benchmark, leading by over 5 points
  • 8,192-token context window, 4x larger than the previous generation
  • MRL training supports flexible dimension scaling from 128 to 3,072
  • Priced at $0.20 per million tokens, offering exceptional cost-effectiveness for multimodal scenarios

We recommend using APIYI (apiyi.com) to quickly integrate Gemini Embedding 2 Preview. With a single API key, you can access mainstream embedding models like Gemini and OpenAI, making it easy to compare and switch between them.


📝 Author: APIYI Technical Team | APIYI apiyi.com – A unified API platform for 300+ AI Large Language Models

References

  1. Official Google Blog: Gemini Embedding 2 Announcement

    • Link: blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/
    • Description: Covers the model's design philosophy and an introduction to its multimodal capabilities.
  2. Gemini API Embedding Documentation: Official API User Guide

    • Link: ai.google.dev/gemini-api/docs/embeddings
    • Description: Complete API parameters and usage examples.
  3. Gemini Embedding Research Paper: Technical Details and Benchmarks

    • Link: arxiv.org/html/2503.07891v1
    • Description: Detailed MTEB test data and model architecture analysis.
  4. Gemini API Pricing: Detailed Pricing Information by Modality

    • Link: ai.google.dev/gemini-api/docs/pricing
    • Description: Itemized pricing for text, image, audio, and video.

Similar Posts