Gemini: Google's Powerhouse in the LLM world
Introduction
Just yesterday, Google Deep Mind introduced Gemini, a family of highly capable multimodal models developed in collaboration with Google Research. The model was built from the ground up to be multimodal and trained jointly across text, image, audio, and video data. This means that it can seamlessly generalize across modalities exhibiting at the same time cutting-edge reasoning capabilities. The model achieves advancement in state-of-the-art performance in 30 out of 32 evaluation benchmarks (i.e., image and video understanding, audio processing), with impressive human-level performance on the MMLU exam benchmark.
Model Architecture
Gemini models are transformer decoder models with architectural improvements and optimization enhancements for stable training at scale. They support 32k context length and employ efficient attention mechanisms (e.g., multi-query attention).
The models are trained to handle textual input interleaved with a variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce text and image outputs (Fig. 1). The visual encoding is built on top of the earlier foundational work on visual language models (i.e., Flamingo [3], CoCa [4], and PaLI[5]), with the models trained to be multimodal from the beginning and to natively output images using discrete image tokens. For video understanding, the video is encoded as a sequence of frames in the large context window. Audio signals can directly be ingested at 16kHz from Universal Speech Model (USM) features.

Gemini 1.0, the first version of Gemini models, comes in three main sizes that support a wide range of applications.
Ultra: largest and most capable model for highly complex tasks, serveable at scale on TPU accelerators
Pro: performance-optimized model in terms of cost and latency with significant performance across a wide range of tasks. This model exhibits strong reasoning performance and broad multimodal capabilities.
Nano: most efficient model for on-device tasks. It is a distilled version of the larger Gemini models and is 4-bit quantized for deployment. It comes in two versions:
Nano-1 (1.8B): for low memory devices
Nano-2 (3.25B): for high memory devices.
Training Dataset
The models are trained on a multimodal and multilingual dataset comprising data from web documents, books, and code, including image, audio, and video data. The datasets were filtered for quality using both heuristics and model-based classifiers. Safety filters were also applied to remove harmful content. The final data mixtures and weights were determined through ablations on smaller models. It was found that data quality is critical for a highly-performing model, and finding the optimal dataset distribution for pretraining remains an interesting challenge.
Evaluation
Gemini was evaluated on various tasks, ranging from general capabilities, math, and coding to reasoning. The models’ performance was compared to a suite of external LLMs, such as GPT-4/GPT-3.5, Claude 2, LLAMA-2, Inflection-2, Grok 1, and PALM 2L. Fig. 2 below shows the Gemini Ultra performance on text benchmarks compared to GPT-4.

Gemini Ultra scores 90% on MMLU(massive multitask language understanding), which uses a combination of 57 subjects such as math, physics, history, law, medicine and ethics for testing both world knowledge and problem-solving abilities, outperforming human experts who score 89.8%. It was found that Gemini performs better when used with chain-of-thought prompting. The model produces a chain of thought with k = 8 or 32 samples. If there is a consensus above a threshold (chosen based on the validation split), it selects this answer; otherwise, it reverts to a greedy sample based on maximum likelihood choice without chain-of-thought.
Gemini was also evaluated on multimodal capabilities:
high-level object recognition using captioning or question-answering tasks such as VQAv2
fine-grained transcription using tasks such as TextVQA and DocVQA requiring the model to recognize low-level details
chart understanding requiring spatial understanding of input layout using ChartQA and InfographicVQA tasks
multimodal reasoning using tasks such as Ai2D, MathVista, and MMMU
Fig. 3 presents the performance of Gemini Ultra in comparison with other LLMs (e.g., GPT-4V, Whisper, Flamingo). Gemini achieves a state-of-the-art score of 59.4% on the new MMMU benchmark, consisting of multimodal tasks spanning different domains requiring deliberate reasoning.

With the image benchmarks we tested, Gemini Ultra outperformed previous state-of-the-art models, without assistance from object character recognition (OCR) systems. These benchmarks highlight Gemini’s native multimodality and indicate early signs of Gemini's more complex reasoning abilities.
Deployment
The path to model deployment follows the iterative approach illustrated in Fig. 4, which focuses on Responsibility and Safety Governance. This approach consists of 5 main steps:
Impact Assessment: Identify, assess, and document key downstream societal benefits and harms associated with the development of advanced Gemini models Used to guide the mitigation and product delivery efforts.
Model Policy: Standardized criteria and prioritization schema for responsible development that act as an indication of launch-readiness. Gemini model policies cover a number of domains including: child safety, hate speech, factual accuracy, fairness and inclusion, and harassment.
Evaluation: A suite of evaluations across the lifecycle of model development (i.e., development evaluations against external academic benchmarks, assurance evaluations to test across Gemini policies, external evaluations to identify blindspots)
Mitigations: Approaches to mitigate model harms across data, instruction tuning, and factuality.
Deployment: Upon the completion of reviews, approved models are released for production.
References
[1] Gemini: A Family of Highly Capable Models, (Google Technical Report)
[2] Introducing Gemini: our largest and most capable AI model (Google Blog)
[3] J-B Alayrac et al., Flamingo: a Visual Language Model for Few-Shot Learning, https://arxiv.org/abs/2204.14198, 2022.
[4] J. Yu and Z. Wang, CoCa: Contrastive Captioners are Image-Text Foundation Models, https://arxiv.org/abs/2205.01917, 2022.