🔵 Gemma — Model Architecture Across Generations

Versions: 4 Team: Google DeepMind Updated: April 2026

From Google DeepMind's first open-weight Gemini descendant to a fully Apache-licensed multimodal family — tracing four generations of architecture innovation.


📑 Table of Contents


📋 Executive Summary

This document covers four generations of the Gemma large language model family developed by Google DeepMind:

  • Gemma 1 — The foundation: decoder-only Transformer distilled from Gemini, GeGLU activation, logit soft-capping, 256K BPE vocab, 6T tokens, 2B & 7B sizes
  • Gemma 2 — Architecture leap: alternating Sliding Window + Global attention, knowledge distillation from 27B teacher, double normalization, GQA for all sizes, 8K context
  • Gemma 3 — Multimodal expansion: SigLIP vision encoder, 128K long context, hybrid 5:1 local/global attention ratio, 256K vocab, 140+ languages, QAT support
  • Gemma 4 — Open-weight frontier: Apache 2.0 license, MoE variant (26B A4B), Per-Layer Embeddings (E2B/E4B), SigLIP 2 vision + native audio (E2B/E4B), 256K context, thinking mode, agentic tool use

📝 Note: The Gemma family also includes specialized variants — CodeGemma (code generation), PaliGemma (vision-language), RecurrentGemma (linear recurrent), and ShieldGemma (safety classifier) — which are documented separately. This document focuses on the core Gemma LLM architecture.


📅 Version Release Timeline

| Version | Release Date | Paper / Blog | Flagship Size | Training Tokens | Context Length | Headline Feature | |:-------:|:----------:|:------------|:------------:|:--------------:|:------------:|:----------------| | Gemma 1 | Feb 21, 2024 | [arXiv:2403.08295](https://arxiv.org/abs/2403.08295) | 7B | 6T | 8K | First open Gemini-based model | | Gemma 2 | Jun 27, 2024 | [arXiv:2408.00118](https://arxiv.org/abs/2408.00118) | 27B | 13T | 8K | SWA + Global attn + KD | | Gemma 3 | Mar 12, 2025 | [arXiv:2503.19786](https://arxiv.org/abs/2503.19786) | 27B | 12T | 128K | Multimodal + 140 languages | | Gemma 4 | Apr 2, 2026 | [Model Card](https://ai.google.dev/gemma/docs/core/model_card_4) · [Blog](https://opensource.googleblog.com/2026/04/gemma-4-expanding-gemmaverse-apache-2.html) | 31B | — | 256K | Apache 2.0 · MoE · PLE · Audio · Thinking |

📊 Cross-Version Benchmark Comparison

Numbers reflect the best comparable model size across generations. Sources: official technical papers.

Benchmark Gemma 1 (7B) Gemma 2 (9B) Gemma 2 (27B) Gemma 3 (12B) Gemma 3 (27B) Gemma 4 (26B MoE) Gemma 4 (31B)
MMLU 64.3 71.3 75.2 74.0 81.0 82.6 85.2
HumanEval 32.3 40.2 51.8 57.9 69.7 ~87.0 91.4
MATH 24.3 36.7 42.4 43.3 67.6 ~88.0 92.1
GSM8K 46.4 68.6 74.0 79.6 89.7 88.4 91.2
Context Length 8K 8K 8K 128K 128K 256K 256K
Training Tokens 6T 8T 13T 12T 12T
Multimodal ✅ (img) ✅ (img)
Vocabulary 256,128 256,128 256,128 262,144 262,144 262,144 262,144

*Gemma 1–3 base-model numbers sourced from official technical reports. Gemma 4 numbers are instruct-model results from the official model card and Google Open Source blog (Apr 2026). MMLU column for Gemma 4 reflects MMLU Pro. HumanEval/MATH for Gemma 4 26B MoE are approximate ranges from multiple evaluations.


🏗️ Master Architecture Diagram

This diagram shows the core Transformer decoder architecture shared across all Gemma versions, with color-coded annotations indicating which generation introduced each component.

Gemma Architecture — Component Evolution Across Generations
🔵 Gemma 1 🔷 Gemma 2 🟦 Gemma 3 🟪 Gemma 4
Token Embedding — SentencePiece BPE
256,128 vocab (Gemma 1 & 2) → 262,144 vocab (Gemma 3) • Tied embeddings (input = output weights)
SigLIP Vision Encoder (Gemma 3) / SigLIP 2 (Gemma 4)
ViT-based, 896×896px, Pan & Scan multi-crop → 256 soft image tokens interleaved with text
Audio Encoder (Gemma 4 E2B / E4B only)
Native audio input; audio tokens interleaved with text tokens in the decoder
× N TRANSFORMER LAYERS
RMSNorm (Pre-Normalization) — Gemma 1
Self-Attention Block
RoPE θ=10,000 MHA (no QKV bias) Logit soft-cap ±30
GQA replaces MHA SWA (odd layers, win=4096) Global attn (even layers)
RoPE θ=1,000,000 5:1 local/global ratio 128K context
p-RoPE p=0.25 (256K ctx) K=V trick (global attn) Shared KV cache
+ Residual Connection
RMSNorm (Post-Normalization) — Added in Gemma 2 (double norm)
Feed-Forward Block
GeGLU Activation Dense FFN (no MoE)
MoE 128+1 experts, top-8 (Gemma 4 26B) GeGLU retained across all generations
+ Residual Connection
Final Logit Soft-Capping — tanh(x/cap)×cap
Gemma 1: cap=30  |  Gemma 2: cap=50  |  Applied before softmax to stabilize training
LM Head — Tied with input embeddings → Next-token prediction
Multimodal output (Gemma 3) KD from 27B teacher (Gemma 2) Thinking mode / tool use (Gemma 4)

🔵 Gemma 1 — February 2024

📅 Released: February 21, 2024  |  📄 arXiv:2403.08295

Summary

  • Google DeepMind’s first open-weight model built directly on the Gemini architecture, making the Gemini-family techniques publicly accessible for research and commercial use — a landmark moment for the open AI ecosystem
  • Built on a decoder-only Transformer with causal attention masks, following the now-standard autoregressive language modeling paradigm established by GPT and refined through LLaMA and Gemini
  • Uses Multi-Head Attention (MHA) without Grouped Query Attention — all query, key, and value heads are independent — with no QKV bias, a deliberate design choice to reduce parameter overhead while maintaining expressivity
  • Adopts Rotary Positional Embeddings (RoPE) with a base frequency of θ=10,000, the same default as LLaMA — providing relative position encoding that generalizes well to sequences up to the training length
  • Employs GeGLU activation in the FFN layers: GeGLU(x) = GELU(xW₁) ⊗ (xW₂) — this is distinct from SwiGLU used by LLaMA/Qwen, using GELU (Gaussian Error Linear Unit) as the gate rather than Swish, providing a smoother activation landscape
  • Introduces logit soft-capping at the attention layer: softcap(x) = tanh(x/30) × 30 — applied to attention logits before the softmax operation, preventing extreme values and stabilizing training without a hard clip that would destroy gradients
  • Applies RMSNorm with both pre-normalization (before attention and FFN sub-layers) — following the LLaMA convention — and also uses post-normalization on the final output, borrowing the double-norm strategy from Gemini
  • Uses a SentencePiece BPE tokenizer with a vocabulary of 256,128 tokens — dramatically larger than LLaMA’s 32K and Mistral’s 32K, providing far superior multilingual tokenization efficiency and byte-level coverage for rare characters
  • Ties input embeddings to output projection weights — the embedding matrix doubles as the LM head, halving the number of parameters in those layers and improving parameter efficiency especially at the 2B scale
  • Trained on 6 trillion tokens (both 2B and 7B models) sourced from web text, code repositories, and mathematical content — filtered and deduplicated using quality heuristics similar to those used for Gemini’s training pipeline
  • Released in two sizes: 2B (18 layers, 8 heads, head_dim=256, d_model=2048, intermediate_size=16,384) and 7B (28 layers, 16 heads, head_dim=256, d_model=3072, intermediate_size=24,576)
  • The head_dim=256 is notably larger than the typical 128 used by LLaMA at equivalent scales — this larger per-head dimension was carried over from Gemini’s design and allows each attention head to capture richer representations
  • Accompanied by instruction-tuned variants (Gemma 1-IT) aligned via Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) using a reward model trained on human preference data
  • Established the Gemma naming convention and the foundational technical choices — GeGLU, logit soft-capping, 256K vocabulary, tied embeddings — that persist as defining characteristics of the entire Gemma family

Architecture Diagram — Gemma 1

Gemma 1 — Transformer Decoder Block (per layer)
Input Token Embeddings (256,128 vocab) + RoPE (θ=10,000) — Tied with LM Head
RMSNorm (Pre-Norm — before attention)
Multi-Head Attention (MHA)
RoPE Positional Encoding No QKV Bias Causal Attention Mask
⚡ Logit Soft-Capping:   attn_logit = tanh(logit / 30) × 30   (before softmax)
2B: 8 Q heads, head_dim=256, d_model=2048  |  7B: 16 Q heads, head_dim=256, d_model=3072
+ Residual Connection → RMSNorm (Pre-Norm before FFN)
Feed-Forward Network (Dense)
GeGLU Activation GELU(xW₁) ⊗ xW₂
2B: intermediate=16,384 (8× d_model)  |  7B: intermediate=24,576 (8× d_model)
+ Residual Connection
Final Logit Soft-Cap:  tanh(logit / 30) × 30   (after final RMSNorm, before softmax)
LM Head (= Embedding matrix, tied) → Next Token Prediction

Community Perspective

  • Received enormous enthusiasm from the open-source AI community — it was the first time Google had released open-weight models derived from the Gemini architecture, democratizing access to Gemini-class techniques
  • The GeGLU activation (vs. SwiGLU in LLaMA) sparked debate; empirically the performance gap is small, but the theoretical distinction (GELU gate vs. Swish gate) was noted by researchers as a deliberate Gemini heritage choice
  • The logit soft-capping mechanism was widely praised as an elegant alternative to gradient clipping — it prevents attention score explosion without the discrete cutoff of hard capping, and practitioners noted more stable fine-tuning behavior
  • The 256K vocabulary was a major talking point — far larger than LLaMA’s 32K, it enables superior tokenization of non-Latin scripts, code, and mathematical notation, though it also means larger embedding layers
  • The 7B model was benchmarked as competitive with Mistral-7B and LLaMA-2-13B despite being released months later — the community quickly integrated Gemma into tools like LM Studio, Ollama, and llama.cpp
  • Some criticism around the relatively small training dataset (6T tokens) compared to contemporaries like Mistral (estimated 3T but much smaller model), and the limited context window of 8K vs. Mistral’s 32K
  • The tied embeddings design was appreciated for memory efficiency, particularly beneficial for the 2B model on consumer hardware
  • Commercial usability under the Gemma Terms of Use (not fully open-source Apache 2.0) was noted as a limitation for some use cases, compared to LLaMA’s or Mistral’s licenses

Model Variants

Model Parameters Layers Q Heads head_dim d_model Intermediate Context Training Tokens
Gemma-2B 2B 18 8 256 2048 16,384 8K 6T
Gemma-7B 7B 28 16 256 3072 24,576 8K 6T
Gemma-2B-IT 2B 18 8 256 2048 16,384 8K 6T + SFT/RLHF
Gemma-7B-IT 7B 28 16 256 3072 24,576 8K 6T + SFT/RLHF
Key Industry Ideas Incorporated | Technique | Origin | How Gemma 1 Used It | |:----------|:-------|:--------------------| | GeGLU Activation | Dauphin et al. (2017), Noam Shazeer (2020) | FFN gating function: GELU gate replaces standard ReLU or GELU, inherited from Gemini | | RoPE | Su et al., "RoFormer" (2021) | Relative positional encoding applied to Q and K projections, θ=10,000 | | RMSNorm | Zhang & Sennrich (2019) | Pre-normalization before each sub-layer for stable gradient flow | | Logit Soft-Capping | Google DeepMind / Gemini (2023) | tanh-based attention logit capping at ±30 to prevent score explosion | | SentencePiece BPE | Kudo & Richardson (2018) | 256,128-token vocabulary with byte-level fallback for rare chars | | Tied Embeddings | Press & Wolf (2017) | Input embedding matrix reused as output projection for parameter efficiency | | SFT + RLHF | Ziegler et al. (2019), Ouyang et al. InstructGPT (2022) | Standard instruction tuning + human preference alignment pipeline |

🔷 Gemma 2 — June 2024

📅 Released: June 27, 2024  |  📄 arXiv:2408.00118

Summary

  • Most architecturally innovative Gemma release: introduced alternating Sliding Window Attention (SWA) and Global Attention — odd-indexed layers use local sliding window attention (window_size=4096) while even-indexed layers use full global attention, allowing the model to capture both local context and long-range dependencies efficiently within an 8K context window
  • Logit soft-capping refined with two separate caps: ±30 for attention logits (same as Gemma 1) and a new ±50 for the final output logits — the asymmetric capping acknowledges that final logit distributions require wider dynamic range than within-layer attention scores
  • Grouped Query Attention (GQA) replaces MHA for all model sizes — the 2B model uses 8 Q heads / 4 KV heads, the 9B uses 16 Q heads / 8 KV heads, and the 27B uses 32 Q heads / 16 KV heads — dramatically reducing KV cache memory at inference time
  • Knowledge distillation is central to the training methodology: smaller models (2B and 9B) are distilled from the 27B teacher model using both token-level KL-divergence minimization and next-token prediction loss — achieving significantly better performance than training from scratch at equivalent compute
  • Expands to three model sizes: 2B, 9B, and 27B — the 27B represents a 4× scale-up from Gemma 1’s 7B, enabling much stronger reasoning and knowledge retention
  • Adds post-normalization (in addition to the existing pre-normalization from Gemma 1): both the output of attention and the output of the FFN are normalized before the residual addition — creating a double normalization scheme that significantly improves training stability at larger scales
  • GeGLU activation is retained across all layers and model sizes — a deliberate architectural continuity choice that preserves the Gemini heritage and avoids the overhead of architecture search
  • Context window of 8,192 tokens (8K) — unchanged from Gemma 1, but the alternating SWA pattern means local layers only attend to the nearest 4,096 tokens, while global layers have full 8K visibility, providing effective hierarchical attention coverage
  • The 2B model architecture is substantially larger than Gemma 1’s 2B: 26 layers (vs. 18), d_model=2304 (vs. 2048), intermediate=9,216 — reflecting the distillation-boosted efficiency allowing more capacity at equivalent parameter count
  • 9B model dimensions: 42 layers, d_model=3,584, intermediate=14,336 — the 42-layer depth is notable as a non-standard choice (most 9B models use 32-36 layers), reflecting the trade-off toward depth over width in Gemma’s design philosophy
  • 27B model dimensions: 46 layers, d_model=4,608, intermediate=36,864 — trained on a full 13T tokens, making it the most data-rich Gemma model at launch
  • Training data scales with model size: 2B trained on 2T tokens, 9B on 8T tokens, 27B on 13T tokens — demonstrating Google’s compute-optimal scaling approach inspired by Chinchilla
  • Post-training alignment uses a combination of SFT, RLHF, and distillation-based alignment: the instruction-tuned models inherit behavioral priors from the 27B-IT teacher via distillation, enabling smaller instruct models to match the alignment quality of larger ones
  • Significantly outperforms Gemma 1 across all benchmarks: GSM8K improves from 46.4 (7B) to 68.6 (9B), MMLU from 64.3 to 71.3 — demonstrating that the SWA + GQA + distillation innovations translate directly to capability improvements

Architecture Diagram — Gemma 2

Gemma 2 — Alternating Attention + Knowledge Distillation
✨ KEY INNOVATION
Alternating Attention Pattern
ODD Layers (1,3,5...)
Local SWA
Window = 4,096 tokens
Only attends to recent tokens
O(n × w) complexity
EVEN Layers (2,4,6...)
Global Attention
Full 8,192-token context
All-to-all attention
O(n²) complexity
📐 Local layers: fast, O(n×w) — Global layers: expressive, O(n²) — Alternation balances efficiency vs. recall
Grouped Query Attention (GQA) — All Sizes
Model Q Heads KV Heads GQA Ratio Layers d_model
Gemma 2 — 2B842:1262,304
Gemma 2 — 9B1682:1423,584
Gemma 2 — 27B32162:1464,608
✨ DISTILLATION STRATEGY
Knowledge Distillation Pipeline
Teacher
Gemma 2 27B
Trained from scratch
Student: 9B
KL(teacher ∥ student)
+ NTP loss
Student: 2B
Distilled from 27B
+ NTP loss
Token-level KL divergence + standard cross-entropy. Smaller models learn soft probability distributions from teacher, not just hard labels.
⚙️ Double Normalization (Pre + Post)
Pre-Norm (before attn/FFN)
+
Post-Norm (NEW in Gemma 2)
=
Stable large-scale training

Community Perspective

  • The alternating SWA + global attention pattern was seen as an elegant engineering solution — it halves the computational cost of attention layers by making half of them local while preserving long-range modeling through interleaved global layers
  • Knowledge distillation from 27B gave the 2B and 9B models capabilities well above their parameter count would suggest — the community quickly noticed that Gemma 2 2B outperformed many 7B models from competing families
  • The double normalization (pre + post) was unusual and prompted analysis — researchers found it improves gradient flow significantly at 27B scale, but adds marginal overhead at 2B
  • The GQA ratio of 2:1 (e.g., 16 Q heads, 8 KV heads) is more conservative than Qwen2’s 7:1 or 8:1 ratios — indicating Google’s preference for maintaining attention quality over maximizing KV cache reduction
  • Gemma 2 27B surpassing Llama 3 70B on several benchmarks despite being less than half the size was a landmark result that cemented Gemma 2 as a top-tier open-weight family
  • The soft-capping of final logits at ±50 (larger than ±30 for attention) was noted as a practical detail — preventing overconfident probability distributions at the output while allowing more dynamic range than the attention layers
  • Community noted that the 8K context window felt limiting given that contemporaries like LLaMA 3 and Qwen 2 were pushing to 128K — this became a major point of improvement for Gemma 3

Model Variants

Model Parameters Layers Q Heads / KV Heads d_model Intermediate Context Training Tokens
Gemma2-2B 2B 26 8 / 4 2,304 9,216 8K 2T
Gemma2-9B 9B 42 16 / 8 3,584 14,336 8K 8T
Gemma2-27B 27B 46 32 / 16 4,608 36,864 8K 13T
Gemma2-2B-IT 2B 26 8 / 4 2,304 9,216 8K 2T + SFT/RLHF/KD
Gemma2-9B-IT 9B 42 16 / 8 3,584 14,336 8K 8T + SFT/RLHF/KD
Gemma2-27B-IT 27B 46 32 / 16 4,608 36,864 8K 13T + SFT/RLHF
Key Industry Ideas Incorporated | Technique | Origin | How Gemma 2 Used It | |:----------|:-------|:--------------------| | Sliding Window Attention | Longformer (Beltagy et al., 2020), Mistral (2023) | Odd-indexed layers use SWA with window_size=4096 for O(n×w) local attention | | Global Attention Interleaving | LongT5 (Guo et al., 2022), BigBird (Zaheer et al., 2020) | Even-indexed layers use full global attention to maintain long-range coherence | | Knowledge Distillation | Hinton et al. (2015), DistilBERT (2019) | Soft-label KL divergence from 27B teacher to 2B and 9B students during pre-training | | GQA | Ainslie et al. (2023) | Replaces MHA for all sizes; 2:1 ratio balances quality vs. KV cache efficiency | | Double Normalization (Pre + Post) | Gemini (Google, 2023) | Applies RMSNorm both before and after attention and FFN sub-layers | | Logit Soft-Capping | Gemini (Google, 2023) | Separate caps for attention (±30) vs. final output logits (±50) | | RLHF with KD | Anthropic, OpenAI | Instruction-tuned variants combine human preference data with distillation from 27B-IT |

🟦 Gemma 3 — March 2025

📅 Released: March 12, 2025  |  📄 arXiv:2503.19786

Summary

  • Major multimodal expansion: introduces a native SigLIP-based vision encoder for 4B, 12B, and 27B models — images are processed at 896×896 pixel resolution through a ViT (Vision Transformer) backbone that produces 256 soft image tokens, which are directly interleaved with text tokens in the decoder, enabling unified multimodal understanding without a separate cross-attention module
  • Pan & Scan multi-crop strategy handles variable-resolution images: the input image is divided into multiple overlapping crops at the native resolution, each processed independently by the SigLIP encoder, then concatenated — this allows arbitrary-aspect-ratio inputs and high-fidelity processing of text in images, charts, and photographs
  • Context window expands dramatically: 128K tokens for the 4B, 12B, and 27B models (a 16× increase from Gemma 2’s 8K), and 32K for the 1B text-only model — enabling long document analysis, multi-turn conversations, and extended code context
  • RoPE base frequency scaled to θ=1,000,000 (up from 10,000 in Gemma 1 and 2) — this 100× increase is critical for supporting 128K context, as higher base frequencies allow RoPE to represent much larger relative position differences without frequency aliasing
  • Hybrid attention with 5:1 local-to-global ratio: five consecutive sliding window attention layers are followed by one full global attention layer — compared to Gemma 2’s 1:1 alternating pattern, this 5:1 ratio dramatically reduces the computational cost of attention for long sequences while the periodic global layers maintain coherence
  • Local SWA layers use a sliding window of 1,024 tokens (4B/12B) or 4,096 tokens (27B) — smaller windows than Gemma 2’s 4,096, reflecting the 5:1 ratio design where local layers prioritize speed
  • Upgraded tokenizer derived from the Gemini 2.0 tokenizer with an expanded 262,144-token vocabulary (up from 256,128 in Gemma 1/2) — supporting 140+ languages with improved tokenization efficiency for multilingual text, including better coverage for CJK characters and low-resource languages
  • Expands to four model sizes: 1B (text-only), 4B, 12B, and 27B — the 1B model fills the on-device/mobile niche while the 12B fills the gap between 9B (Gemma 2) and 27B for production inference
  • GQA across all sizes with consistent 2:1 Q-to-KV head ratios: 1B (8Q/4KV), 4B (8Q/4KV), 12B (16Q/8KV), 27B (32Q/16KV) — maintaining the Gemma 2 convention
  • Gemma 3 4B-IT matches Gemma 2 27B-IT in performance — a remarkable 6.75× parameter efficiency improvement attributed to 12T training tokens, knowledge distillation, and improved architecture (hybrid attention + long context)
  • Training data scaled to 12T tokens per model (vs. 2-13T varying in Gemma 2), covering 140+ languages — a dramatic multilingual expansion compared to Gemma 2’s primarily English-focused pre-training corpus
  • Post-training pipeline includes: (1) large-scale SFT on instruction-following, code, math, and tool use; (2) RLHF with human preference data; (3) knowledge distillation from the 27B-IT model to smaller models — the same three-stage approach as Gemma 2 but at larger scale
  • Quantization-Aware Training (QAT) support is built into the training pipeline — official INT4 and INT8 quantized variants are released alongside the float16 checkpoints, enabling deployment on mobile and edge devices without significant quality degradation
  • ShieldGemma 3 companion safety classifier is released alongside — trained specifically to identify harmful content in Gemma 3 inputs and outputs, providing an integrated responsible AI toolkit for production deployments
  • Knowledge distillation continues from 27B to smaller models, with the 4B and 12B distilled models benefiting from both the 27B pre-training teacher and the 27B-IT alignment teacher

Architecture Diagram — Gemma 3

Gemma 3 — Multimodal Architecture with Hybrid Attention
✨ NEW: VISION
SigLIP Vision Encoder (4B / 12B / 27B only)
Input Image
Any resolution
Pan & Scan crops
SigLIP ViT
896×896px
per crop
256 image tokens
per crop
interleaved with text
Image tokens and text tokens are concatenated in the same sequence — no separate cross-attention module needed
✨ 5:1 HYBRID ATTENTION
Hybrid Attention Pattern (5 Local + 1 Global)
SWA
Local
win=1K
SWA
Local
win=1K
SWA
Local
win=1K
SWA
Local
win=1K
SWA
Local
win=1K
Global
Full
128K ctx
...
🔵 5× Local SWA: O(n × w), fast, captures syntax
🟦 1× Global: O(n²), slow, captures semantics
🔭 Long Context Engineering
128K
Max context (4B/12B/27B)
θ=10⁶
RoPE base freq (100× Gemma 2)
5:1
Local : Global attn ratio
4 SIZES
Model Size Configurations
Model Layers d_model Q/KV Heads Intermediate Context Vision
Gemma 3 — 1B181,1528 / 46,91232K
Gemma 3 — 4B342,5608 / 410,240128K
Gemma 3 — 12B483,84016 / 815,360128K
Gemma 3 — 27B464,60832 / 1636,864128K

Community Perspective

  • The 4B-IT matching 27B-IT from Gemma 2 was the headline result that shocked the community — a 6.75× parameter efficiency jump in one generation, attributed to the 12T training corpus, improved hybrid attention, and aggressive distillation
  • The SigLIP vision integration was welcomed as a clean multimodal architecture — interleaving image tokens directly with text tokens (rather than a separate cross-attention module as in LLaVA-style models) is simpler and scales better
  • Pan & Scan image processing was praised for its practical handling of high-resolution and variable-aspect-ratio images — crucial for document understanding, screenshots, and diagrams
  • The 5:1 local/global attention ratio represents a different architectural bet than Gemma 2’s 1:1 alternating pattern — community experiments showed 5:1 is more compute-efficient for long sequences while maintaining comparable quality
  • The 128K context window finally brought Gemma up to parity with LLaMA 3.1 and Qwen 2 on long-context tasks — a relief for practitioners who found Gemma 2’s 8K limiting
  • QAT support with officially released INT4/INT8 quantized models was praised as a responsible release practice — many users immediately deployed the 4B INT4 model on consumer laptops and mobile devices
  • ShieldGemma as a companion safety classifier was lauded as part of a responsible AI deployment toolkit — though some noted that bundling safety tools separately from the model itself leaves room for misuse
  • 140+ language support represented a massive step forward for multilingual use cases, with the community noting improved tokenizer efficiency and generation quality for low-resource languages compared to Gemma 2
  • The 1B text-only model with 32K context was positioned as the best-in-class on-device LLM, fitting comfortably in 4-bit quantized form on smartphones — sparking discussions about on-device AI applications

Model Variants

Model Parameters Layers Q Heads / KV Heads d_model Intermediate Context Vision Training Tokens
Gemma3-1B 1B 18 8 / 4 1,152 6,912 32K 12T
Gemma3-4B 4B 34 8 / 4 2,560 10,240 128K 12T
Gemma3-12B 12B 48 16 / 8 3,840 15,360 128K 12T
Gemma3-27B 27B 46 32 / 16 4,608 36,864 128K 12T
Gemma3-1B-IT 1B 18 8 / 4 1,152 6,912 32K 12T + SFT/RLHF
Gemma3-4B-IT 4B 34 8 / 4 2,560 10,240 128K 12T + SFT/RLHF/KD
Gemma3-12B-IT 12B 48 16 / 8 3,840 15,360 128K 12T + SFT/RLHF/KD
Gemma3-27B-IT 27B 46 32 / 16 4,608 36,864 128K 12T + SFT/RLHF
Gemma3-1B-PT (QAT) 1B 18 8 / 4 1,152 6,912 32K INT4/INT8 quantized
Gemma3-4B-PT (QAT) 4B 34 8 / 4 2,560 10,240 128K INT4/INT8 quantized

Companion models: ShieldGemma 3 (27B safety classifier), ShieldGemma 3 (4B) — released alongside Gemma 3 for responsible deployment.

Key Industry Ideas Incorporated | Technique | Origin | How Gemma 3 Used It | |:----------|:-------|:--------------------| | SigLIP Vision Encoder | Zhai et al., "Sigmoid Loss for Language Image Pre-Training" (2023) | ViT-based image encoder replacing CLIP for image-text alignment; produces 256 tokens per 896×896 crop | | Pan & Scan Multi-Crop | Google (PaLI-X, 2023) | Variable-resolution image tiling strategy: crop image into overlapping sub-images at native resolution | | 5:1 Local/Global Hybrid Attention | Gemma 2 (alternating), Longformer (2020) | Extended to 5:1 ratio for more efficient long-context processing; 1M-token experiments | | High RoPE Base Frequency (θ=10⁶) | LLaMA 3 (Meta, 2024), Code LLaMA (2023) | Scaling θ from 10,000 to 1,000,000 for faithful 128K context without frequency aliasing | | Knowledge Distillation (pre-training + alignment) | Hinton et al. (2015), Gemma 2 (2024) | Multi-stage KD: 27B teacher used for both pre-training token distribution and RLHF alignment of smaller models | | Quantization-Aware Training (QAT) | GPTQ (Frantar et al., 2023), LLM.int8() (Dettmers et al., 2022) | INT4/INT8 quantization awareness built into training for official quantized releases without quality degradation | | SFT + RLHF + KD Alignment Pipeline | InstructGPT (2022), Gemma 2 (2024) | Three-stage post-training: SFT → RLHF → knowledge distillation from 27B-IT teacher | | ShieldGemma Safety Classifier | LlamaGuard (Meta, 2024) | Companion safety model trained specifically on Gemma 3's output distribution for content moderation | | Expanded Multilingual Tokenizer | Gemini 2.0 tokenizer (Google, 2024) | 262,144-token vocabulary from Gemini 2.0, covering 140+ languages with improved subword compression |

🟪 Gemma 4 — April 2026

📅 Released: April 2, 2026  |  📄 Model Card  |  🔗 Google Open Source Blog

Summary

  • Historic license change to Apache 2.0: for the first time in the Gemma family, all four Gemma 4 model variants are released under the fully permissive Apache 2.0 license — replacing the custom Gemma Terms of Use from all previous generations — enabling unrestricted commercial use, fine-tuning, redistribution, and enterprise deployment without additional agreements
  • Four-model family spanning from on-device to workstation: E2B (~2.3B effective / 5.1B total), E4B (~4.5B effective / 8B total), 26B A4B (Mixture-of-Experts, ~4B active / 25.2B total), and 31B dense — covering the full deployment spectrum from smartphones to data center GPUs
  • Mixture-of-Experts (MoE) architecture in the 26B A4B: introduces Gemma’s first MoE model, with 128 regular experts plus 1 shared expert per MoE layer, routing 8 experts per token — enabling near-27B-quality output at roughly 4B-equivalent inference cost
  • Per-Layer Embeddings (PLE) for E2B and E4B: instead of a single shared embedding table, each transformer layer has its own lightweight embedding lookup (PLE dimensions: vocab × 256 × n_layers), injected after each attention/FFN block — enabling very high intelligence-per-parameter on edge hardware; PLE tables are designed to reside in flash memory rather than VRAM
  • Native audio input for E2B and E4B: the smallest two models accept audio tokens directly (in addition to text and images), making Gemma 4 the first core Gemma generation with audio-native edge models
  • Upgraded SigLIP 2 vision encoder across all model sizes: variable-resolution tile processing (up to 896×896 per tile, multi-tile per prompt) — building on Gemma 3’s SigLIP backbone with improved alignment training
  • Thinking mode via a <|think|> special token: models can be prompted to produce step-by-step chain-of-thought reasoning traces before the final answer, enabling stronger performance on math and reasoning benchmarks without a separate reasoning model
  • Agentic capabilities: native support for function calling, tool use, planning, and system prompts (<|system|> role), enabling Gemma 4 models as drop-in agents in orchestration frameworks
  • Context window extends to 256K tokens for 26B A4B and 31B models (128K for E2B/E4B) — up from 128K in Gemma 3’s largest models — enabled by Proportional RoPE (p-RoPE)
  • Hybrid local/global attention continues the Gemma 3 5:1 pattern with refinements: local window shrinks to 512 tokens for E2B (4 local + 1 global per block) and 1,024 tokens for 26B/31B (5:1); global layers use the K=V trick (Keys = Values, halving KV cache at those layers) and Shared KV Cache (last N layers reuse K/V from previous same-type layers)
  • Proportional RoPE (p-RoPE): for long-context (256K) global attention layers, only a fraction p=0.25 of the RoPE coordinate pairs receive positional encoding — limiting positional noise in long sequences while preserving semantic tracking
  • GQA pattern refined: local layers use 2 Q heads per 1 KV head; global layers can use up to 8 Q heads per 1 KV head — further reducing KV cache memory
  • MMLU Pro 85.2% (31B) — state-of-the-art among Apache 2.0 open models at launch, outperforming Llama 4 Scout and Qwen 3 32B on several reasoning benchmarks; MATH 500 92.1% (31B); HumanEval 91.4% (31B)

Architecture Diagram — Gemma 4

Gemma 4 — Multi-Modal Architecture with MoE, PLE, and Long Context
Text Tokens
All variants
262,144 vocab
SigLIP 2 Vision
All variants
896×896 multi-tile
Audio
E2B / E4B only
native audio tokens
✨ NEW: PLE (E2B / E4B)
Per-Layer Embeddings (PLE)
Each transformer layer fetches a layer-specific token embedding from a flash-resident PLE table (vocab × 256 × n_layers) and injects it after attention/FFN via a lightweight residual block — boosting intelligence-per-parameter on edge hardware without increasing active VRAM requirements.
✨ 5:1 + K=V + Shared KV
Hybrid Attention (5 Local + 1 Global) with Memory Optimizations
SWA
Local
win=1K
SWA
Local
win=1K
SWA
Local
win=1K
SWA
Local
win=1K
SWA
Local
win=1K
Global
K=V trick
Shared KV
...
🔵 5× Local SWA: GQA 2:1 Q/KV, fast
🟪 1× Global: K=V, Shared KV, p-RoPE (p=0.25 for 256K)
✨ NEW: MoE (26B A4B)
Mixture-of-Experts FFN (26B A4B only)
128 experts
+ 1 shared
Top-8 routing
per token
~4B active
25.2B total
📐 Model Size Configurations
Model Layers Effective Params Context Modalities PLE/MoE
Gemma 4 E2B35~2.3B (5.1B total)128KText + Image + AudioPLE
Gemma 4 E4B42~4.5B (8B total)128KText + Image + AudioPLE
Gemma 4 26B A4B~4B active (25.2B total)256KText + ImageMoE (128+1 exp, top-8)
Gemma 4 31B30.7B dense256KText + ImageDense

Community Perspective

  • The Apache 2.0 license change was the biggest headline: the developer community immediately highlighted this as a watershed moment for Google’s open-source AI strategy — previous Gemma generations used a custom Terms of Use that prohibited certain commercial uses, and many organizations had avoided Gemma for that reason
  • MoE in the 26B A4B was welcomed as a practical breakthrough: running a 25B-parameter model at 4B inference cost on consumer hardware made Gemma 4 immediately accessible to individual researchers with a single 16–18 GB GPU — previously a workstation-class requirement
  • Per-Layer Embeddings (PLE) generated significant interest: the idea of offloading per-layer token embeddings to flash memory as a way to dramatically boost intelligence-per-active-parameter was novel and sparked architectural discussions across the ML community
  • Thinking mode adoption was rapid: the <|think|> token mechanism was straightforward to use via standard chat templates, and benchmarks quickly showed 5–10% gains on AIME and LiveCodeBench when thinking was enabled — making it a go-to feature for technical users
  • Audio support in edge models (E2B/E4B) was noted as a significant practical advantage for mobile and IoT voice applications, though the community awaited more detailed audio benchmark comparisons
  • The K=V trick and Shared KV Cache were praised by inference engineers as elegant solutions to KV cache memory pressure in long-context scenarios, reducing global attention memory overhead by up to 50%
  • AIME 2026 ~89.2% was the standout math benchmark result, placing Gemma 4 31B above many closed models on competition-level mathematics — validating the thinking mode + improved pre-training
  • Comparison to Llama 4: the community broadly assessed Gemma 4 as competing favorably with Llama 4 Scout (active parameter count) and Llama 4 Maverick for reasoning-heavy tasks, while the Apache 2.0 license gave Gemma 4 an advantage in enterprise adoption settings

Model Variants

Model Effective Params Total Params Layers Context Modalities License Architecture
Gemma4-E2B ~2.3B 5.1B 35 128K Text + Image + Audio Apache 2.0 Dense + PLE
Gemma4-E2B-IT ~2.3B 5.1B 35 128K Text + Image + Audio Apache 2.0 Dense + PLE + SFT/RLHF
Gemma4-E4B ~4.5B 8B 42 128K Text + Image + Audio Apache 2.0 Dense + PLE
Gemma4-E4B-IT ~4.5B 8B 42 128K Text + Image + Audio Apache 2.0 Dense + PLE + SFT/RLHF
Gemma4-26B-A4B ~4B active 25.2B 256K Text + Image Apache 2.0 MoE (128+1 exp, top-8)
Gemma4-26B-A4B-IT ~4B active 25.2B 256K Text + Image Apache 2.0 MoE + SFT/RLHF
Gemma4-31B 30.7B 30.7B 256K Text + Image Apache 2.0 Dense
Gemma4-31B-IT 30.7B 30.7B 256K Text + Image Apache 2.0 Dense + SFT/RLHF

Companion model: ShieldGemma 4 safety classifier released alongside for responsible deployment.

Key Industry Ideas Incorporated | Technique | Origin | How Gemma 4 Used It | |:----------|:-------|:--------------------| | Mixture-of-Experts (MoE) | Shazeer et al., "Outrageously Large Neural Networks" (2017); Mixtral (2023) | 26B A4B: 128+1 experts per MoE layer, top-8 routing — full MoE debut in the Gemma family | | Per-Layer Embeddings (PLE) | Gemma 4 (Google DeepMind, 2026) | Layer-specific embedding tables fetched from flash memory, injected residually after each layer for edge intelligence-per-parameter boost | | Proportional RoPE (p-RoPE) | Gemma 4 (Google DeepMind, 2026) | Only 25% of RoPE coordinate pairs encoded in global attention layers for 256K contexts, reducing positional noise | | K=V Trick (Keys = Values) | Gemma 4 (Google DeepMind, 2026) | Global attention layers set K=V, collapsing KV cache to a single cache and halving memory requirements | | Shared KV Cache | Gemma 4 (Google DeepMind, 2026) | Last N layers of the same attention type share K/V, reducing redundant memory across layers | | Chain-of-Thought / Thinking Mode | Wei et al., "Chain-of-Thought Prompting" (NeurIPS 2022); DeepSeek-R1 (2025) | `<|think|>` token activates step-by-step reasoning traces before final output | | SigLIP 2 (upgraded vision encoder) | Zhai et al., SigLIP (2023) + Google improvements | Enhanced alignment training over Gemma 3's SigLIP; variable-resolution multi-tile input | | Function Calling / Tool Use | Toolformer (2023), GPT-4 function calling (2023) | Native tool use and function calling via chat templates for agentic deployments | | Apache 2.0 Open Licensing | OSI (Open Source Initiative) | First Gemma generation fully open; previous versions used custom Gemma Terms of Use |

📚 References

Technical Papers

Version Title Link Date
Gemma 1 Gemma: Open Models Based on Gemini Research and Technology arXiv:2403.08295 Mar 2024
Gemma 2 Gemma 2: Improving Open Language Models at a Practical Size arXiv:2408.00118 Aug 2024
Gemma 3 Gemma 3 Technical Report arXiv:2503.19786 Mar 2025
Gemma 4 Gemma 4 Model Card (Google AI for Developers) ai.google.dev/gemma/docs/core/model_card_4 Apr 2026

Official Blog Posts

Title Link
Gemma: Introducing New State-of-the-Art Open Models blog.google/technology/developers/gemma-open-models/
Gemma 2: Advancing Frontier AI Responsibly blog.google/technology/developers/google-gemma-2/
Gemma 3 — The Developer Guide developers.googleblog.com/en/introducing-gemma3/
Gemma 4: Expanding the Gemmaverse with Apache 2.0 opensource.googleblog.com/2026/04/gemma-4-expanding-gemmaverse-apache-2.html
Google DeepMind Gemma Page deepmind.google/models/gemma/

GitHub & Model Repositories

Resource Link
Gemma GitHub (google-deepmind) github.com/google-deepmind/gemma
Gemma on Hugging Face huggingface.co/google/gemma-7b
Gemma 2 on Hugging Face huggingface.co/google/gemma-2-27b
Gemma 3 on Hugging Face huggingface.co/google/gemma-3-27b-it
Gemma 4 on Hugging Face huggingface.co/google/gemma-4-31b-it
Gemma on Kaggle kaggle.com/models/google/gemma
Keras NLP Gemma keras.io/api/keras_nlp/models/gemma/

Cited Techniques

Technique Paper Link
GeGLU Activation Shazeer, “GLU Variants Improve Transformer” (2020) arXiv:2002.05202
RoPE Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding” (2021) arXiv:2104.09864
RMSNorm Zhang & Sennrich, “Root Mean Square Layer Normalization” (2019) arXiv:1910.07467
GQA Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models” (EMNLP 2023) arXiv:2305.13245
Sliding Window Attention Beltagy et al., “Longformer: The Long-Document Transformer” (2020) arXiv:2004.05150
Knowledge Distillation Hinton et al., “Distilling the Knowledge in a Neural Network” (2015) arXiv:1503.02531
SigLIP Zhai et al., “Sigmoid Loss for Language Image Pre-Training” (2023) arXiv:2303.15343
SentencePiece BPE Kudo & Richardson, “SentencePiece: A simple and language independent subword tokenizer” (2018) arXiv:1808.06226
Tied Embeddings Press & Wolf, “Using the Output Embedding to Improve Language Models” (EACL 2017) arXiv:1608.05859
QAT (INT4/INT8) Frantar et al., “GPTQ: Accurate Post-Training Quantization” (ICLR 2023) arXiv:2210.17323
InstructGPT / RLHF Ouyang et al., “Training language models to follow instructions with human feedback” (NeurIPS 2022) arXiv:2203.02155
Gemini Architecture Gemini Team, “Gemini: A Family of Highly Capable Multimodal Models” (2023) arXiv:2312.11805
LLaMA 3 (RoPE scaling) Meta AI, “The Llama 3 Herd of Models” (2024) arXiv:2407.21783
Mixture-of-Experts Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer” (2017) arXiv:1701.06538
Chain-of-Thought Prompting Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (NeurIPS 2022) arXiv:2201.11903

Built with data from official Gemma technical reports, Google DeepMind blog posts, and the Gemma 4 model card (April 2026). All benchmark numbers sourced directly from the referenced publications.

← Back to Index