🔵 Gemma — Model Architecture Across Generations

From Google DeepMind's first open-weight Gemini descendant to a fully Apache-licensed multimodal family — tracing four generations of architecture innovation.

📑 Table of Contents

Executive Summary
Version Release Timeline
Cross-Version Benchmark Comparison
Master Architecture Diagram
Gemma 1 (February 2024)
Gemma 2 (June 2024)
Gemma 3 (March 2025)
Gemma 4 (April 2026)
References

📋 Executive Summary

This document covers four generations of the Gemma large language model family developed by Google DeepMind:

Gemma 1 — The foundation: decoder-only Transformer distilled from Gemini, GeGLU activation, logit soft-capping, 256K BPE vocab, 6T tokens, 2B & 7B sizes
Gemma 2 — Architecture leap: alternating Sliding Window + Global attention, knowledge distillation from 27B teacher, double normalization, GQA for all sizes, 8K context
Gemma 3 — Multimodal expansion: SigLIP vision encoder, 128K long context, hybrid 5:1 local/global attention ratio, 256K vocab, 140+ languages, QAT support
Gemma 4 — Open-weight frontier: Apache 2.0 license, MoE variant (26B A4B), Per-Layer Embeddings (E2B/E4B), SigLIP 2 vision + native audio (E2B/E4B), 256K context, thinking mode, agentic tool use

📝 Note: The Gemma family also includes specialized variants — CodeGemma (code generation), PaliGemma (vision-language), RecurrentGemma (linear recurrent), and ShieldGemma (safety classifier) — which are documented separately. This document focuses on the core Gemma LLM architecture.

📅 Version Release Timeline

| Version | Release Date | Paper / Blog | Flagship Size | Training Tokens | Context Length | Headline Feature | |:-------:|:----------:|:------------|:------------:|:--------------:|:------------:|:----------------| |

| Feb 21, 2024 | [arXiv:2403.08295](https://arxiv.org/abs/2403.08295) | 7B | 6T | 8K | First open Gemini-based model | |

| Jun 27, 2024 | [arXiv:2408.00118](https://arxiv.org/abs/2408.00118) | 27B | 13T | 8K | SWA + Global attn + KD | |

| Mar 12, 2025 | [arXiv:2503.19786](https://arxiv.org/abs/2503.19786) | 27B | 12T | 128K | Multimodal + 140 languages | |

| Apr 2, 2026 | [Model Card](https://ai.google.dev/gemma/docs/core/model_card_4) · [Blog](https://opensource.googleblog.com/2026/04/gemma-4-expanding-gemmaverse-apache-2.html) | 31B | — | 256K | Apache 2.0 · MoE · PLE · Audio · Thinking |

📊 Cross-Version Benchmark Comparison

Numbers reflect the best comparable model size across generations. Sources: official technical papers.

Benchmark	Gemma 1 (7B)	Gemma 2 (9B)	Gemma 2 (27B)	Gemma 3 (12B)	Gemma 3 (27B)	Gemma 4 (26B MoE)	Gemma 4 (31B)
MMLU	64.3	71.3	75.2	74.0	81.0	82.6	85.2
HumanEval	32.3	40.2	51.8	57.9	69.7	~87.0	91.4
MATH	24.3	36.7	42.4	43.3	67.6	~88.0	92.1
GSM8K	46.4	68.6	74.0	79.6	89.7	88.4	91.2
Context Length	8K	8K	8K	128K	128K	256K	256K
Training Tokens	6T	8T	13T	12T	12T	—	—
Multimodal	❌	❌	❌	✅	✅	✅ (img)	✅ (img)
Vocabulary	256,128	256,128	256,128	262,144	262,144	262,144	262,144

_{*Gemma 1–3 base-model numbers sourced from official technical reports. Gemma 4 numbers are instruct-model results from the official model card and Google Open Source blog (Apr 2026). MMLU column for Gemma 4 reflects MMLU Pro. HumanEval/MATH for Gemma 4 26B MoE are approximate ranges from multiple evaluations.}

🏗️ Master Architecture Diagram

This diagram shows the core Transformer decoder architecture shared across all Gemma versions, with color-coded annotations indicating which generation introduced each component.

Gemma Architecture — Component Evolution Across Generations

🔵 Gemma 1 🔷 Gemma 2 🟦 Gemma 3 🟪 Gemma 4

Token Embedding — SentencePiece BPE

256,128 vocab (Gemma 1 & 2) → 262,144 vocab (Gemma 3) • Tied embeddings (input = output weights)

SigLIP Vision Encoder (Gemma 3) / SigLIP 2 (Gemma 4)

ViT-based, 896×896px, Pan & Scan multi-crop → 256 soft image tokens interleaved with text

Audio Encoder (Gemma 4 E2B / E4B only)

Native audio input; audio tokens interleaved with text tokens in the decoder

↓

× N TRANSFORMER LAYERS

RMSNorm (Pre-Normalization) — Gemma 1

Self-Attention Block

RoPE θ=10,000 MHA (no QKV bias) Logit soft-cap ±30

GQA replaces MHA SWA (odd layers, win=4096) Global attn (even layers)

RoPE θ=1,000,000 5:1 local/global ratio 128K context

p-RoPE p=0.25 (256K ctx) K=V trick (global attn) Shared KV cache

+ Residual Connection

RMSNorm (Post-Normalization) — Added in Gemma 2 (double norm)

Feed-Forward Block

GeGLU Activation Dense FFN (no MoE)

MoE 128+1 experts, top-8 (Gemma 4 26B) GeGLU retained across all generations

+ Residual Connection

↓

Final Logit Soft-Capping — tanh(x/cap)×cap

Gemma 1: cap=30 | Gemma 2: cap=50 | Applied before softmax to stabilize training

↓

LM Head — Tied with input embeddings → Next-token prediction

Multimodal output (Gemma 3) KD from 27B teacher (Gemma 2) Thinking mode / tool use (Gemma 4)

🔵 Gemma 1 — February 2024

📅 Released: February 21, 2024 | 📄 arXiv:2403.08295

Summary

Google DeepMind’s first open-weight model built directly on the Gemini architecture, making the Gemini-family techniques publicly accessible for research and commercial use — a landmark moment for the open AI ecosystem
Built on a decoder-only Transformer with causal attention masks, following the now-standard autoregressive language modeling paradigm established by GPT and refined through LLaMA and Gemini
Uses Multi-Head Attention (MHA) without Grouped Query Attention — all query, key, and value heads are independent — with no QKV bias, a deliberate design choice to reduce parameter overhead while maintaining expressivity
Adopts Rotary Positional Embeddings (RoPE) with a base frequency of θ=10,000, the same default as LLaMA — providing relative position encoding that generalizes well to sequences up to the training length
Employs GeGLU activation in the FFN layers: GeGLU(x) = GELU(xW₁) ⊗ (xW₂) — this is distinct from SwiGLU used by LLaMA/Qwen, using GELU (Gaussian Error Linear Unit) as the gate rather than Swish, providing a smoother activation landscape
Introduces logit soft-capping at the attention layer: softcap(x) = tanh(x/30) × 30 — applied to attention logits before the softmax operation, preventing extreme values and stabilizing training without a hard clip that would destroy gradients
Applies RMSNorm with both pre-normalization (before attention and FFN sub-layers) — following the LLaMA convention — and also uses post-normalization on the final output, borrowing the double-norm strategy from Gemini
Uses a SentencePiece BPE tokenizer with a vocabulary of 256,128 tokens — dramatically larger than LLaMA’s 32K and Mistral’s 32K, providing far superior multilingual tokenization efficiency and byte-level coverage for rare characters
Ties input embeddings to output projection weights — the embedding matrix doubles as the LM head, halving the number of parameters in those layers and improving parameter efficiency especially at the 2B scale
Trained on 6 trillion tokens (both 2B and 7B models) sourced from web text, code repositories, and mathematical content — filtered and deduplicated using quality heuristics similar to those used for Gemini’s training pipeline
Released in two sizes: 2B (18 layers, 8 heads, head_dim=256, d_model=2048, intermediate_size=16,384) and 7B (28 layers, 16 heads, head_dim=256, d_model=3072, intermediate_size=24,576)
The head_dim=256 is notably larger than the typical 128 used by LLaMA at equivalent scales — this larger per-head dimension was carried over from Gemini’s design and allows each attention head to capture richer representations
Accompanied by instruction-tuned variants (Gemma 1-IT) aligned via Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) using a reward model trained on human preference data
Established the Gemma naming convention and the foundational technical choices — GeGLU, logit soft-capping, 256K vocabulary, tied embeddings — that persist as defining characteristics of the entire Gemma family

Architecture Diagram — Gemma 1

Gemma 1 — Transformer Decoder Block (per layer)

Input Token Embeddings (256,128 vocab) + RoPE (θ=10,000) — Tied with LM Head

↓

RMSNorm (Pre-Norm — before attention)

↓

Multi-Head Attention (MHA)

RoPE Positional Encoding No QKV Bias Causal Attention Mask

⚡ Logit Soft-Capping: attn_logit = tanh(logit / 30) × 30 (before softmax)

2B: 8 Q heads, head_dim=256, d_model=2048 | 7B: 16 Q heads, head_dim=256, d_model=3072

+ Residual Connection → RMSNorm (Pre-Norm before FFN)

↓

Feed-Forward Network (Dense)

GeGLU Activation GELU(xW₁) ⊗ xW₂

2B: intermediate=16,384 (8× d_model) | 7B: intermediate=24,576 (8× d_model)

+ Residual Connection

↓

Final Logit Soft-Cap: tanh(logit / 30) × 30 (after final RMSNorm, before softmax)

↓

LM Head (= Embedding matrix, tied) → Next Token Prediction

Community Perspective

Received enormous enthusiasm from the open-source AI community — it was the first time Google had released open-weight models derived from the Gemini architecture, democratizing access to Gemini-class techniques
The GeGLU activation (vs. SwiGLU in LLaMA) sparked debate; empirically the performance gap is small, but the theoretical distinction (GELU gate vs. Swish gate) was noted by researchers as a deliberate Gemini heritage choice
The logit soft-capping mechanism was widely praised as an elegant alternative to gradient clipping — it prevents attention score explosion without the discrete cutoff of hard capping, and practitioners noted more stable fine-tuning behavior
The 256K vocabulary was a major talking point — far larger than LLaMA’s 32K, it enables superior tokenization of non-Latin scripts, code, and mathematical notation, though it also means larger embedding layers
The 7B model was benchmarked as competitive with Mistral-7B and LLaMA-2-13B despite being released months later — the community quickly integrated Gemma into tools like LM Studio, Ollama, and llama.cpp
Some criticism around the relatively small training dataset (6T tokens) compared to contemporaries like Mistral (estimated 3T but much smaller model), and the limited context window of 8K vs. Mistral’s 32K
The tied embeddings design was appreciated for memory efficiency, particularly beneficial for the 2B model on consumer hardware
Commercial usability under the Gemma Terms of Use (not fully open-source Apache 2.0) was noted as a limitation for some use cases, compared to LLaMA’s or Mistral’s licenses

Model Variants

Model	Parameters	Layers	Q Heads	head_dim	d_model	Intermediate	Context	Training Tokens
Gemma-2B	2B	18	8	256	2048	16,384	8K	6T
Gemma-7B	7B	28	16	256	3072	24,576	8K	6T
Gemma-2B-IT	2B	18	8	256	2048	16,384	8K	6T + SFT/RLHF
Gemma-7B-IT	7B	28	16	256	3072	24,576	8K	6T + SFT/RLHF

Key Industry Ideas Incorporated

| Technique | Origin | How Gemma 1 Used It | |:----------|:-------|:--------------------| | GeGLU Activation | Dauphin et al. (2017), Noam Shazeer (2020) | FFN gating function: GELU gate replaces standard ReLU or GELU, inherited from Gemini | | RoPE | Su et al., "RoFormer" (2021) | Relative positional encoding applied to Q and K projections, θ=10,000 | | RMSNorm | Zhang & Sennrich (2019) | Pre-normalization before each sub-layer for stable gradient flow | | Logit Soft-Capping | Google DeepMind / Gemini (2023) | tanh-based attention logit capping at ±30 to prevent score explosion | | SentencePiece BPE | Kudo & Richardson (2018) | 256,128-token vocabulary with byte-level fallback for rare chars | | Tied Embeddings | Press & Wolf (2017) | Input embedding matrix reused as output projection for parameter efficiency | | SFT + RLHF | Ziegler et al. (2019), Ouyang et al. InstructGPT (2022) | Standard instruction tuning + human preference alignment pipeline |

🔷 Gemma 2 — June 2024

📅 Released: June 27, 2024 | 📄 arXiv:2408.00118

Summary

Most architecturally innovative Gemma release: introduced alternating Sliding Window Attention (SWA) and Global Attention — odd-indexed layers use local sliding window attention (window_size=4096) while even-indexed layers use full global attention, allowing the model to capture both local context and long-range dependencies efficiently within an 8K context window
Logit soft-capping refined with two separate caps: ±30 for attention logits (same as Gemma 1) and a new ±50 for the final output logits — the asymmetric capping acknowledges that final logit distributions require wider dynamic range than within-layer attention scores
Grouped Query Attention (GQA) replaces MHA for all model sizes — the 2B model uses 8 Q heads / 4 KV heads, the 9B uses 16 Q heads / 8 KV heads, and the 27B uses 32 Q heads / 16 KV heads — dramatically reducing KV cache memory at inference time
Knowledge distillation is central to the training methodology: smaller models (2B and 9B) are distilled from the 27B teacher model using both token-level KL-divergence minimization and next-token prediction loss — achieving significantly better performance than training from scratch at equivalent compute
Expands to three model sizes: 2B, 9B, and 27B — the 27B represents a 4× scale-up from Gemma 1’s 7B, enabling much stronger reasoning and knowledge retention
Adds post-normalization (in addition to the existing pre-normalization from Gemma 1): both the output of attention and the output of the FFN are normalized before the residual addition — creating a double normalization scheme that significantly improves training stability at larger scales
GeGLU activation is retained across all layers and model sizes — a deliberate architectural continuity choice that preserves the Gemini heritage and avoids the overhead of architecture search
Context window of 8,192 tokens (8K) — unchanged from Gemma 1, but the alternating SWA pattern means local layers only attend to the nearest 4,096 tokens, while global layers have full 8K visibility, providing effective hierarchical attention coverage
The 2B model architecture is substantially larger than Gemma 1’s 2B: 26 layers (vs. 18), d_model=2304 (vs. 2048), intermediate=9,216 — reflecting the distillation-boosted efficiency allowing more capacity at equivalent parameter count
9B model dimensions: 42 layers, d_model=3,584, intermediate=14,336 — the 42-layer depth is notable as a non-standard choice (most 9B models use 32-36 layers), reflecting the trade-off toward depth over width in Gemma’s design philosophy
27B model dimensions: 46 layers, d_model=4,608, intermediate=36,864 — trained on a full 13T tokens, making it the most data-rich Gemma model at launch
Training data scales with model size: 2B trained on 2T tokens, 9B on 8T tokens, 27B on 13T tokens — demonstrating Google’s compute-optimal scaling approach inspired by Chinchilla
Post-training alignment uses a combination of SFT, RLHF, and distillation-based alignment: the instruction-tuned models inherit behavioral priors from the 27B-IT teacher via distillation, enabling smaller instruct models to match the alignment quality of larger ones
Significantly outperforms Gemma 1 across all benchmarks: GSM8K improves from 46.4 (7B) to 68.6 (9B), MMLU from 64.3 to 71.3 — demonstrating that the SWA + GQA + distillation innovations translate directly to capability improvements

Architecture Diagram — Gemma 2

Gemma 2 — Alternating Attention + Knowledge Distillation

✨ KEY INNOVATION

Alternating Attention Pattern

ODD Layers (1,3,5...)

Local SWA

Window = 4,096 tokens
Only attends to recent tokens
O(n × w) complexity

⇄

EVEN Layers (2,4,6...)

Global Attention

Full 8,192-token context
All-to-all attention
O(n²) complexity

📐 Local layers: fast, O(n×w) — Global layers: expressive, O(n²) — Alternation balances efficiency vs. recall

Grouped Query Attention (GQA) — All Sizes

Model	Q Heads	KV Heads	GQA Ratio	Layers	d_model
Gemma 2 — 2B	8	4	2:1	26	2,304
Gemma 2 — 9B	16	8	2:1	42	3,584
Gemma 2 — 27B	32	16	2:1	46	4,608

✨ DISTILLATION STRATEGY

Knowledge Distillation Pipeline

Teacher
Gemma 2 27B
Trained from scratch

→

Student: 9B
KL(teacher ∥ student)
+ NTP loss

→

Student: 2B
Distilled from 27B
+ NTP loss

Token-level KL divergence + standard cross-entropy. Smaller models learn soft probability distributions from teacher, not just hard labels.

⚙️ Double Normalization (Pre + Post)

Pre-Norm (before attn/FFN)

Post-Norm (NEW in Gemma 2)

Stable large-scale training

Community Perspective

The alternating SWA + global attention pattern was seen as an elegant engineering solution — it halves the computational cost of attention layers by making half of them local while preserving long-range modeling through interleaved global layers
Knowledge distillation from 27B gave the 2B and 9B models capabilities well above their parameter count would suggest — the community quickly noticed that Gemma 2 2B outperformed many 7B models from competing families
The double normalization (pre + post) was unusual and prompted analysis — researchers found it improves gradient flow significantly at 27B scale, but adds marginal overhead at 2B
The GQA ratio of 2:1 (e.g., 16 Q heads, 8 KV heads) is more conservative than Qwen2’s 7:1 or 8:1 ratios — indicating Google’s preference for maintaining attention quality over maximizing KV cache reduction
Gemma 2 27B surpassing Llama 3 70B on several benchmarks despite being less than half the size was a landmark result that cemented Gemma 2 as a top-tier open-weight family
The soft-capping of final logits at ±50 (larger than ±30 for attention) was noted as a practical detail — preventing overconfident probability distributions at the output while allowing more dynamic range than the attention layers
Community noted that the 8K context window felt limiting given that contemporaries like LLaMA 3 and Qwen 2 were pushing to 128K — this became a major point of improvement for Gemma 3

Model Variants

Model	Parameters	Layers	Q Heads / KV Heads	d_model	Intermediate	Context	Training Tokens
Gemma2-2B	2B	26	8 / 4	2,304	9,216	8K	2T
Gemma2-9B	9B	42	16 / 8	3,584	14,336	8K	8T
Gemma2-27B	27B	46	32 / 16	4,608	36,864	8K	13T
Gemma2-2B-IT	2B	26	8 / 4	2,304	9,216	8K	2T + SFT/RLHF/KD
Gemma2-9B-IT	9B	42	16 / 8	3,584	14,336	8K	8T + SFT/RLHF/KD
Gemma2-27B-IT	27B	46	32 / 16	4,608	36,864	8K	13T + SFT/RLHF

Key Industry Ideas Incorporated

| Technique | Origin | How Gemma 2 Used It | |:----------|:-------|:--------------------| | Sliding Window Attention | Longformer (Beltagy et al., 2020), Mistral (2023) | Odd-indexed layers use SWA with window_size=4096 for O(n×w) local attention | | Global Attention Interleaving | LongT5 (Guo et al., 2022), BigBird (Zaheer et al., 2020) | Even-indexed layers use full global attention to maintain long-range coherence | | Knowledge Distillation | Hinton et al. (2015), DistilBERT (2019) | Soft-label KL divergence from 27B teacher to 2B and 9B students during pre-training | | GQA | Ainslie et al. (2023) | Replaces MHA for all sizes; 2:1 ratio balances quality vs. KV cache efficiency | | Double Normalization (Pre + Post) | Gemini (Google, 2023) | Applies RMSNorm both before and after attention and FFN sub-layers | | Logit Soft-Capping | Gemini (Google, 2023) | Separate caps for attention (±30) vs. final output logits (±50) | | RLHF with KD | Anthropic, OpenAI | Instruction-tuned variants combine human preference data with distillation from 27B-IT |

🟦 Gemma 3 — March 2025

📅 Released: March 12, 2025 | 📄 arXiv:2503.19786

Summary

Major multimodal expansion: introduces a native SigLIP-based vision encoder for 4B, 12B, and 27B models — images are processed at 896×896 pixel resolution through a ViT (Vision Transformer) backbone that produces 256 soft image tokens, which are directly interleaved with text tokens in the decoder, enabling unified multimodal understanding without a separate cross-attention module
Pan & Scan multi-crop strategy handles variable-resolution images: the input image is divided into multiple overlapping crops at the native resolution, each processed independently by the SigLIP encoder, then concatenated — this allows arbitrary-aspect-ratio inputs and high-fidelity processing of text in images, charts, and photographs
Context window expands dramatically: 128K tokens for the 4B, 12B, and 27B models (a 16× increase from Gemma 2’s 8K), and 32K for the 1B text-only model — enabling long document analysis, multi-turn conversations, and extended code context
RoPE base frequency scaled to θ=1,000,000 (up from 10,000 in Gemma 1 and 2) — this 100× increase is critical for supporting 128K context, as higher base frequencies allow RoPE to represent much larger relative position differences without frequency aliasing
Hybrid attention with 5:1 local-to-global ratio: five consecutive sliding window attention layers are followed by one full global attention layer — compared to Gemma 2’s 1:1 alternating pattern, this 5:1 ratio dramatically reduces the computational cost of attention for long sequences while the periodic global layers maintain coherence
Local SWA layers use a sliding window of 1,024 tokens (4B/12B) or 4,096 tokens (27B) — smaller windows than Gemma 2’s 4,096, reflecting the 5:1 ratio design where local layers prioritize speed
Upgraded tokenizer derived from the Gemini 2.0 tokenizer with an expanded 262,144-token vocabulary (up from 256,128 in Gemma 1/2) — supporting 140+ languages with improved tokenization efficiency for multilingual text, including better coverage for CJK characters and low-resource languages
Expands to four model sizes: 1B (text-only), 4B, 12B, and 27B — the 1B model fills the on-device/mobile niche while the 12B fills the gap between 9B (Gemma 2) and 27B for production inference
GQA across all sizes with consistent 2:1 Q-to-KV head ratios: 1B (8Q/4KV), 4B (8Q/4KV), 12B (16Q/8KV), 27B (32Q/16KV) — maintaining the Gemma 2 convention
Gemma 3 4B-IT matches Gemma 2 27B-IT in performance — a remarkable 6.75× parameter efficiency improvement attributed to 12T training tokens, knowledge distillation, and improved architecture (hybrid attention + long context)
Training data scaled to 12T tokens per model (vs. 2-13T varying in Gemma 2), covering 140+ languages — a dramatic multilingual expansion compared to Gemma 2’s primarily English-focused pre-training corpus
Post-training pipeline includes: (1) large-scale SFT on instruction-following, code, math, and tool use; (2) RLHF with human preference data; (3) knowledge distillation from the 27B-IT model to smaller models — the same three-stage approach as Gemma 2 but at larger scale
Quantization-Aware Training (QAT) support is built into the training pipeline — official INT4 and INT8 quantized variants are released alongside the float16 checkpoints, enabling deployment on mobile and edge devices without significant quality degradation
ShieldGemma 3 companion safety classifier is released alongside — trained specifically to identify harmful content in Gemma 3 inputs and outputs, providing an integrated responsible AI toolkit for production deployments
Knowledge distillation continues from 27B to smaller models, with the 4B and 12B distilled models benefiting from both the 27B pre-training teacher and the 27B-IT alignment teacher

Architecture Diagram — Gemma 3

Gemma 3 — Multimodal Architecture with Hybrid Attention

✨ NEW: VISION

SigLIP Vision Encoder (4B / 12B / 27B only)

Input Image
Any resolution
Pan & Scan crops

→

SigLIP ViT
896×896px
per crop

→

256 image tokens
per crop
interleaved with text

Image tokens and text tokens are concatenated in the same sequence — no separate cross-attention module needed

✨ 5:1 HYBRID ATTENTION

Hybrid Attention Pattern (5 Local + 1 Global)

SWA
Local
win=1K

Global
Full
128K ctx

...

🔵 5× Local SWA: O(n × w), fast, captures syntax

🟦 1× Global: O(n²), slow, captures semantics

🔭 Long Context Engineering

128K

Max context (4B/12B/27B)

θ=10⁶

RoPE base freq (100× Gemma 2)

5:1

Local : Global attn ratio

4 SIZES

Model Size Configurations

Model	Layers	d_model	Q/KV Heads	Intermediate	Context	Vision
Gemma 3 — 1B	18	1,152	8 / 4	6,912	32K	❌
Gemma 3 — 4B	34	2,560	8 / 4	10,240	128K	✅
Gemma 3 — 12B	48	3,840	16 / 8	15,360	128K	✅
Gemma 3 — 27B	46	4,608	32 / 16	36,864	128K	✅

Community Perspective

The 4B-IT matching 27B-IT from Gemma 2 was the headline result that shocked the community — a 6.75× parameter efficiency jump in one generation, attributed to the 12T training corpus, improved hybrid attention, and aggressive distillation
The SigLIP vision integration was welcomed as a clean multimodal architecture — interleaving image tokens directly with text tokens (rather than a separate cross-attention module as in LLaVA-style models) is simpler and scales better
Pan & Scan image processing was praised for its practical handling of high-resolution and variable-aspect-ratio images — crucial for document understanding, screenshots, and diagrams
The 5:1 local/global attention ratio represents a different architectural bet than Gemma 2’s 1:1 alternating pattern — community experiments showed 5:1 is more compute-efficient for long sequences while maintaining comparable quality
The 128K context window finally brought Gemma up to parity with LLaMA 3.1 and Qwen 2 on long-context tasks — a relief for practitioners who found Gemma 2’s 8K limiting
QAT support with officially released INT4/INT8 quantized models was praised as a responsible release practice — many users immediately deployed the 4B INT4 model on consumer laptops and mobile devices
ShieldGemma as a companion safety classifier was lauded as part of a responsible AI deployment toolkit — though some noted that bundling safety tools separately from the model itself leaves room for misuse
140+ language support represented a massive step forward for multilingual use cases, with the community noting improved tokenizer efficiency and generation quality for low-resource languages compared to Gemma 2
The 1B text-only model with 32K context was positioned as the best-in-class on-device LLM, fitting comfortably in 4-bit quantized form on smartphones — sparking discussions about on-device AI applications

Model Variants

Model	Parameters	Layers	Q Heads / KV Heads	d_model	Intermediate	Context	Vision	Training Tokens
Gemma3-1B	1B	18	8 / 4	1,152	6,912	32K	❌	12T
Gemma3-4B	4B	34	8 / 4	2,560	10,240	128K	✅	12T
Gemma3-12B	12B	48	16 / 8	3,840	15,360	128K	✅	12T
Gemma3-27B	27B	46	32 / 16	4,608	36,864	128K	✅	12T
Gemma3-1B-IT	1B	18	8 / 4	1,152	6,912	32K	❌	12T + SFT/RLHF
Gemma3-4B-IT	4B	34	8 / 4	2,560	10,240	128K	✅	12T + SFT/RLHF/KD
Gemma3-12B-IT	12B	48	16 / 8	3,840	15,360	128K	✅	12T + SFT/RLHF/KD
Gemma3-27B-IT	27B	46	32 / 16	4,608	36,864	128K	✅	12T + SFT/RLHF
Gemma3-1B-PT (QAT)	1B	18	8 / 4	1,152	6,912	32K	❌	INT4/INT8 quantized
Gemma3-4B-PT (QAT)	4B	34	8 / 4	2,560	10,240	128K	✅	INT4/INT8 quantized

Companion models: ShieldGemma 3 (27B safety classifier), ShieldGemma 3 (4B) — released alongside Gemma 3 for responsible deployment.

Key Industry Ideas Incorporated

| Technique | Origin | How Gemma 3 Used It | |:----------|:-------|:--------------------| | SigLIP Vision Encoder | Zhai et al., "Sigmoid Loss for Language Image Pre-Training" (2023) | ViT-based image encoder replacing CLIP for image-text alignment; produces 256 tokens per 896×896 crop | | Pan & Scan Multi-Crop | Google (PaLI-X, 2023) | Variable-resolution image tiling strategy: crop image into overlapping sub-images at native resolution | | 5:1 Local/Global Hybrid Attention | Gemma 2 (alternating), Longformer (2020) | Extended to 5:1 ratio for more efficient long-context processing; 1M-token experiments | | High RoPE Base Frequency (θ=10⁶) | LLaMA 3 (Meta, 2024), Code LLaMA (2023) | Scaling θ from 10,000 to 1,000,000 for faithful 128K context without frequency aliasing | | Knowledge Distillation (pre-training + alignment) | Hinton et al. (2015), Gemma 2 (2024) | Multi-stage KD: 27B teacher used for both pre-training token distribution and RLHF alignment of smaller models | | Quantization-Aware Training (QAT) | GPTQ (Frantar et al., 2023), LLM.int8() (Dettmers et al., 2022) | INT4/INT8 quantization awareness built into training for official quantized releases without quality degradation | | SFT + RLHF + KD Alignment Pipeline | InstructGPT (2022), Gemma 2 (2024) | Three-stage post-training: SFT → RLHF → knowledge distillation from 27B-IT teacher | | ShieldGemma Safety Classifier | LlamaGuard (Meta, 2024) | Companion safety model trained specifically on Gemma 3's output distribution for content moderation | | Expanded Multilingual Tokenizer | Gemini 2.0 tokenizer (Google, 2024) | 262,144-token vocabulary from Gemini 2.0, covering 140+ languages with improved subword compression |

🟪 Gemma 4 — April 2026

📅 Released: April 2, 2026 | 📄 Model Card | 🔗 Google Open Source Blog

Summary

Historic license change to Apache 2.0: for the first time in the Gemma family, all four Gemma 4 model variants are released under the fully permissive Apache 2.0 license — replacing the custom Gemma Terms of Use from all previous generations — enabling unrestricted commercial use, fine-tuning, redistribution, and enterprise deployment without additional agreements
Four-model family spanning from on-device to workstation: E2B (~2.3B effective / 5.1B total), E4B (~4.5B effective / 8B total), 26B A4B (Mixture-of-Experts, ~4B active / 25.2B total), and 31B dense — covering the full deployment spectrum from smartphones to data center GPUs
Mixture-of-Experts (MoE) architecture in the 26B A4B: introduces Gemma’s first MoE model, with 128 regular experts plus 1 shared expert per MoE layer, routing 8 experts per token — enabling near-27B-quality output at roughly 4B-equivalent inference cost
Per-Layer Embeddings (PLE) for E2B and E4B: instead of a single shared embedding table, each transformer layer has its own lightweight embedding lookup (PLE dimensions: vocab × 256 × n_layers), injected after each attention/FFN block — enabling very high intelligence-per-parameter on edge hardware; PLE tables are designed to reside in flash memory rather than VRAM
Native audio input for E2B and E4B: the smallest two models accept audio tokens directly (in addition to text and images), making Gemma 4 the first core Gemma generation with audio-native edge models
Upgraded SigLIP 2 vision encoder across all model sizes: variable-resolution tile processing (up to 896×896 per tile, multi-tile per prompt) — building on Gemma 3’s SigLIP backbone with improved alignment training
Thinking mode via a <|think|> special token: models can be prompted to produce step-by-step chain-of-thought reasoning traces before the final answer, enabling stronger performance on math and reasoning benchmarks without a separate reasoning model
Agentic capabilities: native support for function calling, tool use, planning, and system prompts (<|system|> role), enabling Gemma 4 models as drop-in agents in orchestration frameworks
Context window extends to 256K tokens for 26B A4B and 31B models (128K for E2B/E4B) — up from 128K in Gemma 3’s largest models — enabled by Proportional RoPE (p-RoPE)
Hybrid local/global attention continues the Gemma 3 5:1 pattern with refinements: local window shrinks to 512 tokens for E2B (4 local + 1 global per block) and 1,024 tokens for 26B/31B (5:1); global layers use the K=V trick (Keys = Values, halving KV cache at those layers) and Shared KV Cache (last N layers reuse K/V from previous same-type layers)
Proportional RoPE (p-RoPE): for long-context (256K) global attention layers, only a fraction p=0.25 of the RoPE coordinate pairs receive positional encoding — limiting positional noise in long sequences while preserving semantic tracking
GQA pattern refined: local layers use 2 Q heads per 1 KV head; global layers can use up to 8 Q heads per 1 KV head — further reducing KV cache memory
MMLU Pro 85.2% (31B) — state-of-the-art among Apache 2.0 open models at launch, outperforming Llama 4 Scout and Qwen 3 32B on several reasoning benchmarks; MATH 500 92.1% (31B); HumanEval 91.4% (31B)

Architecture Diagram — Gemma 4

Gemma 4 — Multi-Modal Architecture with MoE, PLE, and Long Context

Text Tokens
All variants
262,144 vocab

SigLIP 2 Vision
All variants
896×896 multi-tile

Audio
E2B / E4B only
native audio tokens

✨ NEW: PLE (E2B / E4B)

Per-Layer Embeddings (PLE)

Each transformer layer fetches a layer-specific token embedding from a flash-resident PLE table (vocab × 256 × n_layers) and injects it after attention/FFN via a lightweight residual block — boosting intelligence-per-parameter on edge hardware without increasing active VRAM requirements.

✨ 5:1 + K=V + Shared KV

Hybrid Attention (5 Local + 1 Global) with Memory Optimizations

SWA
Local
win=1K

Global
K=V trick
Shared KV

...

🔵 5× Local SWA: GQA 2:1 Q/KV, fast

🟪 1× Global: K=V, Shared KV, p-RoPE (p=0.25 for 256K)

✨ NEW: MoE (26B A4B)

Mixture-of-Experts FFN (26B A4B only)

128 experts
+ 1 shared

→

Top-8 routing
per token

→

~4B active
25.2B total

📐 Model Size Configurations

Model	Layers	Effective Params	Context	Modalities	PLE/MoE
Gemma 4 E2B	35	~2.3B (5.1B total)	128K	Text + Image + Audio	PLE
Gemma 4 E4B	42	~4.5B (8B total)	128K	Text + Image + Audio	PLE
Gemma 4 26B A4B	—	~4B active (25.2B total)	256K	Text + Image	MoE (128+1 exp, top-8)
Gemma 4 31B	—	30.7B dense	256K	Text + Image	Dense

Community Perspective

The Apache 2.0 license change was the biggest headline: the developer community immediately highlighted this as a watershed moment for Google’s open-source AI strategy — previous Gemma generations used a custom Terms of Use that prohibited certain commercial uses, and many organizations had avoided Gemma for that reason
MoE in the 26B A4B was welcomed as a practical breakthrough: running a 25B-parameter model at 4B inference cost on consumer hardware made Gemma 4 immediately accessible to individual researchers with a single 16–18 GB GPU — previously a workstation-class requirement
Per-Layer Embeddings (PLE) generated significant interest: the idea of offloading per-layer token embeddings to flash memory as a way to dramatically boost intelligence-per-active-parameter was novel and sparked architectural discussions across the ML community
Thinking mode adoption was rapid: the <|think|> token mechanism was straightforward to use via standard chat templates, and benchmarks quickly showed 5–10% gains on AIME and LiveCodeBench when thinking was enabled — making it a go-to feature for technical users
Audio support in edge models (E2B/E4B) was noted as a significant practical advantage for mobile and IoT voice applications, though the community awaited more detailed audio benchmark comparisons
The K=V trick and Shared KV Cache were praised by inference engineers as elegant solutions to KV cache memory pressure in long-context scenarios, reducing global attention memory overhead by up to 50%
AIME 2026 ~89.2% was the standout math benchmark result, placing Gemma 4 31B above many closed models on competition-level mathematics — validating the thinking mode + improved pre-training
Comparison to Llama 4: the community broadly assessed Gemma 4 as competing favorably with Llama 4 Scout (active parameter count) and Llama 4 Maverick for reasoning-heavy tasks, while the Apache 2.0 license gave Gemma 4 an advantage in enterprise adoption settings

Model Variants

Model	Effective Params	Total Params	Layers	Context	Modalities	License	Architecture
Gemma4-E2B	~2.3B	5.1B	35	128K	Text + Image + Audio	Apache 2.0	Dense + PLE
Gemma4-E2B-IT	~2.3B	5.1B	35	128K	Text + Image + Audio	Apache 2.0	Dense + PLE + SFT/RLHF
Gemma4-E4B	~4.5B	8B	42	128K	Text + Image + Audio	Apache 2.0	Dense + PLE
Gemma4-E4B-IT	~4.5B	8B	42	128K	Text + Image + Audio	Apache 2.0	Dense + PLE + SFT/RLHF
Gemma4-26B-A4B	~4B active	25.2B	—	256K	Text + Image	Apache 2.0	MoE (128+1 exp, top-8)
Gemma4-26B-A4B-IT	~4B active	25.2B	—	256K	Text + Image	Apache 2.0	MoE + SFT/RLHF
Gemma4-31B	30.7B	30.7B	—	256K	Text + Image	Apache 2.0	Dense
Gemma4-31B-IT	30.7B	30.7B	—	256K	Text + Image	Apache 2.0	Dense + SFT/RLHF

Companion model: ShieldGemma 4 safety classifier released alongside for responsible deployment.

Key Industry Ideas Incorporated

| Technique | Origin | How Gemma 4 Used It | |:----------|:-------|:--------------------| | Mixture-of-Experts (MoE) | Shazeer et al., "Outrageously Large Neural Networks" (2017); Mixtral (2023) | 26B A4B: 128+1 experts per MoE layer, top-8 routing — full MoE debut in the Gemma family | | Per-Layer Embeddings (PLE) | Gemma 4 (Google DeepMind, 2026) | Layer-specific embedding tables fetched from flash memory, injected residually after each layer for edge intelligence-per-parameter boost | | Proportional RoPE (p-RoPE) | Gemma 4 (Google DeepMind, 2026) | Only 25% of RoPE coordinate pairs encoded in global attention layers for 256K contexts, reducing positional noise | | K=V Trick (Keys = Values) | Gemma 4 (Google DeepMind, 2026) | Global attention layers set K=V, collapsing KV cache to a single cache and halving memory requirements | | Shared KV Cache | Gemma 4 (Google DeepMind, 2026) | Last N layers of the same attention type share K/V, reducing redundant memory across layers | | Chain-of-Thought / Thinking Mode | Wei et al., "Chain-of-Thought Prompting" (NeurIPS 2022); DeepSeek-R1 (2025) | `<|think|>` token activates step-by-step reasoning traces before final output | | SigLIP 2 (upgraded vision encoder) | Zhai et al., SigLIP (2023) + Google improvements | Enhanced alignment training over Gemma 3's SigLIP; variable-resolution multi-tile input | | Function Calling / Tool Use | Toolformer (2023), GPT-4 function calling (2023) | Native tool use and function calling via chat templates for agentic deployments | | Apache 2.0 Open Licensing | OSI (Open Source Initiative) | First Gemma generation fully open; previous versions used custom Gemma Terms of Use |

📚 References

Technical Papers

Version	Title	Link	Date
Gemma 1	Gemma: Open Models Based on Gemini Research and Technology	arXiv:2403.08295	Mar 2024
Gemma 2	Gemma 2: Improving Open Language Models at a Practical Size	arXiv:2408.00118	Aug 2024
Gemma 3	Gemma 3 Technical Report	arXiv:2503.19786	Mar 2025
Gemma 4	Gemma 4 Model Card (Google AI for Developers)	ai.google.dev/gemma/docs/core/model_card_4	Apr 2026

Official Blog Posts

Title	Link
Gemma: Introducing New State-of-the-Art Open Models	blog.google/technology/developers/gemma-open-models/
Gemma 2: Advancing Frontier AI Responsibly	blog.google/technology/developers/google-gemma-2/
Gemma 3 — The Developer Guide	developers.googleblog.com/en/introducing-gemma3/
Gemma 4: Expanding the Gemmaverse with Apache 2.0	opensource.googleblog.com/2026/04/gemma-4-expanding-gemmaverse-apache-2.html
Google DeepMind Gemma Page	deepmind.google/models/gemma/

GitHub & Model Repositories

Resource	Link
Gemma GitHub (google-deepmind)	github.com/google-deepmind/gemma
Gemma on Hugging Face	huggingface.co/google/gemma-7b
Gemma 2 on Hugging Face	huggingface.co/google/gemma-2-27b
Gemma 3 on Hugging Face	huggingface.co/google/gemma-3-27b-it
Gemma 4 on Hugging Face	huggingface.co/google/gemma-4-31b-it
Gemma on Kaggle	kaggle.com/models/google/gemma
Keras NLP Gemma	keras.io/api/keras_nlp/models/gemma/

Cited Techniques

Technique	Paper	Link
GeGLU Activation	Shazeer, “GLU Variants Improve Transformer” (2020)	arXiv:2002.05202
RoPE	Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding” (2021)	arXiv:2104.09864
RMSNorm	Zhang & Sennrich, “Root Mean Square Layer Normalization” (2019)	arXiv:1910.07467
GQA	Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models” (EMNLP 2023)	arXiv:2305.13245
Sliding Window Attention	Beltagy et al., “Longformer: The Long-Document Transformer” (2020)	arXiv:2004.05150
Knowledge Distillation	Hinton et al., “Distilling the Knowledge in a Neural Network” (2015)	arXiv:1503.02531
SigLIP	Zhai et al., “Sigmoid Loss for Language Image Pre-Training” (2023)	arXiv:2303.15343
SentencePiece BPE	Kudo & Richardson, “SentencePiece: A simple and language independent subword tokenizer” (2018)	arXiv:1808.06226
Tied Embeddings	Press & Wolf, “Using the Output Embedding to Improve Language Models” (EACL 2017)	arXiv:1608.05859
QAT (INT4/INT8)	Frantar et al., “GPTQ: Accurate Post-Training Quantization” (ICLR 2023)	arXiv:2210.17323
InstructGPT / RLHF	Ouyang et al., “Training language models to follow instructions with human feedback” (NeurIPS 2022)	arXiv:2203.02155
Gemini Architecture	Gemini Team, “Gemini: A Family of Highly Capable Multimodal Models” (2023)	arXiv:2312.11805
LLaMA 3 (RoPE scaling)	Meta AI, “The Llama 3 Herd of Models” (2024)	arXiv:2407.21783
Mixture-of-Experts	Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer” (2017)	arXiv:1701.06538
Chain-of-Thought Prompting	Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (NeurIPS 2022)	arXiv:2201.11903

_{Built with data from official Gemma technical reports, Google DeepMind blog posts, and the Gemma 4 model card (April 2026). All benchmark numbers sourced directly from the referenced publications.}

← Back to Index