# Gemma 4 Architecture Reference > Sources: Google DeepMind blog, HuggingFace blog (huggingface.co/blog/gemma4), > Maarten Grootendorst visual guide, kaitchup.substack.com, wavespeed.ai ## Model Family | Variant | Total Params | Effective Params | Type | Notes | |---------|-------------|-----------------|------|-------| | E2B | ~5.1B | ~2.3B | Dense + PLE | On-device, audio+vision | | E4B | ~8B | ~4B | Dense + PLE | On-device, audio+vision | | 31B | 31B | 31B | Dense | 60 layers, widened vs Gemma 3 27B (62 layers) | | 26B A4B | 26B | ~4B active | MoE | 128 experts, 8 active + 1 shared | ## Attention Architecture - **Pattern:** Local (sliding window) interleaved with global attention - E2B: 4:1 ratio (4 local, 1 global). E4B/31B/26B: 5:1 ratio - Global attention is always the last layer - **Sliding window:** E2B/E4B = 512 tokens; 31B/26B = 1024 tokens - **Grouped Query Attention (GQA):** - Local: 2 query heads share 1 KV head - Global: 8 query heads share 1 KV head, doubled Key dimensions ## Positional Encoding: Proportional RoPE (p-RoPE) - Applied to global attention layers only - p=0.25 -> rotates only 25% of head dimensions - theta=1M - 75% of dimensions are position-independent -> better long-context extrapolation - Replaces Gemma 3's 8x linear frequency scaling ## Per-Layer Embeddings (PLE) — E2B/E4B Only - Each decoder layer gets its own unique token representation - Parallel lower-dimensional pathway alongside main residual stream - PLE dimensions: 256 (E2B), 2560 (E4B) - Original embedding dimensions: 1536 (E2B), 2560 (E4B) - Applied between decoder blocks with gating function - This is why E2B has 5.1B total but only 2.3B effective — the PLE table is large ## Shared KV Cache - Last N layers reuse K/V tensors from earlier layers (same attention type) - No quality loss in practice - Significant memory + compute savings for long-context generation ## Vision Encoder - Params: 150M (E2B/E4B), 550M (31B/26B) - Patch size: 16x16 pixels - 3x3 neighboring patches merged into single embedding - Uses 2D RoPE for variable aspect ratio - Token budgets: 70, 140, 280, 560, 1120 soft tokens - Approximate resolutions: 272x176 (70 tokens) -> 1088x704 (1120 tokens) ## Audio Encoder — E2B/E4B Only - Conformer architecture with convolutional modules - Mel-spectrogram feature extraction - Two 2D conv layers for downsampling - NOT available on 31B or 26B variants ## MoE Details (26B A4B) - 128 total experts - 8 experts activated per token - 1 shared expert (3x size of regular experts) - 119 experts unused during any given forward pass ## Context Window | Variant | Context Window | MRCR v2 8-needle @ 128K | |---------|---------------|------------------------| | E2B | 128K | 19.1% | | E4B | 128K | 25.4% | | 26B A4B | 256K | 44.1% | | 31B | 256K | 66.4% | | Gemma 3 27B | 128K | 13.5% | - Ollama default num_ctx: 2048 (must override!) - Retrieval accuracy diminishes beyond ~100K tokens in repetitive/unstructured text ## Vocabulary - SentencePiece tokenizer, 262,144 tokens (256K vocab, up from 256K in earlier Gemma) ## Memory Requirements (approximate) | Model | BF16 | 8-bit | 4-bit | |-------|------|-------|-------| | E2B | 9.6 GB | 4.6 GB | 3.2 GB | | E4B | 15 GB | 7.5 GB | 5 GB | | 31B Dense | 58.3 GB | 30.4 GB | 17.4 GB | | 26B A4B (MoE) | 48 GB | 25 GB | 15.6 GB | Note: 26B MoE requires ALL 26B params loaded despite only activating ~4B per token. ## License Apache 2.0 — major change from Gemma 3's proprietary "Gemma Terms of Use". No custom clauses, no redistribution restrictions. ## Training Data Cutoff January 2025