Files
gemma4-research/CORPUS_architecture.md
T
Mortdecai 5011059f5d docs: initial Gemma 4 research corpus and synthesis
Architecture specs, benchmarks, gotchas, Ollama settings, tool calling
format, and implementation patterns from Simon and AI_Visualizer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 18:14:19 -04:00

3.5 KiB

Gemma 4 Architecture Reference

Sources: Google DeepMind blog, HuggingFace blog (huggingface.co/blog/gemma4), Maarten Grootendorst visual guide, kaitchup.substack.com, wavespeed.ai

Model Family

Variant Total Params Effective Params Type Notes
E2B ~5.1B ~2.3B Dense + PLE On-device, audio+vision
E4B ~8B ~4B Dense + PLE On-device, audio+vision
31B 31B 31B Dense 60 layers, widened vs Gemma 3 27B (62 layers)
26B A4B 26B ~4B active MoE 128 experts, 8 active + 1 shared

Attention Architecture

  • Pattern: Local (sliding window) interleaved with global attention
    • E2B: 4:1 ratio (4 local, 1 global). E4B/31B/26B: 5:1 ratio
    • Global attention is always the last layer
  • Sliding window: E2B/E4B = 512 tokens; 31B/26B = 1024 tokens
  • Grouped Query Attention (GQA):
    • Local: 2 query heads share 1 KV head
    • Global: 8 query heads share 1 KV head, doubled Key dimensions

Positional Encoding: Proportional RoPE (p-RoPE)

  • Applied to global attention layers only
  • p=0.25 -> rotates only 25% of head dimensions
  • theta=1M
  • 75% of dimensions are position-independent -> better long-context extrapolation
  • Replaces Gemma 3's 8x linear frequency scaling

Per-Layer Embeddings (PLE) — E2B/E4B Only

  • Each decoder layer gets its own unique token representation
  • Parallel lower-dimensional pathway alongside main residual stream
  • PLE dimensions: 256 (E2B), 2560 (E4B)
  • Original embedding dimensions: 1536 (E2B), 2560 (E4B)
  • Applied between decoder blocks with gating function
  • This is why E2B has 5.1B total but only 2.3B effective — the PLE table is large

Shared KV Cache

  • Last N layers reuse K/V tensors from earlier layers (same attention type)
  • No quality loss in practice
  • Significant memory + compute savings for long-context generation

Vision Encoder

  • Params: 150M (E2B/E4B), 550M (31B/26B)
  • Patch size: 16x16 pixels
  • 3x3 neighboring patches merged into single embedding
  • Uses 2D RoPE for variable aspect ratio
  • Token budgets: 70, 140, 280, 560, 1120 soft tokens
  • Approximate resolutions: 272x176 (70 tokens) -> 1088x704 (1120 tokens)

Audio Encoder — E2B/E4B Only

  • Conformer architecture with convolutional modules
  • Mel-spectrogram feature extraction
  • Two 2D conv layers for downsampling
  • NOT available on 31B or 26B variants

MoE Details (26B A4B)

  • 128 total experts
  • 8 experts activated per token
  • 1 shared expert (3x size of regular experts)
  • 119 experts unused during any given forward pass

Context Window

Variant Context Window MRCR v2 8-needle @ 128K
E2B 128K 19.1%
E4B 128K 25.4%
26B A4B 256K 44.1%
31B 256K 66.4%
Gemma 3 27B 128K 13.5%
  • Ollama default num_ctx: 2048 (must override!)
  • Retrieval accuracy diminishes beyond ~100K tokens in repetitive/unstructured text

Vocabulary

  • SentencePiece tokenizer, 262,144 tokens (256K vocab, up from 256K in earlier Gemma)

Memory Requirements (approximate)

Model BF16 8-bit 4-bit
E2B 9.6 GB 4.6 GB 3.2 GB
E4B 15 GB 7.5 GB 5 GB
31B Dense 58.3 GB 30.4 GB 17.4 GB
26B A4B (MoE) 48 GB 25 GB 15.6 GB

Note: 26B MoE requires ALL 26B params loaded despite only activating ~4B per token.

License

Apache 2.0 — major change from Gemma 3's proprietary "Gemma Terms of Use". No custom clauses, no redistribution restrictions.

Training Data Cutoff

January 2025