docs: initial Gemma 4 research corpus and synthesis
Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,105 @@
|
||||
# Gemma 4 Architecture Reference
|
||||
|
||||
> Sources: Google DeepMind blog, HuggingFace blog (huggingface.co/blog/gemma4),
|
||||
> Maarten Grootendorst visual guide, kaitchup.substack.com, wavespeed.ai
|
||||
|
||||
## Model Family
|
||||
|
||||
| Variant | Total Params | Effective Params | Type | Notes |
|
||||
|---------|-------------|-----------------|------|-------|
|
||||
| E2B | ~5.1B | ~2.3B | Dense + PLE | On-device, audio+vision |
|
||||
| E4B | ~8B | ~4B | Dense + PLE | On-device, audio+vision |
|
||||
| 31B | 31B | 31B | Dense | 60 layers, widened vs Gemma 3 27B (62 layers) |
|
||||
| 26B A4B | 26B | ~4B active | MoE | 128 experts, 8 active + 1 shared |
|
||||
|
||||
## Attention Architecture
|
||||
|
||||
- **Pattern:** Local (sliding window) interleaved with global attention
|
||||
- E2B: 4:1 ratio (4 local, 1 global). E4B/31B/26B: 5:1 ratio
|
||||
- Global attention is always the last layer
|
||||
- **Sliding window:** E2B/E4B = 512 tokens; 31B/26B = 1024 tokens
|
||||
- **Grouped Query Attention (GQA):**
|
||||
- Local: 2 query heads share 1 KV head
|
||||
- Global: 8 query heads share 1 KV head, doubled Key dimensions
|
||||
|
||||
## Positional Encoding: Proportional RoPE (p-RoPE)
|
||||
|
||||
- Applied to global attention layers only
|
||||
- p=0.25 -> rotates only 25% of head dimensions
|
||||
- theta=1M
|
||||
- 75% of dimensions are position-independent -> better long-context extrapolation
|
||||
- Replaces Gemma 3's 8x linear frequency scaling
|
||||
|
||||
## Per-Layer Embeddings (PLE) — E2B/E4B Only
|
||||
|
||||
- Each decoder layer gets its own unique token representation
|
||||
- Parallel lower-dimensional pathway alongside main residual stream
|
||||
- PLE dimensions: 256 (E2B), 2560 (E4B)
|
||||
- Original embedding dimensions: 1536 (E2B), 2560 (E4B)
|
||||
- Applied between decoder blocks with gating function
|
||||
- This is why E2B has 5.1B total but only 2.3B effective — the PLE table is large
|
||||
|
||||
## Shared KV Cache
|
||||
|
||||
- Last N layers reuse K/V tensors from earlier layers (same attention type)
|
||||
- No quality loss in practice
|
||||
- Significant memory + compute savings for long-context generation
|
||||
|
||||
## Vision Encoder
|
||||
|
||||
- Params: 150M (E2B/E4B), 550M (31B/26B)
|
||||
- Patch size: 16x16 pixels
|
||||
- 3x3 neighboring patches merged into single embedding
|
||||
- Uses 2D RoPE for variable aspect ratio
|
||||
- Token budgets: 70, 140, 280, 560, 1120 soft tokens
|
||||
- Approximate resolutions: 272x176 (70 tokens) -> 1088x704 (1120 tokens)
|
||||
|
||||
## Audio Encoder — E2B/E4B Only
|
||||
|
||||
- Conformer architecture with convolutional modules
|
||||
- Mel-spectrogram feature extraction
|
||||
- Two 2D conv layers for downsampling
|
||||
- NOT available on 31B or 26B variants
|
||||
|
||||
## MoE Details (26B A4B)
|
||||
|
||||
- 128 total experts
|
||||
- 8 experts activated per token
|
||||
- 1 shared expert (3x size of regular experts)
|
||||
- 119 experts unused during any given forward pass
|
||||
|
||||
## Context Window
|
||||
|
||||
| Variant | Context Window | MRCR v2 8-needle @ 128K |
|
||||
|---------|---------------|------------------------|
|
||||
| E2B | 128K | 19.1% |
|
||||
| E4B | 128K | 25.4% |
|
||||
| 26B A4B | 256K | 44.1% |
|
||||
| 31B | 256K | 66.4% |
|
||||
| Gemma 3 27B | 128K | 13.5% |
|
||||
|
||||
- Ollama default num_ctx: 2048 (must override!)
|
||||
- Retrieval accuracy diminishes beyond ~100K tokens in repetitive/unstructured text
|
||||
|
||||
## Vocabulary
|
||||
|
||||
- SentencePiece tokenizer, 262,144 tokens (256K vocab, up from 256K in earlier Gemma)
|
||||
|
||||
## Memory Requirements (approximate)
|
||||
|
||||
| Model | BF16 | 8-bit | 4-bit |
|
||||
|-------|------|-------|-------|
|
||||
| E2B | 9.6 GB | 4.6 GB | 3.2 GB |
|
||||
| E4B | 15 GB | 7.5 GB | 5 GB |
|
||||
| 31B Dense | 58.3 GB | 30.4 GB | 17.4 GB |
|
||||
| 26B A4B (MoE) | 48 GB | 25 GB | 15.6 GB |
|
||||
|
||||
Note: 26B MoE requires ALL 26B params loaded despite only activating ~4B per token.
|
||||
|
||||
## License
|
||||
|
||||
Apache 2.0 — major change from Gemma 3's proprietary "Gemma Terms of Use". No custom clauses, no redistribution restrictions.
|
||||
|
||||
## Training Data Cutoff
|
||||
|
||||
January 2025
|
||||
Reference in New Issue
Block a user