From 5011059f5d634f88528d948724b3382176480eb1 Mon Sep 17 00:00:00 2001
From: Mortdecai <admin@mortdec.ai>
Date: Sun, 12 Apr 2026 18:14:19 -0400
Subject: [PATCH] docs: initial Gemma 4 research corpus and synthesis

Architecture specs, benchmarks, gotchas, Ollama settings, tool calling
format, and implementation patterns from Simon and AI_Visualizer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 CORPUS_architecture.md        | 105 +++++++++++++++++
 CORPUS_benchmarks.md          |  40 +++++++
 CORPUS_capabilities.md        |  55 +++++++++
 CORPUS_ollama_variants.md     |  42 +++++++
 CORPUS_tool_calling_format.md | 100 +++++++++++++++++
 GOTCHAS.md                    | 205 ++++++++++++++++++++++++++++++++++
 IMPLEMENTATIONS.md            |  95 ++++++++++++++++
 README.md                     |  25 +++++
 SYNTHESIS.md                  | 194 ++++++++++++++++++++++++++++++++
 9 files changed, 861 insertions(+)
 create mode 100644 CORPUS_architecture.md
 create mode 100644 CORPUS_benchmarks.md
 create mode 100644 CORPUS_capabilities.md
 create mode 100644 CORPUS_ollama_variants.md
 create mode 100644 CORPUS_tool_calling_format.md
 create mode 100644 GOTCHAS.md
 create mode 100644 IMPLEMENTATIONS.md
 create mode 100644 README.md
 create mode 100644 SYNTHESIS.md

diff --git a/CORPUS_architecture.md b/CORPUS_architecture.md
new file mode 100644
index 0000000..9d0bc4b
--- /dev/null
+++ b/CORPUS_architecture.md
@@ -0,0 +1,105 @@
+# Gemma 4 Architecture Reference
+
+> Sources: Google DeepMind blog, HuggingFace blog (huggingface.co/blog/gemma4),
+> Maarten Grootendorst visual guide, kaitchup.substack.com, wavespeed.ai
+
+## Model Family
+
+| Variant | Total Params | Effective Params | Type | Notes |
+|---------|-------------|-----------------|------|-------|
+| E2B | ~5.1B | ~2.3B | Dense + PLE | On-device, audio+vision |
+| E4B | ~8B | ~4B | Dense + PLE | On-device, audio+vision |
+| 31B | 31B | 31B | Dense | 60 layers, widened vs Gemma 3 27B (62 layers) |
+| 26B A4B | 26B | ~4B active | MoE | 128 experts, 8 active + 1 shared |
+
+## Attention Architecture
+
+- **Pattern:** Local (sliding window) interleaved with global attention
+  - E2B: 4:1 ratio (4 local, 1 global). E4B/31B/26B: 5:1 ratio
+  - Global attention is always the last layer
+- **Sliding window:** E2B/E4B = 512 tokens; 31B/26B = 1024 tokens
+- **Grouped Query Attention (GQA):**
+  - Local: 2 query heads share 1 KV head
+  - Global: 8 query heads share 1 KV head, doubled Key dimensions
+
+## Positional Encoding: Proportional RoPE (p-RoPE)
+
+- Applied to global attention layers only
+- p=0.25 -> rotates only 25% of head dimensions
+- theta=1M
+- 75% of dimensions are position-independent -> better long-context extrapolation
+- Replaces Gemma 3's 8x linear frequency scaling
+
+## Per-Layer Embeddings (PLE) — E2B/E4B Only
+
+- Each decoder layer gets its own unique token representation
+- Parallel lower-dimensional pathway alongside main residual stream
+- PLE dimensions: 256 (E2B), 2560 (E4B)
+- Original embedding dimensions: 1536 (E2B), 2560 (E4B)
+- Applied between decoder blocks with gating function
+- This is why E2B has 5.1B total but only 2.3B effective — the PLE table is large
+
+## Shared KV Cache
+
+- Last N layers reuse K/V tensors from earlier layers (same attention type)
+- No quality loss in practice
+- Significant memory + compute savings for long-context generation
+
+## Vision Encoder
+
+- Params: 150M (E2B/E4B), 550M (31B/26B)
+- Patch size: 16x16 pixels
+- 3x3 neighboring patches merged into single embedding
+- Uses 2D RoPE for variable aspect ratio
+- Token budgets: 70, 140, 280, 560, 1120 soft tokens
+- Approximate resolutions: 272x176 (70 tokens) -> 1088x704 (1120 tokens)
+
+## Audio Encoder — E2B/E4B Only
+
+- Conformer architecture with convolutional modules
+- Mel-spectrogram feature extraction
+- Two 2D conv layers for downsampling
+- NOT available on 31B or 26B variants
+
+## MoE Details (26B A4B)
+
+- 128 total experts
+- 8 experts activated per token
+- 1 shared expert (3x size of regular experts)
+- 119 experts unused during any given forward pass
+
+## Context Window
+
+| Variant | Context Window | MRCR v2 8-needle @ 128K |
+|---------|---------------|------------------------|
+| E2B | 128K | 19.1% |
+| E4B | 128K | 25.4% |
+| 26B A4B | 256K | 44.1% |
+| 31B | 256K | 66.4% |
+| Gemma 3 27B | 128K | 13.5% |
+
+- Ollama default num_ctx: 2048 (must override!)
+- Retrieval accuracy diminishes beyond ~100K tokens in repetitive/unstructured text
+
+## Vocabulary
+
+- SentencePiece tokenizer, 262,144 tokens (256K vocab, up from 256K in earlier Gemma)
+
+## Memory Requirements (approximate)
+
+| Model | BF16 | 8-bit | 4-bit |
+|-------|------|-------|-------|
+| E2B | 9.6 GB | 4.6 GB | 3.2 GB |
+| E4B | 15 GB | 7.5 GB | 5 GB |
+| 31B Dense | 58.3 GB | 30.4 GB | 17.4 GB |
+| 26B A4B (MoE) | 48 GB | 25 GB | 15.6 GB |
+
+Note: 26B MoE requires ALL 26B params loaded despite only activating ~4B per token.
+
+## License
+
+Apache 2.0 — major change from Gemma 3's proprietary "Gemma Terms of Use". No custom clauses, no redistribution restrictions.
+
+## Training Data Cutoff
+
+January 2025
diff --git a/CORPUS_benchmarks.md b/CORPUS_benchmarks.md
new file mode 100644
index 0000000..c91b1dc
--- /dev/null
+++ b/CORPUS_benchmarks.md
@@ -0,0 +1,40 @@
+# Gemma 4 Benchmarks
+
+> Source: Google DeepMind model card, HuggingFace blog, LMArena
+> Released: April 2, 2026
+
+## Gemma 4 vs Gemma 3 (biggest single-version jump in Gemma family)
+
+| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B A4B | Delta (31B vs G3) |
+|-----------|------------|------------|----------------|-------------------|
+| MMLU Pro | 67.6% | 85.2% | 82.6% | +17.6 |
+| AIME 2026 (no tools) | 20.8% | 89.2% | 88.3% | +68.4 |
+| GPQA Diamond | 42.4% | 84.3% | 82.3% | +41.9 |
+| BigBench Extra Hard | 19.3% | 74.4% | 64.8% | +55.1 |
+| LiveCodeBench v6 | 29.1% | 80.0% | 77.1% | +50.9 |
+| Codeforces ELO | 110 | 2150 | 1718 | +2040 |
+| MMMU Pro (vision) | 49.7% | 76.9% | 73.8% | +27.2 |
+| MATH-Vision | 46.0% | 85.6% | 82.4% | +39.6 |
+| OmniDocBench (lower=better) | 0.365 | 0.131 | 0.149 | -0.234 |
+| MRCR v2 128K | 13.5% | 66.4% | 44.1% | +52.9 |
+| MMMLU (multilingual) | 70.7% | 88.4% | 86.3% | +17.7 |
+
+## Arena Scores
+
+| Model | LMArena Score | Rank |
+|-------|--------------|------|
+| Gemma 4 31B | 1452 | #3 |
+| Gemma 4 26B A4B | 1441 | #6 |
+
+## Agentic Benchmark (tau2-bench)
+
+| Model | Score |
+|-------|-------|
+| 31B | 86.4% |
+| 26B A4B | 85.5% |
+| E4B | 57.5% |
+| E2B | 29.4% |
+
+## Takeaway
+
+The jump from Gemma 3 to 4 is enormous — AIME went from 20.8% to 89.2%, Codeforces from 110 to 2150 ELO. This is not an incremental update. The 26B MoE nearly matches 31B Dense on most benchmarks while using ~4B active params.
diff --git a/CORPUS_capabilities.md b/CORPUS_capabilities.md
new file mode 100644
index 0000000..52b2cc6
--- /dev/null
+++ b/CORPUS_capabilities.md
@@ -0,0 +1,55 @@
+# Gemma 4 Capabilities Reference
+
+## Modalities
+
+### Text (all variants)
+- Standard instruction-following, chat, completion
+- System prompt support (critical — see synthesis)
+- 128K context window (training length)
+- 262K vocabulary
+
+### Vision (all variants)
+- **Tested and verified working** (Seth, 2026-04-10)
+- Accurately described colors, shapes, composition in 256x256 test image
+- ~25 tok/s, ~24s end-to-end on pve197 V100
+- Input: base64-encoded image in `images` field of Ollama API
+- Vision encoder: 16x16 patches, 2D RoPE, variable aspect ratio
+- Token budgets scale with resolution (70-1120 soft tokens)
+- Used in AI_Visualizer for SDXL frame quality criticism
+
+### Audio (E2B/E4B only)
+- **Not tested by Seth** — status unknown in practice
+- Conformer architecture (~300M params), mel-spectrogram input
+- **Trained on SPEECH ONLY — not music or environmental sounds**
+- Maximum 30 seconds per clip
+- NOT available on 26B or 31B variants
+- AI_Visualizer explicitly rejected audio for music analysis (DECISIONS S2) — correct call, model wasn't trained for it
+
+### Video (all variants)
+- E2B/E4B: video WITH audio (`load_audio_from_video=True`)
+- 31B/26B: video WITHOUT audio
+- Not explicitly post-trained on video but works
+- Maximum 60 seconds at 1 frame/second
+- Not tested by Seth
+
+### Tool Calling / Function Calling
+- **Verified reliable** in both Simon and AI_Visualizer
+- Ollama native tool format (OpenAI-compatible function calling)
+- Simon: 6 genealogy tools, up to 12 sequential iterations
+- Supports parallel tool calls in single response
+- Weak at deeply nested JSON schemas -> prefer sequential calls
+
+## Benchmark Context (vs Gemma 3)
+
+- 31B replaces Gemma 3 27B (60 layers vs 62, but wider)
+- MoE variant (26B) is new — no Gemma 3 equivalent
+- E-series with PLE is new — on-device focus
+- Proportional RoPE replaces linear frequency scaling -> better long-context
+- Shared KV cache is new -> more efficient inference
+
+## What Gemma 4 Does NOT Do
+
+- No native code execution / sandboxing
+- No web browsing or retrieval
+- Audio only on E-series (not the models most people run)
+- No built-in RAG — tool calling can implement it
diff --git a/CORPUS_ollama_variants.md b/CORPUS_ollama_variants.md
new file mode 100644
index 0000000..1f7be26
--- /dev/null
+++ b/CORPUS_ollama_variants.md
@@ -0,0 +1,42 @@
+# Gemma 4 on Ollama — Available Variants
+
+> Last verified against Seth's homelab: 2026-04-12
+
+## Ollama Model Tags
+
+| Tag | Params | Quant | Size on Disk | VRAM | Notes |
+|-----|--------|-------|-------------|------|-------|
+| `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 |
+| `gemma4:26b` | 25.8B | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti |
+| `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) |
+
+## Capabilities by Variant (from `ollama show`)
+
+All variants support:
+- Text generation (completion, chat)
+- Vision (image input via base64 in `images` field)
+- Tool/function calling (native Ollama tool format)
+
+E-series (E2B, E4B) additionally support:
+- Audio input (conformer encoder)
+
+## GPU Coexistence (pve197 V100 32GB)
+
+- gemma4:26b + SDXL Turbo: ~28.5GB peak VRAM — fits on V100-32GB
+- gemma4:31b: 24.5GB alone — memory pressure with any coexisting model
+- gemma4:e4b-it-q8_0: ~12GB — comfortable headroom
+
+## Ollama API Endpoint
+
+- `/api/generate` (single-turn, used by AI_Visualizer)
+- `/api/chat` (multi-turn with message history, used by Simon)
+- Both accept `tools`, `images`, `stream`, `options`, `keep_alive`
+
+## Important Ollama Defaults to Override
+
+| Parameter | Ollama Default | Recommended | Why |
+|-----------|---------------|-------------|-----|
+| `num_ctx` | 2048 | 4096-32768 | Default is absurdly small, causes truncation |
+| `num_predict` | 128 | 512-4096+ | Default truncates almost all useful output |
+| `think` | true (Ollama 0.20+) | false | See GOTCHAS doc |
+| `keep_alive` | 5m | 30m-4h | Prevents expensive model reload between calls |
diff --git a/CORPUS_tool_calling_format.md b/CORPUS_tool_calling_format.md
new file mode 100644
index 0000000..349f235
--- /dev/null
+++ b/CORPUS_tool_calling_format.md
@@ -0,0 +1,100 @@
+# Gemma 4 Native Tool Calling Format
+
+> Source: Google AI for Developers - Function Calling docs
+> https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
+
+## Special Tokens (6 total)
+
+| Token | Purpose |
+|-------|---------|
+| `<\|tool>` / `<tool\|>` | Tool definition block |
+| `<\|tool_call>` / `<tool_call\|>` | Model's tool request |
+| `<\|tool_response>` / `<tool_response\|>` | Tool execution result |
+
+String delimiter: `<\|"\|>` (encloses all string values in native format)
+
+## Native Format (raw model tokens)
+
+### Tool definition in system prompt:
+```
+<|tool>declaration:
+get_current_temperature{
+  location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
+  unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
+}<tool|>
+```
+
+### Tool call from model:
+```
+<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>
+```
+
+### Tool response:
+```
+<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>
+```
+
+## JSON Chat Format (for Ollama / OpenAI-compatible APIs)
+
+This is what you actually use in practice. Ollama translates to/from native tokens.
+
+### Tool definition:
+```json
+{
+  "type": "function",
+  "function": {
+    "name": "get_weather",
+    "description": "Get current weather for a location",
+    "parameters": {
+      "type": "object",
+      "properties": {
+        "city": {"type": "string", "description": "The city name"}
+      },
+      "required": ["city"]
+    }
+  }
+}
+```
+
+### Model returns:
+```json
+{
+  "role": "assistant",
+  "tool_calls": [{
+    "function": {
+      "name": "get_weather",
+      "arguments": {"city": "London"}
+    }
+  }]
+}
+```
+
+### Tool result message:
+```json
+{
+  "role": "tool",
+  "content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
+}
+```
+
+## Thinking Mode + Tool Calls
+
+- When thinking is enabled, preserve thoughts between tool calls
+- For long agent chains, summarize thoughts as plain text to save context
+- Recommended: **disable thinking for tool-heavy workflows** (Seth's finding)
+
+## Framework Flags
+
+| Framework | Required Flag |
+|-----------|--------------|
+| llama.cpp | `--jinja` |
+| vLLM | `--enable-auto-tool-choice` |
+| Ollama | Works via `/api/chat` endpoint with `tools` field |
+| transformers | `apply_chat_template(tools=[...])` |
+
+## Known Issues
+
+- Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
+- llama.cpp: format mismatches and continuous loops reported
+- LM Studio: compatibility issues with tool calling
+- **Workaround:** Use non-streaming mode for tool calls (proven in Simon)
diff --git a/GOTCHAS.md b/GOTCHAS.md
new file mode 100644
index 0000000..7ec7fe7
--- /dev/null
+++ b/GOTCHAS.md
@@ -0,0 +1,205 @@
+# Gemma 4 Gotchas & Known Issues
+
+> Derived from Seth's production implementations (Simon, AI_Visualizer)
+> and community reports. These are hard-won lessons.
+
+## CRITICAL: Thinking Mode Eats Context
+
+**Severity: HIGH — causes silent failures**
+
+Gemma 4 in Ollama 0.20+ defaults to `think: true`. When enabled:
+- Thinking tokens go into a hidden `thinking` field, NOT `response`
+- If `num_predict` is limited, thinking consumes the entire budget
+- `response` comes back **empty** — no error, just silence
+- On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without)
+
+**Fix:** Always pass `think: false` in the Ollama payload. Seth has had success ONLY with thinking off.
+
+```json
+{
+  "model": "gemma4:26b",
+  "think": false,
+  "options": { "num_predict": 4096 }
+}
+```
+
+## CRITICAL: format=json Causes Infinite Loops
+
+**Severity: HIGH — hangs indefinitely**
+
+Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested.
+
+**Fix:** Never use `format: "json"`. Instead:
+1. Request JSON structure in the prompt text
+2. Parse client-side with regex + `json.loads` + json5 fallback
+
+```python
+# DO THIS
+response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False)
+body = response["response"]
+obj = json.loads(body[body.find("{"):body.rfind("}") + 1])
+
+# NOT THIS
+response = client.generate(model="gemma4:26b", prompt=prompt, format="json")  # HANGS
+```
+
+## CRITICAL: Ollama Default Context is 2048
+
+**Severity: HIGH — causes truncation**
+
+Ollama defaults `num_ctx` to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated.
+
+**Fix:** Always set `num_ctx` explicitly:
+```json
+{ "options": { "num_ctx": 8192 } }
+```
+
+Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn.
+
+## HIGH: num_predict Default is 128
+
+**Severity: HIGH — truncates output**
+
+Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output exceeds this.
+
+**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
+
+## MEDIUM: Weak at Long/Nested JSON
+
+**Severity: MEDIUM — causes parse failures**
+
+Gemma 4 reliably produces short JSON (5-10 fields) but struggles with:
+- Deeply nested schemas (3+ levels)
+- Long arrays (20+ items)
+- Mixed nesting + length
+
+**Fix:** Sequential tool calls. Break one large JSON request into multiple smaller calls:
+- Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc.
+- Due to Gemma 4's fast speed and free local use, sequential calls are cheap
+
+**Fallback pattern (AI_Visualizer):**
+```python
+for attempt in range(MAX_RETRIES):
+    temp = BASE_TEMP + attempt * TEMP_BUMP  # 0.4 -> 0.5 -> 0.6
+    response = call_gemma(temperature=temp)
+    try:
+        return parse_json(response)
+    except JSONDecodeError:
+        continue
+```
+
+## MEDIUM: Identity Confusion
+
+**Severity: MEDIUM — cosmetic but confusing**
+
+Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may:
+- Claim to be a different model
+- Hallucinate capabilities it doesn't have
+- Respond as a generic "AI assistant" without personality
+
+**Fix:** Explicit identity in system prompt:
+```
+You are [Name], a [role]. You are powered by Gemma 4.
+You ONLY do [X]. You NEVER do [Y].
+```
+
+Gemma 4 does NOT need hand-holding on task execution — it's very capable.
+It needs explicit instructions about identity and boundaries.
+
+## MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens)
+
+**Severity: MEDIUM — hardware-specific, affects RTX 3090**
+
+Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model.
+
+**Source:** [ollama/ollama#15350](https://github.com/ollama/ollama/issues/15350)
+
+**Fix:** Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware.
+
+## MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming
+
+**Severity: MEDIUM — version-specific**
+
+As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio.
+
+**Source:** [community reports](https://dev.to/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f)
+
+**Fix:** Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls.
+
+## MEDIUM: VRAM-Hungry for Context
+
+**Severity: MEDIUM — affects hardware planning**
+
+Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.
+
+**Implication:** Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable.
+
+## MEDIUM: Safety Overfiltering
+
+**Severity: MEDIUM — blocks benign prompts**
+
+Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts.
+
+**Fix:** Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly.
+
+## MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0)
+
+**Severity: MEDIUM — crashes on first attention forward pass**
+
+The 31B and 26B ship with `num_kv_shared_layers = 0`, which causes `layer_types[:-0]` to collapse to zero layer slots. Crashes on first forward pass.
+
+**Fix:** Patch the config. Check model card discussions for the exact fix.
+
+## LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090)
+
+**Severity: LOW — vLLM-specific**
+
+Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+.
+
+**Source:** [vllm-project/vllm#38887](https://github.com/vllm-project/vllm/issues/38887)
+
+**Fix:** Use Ollama instead of vLLM for now, or wait for the fix.
+
+## LOW: `<unused>` Token Infinite Loop (Vulkan backends)
+
+**Severity: LOW — Vulkan-specific**
+
+Gemma 4 can generate `<unused>` or `<unused24>` tokens in an infinite loop on Vulkan backends in llama.cpp.
+
+**Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516)
+
+## LOW: Fine-Tuning Ecosystem Issues
+
+**Severity: LOW — only relevant if fine-tuning**
+
+Day-one issues for fine-tuners:
+- HuggingFace Transformers didn't recognize gemma4 architecture (required install from source)
+- PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
+- New `mm_token_type_ids` field required during training even for text-only data
+- E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
+
+## LOW: Vision Validator Overrejects
+
+**Severity: LOW — specific to evaluative vision tasks**
+
+In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable.
+
+**Pattern:** Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?"
+
+## LOW: Keep-Alive Too Short
+
+**Severity: LOW — performance only**
+
+Default `keep_alive` is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty).
+
+**Fix:** Set `keep_alive` to match your pipeline duration:
+```json
+{ "keep_alive": "4h" }
+```
+
+Or pin/unpin explicitly:
+```python
+client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0})  # pin
+# ... do work ...
+client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0})    # unpin
+```
diff --git a/IMPLEMENTATIONS.md b/IMPLEMENTATIONS.md
new file mode 100644
index 0000000..0999935
--- /dev/null
+++ b/IMPLEMENTATIONS.md
@@ -0,0 +1,95 @@
+# Gemma 4 Implementation Reference
+
+> Patterns extracted from Seth's two production Gemma 4 projects.
+
+## Project: Simon (FreibergFamily/simon/)
+
+**Purpose:** AI genealogy historian — multi-turn chat with tool-calling agent
+
+| Setting | Value |
+|---------|-------|
+| Model | `gemma4:26b` |
+| API | `/api/chat` (multi-turn) |
+| num_ctx | 32768 |
+| num_predict | 4096 |
+| temperature | 1.0 |
+| top_p | 0.95 |
+| top_k | 64 |
+| keep_alive | 4h |
+| think | (not explicitly set — should be false) |
+| format_json | not used |
+| Vision | not used |
+| Tool calling | 6 tools, max 12 iterations |
+
+### Key Patterns
+
+1. **Aggressive system prompt:** 40+ lines defining identity, boundaries, tool usage rules, multi-step chaining requirements. Gemma 4 follows all of it.
+
+2. **Tool chaining instructions:** System prompt explicitly tells Gemma to chain tools (e.g., "after lookup_person, ALSO call get_historical_context"). Gemma 4 follows these multi-step chains reliably.
+
+3. **Parallel tool calls:** Encouraged in system prompt for multiple lookups. Gemma 4 does this.
+
+4. **History pruning:** Drops old tool results and tool-call messages, keeps assistant summaries. Prevents context bloat in multi-turn.
+
+5. **Fallback to streaming:** After 12 tool iterations, switches to stream mode (no tools) to force a text response.
+
+6. **Two modes (historian vs interview):** Completely different system prompts swapped at runtime. Gemma 4 stays in character for both.
+
+---
+
+## Project: AI Visualizer (AI_Visualizer/)
+
+**Purpose:** Music-reactive video generator — Gemma 4 as reasoning engine across 4 pipeline stages
+
+| Stage | num_predict | num_ctx | temperature | Purpose |
+|-------|-------------|---------|-------------|---------|
+| Mood Analysis | 4096 | 16384 | 0.4-0.6 | Analyze CLAP descriptors -> narratives + boundary adjustments |
+| Rate Pass | 512 | 4096 | 0.3-0.5 | Choose visual pacing rate per music segment |
+| Storyboard | 2048 | 4096 | 0.6-0.8 | Generate SDXL prompts per music segment |
+| Batch Expansion | 2048 | default | 0.7 | Interpolate between scene prompts over time |
+| Vision Validator | 256 | default | 0.2 | Critique generated frames (queued for disable) |
+
+### Key Patterns
+
+1. **No tool calling used.** All Gemma interaction is single-turn generate with JSON requested in prompt.
+
+2. **Client-side JSON extraction:**
+   ```python
+   body = response["response"]
+   start = body.find("{")
+   end = body.rfind("}")
+   obj = json.loads(body[start:end + 1])
+   ```
+
+3. **Temperature ramping on retry:** Base temp + bump per attempt. Conservative first, creative on retry.
+
+4. **think: false everywhere.** Explicitly set on every call. Critical for budget control.
+
+5. **format_json: false everywhere.** Causes infinite loops on nested schemas.
+
+6. **Model pinning:** `keep_alive=-1` to prevent GPU eviction during long SDXL pauses.
+
+7. **Explicit num_ctx:** Added after discovering Ollama defaults to 2048, which truncated mood analyzer prompts on long tracks.
+
+8. **Banned vocabulary in prompts:** List of cliche words (cinematic, dramatic, ethereal...) passed to Gemma to avoid generic output.
+
+9. **Vision for image critique:** Base64-encoded PNG -> structured SCORE/ISSUE/REASON output parsed by regex. Works but overrejects on subjective quality.
+
+---
+
+## Common Settings Across Both Projects
+
+```json
+{
+  "model": "gemma4:26b",
+  "think": false,
+  "options": {
+    "num_ctx": 4096,
+    "num_predict": 2048,
+    "temperature": 0.5
+  },
+  "keep_alive": "30m"
+}
+```
+
+Adjust num_ctx/num_predict upward for your payload size. These are safe minimums.
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..d4bf567
--- /dev/null
+++ b/README.md
@@ -0,0 +1,25 @@
+# gemma4-research
+
+Research corpus and implementation guidance for Google Gemma 4, based on production use in Seth's homelab.
+
+## Files
+
+| File | What | When to Read |
+|------|------|-------------|
+| `SYNTHESIS.md` | **Start here.** Opinionated guide — how to build with Gemma 4 | Before any new Gemma 4 implementation |
+| `GOTCHAS.md` | Known issues and workarounds, severity-ranked | When debugging Gemma 4 issues or starting a new project |
+| `IMPLEMENTATIONS.md` | Patterns from Simon and AI_Visualizer | When designing a new Gemma 4 integration |
+| `CORPUS_architecture.md` | Model architecture details (layers, attention, PLE, MoE) | When you need to understand WHY Gemma 4 behaves a certain way |
+| `CORPUS_ollama_variants.md` | Available models, sizes, VRAM, Ollama settings | When choosing a model variant or configuring Ollama |
+| `CORPUS_capabilities.md` | Modalities (vision, audio, video, tools), what it can/can't do | When scoping what Gemma 4 can handle |
+| `CORPUS_benchmarks.md` | Full benchmark table vs Gemma 3, arena scores, agentic scores | When comparing Gemma 4 to alternatives |
+| `CORPUS_tool_calling_format.md` | Native token format + JSON API format for function calling | When implementing tool calling |
+
+## Source Projects
+
+- **Simon** (`~/bin/FreibergFamily/simon/`) — Multi-turn chat agent with 6 tools, genealogy historian
+- **AI Visualizer** (`~/bin/AI_Visualizer/`) — Music video generator, 4-stage Gemma pipeline + vision
+
+## Key Insight
+
+Gemma 4 is ultra-compliant and highly capable but doesn't know who it is. It needs explicit system prompts, not hand-holding. Due to fast local inference, sequential tool calls beat long JSON requests.
diff --git a/SYNTHESIS.md b/SYNTHESIS.md
new file mode 100644
index 0000000..e9fc368
--- /dev/null
+++ b/SYNTHESIS.md
@@ -0,0 +1,194 @@
+# Gemma 4 Synthesis — How to Build With It
+
+> Opinionated guide based on two production implementations and ongoing use.
+> Seth Freiberg, 2026-04-12
+
+## The One-Paragraph Summary
+
+Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs `num_predict` increased (Ollama defaults are absurdly low), `think` set to false (thinking eats the context budget), and `format: json` avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output.
+
+## Mental Model
+
+Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain:
+- Who they are and what their job is
+- What they should and should NOT do
+- Exactly what format you want the deliverable in
+- The boundaries of their role
+
+Get those right and Gemma 4 just works. Get them wrong and you get a generic chatbot.
+
+## Mandatory Ollama Settings
+
+Every Gemma 4 call MUST include:
+
+```json
+{
+  "think": false,
+  "options": {
+    "num_ctx": 4096,
+    "num_predict": 2048
+  }
+}
+```
+
+**Why each one:**
+- `think: false` — Ollama 0.20+ defaults to think:true. Thinking tokens consume num_predict budget invisibly, returning empty responses. Seth has ONLY had success with thinking off.
+- `num_ctx: 4096+` — Ollama defaults to 2048. Your system prompt alone might exceed that.
+- `num_predict: 2048+` — Ollama defaults to 128. Any structured output gets truncated.
+
+Scale these to your task. The values above are safe minimums, not recommendations.
+
+## System Prompt Template
+
+```
+You are [NAME], a [ROLE DESCRIPTION].
+
+## What You Do
+- [Explicit list of responsibilities]
+- [Tools you have access to and when to use each one]
+
+## What You Do NOT Do
+- [Explicit list of things to refuse or avoid]
+- [Common mistakes to prevent]
+
+## Output Format
+[Exact schema, field names, example if complex]
+Respond with ONLY [format]. No prose outside the [format].
+
+## Rules
+- [Behavioral constraints]
+- [Multi-step chaining instructions if using tools]
+
+Today's date: [DATE]
+```
+
+**Key principles:**
+1. Identity first — who is this agent?
+2. Positive instructions before negative (what TO do before what NOT to do)
+3. Output format is explicit and complete — Gemma 4 follows schemas faithfully
+4. "No prose outside the JSON" prevents wrapper text that breaks parsing
+5. Date injection helps with temporal reasoning
+
+## Tool Calling Strategy
+
+Gemma 4 is **reliable for tool calling** but **weak at structuring long JSONs**.
+
+### When to use tool calling (Ollama native)
+- Multi-turn agents with 2-10 tools
+- Sequential reasoning chains (lookup A -> use A to decide B -> lookup B)
+- Any task where the model needs to gather information before responding
+
+### When to use prompt-based JSON instead
+- Single-turn generation with known output structure
+- When you need specific JSON schema control
+- When the output is a payload (prompts, configs) not a conversation
+
+### The Sequential Pattern
+
+Instead of asking Gemma 4 to produce one massive JSON:
+```
+BAD:  "Generate a 50-scene storyboard as JSON"  -> truncated/malformed
+GOOD: "Generate scenes 1-5 as JSON" x10         -> reliable every time
+```
+
+Gemma 4's inference speed makes sequential calls cheap. A 10-call chain at ~134 tok/s on a 3090 Ti costs seconds, not minutes. This is the fundamental advantage of local models — latency is predictable and network-free.
+
+## JSON Extraction Pattern
+
+Since `format: "json"` is broken, always extract client-side:
+
+```python
+# Python
+import json
+raw = response["response"]
+start = raw.find("{")
+end = raw.rfind("}")
+if start >= 0 and end > start:
+    obj = json.loads(raw[start:end + 1])
+```
+
+```javascript
+// JavaScript
+const raw = response.message.content;
+const match = raw.match(/\{[\s\S]*\}/);
+if (match) obj = JSON.parse(match[0]);
+```
+
+For arrays, find `[` and `]` instead. Add json5 fallback for trailing commas.
+
+## Temperature Guidelines
+
+| Task Type | Temperature | Why |
+|-----------|-------------|-----|
+| Evaluation / scoring | 0.2 | Consistent, reproducible judgments |
+| Structured extraction | 0.3-0.4 | Faithful to schema |
+| Creative generation | 0.6-0.8 | Variety without chaos |
+| Conversation / chat | 0.7-1.0 | Natural feel |
+
+Retry strategy: bump temp +0.1 per retry to escape format failures.
+
+## Vision Usage
+
+**Works for:** Describing image contents (objects, colors, composition, text)
+**Unreliable for:** Subjective quality scoring, aesthetic judgment
+
+```python
+import base64
+with open("image.png", "rb") as f:
+    b64 = base64.b64encode(f.read()).decode("ascii")
+
+response = client.generate(
+    model="gemma4:26b",
+    prompt="Describe this image in detail.",
+    images=[b64],
+    think=False,
+    options={"temperature": 0.2, "num_predict": 512}
+)
+```
+
+Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only.
+
+## Context Management
+
+### Multi-turn (chat agents)
+- Prune old tool results and tool-call messages
+- Keep assistant's natural-language summaries
+- Set num_ctx to 32768 for rich conversations
+- Set a tool iteration limit (12 is proven) with streaming fallback
+
+### Single-turn (pipeline stages)
+- Calculate your prompt size and set num_ctx accordingly
+- For long inputs (full track analysis), use recursive splitting at natural boundaries
+- Pin model with `keep_alive=-1` if pipeline has idle gaps
+
+## Model Selection
+
+| Use Case | Recommended | Why |
+|----------|------------|-----|
+| Production pipeline (needs GPU coexistence) | `gemma4:26b` | Best quality/speed/VRAM balance |
+| On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio |
+| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Sharpest but slow under memory pressure |
+| Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev |
+
+## Anti-Patterns
+
+1. **Don't use `format: "json"`** — infinite loops on nested schemas
+2. **Don't leave `think` at default** — eats your output budget silently
+3. **Don't leave `num_predict` at default** — 128 tokens is nothing
+4. **Don't leave `num_ctx` at default** — 2048 truncates most prompts
+5. **Don't ask for huge JSON in one call** — break into sequential calls
+6. **Don't use thinking mode for evaluation** — inflates scores, wastes context
+7. **Don't skip system prompt identity** — Gemma 4 becomes a generic chatbot
+8. **Don't use audio on 26B/31B** — only E-series has audio encoder
+
+## Quick-Start Checklist
+
+- [ ] Set `think: false`
+- [ ] Set `num_predict` >= 512 (2048+ for JSON output)
+- [ ] Set `num_ctx` >= 4096 (scale to your prompt size)
+- [ ] Write explicit system prompt with identity + boundaries + output format
+- [ ] Extract JSON client-side (no `format: "json"`)
+- [ ] Set `keep_alive` >= 30m (or pin with -1)
+- [ ] For long structured output, use sequential calls
+- [ ] For vision, pass base64 in `images` array
+- [ ] Test with your actual prompt length — Ollama won't warn about truncation