docs: initial Gemma 4 research corpus and synthesis

Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 18:14:19 -04:00
commit 5011059f5d
9 changed files with 861 additions and 0 deletions
@@ -0,0 +1,205 @@
+# Gemma 4 Gotchas & Known Issues
+
+> Derived from Seth's production implementations (Simon, AI_Visualizer)
+> and community reports. These are hard-won lessons.
+
+## CRITICAL: Thinking Mode Eats Context
+
+**Severity: HIGH — causes silent failures**
+
+Gemma 4 in Ollama 0.20+ defaults to `think: true`. When enabled:
+- Thinking tokens go into a hidden `thinking` field, NOT `response`
+- If `num_predict` is limited, thinking consumes the entire budget
+- `response` comes back **empty** — no error, just silence
+- On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without)
+
+**Fix:** Always pass `think: false` in the Ollama payload. Seth has had success ONLY with thinking off.
+
+```json
+{
+  "model": "gemma4:26b",
+  "think": false,
+  "options": { "num_predict": 4096 }
+}
+```
+
+## CRITICAL: format=json Causes Infinite Loops
+
+**Severity: HIGH — hangs indefinitely**
+
+Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested.
+
+**Fix:** Never use `format: "json"`. Instead:
+1. Request JSON structure in the prompt text
+2. Parse client-side with regex + `json.loads` + json5 fallback
+
+```python
+# DO THIS
+response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False)
+body = response["response"]
+obj = json.loads(body[body.find("{"):body.rfind("}") + 1])
+
+# NOT THIS
+response = client.generate(model="gemma4:26b", prompt=prompt, format="json")  # HANGS
+```
+
+## CRITICAL: Ollama Default Context is 2048
+
+**Severity: HIGH — causes truncation**
+
+Ollama defaults `num_ctx` to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated.
+
+**Fix:** Always set `num_ctx` explicitly:
+```json
+{ "options": { "num_ctx": 8192 } }
+```
+
+Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn.
+
+## HIGH: num_predict Default is 128
+
+**Severity: HIGH — truncates output**
+
+Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output exceeds this.
+
+**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
+
+## MEDIUM: Weak at Long/Nested JSON
+
+**Severity: MEDIUM — causes parse failures**
+
+Gemma 4 reliably produces short JSON (5-10 fields) but struggles with:
+- Deeply nested schemas (3+ levels)
+- Long arrays (20+ items)
+- Mixed nesting + length
+
+**Fix:** Sequential tool calls. Break one large JSON request into multiple smaller calls:
+- Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc.
+- Due to Gemma 4's fast speed and free local use, sequential calls are cheap
+
+**Fallback pattern (AI_Visualizer):**
+```python
+for attempt in range(MAX_RETRIES):
+    temp = BASE_TEMP + attempt * TEMP_BUMP  # 0.4 -> 0.5 -> 0.6
+    response = call_gemma(temperature=temp)
+    try:
+        return parse_json(response)
+    except JSONDecodeError:
+        continue
+```
+
+## MEDIUM: Identity Confusion
+
+**Severity: MEDIUM — cosmetic but confusing**
+
+Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may:
+- Claim to be a different model
+- Hallucinate capabilities it doesn't have
+- Respond as a generic "AI assistant" without personality
+
+**Fix:** Explicit identity in system prompt:
+```
+You are [Name], a [role]. You are powered by Gemma 4.
+You ONLY do [X]. You NEVER do [Y].
+```
+
+Gemma 4 does NOT need hand-holding on task execution — it's very capable.
+It needs explicit instructions about identity and boundaries.
+
+## MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens)
+
+**Severity: MEDIUM — hardware-specific, affects RTX 3090**
+
+Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model.
+
+**Source:** [ollama/ollama#15350](https://github.com/ollama/ollama/issues/15350)
+
+**Fix:** Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware.
+
+## MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming
+
+**Severity: MEDIUM — version-specific**
+
+As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio.
+
+**Source:** [community reports](https://dev.to/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f)
+
+**Fix:** Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls.
+
+## MEDIUM: VRAM-Hungry for Context
+
+**Severity: MEDIUM — affects hardware planning**
+
+Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.
+
+**Implication:** Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable.
+
+## MEDIUM: Safety Overfiltering
+
+**Severity: MEDIUM — blocks benign prompts**
+
+Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts.
+
+**Fix:** Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly.
+
+## MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0)
+
+**Severity: MEDIUM — crashes on first attention forward pass**
+
+The 31B and 26B ship with `num_kv_shared_layers = 0`, which causes `layer_types[:-0]` to collapse to zero layer slots. Crashes on first forward pass.
+
+**Fix:** Patch the config. Check model card discussions for the exact fix.
+
+## LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090)
+
+**Severity: LOW — vLLM-specific**
+
+Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+.
+
+**Source:** [vllm-project/vllm#38887](https://github.com/vllm-project/vllm/issues/38887)
+
+**Fix:** Use Ollama instead of vLLM for now, or wait for the fix.
+
+## LOW: `<unused>` Token Infinite Loop (Vulkan backends)
+
+**Severity: LOW — Vulkan-specific**
+
+Gemma 4 can generate `<unused>` or `<unused24>` tokens in an infinite loop on Vulkan backends in llama.cpp.
+
+**Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516)
+
+## LOW: Fine-Tuning Ecosystem Issues
+
+**Severity: LOW — only relevant if fine-tuning**
+
+Day-one issues for fine-tuners:
+- HuggingFace Transformers didn't recognize gemma4 architecture (required install from source)
+- PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
+- New `mm_token_type_ids` field required during training even for text-only data
+- E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
+
+## LOW: Vision Validator Overrejects
+
+**Severity: LOW — specific to evaluative vision tasks**
+
+In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable.
+
+**Pattern:** Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?"
+
+## LOW: Keep-Alive Too Short
+
+**Severity: LOW — performance only**
+
+Default `keep_alive` is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty).
+
+**Fix:** Set `keep_alive` to match your pipeline duration:
+```json
+{ "keep_alive": "4h" }
+```
+
+Or pin/unpin explicitly:
+```python
+client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0})  # pin
+# ... do work ...
+client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0})    # unpin
+```