Files
gemma4-research/GOTCHAS.md
T
Mortdecai c61394923c fix: walk back round-1/2 conclusions — the cause was think=false all along
Seth asked "was this with think=false?" Yes — and that was the only
question that mattered. Everything I concluded in round 1 and round 2
was wrong.

Actual cause, isolated in round 3:
- At identical message state, gemma4:26b with think=false returns
  eval=4 (silent stop); with think unset or think=true, returns
  eval=165 and emits the correct tool call.
- Original round-1 write_file harness + think unset: 26B passes in
  8 iters, 20s. No mitigations needed.
- 31B dense and qwen3-coder:30b tolerate think=false; 26B MoE does not.

Red herrings (kept on-record in the bakeoff doc, not silently erased):
- Round 1: "write_file tool-call argument size" — wrong
- Round 2a: refuted the arg-size theory but for the wrong reason
  (still failed because think=false was still set)
- Round 2b: "cumulative tool-response context size" — truncating
  did make 26B pass, but by coincidence. Shorter context at the
  decision turn dodged the think=false side effect.

Why the existing "always think:false" guidance was misleading:
it was derived from AI_Visualizer (single-turn JSON pipelines) where
thinking tokens do eat num_predict invisibly. In multi-turn
tool-calling agents the channels are separate and the flag has a
different effect — catastrophic on 26B specifically.

Doc updates:
- GOTCHAS: replaced the 26B entry with the actual cause; scoped the
  original "Thinking Mode Eats Context" entry to single-turn pipelines
- SYNTHESIS: split the "Mandatory Ollama Settings" block into
  single-turn vs multi-turn variants; updated anti-patterns and
  quick-start checklist
- CORPUS_cli_coding_agent.md: revised pointer and config template
- docs/reference/bakeoff-2026-04-18.md: added Round 3 section with
  the correction notice at the top of the file and full diagnostic
  methodology

New artifacts: harness_no_think_flag.py, harness_write_no_think.py,
and 4 new log files demonstrating all three models pass when think
is left at default.
2026-04-18 18:14:05 -04:00

297 lines
13 KiB
Markdown

# Gemma 4 Gotchas & Known Issues
> Derived from Seth's production implementations (Simon, AI_Visualizer)
> and community reports. These are hard-won lessons.
## CRITICAL: Thinking Mode Eats Context (single-turn pipelines only)
**Severity: HIGH — causes silent failures in single-turn `/api/generate` workloads**
> **Scope update (2026-04-18):** This guidance applies to **single-turn JSON
> generation pipelines** (the AI_Visualizer shape: one call → one structured
> response). For **multi-turn tool-calling agents**, the opposite is true on
> `gemma4:26b` — see § "`think: false` Kills Gemma 4 26B in Multi-Turn
> Tool-Calling Loops" above. Don't copy this fix to an agent harness without
> testing.
Gemma 4 in Ollama 0.20+ defaults to `think: true`. When enabled in a single-turn
JSON pipeline:
- Thinking tokens go into a hidden `thinking` field, NOT `response`
- If `num_predict` is limited, thinking consumes the entire budget
- `response` comes back **empty** — no error, just silence
- On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without)
**Fix (for single-turn pipelines):** Always pass `think: false` in the Ollama payload.
```json
{
"model": "gemma4:26b",
"think": false,
"options": { "num_predict": 4096 }
}
```
**Do not blindly carry this to multi-turn tool-calling agents** — verified
2026-04-18 that it silent-stops 26B specifically in that context.
## CRITICAL: format=json Causes Infinite Loops
**Severity: HIGH — hangs indefinitely**
Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested.
**Fix:** Never use `format: "json"`. Instead:
1. Request JSON structure in the prompt text
2. Parse client-side with regex + `json.loads` + json5 fallback
```python
# DO THIS
response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False)
body = response["response"]
obj = json.loads(body[body.find("{"):body.rfind("}") + 1])
# NOT THIS
response = client.generate(model="gemma4:26b", prompt=prompt, format="json") # HANGS
```
## CRITICAL: Ollama Default Context is 2048
**Severity: HIGH — causes truncation**
Ollama defaults `num_ctx` to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated.
**Fix:** Always set `num_ctx` explicitly:
```json
{ "options": { "num_ctx": 8192 } }
```
Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn.
## HIGH: num_predict Default is 128
**Severity: HIGH — truncates output**
Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output exceeds this.
**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
## HIGH: `think: false` Kills Gemma 4 26B in Multi-Turn Tool-Calling Loops
**Severity: HIGH — silent agent-loop failure. Setting is what the old guidance said to do.**
Reproduced on 2026-04-18 against `gemma4:26b` via Ollama 0.20.4 on a 3090 Ti
(steel141). Contradicts the older "always think:false" guidance (see § "Thinking
Mode Eats Context" below — now scoped to single-turn pipelines only).
### The observation
At identical message state with all else equal:
| `think` setting | `eval_count` on decision turn | Agent behavior |
|---|---|---|
| `false` | **4** (silent stop, no content, no tool_calls) | Fails — zero edits emitted |
| unset (Ollama default) | 165 | Passes — emits correct edit |
| `true` | 165 | Passes — emits correct edit |
26B passes the task in 8 iterations / 12-20s on the same harness the moment
the `think` key is removed from the Ollama payload. `write_file` vs
`apply_patch` doesn't matter. Tool-response size doesn't matter.
### What I initially got wrong
The 2026-04-18 bakeoff went through two wrong hypotheses before Seth asked
"was this with think=false?" The failed-and-corrected path:
1. **"Long `write_file` argument breaks 26B"** — wrong. `apply_patch` also failed.
2. **"Large tool-response context breaks 26B"** — wrong. Truncation *did* make
26B pass (800/1200-char caps), but that's because shorter context dodged
the `think: false` side effect by coincidence of state at the decision turn.
3. **Actual cause:** `think: false` alters the decoding path in a way that makes
the 26B MoE (3.8B active params, 8-of-128 expert routing) emit near-immediate
EOS at tool-decision turns. 31B Dense and Qwen3-Coder are robust to the
flag; 26B specifically is not.
See `docs/reference/bakeoff-2026-04-18.md` § "Round 3" for full traces and the
diagnostic that isolated the flag.
### Fix
- **For multi-turn tool-calling agents, do NOT set `think: false`.** Leave it
unset (Ollama default) or `true`.
- **If your agent accumulates `thinking` field content**, prune old thinking
blobs from message history to control context growth.
- **For single-turn JSON pipelines** (the AI_Visualizer shape), the original
"always think:false" guidance still applies — see § "Thinking Mode Eats
Context" below.
- 31B Dense and Qwen3-Coder work fine either way — this gotcha is 26B-specific
on this Ollama version.
## MEDIUM: Weak at Long/Nested JSON
**Severity: MEDIUM — causes parse failures**
Gemma 4 reliably produces short JSON (5-10 fields) but struggles with:
- Deeply nested schemas (3+ levels)
- Long arrays (20+ items)
- Mixed nesting + length
**Fix:** Sequential tool calls. Break one large JSON request into multiple smaller calls:
- Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc.
- Due to Gemma 4's fast speed and free local use, sequential calls are cheap
**Fallback pattern (AI_Visualizer):**
```python
for attempt in range(MAX_RETRIES):
temp = BASE_TEMP + attempt * TEMP_BUMP # 0.4 -> 0.5 -> 0.6
response = call_gemma(temperature=temp)
try:
return parse_json(response)
except JSONDecodeError:
continue
```
## MEDIUM: Identity Confusion
**Severity: MEDIUM — cosmetic but confusing**
Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may:
- Claim to be a different model
- Hallucinate capabilities it doesn't have
- Respond as a generic "AI assistant" without personality
**Fix:** Explicit identity in system prompt:
```
You are [Name], a [role]. You are powered by Gemma 4.
You ONLY do [X]. You NEVER do [Y].
```
Gemma 4 does NOT need hand-holding on task execution — it's very capable.
It needs explicit instructions about identity and boundaries.
## MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens)
**Severity: MEDIUM — hardware-specific, affects RTX 3090**
Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model.
**Source:** [ollama/ollama#15350](https://github.com/ollama/ollama/issues/15350)
**Fix:** Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware.
## MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming
**Severity: MEDIUM — version-specific**
As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio.
**Source:** [community reports](https://dev.to/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f)
**Fix:** Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls.
## MEDIUM: VRAM-Hungry for Context
**Severity: MEDIUM — affects hardware planning**
Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.
**Implication:** Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable.
## MEDIUM: Safety Overfiltering
**Severity: MEDIUM — blocks benign prompts**
Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts.
**Fix:** Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly.
## MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0)
**Severity: MEDIUM — crashes on first attention forward pass**
The 31B and 26B ship with `num_kv_shared_layers = 0`, which causes `layer_types[:-0]` to collapse to zero layer slots. Crashes on first forward pass.
**Fix:** Patch the config. Check model card discussions for the exact fix.
## LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090)
**Severity: LOW — vLLM-specific**
Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+.
**Source:** [vllm-project/vllm#38887](https://github.com/vllm-project/vllm/issues/38887)
**Fix:** Use Ollama instead of vLLM for now, or wait for the fix.
## LOW: `<unused>` Token Infinite Loop (Vulkan backends)
**Severity: LOW — Vulkan-specific**
Gemma 4 can generate `<unused>` or `<unused24>` tokens in an infinite loop on Vulkan backends in llama.cpp.
**Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516)
## MEDIUM: `google/gemma_pytorch` Abandoned for Gemma 4
**Severity: MEDIUM — wastes time on a dead-end path**
The `google/gemma_pytorch` repo (last push 2025-05-30) has zero Gemma 4 support —
its variants validator only accepts Gemma 1/2/3 IDs. Anyone pointing at it as "the
official PyTorch reference" for Gemma 4 is wrong.
**Use instead:**
- **Inference:** `huggingface/transformers` (`AutoModelForMultimodalLM`, v5.5.4+)
- **Reference impl:** `google-deepmind/gemma` (JAX/Flax)
- **Serving:** Ollama / vLLM / llama.cpp
See `tooling/google-official/gemma-pytorch/README.md` for the original repo state.
## LOW: Fine-Tuning Ecosystem Issues
**Severity: LOW — only relevant if fine-tuning**
Day-one issues for fine-tuners:
- HuggingFace Transformers didn't recognize gemma4 architecture (required install from source)
- PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
- New `mm_token_type_ids` field required during training even for text-only data
- E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
- **Flash Attention 2/4 incompatible:** Gemma 4's global-attention head_dim is 512;
FA2 max is 256, FA4 max is 128. Training backends fall back to SDP or Flex Attention
(Axolotl hard-codes `sdp_attention: true` for Gemma 4). Does not affect inference
runtimes that already use SDP (Ollama, vLLM).
- **Fused LoRA kernels broken** (shared-KV layers). Axolotl disables
`lora_mlp_kernel` / `qkv_kernel` / `o_kernel` for Gemma 4; Unsloth routes around it.
- **26B A4B MoE wants ≥8-bit LoRA**, not 4-bit QLoRA — MoE expert quality degrades
at 4-bit during training. Axolotl's ScatterMoE + expert-LoRA config is the only
validated 4-bit MoE path. (This caveat is **training-only**; Q4_K_M inference is fine.)
- **New tool-call / channel tokens are learned embeddings** — if fine-tuning, set
`modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True` in
`LoraConfig`, or the adapter trains against frozen random vectors for them.
See `tooling/fine-tuning/recipe-recommendation.md` for the full training path.
## LOW: Vision Validator Overrejects
**Severity: LOW — specific to evaluative vision tasks**
In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable.
**Pattern:** Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?"
## LOW: Keep-Alive Too Short
**Severity: LOW — performance only**
Default `keep_alive` is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty).
**Fix:** Set `keep_alive` to match your pipeline duration:
```json
{ "keep_alive": "4h" }
```
Or pin/unpin explicitly:
```python
client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0}) # pin
# ... do work ...
client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0}) # unpin
```