Files
gemma4-research/GOTCHAS.md
T
Mortdecai 7f806e0b92 feat: round-2 bakeoff — 26b silent-stop is tool-response context size
Round 2 tested the hypothesis that 26B's silent-stop was about
write_file argument size. Result: refuted.

- Patch-mode (apply_patch instead of write_file): 26B fails identically
  at iter 6. Tool-arg size is not the cause.
- Truncation sweep on tool responses reveals the real trigger: cap at
  800 or 1200 chars → 26B PASSES (1200-cap is 8.4s, fastest of any run).
  Cap at 1600, 2000, or unlimited → 26B silent-stops with eval=4.

Revised understanding: 26B silent-stops when cumulative tool-response
context crosses a shape threshold around 1200-1600 chars per response.
Not a tool-arg bug, not a raw code-gen bug — 26B emits correct code
fine in both one-shot and short-context settings.

Production CLI agents (openclaw, open code, aider) typically truncate
tool responses by default, so this failure may not surface in them.
Custom harnesses should cap ≤1200 chars per tool response when
targeting the 26B MoE.

Updates GOTCHAS (rewritten entry with the truncation sweep table),
SYNTHESIS model-selection row, CORPUS_cli_coding_agent.md pointer,
docs/reference/bakeoff-2026-04-18.md with full Round 2 methodology
and data.

Adds harness_patch.py (apply_patch edit tool), harness_patch_truncated.py
(env-configurable TOOL_RESULT_CAP), all 7 run logs, and a
.secrets.baseline for detect-secrets false positives on JSON timestamps.
2026-04-18 13:40:18 -04:00

291 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gemma 4 Gotchas & Known Issues
> Derived from Seth's production implementations (Simon, AI_Visualizer)
> and community reports. These are hard-won lessons.
## CRITICAL: Thinking Mode Eats Context
**Severity: HIGH — causes silent failures**
Gemma 4 in Ollama 0.20+ defaults to `think: true`. When enabled:
- Thinking tokens go into a hidden `thinking` field, NOT `response`
- If `num_predict` is limited, thinking consumes the entire budget
- `response` comes back **empty** — no error, just silence
- On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without)
**Fix:** Always pass `think: false` in the Ollama payload. Seth has had success ONLY with thinking off.
```json
{
"model": "gemma4:26b",
"think": false,
"options": { "num_predict": 4096 }
}
```
## CRITICAL: format=json Causes Infinite Loops
**Severity: HIGH — hangs indefinitely**
Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested.
**Fix:** Never use `format: "json"`. Instead:
1. Request JSON structure in the prompt text
2. Parse client-side with regex + `json.loads` + json5 fallback
```python
# DO THIS
response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False)
body = response["response"]
obj = json.loads(body[body.find("{"):body.rfind("}") + 1])
# NOT THIS
response = client.generate(model="gemma4:26b", prompt=prompt, format="json") # HANGS
```
## CRITICAL: Ollama Default Context is 2048
**Severity: HIGH — causes truncation**
Ollama defaults `num_ctx` to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated.
**Fix:** Always set `num_ctx` explicitly:
```json
{ "options": { "num_ctx": 8192 } }
```
Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn.
## HIGH: num_predict Default is 128
**Severity: HIGH — truncates output**
Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output exceeds this.
**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
## HIGH: 26B Silent-Stops When Tool Responses Accumulate (reproducible)
**Severity: HIGH — silent agent-loop failure. Mitigatable.**
Reproduced on 2026-04-18 against `gemma4:26b` via Ollama 0.20.4 on a 3090 Ti
(steel141). Agent harness looped through `read_file` / `(write_file or apply_patch)` / `run_bash`
tools to fix a failing Python test.
### The observation
26B silent-stops (empty content, no tool calls, `eval_count=4`) at the
decision-to-edit turn, **regardless of which edit tool is offered** — tested with
both `write_file(path, full_content)` and `apply_patch(path, old, new)`.
Initial hypothesis (long tool-call argument) was **refuted**.
### The actual trigger: cumulative tool-response context shape
A sweep with progressive truncation caps on tool responses (`TOOL_RESULT_CAP`):
| Cap (chars) | Result | Halt eval_count |
|---|---|---|
| 800 | PASS | 24 (continues, hits iteration cap) |
| 1200 | **PASS****fastest of any run (8.4s)** | 27 (clean summary) |
| 1600 | FAIL | **4** (silent stop) |
| 2000 | FAIL | **4** (silent stop) |
| unlimited | FAIL | **4** (silent stop) |
Sharp transition between 1200 and 1600 chars-per-response. Below the line, 26B
emits correct code (eval_count ~165 on the patch turn). Above, it silent-stops.
Exact mechanism unproven (could be MoE expert routing, chat-template edge case,
or something else). **Actionable:** cap tool responses ≤1200 chars.
### What's NOT at fault
- **Not the edit tool surface** — `write_file` and `apply_patch` both trigger it
- **Not raw code generation** — a one-shot direct prompt asking 26B to fix the
same function returned clean correct code (eval=81)
- **Not total context size alone** — the 800-cap run continued past 3741 prompt
tokens. Failing runs halt at ~2070-2150 tokens but the 1200-cap run crossed
the same range and kept going
- **Not a Gemma-4-family issue** — `gemma4:31b-it-q4_K_M` on identical harness
handles full-size tool responses cleanly (eval=330 on the write turn)
### Fix
- **For 26B in an agent loop, cap tool responses ≤1200 chars.** 800 is safer;
this is where every production CLI agent (openclaw / open code / aider /
cline) already lives by default, so the issue may not surface in those
frameworks.
- **For raw pytest output specifically**, use `pytest -x --tb=line` or a custom
formatter to shrink per-test output to a few lines.
- **Alternative:** use `gemma4:31b-it-q4_K_M` — same harness, no mitigation,
just works. Trade: ~5× slower than 26B when 26B cooperates.
- See `docs/reference/bakeoff-2026-04-18.md` (Round 2) for full traces and the
truncation sweep methodology.
## MEDIUM: Weak at Long/Nested JSON
**Severity: MEDIUM — causes parse failures**
Gemma 4 reliably produces short JSON (5-10 fields) but struggles with:
- Deeply nested schemas (3+ levels)
- Long arrays (20+ items)
- Mixed nesting + length
**Fix:** Sequential tool calls. Break one large JSON request into multiple smaller calls:
- Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc.
- Due to Gemma 4's fast speed and free local use, sequential calls are cheap
**Fallback pattern (AI_Visualizer):**
```python
for attempt in range(MAX_RETRIES):
temp = BASE_TEMP + attempt * TEMP_BUMP # 0.4 -> 0.5 -> 0.6
response = call_gemma(temperature=temp)
try:
return parse_json(response)
except JSONDecodeError:
continue
```
## MEDIUM: Identity Confusion
**Severity: MEDIUM — cosmetic but confusing**
Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may:
- Claim to be a different model
- Hallucinate capabilities it doesn't have
- Respond as a generic "AI assistant" without personality
**Fix:** Explicit identity in system prompt:
```
You are [Name], a [role]. You are powered by Gemma 4.
You ONLY do [X]. You NEVER do [Y].
```
Gemma 4 does NOT need hand-holding on task execution — it's very capable.
It needs explicit instructions about identity and boundaries.
## MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens)
**Severity: MEDIUM — hardware-specific, affects RTX 3090**
Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model.
**Source:** [ollama/ollama#15350](https://github.com/ollama/ollama/issues/15350)
**Fix:** Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware.
## MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming
**Severity: MEDIUM — version-specific**
As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio.
**Source:** [community reports](https://dev.to/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f)
**Fix:** Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls.
## MEDIUM: VRAM-Hungry for Context
**Severity: MEDIUM — affects hardware planning**
Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.
**Implication:** Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable.
## MEDIUM: Safety Overfiltering
**Severity: MEDIUM — blocks benign prompts**
Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts.
**Fix:** Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly.
## MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0)
**Severity: MEDIUM — crashes on first attention forward pass**
The 31B and 26B ship with `num_kv_shared_layers = 0`, which causes `layer_types[:-0]` to collapse to zero layer slots. Crashes on first forward pass.
**Fix:** Patch the config. Check model card discussions for the exact fix.
## LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090)
**Severity: LOW — vLLM-specific**
Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+.
**Source:** [vllm-project/vllm#38887](https://github.com/vllm-project/vllm/issues/38887)
**Fix:** Use Ollama instead of vLLM for now, or wait for the fix.
## LOW: `<unused>` Token Infinite Loop (Vulkan backends)
**Severity: LOW — Vulkan-specific**
Gemma 4 can generate `<unused>` or `<unused24>` tokens in an infinite loop on Vulkan backends in llama.cpp.
**Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516)
## MEDIUM: `google/gemma_pytorch` Abandoned for Gemma 4
**Severity: MEDIUM — wastes time on a dead-end path**
The `google/gemma_pytorch` repo (last push 2025-05-30) has zero Gemma 4 support —
its variants validator only accepts Gemma 1/2/3 IDs. Anyone pointing at it as "the
official PyTorch reference" for Gemma 4 is wrong.
**Use instead:**
- **Inference:** `huggingface/transformers` (`AutoModelForMultimodalLM`, v5.5.4+)
- **Reference impl:** `google-deepmind/gemma` (JAX/Flax)
- **Serving:** Ollama / vLLM / llama.cpp
See `tooling/google-official/gemma-pytorch/README.md` for the original repo state.
## LOW: Fine-Tuning Ecosystem Issues
**Severity: LOW — only relevant if fine-tuning**
Day-one issues for fine-tuners:
- HuggingFace Transformers didn't recognize gemma4 architecture (required install from source)
- PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
- New `mm_token_type_ids` field required during training even for text-only data
- E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
- **Flash Attention 2/4 incompatible:** Gemma 4's global-attention head_dim is 512;
FA2 max is 256, FA4 max is 128. Training backends fall back to SDP or Flex Attention
(Axolotl hard-codes `sdp_attention: true` for Gemma 4). Does not affect inference
runtimes that already use SDP (Ollama, vLLM).
- **Fused LoRA kernels broken** (shared-KV layers). Axolotl disables
`lora_mlp_kernel` / `qkv_kernel` / `o_kernel` for Gemma 4; Unsloth routes around it.
- **26B A4B MoE wants ≥8-bit LoRA**, not 4-bit QLoRA — MoE expert quality degrades
at 4-bit during training. Axolotl's ScatterMoE + expert-LoRA config is the only
validated 4-bit MoE path. (This caveat is **training-only**; Q4_K_M inference is fine.)
- **New tool-call / channel tokens are learned embeddings** — if fine-tuning, set
`modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True` in
`LoraConfig`, or the adapter trains against frozen random vectors for them.
See `tooling/fine-tuning/recipe-recommendation.md` for the full training path.
## LOW: Vision Validator Overrejects
**Severity: LOW — specific to evaluative vision tasks**
In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable.
**Pattern:** Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?"
## LOW: Keep-Alive Too Short
**Severity: LOW — performance only**
Default `keep_alive` is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty).
**Fix:** Set `keep_alive` to match your pipeline duration:
```json
{ "keep_alive": "4h" }
```
Or pin/unpin explicitly:
```python
client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0}) # pin
# ... do work ...
client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0}) # unpin
```