7f806e0b92
Round 2 tested the hypothesis that 26B's silent-stop was about write_file argument size. Result: refuted. - Patch-mode (apply_patch instead of write_file): 26B fails identically at iter 6. Tool-arg size is not the cause. - Truncation sweep on tool responses reveals the real trigger: cap at 800 or 1200 chars → 26B PASSES (1200-cap is 8.4s, fastest of any run). Cap at 1600, 2000, or unlimited → 26B silent-stops with eval=4. Revised understanding: 26B silent-stops when cumulative tool-response context crosses a shape threshold around 1200-1600 chars per response. Not a tool-arg bug, not a raw code-gen bug — 26B emits correct code fine in both one-shot and short-context settings. Production CLI agents (openclaw, open code, aider) typically truncate tool responses by default, so this failure may not surface in them. Custom harnesses should cap ≤1200 chars per tool response when targeting the 26B MoE. Updates GOTCHAS (rewritten entry with the truncation sweep table), SYNTHESIS model-selection row, CORPUS_cli_coding_agent.md pointer, docs/reference/bakeoff-2026-04-18.md with full Round 2 methodology and data. Adds harness_patch.py (apply_patch edit tool), harness_patch_truncated.py (env-configurable TOOL_RESULT_CAP), all 7 run logs, and a .secrets.baseline for detect-secrets false positives on JSON timestamps.
291 lines
12 KiB
Markdown
291 lines
12 KiB
Markdown
# Gemma 4 Gotchas & Known Issues
|
||
|
||
> Derived from Seth's production implementations (Simon, AI_Visualizer)
|
||
> and community reports. These are hard-won lessons.
|
||
|
||
## CRITICAL: Thinking Mode Eats Context
|
||
|
||
**Severity: HIGH — causes silent failures**
|
||
|
||
Gemma 4 in Ollama 0.20+ defaults to `think: true`. When enabled:
|
||
- Thinking tokens go into a hidden `thinking` field, NOT `response`
|
||
- If `num_predict` is limited, thinking consumes the entire budget
|
||
- `response` comes back **empty** — no error, just silence
|
||
- On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without)
|
||
|
||
**Fix:** Always pass `think: false` in the Ollama payload. Seth has had success ONLY with thinking off.
|
||
|
||
```json
|
||
{
|
||
"model": "gemma4:26b",
|
||
"think": false,
|
||
"options": { "num_predict": 4096 }
|
||
}
|
||
```
|
||
|
||
## CRITICAL: format=json Causes Infinite Loops
|
||
|
||
**Severity: HIGH — hangs indefinitely**
|
||
|
||
Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested.
|
||
|
||
**Fix:** Never use `format: "json"`. Instead:
|
||
1. Request JSON structure in the prompt text
|
||
2. Parse client-side with regex + `json.loads` + json5 fallback
|
||
|
||
```python
|
||
# DO THIS
|
||
response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False)
|
||
body = response["response"]
|
||
obj = json.loads(body[body.find("{"):body.rfind("}") + 1])
|
||
|
||
# NOT THIS
|
||
response = client.generate(model="gemma4:26b", prompt=prompt, format="json") # HANGS
|
||
```
|
||
|
||
## CRITICAL: Ollama Default Context is 2048
|
||
|
||
**Severity: HIGH — causes truncation**
|
||
|
||
Ollama defaults `num_ctx` to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated.
|
||
|
||
**Fix:** Always set `num_ctx` explicitly:
|
||
```json
|
||
{ "options": { "num_ctx": 8192 } }
|
||
```
|
||
|
||
Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn.
|
||
|
||
## HIGH: num_predict Default is 128
|
||
|
||
**Severity: HIGH — truncates output**
|
||
|
||
Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output exceeds this.
|
||
|
||
**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
|
||
|
||
## HIGH: 26B Silent-Stops When Tool Responses Accumulate (reproducible)
|
||
|
||
**Severity: HIGH — silent agent-loop failure. Mitigatable.**
|
||
|
||
Reproduced on 2026-04-18 against `gemma4:26b` via Ollama 0.20.4 on a 3090 Ti
|
||
(steel141). Agent harness looped through `read_file` / `(write_file or apply_patch)` / `run_bash`
|
||
tools to fix a failing Python test.
|
||
|
||
### The observation
|
||
|
||
26B silent-stops (empty content, no tool calls, `eval_count=4`) at the
|
||
decision-to-edit turn, **regardless of which edit tool is offered** — tested with
|
||
both `write_file(path, full_content)` and `apply_patch(path, old, new)`.
|
||
Initial hypothesis (long tool-call argument) was **refuted**.
|
||
|
||
### The actual trigger: cumulative tool-response context shape
|
||
|
||
A sweep with progressive truncation caps on tool responses (`TOOL_RESULT_CAP`):
|
||
|
||
| Cap (chars) | Result | Halt eval_count |
|
||
|---|---|---|
|
||
| 800 | PASS | 24 (continues, hits iteration cap) |
|
||
| 1200 | **PASS** — **fastest of any run (8.4s)** | 27 (clean summary) |
|
||
| 1600 | FAIL | **4** (silent stop) |
|
||
| 2000 | FAIL | **4** (silent stop) |
|
||
| unlimited | FAIL | **4** (silent stop) |
|
||
|
||
Sharp transition between 1200 and 1600 chars-per-response. Below the line, 26B
|
||
emits correct code (eval_count ~165 on the patch turn). Above, it silent-stops.
|
||
Exact mechanism unproven (could be MoE expert routing, chat-template edge case,
|
||
or something else). **Actionable:** cap tool responses ≤1200 chars.
|
||
|
||
### What's NOT at fault
|
||
|
||
- **Not the edit tool surface** — `write_file` and `apply_patch` both trigger it
|
||
- **Not raw code generation** — a one-shot direct prompt asking 26B to fix the
|
||
same function returned clean correct code (eval=81)
|
||
- **Not total context size alone** — the 800-cap run continued past 3741 prompt
|
||
tokens. Failing runs halt at ~2070-2150 tokens but the 1200-cap run crossed
|
||
the same range and kept going
|
||
- **Not a Gemma-4-family issue** — `gemma4:31b-it-q4_K_M` on identical harness
|
||
handles full-size tool responses cleanly (eval=330 on the write turn)
|
||
|
||
### Fix
|
||
|
||
- **For 26B in an agent loop, cap tool responses ≤1200 chars.** 800 is safer;
|
||
this is where every production CLI agent (openclaw / open code / aider /
|
||
cline) already lives by default, so the issue may not surface in those
|
||
frameworks.
|
||
- **For raw pytest output specifically**, use `pytest -x --tb=line` or a custom
|
||
formatter to shrink per-test output to a few lines.
|
||
- **Alternative:** use `gemma4:31b-it-q4_K_M` — same harness, no mitigation,
|
||
just works. Trade: ~5× slower than 26B when 26B cooperates.
|
||
- See `docs/reference/bakeoff-2026-04-18.md` (Round 2) for full traces and the
|
||
truncation sweep methodology.
|
||
|
||
## MEDIUM: Weak at Long/Nested JSON
|
||
|
||
**Severity: MEDIUM — causes parse failures**
|
||
|
||
Gemma 4 reliably produces short JSON (5-10 fields) but struggles with:
|
||
- Deeply nested schemas (3+ levels)
|
||
- Long arrays (20+ items)
|
||
- Mixed nesting + length
|
||
|
||
**Fix:** Sequential tool calls. Break one large JSON request into multiple smaller calls:
|
||
- Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc.
|
||
- Due to Gemma 4's fast speed and free local use, sequential calls are cheap
|
||
|
||
**Fallback pattern (AI_Visualizer):**
|
||
```python
|
||
for attempt in range(MAX_RETRIES):
|
||
temp = BASE_TEMP + attempt * TEMP_BUMP # 0.4 -> 0.5 -> 0.6
|
||
response = call_gemma(temperature=temp)
|
||
try:
|
||
return parse_json(response)
|
||
except JSONDecodeError:
|
||
continue
|
||
```
|
||
|
||
## MEDIUM: Identity Confusion
|
||
|
||
**Severity: MEDIUM — cosmetic but confusing**
|
||
|
||
Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may:
|
||
- Claim to be a different model
|
||
- Hallucinate capabilities it doesn't have
|
||
- Respond as a generic "AI assistant" without personality
|
||
|
||
**Fix:** Explicit identity in system prompt:
|
||
```
|
||
You are [Name], a [role]. You are powered by Gemma 4.
|
||
You ONLY do [X]. You NEVER do [Y].
|
||
```
|
||
|
||
Gemma 4 does NOT need hand-holding on task execution — it's very capable.
|
||
It needs explicit instructions about identity and boundaries.
|
||
|
||
## MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens)
|
||
|
||
**Severity: MEDIUM — hardware-specific, affects RTX 3090**
|
||
|
||
Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model.
|
||
|
||
**Source:** [ollama/ollama#15350](https://github.com/ollama/ollama/issues/15350)
|
||
|
||
**Fix:** Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware.
|
||
|
||
## MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming
|
||
|
||
**Severity: MEDIUM — version-specific**
|
||
|
||
As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio.
|
||
|
||
**Source:** [community reports](https://dev.to/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f)
|
||
|
||
**Fix:** Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls.
|
||
|
||
## MEDIUM: VRAM-Hungry for Context
|
||
|
||
**Severity: MEDIUM — affects hardware planning**
|
||
|
||
Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.
|
||
|
||
**Implication:** Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable.
|
||
|
||
## MEDIUM: Safety Overfiltering
|
||
|
||
**Severity: MEDIUM — blocks benign prompts**
|
||
|
||
Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts.
|
||
|
||
**Fix:** Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly.
|
||
|
||
## MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0)
|
||
|
||
**Severity: MEDIUM — crashes on first attention forward pass**
|
||
|
||
The 31B and 26B ship with `num_kv_shared_layers = 0`, which causes `layer_types[:-0]` to collapse to zero layer slots. Crashes on first forward pass.
|
||
|
||
**Fix:** Patch the config. Check model card discussions for the exact fix.
|
||
|
||
## LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090)
|
||
|
||
**Severity: LOW — vLLM-specific**
|
||
|
||
Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+.
|
||
|
||
**Source:** [vllm-project/vllm#38887](https://github.com/vllm-project/vllm/issues/38887)
|
||
|
||
**Fix:** Use Ollama instead of vLLM for now, or wait for the fix.
|
||
|
||
## LOW: `<unused>` Token Infinite Loop (Vulkan backends)
|
||
|
||
**Severity: LOW — Vulkan-specific**
|
||
|
||
Gemma 4 can generate `<unused>` or `<unused24>` tokens in an infinite loop on Vulkan backends in llama.cpp.
|
||
|
||
**Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516)
|
||
|
||
## MEDIUM: `google/gemma_pytorch` Abandoned for Gemma 4
|
||
|
||
**Severity: MEDIUM — wastes time on a dead-end path**
|
||
|
||
The `google/gemma_pytorch` repo (last push 2025-05-30) has zero Gemma 4 support —
|
||
its variants validator only accepts Gemma 1/2/3 IDs. Anyone pointing at it as "the
|
||
official PyTorch reference" for Gemma 4 is wrong.
|
||
|
||
**Use instead:**
|
||
- **Inference:** `huggingface/transformers` (`AutoModelForMultimodalLM`, v5.5.4+)
|
||
- **Reference impl:** `google-deepmind/gemma` (JAX/Flax)
|
||
- **Serving:** Ollama / vLLM / llama.cpp
|
||
|
||
See `tooling/google-official/gemma-pytorch/README.md` for the original repo state.
|
||
|
||
## LOW: Fine-Tuning Ecosystem Issues
|
||
|
||
**Severity: LOW — only relevant if fine-tuning**
|
||
|
||
Day-one issues for fine-tuners:
|
||
- HuggingFace Transformers didn't recognize gemma4 architecture (required install from source)
|
||
- PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
|
||
- New `mm_token_type_ids` field required during training even for text-only data
|
||
- E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
|
||
- **Flash Attention 2/4 incompatible:** Gemma 4's global-attention head_dim is 512;
|
||
FA2 max is 256, FA4 max is 128. Training backends fall back to SDP or Flex Attention
|
||
(Axolotl hard-codes `sdp_attention: true` for Gemma 4). Does not affect inference
|
||
runtimes that already use SDP (Ollama, vLLM).
|
||
- **Fused LoRA kernels broken** (shared-KV layers). Axolotl disables
|
||
`lora_mlp_kernel` / `qkv_kernel` / `o_kernel` for Gemma 4; Unsloth routes around it.
|
||
- **26B A4B MoE wants ≥8-bit LoRA**, not 4-bit QLoRA — MoE expert quality degrades
|
||
at 4-bit during training. Axolotl's ScatterMoE + expert-LoRA config is the only
|
||
validated 4-bit MoE path. (This caveat is **training-only**; Q4_K_M inference is fine.)
|
||
- **New tool-call / channel tokens are learned embeddings** — if fine-tuning, set
|
||
`modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True` in
|
||
`LoraConfig`, or the adapter trains against frozen random vectors for them.
|
||
|
||
See `tooling/fine-tuning/recipe-recommendation.md` for the full training path.
|
||
|
||
## LOW: Vision Validator Overrejects
|
||
|
||
**Severity: LOW — specific to evaluative vision tasks**
|
||
|
||
In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable.
|
||
|
||
**Pattern:** Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?"
|
||
|
||
## LOW: Keep-Alive Too Short
|
||
|
||
**Severity: LOW — performance only**
|
||
|
||
Default `keep_alive` is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty).
|
||
|
||
**Fix:** Set `keep_alive` to match your pipeline duration:
|
||
```json
|
||
{ "keep_alive": "4h" }
|
||
```
|
||
|
||
Or pin/unpin explicitly:
|
||
```python
|
||
client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0}) # pin
|
||
# ... do work ...
|
||
client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0}) # unpin
|
||
```
|