Ran minimal agent loop (Ollama /api/chat + read_file/write_file/run_bash) on steel141 3090 Ti against 3 models on a broken-median-function task: - gemma4:31b-it-q4_K_M: PASS (8 iters, 1 write, 44s) — textbook trace - qwen3-coder:30b: PASS (15 iters, 1 write, 22s) — correct but chatty - gemma4:26b: FAIL (6 iters, 0 writes) — silently stops with eval=4 after reading source. Reproduced on second run. One-shot probe confirms 26b CAN produce the correct fix — failure is specifically at the write_file tool-call argument boundary. Updates GOTCHAS with a new HIGH-severity entry, SYNTHESIS model-selection table, CORPUS_cli_coding_agent.md empirical-follow-up pointer, and adds docs/reference/bakeoff-2026-04-18.md with the full writeup.
11 KiB
Gemma 4 Gotchas & Known Issues
Derived from Seth's production implementations (Simon, AI_Visualizer) and community reports. These are hard-won lessons.
CRITICAL: Thinking Mode Eats Context
Severity: HIGH — causes silent failures
Gemma 4 in Ollama 0.20+ defaults to think: true. When enabled:
- Thinking tokens go into a hidden
thinkingfield, NOTresponse - If
num_predictis limited, thinking consumes the entire budget responsecomes back empty — no error, just silence- On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without)
Fix: Always pass think: false in the Ollama payload. Seth has had success ONLY with thinking off.
{
"model": "gemma4:26b",
"think": false,
"options": { "num_predict": 4096 }
}
CRITICAL: format=json Causes Infinite Loops
Severity: HIGH — hangs indefinitely
Ollama's server-side format: "json" enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested.
Fix: Never use format: "json". Instead:
- Request JSON structure in the prompt text
- Parse client-side with regex +
json.loads+ json5 fallback
# DO THIS
response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False)
body = response["response"]
obj = json.loads(body[body.find("{"):body.rfind("}") + 1])
# NOT THIS
response = client.generate(model="gemma4:26b", prompt=prompt, format="json") # HANGS
CRITICAL: Ollama Default Context is 2048
Severity: HIGH — causes truncation
Ollama defaults num_ctx to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated.
Fix: Always set num_ctx explicitly:
{ "options": { "num_ctx": 8192 } }
Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn.
HIGH: num_predict Default is 128
Severity: HIGH — truncates output
Ollama defaults num_predict to 128 tokens. Almost any useful Gemma 4 output exceeds this.
Fix: Always set num_predict explicitly. Minimum recommended: 512. For JSON output: 2048+.
HIGH: 26B Silent-Stops at write_file Tool Boundary (reproducible)
Severity: HIGH — agent-loop failure, silent
Reproduced on 2026-04-18 against gemma4:26b via Ollama 0.20.4 on a 3090 Ti
(steel141). Agent harness exposed read_file / write_file / run_bash tools
and asked the model to fix a failing Python test.
Observed pattern (both runs identical):
- Model reads README, runs pytest (sees failures), reads the buggy source file
- Next turn: empty content, no tool calls,
eval_count=4— model silently exits - Zero writes ever emitted
Isolation: a direct one-shot call asking 26B to rewrite the same function
returned the correct fix (eval_count=81). So diagnosis and code generation are
intact — failure is at the write_file(path, content) tool-call argument
boundary, where content is a ~500-char string. Consistent with the "Weak at
Long/Nested JSON" gotcha below: a long string inside a tool-call argument is
structurally the same problem.
gemma4:31b-it-q4_K_M on the same harness completed the task cleanly
(eval_count=330 on the write turn). qwen3-coder:30b also completed.
Fix:
- For 26B in an agent loop, prefer a patch/diff tool surface
(
apply_patch(path, old, new)) over a full-content write (write_file(path, full_content)). Delta-sized arguments are inside the model's comfort zone. - Or use 31B for the agent and keep 26B for single-shot tasks where the full response is the output, not a tool-call argument.
- See
docs/reference/bakeoff-2026-04-18.mdfor the full trace.
MEDIUM: Weak at Long/Nested JSON
Severity: MEDIUM — causes parse failures
Gemma 4 reliably produces short JSON (5-10 fields) but struggles with:
- Deeply nested schemas (3+ levels)
- Long arrays (20+ items)
- Mixed nesting + length
Fix: Sequential tool calls. Break one large JSON request into multiple smaller calls:
- Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc.
- Due to Gemma 4's fast speed and free local use, sequential calls are cheap
Fallback pattern (AI_Visualizer):
for attempt in range(MAX_RETRIES):
temp = BASE_TEMP + attempt * TEMP_BUMP # 0.4 -> 0.5 -> 0.6
response = call_gemma(temperature=temp)
try:
return parse_json(response)
except JSONDecodeError:
continue
MEDIUM: Identity Confusion
Severity: MEDIUM — cosmetic but confusing
Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may:
- Claim to be a different model
- Hallucinate capabilities it doesn't have
- Respond as a generic "AI assistant" without personality
Fix: Explicit identity in system prompt:
You are [Name], a [role]. You are powered by Gemma 4.
You ONLY do [X]. You NEVER do [Y].
Gemma 4 does NOT need hand-holding on task execution — it's very capable. It needs explicit instructions about identity and boundaries.
MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens)
Severity: MEDIUM — hardware-specific, affects RTX 3090
Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model.
Source: ollama/ollama#15350
Fix: Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware.
MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming
Severity: MEDIUM — version-specific
As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio.
Source: community reports
Fix: Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls.
MEDIUM: VRAM-Hungry for Context
Severity: MEDIUM — affects hardware planning
Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.
Implication: Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable.
MEDIUM: Safety Overfiltering
Severity: MEDIUM — blocks benign prompts
Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts.
Fix: Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly.
MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0)
Severity: MEDIUM — crashes on first attention forward pass
The 31B and 26B ship with num_kv_shared_layers = 0, which causes layer_types[:-0] to collapse to zero layer slots. Crashes on first forward pass.
Fix: Patch the config. Check model card discussions for the exact fix.
LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090)
Severity: LOW — vLLM-specific
Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+.
Source: vllm-project/vllm#38887
Fix: Use Ollama instead of vLLM for now, or wait for the fix.
LOW: <unused> Token Infinite Loop (Vulkan backends)
Severity: LOW — Vulkan-specific
Gemma 4 can generate <unused> or <unused24> tokens in an infinite loop on Vulkan backends in llama.cpp.
Source: ggml-org/llama.cpp#21516
MEDIUM: google/gemma_pytorch Abandoned for Gemma 4
Severity: MEDIUM — wastes time on a dead-end path
The google/gemma_pytorch repo (last push 2025-05-30) has zero Gemma 4 support —
its variants validator only accepts Gemma 1/2/3 IDs. Anyone pointing at it as "the
official PyTorch reference" for Gemma 4 is wrong.
Use instead:
- Inference:
huggingface/transformers(AutoModelForMultimodalLM, v5.5.4+) - Reference impl:
google-deepmind/gemma(JAX/Flax) - Serving: Ollama / vLLM / llama.cpp
See tooling/google-official/gemma-pytorch/README.md for the original repo state.
LOW: Fine-Tuning Ecosystem Issues
Severity: LOW — only relevant if fine-tuning
Day-one issues for fine-tuners:
- HuggingFace Transformers didn't recognize gemma4 architecture (required install from source)
- PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
- New
mm_token_type_idsfield required during training even for text-only data - E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
- Flash Attention 2/4 incompatible: Gemma 4's global-attention head_dim is 512;
FA2 max is 256, FA4 max is 128. Training backends fall back to SDP or Flex Attention
(Axolotl hard-codes
sdp_attention: truefor Gemma 4). Does not affect inference runtimes that already use SDP (Ollama, vLLM). - Fused LoRA kernels broken (shared-KV layers). Axolotl disables
lora_mlp_kernel/qkv_kernel/o_kernelfor Gemma 4; Unsloth routes around it. - 26B A4B MoE wants ≥8-bit LoRA, not 4-bit QLoRA — MoE expert quality degrades at 4-bit during training. Axolotl's ScatterMoE + expert-LoRA config is the only validated 4-bit MoE path. (This caveat is training-only; Q4_K_M inference is fine.)
- New tool-call / channel tokens are learned embeddings — if fine-tuning, set
modules_to_save=["lm_head","embed_tokens"]+ensure_weight_tying=TrueinLoraConfig, or the adapter trains against frozen random vectors for them.
See tooling/fine-tuning/recipe-recommendation.md for the full training path.
LOW: Vision Validator Overrejects
Severity: LOW — specific to evaluative vision tasks
In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable.
Pattern: Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?"
LOW: Keep-Alive Too Short
Severity: LOW — performance only
Default keep_alive is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty).
Fix: Set keep_alive to match your pipeline duration:
{ "keep_alive": "4h" }
Or pin/unpin explicitly:
client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0}) # pin
# ... do work ...
client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0}) # unpin