# Gemma 4 Gotchas & Known Issues > Derived from Seth's production implementations (Simon, AI_Visualizer) > and community reports. These are hard-won lessons. ## CRITICAL: Thinking Mode Eats Context (single-turn pipelines only) **Severity: HIGH — causes silent failures in single-turn `/api/generate` workloads** > **Scope update (2026-04-18):** This guidance applies to **single-turn JSON > generation pipelines** (the AI_Visualizer shape: one call → one structured > response). For **multi-turn tool-calling agents**, the opposite is true on > `gemma4:26b` — see § "`think: false` Kills Gemma 4 26B in Multi-Turn > Tool-Calling Loops" above. Don't copy this fix to an agent harness without > testing. Gemma 4 in Ollama 0.20+ defaults to `think: true`. When enabled in a single-turn JSON pipeline: - Thinking tokens go into a hidden `thinking` field, NOT `response` - If `num_predict` is limited, thinking consumes the entire budget - `response` comes back **empty** — no error, just silence - On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without) **Fix (for single-turn pipelines):** Always pass `think: false` in the Ollama payload. ```json { "model": "gemma4:26b", "think": false, "options": { "num_predict": 4096 } } ``` **Do not blindly carry this to multi-turn tool-calling agents** — verified 2026-04-18 that it silent-stops 26B specifically in that context. ## CRITICAL: format=json Causes Infinite Loops **Severity: HIGH — hangs indefinitely** Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested. **Fix:** Never use `format: "json"`. Instead: 1. Request JSON structure in the prompt text 2. Parse client-side with regex + `json.loads` + json5 fallback ```python # DO THIS response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False) body = response["response"] obj = json.loads(body[body.find("{"):body.rfind("}") + 1]) # NOT THIS response = client.generate(model="gemma4:26b", prompt=prompt, format="json") # HANGS ``` ## CRITICAL: Ollama Default Context is 2048 **Severity: HIGH — causes truncation** Ollama defaults `num_ctx` to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated. **Fix:** Always set `num_ctx` explicitly: ```json { "options": { "num_ctx": 8192 } } ``` Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn. ## HIGH: num_predict Default is 128 **Severity: HIGH — truncates output** Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output exceeds this. **Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+. ## HIGH: `think: false` Kills Gemma 4 26B in Multi-Turn Tool-Calling Loops **Severity: HIGH — silent agent-loop failure. Setting is what the old guidance said to do.** Reproduced on 2026-04-18 against `gemma4:26b` via Ollama 0.20.4 on a 3090 Ti (steel141). Contradicts the older "always think:false" guidance (see § "Thinking Mode Eats Context" below — now scoped to single-turn pipelines only). ### The observation At identical message state with all else equal: | `think` setting | `eval_count` on decision turn | Agent behavior | |---|---|---| | `false` | **4** (silent stop, no content, no tool_calls) | Fails — zero edits emitted | | unset (Ollama default) | 165 | Passes — emits correct edit | | `true` | 165 | Passes — emits correct edit | 26B passes the task in 8 iterations / 12-20s on the same harness the moment the `think` key is removed from the Ollama payload. `write_file` vs `apply_patch` doesn't matter. Tool-response size doesn't matter. ### What I initially got wrong The 2026-04-18 bakeoff went through two wrong hypotheses before Seth asked "was this with think=false?" The failed-and-corrected path: 1. **"Long `write_file` argument breaks 26B"** — wrong. `apply_patch` also failed. 2. **"Large tool-response context breaks 26B"** — wrong. Truncation *did* make 26B pass (800/1200-char caps), but that's because shorter context dodged the `think: false` side effect by coincidence of state at the decision turn. 3. **Actual cause:** `think: false` alters the decoding path in a way that makes the 26B MoE (3.8B active params, 8-of-128 expert routing) emit near-immediate EOS at tool-decision turns. 31B Dense and Qwen3-Coder are robust to the flag; 26B specifically is not. See `docs/reference/bakeoff-2026-04-18.md` § "Round 3" for full traces and the diagnostic that isolated the flag. ### Fix - **For multi-turn tool-calling agents, do NOT set `think: false`.** Leave it unset (Ollama default) or `true`. - **If your agent accumulates `thinking` field content**, prune old thinking blobs from message history to control context growth. - **For single-turn JSON pipelines** (the AI_Visualizer shape), the original "always think:false" guidance still applies — see § "Thinking Mode Eats Context" below. - 31B Dense and Qwen3-Coder work fine either way — this gotcha is 26B-specific on this Ollama version. ## MEDIUM: Weak at Long/Nested JSON **Severity: MEDIUM — causes parse failures** Gemma 4 reliably produces short JSON (5-10 fields) but struggles with: - Deeply nested schemas (3+ levels) - Long arrays (20+ items) - Mixed nesting + length **Fix:** Sequential tool calls. Break one large JSON request into multiple smaller calls: - Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc. - Due to Gemma 4's fast speed and free local use, sequential calls are cheap **Fallback pattern (AI_Visualizer):** ```python for attempt in range(MAX_RETRIES): temp = BASE_TEMP + attempt * TEMP_BUMP # 0.4 -> 0.5 -> 0.6 response = call_gemma(temperature=temp) try: return parse_json(response) except JSONDecodeError: continue ``` ## MEDIUM: Identity Confusion **Severity: MEDIUM — cosmetic but confusing** Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may: - Claim to be a different model - Hallucinate capabilities it doesn't have - Respond as a generic "AI assistant" without personality **Fix:** Explicit identity in system prompt: ``` You are [Name], a [role]. You are powered by Gemma 4. You ONLY do [X]. You NEVER do [Y]. ``` Gemma 4 does NOT need hand-holding on task execution — it's very capable. It needs explicit instructions about identity and boundaries. ## MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens) **Severity: MEDIUM — hardware-specific, affects RTX 3090** Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model. **Source:** [ollama/ollama#15350](https://github.com/ollama/ollama/issues/15350) **Fix:** Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware. ## MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming **Severity: MEDIUM — version-specific** As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio. **Source:** [community reports](https://dev.to/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f) **Fix:** Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls. ## MEDIUM: VRAM-Hungry for Context **Severity: MEDIUM — affects hardware planning** Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card. **Implication:** Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable. ## MEDIUM: Safety Overfiltering **Severity: MEDIUM — blocks benign prompts** Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts. **Fix:** Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly. ## MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0) **Severity: MEDIUM — crashes on first attention forward pass** The 31B and 26B ship with `num_kv_shared_layers = 0`, which causes `layer_types[:-0]` to collapse to zero layer slots. Crashes on first forward pass. **Fix:** Patch the config. Check model card discussions for the exact fix. ## LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090) **Severity: LOW — vLLM-specific** Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+. **Source:** [vllm-project/vllm#38887](https://github.com/vllm-project/vllm/issues/38887) **Fix:** Use Ollama instead of vLLM for now, or wait for the fix. ## LOW: `` Token Infinite Loop (Vulkan backends) **Severity: LOW — Vulkan-specific** Gemma 4 can generate `` or `` tokens in an infinite loop on Vulkan backends in llama.cpp. **Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516) ## MEDIUM: `google/gemma_pytorch` Abandoned for Gemma 4 **Severity: MEDIUM — wastes time on a dead-end path** The `google/gemma_pytorch` repo (last push 2025-05-30) has zero Gemma 4 support — its variants validator only accepts Gemma 1/2/3 IDs. Anyone pointing at it as "the official PyTorch reference" for Gemma 4 is wrong. **Use instead:** - **Inference:** `huggingface/transformers` (`AutoModelForMultimodalLM`, v5.5.4+) - **Reference impl:** `google-deepmind/gemma` (JAX/Flax) - **Serving:** Ollama / vLLM / llama.cpp See `tooling/google-official/gemma-pytorch/README.md` for the original repo state. ## LOW: Fine-Tuning Ecosystem Issues **Severity: LOW — only relevant if fine-tuning** Day-one issues for fine-tuners: - HuggingFace Transformers didn't recognize gemma4 architecture (required install from source) - PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type) - New `mm_token_type_ids` field required during training even for text-only data - E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug) - **Flash Attention 2/4 incompatible:** Gemma 4's global-attention head_dim is 512; FA2 max is 256, FA4 max is 128. Training backends fall back to SDP or Flex Attention (Axolotl hard-codes `sdp_attention: true` for Gemma 4). Does not affect inference runtimes that already use SDP (Ollama, vLLM). - **Fused LoRA kernels broken** (shared-KV layers). Axolotl disables `lora_mlp_kernel` / `qkv_kernel` / `o_kernel` for Gemma 4; Unsloth routes around it. - **26B A4B MoE wants ≥8-bit LoRA**, not 4-bit QLoRA — MoE expert quality degrades at 4-bit during training. Axolotl's ScatterMoE + expert-LoRA config is the only validated 4-bit MoE path. (This caveat is **training-only**; Q4_K_M inference is fine.) - **New tool-call / channel tokens are learned embeddings** — if fine-tuning, set `modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True` in `LoraConfig`, or the adapter trains against frozen random vectors for them. See `tooling/fine-tuning/recipe-recommendation.md` for the full training path. ## LOW: Vision Validator Overrejects **Severity: LOW — specific to evaluative vision tasks** In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable. **Pattern:** Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?" ## LOW: Keep-Alive Too Short **Severity: LOW — performance only** Default `keep_alive` is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty). **Fix:** Set `keep_alive` to match your pipeline duration: ```json { "keep_alive": "4h" } ``` Or pin/unpin explicitly: ```python client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0}) # pin # ... do work ... client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0}) # unpin ```