Seth asked "was this with think=false?" Yes — and that was the only question that mattered. Everything I concluded in round 1 and round 2 was wrong. Actual cause, isolated in round 3: - At identical message state, gemma4:26b with think=false returns eval=4 (silent stop); with think unset or think=true, returns eval=165 and emits the correct tool call. - Original round-1 write_file harness + think unset: 26B passes in 8 iters, 20s. No mitigations needed. - 31B dense and qwen3-coder:30b tolerate think=false; 26B MoE does not. Red herrings (kept on-record in the bakeoff doc, not silently erased): - Round 1: "write_file tool-call argument size" — wrong - Round 2a: refuted the arg-size theory but for the wrong reason (still failed because think=false was still set) - Round 2b: "cumulative tool-response context size" — truncating did make 26B pass, but by coincidence. Shorter context at the decision turn dodged the think=false side effect. Why the existing "always think:false" guidance was misleading: it was derived from AI_Visualizer (single-turn JSON pipelines) where thinking tokens do eat num_predict invisibly. In multi-turn tool-calling agents the channels are separate and the flag has a different effect — catastrophic on 26B specifically. Doc updates: - GOTCHAS: replaced the 26B entry with the actual cause; scoped the original "Thinking Mode Eats Context" entry to single-turn pipelines - SYNTHESIS: split the "Mandatory Ollama Settings" block into single-turn vs multi-turn variants; updated anti-patterns and quick-start checklist - CORPUS_cli_coding_agent.md: revised pointer and config template - docs/reference/bakeoff-2026-04-18.md: added Round 3 section with the correction notice at the top of the file and full diagnostic methodology New artifacts: harness_no_think_flag.py, harness_write_no_think.py, and 4 new log files demonstrating all three models pass when think is left at default.
8.9 KiB
Gemma 4 Synthesis — How to Build With It
Opinionated guide based on two production implementations and ongoing use. Seth Freiberg, 2026-04-12
The One-Paragraph Summary
Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs num_predict increased (Ollama defaults are absurdly low), think set to false (thinking eats the context budget), and format: json avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output.
For canonical upstream source (model cards, chat templates, serving commands, fine-tuning recipes, specialized siblings like EmbeddingGemma/ShieldGemma): see
tooling/README.md. That directory is 147 files / 14 MB of first-party material pulled from Google / Hugging Face / framework maintainers. This SYNTHESIS is the opinionated digest;tooling/is the receipts.
Mental Model
Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain:
- Who they are and what their job is
- What they should and should NOT do
- Exactly what format you want the deliverable in
- The boundaries of their role
Get those right and Gemma 4 just works. Get them wrong and you get a generic chatbot.
Mandatory Ollama Settings
For single-turn pipelines (AI_Visualizer shape)
Every Gemma 4 call MUST include:
{
"think": false,
"options": {
"num_ctx": 4096,
"num_predict": 2048
}
}
Why each one:
think: false— Ollama 0.20+ defaults to think:true. In single-turn JSON pipelines, thinking tokens consume num_predict budget invisibly, returning empty responses.num_ctx: 4096+— Ollama defaults to 2048. Your system prompt alone might exceed that.num_predict: 2048+— Ollama defaults to 128. Any structured output gets truncated.
Scale these to your task. The values above are safe minimums, not recommendations.
For multi-turn tool-calling agents (Simon / CLI-coding-agent shape)
Do NOT set think: false. Leave it unset (Ollama default) or true.
{
"options": {
"num_ctx": 32768,
"num_predict": 4096
}
}
Verified 2026-04-18 that think: false silently breaks gemma4:26b in multi-turn
tool-calling loops — model silent-stops with eval_count=4 at tool-decision turns.
31B Dense and Qwen3-Coder tolerate the flag; 26B MoE does not. See GOTCHAS.md
§ "think: false Kills Gemma 4 26B in Multi-Turn Tool-Calling Loops" and
docs/reference/bakeoff-2026-04-18.md § "Round 3".
System Prompt Template
You are [NAME], a [ROLE DESCRIPTION].
## What You Do
- [Explicit list of responsibilities]
- [Tools you have access to and when to use each one]
## What You Do NOT Do
- [Explicit list of things to refuse or avoid]
- [Common mistakes to prevent]
## Output Format
[Exact schema, field names, example if complex]
Respond with ONLY [format]. No prose outside the [format].
## Rules
- [Behavioral constraints]
- [Multi-step chaining instructions if using tools]
Today's date: [DATE]
Key principles:
- Identity first — who is this agent?
- Positive instructions before negative (what TO do before what NOT to do)
- Output format is explicit and complete — Gemma 4 follows schemas faithfully
- "No prose outside the JSON" prevents wrapper text that breaks parsing
- Date injection helps with temporal reasoning
Tool Calling Strategy
Gemma 4 is reliable for tool calling but weak at structuring long JSONs.
When to use tool calling (Ollama native)
- Multi-turn agents with 2-10 tools
- Sequential reasoning chains (lookup A -> use A to decide B -> lookup B)
- Any task where the model needs to gather information before responding
When to use prompt-based JSON instead
- Single-turn generation with known output structure
- When you need specific JSON schema control
- When the output is a payload (prompts, configs) not a conversation
The Sequential Pattern
Instead of asking Gemma 4 to produce one massive JSON:
BAD: "Generate a 50-scene storyboard as JSON" -> truncated/malformed
GOOD: "Generate scenes 1-5 as JSON" x10 -> reliable every time
Gemma 4's inference speed makes sequential calls cheap. A 10-call chain at ~134 tok/s on a 3090 Ti costs seconds, not minutes. This is the fundamental advantage of local models — latency is predictable and network-free.
JSON Extraction Pattern
Since format: "json" is broken, always extract client-side:
# Python
import json
raw = response["response"]
start = raw.find("{")
end = raw.rfind("}")
if start >= 0 and end > start:
obj = json.loads(raw[start:end + 1])
// JavaScript
const raw = response.message.content;
const match = raw.match(/\{[\s\S]*\}/);
if (match) obj = JSON.parse(match[0]);
For arrays, find [ and ] instead. Add json5 fallback for trailing commas.
Temperature Guidelines
| Task Type | Temperature | Why |
|---|---|---|
| Evaluation / scoring | 0.2 | Consistent, reproducible judgments |
| Structured extraction | 0.3-0.4 | Faithful to schema |
| Creative generation | 0.6-0.8 | Variety without chaos |
| Conversation / chat | 0.7-1.0 | Natural feel |
Retry strategy: bump temp +0.1 per retry to escape format failures.
Vision Usage
Works for: Describing image contents (objects, colors, composition, text) Unreliable for: Subjective quality scoring, aesthetic judgment
import base64
with open("image.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode("ascii")
response = client.generate(
model="gemma4:26b",
prompt="Describe this image in detail.",
images=[b64],
think=False,
options={"temperature": 0.2, "num_predict": 512}
)
Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only.
Context Management
Multi-turn (chat agents)
- Prune old tool results and tool-call messages
- Keep assistant's natural-language summaries
- Set num_ctx to 32768 for rich conversations
- Set a tool iteration limit (12 is proven) with streaming fallback
Single-turn (pipeline stages)
- Calculate your prompt size and set num_ctx accordingly
- For long inputs (full track analysis), use recursive splitting at natural boundaries
- Pin model with
keep_alive=-1if pipeline has idle gaps
Model Selection
| Use Case | Recommended | Why |
|---|---|---|
| Production pipeline (needs GPU coexistence) | gemma4:26b |
MoE (3.8B active), fast, good quality/VRAM balance |
| On-device / edge | gemma4:e4b-it-q8_0 |
12GB VRAM, vision+audio (audio via llama.cpp only) |
| Maximum quality (single-model GPU) | gemma4:31b-it-q4_K_M |
Dense 31B, sharpest but 5x slower, more VRAM pressure |
| Rapid prototyping / testing | gemma4:26b |
Fast enough for interactive dev |
| Retrieval / embeddings | embeddinggemma (308M, separate model) |
Gemma 4 has no embedding mode; use the sibling |
| CLI coding agent (openclaw / open code / pi / hermes / aider) | gemma4:26b (fastest) or gemma4:31b-it-q4_K_M (more headroom), either works — just do not set think: false in the payload |
2026-04-18 bakeoff on 3090 Ti: all three models (including Qwen3-Coder 30B) pass the same task in 8-14 iters. The only real gotcha is think: false silently breaks 26B in multi-turn loops. See CORPUS_cli_coding_agent.md + docs/reference/bakeoff-2026-04-18.md |
Anti-Patterns
- Don't use
format: "json"— infinite loops on nested schemas - For single-turn pipelines, don't leave
thinkat default — eats your output budget silently. For multi-turn tool-calling agents, don't SETthink: false— silent-stops 26B. See the two "Mandatory Ollama Settings" sections above. - Don't leave
num_predictat default — 128 tokens is nothing - Don't leave
num_ctxat default — 2048 truncates most prompts - Don't ask for huge JSON in one call — break into sequential calls
- Don't use thinking mode for evaluation — inflates scores, wastes context
- Don't skip system prompt identity — Gemma 4 becomes a generic chatbot
- Don't use audio on 26B/31B — only E-series has audio encoder
Quick-Start Checklist
- Set
think: falsefor single-turn pipelines only. Leave unset for multi-turn tool-calling agents (silent-stops 26B). - Set
num_predict>= 512 (2048+ for JSON output) - Set
num_ctx>= 4096 (scale to your prompt size) - Write explicit system prompt with identity + boundaries + output format
- Extract JSON client-side (no
format: "json") - Set
keep_alive>= 30m (or pin with -1) - For long structured output, use sequential calls
- For vision, pass base64 in
imagesarray - Test with your actual prompt length — Ollama won't warn about truncation