Ran minimal agent loop (Ollama /api/chat + read_file/write_file/run_bash) on steel141 3090 Ti against 3 models on a broken-median-function task: - gemma4:31b-it-q4_K_M: PASS (8 iters, 1 write, 44s) — textbook trace - qwen3-coder:30b: PASS (15 iters, 1 write, 22s) — correct but chatty - gemma4:26b: FAIL (6 iters, 0 writes) — silently stops with eval=4 after reading source. Reproduced on second run. One-shot probe confirms 26b CAN produce the correct fix — failure is specifically at the write_file tool-call argument boundary. Updates GOTCHAS with a new HIGH-severity entry, SYNTHESIS model-selection table, CORPUS_cli_coding_agent.md empirical-follow-up pointer, and adds docs/reference/bakeoff-2026-04-18.md with the full writeup.
12 KiB
Gemma 4 as a CLI Coding Agent
Research pass, 2026-04-18. Positions Gemma 4 against the specific use case of driving a terminal-based coding agent (openclaw / open code / aider / pi / hermes style: read_file, write_file, bash, iterate). Separate from the existing
IMPLEMENTATIONS.mdchat-agent patterns (Simon) and pipeline patterns (AI_Visualizer).
Empirical follow-up:
docs/reference/bakeoff-2026-04-18.md— real runs ofgemma4:26b,gemma4:31b-it-q4_K_M, andqwen3-coder:30bagainst a custom minimal CLI-agent harness on a fix-the-median-bug task. Key findings: 31B clean (8 iters, 1 write), Qwen3-Coder correct but chatty (15 iters), 26B reproducibly silent-stops at thewrite_filetool call boundary even though it can produce the fix in a direct one-shot call. Read when: scoping which model to point an agent at, or hitting an unexpected tool-call halt.
TL;DR
- Gemma 4 is Google's first Gemma with trained (not proof-of-concept) tool use. LiveCodeBench v6 = 80.0% (31B) / 77.1% (26B). Codeforces ELO = 2150 / 1718. That's frontier-open territory on the reported benchmarks.
- Google/HF co-launched with four local CLI coding agents: openclaw, hermes, pi, open code (see
tooling/huggingface/blog/gemma4-blog.md, § "Plug in your local agent"). All four use an OpenAI-compatible endpoint → Ollama or llama.cpp work interchangeably. - No SWE-bench or Aider polyglot number from Google. Reporting leans on competitive programming + single-file code gen. Real-world multi-file repo-scale coding is an empirical question Google didn't answer. Treat the CLI agent claim as plausible + untested, not proven.
- No specialized CodeGemma-4 sibling exists (CodeGemma is still G1). Base Gemma 4 is the Gemma-family coding path for now.
- In Seth's homelab, CT 166
openclaw2on pve197 is the natural testbed — GPU-adjacent to CT 105 Ollama which already servesgemma4:26bandgemma4:31b-it-q4_K_M.
What Google does and doesn't claim
The HF 31B-it model card (tooling/huggingface/model-cards/gemma-4-31B-it-README.md, line 38) says:
"Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents."
Reported coding / agentic numbers (from CORPUS_benchmarks.md):
| Benchmark | 31B | 26B A4B | What it tests |
|---|---|---|---|
| LiveCodeBench v6 | 80.0% | 77.1% | Single-file code generation |
| Codeforces ELO | 2150 | 1718 | Competitive programming |
| tau2-bench | 86.4% | 85.5% | Agentic tool use — customer service, not coding |
What's not reported and worth noting:
- SWE-bench Verified / SWE-bench Lite — the standard multi-file repo-patch benchmark
- Aider polyglot — the standard diff-format / edit-quality benchmark
- HumanEval / MBPP — even the old single-function tests
The absence isn't necessarily bad news (Google could simply have prioritized novel benchmarks), but it means the claim "powering highly capable autonomous agents" has no agentic-coding-specific receipt. tau2-bench is the closest agentic number and it measures a different domain.
First-party supported CLI coding agents
From the HF launch blog (tooling/huggingface/blog/gemma4-blog.md, lines 505-572):
| Agent | Config | Endpoint |
|---|---|---|
| openclaw | openclaw onboard — auto-detects running llama-server |
OpenAI-compatible |
| hermes | hermes model — interactive model picker |
OpenAI-compatible |
| pi | ~/.pi/agent/models.json |
baseUrl: http://localhost:8080/v1, api: openai-completions |
| open code | ~/.config/opencode/opencode.json (opencode.ai) |
@ai-sdk/openai-compatible, baseURL: http://127.0.0.1:8080/v1 |
All four are demonstrated against llama.cpp's llama-server, which ships first-party Gemma 4 GGUFs via ggml-org/gemma-4-*-it-GGUF including mmproj for vision. Ollama's /v1/chat/completions is drop-in substitutable — same protocol, different port/path (http://<host>:11434/v1).
The blog didn't test aider / continue / cline / roo code / goose. They're all OpenAI-compatible and should work, but they're outside Google's tested set. Aider in particular uses a highly structured diff format that depends on the model emitting edits cleanly — an area where Gemma 4 has a known weakness (long/nested JSON — see GOTCHAS.md).
vs qwen3-coder:30b (the realistic homelab alternative)
Seth's steel141 already has qwen3-coder:30b and qwen3-coder-next:79.7B. The honest comparison:
| Axis | Gemma 4 26B A4B | qwen3-coder:30b |
|---|---|---|
| Active params | 3.8B (MoE, 8-of-128 experts) | ~30B dense |
| Designed for | General-purpose + agentic tool use | Coding specifically |
| Vision | Native (all variants) | No |
| Agentic tool-call training | Yes, native tokens | Yes, native tokens |
| LiveCodeBench v6 | 77.1% (Google card) | not in this corpus — don't invent |
| Edit-format fidelity | Weak at long JSON (sequential-calls workaround) | Coder-tuned, strong at diffs |
| VRAM at 32K ctx | moderate (KV-hungry, see GOTCHAS) | moderate |
Picking heuristic:
- Gemma 4 if the agent does chat + tools + vision (e.g., "look at this screenshot, edit this file, re-run test") — it's the only side with native vision.
- qwen3-coder if the agent is pure code-edit loops where diff quality dominates.
- Bakeoff before committing. Swapping an OpenAI-compatible provider URL is near-free. Two runs on one real repo task beats either benchmark.
Don't treat Google's "Enhanced Coding" framing as a head-to-head result against Qwen. It's not — they're pointing at the delta from Gemma 3, not at current coder-specialized competition.
Configuration for Ollama-backed agents
The baseline settings from SYNTHESIS.md still apply. CLI coding agent-specific adjustments:
{
"model": "gemma4:26b",
"think": false,
"keep_alive": "4h",
"options": {
"num_ctx": 32768,
"num_predict": 4096,
"temperature": 0.3
}
}
num_ctx: 32768is the working minimum for repo-scale work. Agents interleave file reads, bash output, and edits; 4K will truncate the secondread_file.num_predict: 4096— single edits are short but the agent may emit a bash invocation + reasoning + tool call in one turn.temperature: 0.3— perSYNTHESIS.mdtemperature table, "structured extraction" tier. Coding edits want low variance.think: false— critical.GOTCHAS.mddocuments that Ollama 0.20+ thinking silently eatsnum_predictand drops tool calls. If an agent somehow injectsthink: true, you'll see empty responses.keep_alive: 4h— agent sessions have think pauses; avoid reload penalty.
Streaming
Non-streaming mode required on Ollama 0.20.0-0.20.1. The tool-call parser drops calls on streaming endpoints (see GOTCHAS.md and CORPUS_tool_calling_format.md). Most CLI agents default to non-streaming for tool turns, but verify in the agent's config.
llama-server alternative
If you want to follow the HF blog exactly, swap Ollama for llama.cpp:
llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M \
--jinja \
-c 32768 \
--host 0.0.0.0 --port 8080
--jinja is the critical flag — without it, the native tool-call template (with <|tool_call> / <tool_call|> asymmetric brackets — see CORPUS_tool_calling_format.md) doesn't render correctly.
Gotchas specific to CLI coding agent use
These extend (do not replace) the general GOTCHAS.md.
1. Safety overfiltering on security-adjacent code
GOTCHAS.md documents strict alignment generally. For coding agents this bites more often: pentest tooling, CTF write-ups, auth-bypass debugging, even aggressive rm -rf-style cleanup can trigger refusals or bowdlerized edits.
Workaround: The agent's system prompt should establish authorization context — "this is an authorized security test", "this is my own machine", "this is a CTF challenge". Don't rephrase as a jailbreak; state context plainly. Stock agent system prompts typically don't set this, so it's often the first thing to add.
2. Weak long JSON → favors sequential tool calls
Gemma 4 struggles with deeply-nested schemas and long arrays (existing GOTCHAS.md finding). Agent-level implication:
- Agents that drive tool-by-tool (openclaw, open code, pi, cline): good fit. Each
write_file/bash/read_fileis a short tool call. - Agents that expect one-giant-structured-response (some aider edit modes, any "output the entire diff as JSON"): expect parse failures on long patches. Break into smaller edits if possible.
3. No code execution — that's the agent's job
Gemma 4 has no sandbox / kernel / VM. It decides when to call bash; the agent runs it. This is standard but worth stating — no CodeInterpreter-style "model runs the code" path.
4. Long-horizon context pressure
Gemma 4 supports 256K on 26B/31B but the KV cache is VRAM-hungry (existing GOTCHAS.md). For an agent churning through a repo:
- 32K ctx = comfortable on a 24GB card
- 128K ctx = you're feeding a lot of VRAM to cache, not weights
- Prefer agent-side retrieval (grep, ripgrep, targeted file reads) over "paste the whole repo in context"
5. Identity drift across long sessions
Gemma 4's "ultra-compliant but doesn't know who it is" (existing GOTCHA) shows up in long agent sessions as subtle drift — switching voice, adopting a different tool-call style mid-session, forgetting constraints from turn 1. The SYNTHESIS.md system-prompt template (identity + what-you-do + what-you-do-not + format) is more important for a 50-turn agent loop than a 3-turn chat.
6. Missing coding-specific agentic benchmark (same warning, bigger stakes)
Because Google didn't publish SWE-bench, you're operating on extrapolation from Codeforces + tau2-bench when you use Gemma 4 as a CLI coding agent. Measure on your actual repo before taking a dependency.
Homelab setup (Seth)
Natural testbed: CT 166 openclaw2 on pve197 → CT 105 Ollama on pve197.
Both are on the same host so there's no network hop. CT 105 already serves
gemma4:26b and gemma4:31b-it-q4_K_M (verified in handoff + per-node inventory
in /home/claude/bin/CLAUDE.md).
- Verify openclaw2's current model config. If it's pointing at a different
backend, switch to
http://192.168.0.179:11434/v1withgemma4:26b(or 31B if VRAM permits alongside the V100 CT 167 visualizer stack). - Set default options per the block above (
num_ctx: 32768,num_predict: 4096,think: false,temperature: 0.3,keep_alive: 4h). - Run one real task (suggested: a small addition to Mortdecai-2.0 — a codebase with existing CLAUDE.md and clear conventions, good signal-to-noise).
- Capture: number of tool calls, number of retries, diff quality, wall clock.
- Same task against
qwen3-coder:30bon steel141 (http://192.168.0.141:11434/v1). Don't A/B anything else — same agent, same prompt, same repo state, different backend. - If Gemma 4 dominates on plan/navigate/describe but Qwen dominates on write_file quality, the natural step is per-role model split: let openclaw2 use Gemma for "thinking" tool calls and Qwen for edit tool calls. open code's provider config supports this cleanly.
What is NOT covered by this document
- Concrete benchmark results from the proposed bakeoff (do the measurement, write a separate findings file)
- openclaw / hermes / pi / open code feature-matrix detail (each agent has its own docs — the HF blog links to all four)
- aider-specific diff-format analysis (aider wasn't in the HF blog's tested set)
- Fine-tuning Gemma 4 for coding agents (see
tooling/fine-tuning/— the existing path) - CodeGemma (still Gemma 1 base — see
tooling/gemma-family/codegemma.md)
Provenance
- HF 31B-it model card:
tooling/huggingface/model-cards/gemma-4-31B-it-README.md - HF launch blog:
tooling/huggingface/blog/gemma4-blog.md - Benchmarks:
CORPUS_benchmarks.md - Tool calling:
CORPUS_tool_calling_format.md - Ollama variants:
CORPUS_ollama_variants.md - Known issues:
GOTCHAS.md - Qwen3-Coder in homelab:
/home/claude/bin/CLAUDE.md§ "Ollama models"