Files
gemma4-research/docs/reference/bakeoff-2026-04-18.md
T
Mortdecai a945207aab feat: CLI coding agent bakeoff — 26b reproducibly silent-stops at write_file
Ran minimal agent loop (Ollama /api/chat + read_file/write_file/run_bash) on
steel141 3090 Ti against 3 models on a broken-median-function task:

- gemma4:31b-it-q4_K_M: PASS (8 iters, 1 write, 44s) — textbook trace
- qwen3-coder:30b: PASS (15 iters, 1 write, 22s) — correct but chatty
- gemma4:26b: FAIL (6 iters, 0 writes) — silently stops with eval=4
  after reading source. Reproduced on second run. One-shot probe
  confirms 26b CAN produce the correct fix — failure is specifically
  at the write_file tool-call argument boundary.

Updates GOTCHAS with a new HIGH-severity entry, SYNTHESIS model-selection
table, CORPUS_cli_coding_agent.md empirical-follow-up pointer, and adds
docs/reference/bakeoff-2026-04-18.md with the full writeup.
2026-04-18 13:27:50 -04:00

9.0 KiB
Raw Blame History

CLI Coding Agent Bakeoff — 2026-04-18

Empirical follow-up to CORPUS_cli_coding_agent.md. Runs a minimal CLI coding agent loop against three candidate models on identical hardware and an identical broken-code task. n=1 per model (plus one re-run to check reproducibility of a failure). Treat as a smoke test, not a benchmark.

Setup

  • Host: steel141 (Seth's local box)
  • GPU: NVIDIA RTX 3090 Ti, 24 GiB, ~22.7 GiB free
  • Ollama: 0.20.4
  • Harness: scripts/bakeoff/harness.py — custom minimal agent loop, not openclaw / open code / aider / pi / hermes. Protocol: Ollama /api/chat with tools=[read_file, write_file, run_bash], non-streaming, think: false, num_ctx: 32768, num_predict: 4096, temperature: 0.3. Iteration cap = 15.
  • Task: scripts/bakeoff/task_seed/ — Python package with buggy median() function. 3 of 7 pytest tests fail on even-length inputs. Fix is ~5 lines.
  • System prompt: generic CLI-agent template (identity + allowed tools + rules: "never modify tests", "prefer minimal edits"). Not tuned per model.

All three models pulled from steel141's local Ollama, swapped in/out of GPU as each run started. First iteration per run pays the load cost; later iterations are hot.

Results

Model Pass Iterations write_file read_file run_bash Wall clock Halt reason
gemma4:26b Fail 6 0 2 3 10.9s no_tool_calls (silent empty response)
gemma4:26b (retry) Fail 6 0 2 3 11.4s no_tool_calls (reproduces exactly)
gemma4:31b-it-q4_K_M Pass 8 1 2 4 44.1s no_tool_calls (clean summary turn)
qwen3-coder:30b Pass 15 (cap) 1 4 8 22.6s no_tool_calls (at iteration cap)

Gemma 4 31B — clean run

Textbook agent trace:

  1. read_file README.md
  2. pytest (exit=2, module not found — pytest needs PYTHONPATH)
  3. ls -R
  4. PYTHONPATH=. pytest → sees 3 failures
  5. read_file calc/stats.py
  6. write_file calc/stats.py (eval_count=330, 13.4s) — correct fix
  7. PYTHONPATH=. pytest → all green
  8. summary: "I updated the median function in calc/stats.py to correctly calculate the average of the two middle elements..."

Zero wasted turns. One write. Minimal edit.

Qwen3-Coder 30B — correct but chatty

Passed, but used all 15 iterations:

  • Narrated every step ("I'll help you...", "Now let's look at...")
  • Tried to read a non-existent file (test_calc.py) — wasted iter 2
  • Tried to read_file on a directory (calc) — wasted iter 6
  • Ran several redundant bash calls (pwd && pytest, etc.)
  • Emitted a ceremonial echo "All tests pass..." bash call at iter 14
  • Final turn was a polite summary

The fix itself (iter 12) was correct on the first write. Quality is fine; efficiency isn't. Per-iteration it was fast (many 20-40 token turns) — total wall clock 22.6s beat Gemma 31B despite using nearly 2× the iterations.

Gemma 4 26B — reproducible silent stop

Both runs followed an identical trajectory:

  1. ls -R
  2. read_file README.md
  3. pytest (exit=2)
  4. PYTHONPATH=. pytest → sees 3 failures
  5. read_file calc/stats.py
  6. Empty response. eval_count=4. No tool calls. Loop terminates.

Zero writes. The model saw all the context it needed (failing tests + buggy source) and then silently declined to act.

Isolating the failure — one-shot probe

To check whether 26B can produce the fix at all, I ran a single-turn call with no tool loop:

prompt: "The following function is buggy — median([1,2,3,4]) returns 3
         but should return 2.5. Rewrite it correctly. [buggy code]"

Response (eval_count=81):

def median(numbers):
    s = sorted(numbers)
    n = len(s)
    if n % 2 == 1:
        return s[n // 2]
    else:
        return (s[n // 2 - 1] + s[n // 2]) / 2

Correct. So 26B's diagnosis and code generation are intact. The failure is specifically at the tool-call-boundary — when the model needs to emit a write_file(path, content) call where the content argument is a several-hundred-character string, it aborts with eval=4 instead.

This aligns with GOTCHAS.md § "Weak at Long/Nested JSON". A write_file tool call argument with a ~500-char string is structurally similar to a long JSON value. Gemma 4 31B handles the same surface reliably (eval=330 on that turn); the 26B MoE does not.

Interpretation

What this is evidence for

  • Gemma 4 31B is a viable CLI-coding-agent backing model on this class of task. Clean trace, minimal wasted turns, correct fix on first write.
  • Qwen3-Coder 30B also works, at the cost of more iterations and looser discipline. Diff quality was fine; agentic efficiency wasn't.
  • Gemma 4 26B has a reproducible failure mode at tool-call-argument emission. It can reason. It can code. It struggles to deliver code through a write_file tool call when the content is non-trivial.

What this is NOT evidence for

  • This is not a representative benchmark. n=1 per model. One task. One fix. One harness. Do not conclude "Gemma 4 26B is broken for coding agents" — conclude "Gemma 4 26B failed this specific setup reproducibly; investigate further before relying on it."
  • This harness is not openclaw / open code / aider / pi / hermes. Production agents wrap prompts, retries, and tool surfaces differently. The 26B failure may be avoided in a harness that:
    • Uses a patch/diff tool (apply_patch(old, new)) instead of write_file(full_content) — smaller argument surface, matches the "sequential tool calls" pattern from SYNTHESIS.md
    • Adds a retry on empty response (same as Simon's streaming-fallback pattern in IMPLEMENTATIONS.md)
    • Provides fewer but richer tools (a dedicated fix_file that re-prompts internally)
  • This compares agent behavior, not raw performance. Wall clock is noisy (model load, context size, token rate all differ). Per-iteration latency is more meaningful — but that only matters for throughput, not correctness.

Recommendations

  1. For a CLI coding agent on Seth's hardware: start with gemma4:31b-it-q4_K_M. Clean behavior, modest wall clock (44s for a simple fix), no retry needed.
  2. For comparison or backup: qwen3-coder:30b is equally correct, roughly half the per-iteration cost, ~2× the iteration count. In a longer session those extra turns add up.
  3. Do not default to gemma4:26b for this pattern. Two tests in a row silent-stopped at the write boundary. If you want to use the 26B MoE (it's strong on LiveCodeBench v6 at 77.1%), validate it against your specific agent framework first — especially whether the framework uses write_file (full content) or apply_patch (delta) as its edit primitive.
  4. Test with the real harness you plan to use in production (openclaw2, open code, etc.) before committing. A handful of this style of run takes minutes on the 3090 Ti and will tell you more than any benchmark card.

Honest caveats

  • Stochasticity. Only 26B was re-run. 31B and Qwen3-Coder might hit failure modes on a different seed or a different task. Temperature 0.3 is low but not zero.
  • System prompt bias. "Start by reading README.md" steered all three models similarly; a different prompt skeleton would produce different traces. I did not tune per model — deliberately — because a production agent won't either.
  • The 26B silent-stop hypothesis (tool-arg emission failure) is inferred, not proven. A clean confirmation would require running the same task with a smaller-surface edit tool (apply_patch(path, old, new) instead of write_file(path, full_content)) and showing 26B succeeds. That's the obvious follow-up.
  • Ollama 0.20.4 is between the 0.20.0/0.20.1 known-broken-streaming range and whatever is current. Non-streaming tool calls worked cleanly for 31B and Qwen; 26B's failure looks model-specific, not Ollama-specific, but I didn't test on a different Ollama version.
  • No openclaw / open code / aider runs. Those are the frameworks named in the HF launch blog. This was a synthetic harness; transfer is plausible but unverified.

Artifacts

  • scripts/bakeoff/harness.py — the agent loop
  • scripts/bakeoff/task_seed/ — the broken-code seed (reset between runs)
  • scripts/bakeoff/runs/gemma4-26b/log.json — full turn-by-turn trace
  • scripts/bakeoff/runs/gemma4-26b-retry/log.json
  • scripts/bakeoff/runs/gemma4-31b/log.json
  • scripts/bakeoff/runs/qwen3-coder-30b/log.json

Each log records per-turn: content, tool calls, results (truncated to 800 chars), prompt/eval token counts, wall time. Final block records halt reason, pass/fail, iteration count, tool-call totals, total wall clock.

Reproducing

cd scripts/bakeoff
python3 harness.py gemma4:31b-it-q4_K_M runs/gemma4-31b/work runs/gemma4-31b/log.json
python3 harness.py qwen3-coder:30b runs/qwen3-coder-30b/work runs/qwen3-coder-30b/log.json
python3 harness.py gemma4:26b runs/gemma4-26b/work runs/gemma4-26b/log.json

Each invocation resets the work directory from task_seed/, runs the loop, writes the log, and prints a one-line summary.