feat: CLI coding agent bakeoff — 26b reproducibly silent-stops at write_file

Ran minimal agent loop (Ollama /api/chat + read_file/write_file/run_bash) on steel141 3090 Ti against 3 models on a broken-median-function task: - gemma4:31b-it-q4_K_M: PASS (8 iters, 1 write, 44s) — textbook trace - qwen3-coder:30b: PASS (15 iters, 1 write, 22s) — correct but chatty - gemma4:26b: FAIL (6 iters, 0 writes) — silently stops with eval=4 after reading source. Reproduced on second run. One-shot probe confirms 26b CAN produce the correct fix — failure is specifically at the write_file tool-call argument boundary. Updates GOTCHAS with a new HIGH-severity entry, SYNTHESIS model-selection table, CORPUS_cli_coding_agent.md empirical-follow-up pointer, and adds docs/reference/bakeoff-2026-04-18.md with the full writeup.
2026-04-18 13:27:50 -04:00
parent 4b9c537dda
commit a945207aab
15 changed files with 1172 additions and 1 deletions
@@ -64,6 +64,37 @@ Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output ex

 **Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.

+## HIGH: 26B Silent-Stops at `write_file` Tool Boundary (reproducible)
+
+**Severity: HIGH — agent-loop failure, silent**
+
+Reproduced on 2026-04-18 against `gemma4:26b` via Ollama 0.20.4 on a 3090 Ti
+(steel141). Agent harness exposed `read_file` / `write_file` / `run_bash` tools
+and asked the model to fix a failing Python test.
+
+Observed pattern (both runs identical):
+1. Model reads README, runs pytest (sees failures), reads the buggy source file
+2. Next turn: **empty content, no tool calls, `eval_count=4`** — model silently exits
+3. Zero writes ever emitted
+
+Isolation: a direct one-shot call asking 26B to rewrite the same function
+returned the correct fix (`eval_count=81`). So diagnosis and code generation are
+intact — failure is at the `write_file(path, content)` tool-call argument
+boundary, where `content` is a ~500-char string. Consistent with the "Weak at
+Long/Nested JSON" gotcha below: a long string inside a tool-call argument is
+structurally the same problem.
+
+`gemma4:31b-it-q4_K_M` on the same harness completed the task cleanly
+(`eval_count=330` on the write turn). `qwen3-coder:30b` also completed.
+
+**Fix:**
+- For 26B in an agent loop, prefer a **patch/diff tool surface**
+  (`apply_patch(path, old, new)`) over a **full-content write** (`write_file(path, full_content)`).
+  Delta-sized arguments are inside the model's comfort zone.
+- Or use 31B for the agent and keep 26B for single-shot tasks where the full
+  response is the output, not a tool-call argument.
+- See `docs/reference/bakeoff-2026-04-18.md` for the full trace.
+
 ## MEDIUM: Weak at Long/Nested JSON

 **Severity: MEDIUM — causes parse failures**