feat: CLI coding agent bakeoff — 26b reproducibly silent-stops at write_file

Ran minimal agent loop (Ollama /api/chat + read_file/write_file/run_bash) on
steel141 3090 Ti against 3 models on a broken-median-function task:

- gemma4:31b-it-q4_K_M: PASS (8 iters, 1 write, 44s) — textbook trace
- qwen3-coder:30b: PASS (15 iters, 1 write, 22s) — correct but chatty
- gemma4:26b: FAIL (6 iters, 0 writes) — silently stops with eval=4
  after reading source. Reproduced on second run. One-shot probe
  confirms 26b CAN produce the correct fix — failure is specifically
  at the write_file tool-call argument boundary.

Updates GOTCHAS with a new HIGH-severity entry, SYNTHESIS model-selection
table, CORPUS_cli_coding_agent.md empirical-follow-up pointer, and adds
docs/reference/bakeoff-2026-04-18.md with the full writeup.
This commit is contained in:
Mortdecai
2026-04-18 13:27:50 -04:00
parent 4b9c537dda
commit a945207aab
15 changed files with 1172 additions and 1 deletions
+31
View File
@@ -64,6 +64,37 @@ Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output ex
**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
## HIGH: 26B Silent-Stops at `write_file` Tool Boundary (reproducible)
**Severity: HIGH — agent-loop failure, silent**
Reproduced on 2026-04-18 against `gemma4:26b` via Ollama 0.20.4 on a 3090 Ti
(steel141). Agent harness exposed `read_file` / `write_file` / `run_bash` tools
and asked the model to fix a failing Python test.
Observed pattern (both runs identical):
1. Model reads README, runs pytest (sees failures), reads the buggy source file
2. Next turn: **empty content, no tool calls, `eval_count=4`** — model silently exits
3. Zero writes ever emitted
Isolation: a direct one-shot call asking 26B to rewrite the same function
returned the correct fix (`eval_count=81`). So diagnosis and code generation are
intact — failure is at the `write_file(path, content)` tool-call argument
boundary, where `content` is a ~500-char string. Consistent with the "Weak at
Long/Nested JSON" gotcha below: a long string inside a tool-call argument is
structurally the same problem.
`gemma4:31b-it-q4_K_M` on the same harness completed the task cleanly
(`eval_count=330` on the write turn). `qwen3-coder:30b` also completed.
**Fix:**
- For 26B in an agent loop, prefer a **patch/diff tool surface**
(`apply_patch(path, old, new)`) over a **full-content write** (`write_file(path, full_content)`).
Delta-sized arguments are inside the model's comfort zone.
- Or use 31B for the agent and keep 26B for single-shot tasks where the full
response is the output, not a tool-call argument.
- See `docs/reference/bakeoff-2026-04-18.md` for the full trace.
## MEDIUM: Weak at Long/Nested JSON
**Severity: MEDIUM — causes parse failures**