feat: CLI coding agent bakeoff — 26b reproducibly silent-stops at write_file
Ran minimal agent loop (Ollama /api/chat + read_file/write_file/run_bash) on steel141 3090 Ti against 3 models on a broken-median-function task: - gemma4:31b-it-q4_K_M: PASS (8 iters, 1 write, 44s) — textbook trace - qwen3-coder:30b: PASS (15 iters, 1 write, 22s) — correct but chatty - gemma4:26b: FAIL (6 iters, 0 writes) — silently stops with eval=4 after reading source. Reproduced on second run. One-shot probe confirms 26b CAN produce the correct fix — failure is specifically at the write_file tool-call argument boundary. Updates GOTCHAS with a new HIGH-severity entry, SYNTHESIS model-selection table, CORPUS_cli_coding_agent.md empirical-follow-up pointer, and adds docs/reference/bakeoff-2026-04-18.md with the full writeup.
This commit is contained in:
+31
@@ -64,6 +64,37 @@ Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output ex
|
||||
|
||||
**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
|
||||
|
||||
## HIGH: 26B Silent-Stops at `write_file` Tool Boundary (reproducible)
|
||||
|
||||
**Severity: HIGH — agent-loop failure, silent**
|
||||
|
||||
Reproduced on 2026-04-18 against `gemma4:26b` via Ollama 0.20.4 on a 3090 Ti
|
||||
(steel141). Agent harness exposed `read_file` / `write_file` / `run_bash` tools
|
||||
and asked the model to fix a failing Python test.
|
||||
|
||||
Observed pattern (both runs identical):
|
||||
1. Model reads README, runs pytest (sees failures), reads the buggy source file
|
||||
2. Next turn: **empty content, no tool calls, `eval_count=4`** — model silently exits
|
||||
3. Zero writes ever emitted
|
||||
|
||||
Isolation: a direct one-shot call asking 26B to rewrite the same function
|
||||
returned the correct fix (`eval_count=81`). So diagnosis and code generation are
|
||||
intact — failure is at the `write_file(path, content)` tool-call argument
|
||||
boundary, where `content` is a ~500-char string. Consistent with the "Weak at
|
||||
Long/Nested JSON" gotcha below: a long string inside a tool-call argument is
|
||||
structurally the same problem.
|
||||
|
||||
`gemma4:31b-it-q4_K_M` on the same harness completed the task cleanly
|
||||
(`eval_count=330` on the write turn). `qwen3-coder:30b` also completed.
|
||||
|
||||
**Fix:**
|
||||
- For 26B in an agent loop, prefer a **patch/diff tool surface**
|
||||
(`apply_patch(path, old, new)`) over a **full-content write** (`write_file(path, full_content)`).
|
||||
Delta-sized arguments are inside the model's comfort zone.
|
||||
- Or use 31B for the agent and keep 26B for single-shot tasks where the full
|
||||
response is the output, not a tool-call argument.
|
||||
- See `docs/reference/bakeoff-2026-04-18.md` for the full trace.
|
||||
|
||||
## MEDIUM: Weak at Long/Nested JSON
|
||||
|
||||
**Severity: MEDIUM — causes parse failures**
|
||||
|
||||
Reference in New Issue
Block a user