feat: CLI coding agent bakeoff — 26b reproducibly silent-stops at write_file

Ran minimal agent loop (Ollama /api/chat + read_file/write_file/run_bash) on steel141 3090 Ti against 3 models on a broken-median-function task: - gemma4:31b-it-q4_K_M: PASS (8 iters, 1 write, 44s) — textbook trace - qwen3-coder:30b: PASS (15 iters, 1 write, 22s) — correct but chatty - gemma4:26b: FAIL (6 iters, 0 writes) — silently stops with eval=4 after reading source. Reproduced on second run. One-shot probe confirms 26b CAN produce the correct fix — failure is specifically at the write_file tool-call argument boundary. Updates GOTCHAS with a new HIGH-severity entry, SYNTHESIS model-selection table, CORPUS_cli_coding_agent.md empirical-follow-up pointer, and adds docs/reference/bakeoff-2026-04-18.md with the full writeup.
2026-04-18 13:27:50 -04:00
parent 4b9c537dda
commit a945207aab
15 changed files with 1172 additions and 1 deletions
@@ -6,6 +6,14 @@
 > `IMPLEMENTATIONS.md` chat-agent patterns (Simon) and pipeline patterns
 > (AI_Visualizer).

+> **Empirical follow-up:** `docs/reference/bakeoff-2026-04-18.md` — real runs of
+> `gemma4:26b`, `gemma4:31b-it-q4_K_M`, and `qwen3-coder:30b` against a custom
+> minimal CLI-agent harness on a fix-the-median-bug task. Key findings:
+> **31B clean (8 iters, 1 write), Qwen3-Coder correct but chatty (15 iters),
+> 26B reproducibly silent-stops at the `write_file` tool call boundary** even
+> though it can produce the fix in a direct one-shot call. Read when: scoping
+> which model to point an agent at, or hitting an unexpected tool-call halt.
+
 ## TL;DR

 - Gemma 4 is Google's **first Gemma with trained (not proof-of-concept) tool use**. LiveCodeBench v6 = 80.0% (31B) / 77.1% (26B). Codeforces ELO = 2150 / 1718. That's frontier-open territory on the reported benchmarks.