feat: round-2 bakeoff — 26b silent-stop is tool-response context size

Round 2 tested the hypothesis that 26B's silent-stop was about write_file argument size. Result: refuted. - Patch-mode (apply_patch instead of write_file): 26B fails identically at iter 6. Tool-arg size is not the cause. - Truncation sweep on tool responses reveals the real trigger: cap at 800 or 1200 chars → 26B PASSES (1200-cap is 8.4s, fastest of any run). Cap at 1600, 2000, or unlimited → 26B silent-stops with eval=4. Revised understanding: 26B silent-stops when cumulative tool-response context crosses a shape threshold around 1200-1600 chars per response. Not a tool-arg bug, not a raw code-gen bug — 26B emits correct code fine in both one-shot and short-context settings. Production CLI agents (openclaw, open code, aider) typically truncate tool responses by default, so this failure may not surface in them. Custom harnesses should cap ≤1200 chars per tool response when targeting the 26B MoE. Updates GOTCHAS (rewritten entry with the truncation sweep table), SYNTHESIS model-selection row, CORPUS_cli_coding_agent.md pointer, docs/reference/bakeoff-2026-04-18.md with full Round 2 methodology and data. Adds harness_patch.py (apply_patch edit tool), harness_patch_truncated.py (env-configurable TOOL_RESULT_CAP), all 7 run logs, and a .secrets.baseline for detect-secrets false positives on JSON timestamps.
2026-04-18 13:40:18 -04:00
parent a945207aab
commit 7f806e0b92
15 changed files with 16481 additions and 32 deletions
@@ -64,36 +64,61 @@ Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output ex

 **Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.

-## HIGH: 26B Silent-Stops at `write_file` Tool Boundary (reproducible)
+## HIGH: 26B Silent-Stops When Tool Responses Accumulate (reproducible)

-**Severity: HIGH — agent-loop failure, silent**
+**Severity: HIGH — silent agent-loop failure. Mitigatable.**

 Reproduced on 2026-04-18 against `gemma4:26b` via Ollama 0.20.4 on a 3090 Ti
-(steel141). Agent harness exposed `read_file` / `write_file` / `run_bash` tools
-and asked the model to fix a failing Python test.
+(steel141). Agent harness looped through `read_file` / `(write_file or apply_patch)` / `run_bash`
+tools to fix a failing Python test.

-Observed pattern (both runs identical):
-1. Model reads README, runs pytest (sees failures), reads the buggy source file
-2. Next turn: **empty content, no tool calls, `eval_count=4`** — model silently exits
-3. Zero writes ever emitted
+### The observation

-Isolation: a direct one-shot call asking 26B to rewrite the same function
-returned the correct fix (`eval_count=81`). So diagnosis and code generation are
-intact — failure is at the `write_file(path, content)` tool-call argument
-boundary, where `content` is a ~500-char string. Consistent with the "Weak at
-Long/Nested JSON" gotcha below: a long string inside a tool-call argument is
-structurally the same problem.
+26B silent-stops (empty content, no tool calls, `eval_count=4`) at the
+decision-to-edit turn, **regardless of which edit tool is offered** — tested with
+both `write_file(path, full_content)` and `apply_patch(path, old, new)`.
+Initial hypothesis (long tool-call argument) was **refuted**.

-`gemma4:31b-it-q4_K_M` on the same harness completed the task cleanly
-(`eval_count=330` on the write turn). `qwen3-coder:30b` also completed.
+### The actual trigger: cumulative tool-response context shape

-**Fix:**
- For 26B in an agent loop, prefer a **patch/diff tool surface**
-  (`apply_patch(path, old, new)`) over a **full-content write** (`write_file(path, full_content)`).
-  Delta-sized arguments are inside the model's comfort zone.
- Or use 31B for the agent and keep 26B for single-shot tasks where the full
-  response is the output, not a tool-call argument.
- See `docs/reference/bakeoff-2026-04-18.md` for the full trace.
+A sweep with progressive truncation caps on tool responses (`TOOL_RESULT_CAP`):
+
+| Cap (chars) | Result | Halt eval_count |
+|---|---|---|
+| 800 | PASS | 24 (continues, hits iteration cap) |
+| 1200 | **PASS** — **fastest of any run (8.4s)** | 27 (clean summary) |
+| 1600 | FAIL | **4** (silent stop) |
+| 2000 | FAIL | **4** (silent stop) |
+| unlimited | FAIL | **4** (silent stop) |
+
+Sharp transition between 1200 and 1600 chars-per-response. Below the line, 26B
+emits correct code (eval_count ~165 on the patch turn). Above, it silent-stops.
+Exact mechanism unproven (could be MoE expert routing, chat-template edge case,
+or something else). **Actionable:** cap tool responses ≤1200 chars.
+
+### What's NOT at fault
+
+- **Not the edit tool surface** — `write_file` and `apply_patch` both trigger it
+- **Not raw code generation** — a one-shot direct prompt asking 26B to fix the
+  same function returned clean correct code (eval=81)
+- **Not total context size alone** — the 800-cap run continued past 3741 prompt
+  tokens. Failing runs halt at ~2070-2150 tokens but the 1200-cap run crossed
+  the same range and kept going
+- **Not a Gemma-4-family issue** — `gemma4:31b-it-q4_K_M` on identical harness
+  handles full-size tool responses cleanly (eval=330 on the write turn)
+
+### Fix
+
+- **For 26B in an agent loop, cap tool responses ≤1200 chars.** 800 is safer;
+  this is where every production CLI agent (openclaw / open code / aider /
+  cline) already lives by default, so the issue may not surface in those
+  frameworks.
+- **For raw pytest output specifically**, use `pytest -x --tb=line` or a custom
+  formatter to shrink per-test output to a few lines.
+- **Alternative:** use `gemma4:31b-it-q4_K_M` — same harness, no mitigation,
+  just works. Trade: ~5× slower than 26B when 26B cooperates.
+- See `docs/reference/bakeoff-2026-04-18.md` (Round 2) for full traces and the
+  truncation sweep methodology.

 ## MEDIUM: Weak at Long/Nested JSON