feat: round-2 bakeoff — 26b silent-stop is tool-response context size

Round 2 tested the hypothesis that 26B's silent-stop was about write_file argument size. Result: refuted. - Patch-mode (apply_patch instead of write_file): 26B fails identically at iter 6. Tool-arg size is not the cause. - Truncation sweep on tool responses reveals the real trigger: cap at 800 or 1200 chars → 26B PASSES (1200-cap is 8.4s, fastest of any run). Cap at 1600, 2000, or unlimited → 26B silent-stops with eval=4. Revised understanding: 26B silent-stops when cumulative tool-response context crosses a shape threshold around 1200-1600 chars per response. Not a tool-arg bug, not a raw code-gen bug — 26B emits correct code fine in both one-shot and short-context settings. Production CLI agents (openclaw, open code, aider) typically truncate tool responses by default, so this failure may not surface in them. Custom harnesses should cap ≤1200 chars per tool response when targeting the 26B MoE. Updates GOTCHAS (rewritten entry with the truncation sweep table), SYNTHESIS model-selection row, CORPUS_cli_coding_agent.md pointer, docs/reference/bakeoff-2026-04-18.md with full Round 2 methodology and data. Adds harness_patch.py (apply_patch edit tool), harness_patch_truncated.py (env-configurable TOOL_RESULT_CAP), all 7 run logs, and a .secrets.baseline for detect-secrets false positives on JSON timestamps.
2026-04-18 13:40:18 -04:00
parent a945207aab
commit 7f806e0b92
15 changed files with 16481 additions and 32 deletions
@@ -144,3 +144,98 @@ python3 harness.py gemma4:26b runs/gemma4-26b/work runs/gemma4-26b/log.json
 ```

 Each invocation resets the work directory from `task_seed/`, runs the loop, writes the log, and prints a one-line summary.
+
+---
+
+# Round 2 — isolating the 26B silent-stop
+
+After Round 1 I hypothesized the 26B failure was about long `write_file(path, full_content)` tool arguments. Round 2 tests that.
+
+## What was tested
+
+1. **Patch-mode harness** (`harness_patch.py`) — identical to the original but swaps `write_file(path, content)` for `apply_patch(path, old_text, new_text)`. Arguments are a small delta (~100-200 chars), not the full file.
+2. **Truncation-mode harness** (`harness_patch_truncated.py`) — same as patch-mode, but caps every tool response to `TOOL_RESULT_CAP` chars (env-configurable) before returning it to the model.
+
+All else identical: same task, same system prompt, same Ollama settings, same 3090 Ti on steel141.
+
+## Results
+
+### Round 2a — patch-mode (small edit tool arguments)
+
+| Model | Pass | Iters | patches | reads | bashes | Wall |
+|---|---|---|---|---|---|---|
+| `gemma4:31b-it-q4_K_M` | ✓ | 8 | 1 | 2 | 4 | 37s |
+| `qwen3-coder:30b` | ✓ | 14 | 1 | 3 | 9 | 22s |
+| `gemma4:26b` | **✗** | 6 | **0** | 2 | 3 | 8s |
+
+**Hypothesis refuted.** 26B fails identically on patch-mode: 6 iters, silent stop at iter 6 with eval=4, zero edits. The tool-call **argument size is not the trigger.**
+
+### Round 2b — tool-result truncation cap
+
+Ran 26B through patch-mode with progressively smaller caps on each tool response:
+
+| TOOL_RESULT_CAP | 26B Pass | Halt turn | prompt_eval at halt | eval_count at halt |
+|---|---|---|---|---|
+| **800** | ✓ | iter 15 (cap) | 3741 | 24 |
+| **1200** | ✓ | iter 8 | 2294 | 27 |
+| **1600** | ✗ | iter 6 | 2070 | **4** |
+| **2000** | ✗ | iter 6 | 2157 | **4** |
+| **unlimited** | ✗ | iter 6 | 2139 | **4** |
+
+Sharp transition between 1200 and 1600. Below the line, 26B generates code (`eval_count=165` on the patch turn). Above the line, `eval_count=4` — effectively an EOS.
+
+**The trigger is cumulative tool-response context shape, not total tokens.** The 800-cap run continued reasoning past 3741 prompt tokens without issue. The failing runs all halt at ~2070-2150 tokens — but the 1200-cap run crossed that same range (2076 at iter 7) and kept going. So "N tokens" isn't the cause — the recent-context pattern (large tool responses accumulated over 5 iterations) is.
+
+### Bonus observation: 26B at 1200-cap is the fastest passing configuration
+
+| Run | Iters | Wall clock |
+|---|---|---|
+| 26B @ 1200-cap | 8 | **8.4s** |
+| 31B @ patch | 8 | 37s |
+| Qwen3-Coder @ patch | 14 | 22s |
+
+Same task, same correct fix. 26B's MoE (3.8B active params) is ~5× faster than 31B dense when it cooperates.
+
+## Revised interpretation
+
+- **Not "26B is broken for CLI coding agents."**
+- **Not "long tool-call arguments break 26B."**
+- **Yes: "26B silent-stops when the cumulative tool-response context crosses a certain shape/size threshold, at the decision-to-edit boundary."** Observed threshold here: per-tool-response cap somewhere between 1200 and 1600 chars, on this task / this Ollama version / this model variant.
+- **The mitigation is standard.** Every production CLI agent (openclaw, open code, aider, cline, continue) truncates tool responses — this is table stakes, not exotic. 26B's "failure mode" is likely *already mitigated* in those frameworks. What my default harness did (pass full 4-6KB pytest outputs verbatim) is probably not what those frameworks do.
+- **Exact mechanism is unproven.** I'm observing behavior, not internals. Could be MoE expert routing, could be chat-template edge case, could be some interaction with the tool-call channel tokens. Finding the root cause would require model instrumentation beyond this scope.
+
+## Revised recommendation
+
+1. **Default to `gemma4:31b-it-q4_K_M`** for general CLI coding agent use. Robust to long tool responses, no mitigation needed.
+2. **Use `gemma4:26b`** if you care about latency AND your agent framework truncates tool responses (most do). 5× faster than 31B when it works.
+3. **Verify by re-running against your actual agent framework.** Don't trust this harness as a proxy — it's a diagnostic, not a production test.
+4. **If you're writing a custom agent and targeting 26B**, cap tool responses aggressively (≤1200 chars per response worked here; ≤800 is safer). pytest output in particular benefits from `--tb=line` or `-x` to shrink it.
+
+## Artifacts (Round 2)
+
+- `scripts/bakeoff/harness_patch.py` — patch-mode harness
+- `scripts/bakeoff/harness_patch_truncated.py` — truncation-mode harness (env var `TOOL_RESULT_CAP`)
+- `scripts/bakeoff/runs_patch/gemma4-26b/log.json` — patch mode, unlimited (fails)
+- `scripts/bakeoff/runs_patch/gemma4-26b-truncated/log.json` — cap=800 (passes)
+- `scripts/bakeoff/runs_patch/gemma4-26b-cap1200/log.json` — cap=1200 (passes)
+- `scripts/bakeoff/runs_patch/gemma4-26b-cap1600/log.json` — cap=1600 (fails)
+- `scripts/bakeoff/runs_patch/gemma4-26b-cap2000/log.json` — cap=2000 (fails)
+- `scripts/bakeoff/runs_patch/gemma4-31b/log.json` — patch mode, passes (control)
+- `scripts/bakeoff/runs_patch/qwen3-coder-30b/log.json` — patch mode, passes (control)
+
+## Reproducing Round 2
+
+```bash
+cd scripts/bakeoff
+
+# Patch-mode baseline (3 models)
+python3 harness_patch.py gemma4:31b-it-q4_K_M runs_patch/gemma4-31b/work runs_patch/gemma4-31b/log.json
+python3 harness_patch.py qwen3-coder:30b runs_patch/qwen3-coder-30b/work runs_patch/qwen3-coder-30b/log.json
+python3 harness_patch.py gemma4:26b runs_patch/gemma4-26b/work runs_patch/gemma4-26b/log.json
+
+# Truncation sweep on 26B
+TOOL_RESULT_CAP=800  python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-truncated/work runs_patch/gemma4-26b-truncated/log.json
+TOOL_RESULT_CAP=1200 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap1200/work runs_patch/gemma4-26b-cap1200/log.json
+TOOL_RESULT_CAP=1600 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap1600/work runs_patch/gemma4-26b-cap1600/log.json
+TOOL_RESULT_CAP=2000 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap2000/work runs_patch/gemma4-26b-cap2000/log.json
+```