feat: round-2 bakeoff — 26b silent-stop is tool-response context size

Round 2 tested the hypothesis that 26B's silent-stop was about
write_file argument size. Result: refuted.

- Patch-mode (apply_patch instead of write_file): 26B fails identically
  at iter 6. Tool-arg size is not the cause.
- Truncation sweep on tool responses reveals the real trigger: cap at
  800 or 1200 chars → 26B PASSES (1200-cap is 8.4s, fastest of any run).
  Cap at 1600, 2000, or unlimited → 26B silent-stops with eval=4.

Revised understanding: 26B silent-stops when cumulative tool-response
context crosses a shape threshold around 1200-1600 chars per response.
Not a tool-arg bug, not a raw code-gen bug — 26B emits correct code
fine in both one-shot and short-context settings.

Production CLI agents (openclaw, open code, aider) typically truncate
tool responses by default, so this failure may not surface in them.
Custom harnesses should cap ≤1200 chars per tool response when
targeting the 26B MoE.

Updates GOTCHAS (rewritten entry with the truncation sweep table),
SYNTHESIS model-selection row, CORPUS_cli_coding_agent.md pointer,
docs/reference/bakeoff-2026-04-18.md with full Round 2 methodology
and data.

Adds harness_patch.py (apply_patch edit tool), harness_patch_truncated.py
(env-configurable TOOL_RESULT_CAP), all 7 run logs, and a
.secrets.baseline for detect-secrets false positives on JSON timestamps.
This commit is contained in:
Mortdecai
2026-04-18 13:40:18 -04:00
parent a945207aab
commit 7f806e0b92
15 changed files with 16481 additions and 32 deletions
+48 -23
View File
@@ -64,36 +64,61 @@ Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output ex
**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
## HIGH: 26B Silent-Stops at `write_file` Tool Boundary (reproducible)
## HIGH: 26B Silent-Stops When Tool Responses Accumulate (reproducible)
**Severity: HIGH — agent-loop failure, silent**
**Severity: HIGH — silent agent-loop failure. Mitigatable.**
Reproduced on 2026-04-18 against `gemma4:26b` via Ollama 0.20.4 on a 3090 Ti
(steel141). Agent harness exposed `read_file` / `write_file` / `run_bash` tools
and asked the model to fix a failing Python test.
(steel141). Agent harness looped through `read_file` / `(write_file or apply_patch)` / `run_bash`
tools to fix a failing Python test.
Observed pattern (both runs identical):
1. Model reads README, runs pytest (sees failures), reads the buggy source file
2. Next turn: **empty content, no tool calls, `eval_count=4`** — model silently exits
3. Zero writes ever emitted
### The observation
Isolation: a direct one-shot call asking 26B to rewrite the same function
returned the correct fix (`eval_count=81`). So diagnosis and code generation are
intact — failure is at the `write_file(path, content)` tool-call argument
boundary, where `content` is a ~500-char string. Consistent with the "Weak at
Long/Nested JSON" gotcha below: a long string inside a tool-call argument is
structurally the same problem.
26B silent-stops (empty content, no tool calls, `eval_count=4`) at the
decision-to-edit turn, **regardless of which edit tool is offered** — tested with
both `write_file(path, full_content)` and `apply_patch(path, old, new)`.
Initial hypothesis (long tool-call argument) was **refuted**.
`gemma4:31b-it-q4_K_M` on the same harness completed the task cleanly
(`eval_count=330` on the write turn). `qwen3-coder:30b` also completed.
### The actual trigger: cumulative tool-response context shape
**Fix:**
- For 26B in an agent loop, prefer a **patch/diff tool surface**
(`apply_patch(path, old, new)`) over a **full-content write** (`write_file(path, full_content)`).
Delta-sized arguments are inside the model's comfort zone.
- Or use 31B for the agent and keep 26B for single-shot tasks where the full
response is the output, not a tool-call argument.
- See `docs/reference/bakeoff-2026-04-18.md` for the full trace.
A sweep with progressive truncation caps on tool responses (`TOOL_RESULT_CAP`):
| Cap (chars) | Result | Halt eval_count |
|---|---|---|
| 800 | PASS | 24 (continues, hits iteration cap) |
| 1200 | **PASS****fastest of any run (8.4s)** | 27 (clean summary) |
| 1600 | FAIL | **4** (silent stop) |
| 2000 | FAIL | **4** (silent stop) |
| unlimited | FAIL | **4** (silent stop) |
Sharp transition between 1200 and 1600 chars-per-response. Below the line, 26B
emits correct code (eval_count ~165 on the patch turn). Above, it silent-stops.
Exact mechanism unproven (could be MoE expert routing, chat-template edge case,
or something else). **Actionable:** cap tool responses ≤1200 chars.
### What's NOT at fault
- **Not the edit tool surface** — `write_file` and `apply_patch` both trigger it
- **Not raw code generation** — a one-shot direct prompt asking 26B to fix the
same function returned clean correct code (eval=81)
- **Not total context size alone** — the 800-cap run continued past 3741 prompt
tokens. Failing runs halt at ~2070-2150 tokens but the 1200-cap run crossed
the same range and kept going
- **Not a Gemma-4-family issue** — `gemma4:31b-it-q4_K_M` on identical harness
handles full-size tool responses cleanly (eval=330 on the write turn)
### Fix
- **For 26B in an agent loop, cap tool responses ≤1200 chars.** 800 is safer;
this is where every production CLI agent (openclaw / open code / aider /
cline) already lives by default, so the issue may not surface in those
frameworks.
- **For raw pytest output specifically**, use `pytest -x --tb=line` or a custom
formatter to shrink per-test output to a few lines.
- **Alternative:** use `gemma4:31b-it-q4_K_M` — same harness, no mitigation,
just works. Trade: ~5× slower than 26B when 26B cooperates.
- See `docs/reference/bakeoff-2026-04-18.md` (Round 2) for full traces and the
truncation sweep methodology.
## MEDIUM: Weak at Long/Nested JSON