feat: round-2 bakeoff — 26b silent-stop is tool-response context size
Round 2 tested the hypothesis that 26B's silent-stop was about write_file argument size. Result: refuted. - Patch-mode (apply_patch instead of write_file): 26B fails identically at iter 6. Tool-arg size is not the cause. - Truncation sweep on tool responses reveals the real trigger: cap at 800 or 1200 chars → 26B PASSES (1200-cap is 8.4s, fastest of any run). Cap at 1600, 2000, or unlimited → 26B silent-stops with eval=4. Revised understanding: 26B silent-stops when cumulative tool-response context crosses a shape threshold around 1200-1600 chars per response. Not a tool-arg bug, not a raw code-gen bug — 26B emits correct code fine in both one-shot and short-context settings. Production CLI agents (openclaw, open code, aider) typically truncate tool responses by default, so this failure may not surface in them. Custom harnesses should cap ≤1200 chars per tool response when targeting the 26B MoE. Updates GOTCHAS (rewritten entry with the truncation sweep table), SYNTHESIS model-selection row, CORPUS_cli_coding_agent.md pointer, docs/reference/bakeoff-2026-04-18.md with full Round 2 methodology and data. Adds harness_patch.py (apply_patch edit tool), harness_patch_truncated.py (env-configurable TOOL_RESULT_CAP), all 7 run logs, and a .secrets.baseline for detect-secrets false positives on JSON timestamps.
This commit is contained in:
+48
-23
@@ -64,36 +64,61 @@ Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output ex
|
||||
|
||||
**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
|
||||
|
||||
## HIGH: 26B Silent-Stops at `write_file` Tool Boundary (reproducible)
|
||||
## HIGH: 26B Silent-Stops When Tool Responses Accumulate (reproducible)
|
||||
|
||||
**Severity: HIGH — agent-loop failure, silent**
|
||||
**Severity: HIGH — silent agent-loop failure. Mitigatable.**
|
||||
|
||||
Reproduced on 2026-04-18 against `gemma4:26b` via Ollama 0.20.4 on a 3090 Ti
|
||||
(steel141). Agent harness exposed `read_file` / `write_file` / `run_bash` tools
|
||||
and asked the model to fix a failing Python test.
|
||||
(steel141). Agent harness looped through `read_file` / `(write_file or apply_patch)` / `run_bash`
|
||||
tools to fix a failing Python test.
|
||||
|
||||
Observed pattern (both runs identical):
|
||||
1. Model reads README, runs pytest (sees failures), reads the buggy source file
|
||||
2. Next turn: **empty content, no tool calls, `eval_count=4`** — model silently exits
|
||||
3. Zero writes ever emitted
|
||||
### The observation
|
||||
|
||||
Isolation: a direct one-shot call asking 26B to rewrite the same function
|
||||
returned the correct fix (`eval_count=81`). So diagnosis and code generation are
|
||||
intact — failure is at the `write_file(path, content)` tool-call argument
|
||||
boundary, where `content` is a ~500-char string. Consistent with the "Weak at
|
||||
Long/Nested JSON" gotcha below: a long string inside a tool-call argument is
|
||||
structurally the same problem.
|
||||
26B silent-stops (empty content, no tool calls, `eval_count=4`) at the
|
||||
decision-to-edit turn, **regardless of which edit tool is offered** — tested with
|
||||
both `write_file(path, full_content)` and `apply_patch(path, old, new)`.
|
||||
Initial hypothesis (long tool-call argument) was **refuted**.
|
||||
|
||||
`gemma4:31b-it-q4_K_M` on the same harness completed the task cleanly
|
||||
(`eval_count=330` on the write turn). `qwen3-coder:30b` also completed.
|
||||
### The actual trigger: cumulative tool-response context shape
|
||||
|
||||
**Fix:**
|
||||
- For 26B in an agent loop, prefer a **patch/diff tool surface**
|
||||
(`apply_patch(path, old, new)`) over a **full-content write** (`write_file(path, full_content)`).
|
||||
Delta-sized arguments are inside the model's comfort zone.
|
||||
- Or use 31B for the agent and keep 26B for single-shot tasks where the full
|
||||
response is the output, not a tool-call argument.
|
||||
- See `docs/reference/bakeoff-2026-04-18.md` for the full trace.
|
||||
A sweep with progressive truncation caps on tool responses (`TOOL_RESULT_CAP`):
|
||||
|
||||
| Cap (chars) | Result | Halt eval_count |
|
||||
|---|---|---|
|
||||
| 800 | PASS | 24 (continues, hits iteration cap) |
|
||||
| 1200 | **PASS** — **fastest of any run (8.4s)** | 27 (clean summary) |
|
||||
| 1600 | FAIL | **4** (silent stop) |
|
||||
| 2000 | FAIL | **4** (silent stop) |
|
||||
| unlimited | FAIL | **4** (silent stop) |
|
||||
|
||||
Sharp transition between 1200 and 1600 chars-per-response. Below the line, 26B
|
||||
emits correct code (eval_count ~165 on the patch turn). Above, it silent-stops.
|
||||
Exact mechanism unproven (could be MoE expert routing, chat-template edge case,
|
||||
or something else). **Actionable:** cap tool responses ≤1200 chars.
|
||||
|
||||
### What's NOT at fault
|
||||
|
||||
- **Not the edit tool surface** — `write_file` and `apply_patch` both trigger it
|
||||
- **Not raw code generation** — a one-shot direct prompt asking 26B to fix the
|
||||
same function returned clean correct code (eval=81)
|
||||
- **Not total context size alone** — the 800-cap run continued past 3741 prompt
|
||||
tokens. Failing runs halt at ~2070-2150 tokens but the 1200-cap run crossed
|
||||
the same range and kept going
|
||||
- **Not a Gemma-4-family issue** — `gemma4:31b-it-q4_K_M` on identical harness
|
||||
handles full-size tool responses cleanly (eval=330 on the write turn)
|
||||
|
||||
### Fix
|
||||
|
||||
- **For 26B in an agent loop, cap tool responses ≤1200 chars.** 800 is safer;
|
||||
this is where every production CLI agent (openclaw / open code / aider /
|
||||
cline) already lives by default, so the issue may not surface in those
|
||||
frameworks.
|
||||
- **For raw pytest output specifically**, use `pytest -x --tb=line` or a custom
|
||||
formatter to shrink per-test output to a few lines.
|
||||
- **Alternative:** use `gemma4:31b-it-q4_K_M` — same harness, no mitigation,
|
||||
just works. Trade: ~5× slower than 26B when 26B cooperates.
|
||||
- See `docs/reference/bakeoff-2026-04-18.md` (Round 2) for full traces and the
|
||||
truncation sweep methodology.
|
||||
|
||||
## MEDIUM: Weak at Long/Nested JSON
|
||||
|
||||
|
||||
Reference in New Issue
Block a user