fix: walk back round-1/2 conclusions — the cause was think=false all along

Seth asked "was this with think=false?" Yes — and that was the only question that mattered. Everything I concluded in round 1 and round 2 was wrong. Actual cause, isolated in round 3: - At identical message state, gemma4:26b with think=false returns eval=4 (silent stop); with think unset or think=true, returns eval=165 and emits the correct tool call. - Original round-1 write_file harness + think unset: 26B passes in 8 iters, 20s. No mitigations needed. - 31B dense and qwen3-coder:30b tolerate think=false; 26B MoE does not. Red herrings (kept on-record in the bakeoff doc, not silently erased): - Round 1: "write_file tool-call argument size" — wrong - Round 2a: refuted the arg-size theory but for the wrong reason (still failed because think=false was still set) - Round 2b: "cumulative tool-response context size" — truncating did make 26B pass, but by coincidence. Shorter context at the decision turn dodged the think=false side effect. Why the existing "always think:false" guidance was misleading: it was derived from AI_Visualizer (single-turn JSON pipelines) where thinking tokens do eat num_predict invisibly. In multi-turn tool-calling agents the channels are separate and the flag has a different effect — catastrophic on 26B specifically. Doc updates: - GOTCHAS: replaced the 26B entry with the actual cause; scoped the original "Thinking Mode Eats Context" entry to single-turn pipelines - SYNTHESIS: split the "Mandatory Ollama Settings" block into single-turn vs multi-turn variants; updated anti-patterns and quick-start checklist - CORPUS_cli_coding_agent.md: revised pointer and config template - docs/reference/bakeoff-2026-04-18.md: added Round 3 section with the correction notice at the top of the file and full diagnostic methodology New artifacts: harness_no_think_flag.py, harness_write_no_think.py, and 4 new log files demonstrating all three models pass when think is left at default.
2026-04-18 18:14:05 -04:00
parent 7f806e0b92
commit c61394923c
11 changed files with 1132 additions and 61 deletions
@@ -5,6 +5,14 @@
 > identical broken-code task. **n=1 per model** (plus one re-run to check
 > reproducibility of a failure). Treat as a smoke test, not a benchmark.

+> **Correction notice (Round 3):** Rounds 1 and 2 both misidentified the cause
+> of Gemma 4 26B's silent-stop failure. Round 1 blamed `write_file` tool-call
+> argument size. Round 2 blamed tool-response context size. **Round 3 proves
+> both wrong: the actual cause is the `think: false` Ollama flag.** Remove the
+> flag and 26B passes on the original Round 1 harness unmodified. Kept the
+> failed hypotheses below as-recorded — Seth asked "was this with
+> think=false?" and the answer exposed the confounder. Never presented as Plan A.
+
 ## Setup

 - **Host:** steel141 (Seth's local box)
@@ -239,3 +247,136 @@ TOOL_RESULT_CAP=1200 python3 harness_patch_truncated.py gemma4:26b runs_patch/ge
 TOOL_RESULT_CAP=1600 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap1600/work runs_patch/gemma4-26b-cap1600/log.json
 TOOL_RESULT_CAP=2000 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap2000/work runs_patch/gemma4-26b-cap2000/log.json
 ```
+
+---
+
+# Round 3 — the actual cause: `think: false`
+
+Seth asked "was this with think=false?" That was the only question that mattered.
+
+## The question that unstuck it
+
+Every harness in Round 1 and Round 2 set `"think": False` in the Ollama payload —
+per existing guidance in `GOTCHAS.md`: "Always pass `think: false` in the Ollama
+payload. Seth has had success ONLY with thinking off." I copied that to the
+harnesses without testing whether it was the right choice for a multi-turn
+tool-calling agent loop (as opposed to the single-turn JSON pipeline that
+guidance came from).
+
+## The diagnostic
+
+Replayed the exact 5-iteration failing state to `gemma4:26b` three times with
+three think settings, same message history, same tool definitions:
+
+| `think` setting | `eval_count` | tool call emitted? |
+|---|---|---|
+| `false` (my harness) | **4** | ✗ |
+| unset (Ollama default) | 165 | ✓ `apply_patch` |
+| `true` | 165 | ✓ `apply_patch` |
+
+Sharp, reproducible. `think: false` → silent stop. Anything else → works.
+
+## Round 3 runs — unlimited tool responses, think flag removed
+
+| Harness | Model | Pass | Iters | Wall |
+|---|---|---|---|---|
+| `write_file` (Round-1 harness, think unset) | `gemma4:26b` | **✓** | 8 | 20.6s |
+| `apply_patch` (Round-2a harness, think unset) | `gemma4:26b` | **✓** | 8 | 12.5s |
+| `write_file`, think unset | `gemma4:31b-it-q4_K_M` | ✓ | 8 | — |
+| `apply_patch`, think unset | `gemma4:31b-it-q4_K_M` | ✓ | 8 | 66.4s |
+| `apply_patch`, think unset | `qwen3-coder:30b` | ✓ | 11 | 19.5s |
+
+**26B passes cleanly on the unmodified Round 1 harness once the think flag is
+removed.** No truncation, no patch-tool swap, no mitigations.
+
+The 31B / Qwen runs confirm the flag doesn't matter for those models (pass either
+way). 31B is visibly slower without the think flag (66s vs 37s) — likely
+because it's actually generating hidden thinking now — but it still completes.
+
+## What Rounds 1 and 2 got wrong
+
+### Round 1 (wrong): "26B silent-stops at the write_file tool-call argument boundary"
+
+The write_file tool was present. 26B failed. But 26B also fails with
+`apply_patch` (Round 2a) and passes with `write_file` when think is unset
+(Round 3). The tool surface was not the cause.
+
+### Round 2a (wrong): "Refuted the write_file hypothesis"
+
+Correctly refuted the original hypothesis, but still tested with `think: false`.
+Only the positive finding (still failed) was right; the conclusion ("the edit
+tool is not the cause") was right for the wrong reason. The cause wasn't the
+edit tool **because** it was `think: false`.
+
+### Round 2b (wrong): "Cumulative tool-response context size is the trigger"
+
+The truncation sweep showed a sharp 1200-vs-1600-char boundary. That was real
+behavior, but it was a *byproduct* of `think: false`. With shorter context,
+`think: false` doesn't always trigger the silent-stop at every decision point
+— apparently the decoding-path divergence is stochastic or state-dependent.
+The underlying bug was the same (the flag); the truncation pattern was just a
+workaround that happened to land on the lucky side of the dice.
+
+The prompt_eval_count threshold I identified (~2100 tokens) was the cumulative
+context size at the model's natural decision-to-edit turn. Below that many
+tokens the model survived the think=false flag; above it, `think=false` killed
+generation. The number was real but the causal story was wrong.
+
+## Why the existing GOTCHAS guidance was misleading here
+
+`GOTCHAS.md` says: *"Thinking tokens consume num_predict budget invisibly,
+returning empty responses. Seth has ONLY had success with thinking off."*
+
+That guidance was derived from `AI_Visualizer` (per `IMPLEMENTATIONS.md` §
+"Project: AI Visualizer") — single-turn JSON-generation pipelines where the
+model's thinking DOES eat the num_predict budget and returns an empty `content`
+field.
+
+In a **multi-turn tool-calling agent loop**, the mechanics are different:
+- Ollama returns separate fields for `content` and `thinking` (when populated)
+- Tool calls come out through `tool_calls`, which isn't bounded by `content`
+  generation the same way
+- Setting `think: false` here changes the chat-template / decoding path in a
+  way that makes 26B specifically — probably due to MoE routing sensitivity —
+  prefer early EOS at tool-decision turns
+- 31B and Qwen3-Coder are more robust to the same flag
+
+So the guidance isn't wrong; it's out of scope. It applied to AI_Visualizer,
+was over-generalized to "always think:false", and the agent corpus inherited
+that over-generalization.
+
+## Revised, correct recommendation for CLI coding agents
+
+1. **Do NOT set `think: false`** in your agent payload. Leave it unset (Ollama
+   default) or `true`.
+2. **Do manage the `content` and `thinking` fields explicitly** if they
+   accumulate in your message history — prune old thinking blobs before
+   pushing past 30K context.
+3. **The model / tool-surface choices don't matter the way I said they did.**
+   Any of (`gemma4:26b`, `gemma4:31b-it-q4_K_M`, `qwen3-coder:30b`) × (`write_file`,
+   `apply_patch`) × (capped/uncapped responses) passes when `think` is unset.
+4. **For single-turn JSON pipelines, the original "think: false" guidance still
+   applies.** This correction is scoped to multi-turn tool-calling agents.
+
+## Round 3 artifacts
+
+- `scripts/bakeoff/harness_no_think_flag.py` — patch-mode harness with no think key
+- `scripts/bakeoff/harness_write_no_think.py` — write-file harness with no think key
+- `scripts/bakeoff/runs_patch/gemma4-26b-no-think-flag/log.json` — 26B patch, no think (PASS)
+- `scripts/bakeoff/runs_patch/gemma4-26b-writefile-no-think/log.json` — 26B write, no think (PASS)
+- `scripts/bakeoff/runs_patch/gemma4-31b-no-think-flag/log.json` — 31B patch, no think (PASS)
+- `scripts/bakeoff/runs_patch/qwen3-coder-30b-no-think-flag/log.json` — Qwen patch, no think (PASS)
+
+## Reproducing Round 3
+
+```bash
+cd scripts/bakeoff
+
+# The correction: same harness as Round 1, just with think flag removed
+python3 harness_write_no_think.py gemma4:26b runs_patch/gemma4-26b-writefile-no-think/work runs_patch/gemma4-26b-writefile-no-think/log.json
+
+# Patch-mode without think flag
+python3 harness_no_think_flag.py gemma4:26b runs_patch/gemma4-26b-no-think-flag/work runs_patch/gemma4-26b-no-think-flag/log.json
+python3 harness_no_think_flag.py gemma4:31b-it-q4_K_M runs_patch/gemma4-31b-no-think-flag/work runs_patch/gemma4-31b-no-think-flag/log.json
+python3 harness_no_think_flag.py qwen3-coder:30b runs_patch/qwen3-coder-30b-no-think-flag/work runs_patch/qwen3-coder-30b-no-think-flag/log.json
+```