fix: walk back round-1/2 conclusions — the cause was think=false all along
Seth asked "was this with think=false?" Yes — and that was the only question that mattered. Everything I concluded in round 1 and round 2 was wrong. Actual cause, isolated in round 3: - At identical message state, gemma4:26b with think=false returns eval=4 (silent stop); with think unset or think=true, returns eval=165 and emits the correct tool call. - Original round-1 write_file harness + think unset: 26B passes in 8 iters, 20s. No mitigations needed. - 31B dense and qwen3-coder:30b tolerate think=false; 26B MoE does not. Red herrings (kept on-record in the bakeoff doc, not silently erased): - Round 1: "write_file tool-call argument size" — wrong - Round 2a: refuted the arg-size theory but for the wrong reason (still failed because think=false was still set) - Round 2b: "cumulative tool-response context size" — truncating did make 26B pass, but by coincidence. Shorter context at the decision turn dodged the think=false side effect. Why the existing "always think:false" guidance was misleading: it was derived from AI_Visualizer (single-turn JSON pipelines) where thinking tokens do eat num_predict invisibly. In multi-turn tool-calling agents the channels are separate and the flag has a different effect — catastrophic on 26B specifically. Doc updates: - GOTCHAS: replaced the 26B entry with the actual cause; scoped the original "Thinking Mode Eats Context" entry to single-turn pipelines - SYNTHESIS: split the "Mandatory Ollama Settings" block into single-turn vs multi-turn variants; updated anti-patterns and quick-start checklist - CORPUS_cli_coding_agent.md: revised pointer and config template - docs/reference/bakeoff-2026-04-18.md: added Round 3 section with the correction notice at the top of the file and full diagnostic methodology New artifacts: harness_no_think_flag.py, harness_write_no_think.py, and 4 new log files demonstrating all three models pass when think is left at default.
This commit is contained in:
@@ -5,6 +5,14 @@
|
||||
> identical broken-code task. **n=1 per model** (plus one re-run to check
|
||||
> reproducibility of a failure). Treat as a smoke test, not a benchmark.
|
||||
|
||||
> **Correction notice (Round 3):** Rounds 1 and 2 both misidentified the cause
|
||||
> of Gemma 4 26B's silent-stop failure. Round 1 blamed `write_file` tool-call
|
||||
> argument size. Round 2 blamed tool-response context size. **Round 3 proves
|
||||
> both wrong: the actual cause is the `think: false` Ollama flag.** Remove the
|
||||
> flag and 26B passes on the original Round 1 harness unmodified. Kept the
|
||||
> failed hypotheses below as-recorded — Seth asked "was this with
|
||||
> think=false?" and the answer exposed the confounder. Never presented as Plan A.
|
||||
|
||||
## Setup
|
||||
|
||||
- **Host:** steel141 (Seth's local box)
|
||||
@@ -239,3 +247,136 @@ TOOL_RESULT_CAP=1200 python3 harness_patch_truncated.py gemma4:26b runs_patch/ge
|
||||
TOOL_RESULT_CAP=1600 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap1600/work runs_patch/gemma4-26b-cap1600/log.json
|
||||
TOOL_RESULT_CAP=2000 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap2000/work runs_patch/gemma4-26b-cap2000/log.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Round 3 — the actual cause: `think: false`
|
||||
|
||||
Seth asked "was this with think=false?" That was the only question that mattered.
|
||||
|
||||
## The question that unstuck it
|
||||
|
||||
Every harness in Round 1 and Round 2 set `"think": False` in the Ollama payload —
|
||||
per existing guidance in `GOTCHAS.md`: "Always pass `think: false` in the Ollama
|
||||
payload. Seth has had success ONLY with thinking off." I copied that to the
|
||||
harnesses without testing whether it was the right choice for a multi-turn
|
||||
tool-calling agent loop (as opposed to the single-turn JSON pipeline that
|
||||
guidance came from).
|
||||
|
||||
## The diagnostic
|
||||
|
||||
Replayed the exact 5-iteration failing state to `gemma4:26b` three times with
|
||||
three think settings, same message history, same tool definitions:
|
||||
|
||||
| `think` setting | `eval_count` | tool call emitted? |
|
||||
|---|---|---|
|
||||
| `false` (my harness) | **4** | ✗ |
|
||||
| unset (Ollama default) | 165 | ✓ `apply_patch` |
|
||||
| `true` | 165 | ✓ `apply_patch` |
|
||||
|
||||
Sharp, reproducible. `think: false` → silent stop. Anything else → works.
|
||||
|
||||
## Round 3 runs — unlimited tool responses, think flag removed
|
||||
|
||||
| Harness | Model | Pass | Iters | Wall |
|
||||
|---|---|---|---|---|
|
||||
| `write_file` (Round-1 harness, think unset) | `gemma4:26b` | **✓** | 8 | 20.6s |
|
||||
| `apply_patch` (Round-2a harness, think unset) | `gemma4:26b` | **✓** | 8 | 12.5s |
|
||||
| `write_file`, think unset | `gemma4:31b-it-q4_K_M` | ✓ | 8 | — |
|
||||
| `apply_patch`, think unset | `gemma4:31b-it-q4_K_M` | ✓ | 8 | 66.4s |
|
||||
| `apply_patch`, think unset | `qwen3-coder:30b` | ✓ | 11 | 19.5s |
|
||||
|
||||
**26B passes cleanly on the unmodified Round 1 harness once the think flag is
|
||||
removed.** No truncation, no patch-tool swap, no mitigations.
|
||||
|
||||
The 31B / Qwen runs confirm the flag doesn't matter for those models (pass either
|
||||
way). 31B is visibly slower without the think flag (66s vs 37s) — likely
|
||||
because it's actually generating hidden thinking now — but it still completes.
|
||||
|
||||
## What Rounds 1 and 2 got wrong
|
||||
|
||||
### Round 1 (wrong): "26B silent-stops at the write_file tool-call argument boundary"
|
||||
|
||||
The write_file tool was present. 26B failed. But 26B also fails with
|
||||
`apply_patch` (Round 2a) and passes with `write_file` when think is unset
|
||||
(Round 3). The tool surface was not the cause.
|
||||
|
||||
### Round 2a (wrong): "Refuted the write_file hypothesis"
|
||||
|
||||
Correctly refuted the original hypothesis, but still tested with `think: false`.
|
||||
Only the positive finding (still failed) was right; the conclusion ("the edit
|
||||
tool is not the cause") was right for the wrong reason. The cause wasn't the
|
||||
edit tool **because** it was `think: false`.
|
||||
|
||||
### Round 2b (wrong): "Cumulative tool-response context size is the trigger"
|
||||
|
||||
The truncation sweep showed a sharp 1200-vs-1600-char boundary. That was real
|
||||
behavior, but it was a *byproduct* of `think: false`. With shorter context,
|
||||
`think: false` doesn't always trigger the silent-stop at every decision point
|
||||
— apparently the decoding-path divergence is stochastic or state-dependent.
|
||||
The underlying bug was the same (the flag); the truncation pattern was just a
|
||||
workaround that happened to land on the lucky side of the dice.
|
||||
|
||||
The prompt_eval_count threshold I identified (~2100 tokens) was the cumulative
|
||||
context size at the model's natural decision-to-edit turn. Below that many
|
||||
tokens the model survived the think=false flag; above it, `think=false` killed
|
||||
generation. The number was real but the causal story was wrong.
|
||||
|
||||
## Why the existing GOTCHAS guidance was misleading here
|
||||
|
||||
`GOTCHAS.md` says: *"Thinking tokens consume num_predict budget invisibly,
|
||||
returning empty responses. Seth has ONLY had success with thinking off."*
|
||||
|
||||
That guidance was derived from `AI_Visualizer` (per `IMPLEMENTATIONS.md` §
|
||||
"Project: AI Visualizer") — single-turn JSON-generation pipelines where the
|
||||
model's thinking DOES eat the num_predict budget and returns an empty `content`
|
||||
field.
|
||||
|
||||
In a **multi-turn tool-calling agent loop**, the mechanics are different:
|
||||
- Ollama returns separate fields for `content` and `thinking` (when populated)
|
||||
- Tool calls come out through `tool_calls`, which isn't bounded by `content`
|
||||
generation the same way
|
||||
- Setting `think: false` here changes the chat-template / decoding path in a
|
||||
way that makes 26B specifically — probably due to MoE routing sensitivity —
|
||||
prefer early EOS at tool-decision turns
|
||||
- 31B and Qwen3-Coder are more robust to the same flag
|
||||
|
||||
So the guidance isn't wrong; it's out of scope. It applied to AI_Visualizer,
|
||||
was over-generalized to "always think:false", and the agent corpus inherited
|
||||
that over-generalization.
|
||||
|
||||
## Revised, correct recommendation for CLI coding agents
|
||||
|
||||
1. **Do NOT set `think: false`** in your agent payload. Leave it unset (Ollama
|
||||
default) or `true`.
|
||||
2. **Do manage the `content` and `thinking` fields explicitly** if they
|
||||
accumulate in your message history — prune old thinking blobs before
|
||||
pushing past 30K context.
|
||||
3. **The model / tool-surface choices don't matter the way I said they did.**
|
||||
Any of (`gemma4:26b`, `gemma4:31b-it-q4_K_M`, `qwen3-coder:30b`) × (`write_file`,
|
||||
`apply_patch`) × (capped/uncapped responses) passes when `think` is unset.
|
||||
4. **For single-turn JSON pipelines, the original "think: false" guidance still
|
||||
applies.** This correction is scoped to multi-turn tool-calling agents.
|
||||
|
||||
## Round 3 artifacts
|
||||
|
||||
- `scripts/bakeoff/harness_no_think_flag.py` — patch-mode harness with no think key
|
||||
- `scripts/bakeoff/harness_write_no_think.py` — write-file harness with no think key
|
||||
- `scripts/bakeoff/runs_patch/gemma4-26b-no-think-flag/log.json` — 26B patch, no think (PASS)
|
||||
- `scripts/bakeoff/runs_patch/gemma4-26b-writefile-no-think/log.json` — 26B write, no think (PASS)
|
||||
- `scripts/bakeoff/runs_patch/gemma4-31b-no-think-flag/log.json` — 31B patch, no think (PASS)
|
||||
- `scripts/bakeoff/runs_patch/qwen3-coder-30b-no-think-flag/log.json` — Qwen patch, no think (PASS)
|
||||
|
||||
## Reproducing Round 3
|
||||
|
||||
```bash
|
||||
cd scripts/bakeoff
|
||||
|
||||
# The correction: same harness as Round 1, just with think flag removed
|
||||
python3 harness_write_no_think.py gemma4:26b runs_patch/gemma4-26b-writefile-no-think/work runs_patch/gemma4-26b-writefile-no-think/log.json
|
||||
|
||||
# Patch-mode without think flag
|
||||
python3 harness_no_think_flag.py gemma4:26b runs_patch/gemma4-26b-no-think-flag/work runs_patch/gemma4-26b-no-think-flag/log.json
|
||||
python3 harness_no_think_flag.py gemma4:31b-it-q4_K_M runs_patch/gemma4-31b-no-think-flag/work runs_patch/gemma4-31b-no-think-flag/log.json
|
||||
python3 harness_no_think_flag.py qwen3-coder:30b runs_patch/qwen3-coder-30b-no-think-flag/work runs_patch/qwen3-coder-30b-no-think-flag/log.json
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user