feat: mort-bot think=true vs think=false bakeoff

Seth's challenge: "we experienced this context eating with every implementation that had think=true. mort-bot runs a loop. Can you do a bake-off?" Built a harness that replicates mort-bot's /api/chat loop verbatim (num_ctx=8192, num_predict=2048, temperature=0.7, gemma4:26b, STEP_BUDGET=20, exact payload shape) but with stubbed tools and a prebuilt 15-turn fake chat history. Ran 4 tasks × 2 think settings. Finding: on Ollama 0.20.4 the "thinking eats context" concern does NOT reproduce. Direct evidence: - Movies task step 2 (think=true) returned 905 chars of thinking. - Step 3 prompt_eval_count delta: +76 tokens (think=true) vs +135 tokens (think=false). If thinking had accumulated in the prompt, think=true would have grown by +360 tokens, not shrunk. - Ollama's chat template strips the `thinking` field when serializing assistant turns for subsequent prompts. All 4 tasks × 2 settings produced identical step counts and tool counts. Wall clocks comparable. Gemma only actually generated thinking on 1 of 4 tasks (the one with check_sethflix verify-loop); on the others with think=true it emitted 0 thinking tokens. Reconciled with the earlier coding-agent bakeoff: the two findings are orthogonal. Coding bakeoff was at num_ctx=32K with a different harness; mort at 8K doesn't touch the silent-stop regime either way. Seth's prior may have been correct on an older Ollama or in a different API shape (/api/generate has its own issues) but does not reproduce here. Concrete recommendation: mort-bot THINK=False is defensible but not load-bearing; THINK=True or unset-default would also work. Keep as-is unless a different need arises. New: docs/reference/mort-bakeoff-2026-04-18.md, scripts/mort-bakeoff/ (harness + 8 run logs). README updated with pointer.
2026-04-18 18:23:43 -04:00
parent c61394923c
commit 8436a91571
12 changed files with 988 additions and 2 deletions
@@ -0,0 +1,155 @@
+# mort-bot `think=true` vs `think=false` Bakeoff — 2026-04-18
+
+> Follow-up to Seth's challenge: "we experienced this context eating with every
+> implementation that had think=true. mort-bot runs a loop. Can you do a
+> bake-off to see if that bot would actually perform better with thinking on?"
+>
+> Short answer: **no measurable difference on Ollama 0.20.4**. The
+> "thinking-eats-context" concern doesn't reproduce in mort-bot's current
+> loop shape because Ollama's chat template strips the `thinking` field from
+> serialized history when it builds the prompt for subsequent turns. Either
+> setting is defensible.
+
+## Setup
+
+- Harness: `scripts/mort-bakeoff/harness.py` — replicates mort-bot `llm.py`
+  `run_tool_loop` call shape verbatim (model, options, payload structure,
+  `messages.append(msg)` behavior), but with stubbed tools and a prebuilt
+  ~15-turn fake chat history to simulate mid-session state.
+- Host / Ollama: steel141, 3090 Ti, Ollama 0.20.4
+- Exact config match to mort-bot production: `gemma4:26b`, `num_ctx=8192`,
+  `num_predict=2048`, `temperature=0.7`, `top_p=0.95`, `top_k=64`,
+  `keep_alive=2h`, `STEP_BUDGET=20`
+- Tasks: 4 scenarios representative of real traffic (movies, research,
+  memory, long-chain research)
+- n=1 per (task, think) cell — bakeoff, not benchmark
+
+## Results
+
+| Task | Think | Steps | Tools | Peak prompt | Thinking generated | Wall |
+|---|---|---|---|---|---|---|
+| memory | false | 2 | 1 | 1421 | 0 tok | 1.7s |
+| memory | true | 2 | 1 | 1422 | 0 tok | 2.0s |
+| research | false | 2 | 2 | 1593 | 0 tok | 2.3s |
+| research | true | 2 | 2 | 1594 | 0 tok | 2.3s |
+| movies | false | 3 | 2 | 1635 | 0 tok | 8.3s |
+| movies | true | 3 | 2 | 1577 | 905 chars / ~226 tok | 5.4s |
+| long | false | 7 | 6 | 2243 | 0 tok | 7.8s |
+| long | true | 7 | 6 | 2288 | 0 tok | 8.0s |
+
+Every (task, think) pair produced **identical step counts and tool counts**.
+
+## Does thinking accumulate in context?
+
+**No — verified directly.**
+
+On the `movies` task, step 2 with `think=true` returned 905 chars (~226 tok)
+in a separate `thinking` field. My harness then appends the full message dict
+(including `thinking`) to `messages` before the next request — exactly what
+mort-bot's `ollama_messages.append(msg)` does.
+
+Step 2 → Step 3 prompt_eval_count delta:
+
+| Setting | Delta | What that means |
+|---|---|---|
+| think=false | +135 tok | Tool result only |
+| think=true | +76 tok | Tool result only — thinking was **stripped** |
+
+If thinking had accumulated, the think=true delta would have been ~360 tokens
+(tool result + thinking). Instead it was smaller than think=false's delta.
+**Ollama 0.20.4's chat template does not include the `thinking` field when
+serializing an assistant turn for subsequent prompts.** Thinking is a per-turn
+response annotation, not a persisted conversation channel.
+
+## Does thinking actually happen?
+
+Gemma 4 is conservative. With `think=true`, it chose to generate thinking
+tokens on **1 of 4 tasks**:
+
+- `memory` — 0 thinking tokens (simple lookup)
+- `research` — 0 thinking tokens (clear sequential plan)
+- `long` — 0 thinking tokens across 7 steps (explicit step list to follow)
+- `movies` — 905 chars thinking on step 2 (the only task requiring
+  verification logic: check candidates → filter IN LIBRARY → replace → re-check)
+
+The model appears to decide whether to think based on whether the task has
+uncertainty to reason about. Following explicit multi-step instructions
+doesn't trigger it; generating + filtering recommendations does.
+
+## Answer to Seth's claim
+
+Seth's prior: "we experienced this context eating with every implementation
+that had think=true."
+
+On current Ollama (0.20.4) with mort-bot's loop shape: **the claim does not
+reproduce.** Possibilities for why Seth saw it before:
+
+1. **Older Ollama version.** Earlier 0.x releases may have serialized `thinking`
+   into subsequent prompts. Not tested here.
+2. **Different API shape.** `/api/generate` behaves differently from `/api/chat`
+   — mort-bot's own CLAUDE.md notes `/api/generate` with `think=true` returns
+   empty responses. That's a separate, real bug, not a context-growth bug.
+3. **AI_Visualizer-shaped pipelines** generate into `content` where thinking
+   tokens explicitly eat the `num_predict` budget. That is a real failure mode
+   and the original "always think:false" guidance addresses it correctly.
+4. **Confounded by other issues** — context truncation, model quirks, silent
+   failures — misattributed to thinking.
+
+The production `THINK=False` setting was defensible when adopted and remains
+defensible now. It's just not load-bearing in the way the original guidance
+suggested.
+
+## Concrete recommendation for mort-bot
+
+1. **Keep `THINK=False` as-is, or try unset-default, or `THINK=True` — pick based on whether you want the slight quality hedge on reasoning-heavy turns (like the movies check_sethflix verify-loop). No context-growth penalty either way on 0.20.4.**
+2. **Don't backport the "CLI coding agent" finding from `docs/reference/bakeoff-2026-04-18.md`** — that one was at `num_ctx=32768` with a coding harness, different regime. Mort's `num_ctx=8192` doesn't touch the silent-stop trigger.
+3. **If Ollama versions drift**, re-run this harness. The stripping behavior is an Ollama implementation detail; a future version could change it.
+
+## Reconciling with the earlier coding-agent bakeoff
+
+The two findings are orthogonal:
+
+| Bakeoff | Context | Harness | think=false | think=true |
+|---|---|---|---|---|
+| CLI coding (Round 3) | 32K | custom coding loop | 26B silent-stops | works |
+| mort-bot (this) | 8K | mort's real loop shape | works | works |
+
+Both can be true. The coding bakeoff ran into a state-specific `think=false`
+failure at 32K context. mort-bot at 8K doesn't reach that state. Seth's claim
+("think=true eats context") doesn't reproduce at 8K either because Ollama
+strips thinking from serialized history. The practical synthesis: **context
+size and API shape matter more than either single flag**.
+
+## Caveats
+
+- **Stubbed tools.** Real mort tools (SearXNG, SethSearch, web_fetch) return
+  variable-sized responses. This harness gave ~300-500 char deterministic
+  stubs. If production tool responses are much larger, context growth
+  dynamics could differ.
+- **No image/vision path.** mort does vision preprocessing via `/api/generate`
+  with `THINK_VISION=False`. That path is documented by mort's own notes as
+  broken with `think=true` on `/api/generate`. Out of scope here (this bakeoff
+  is `/api/chat` only).
+- **Production traffic has longer sessions and real chat-history overhead.**
+  The fake history was ~2.7KB; real sessions can accumulate more. Not tested
+  at the 20-step STEP_BUDGET limit.
+- **n=1.** Stochastic variance wasn't measured. Results at temp=0.7 can vary.
+
+## Artifacts
+
+- `scripts/mort-bakeoff/harness.py` — the harness
+- `scripts/mort-bakeoff/runs/memory-think-{false,true}.json`
+- `scripts/mort-bakeoff/runs/research-think-{false,true}.json`
+- `scripts/mort-bakeoff/runs/movies-think-{false,true}.json`
+- `scripts/mort-bakeoff/runs/long-think-{false,true}.json`
+
+## Reproducing
+
+```bash
+cd scripts/mort-bakeoff
+for task in memory research movies long; do
+  for t in false true; do
+    python3 harness.py $t $task runs/${task}-think-${t}.json
+  done
+done
+```