# mort-bot `think=true` vs `think=false` Bakeoff — 2026-04-18 > Follow-up to Seth's challenge: "we experienced this context eating with every > implementation that had think=true. mort-bot runs a loop. Can you do a > bake-off to see if that bot would actually perform better with thinking on?" > > Short answer: **no measurable difference on Ollama 0.20.4**. The > "thinking-eats-context" concern doesn't reproduce in mort-bot's current > loop shape because Ollama's chat template strips the `thinking` field from > serialized history when it builds the prompt for subsequent turns. Either > setting is defensible. ## Setup - Harness: `scripts/mort-bakeoff/harness.py` — replicates mort-bot `llm.py` `run_tool_loop` call shape verbatim (model, options, payload structure, `messages.append(msg)` behavior), but with stubbed tools and a prebuilt ~15-turn fake chat history to simulate mid-session state. - Host / Ollama: steel141, 3090 Ti, Ollama 0.20.4 - Exact config match to mort-bot production: `gemma4:26b`, `num_ctx=8192`, `num_predict=2048`, `temperature=0.7`, `top_p=0.95`, `top_k=64`, `keep_alive=2h`, `STEP_BUDGET=20` - Tasks: 4 scenarios representative of real traffic (movies, research, memory, long-chain research) - n=1 per (task, think) cell — bakeoff, not benchmark ## Results | Task | Think | Steps | Tools | Peak prompt | Thinking generated | Wall | |---|---|---|---|---|---|---| | memory | false | 2 | 1 | 1421 | 0 tok | 1.7s | | memory | true | 2 | 1 | 1422 | 0 tok | 2.0s | | research | false | 2 | 2 | 1593 | 0 tok | 2.3s | | research | true | 2 | 2 | 1594 | 0 tok | 2.3s | | movies | false | 3 | 2 | 1635 | 0 tok | 8.3s | | movies | true | 3 | 2 | 1577 | 905 chars / ~226 tok | 5.4s | | long | false | 7 | 6 | 2243 | 0 tok | 7.8s | | long | true | 7 | 6 | 2288 | 0 tok | 8.0s | Every (task, think) pair produced **identical step counts and tool counts**. ## Does thinking accumulate in context? **No — verified directly.** On the `movies` task, step 2 with `think=true` returned 905 chars (~226 tok) in a separate `thinking` field. My harness then appends the full message dict (including `thinking`) to `messages` before the next request — exactly what mort-bot's `ollama_messages.append(msg)` does. Step 2 → Step 3 prompt_eval_count delta: | Setting | Delta | What that means | |---|---|---| | think=false | +135 tok | Tool result only | | think=true | +76 tok | Tool result only — thinking was **stripped** | If thinking had accumulated, the think=true delta would have been ~360 tokens (tool result + thinking). Instead it was smaller than think=false's delta. **Ollama 0.20.4's chat template does not include the `thinking` field when serializing an assistant turn for subsequent prompts.** Thinking is a per-turn response annotation, not a persisted conversation channel. ## Does thinking actually happen? Gemma 4 is conservative. With `think=true`, it chose to generate thinking tokens on **1 of 4 tasks**: - `memory` — 0 thinking tokens (simple lookup) - `research` — 0 thinking tokens (clear sequential plan) - `long` — 0 thinking tokens across 7 steps (explicit step list to follow) - `movies` — 905 chars thinking on step 2 (the only task requiring verification logic: check candidates → filter IN LIBRARY → replace → re-check) The model appears to decide whether to think based on whether the task has uncertainty to reason about. Following explicit multi-step instructions doesn't trigger it; generating + filtering recommendations does. ## Answer to Seth's claim Seth's prior: "we experienced this context eating with every implementation that had think=true." On current Ollama (0.20.4) with mort-bot's loop shape: **the claim does not reproduce.** Possibilities for why Seth saw it before: 1. **Older Ollama version.** Earlier 0.x releases may have serialized `thinking` into subsequent prompts. Not tested here. 2. **Different API shape.** `/api/generate` behaves differently from `/api/chat` — mort-bot's own CLAUDE.md notes `/api/generate` with `think=true` returns empty responses. That's a separate, real bug, not a context-growth bug. 3. **AI_Visualizer-shaped pipelines** generate into `content` where thinking tokens explicitly eat the `num_predict` budget. That is a real failure mode and the original "always think:false" guidance addresses it correctly. 4. **Confounded by other issues** — context truncation, model quirks, silent failures — misattributed to thinking. The production `THINK=False` setting was defensible when adopted and remains defensible now. It's just not load-bearing in the way the original guidance suggested. ## Concrete recommendation for mort-bot 1. **Keep `THINK=False` as-is, or try unset-default, or `THINK=True` — pick based on whether you want the slight quality hedge on reasoning-heavy turns (like the movies check_sethflix verify-loop). No context-growth penalty either way on 0.20.4.** 2. **Don't backport the "CLI coding agent" finding from `docs/reference/bakeoff-2026-04-18.md`** — that one was at `num_ctx=32768` with a coding harness, different regime. Mort's `num_ctx=8192` doesn't touch the silent-stop trigger. 3. **If Ollama versions drift**, re-run this harness. The stripping behavior is an Ollama implementation detail; a future version could change it. ## Reconciling with the earlier coding-agent bakeoff The two findings are orthogonal: | Bakeoff | Context | Harness | think=false | think=true | |---|---|---|---|---| | CLI coding (Round 3) | 32K | custom coding loop | 26B silent-stops | works | | mort-bot (this) | 8K | mort's real loop shape | works | works | Both can be true. The coding bakeoff ran into a state-specific `think=false` failure at 32K context. mort-bot at 8K doesn't reach that state. Seth's claim ("think=true eats context") doesn't reproduce at 8K either because Ollama strips thinking from serialized history. The practical synthesis: **context size and API shape matter more than either single flag**. ## Caveats - **Stubbed tools.** Real mort tools (SearXNG, SethSearch, web_fetch) return variable-sized responses. This harness gave ~300-500 char deterministic stubs. If production tool responses are much larger, context growth dynamics could differ. - **No image/vision path.** mort does vision preprocessing via `/api/generate` with `THINK_VISION=False`. That path is documented by mort's own notes as broken with `think=true` on `/api/generate`. Out of scope here (this bakeoff is `/api/chat` only). - **Production traffic has longer sessions and real chat-history overhead.** The fake history was ~2.7KB; real sessions can accumulate more. Not tested at the 20-step STEP_BUDGET limit. - **n=1.** Stochastic variance wasn't measured. Results at temp=0.7 can vary. ## Artifacts - `scripts/mort-bakeoff/harness.py` — the harness - `scripts/mort-bakeoff/runs/memory-think-{false,true}.json` - `scripts/mort-bakeoff/runs/research-think-{false,true}.json` - `scripts/mort-bakeoff/runs/movies-think-{false,true}.json` - `scripts/mort-bakeoff/runs/long-think-{false,true}.json` ## Reproducing ```bash cd scripts/mort-bakeoff for task in memory research movies long; do for t in false true; do python3 harness.py $t $task runs/${task}-think-${t}.json done done ```