Seth's challenge: "we experienced this context eating with every implementation that had think=true. mort-bot runs a loop. Can you do a bake-off?" Built a harness that replicates mort-bot's /api/chat loop verbatim (num_ctx=8192, num_predict=2048, temperature=0.7, gemma4:26b, STEP_BUDGET=20, exact payload shape) but with stubbed tools and a prebuilt 15-turn fake chat history. Ran 4 tasks × 2 think settings. Finding: on Ollama 0.20.4 the "thinking eats context" concern does NOT reproduce. Direct evidence: - Movies task step 2 (think=true) returned 905 chars of thinking. - Step 3 prompt_eval_count delta: +76 tokens (think=true) vs +135 tokens (think=false). If thinking had accumulated in the prompt, think=true would have grown by +360 tokens, not shrunk. - Ollama's chat template strips the `thinking` field when serializing assistant turns for subsequent prompts. All 4 tasks × 2 settings produced identical step counts and tool counts. Wall clocks comparable. Gemma only actually generated thinking on 1 of 4 tasks (the one with check_sethflix verify-loop); on the others with think=true it emitted 0 thinking tokens. Reconciled with the earlier coding-agent bakeoff: the two findings are orthogonal. Coding bakeoff was at num_ctx=32K with a different harness; mort at 8K doesn't touch the silent-stop regime either way. Seth's prior may have been correct on an older Ollama or in a different API shape (/api/generate has its own issues) but does not reproduce here. Concrete recommendation: mort-bot THINK=False is defensible but not load-bearing; THINK=True or unset-default would also work. Keep as-is unless a different need arises. New: docs/reference/mort-bakeoff-2026-04-18.md, scripts/mort-bakeoff/ (harness + 8 run logs). README updated with pointer.
7.1 KiB
mort-bot think=true vs think=false Bakeoff — 2026-04-18
Follow-up to Seth's challenge: "we experienced this context eating with every implementation that had think=true. mort-bot runs a loop. Can you do a bake-off to see if that bot would actually perform better with thinking on?"
Short answer: no measurable difference on Ollama 0.20.4. The "thinking-eats-context" concern doesn't reproduce in mort-bot's current loop shape because Ollama's chat template strips the
thinkingfield from serialized history when it builds the prompt for subsequent turns. Either setting is defensible.
Setup
- Harness:
scripts/mort-bakeoff/harness.py— replicates mort-botllm.pyrun_tool_loopcall shape verbatim (model, options, payload structure,messages.append(msg)behavior), but with stubbed tools and a prebuilt ~15-turn fake chat history to simulate mid-session state. - Host / Ollama: steel141, 3090 Ti, Ollama 0.20.4
- Exact config match to mort-bot production:
gemma4:26b,num_ctx=8192,num_predict=2048,temperature=0.7,top_p=0.95,top_k=64,keep_alive=2h,STEP_BUDGET=20 - Tasks: 4 scenarios representative of real traffic (movies, research, memory, long-chain research)
- n=1 per (task, think) cell — bakeoff, not benchmark
Results
| Task | Think | Steps | Tools | Peak prompt | Thinking generated | Wall |
|---|---|---|---|---|---|---|
| memory | false | 2 | 1 | 1421 | 0 tok | 1.7s |
| memory | true | 2 | 1 | 1422 | 0 tok | 2.0s |
| research | false | 2 | 2 | 1593 | 0 tok | 2.3s |
| research | true | 2 | 2 | 1594 | 0 tok | 2.3s |
| movies | false | 3 | 2 | 1635 | 0 tok | 8.3s |
| movies | true | 3 | 2 | 1577 | 905 chars / ~226 tok | 5.4s |
| long | false | 7 | 6 | 2243 | 0 tok | 7.8s |
| long | true | 7 | 6 | 2288 | 0 tok | 8.0s |
Every (task, think) pair produced identical step counts and tool counts.
Does thinking accumulate in context?
No — verified directly.
On the movies task, step 2 with think=true returned 905 chars (~226 tok)
in a separate thinking field. My harness then appends the full message dict
(including thinking) to messages before the next request — exactly what
mort-bot's ollama_messages.append(msg) does.
Step 2 → Step 3 prompt_eval_count delta:
| Setting | Delta | What that means |
|---|---|---|
| think=false | +135 tok | Tool result only |
| think=true | +76 tok | Tool result only — thinking was stripped |
If thinking had accumulated, the think=true delta would have been ~360 tokens
(tool result + thinking). Instead it was smaller than think=false's delta.
Ollama 0.20.4's chat template does not include the thinking field when
serializing an assistant turn for subsequent prompts. Thinking is a per-turn
response annotation, not a persisted conversation channel.
Does thinking actually happen?
Gemma 4 is conservative. With think=true, it chose to generate thinking
tokens on 1 of 4 tasks:
memory— 0 thinking tokens (simple lookup)research— 0 thinking tokens (clear sequential plan)long— 0 thinking tokens across 7 steps (explicit step list to follow)movies— 905 chars thinking on step 2 (the only task requiring verification logic: check candidates → filter IN LIBRARY → replace → re-check)
The model appears to decide whether to think based on whether the task has uncertainty to reason about. Following explicit multi-step instructions doesn't trigger it; generating + filtering recommendations does.
Answer to Seth's claim
Seth's prior: "we experienced this context eating with every implementation that had think=true."
On current Ollama (0.20.4) with mort-bot's loop shape: the claim does not reproduce. Possibilities for why Seth saw it before:
- Older Ollama version. Earlier 0.x releases may have serialized
thinkinginto subsequent prompts. Not tested here. - Different API shape.
/api/generatebehaves differently from/api/chat— mort-bot's own CLAUDE.md notes/api/generatewiththink=truereturns empty responses. That's a separate, real bug, not a context-growth bug. - AI_Visualizer-shaped pipelines generate into
contentwhere thinking tokens explicitly eat thenum_predictbudget. That is a real failure mode and the original "always think:false" guidance addresses it correctly. - Confounded by other issues — context truncation, model quirks, silent failures — misattributed to thinking.
The production THINK=False setting was defensible when adopted and remains
defensible now. It's just not load-bearing in the way the original guidance
suggested.
Concrete recommendation for mort-bot
- Keep
THINK=Falseas-is, or try unset-default, orTHINK=True— pick based on whether you want the slight quality hedge on reasoning-heavy turns (like the movies check_sethflix verify-loop). No context-growth penalty either way on 0.20.4. - Don't backport the "CLI coding agent" finding from
docs/reference/bakeoff-2026-04-18.md— that one was atnum_ctx=32768with a coding harness, different regime. Mort'snum_ctx=8192doesn't touch the silent-stop trigger. - If Ollama versions drift, re-run this harness. The stripping behavior is an Ollama implementation detail; a future version could change it.
Reconciling with the earlier coding-agent bakeoff
The two findings are orthogonal:
| Bakeoff | Context | Harness | think=false | think=true |
|---|---|---|---|---|
| CLI coding (Round 3) | 32K | custom coding loop | 26B silent-stops | works |
| mort-bot (this) | 8K | mort's real loop shape | works | works |
Both can be true. The coding bakeoff ran into a state-specific think=false
failure at 32K context. mort-bot at 8K doesn't reach that state. Seth's claim
("think=true eats context") doesn't reproduce at 8K either because Ollama
strips thinking from serialized history. The practical synthesis: context
size and API shape matter more than either single flag.
Caveats
- Stubbed tools. Real mort tools (SearXNG, SethSearch, web_fetch) return variable-sized responses. This harness gave ~300-500 char deterministic stubs. If production tool responses are much larger, context growth dynamics could differ.
- No image/vision path. mort does vision preprocessing via
/api/generatewithTHINK_VISION=False. That path is documented by mort's own notes as broken withthink=trueon/api/generate. Out of scope here (this bakeoff is/api/chatonly). - Production traffic has longer sessions and real chat-history overhead. The fake history was ~2.7KB; real sessions can accumulate more. Not tested at the 20-step STEP_BUDGET limit.
- n=1. Stochastic variance wasn't measured. Results at temp=0.7 can vary.
Artifacts
scripts/mort-bakeoff/harness.py— the harnessscripts/mort-bakeoff/runs/memory-think-{false,true}.jsonscripts/mort-bakeoff/runs/research-think-{false,true}.jsonscripts/mort-bakeoff/runs/movies-think-{false,true}.jsonscripts/mort-bakeoff/runs/long-think-{false,true}.json
Reproducing
cd scripts/mort-bakeoff
for task in memory research movies long; do
for t in false true; do
python3 harness.py $t $task runs/${task}-think-${t}.json
done
done