Files
gemma4-research/docs/reference/mort-bakeoff-2026-04-18.md
Mortdecai 8436a91571 feat: mort-bot think=true vs think=false bakeoff
Seth's challenge: "we experienced this context eating with every
implementation that had think=true. mort-bot runs a loop. Can you do
a bake-off?"

Built a harness that replicates mort-bot's /api/chat loop verbatim
(num_ctx=8192, num_predict=2048, temperature=0.7, gemma4:26b,
STEP_BUDGET=20, exact payload shape) but with stubbed tools and a
prebuilt 15-turn fake chat history. Ran 4 tasks × 2 think settings.

Finding: on Ollama 0.20.4 the "thinking eats context" concern does
NOT reproduce. Direct evidence:
- Movies task step 2 (think=true) returned 905 chars of thinking.
- Step 3 prompt_eval_count delta: +76 tokens (think=true) vs +135
  tokens (think=false). If thinking had accumulated in the prompt,
  think=true would have grown by +360 tokens, not shrunk.
- Ollama's chat template strips the `thinking` field when serializing
  assistant turns for subsequent prompts.

All 4 tasks × 2 settings produced identical step counts and tool
counts. Wall clocks comparable. Gemma only actually generated
thinking on 1 of 4 tasks (the one with check_sethflix verify-loop);
on the others with think=true it emitted 0 thinking tokens.

Reconciled with the earlier coding-agent bakeoff: the two findings
are orthogonal. Coding bakeoff was at num_ctx=32K with a different
harness; mort at 8K doesn't touch the silent-stop regime either way.
Seth's prior may have been correct on an older Ollama or in a
different API shape (/api/generate has its own issues) but does not
reproduce here.

Concrete recommendation: mort-bot THINK=False is defensible but not
load-bearing; THINK=True or unset-default would also work. Keep as-is
unless a different need arises.

New: docs/reference/mort-bakeoff-2026-04-18.md, scripts/mort-bakeoff/
(harness + 8 run logs). README updated with pointer.
2026-04-18 18:23:43 -04:00

7.1 KiB

mort-bot think=true vs think=false Bakeoff — 2026-04-18

Follow-up to Seth's challenge: "we experienced this context eating with every implementation that had think=true. mort-bot runs a loop. Can you do a bake-off to see if that bot would actually perform better with thinking on?"

Short answer: no measurable difference on Ollama 0.20.4. The "thinking-eats-context" concern doesn't reproduce in mort-bot's current loop shape because Ollama's chat template strips the thinking field from serialized history when it builds the prompt for subsequent turns. Either setting is defensible.

Setup

  • Harness: scripts/mort-bakeoff/harness.py — replicates mort-bot llm.py run_tool_loop call shape verbatim (model, options, payload structure, messages.append(msg) behavior), but with stubbed tools and a prebuilt ~15-turn fake chat history to simulate mid-session state.
  • Host / Ollama: steel141, 3090 Ti, Ollama 0.20.4
  • Exact config match to mort-bot production: gemma4:26b, num_ctx=8192, num_predict=2048, temperature=0.7, top_p=0.95, top_k=64, keep_alive=2h, STEP_BUDGET=20
  • Tasks: 4 scenarios representative of real traffic (movies, research, memory, long-chain research)
  • n=1 per (task, think) cell — bakeoff, not benchmark

Results

Task Think Steps Tools Peak prompt Thinking generated Wall
memory false 2 1 1421 0 tok 1.7s
memory true 2 1 1422 0 tok 2.0s
research false 2 2 1593 0 tok 2.3s
research true 2 2 1594 0 tok 2.3s
movies false 3 2 1635 0 tok 8.3s
movies true 3 2 1577 905 chars / ~226 tok 5.4s
long false 7 6 2243 0 tok 7.8s
long true 7 6 2288 0 tok 8.0s

Every (task, think) pair produced identical step counts and tool counts.

Does thinking accumulate in context?

No — verified directly.

On the movies task, step 2 with think=true returned 905 chars (~226 tok) in a separate thinking field. My harness then appends the full message dict (including thinking) to messages before the next request — exactly what mort-bot's ollama_messages.append(msg) does.

Step 2 → Step 3 prompt_eval_count delta:

Setting Delta What that means
think=false +135 tok Tool result only
think=true +76 tok Tool result only — thinking was stripped

If thinking had accumulated, the think=true delta would have been ~360 tokens (tool result + thinking). Instead it was smaller than think=false's delta. Ollama 0.20.4's chat template does not include the thinking field when serializing an assistant turn for subsequent prompts. Thinking is a per-turn response annotation, not a persisted conversation channel.

Does thinking actually happen?

Gemma 4 is conservative. With think=true, it chose to generate thinking tokens on 1 of 4 tasks:

  • memory — 0 thinking tokens (simple lookup)
  • research — 0 thinking tokens (clear sequential plan)
  • long — 0 thinking tokens across 7 steps (explicit step list to follow)
  • movies — 905 chars thinking on step 2 (the only task requiring verification logic: check candidates → filter IN LIBRARY → replace → re-check)

The model appears to decide whether to think based on whether the task has uncertainty to reason about. Following explicit multi-step instructions doesn't trigger it; generating + filtering recommendations does.

Answer to Seth's claim

Seth's prior: "we experienced this context eating with every implementation that had think=true."

On current Ollama (0.20.4) with mort-bot's loop shape: the claim does not reproduce. Possibilities for why Seth saw it before:

  1. Older Ollama version. Earlier 0.x releases may have serialized thinking into subsequent prompts. Not tested here.
  2. Different API shape. /api/generate behaves differently from /api/chat — mort-bot's own CLAUDE.md notes /api/generate with think=true returns empty responses. That's a separate, real bug, not a context-growth bug.
  3. AI_Visualizer-shaped pipelines generate into content where thinking tokens explicitly eat the num_predict budget. That is a real failure mode and the original "always think:false" guidance addresses it correctly.
  4. Confounded by other issues — context truncation, model quirks, silent failures — misattributed to thinking.

The production THINK=False setting was defensible when adopted and remains defensible now. It's just not load-bearing in the way the original guidance suggested.

Concrete recommendation for mort-bot

  1. Keep THINK=False as-is, or try unset-default, or THINK=True — pick based on whether you want the slight quality hedge on reasoning-heavy turns (like the movies check_sethflix verify-loop). No context-growth penalty either way on 0.20.4.
  2. Don't backport the "CLI coding agent" finding from docs/reference/bakeoff-2026-04-18.md — that one was at num_ctx=32768 with a coding harness, different regime. Mort's num_ctx=8192 doesn't touch the silent-stop trigger.
  3. If Ollama versions drift, re-run this harness. The stripping behavior is an Ollama implementation detail; a future version could change it.

Reconciling with the earlier coding-agent bakeoff

The two findings are orthogonal:

Bakeoff Context Harness think=false think=true
CLI coding (Round 3) 32K custom coding loop 26B silent-stops works
mort-bot (this) 8K mort's real loop shape works works

Both can be true. The coding bakeoff ran into a state-specific think=false failure at 32K context. mort-bot at 8K doesn't reach that state. Seth's claim ("think=true eats context") doesn't reproduce at 8K either because Ollama strips thinking from serialized history. The practical synthesis: context size and API shape matter more than either single flag.

Caveats

  • Stubbed tools. Real mort tools (SearXNG, SethSearch, web_fetch) return variable-sized responses. This harness gave ~300-500 char deterministic stubs. If production tool responses are much larger, context growth dynamics could differ.
  • No image/vision path. mort does vision preprocessing via /api/generate with THINK_VISION=False. That path is documented by mort's own notes as broken with think=true on /api/generate. Out of scope here (this bakeoff is /api/chat only).
  • Production traffic has longer sessions and real chat-history overhead. The fake history was ~2.7KB; real sessions can accumulate more. Not tested at the 20-step STEP_BUDGET limit.
  • n=1. Stochastic variance wasn't measured. Results at temp=0.7 can vary.

Artifacts

  • scripts/mort-bakeoff/harness.py — the harness
  • scripts/mort-bakeoff/runs/memory-think-{false,true}.json
  • scripts/mort-bakeoff/runs/research-think-{false,true}.json
  • scripts/mort-bakeoff/runs/movies-think-{false,true}.json
  • scripts/mort-bakeoff/runs/long-think-{false,true}.json

Reproducing

cd scripts/mort-bakeoff
for task in memory research movies long; do
  for t in false true; do
    python3 harness.py $t $task runs/${task}-think-${t}.json
  done
done