# mort-bot `think=true` vs `think=false` Bakeoff — 2026-04-18

> Follow-up to Seth's challenge: "we experienced this context eating with every
> implementation that had think=true. mort-bot runs a loop. Can you do a
> bake-off to see if that bot would actually perform better with thinking on?"
>
> Short answer: **no measurable difference on Ollama 0.20.4**. The
> "thinking-eats-context" concern doesn't reproduce in mort-bot's current
> loop shape because Ollama's chat template strips the `thinking` field from
> serialized history when it builds the prompt for subsequent turns. Either
> setting is defensible.

## Setup

- Harness: `scripts/mort-bakeoff/harness.py` — replicates mort-bot `llm.py`
  `run_tool_loop` call shape verbatim (model, options, payload structure,
  `messages.append(msg)` behavior), but with stubbed tools and a prebuilt
  ~15-turn fake chat history to simulate mid-session state.
- Host / Ollama: steel141, 3090 Ti, Ollama 0.20.4
- Exact config match to mort-bot production: `gemma4:26b`, `num_ctx=8192`,
  `num_predict=2048`, `temperature=0.7`, `top_p=0.95`, `top_k=64`,
  `keep_alive=2h`, `STEP_BUDGET=20`
- Tasks: 4 scenarios representative of real traffic (movies, research,
  memory, long-chain research)
- n=1 per (task, think) cell — bakeoff, not benchmark

## Results

| Task | Think | Steps | Tools | Peak prompt | Thinking generated | Wall |
|---|---|---|---|---|---|---|
| memory | false | 2 | 1 | 1421 | 0 tok | 1.7s |
| memory | true | 2 | 1 | 1422 | 0 tok | 2.0s |
| research | false | 2 | 2 | 1593 | 0 tok | 2.3s |
| research | true | 2 | 2 | 1594 | 0 tok | 2.3s |
| movies | false | 3 | 2 | 1635 | 0 tok | 8.3s |
| movies | true | 3 | 2 | 1577 | 905 chars / ~226 tok | 5.4s |
| long | false | 7 | 6 | 2243 | 0 tok | 7.8s |
| long | true | 7 | 6 | 2288 | 0 tok | 8.0s |

Every (task, think) pair produced **identical step counts and tool counts**.

## Does thinking accumulate in context?

**No — verified directly.**

On the `movies` task, step 2 with `think=true` returned 905 chars (~226 tok)
in a separate `thinking` field. My harness then appends the full message dict
(including `thinking`) to `messages` before the next request — exactly what
mort-bot's `ollama_messages.append(msg)` does.

Step 2 → Step 3 prompt_eval_count delta:

| Setting | Delta | What that means |
|---|---|---|
| think=false | +135 tok | Tool result only |
| think=true | +76 tok | Tool result only — thinking was **stripped** |

If thinking had accumulated, the think=true delta would have been ~360 tokens
(tool result + thinking). Instead it was smaller than think=false's delta.
**Ollama 0.20.4's chat template does not include the `thinking` field when
serializing an assistant turn for subsequent prompts.** Thinking is a per-turn
response annotation, not a persisted conversation channel.

## Does thinking actually happen?

Gemma 4 is conservative. With `think=true`, it chose to generate thinking
tokens on **1 of 4 tasks**:

- `memory` — 0 thinking tokens (simple lookup)
- `research` — 0 thinking tokens (clear sequential plan)
- `long` — 0 thinking tokens across 7 steps (explicit step list to follow)
- `movies` — 905 chars thinking on step 2 (the only task requiring
  verification logic: check candidates → filter IN LIBRARY → replace → re-check)

The model appears to decide whether to think based on whether the task has
uncertainty to reason about. Following explicit multi-step instructions
doesn't trigger it; generating + filtering recommendations does.

## Answer to Seth's claim

Seth's prior: "we experienced this context eating with every implementation
that had think=true."

On current Ollama (0.20.4) with mort-bot's loop shape: **the claim does not
reproduce.** Possibilities for why Seth saw it before:

1. **Older Ollama version.** Earlier 0.x releases may have serialized `thinking`
   into subsequent prompts. Not tested here.
2. **Different API shape.** `/api/generate` behaves differently from `/api/chat`
   — mort-bot's own CLAUDE.md notes `/api/generate` with `think=true` returns
   empty responses. That's a separate, real bug, not a context-growth bug.
3. **AI_Visualizer-shaped pipelines** generate into `content` where thinking
   tokens explicitly eat the `num_predict` budget. That is a real failure mode
   and the original "always think:false" guidance addresses it correctly.
4. **Confounded by other issues** — context truncation, model quirks, silent
   failures — misattributed to thinking.

The production `THINK=False` setting was defensible when adopted and remains
defensible now. It's just not load-bearing in the way the original guidance
suggested.

## Concrete recommendation for mort-bot

1. **Keep `THINK=False` as-is, or try unset-default, or `THINK=True` — pick based on whether you want the slight quality hedge on reasoning-heavy turns (like the movies check_sethflix verify-loop). No context-growth penalty either way on 0.20.4.**
2. **Don't backport the "CLI coding agent" finding from `docs/reference/bakeoff-2026-04-18.md`** — that one was at `num_ctx=32768` with a coding harness, different regime. Mort's `num_ctx=8192` doesn't touch the silent-stop trigger.
3. **If Ollama versions drift**, re-run this harness. The stripping behavior is an Ollama implementation detail; a future version could change it.

## Reconciling with the earlier coding-agent bakeoff

The two findings are orthogonal:

| Bakeoff | Context | Harness | think=false | think=true |
|---|---|---|---|---|
| CLI coding (Round 3) | 32K | custom coding loop | 26B silent-stops | works |
| mort-bot (this) | 8K | mort's real loop shape | works | works |

Both can be true. The coding bakeoff ran into a state-specific `think=false`
failure at 32K context. mort-bot at 8K doesn't reach that state. Seth's claim
("think=true eats context") doesn't reproduce at 8K either because Ollama
strips thinking from serialized history. The practical synthesis: **context
size and API shape matter more than either single flag**.

## Caveats

- **Stubbed tools.** Real mort tools (SearXNG, SethSearch, web_fetch) return
  variable-sized responses. This harness gave ~300-500 char deterministic
  stubs. If production tool responses are much larger, context growth
  dynamics could differ.
- **No image/vision path.** mort does vision preprocessing via `/api/generate`
  with `THINK_VISION=False`. That path is documented by mort's own notes as
  broken with `think=true` on `/api/generate`. Out of scope here (this bakeoff
  is `/api/chat` only).
- **Production traffic has longer sessions and real chat-history overhead.**
  The fake history was ~2.7KB; real sessions can accumulate more. Not tested
  at the 20-step STEP_BUDGET limit.
- **n=1.** Stochastic variance wasn't measured. Results at temp=0.7 can vary.

## Artifacts

- `scripts/mort-bakeoff/harness.py` — the harness
- `scripts/mort-bakeoff/runs/memory-think-{false,true}.json`
- `scripts/mort-bakeoff/runs/research-think-{false,true}.json`
- `scripts/mort-bakeoff/runs/movies-think-{false,true}.json`
- `scripts/mort-bakeoff/runs/long-think-{false,true}.json`

## Reproducing

```bash
cd scripts/mort-bakeoff
for task in memory research movies long; do
  for t in false true; do
    python3 harness.py $t $task runs/${task}-think-${t}.json
  done
done
```