feat: mort-bot think=true vs think=false bakeoff

Seth's challenge: "we experienced this context eating with every
implementation that had think=true. mort-bot runs a loop. Can you do
a bake-off?"

Built a harness that replicates mort-bot's /api/chat loop verbatim
(num_ctx=8192, num_predict=2048, temperature=0.7, gemma4:26b,
STEP_BUDGET=20, exact payload shape) but with stubbed tools and a
prebuilt 15-turn fake chat history. Ran 4 tasks × 2 think settings.

Finding: on Ollama 0.20.4 the "thinking eats context" concern does
NOT reproduce. Direct evidence:
- Movies task step 2 (think=true) returned 905 chars of thinking.
- Step 3 prompt_eval_count delta: +76 tokens (think=true) vs +135
  tokens (think=false). If thinking had accumulated in the prompt,
  think=true would have grown by +360 tokens, not shrunk.
- Ollama's chat template strips the `thinking` field when serializing
  assistant turns for subsequent prompts.

All 4 tasks × 2 settings produced identical step counts and tool
counts. Wall clocks comparable. Gemma only actually generated
thinking on 1 of 4 tasks (the one with check_sethflix verify-loop);
on the others with think=true it emitted 0 thinking tokens.

Reconciled with the earlier coding-agent bakeoff: the two findings
are orthogonal. Coding bakeoff was at num_ctx=32K with a different
harness; mort at 8K doesn't touch the silent-stop regime either way.
Seth's prior may have been correct on an older Ollama or in a
different API shape (/api/generate has its own issues) but does not
reproduce here.

Concrete recommendation: mort-bot THINK=False is defensible but not
load-bearing; THINK=True or unset-default would also work. Keep as-is
unless a different need arises.

New: docs/reference/mort-bakeoff-2026-04-18.md, scripts/mort-bakeoff/
(harness + 8 run logs). README updated with pointer.
This commit is contained in:
Mortdecai
2026-04-18 18:23:43 -04:00
parent c61394923c
commit 8436a91571
12 changed files with 988 additions and 2 deletions
+155
View File
@@ -0,0 +1,155 @@
# mort-bot `think=true` vs `think=false` Bakeoff — 2026-04-18
> Follow-up to Seth's challenge: "we experienced this context eating with every
> implementation that had think=true. mort-bot runs a loop. Can you do a
> bake-off to see if that bot would actually perform better with thinking on?"
>
> Short answer: **no measurable difference on Ollama 0.20.4**. The
> "thinking-eats-context" concern doesn't reproduce in mort-bot's current
> loop shape because Ollama's chat template strips the `thinking` field from
> serialized history when it builds the prompt for subsequent turns. Either
> setting is defensible.
## Setup
- Harness: `scripts/mort-bakeoff/harness.py` — replicates mort-bot `llm.py`
`run_tool_loop` call shape verbatim (model, options, payload structure,
`messages.append(msg)` behavior), but with stubbed tools and a prebuilt
~15-turn fake chat history to simulate mid-session state.
- Host / Ollama: steel141, 3090 Ti, Ollama 0.20.4
- Exact config match to mort-bot production: `gemma4:26b`, `num_ctx=8192`,
`num_predict=2048`, `temperature=0.7`, `top_p=0.95`, `top_k=64`,
`keep_alive=2h`, `STEP_BUDGET=20`
- Tasks: 4 scenarios representative of real traffic (movies, research,
memory, long-chain research)
- n=1 per (task, think) cell — bakeoff, not benchmark
## Results
| Task | Think | Steps | Tools | Peak prompt | Thinking generated | Wall |
|---|---|---|---|---|---|---|
| memory | false | 2 | 1 | 1421 | 0 tok | 1.7s |
| memory | true | 2 | 1 | 1422 | 0 tok | 2.0s |
| research | false | 2 | 2 | 1593 | 0 tok | 2.3s |
| research | true | 2 | 2 | 1594 | 0 tok | 2.3s |
| movies | false | 3 | 2 | 1635 | 0 tok | 8.3s |
| movies | true | 3 | 2 | 1577 | 905 chars / ~226 tok | 5.4s |
| long | false | 7 | 6 | 2243 | 0 tok | 7.8s |
| long | true | 7 | 6 | 2288 | 0 tok | 8.0s |
Every (task, think) pair produced **identical step counts and tool counts**.
## Does thinking accumulate in context?
**No — verified directly.**
On the `movies` task, step 2 with `think=true` returned 905 chars (~226 tok)
in a separate `thinking` field. My harness then appends the full message dict
(including `thinking`) to `messages` before the next request — exactly what
mort-bot's `ollama_messages.append(msg)` does.
Step 2 → Step 3 prompt_eval_count delta:
| Setting | Delta | What that means |
|---|---|---|
| think=false | +135 tok | Tool result only |
| think=true | +76 tok | Tool result only — thinking was **stripped** |
If thinking had accumulated, the think=true delta would have been ~360 tokens
(tool result + thinking). Instead it was smaller than think=false's delta.
**Ollama 0.20.4's chat template does not include the `thinking` field when
serializing an assistant turn for subsequent prompts.** Thinking is a per-turn
response annotation, not a persisted conversation channel.
## Does thinking actually happen?
Gemma 4 is conservative. With `think=true`, it chose to generate thinking
tokens on **1 of 4 tasks**:
- `memory` — 0 thinking tokens (simple lookup)
- `research` — 0 thinking tokens (clear sequential plan)
- `long` — 0 thinking tokens across 7 steps (explicit step list to follow)
- `movies` — 905 chars thinking on step 2 (the only task requiring
verification logic: check candidates → filter IN LIBRARY → replace → re-check)
The model appears to decide whether to think based on whether the task has
uncertainty to reason about. Following explicit multi-step instructions
doesn't trigger it; generating + filtering recommendations does.
## Answer to Seth's claim
Seth's prior: "we experienced this context eating with every implementation
that had think=true."
On current Ollama (0.20.4) with mort-bot's loop shape: **the claim does not
reproduce.** Possibilities for why Seth saw it before:
1. **Older Ollama version.** Earlier 0.x releases may have serialized `thinking`
into subsequent prompts. Not tested here.
2. **Different API shape.** `/api/generate` behaves differently from `/api/chat`
— mort-bot's own CLAUDE.md notes `/api/generate` with `think=true` returns
empty responses. That's a separate, real bug, not a context-growth bug.
3. **AI_Visualizer-shaped pipelines** generate into `content` where thinking
tokens explicitly eat the `num_predict` budget. That is a real failure mode
and the original "always think:false" guidance addresses it correctly.
4. **Confounded by other issues** — context truncation, model quirks, silent
failures — misattributed to thinking.
The production `THINK=False` setting was defensible when adopted and remains
defensible now. It's just not load-bearing in the way the original guidance
suggested.
## Concrete recommendation for mort-bot
1. **Keep `THINK=False` as-is, or try unset-default, or `THINK=True` — pick based on whether you want the slight quality hedge on reasoning-heavy turns (like the movies check_sethflix verify-loop). No context-growth penalty either way on 0.20.4.**
2. **Don't backport the "CLI coding agent" finding from `docs/reference/bakeoff-2026-04-18.md`** — that one was at `num_ctx=32768` with a coding harness, different regime. Mort's `num_ctx=8192` doesn't touch the silent-stop trigger.
3. **If Ollama versions drift**, re-run this harness. The stripping behavior is an Ollama implementation detail; a future version could change it.
## Reconciling with the earlier coding-agent bakeoff
The two findings are orthogonal:
| Bakeoff | Context | Harness | think=false | think=true |
|---|---|---|---|---|
| CLI coding (Round 3) | 32K | custom coding loop | 26B silent-stops | works |
| mort-bot (this) | 8K | mort's real loop shape | works | works |
Both can be true. The coding bakeoff ran into a state-specific `think=false`
failure at 32K context. mort-bot at 8K doesn't reach that state. Seth's claim
("think=true eats context") doesn't reproduce at 8K either because Ollama
strips thinking from serialized history. The practical synthesis: **context
size and API shape matter more than either single flag**.
## Caveats
- **Stubbed tools.** Real mort tools (SearXNG, SethSearch, web_fetch) return
variable-sized responses. This harness gave ~300-500 char deterministic
stubs. If production tool responses are much larger, context growth
dynamics could differ.
- **No image/vision path.** mort does vision preprocessing via `/api/generate`
with `THINK_VISION=False`. That path is documented by mort's own notes as
broken with `think=true` on `/api/generate`. Out of scope here (this bakeoff
is `/api/chat` only).
- **Production traffic has longer sessions and real chat-history overhead.**
The fake history was ~2.7KB; real sessions can accumulate more. Not tested
at the 20-step STEP_BUDGET limit.
- **n=1.** Stochastic variance wasn't measured. Results at temp=0.7 can vary.
## Artifacts
- `scripts/mort-bakeoff/harness.py` — the harness
- `scripts/mort-bakeoff/runs/memory-think-{false,true}.json`
- `scripts/mort-bakeoff/runs/research-think-{false,true}.json`
- `scripts/mort-bakeoff/runs/movies-think-{false,true}.json`
- `scripts/mort-bakeoff/runs/long-think-{false,true}.json`
## Reproducing
```bash
cd scripts/mort-bakeoff
for task in memory research movies long; do
for t in false true; do
python3 harness.py $t $task runs/${task}-think-${t}.json
done
done
```