feat: mort-bot think=true vs think=false bakeoff
Seth's challenge: "we experienced this context eating with every implementation that had think=true. mort-bot runs a loop. Can you do a bake-off?" Built a harness that replicates mort-bot's /api/chat loop verbatim (num_ctx=8192, num_predict=2048, temperature=0.7, gemma4:26b, STEP_BUDGET=20, exact payload shape) but with stubbed tools and a prebuilt 15-turn fake chat history. Ran 4 tasks × 2 think settings. Finding: on Ollama 0.20.4 the "thinking eats context" concern does NOT reproduce. Direct evidence: - Movies task step 2 (think=true) returned 905 chars of thinking. - Step 3 prompt_eval_count delta: +76 tokens (think=true) vs +135 tokens (think=false). If thinking had accumulated in the prompt, think=true would have grown by +360 tokens, not shrunk. - Ollama's chat template strips the `thinking` field when serializing assistant turns for subsequent prompts. All 4 tasks × 2 settings produced identical step counts and tool counts. Wall clocks comparable. Gemma only actually generated thinking on 1 of 4 tasks (the one with check_sethflix verify-loop); on the others with think=true it emitted 0 thinking tokens. Reconciled with the earlier coding-agent bakeoff: the two findings are orthogonal. Coding bakeoff was at num_ctx=32K with a different harness; mort at 8K doesn't touch the silent-stop regime either way. Seth's prior may have been correct on an older Ollama or in a different API shape (/api/generate has its own issues) but does not reproduce here. Concrete recommendation: mort-bot THINK=False is defensible but not load-bearing; THINK=True or unset-default would also work. Keep as-is unless a different need arises. New: docs/reference/mort-bakeoff-2026-04-18.md, scripts/mort-bakeoff/ (harness + 8 run logs). README updated with pointer.
This commit is contained in:
@@ -0,0 +1,155 @@
|
||||
# mort-bot `think=true` vs `think=false` Bakeoff — 2026-04-18
|
||||
|
||||
> Follow-up to Seth's challenge: "we experienced this context eating with every
|
||||
> implementation that had think=true. mort-bot runs a loop. Can you do a
|
||||
> bake-off to see if that bot would actually perform better with thinking on?"
|
||||
>
|
||||
> Short answer: **no measurable difference on Ollama 0.20.4**. The
|
||||
> "thinking-eats-context" concern doesn't reproduce in mort-bot's current
|
||||
> loop shape because Ollama's chat template strips the `thinking` field from
|
||||
> serialized history when it builds the prompt for subsequent turns. Either
|
||||
> setting is defensible.
|
||||
|
||||
## Setup
|
||||
|
||||
- Harness: `scripts/mort-bakeoff/harness.py` — replicates mort-bot `llm.py`
|
||||
`run_tool_loop` call shape verbatim (model, options, payload structure,
|
||||
`messages.append(msg)` behavior), but with stubbed tools and a prebuilt
|
||||
~15-turn fake chat history to simulate mid-session state.
|
||||
- Host / Ollama: steel141, 3090 Ti, Ollama 0.20.4
|
||||
- Exact config match to mort-bot production: `gemma4:26b`, `num_ctx=8192`,
|
||||
`num_predict=2048`, `temperature=0.7`, `top_p=0.95`, `top_k=64`,
|
||||
`keep_alive=2h`, `STEP_BUDGET=20`
|
||||
- Tasks: 4 scenarios representative of real traffic (movies, research,
|
||||
memory, long-chain research)
|
||||
- n=1 per (task, think) cell — bakeoff, not benchmark
|
||||
|
||||
## Results
|
||||
|
||||
| Task | Think | Steps | Tools | Peak prompt | Thinking generated | Wall |
|
||||
|---|---|---|---|---|---|---|
|
||||
| memory | false | 2 | 1 | 1421 | 0 tok | 1.7s |
|
||||
| memory | true | 2 | 1 | 1422 | 0 tok | 2.0s |
|
||||
| research | false | 2 | 2 | 1593 | 0 tok | 2.3s |
|
||||
| research | true | 2 | 2 | 1594 | 0 tok | 2.3s |
|
||||
| movies | false | 3 | 2 | 1635 | 0 tok | 8.3s |
|
||||
| movies | true | 3 | 2 | 1577 | 905 chars / ~226 tok | 5.4s |
|
||||
| long | false | 7 | 6 | 2243 | 0 tok | 7.8s |
|
||||
| long | true | 7 | 6 | 2288 | 0 tok | 8.0s |
|
||||
|
||||
Every (task, think) pair produced **identical step counts and tool counts**.
|
||||
|
||||
## Does thinking accumulate in context?
|
||||
|
||||
**No — verified directly.**
|
||||
|
||||
On the `movies` task, step 2 with `think=true` returned 905 chars (~226 tok)
|
||||
in a separate `thinking` field. My harness then appends the full message dict
|
||||
(including `thinking`) to `messages` before the next request — exactly what
|
||||
mort-bot's `ollama_messages.append(msg)` does.
|
||||
|
||||
Step 2 → Step 3 prompt_eval_count delta:
|
||||
|
||||
| Setting | Delta | What that means |
|
||||
|---|---|---|
|
||||
| think=false | +135 tok | Tool result only |
|
||||
| think=true | +76 tok | Tool result only — thinking was **stripped** |
|
||||
|
||||
If thinking had accumulated, the think=true delta would have been ~360 tokens
|
||||
(tool result + thinking). Instead it was smaller than think=false's delta.
|
||||
**Ollama 0.20.4's chat template does not include the `thinking` field when
|
||||
serializing an assistant turn for subsequent prompts.** Thinking is a per-turn
|
||||
response annotation, not a persisted conversation channel.
|
||||
|
||||
## Does thinking actually happen?
|
||||
|
||||
Gemma 4 is conservative. With `think=true`, it chose to generate thinking
|
||||
tokens on **1 of 4 tasks**:
|
||||
|
||||
- `memory` — 0 thinking tokens (simple lookup)
|
||||
- `research` — 0 thinking tokens (clear sequential plan)
|
||||
- `long` — 0 thinking tokens across 7 steps (explicit step list to follow)
|
||||
- `movies` — 905 chars thinking on step 2 (the only task requiring
|
||||
verification logic: check candidates → filter IN LIBRARY → replace → re-check)
|
||||
|
||||
The model appears to decide whether to think based on whether the task has
|
||||
uncertainty to reason about. Following explicit multi-step instructions
|
||||
doesn't trigger it; generating + filtering recommendations does.
|
||||
|
||||
## Answer to Seth's claim
|
||||
|
||||
Seth's prior: "we experienced this context eating with every implementation
|
||||
that had think=true."
|
||||
|
||||
On current Ollama (0.20.4) with mort-bot's loop shape: **the claim does not
|
||||
reproduce.** Possibilities for why Seth saw it before:
|
||||
|
||||
1. **Older Ollama version.** Earlier 0.x releases may have serialized `thinking`
|
||||
into subsequent prompts. Not tested here.
|
||||
2. **Different API shape.** `/api/generate` behaves differently from `/api/chat`
|
||||
— mort-bot's own CLAUDE.md notes `/api/generate` with `think=true` returns
|
||||
empty responses. That's a separate, real bug, not a context-growth bug.
|
||||
3. **AI_Visualizer-shaped pipelines** generate into `content` where thinking
|
||||
tokens explicitly eat the `num_predict` budget. That is a real failure mode
|
||||
and the original "always think:false" guidance addresses it correctly.
|
||||
4. **Confounded by other issues** — context truncation, model quirks, silent
|
||||
failures — misattributed to thinking.
|
||||
|
||||
The production `THINK=False` setting was defensible when adopted and remains
|
||||
defensible now. It's just not load-bearing in the way the original guidance
|
||||
suggested.
|
||||
|
||||
## Concrete recommendation for mort-bot
|
||||
|
||||
1. **Keep `THINK=False` as-is, or try unset-default, or `THINK=True` — pick based on whether you want the slight quality hedge on reasoning-heavy turns (like the movies check_sethflix verify-loop). No context-growth penalty either way on 0.20.4.**
|
||||
2. **Don't backport the "CLI coding agent" finding from `docs/reference/bakeoff-2026-04-18.md`** — that one was at `num_ctx=32768` with a coding harness, different regime. Mort's `num_ctx=8192` doesn't touch the silent-stop trigger.
|
||||
3. **If Ollama versions drift**, re-run this harness. The stripping behavior is an Ollama implementation detail; a future version could change it.
|
||||
|
||||
## Reconciling with the earlier coding-agent bakeoff
|
||||
|
||||
The two findings are orthogonal:
|
||||
|
||||
| Bakeoff | Context | Harness | think=false | think=true |
|
||||
|---|---|---|---|---|
|
||||
| CLI coding (Round 3) | 32K | custom coding loop | 26B silent-stops | works |
|
||||
| mort-bot (this) | 8K | mort's real loop shape | works | works |
|
||||
|
||||
Both can be true. The coding bakeoff ran into a state-specific `think=false`
|
||||
failure at 32K context. mort-bot at 8K doesn't reach that state. Seth's claim
|
||||
("think=true eats context") doesn't reproduce at 8K either because Ollama
|
||||
strips thinking from serialized history. The practical synthesis: **context
|
||||
size and API shape matter more than either single flag**.
|
||||
|
||||
## Caveats
|
||||
|
||||
- **Stubbed tools.** Real mort tools (SearXNG, SethSearch, web_fetch) return
|
||||
variable-sized responses. This harness gave ~300-500 char deterministic
|
||||
stubs. If production tool responses are much larger, context growth
|
||||
dynamics could differ.
|
||||
- **No image/vision path.** mort does vision preprocessing via `/api/generate`
|
||||
with `THINK_VISION=False`. That path is documented by mort's own notes as
|
||||
broken with `think=true` on `/api/generate`. Out of scope here (this bakeoff
|
||||
is `/api/chat` only).
|
||||
- **Production traffic has longer sessions and real chat-history overhead.**
|
||||
The fake history was ~2.7KB; real sessions can accumulate more. Not tested
|
||||
at the 20-step STEP_BUDGET limit.
|
||||
- **n=1.** Stochastic variance wasn't measured. Results at temp=0.7 can vary.
|
||||
|
||||
## Artifacts
|
||||
|
||||
- `scripts/mort-bakeoff/harness.py` — the harness
|
||||
- `scripts/mort-bakeoff/runs/memory-think-{false,true}.json`
|
||||
- `scripts/mort-bakeoff/runs/research-think-{false,true}.json`
|
||||
- `scripts/mort-bakeoff/runs/movies-think-{false,true}.json`
|
||||
- `scripts/mort-bakeoff/runs/long-think-{false,true}.json`
|
||||
|
||||
## Reproducing
|
||||
|
||||
```bash
|
||||
cd scripts/mort-bakeoff
|
||||
for task in memory research movies long; do
|
||||
for t in false true; do
|
||||
python3 harness.py $t $task runs/${task}-think-${t}.json
|
||||
done
|
||||
done
|
||||
```
|
||||
Reference in New Issue
Block a user