From 0f82cd71b1d87c9e5b949a76073a8456de179cd8 Mon Sep 17 00:00:00 2001
From: Mortdecai <admin@mortdec.ai>
Date: Mon, 20 Apr 2026 06:00:07 -0400
Subject: [PATCH] =?UTF-8?q?docs:=20session=20handoff=20=E2=80=94=20GPU=20b?=
 =?UTF-8?q?akeoff=20(3090=20Ti=20vs=20Strix=20Halo)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Closes out the session that produced docs/reference/gpu-bakeoff-2026-04-20.md
and the parked scripts/native-bakeoff/ scaffold. Chains (chronologically)
from the 2026-04-18 OpenWebUI setup handoff though the topic is unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 ...-04-20-055658-gpu-bakeoff-3090-vs-strix.md | 177 ++++++++++++++++++
 1 file changed, 177 insertions(+)
 create mode 100644 .claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md

diff --git a/.claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md b/.claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md
new file mode 100644
index 0000000..d122661
--- /dev/null
+++ b/.claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md
@@ -0,0 +1,177 @@
+# Handoff: GPU Bakeoff — 3090 Ti vs Strix Halo (+ parked native-bakeoff scaffold)
+
+## Session Metadata
+- Created: 2026-04-20 05:56:58
+- Project: /home/claude/bin/gemma4-research
+- Branch: master (pushed to origin)
+- Session duration: ~extended session, multi-pivot (~4+ hours)
+
+### Recent Commits (for context)
+  - **91842f3 docs: scrub PII/IPs from gpu-bakeoff** ← latest, end of session
+  - **22af597 docs: remove V100 from GPU bakeoff** ← V100 column dropped
+  - **b619035 feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo** ← initial write-up (superseded by later scrubs)
+  - **df5542f feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling** ← parked research
+  - 91aaaa4 docs: redact PII from persistent-correspondence findings
+
+## Handoff Chain
+
+- **Continues from**: [2026-04-18-233832-openwebui-setup-doc.md](./2026-04-18-233832-openwebui-setup-doc.md)
+  - Previous title: OpenWebUI Setup Doc for Gemma 4
+- **Supersedes**: None
+
+> This session is not a continuation of the OpenWebUI doc work — it's a fresh research thread on the same repo. The link is chronological, not topical. Previous handoff is only relevant if debugging OpenWebUI-related Gemma 4 behavior.
+
+## Current State Summary
+
+Session started on a native-vs-JSON tool-calling bakeoff question, pivoted to a cross-GPU throughput comparison mid-session, and shipped the latter. Final state: `docs/reference/gpu-bakeoff-2026-04-20.md` comparing `gemma4:26b` MoE and `gemma4:31b` dense decode/prefill rates on **RTX 3090 Ti (steel141)** vs **AMD Strix Halo iGPU (strix-halo host)**. V100 data was initially gathered and included but **removed** when it turned out the V100 was 95% CPU-bound due to SDXL coresident on CT 167 — the published doc is a clean 2-host comparison. Native-bakeoff harness (the earlier thread) remains scaffolded and committed at `scripts/native-bakeoff/` but not run further. Repo is clean, three commits pushed.
+
+## Codebase Understanding
+
+### Architecture Overview
+
+The repo is a Gemma 4 research corpus. New this session:
+- `scripts/native-bakeoff/` — three-arm tool-calling harness (Ollama JSON tools vs Ollama raw native tokens vs google-deepmind/gemma JAX ToolSampler). Arms A and B tested and functionally equivalent on `gemma4:26b` Q4 against a shared task suite lifted from mort-bakeoff. Arm C is env-gated (requires JAX + `gemma` PyPI package); wired but not run.
+- `scripts/gpu-bakeoff/` — cross-GPU throughput harness. Takes host aliases from `HOSTS` dict and resolves URLs from env vars (`OLLAMA_STEEL141_URL`, `OLLAMA_PVE197_URL`, `OLLAMA_STRIX_URL`). Runs 1 warmup + 3 measurement calls per (host × model × prompt-length), logs Ollama's canonical timing fields, aggregates min/median/max.
+- `docs/reference/gpu-bakeoff-2026-04-20.md` — the finished writeup. 3090 Ti + Strix Halo only.
+
+The `docs/reference/` tier holds experimental findings; `docs/` top-level holds applied how-to guides. Both bakeoffs landed in `docs/reference/` which is correct.
+
+### Critical Files
+
+| File | Purpose | Relevance |
+|------|---------|-----------|
+| `docs/reference/gpu-bakeoff-2026-04-20.md` | The session's primary artifact | Read this first for the session's shipped findings |
+| `scripts/gpu-bakeoff/harness.py` | GPU bakeoff harness, env-var-driven URL resolution | Re-run the bakeoff (e.g., for isolated V100) by setting env vars + invoking |
+| `scripts/gpu-bakeoff/runs/**/*.json` | Raw per-call timing data | Source of truth for the doc's numbers; each JSON has warmup + 3 runs with full Ollama timing fields |
+| `scripts/native-bakeoff/harness.py` | Parked three-arm tool-calling harness | Reference if revisiting the native-vs-JSON question; arms A and B are ready, arm C needs JAX env |
+| `scripts/native-bakeoff/arms/ollama_native.py` | Arm B — renders the canonical HF jinja chat template directly, POSTs to /api/generate raw:true | Contains a subtle fix (keep assistant `content=""` when it has `tool_calls`) that's easy to regress |
+| `tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja` | Canonical Gemma 4 chat template, rendered by arm B | Authoritative source of Gemma 4's native tool-call wire format |
+| `~/bin/DECISIONS.md` | Global decision log | Three new 2026-04-20 entries: MoE-preferred, 3090 Ti primary, V100 degraded |
+
+### Key Patterns Discovered
+
+- **MoE vs dense is a latency cliff, not a smooth curve.** `gemma4:26b` (MoE, ~4B active) decodes ~4.7× faster than `gemma4:31b` (dense, 31.3B active) on every GPU tested, because memory bandwidth is the binding constraint and the active-parameter bill is what you pay for per token. Total parameter count doesn't predict latency.
+- **Ollama's JSON↔native-token tool-call translator is faithful** on `gemma4:26b` Q4. Arms A (JSON tools via `/api/chat`) and B (raw native tokens via `/api/generate raw:true`) produced identical behavioral shapes on the 4-task mort-bakeoff suite. Good for mort-bot's confidence in its production path.
+- **Ollama's `/api/generate` strips matched stop tokens from the response.** Arm B's initial version mis-handled this by checking `done_reason == "stop"` as the "already terminated" branch; the correct logic is to always re-append the stop token based on which OPEN token (`<|tool_call>` vs `<|turn>`) is present in the completion.
+- **Jinja `message.get('content')` checks the raw string, not the strip-thinking'd version.** Storing the model's `<|channel>thought\n<channel|>` prefix in an assistant message's `content` field causes the template's post-tool-response conditional to append a spurious `<turn|>\n`, corrupting the next step's prompt. Safe default: leave `content=""` when the message has `tool_calls`.
+
+## Work Completed
+
+### Tasks Finished
+
+- [x] Researched "most native Gemma 4 engine" — concluded `google-deepmind/gemma` (JAX) is the canonical reference; `gemma.cpp` verified to still NOT support Gemma 4 on dev branch (main README "CPU-only inference for: Gemma 2-3, PaliGemma 2")
+- [x] Scaffolded three-arm native-bakeoff harness (ollama-json, ollama-native, jax-native) at `scripts/native-bakeoff/`
+- [x] Ran A+B sweep on `gemma4:26b` Q4 via Strix Halo host over Tailscale; debugged arm-B parser bug; concluded Ollama's JSON↔native translator is faithful
+- [x] Probed GPU inventory across steel141 (3090 Ti), pve197 CT 105 (V100), strix-halo (Strix Halo iGPU)
+- [x] Built `scripts/gpu-bakeoff/harness.py` — env-var-keyed hosts, warmup + 3 runs, canonical timing extraction
+- [x] Ran the bakeoff; discovered V100 was 95% CPU-bound due to SDXL occupying ~31 GB of its VRAM
+- [x] Wrote `docs/reference/gpu-bakeoff-2026-04-20.md` with V100 column initially included, then removed at Seth's direction
+- [x] Scrubbed PII/IPs from the doc and harness: host alias `matt-strix` → `strix-halo`, URLs moved to env vars, `runs/` dir renamed, JSONs patched
+- [x] Updated `~/bin/DECISIONS.md` with three 2026-04-20 entries
+- [x] Added feedback memory for the PII-scrub preference
+- [x] Updated `README.md` index entry for the new bakeoff doc
+
+### Files Modified
+
+| File | Changes | Rationale |
+|------|---------|-----------|
+| `docs/reference/gpu-bakeoff-2026-04-20.md` | Created (final: 3090 Ti vs Strix Halo) | Session's primary artifact |
+| `scripts/gpu-bakeoff/` | New dir — harness + runs | Bakeoff infrastructure |
+| `scripts/native-bakeoff/` | New dir — three-arm harness, parked | Earlier research thread, parked but shippable |
+| `README.md` | One new row in the file index | Discoverability for the new doc |
+| `~/bin/DECISIONS.md` | Three new 2026-04-20 entries | MoE preference, 3090 Ti primacy, V100-SDXL contention |
+| `~/.claude/projects/-home-claude-bin-gemma4-research/memory/feedback_scrub_pii_before_publish.md` | New memory entry | Seth's preference for scrubbing artifacts before sharing |
+| `~/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md` | Index entry added | Link to the new memory |
+
+### Decisions Made
+
+| Decision | Options Considered | Rationale |
+|----------|-------------------|-----------|
+| Pivot from native-bakeoff to GPU-bakeoff mid-session | Complete native-bakeoff first; park and come back | Seth explicitly pivoted ("What I really want is..."); native-bakeoff was already functionally answered (A ≡ B) |
+| Remove V100 from GPU-bakeoff doc entirely rather than keep with caveat | Keep with prominent ⚠ badge; drop the column | Seth directed "remove v100 from doc"; degraded data with caveat pollutes the narrative |
+| Env-var-ize host URLs in harness source rather than config file | .env file; hard-coded with placeholders; CLI-only | Lightest change that accomplishes scrub; localhost default keeps steel141 path usable out of the box |
+| Start GPU bakeoff on E4B, not 26B, for the native-bakeoff thread | Go straight to 26B (production model) | Actually reversed to 26B mid-session when strix-halo (Matt's host) was found reachable with `gemma4:26b` already pulled — production-shape became the shipped path |
+| Don't rewrite git history to remove IPs from earlier commits | Force-push a cleaned history | Destructive; Seth's "remove IP/PI" was scoped to current artifacts, not a history scrub. Flagged the tradeoff and did not act |
+| Chain this handoff to the previous OpenWebUI one chronologically even though topically unrelated | Link as "continues from"; mark "supersedes"; no chain | Session-handoff skill's chain field is chronological per doc conventions; the narrative separation is called out in the body |
+
+## Pending Work
+
+## Immediate Next Steps
+
+1. **(Optional) Isolated V100 re-run.** Stop CT 167 (ai-visualizer / SDXL) on pve197, then `OLLAMA_PVE197_URL=http://<ip>:11434 python3 scripts/gpu-bakeoff/harness.py --host pve197`. Expected result: V100 lands between 3090 Ti and Strix Halo based on HBM2 ~900 GB/s spec. Add a V100 column back to the doc with isolated numbers. Judgment call — worth the ai-visualizer interruption?
+2. **(Optional) Strix max-model-fit follow-up.** Strix can host models neither the 3090 Ti nor V100 can. Pull a larger model (gemma4:26b-a4b-it-q8_0 at 28 GB, or something 40B+) on the Strix Halo host; re-run harness to characterize the bandwidth/capacity ceiling for that architecture.
+3. **(Optional) Close the native-bakeoff thread with arm C.** Set up a JAX env on steel141 or in a vast-h100 session, pip install `gemma`, run the JAX ToolSampler arm against the same mort-bakeoff task suite. If arm C matches arms A/B, that's definitive "Ollama's runtime is faithful to the DeepMind reference." If it diverges, the GGUF quantization / llama.cpp runtime is the variable to investigate.
+
+### Blockers/Open Questions
+
+- **Does `gemma4:31b-it-q4_K_M` on the V100 still deserve its 2026-04-07 "primary model on V100" designation?** The new 2026-04-20 decision noting 26B-MoE preference doesn't formally supersede it — they coexist on a speed vs quality axis that wasn't measured here. If a future session cares, a quality bakeoff (same tasks, qualitatively scored outputs) would resolve it.
+- **Quantization sensitivity unmeasured.** All bakeoff numbers are Q4_K_M. Q8 vs Q4 throughput ratio on the same model (especially on Strix where more headroom is available) is an open question that came up in the "open questions" section of the doc.
+
+### Deferred Items
+
+- **Native-bakeoff arm C** — env setup cost, not landing in this session.
+- **Git history scrub** — would require force-push; Seth's scrub request was interpreted as "current artifacts only" and he was informed of the tradeoff.
+- **DECISIONS.md per-project local** — considered creating a project-local decision log for the bakeoff findings but instead promoted them to the global log (`~/bin/DECISIONS.md`) since the hardware/model implications are cross-project.
+
+## Context for Resuming Agent
+
+## Important Context
+
+- **The V100 caveat is in git history (commit b619035) but not the final doc.** If someone greps the repo for "V100" and expects to find it in the current head, they won't — the final commit `22af597` removed it deliberately.
+- **Host aliases were scrubbed this session.** `matt-strix` was renamed to `strix-halo` in the repo; the SSH alias in `~/.ssh/config` and `~/bin/CLAUDE.md` still uses the original name. Don't "reconcile" those by renaming the alias locally — Seth uses it as-is outside the published repo.
+- **Harness requires env vars for non-local hosts now.** Running `scripts/gpu-bakeoff/harness.py --host strix-halo` without `OLLAMA_STRIX_URL` set will error out with a clear message. Set it from the SSH alias / Tailscale IP as needed.
+- **The scrubbed URL constants are NOT in this repo.** If the next session needs to re-run the bakeoff against the original hosts, pull them from `~/bin/CLAUDE.md` (SSH aliases → tailscale/LAN IPs) or probe via `ssh strix-halo hostname -I` / equivalent.
+- **gemma4:latest on steel141 is the E4B-it variant (8 GB), NOT the MoE 26B.** Confirmed during smoke-testing. Other hosts may resolve `gemma4:latest` differently.
+- **Push-on-commit is the convention** for this repo (`~/bin/CLAUDE.md` Gitea section). Both commits this session were pushed immediately.
+
+### Assumptions Made
+
+- The V100 was degraded "because of SDXL" based on `/api/ps` showing `size_vram: 1.57 GB` of a 30.5 GB model + `nvidia-smi` showing 31.7/32.7 GB used by other processes. **Not independently verified** by stopping SDXL and re-running; that's the open follow-up. If SDXL wasn't actually the culprit (e.g., Ollama version bug on that host), the finding needs revisiting.
+- matt-strix's `gemma4:31b` tag and steel141's `gemma4:31b-it-q4_K_M` tag are the same weights (both Q4_K_M, both 19.9 GB, both 31.3 B params). Verified via `/api/tags` metadata; not by hash comparison.
+- Ollama's `/api/generate` canonical timing fields (`prompt_eval_duration`, `eval_duration`, etc.) are trustworthy for throughput measurements. Supported by their deterministic behavior across runs; not compared against external profiling.
+
+### Potential Gotchas
+
+- **`keep_alive: 10m` in the harness keeps models resident.** Running the full matrix against a host with limited VRAM can leave the model loaded after the harness exits; subsequent unrelated Ollama users may see degraded performance until `keep_alive` expires or another model evicts it.
+- **The V100 runs are gone from `scripts/gpu-bakeoff/runs/`** (commit `22af597`). Git history has them at `b619035^`. Don't write new code expecting `runs/pve197/` to exist locally.
+- **The native-bakeoff `content=""` fix is subtle.** If someone "improves" arm B to preserve the model's pre-tool-call thinking text as assistant content, they'll regress the turn-termination bug. Module-level comment in `scripts/native-bakeoff/arms/ollama_native.py` calls this out but is easy to miss.
+- **gemma.cpp status as of 2026-04-20:** dev branch README still says Gemma 2/3 + PaliGemma 2 only. Don't suggest gemma.cpp as a Gemma 4 option without re-checking.
+- **Arm B's raw_completion_tail/prompt_tail/prompt_head trace fields** were added during debugging and left in place. They make the trace JSONs larger than strictly necessary; ok to remove if cleanliness matters, but don't delete the fix they were added to diagnose.
+
+## Environment State
+
+### Tools/Services Used
+
+- Local Ollama on steel141 (127.0.0.1:11434) — version and model list as of session
+- Remote Ollama on strix-halo (via Tailscale) — version 0.21.0, models: `gemma4:26b`, `gemma4:31b`
+- Remote Ollama on pve197 CT 105 — models include the Q8 MoE `gemma4:26b-a4b-it-q8_0` that only fits V100
+- Git / Gitea at `git.sethpc.xyz/Seth/gemma4-research`
+- Python 3 with `aiohttp`, `jinja2`, `urllib.request` (stdlib only for gpu-bakeoff)
+
+### Active Processes
+
+- None started or left running by this session. The `keep_alive: 10m` in harness.py may still be holding models resident briefly post-session; they'll drop when the TTL expires.
+
+### Environment Variables
+
+- `OLLAMA_STEEL141_URL` — default `http://127.0.0.1:11434` if unset
+- `OLLAMA_PVE197_URL` — no default; required if `--host pve197`
+- `OLLAMA_STRIX_URL` — no default; required if `--host strix-halo`
+- Optionally `OLLAMA_URL` for any one-off calls to a different host, though harness doesn't read this
+
+(No values are embedded in source; none logged here per handoff security policy.)
+
+## Related Resources
+
+- [docs/reference/gpu-bakeoff-2026-04-20.md](../../docs/reference/gpu-bakeoff-2026-04-20.md) — the session's primary artifact
+- [scripts/gpu-bakeoff/](../../scripts/gpu-bakeoff/) — harness + raw traces
+- [scripts/native-bakeoff/](../../scripts/native-bakeoff/) — parked research thread, functional A+B arms
+- [tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja](../../tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja) — authoritative Gemma 4 chat template, rendered by arm B of native-bakeoff
+- [~/bin/DECISIONS.md](/home/claude/bin/DECISIONS.md) — three new 2026-04-20 entries relating to this session
+- [MEMORY index](/home/claude/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md) — updated with PII-scrub feedback
+- Previous handoff: [2026-04-18-233832-openwebui-setup-doc.md](./2026-04-18-233832-openwebui-setup-doc.md) — chronological predecessor, topically unrelated
+- Gitea commits this session: `df5542f`, `b619035`, `22af597`, `91842f3`
+
+---
+
+**Security Reminder**: Before finalizing, run `validate_handoff.py` to check for accidental secret exposure.