# Handoff: GPU Bakeoff — 3090 Ti vs Strix Halo (+ parked native-bakeoff scaffold) ## Session Metadata - Created: 2026-04-20 05:56:58 - Project: /home/claude/bin/gemma4-research - Branch: master (pushed to origin) - Session duration: ~extended session, multi-pivot (~4+ hours) ### Recent Commits (for context) - **91842f3 docs: scrub PII/IPs from gpu-bakeoff** ← latest, end of session - **22af597 docs: remove V100 from GPU bakeoff** ← V100 column dropped - **b619035 feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo** ← initial write-up (superseded by later scrubs) - **df5542f feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling** ← parked research - 91aaaa4 docs: redact PII from persistent-correspondence findings ## Handoff Chain - **Continues from**: [2026-04-18-233832-openwebui-setup-doc.md](./2026-04-18-233832-openwebui-setup-doc.md) - Previous title: OpenWebUI Setup Doc for Gemma 4 - **Supersedes**: None > This session is not a continuation of the OpenWebUI doc work — it's a fresh research thread on the same repo. The link is chronological, not topical. Previous handoff is only relevant if debugging OpenWebUI-related Gemma 4 behavior. ## Current State Summary Session started on a native-vs-JSON tool-calling bakeoff question, pivoted to a cross-GPU throughput comparison mid-session, and shipped the latter. Final state: `docs/reference/gpu-bakeoff-2026-04-20.md` comparing `gemma4:26b` MoE and `gemma4:31b` dense decode/prefill rates on **RTX 3090 Ti (steel141)** vs **AMD Strix Halo iGPU (strix-halo host)**. V100 data was initially gathered and included but **removed** when it turned out the V100 was 95% CPU-bound due to SDXL coresident on CT 167 — the published doc is a clean 2-host comparison. Native-bakeoff harness (the earlier thread) remains scaffolded and committed at `scripts/native-bakeoff/` but not run further. Repo is clean, three commits pushed. ## Codebase Understanding ### Architecture Overview The repo is a Gemma 4 research corpus. New this session: - `scripts/native-bakeoff/` — three-arm tool-calling harness (Ollama JSON tools vs Ollama raw native tokens vs google-deepmind/gemma JAX ToolSampler). Arms A and B tested and functionally equivalent on `gemma4:26b` Q4 against a shared task suite lifted from mort-bakeoff. Arm C is env-gated (requires JAX + `gemma` PyPI package); wired but not run. - `scripts/gpu-bakeoff/` — cross-GPU throughput harness. Takes host aliases from `HOSTS` dict and resolves URLs from env vars (`OLLAMA_STEEL141_URL`, `OLLAMA_PVE197_URL`, `OLLAMA_STRIX_URL`). Runs 1 warmup + 3 measurement calls per (host × model × prompt-length), logs Ollama's canonical timing fields, aggregates min/median/max. - `docs/reference/gpu-bakeoff-2026-04-20.md` — the finished writeup. 3090 Ti + Strix Halo only. The `docs/reference/` tier holds experimental findings; `docs/` top-level holds applied how-to guides. Both bakeoffs landed in `docs/reference/` which is correct. ### Critical Files | File | Purpose | Relevance | |------|---------|-----------| | `docs/reference/gpu-bakeoff-2026-04-20.md` | The session's primary artifact | Read this first for the session's shipped findings | | `scripts/gpu-bakeoff/harness.py` | GPU bakeoff harness, env-var-driven URL resolution | Re-run the bakeoff (e.g., for isolated V100) by setting env vars + invoking | | `scripts/gpu-bakeoff/runs/**/*.json` | Raw per-call timing data | Source of truth for the doc's numbers; each JSON has warmup + 3 runs with full Ollama timing fields | | `scripts/native-bakeoff/harness.py` | Parked three-arm tool-calling harness | Reference if revisiting the native-vs-JSON question; arms A and B are ready, arm C needs JAX env | | `scripts/native-bakeoff/arms/ollama_native.py` | Arm B — renders the canonical HF jinja chat template directly, POSTs to /api/generate raw:true | Contains a subtle fix (keep assistant `content=""` when it has `tool_calls`) that's easy to regress | | `tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja` | Canonical Gemma 4 chat template, rendered by arm B | Authoritative source of Gemma 4's native tool-call wire format | | `~/bin/DECISIONS.md` | Global decision log | Three new 2026-04-20 entries: MoE-preferred, 3090 Ti primary, V100 degraded | ### Key Patterns Discovered - **MoE vs dense is a latency cliff, not a smooth curve.** `gemma4:26b` (MoE, ~4B active) decodes ~4.7× faster than `gemma4:31b` (dense, 31.3B active) on every GPU tested, because memory bandwidth is the binding constraint and the active-parameter bill is what you pay for per token. Total parameter count doesn't predict latency. - **Ollama's JSON↔native-token tool-call translator is faithful** on `gemma4:26b` Q4. Arms A (JSON tools via `/api/chat`) and B (raw native tokens via `/api/generate raw:true`) produced identical behavioral shapes on the 4-task mort-bakeoff suite. Good for mort-bot's confidence in its production path. - **Ollama's `/api/generate` strips matched stop tokens from the response.** Arm B's initial version mis-handled this by checking `done_reason == "stop"` as the "already terminated" branch; the correct logic is to always re-append the stop token based on which OPEN token (`<|tool_call>` vs `<|turn>`) is present in the completion. - **Jinja `message.get('content')` checks the raw string, not the strip-thinking'd version.** Storing the model's `<|channel>thought\n` prefix in an assistant message's `content` field causes the template's post-tool-response conditional to append a spurious `\n`, corrupting the next step's prompt. Safe default: leave `content=""` when the message has `tool_calls`. ## Work Completed ### Tasks Finished - [x] Researched "most native Gemma 4 engine" — concluded `google-deepmind/gemma` (JAX) is the canonical reference; `gemma.cpp` verified to still NOT support Gemma 4 on dev branch (main README "CPU-only inference for: Gemma 2-3, PaliGemma 2") - [x] Scaffolded three-arm native-bakeoff harness (ollama-json, ollama-native, jax-native) at `scripts/native-bakeoff/` - [x] Ran A+B sweep on `gemma4:26b` Q4 via Strix Halo host over Tailscale; debugged arm-B parser bug; concluded Ollama's JSON↔native translator is faithful - [x] Probed GPU inventory across steel141 (3090 Ti), pve197 CT 105 (V100), strix-halo (Strix Halo iGPU) - [x] Built `scripts/gpu-bakeoff/harness.py` — env-var-keyed hosts, warmup + 3 runs, canonical timing extraction - [x] Ran the bakeoff; discovered V100 was 95% CPU-bound due to SDXL occupying ~31 GB of its VRAM - [x] Wrote `docs/reference/gpu-bakeoff-2026-04-20.md` with V100 column initially included, then removed at Seth's direction - [x] Scrubbed PII/IPs from the doc and harness: host alias `matt-strix` → `strix-halo`, URLs moved to env vars, `runs/` dir renamed, JSONs patched - [x] Updated `~/bin/DECISIONS.md` with three 2026-04-20 entries - [x] Added feedback memory for the PII-scrub preference - [x] Updated `README.md` index entry for the new bakeoff doc ### Files Modified | File | Changes | Rationale | |------|---------|-----------| | `docs/reference/gpu-bakeoff-2026-04-20.md` | Created (final: 3090 Ti vs Strix Halo) | Session's primary artifact | | `scripts/gpu-bakeoff/` | New dir — harness + runs | Bakeoff infrastructure | | `scripts/native-bakeoff/` | New dir — three-arm harness, parked | Earlier research thread, parked but shippable | | `README.md` | One new row in the file index | Discoverability for the new doc | | `~/bin/DECISIONS.md` | Three new 2026-04-20 entries | MoE preference, 3090 Ti primacy, V100-SDXL contention | | `~/.claude/projects/-home-claude-bin-gemma4-research/memory/feedback_scrub_pii_before_publish.md` | New memory entry | Seth's preference for scrubbing artifacts before sharing | | `~/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md` | Index entry added | Link to the new memory | ### Decisions Made | Decision | Options Considered | Rationale | |----------|-------------------|-----------| | Pivot from native-bakeoff to GPU-bakeoff mid-session | Complete native-bakeoff first; park and come back | Seth explicitly pivoted ("What I really want is..."); native-bakeoff was already functionally answered (A ≡ B) | | Remove V100 from GPU-bakeoff doc entirely rather than keep with caveat | Keep with prominent ⚠ badge; drop the column | Seth directed "remove v100 from doc"; degraded data with caveat pollutes the narrative | | Env-var-ize host URLs in harness source rather than config file | .env file; hard-coded with placeholders; CLI-only | Lightest change that accomplishes scrub; localhost default keeps steel141 path usable out of the box | | Start GPU bakeoff on E4B, not 26B, for the native-bakeoff thread | Go straight to 26B (production model) | Actually reversed to 26B mid-session when strix-halo (Matt's host) was found reachable with `gemma4:26b` already pulled — production-shape became the shipped path | | Don't rewrite git history to remove IPs from earlier commits | Force-push a cleaned history | Destructive; Seth's "remove IP/PI" was scoped to current artifacts, not a history scrub. Flagged the tradeoff and did not act | | Chain this handoff to the previous OpenWebUI one chronologically even though topically unrelated | Link as "continues from"; mark "supersedes"; no chain | Session-handoff skill's chain field is chronological per doc conventions; the narrative separation is called out in the body | ## Pending Work ## Immediate Next Steps 1. **(Optional) Isolated V100 re-run.** Stop CT 167 (ai-visualizer / SDXL) on pve197, then `OLLAMA_PVE197_URL=http://:11434 python3 scripts/gpu-bakeoff/harness.py --host pve197`. Expected result: V100 lands between 3090 Ti and Strix Halo based on HBM2 ~900 GB/s spec. Add a V100 column back to the doc with isolated numbers. Judgment call — worth the ai-visualizer interruption? 2. **(Optional) Strix max-model-fit follow-up.** Strix can host models neither the 3090 Ti nor V100 can. Pull a larger model (gemma4:26b-a4b-it-q8_0 at 28 GB, or something 40B+) on the Strix Halo host; re-run harness to characterize the bandwidth/capacity ceiling for that architecture. 3. **(Optional) Close the native-bakeoff thread with arm C.** Set up a JAX env on steel141 or in a vast-h100 session, pip install `gemma`, run the JAX ToolSampler arm against the same mort-bakeoff task suite. If arm C matches arms A/B, that's definitive "Ollama's runtime is faithful to the DeepMind reference." If it diverges, the GGUF quantization / llama.cpp runtime is the variable to investigate. ### Blockers/Open Questions - **Does `gemma4:31b-it-q4_K_M` on the V100 still deserve its 2026-04-07 "primary model on V100" designation?** The new 2026-04-20 decision noting 26B-MoE preference doesn't formally supersede it — they coexist on a speed vs quality axis that wasn't measured here. If a future session cares, a quality bakeoff (same tasks, qualitatively scored outputs) would resolve it. - **Quantization sensitivity unmeasured.** All bakeoff numbers are Q4_K_M. Q8 vs Q4 throughput ratio on the same model (especially on Strix where more headroom is available) is an open question that came up in the "open questions" section of the doc. ### Deferred Items - **Native-bakeoff arm C** — env setup cost, not landing in this session. - **Git history scrub** — would require force-push; Seth's scrub request was interpreted as "current artifacts only" and he was informed of the tradeoff. - **DECISIONS.md per-project local** — considered creating a project-local decision log for the bakeoff findings but instead promoted them to the global log (`~/bin/DECISIONS.md`) since the hardware/model implications are cross-project. ## Context for Resuming Agent ## Important Context - **The V100 caveat is in git history (commit b619035) but not the final doc.** If someone greps the repo for "V100" and expects to find it in the current head, they won't — the final commit `22af597` removed it deliberately. - **Host aliases were scrubbed this session.** `matt-strix` was renamed to `strix-halo` in the repo; the SSH alias in `~/.ssh/config` and `~/bin/CLAUDE.md` still uses the original name. Don't "reconcile" those by renaming the alias locally — Seth uses it as-is outside the published repo. - **Harness requires env vars for non-local hosts now.** Running `scripts/gpu-bakeoff/harness.py --host strix-halo` without `OLLAMA_STRIX_URL` set will error out with a clear message. Set it from the SSH alias / Tailscale IP as needed. - **The scrubbed URL constants are NOT in this repo.** If the next session needs to re-run the bakeoff against the original hosts, pull them from `~/bin/CLAUDE.md` (SSH aliases → tailscale/LAN IPs) or probe via `ssh strix-halo hostname -I` / equivalent. - **gemma4:latest on steel141 is the E4B-it variant (8 GB), NOT the MoE 26B.** Confirmed during smoke-testing. Other hosts may resolve `gemma4:latest` differently. - **Push-on-commit is the convention** for this repo (`~/bin/CLAUDE.md` Gitea section). Both commits this session were pushed immediately. ### Assumptions Made - The V100 was degraded "because of SDXL" based on `/api/ps` showing `size_vram: 1.57 GB` of a 30.5 GB model + `nvidia-smi` showing 31.7/32.7 GB used by other processes. **Not independently verified** by stopping SDXL and re-running; that's the open follow-up. If SDXL wasn't actually the culprit (e.g., Ollama version bug on that host), the finding needs revisiting. - matt-strix's `gemma4:31b` tag and steel141's `gemma4:31b-it-q4_K_M` tag are the same weights (both Q4_K_M, both 19.9 GB, both 31.3 B params). Verified via `/api/tags` metadata; not by hash comparison. - Ollama's `/api/generate` canonical timing fields (`prompt_eval_duration`, `eval_duration`, etc.) are trustworthy for throughput measurements. Supported by their deterministic behavior across runs; not compared against external profiling. ### Potential Gotchas - **`keep_alive: 10m` in the harness keeps models resident.** Running the full matrix against a host with limited VRAM can leave the model loaded after the harness exits; subsequent unrelated Ollama users may see degraded performance until `keep_alive` expires or another model evicts it. - **The V100 runs are gone from `scripts/gpu-bakeoff/runs/`** (commit `22af597`). Git history has them at `b619035^`. Don't write new code expecting `runs/pve197/` to exist locally. - **The native-bakeoff `content=""` fix is subtle.** If someone "improves" arm B to preserve the model's pre-tool-call thinking text as assistant content, they'll regress the turn-termination bug. Module-level comment in `scripts/native-bakeoff/arms/ollama_native.py` calls this out but is easy to miss. - **gemma.cpp status as of 2026-04-20:** dev branch README still says Gemma 2/3 + PaliGemma 2 only. Don't suggest gemma.cpp as a Gemma 4 option without re-checking. - **Arm B's raw_completion_tail/prompt_tail/prompt_head trace fields** were added during debugging and left in place. They make the trace JSONs larger than strictly necessary; ok to remove if cleanliness matters, but don't delete the fix they were added to diagnose. ## Environment State ### Tools/Services Used - Local Ollama on steel141 (127.0.0.1:11434) — version and model list as of session - Remote Ollama on strix-halo (via Tailscale) — version 0.21.0, models: `gemma4:26b`, `gemma4:31b` - Remote Ollama on pve197 CT 105 — models include the Q8 MoE `gemma4:26b-a4b-it-q8_0` that only fits V100 - Git / Gitea at `git.sethpc.xyz/Seth/gemma4-research` - Python 3 with `aiohttp`, `jinja2`, `urllib.request` (stdlib only for gpu-bakeoff) ### Active Processes - None started or left running by this session. The `keep_alive: 10m` in harness.py may still be holding models resident briefly post-session; they'll drop when the TTL expires. ### Environment Variables - `OLLAMA_STEEL141_URL` — default `http://127.0.0.1:11434` if unset - `OLLAMA_PVE197_URL` — no default; required if `--host pve197` - `OLLAMA_STRIX_URL` — no default; required if `--host strix-halo` - Optionally `OLLAMA_URL` for any one-off calls to a different host, though harness doesn't read this (No values are embedded in source; none logged here per handoff security policy.) ## Related Resources - [docs/reference/gpu-bakeoff-2026-04-20.md](../../docs/reference/gpu-bakeoff-2026-04-20.md) — the session's primary artifact - [scripts/gpu-bakeoff/](../../scripts/gpu-bakeoff/) — harness + raw traces - [scripts/native-bakeoff/](../../scripts/native-bakeoff/) — parked research thread, functional A+B arms - [tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja](../../tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja) — authoritative Gemma 4 chat template, rendered by arm B of native-bakeoff - [~/bin/DECISIONS.md](/home/claude/bin/DECISIONS.md) — three new 2026-04-20 entries relating to this session - [MEMORY index](/home/claude/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md) — updated with PII-scrub feedback - Previous handoff: [2026-04-18-233832-openwebui-setup-doc.md](./2026-04-18-233832-openwebui-setup-doc.md) — chronological predecessor, topically unrelated - Gitea commits this session: `df5542f`, `b619035`, `22af597`, `91842f3` --- **Security Reminder**: Before finalizing, run `validate_handoff.py` to check for accidental secret exposure.