gemma4-research/.claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md

# Handoff: GPU Bakeoff — 3090 Ti vs Strix Halo (+ parked native-bakeoff scaffold)

## Session Metadata
- Created: 2026-04-20 05:56:58
- Project: /home/claude/bin/gemma4-research
- Branch: master (pushed to origin)
- Session duration: ~extended session, multi-pivot (~4+ hours)

### Recent Commits (for context)
  - **91842f3 docs: scrub PII/IPs from gpu-bakeoff** ← latest, end of session
  - **22af597 docs: remove V100 from GPU bakeoff** ← V100 column dropped
  - **b619035 feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo** ← initial write-up (superseded by later scrubs)
  - **df5542f feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling** ← parked research
  - 91aaaa4 docs: redact PII from persistent-correspondence findings

## Handoff Chain

- **Continues from**: [2026-04-18-233832-openwebui-setup-doc.md](./2026-04-18-233832-openwebui-setup-doc.md)
  - Previous title: OpenWebUI Setup Doc for Gemma 4
- **Supersedes**: None

> This session is not a continuation of the OpenWebUI doc work — it's a fresh research thread on the same repo. The link is chronological, not topical. Previous handoff is only relevant if debugging OpenWebUI-related Gemma 4 behavior.

## Current State Summary

Session started on a native-vs-JSON tool-calling bakeoff question, pivoted to a cross-GPU throughput comparison mid-session, and shipped the latter. Final state: `docs/reference/gpu-bakeoff-2026-04-20.md` comparing `gemma4:26b` MoE and `gemma4:31b` dense decode/prefill rates on **RTX 3090 Ti (steel141)** vs **AMD Strix Halo iGPU (strix-halo host)**. V100 data was initially gathered and included but **removed** when it turned out the V100 was 95% CPU-bound due to SDXL coresident on CT 167 — the published doc is a clean 2-host comparison. Native-bakeoff harness (the earlier thread) remains scaffolded and committed at `scripts/native-bakeoff/` but not run further. Repo is clean, three commits pushed.

## Codebase Understanding

### Architecture Overview

The repo is a Gemma 4 research corpus. New this session:
- `scripts/native-bakeoff/` — three-arm tool-calling harness (Ollama JSON tools vs Ollama raw native tokens vs google-deepmind/gemma JAX ToolSampler). Arms A and B tested and functionally equivalent on `gemma4:26b` Q4 against a shared task suite lifted from mort-bakeoff. Arm C is env-gated (requires JAX + `gemma` PyPI package); wired but not run.
- `scripts/gpu-bakeoff/` — cross-GPU throughput harness. Takes host aliases from `HOSTS` dict and resolves URLs from env vars (`OLLAMA_STEEL141_URL`, `OLLAMA_PVE197_URL`, `OLLAMA_STRIX_URL`). Runs 1 warmup + 3 measurement calls per (host × model × prompt-length), logs Ollama's canonical timing fields, aggregates min/median/max.
- `docs/reference/gpu-bakeoff-2026-04-20.md` — the finished writeup. 3090 Ti + Strix Halo only.

The `docs/reference/` tier holds experimental findings; `docs/` top-level holds applied how-to guides. Both bakeoffs landed in `docs/reference/` which is correct.

### Critical Files

| File | Purpose | Relevance |
|------|---------|-----------|
| `docs/reference/gpu-bakeoff-2026-04-20.md` | The session's primary artifact | Read this first for the session's shipped findings |
| `scripts/gpu-bakeoff/harness.py` | GPU bakeoff harness, env-var-driven URL resolution | Re-run the bakeoff (e.g., for isolated V100) by setting env vars + invoking |
| `scripts/gpu-bakeoff/runs/**/*.json` | Raw per-call timing data | Source of truth for the doc's numbers; each JSON has warmup + 3 runs with full Ollama timing fields |
| `scripts/native-bakeoff/harness.py` | Parked three-arm tool-calling harness | Reference if revisiting the native-vs-JSON question; arms A and B are ready, arm C needs JAX env |
| `scripts/native-bakeoff/arms/ollama_native.py` | Arm B — renders the canonical HF jinja chat template directly, POSTs to /api/generate raw:true | Contains a subtle fix (keep assistant `content=""` when it has `tool_calls`) that's easy to regress |
| `tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja` | Canonical Gemma 4 chat template, rendered by arm B | Authoritative source of Gemma 4's native tool-call wire format |
| `~/bin/DECISIONS.md` | Global decision log | Three new 2026-04-20 entries: MoE-preferred, 3090 Ti primary, V100 degraded |

### Key Patterns Discovered

- **MoE vs dense is a latency cliff, not a smooth curve.** `gemma4:26b` (MoE, ~4B active) decodes ~4.7× faster than `gemma4:31b` (dense, 31.3B active) on every GPU tested, because memory bandwidth is the binding constraint and the active-parameter bill is what you pay for per token. Total parameter count doesn't predict latency.
- **Ollama's JSON↔native-token tool-call translator is faithful** on `gemma4:26b` Q4. Arms A (JSON tools via `/api/chat`) and B (raw native tokens via `/api/generate raw:true`) produced identical behavioral shapes on the 4-task mort-bakeoff suite. Good for mort-bot's confidence in its production path.
- **Ollama's `/api/generate` strips matched stop tokens from the response.** Arm B's initial version mis-handled this by checking `done_reason == "stop"` as the "already terminated" branch; the correct logic is to always re-append the stop token based on which OPEN token (`<|tool_call>` vs `<|turn>`) is present in the completion.
- **Jinja `message.get('content')` checks the raw string, not the strip-thinking'd version.** Storing the model's `<|channel>thought\n<channel|>` prefix in an assistant message's `content` field causes the template's post-tool-response conditional to append a spurious `<turn|>\n`, corrupting the next step's prompt. Safe default: leave `content=""` when the message has `tool_calls`.

## Work Completed

### Tasks Finished

- [x] Researched "most native Gemma 4 engine" — concluded `google-deepmind/gemma` (JAX) is the canonical reference; `gemma.cpp` verified to still NOT support Gemma 4 on dev branch (main README "CPU-only inference for: Gemma 2-3, PaliGemma 2")
- [x] Scaffolded three-arm native-bakeoff harness (ollama-json, ollama-native, jax-native) at `scripts/native-bakeoff/`
- [x] Ran A+B sweep on `gemma4:26b` Q4 via Strix Halo host over Tailscale; debugged arm-B parser bug; concluded Ollama's JSON↔native translator is faithful
- [x] Probed GPU inventory across steel141 (3090 Ti), pve197 CT 105 (V100), strix-halo (Strix Halo iGPU)
- [x] Built `scripts/gpu-bakeoff/harness.py` — env-var-keyed hosts, warmup + 3 runs, canonical timing extraction
- [x] Ran the bakeoff; discovered V100 was 95% CPU-bound due to SDXL occupying ~31 GB of its VRAM
- [x] Wrote `docs/reference/gpu-bakeoff-2026-04-20.md` with V100 column initially included, then removed at Seth's direction
- [x] Scrubbed PII/IPs from the doc and harness: host alias `matt-strix` → `strix-halo`, URLs moved to env vars, `runs/` dir renamed, JSONs patched
- [x] Updated `~/bin/DECISIONS.md` with three 2026-04-20 entries
- [x] Added feedback memory for the PII-scrub preference
- [x] Updated `README.md` index entry for the new bakeoff doc

### Files Modified

| File | Changes | Rationale |
|------|---------|-----------|
| `docs/reference/gpu-bakeoff-2026-04-20.md` | Created (final: 3090 Ti vs Strix Halo) | Session's primary artifact |
| `scripts/gpu-bakeoff/` | New dir — harness + runs | Bakeoff infrastructure |
| `scripts/native-bakeoff/` | New dir — three-arm harness, parked | Earlier research thread, parked but shippable |
| `README.md` | One new row in the file index | Discoverability for the new doc |
| `~/bin/DECISIONS.md` | Three new 2026-04-20 entries | MoE preference, 3090 Ti primacy, V100-SDXL contention |
| `~/.claude/projects/-home-claude-bin-gemma4-research/memory/feedback_scrub_pii_before_publish.md` | New memory entry | Seth's preference for scrubbing artifacts before sharing |
| `~/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md` | Index entry added | Link to the new memory |

### Decisions Made

| Decision | Options Considered | Rationale |
|----------|-------------------|-----------|
| Pivot from native-bakeoff to GPU-bakeoff mid-session | Complete native-bakeoff first; park and come back | Seth explicitly pivoted ("What I really want is..."); native-bakeoff was already functionally answered (A ≡ B) |
| Remove V100 from GPU-bakeoff doc entirely rather than keep with caveat | Keep with prominent ⚠ badge; drop the column | Seth directed "remove v100 from doc"; degraded data with caveat pollutes the narrative |
| Env-var-ize host URLs in harness source rather than config file | .env file; hard-coded with placeholders; CLI-only | Lightest change that accomplishes scrub; localhost default keeps steel141 path usable out of the box |
| Start GPU bakeoff on E4B, not 26B, for the native-bakeoff thread | Go straight to 26B (production model) | Actually reversed to 26B mid-session when strix-halo (Matt's host) was found reachable with `gemma4:26b` already pulled — production-shape became the shipped path |
| Don't rewrite git history to remove IPs from earlier commits | Force-push a cleaned history | Destructive; Seth's "remove IP/PI" was scoped to current artifacts, not a history scrub. Flagged the tradeoff and did not act |
| Chain this handoff to the previous OpenWebUI one chronologically even though topically unrelated | Link as "continues from"; mark "supersedes"; no chain | Session-handoff skill's chain field is chronological per doc conventions; the narrative separation is called out in the body |

## Pending Work

## Immediate Next Steps

1. **(Optional) Isolated V100 re-run.** Stop CT 167 (ai-visualizer / SDXL) on pve197, then `OLLAMA_PVE197_URL=http://<ip>:11434 python3 scripts/gpu-bakeoff/harness.py --host pve197`. Expected result: V100 lands between 3090 Ti and Strix Halo based on HBM2 ~900 GB/s spec. Add a V100 column back to the doc with isolated numbers. Judgment call — worth the ai-visualizer interruption?
2. **(Optional) Strix max-model-fit follow-up.** Strix can host models neither the 3090 Ti nor V100 can. Pull a larger model (gemma4:26b-a4b-it-q8_0 at 28 GB, or something 40B+) on the Strix Halo host; re-run harness to characterize the bandwidth/capacity ceiling for that architecture.
3. **(Optional) Close the native-bakeoff thread with arm C.** Set up a JAX env on steel141 or in a vast-h100 session, pip install `gemma`, run the JAX ToolSampler arm against the same mort-bakeoff task suite. If arm C matches arms A/B, that's definitive "Ollama's runtime is faithful to the DeepMind reference." If it diverges, the GGUF quantization / llama.cpp runtime is the variable to investigate.

### Blockers/Open Questions

- **Does `gemma4:31b-it-q4_K_M` on the V100 still deserve its 2026-04-07 "primary model on V100" designation?** The new 2026-04-20 decision noting 26B-MoE preference doesn't formally supersede it — they coexist on a speed vs quality axis that wasn't measured here. If a future session cares, a quality bakeoff (same tasks, qualitatively scored outputs) would resolve it.
- **Quantization sensitivity unmeasured.** All bakeoff numbers are Q4_K_M. Q8 vs Q4 throughput ratio on the same model (especially on Strix where more headroom is available) is an open question that came up in the "open questions" section of the doc.

### Deferred Items

- **Native-bakeoff arm C** — env setup cost, not landing in this session.
- **Git history scrub** — would require force-push; Seth's scrub request was interpreted as "current artifacts only" and he was informed of the tradeoff.
- **DECISIONS.md per-project local** — considered creating a project-local decision log for the bakeoff findings but instead promoted them to the global log (`~/bin/DECISIONS.md`) since the hardware/model implications are cross-project.

## Context for Resuming Agent

## Important Context

- **The V100 caveat is in git history (commit b619035) but not the final doc.** If someone greps the repo for "V100" and expects to find it in the current head, they won't — the final commit `22af597` removed it deliberately.
- **Host aliases were scrubbed this session.** `matt-strix` was renamed to `strix-halo` in the repo; the SSH alias in `~/.ssh/config` and `~/bin/CLAUDE.md` still uses the original name. Don't "reconcile" those by renaming the alias locally — Seth uses it as-is outside the published repo.
- **Harness requires env vars for non-local hosts now.** Running `scripts/gpu-bakeoff/harness.py --host strix-halo` without `OLLAMA_STRIX_URL` set will error out with a clear message. Set it from the SSH alias / Tailscale IP as needed.
- **The scrubbed URL constants are NOT in this repo.** If the next session needs to re-run the bakeoff against the original hosts, pull them from `~/bin/CLAUDE.md` (SSH aliases → tailscale/LAN IPs) or probe via `ssh strix-halo hostname -I` / equivalent.
- **gemma4:latest on steel141 is the E4B-it variant (8 GB), NOT the MoE 26B.** Confirmed during smoke-testing. Other hosts may resolve `gemma4:latest` differently.
- **Push-on-commit is the convention** for this repo (`~/bin/CLAUDE.md` Gitea section). Both commits this session were pushed immediately.

### Assumptions Made

- The V100 was degraded "because of SDXL" based on `/api/ps` showing `size_vram: 1.57 GB` of a 30.5 GB model + `nvidia-smi` showing 31.7/32.7 GB used by other processes. **Not independently verified** by stopping SDXL and re-running; that's the open follow-up. If SDXL wasn't actually the culprit (e.g., Ollama version bug on that host), the finding needs revisiting.
- matt-strix's `gemma4:31b` tag and steel141's `gemma4:31b-it-q4_K_M` tag are the same weights (both Q4_K_M, both 19.9 GB, both 31.3 B params). Verified via `/api/tags` metadata; not by hash comparison.
- Ollama's `/api/generate` canonical timing fields (`prompt_eval_duration`, `eval_duration`, etc.) are trustworthy for throughput measurements. Supported by their deterministic behavior across runs; not compared against external profiling.

### Potential Gotchas

- **`keep_alive: 10m` in the harness keeps models resident.** Running the full matrix against a host with limited VRAM can leave the model loaded after the harness exits; subsequent unrelated Ollama users may see degraded performance until `keep_alive` expires or another model evicts it.
- **The V100 runs are gone from `scripts/gpu-bakeoff/runs/`** (commit `22af597`). Git history has them at `b619035^`. Don't write new code expecting `runs/pve197/` to exist locally.
- **The native-bakeoff `content=""` fix is subtle.** If someone "improves" arm B to preserve the model's pre-tool-call thinking text as assistant content, they'll regress the turn-termination bug. Module-level comment in `scripts/native-bakeoff/arms/ollama_native.py` calls this out but is easy to miss.
- **gemma.cpp status as of 2026-04-20:** dev branch README still says Gemma 2/3 + PaliGemma 2 only. Don't suggest gemma.cpp as a Gemma 4 option without re-checking.
- **Arm B's raw_completion_tail/prompt_tail/prompt_head trace fields** were added during debugging and left in place. They make the trace JSONs larger than strictly necessary; ok to remove if cleanliness matters, but don't delete the fix they were added to diagnose.

## Environment State

### Tools/Services Used

- Local Ollama on steel141 (127.0.0.1:11434) — version and model list as of session
- Remote Ollama on strix-halo (via Tailscale) — version 0.21.0, models: `gemma4:26b`, `gemma4:31b`
- Remote Ollama on pve197 CT 105 — models include the Q8 MoE `gemma4:26b-a4b-it-q8_0` that only fits V100
- Git / Gitea at `git.sethpc.xyz/Seth/gemma4-research`
- Python 3 with `aiohttp`, `jinja2`, `urllib.request` (stdlib only for gpu-bakeoff)

### Active Processes

- None started or left running by this session. The `keep_alive: 10m` in harness.py may still be holding models resident briefly post-session; they'll drop when the TTL expires.

### Environment Variables

- `OLLAMA_STEEL141_URL` — default `http://127.0.0.1:11434` if unset
- `OLLAMA_PVE197_URL` — no default; required if `--host pve197`
- `OLLAMA_STRIX_URL` — no default; required if `--host strix-halo`
- Optionally `OLLAMA_URL` for any one-off calls to a different host, though harness doesn't read this

(No values are embedded in source; none logged here per handoff security policy.)

## Related Resources

- [docs/reference/gpu-bakeoff-2026-04-20.md](../../docs/reference/gpu-bakeoff-2026-04-20.md) — the session's primary artifact
- [scripts/gpu-bakeoff/](../../scripts/gpu-bakeoff/) — harness + raw traces
- [scripts/native-bakeoff/](../../scripts/native-bakeoff/) — parked research thread, functional A+B arms
- [tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja](../../tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja) — authoritative Gemma 4 chat template, rendered by arm B of native-bakeoff
- [~/bin/DECISIONS.md](/home/claude/bin/DECISIONS.md) — three new 2026-04-20 entries relating to this session
- [MEMORY index](/home/claude/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md) — updated with PII-scrub feedback
- Previous handoff: [2026-04-18-233832-openwebui-setup-doc.md](./2026-04-18-233832-openwebui-setup-doc.md) — chronological predecessor, topically unrelated
- Gitea commits this session: `df5542f`, `b619035`, `22af597`, `91842f3`

---

**Security Reminder**: Before finalizing, run `validate_handoff.py` to check for accidental secret exposure.