Files
gemma4-research/.claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md
T
Mortdecai 0f82cd71b1 docs: session handoff — GPU bakeoff (3090 Ti vs Strix Halo)
Closes out the session that produced docs/reference/gpu-bakeoff-2026-04-20.md
and the parked scripts/native-bakeoff/ scaffold. Chains (chronologically)
from the 2026-04-18 OpenWebUI setup handoff though the topic is unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 06:00:07 -04:00

178 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Handoff: GPU Bakeoff — 3090 Ti vs Strix Halo (+ parked native-bakeoff scaffold)
## Session Metadata
- Created: 2026-04-20 05:56:58
- Project: /home/claude/bin/gemma4-research
- Branch: master (pushed to origin)
- Session duration: ~extended session, multi-pivot (~4+ hours)
### Recent Commits (for context)
- **91842f3 docs: scrub PII/IPs from gpu-bakeoff** ← latest, end of session
- **22af597 docs: remove V100 from GPU bakeoff** ← V100 column dropped
- **b619035 feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo** ← initial write-up (superseded by later scrubs)
- **df5542f feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling** ← parked research
- 91aaaa4 docs: redact PII from persistent-correspondence findings
## Handoff Chain
- **Continues from**: [2026-04-18-233832-openwebui-setup-doc.md](./2026-04-18-233832-openwebui-setup-doc.md)
- Previous title: OpenWebUI Setup Doc for Gemma 4
- **Supersedes**: None
> This session is not a continuation of the OpenWebUI doc work — it's a fresh research thread on the same repo. The link is chronological, not topical. Previous handoff is only relevant if debugging OpenWebUI-related Gemma 4 behavior.
## Current State Summary
Session started on a native-vs-JSON tool-calling bakeoff question, pivoted to a cross-GPU throughput comparison mid-session, and shipped the latter. Final state: `docs/reference/gpu-bakeoff-2026-04-20.md` comparing `gemma4:26b` MoE and `gemma4:31b` dense decode/prefill rates on **RTX 3090 Ti (steel141)** vs **AMD Strix Halo iGPU (strix-halo host)**. V100 data was initially gathered and included but **removed** when it turned out the V100 was 95% CPU-bound due to SDXL coresident on CT 167 — the published doc is a clean 2-host comparison. Native-bakeoff harness (the earlier thread) remains scaffolded and committed at `scripts/native-bakeoff/` but not run further. Repo is clean, three commits pushed.
## Codebase Understanding
### Architecture Overview
The repo is a Gemma 4 research corpus. New this session:
- `scripts/native-bakeoff/` — three-arm tool-calling harness (Ollama JSON tools vs Ollama raw native tokens vs google-deepmind/gemma JAX ToolSampler). Arms A and B tested and functionally equivalent on `gemma4:26b` Q4 against a shared task suite lifted from mort-bakeoff. Arm C is env-gated (requires JAX + `gemma` PyPI package); wired but not run.
- `scripts/gpu-bakeoff/` — cross-GPU throughput harness. Takes host aliases from `HOSTS` dict and resolves URLs from env vars (`OLLAMA_STEEL141_URL`, `OLLAMA_PVE197_URL`, `OLLAMA_STRIX_URL`). Runs 1 warmup + 3 measurement calls per (host × model × prompt-length), logs Ollama's canonical timing fields, aggregates min/median/max.
- `docs/reference/gpu-bakeoff-2026-04-20.md` — the finished writeup. 3090 Ti + Strix Halo only.
The `docs/reference/` tier holds experimental findings; `docs/` top-level holds applied how-to guides. Both bakeoffs landed in `docs/reference/` which is correct.
### Critical Files
| File | Purpose | Relevance |
|------|---------|-----------|
| `docs/reference/gpu-bakeoff-2026-04-20.md` | The session's primary artifact | Read this first for the session's shipped findings |
| `scripts/gpu-bakeoff/harness.py` | GPU bakeoff harness, env-var-driven URL resolution | Re-run the bakeoff (e.g., for isolated V100) by setting env vars + invoking |
| `scripts/gpu-bakeoff/runs/**/*.json` | Raw per-call timing data | Source of truth for the doc's numbers; each JSON has warmup + 3 runs with full Ollama timing fields |
| `scripts/native-bakeoff/harness.py` | Parked three-arm tool-calling harness | Reference if revisiting the native-vs-JSON question; arms A and B are ready, arm C needs JAX env |
| `scripts/native-bakeoff/arms/ollama_native.py` | Arm B — renders the canonical HF jinja chat template directly, POSTs to /api/generate raw:true | Contains a subtle fix (keep assistant `content=""` when it has `tool_calls`) that's easy to regress |
| `tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja` | Canonical Gemma 4 chat template, rendered by arm B | Authoritative source of Gemma 4's native tool-call wire format |
| `~/bin/DECISIONS.md` | Global decision log | Three new 2026-04-20 entries: MoE-preferred, 3090 Ti primary, V100 degraded |
### Key Patterns Discovered
- **MoE vs dense is a latency cliff, not a smooth curve.** `gemma4:26b` (MoE, ~4B active) decodes ~4.7× faster than `gemma4:31b` (dense, 31.3B active) on every GPU tested, because memory bandwidth is the binding constraint and the active-parameter bill is what you pay for per token. Total parameter count doesn't predict latency.
- **Ollama's JSON↔native-token tool-call translator is faithful** on `gemma4:26b` Q4. Arms A (JSON tools via `/api/chat`) and B (raw native tokens via `/api/generate raw:true`) produced identical behavioral shapes on the 4-task mort-bakeoff suite. Good for mort-bot's confidence in its production path.
- **Ollama's `/api/generate` strips matched stop tokens from the response.** Arm B's initial version mis-handled this by checking `done_reason == "stop"` as the "already terminated" branch; the correct logic is to always re-append the stop token based on which OPEN token (`<|tool_call>` vs `<|turn>`) is present in the completion.
- **Jinja `message.get('content')` checks the raw string, not the strip-thinking'd version.** Storing the model's `<|channel>thought\n<channel|>` prefix in an assistant message's `content` field causes the template's post-tool-response conditional to append a spurious `<turn|>\n`, corrupting the next step's prompt. Safe default: leave `content=""` when the message has `tool_calls`.
## Work Completed
### Tasks Finished
- [x] Researched "most native Gemma 4 engine" — concluded `google-deepmind/gemma` (JAX) is the canonical reference; `gemma.cpp` verified to still NOT support Gemma 4 on dev branch (main README "CPU-only inference for: Gemma 2-3, PaliGemma 2")
- [x] Scaffolded three-arm native-bakeoff harness (ollama-json, ollama-native, jax-native) at `scripts/native-bakeoff/`
- [x] Ran A+B sweep on `gemma4:26b` Q4 via Strix Halo host over Tailscale; debugged arm-B parser bug; concluded Ollama's JSON↔native translator is faithful
- [x] Probed GPU inventory across steel141 (3090 Ti), pve197 CT 105 (V100), strix-halo (Strix Halo iGPU)
- [x] Built `scripts/gpu-bakeoff/harness.py` — env-var-keyed hosts, warmup + 3 runs, canonical timing extraction
- [x] Ran the bakeoff; discovered V100 was 95% CPU-bound due to SDXL occupying ~31 GB of its VRAM
- [x] Wrote `docs/reference/gpu-bakeoff-2026-04-20.md` with V100 column initially included, then removed at Seth's direction
- [x] Scrubbed PII/IPs from the doc and harness: host alias `matt-strix``strix-halo`, URLs moved to env vars, `runs/` dir renamed, JSONs patched
- [x] Updated `~/bin/DECISIONS.md` with three 2026-04-20 entries
- [x] Added feedback memory for the PII-scrub preference
- [x] Updated `README.md` index entry for the new bakeoff doc
### Files Modified
| File | Changes | Rationale |
|------|---------|-----------|
| `docs/reference/gpu-bakeoff-2026-04-20.md` | Created (final: 3090 Ti vs Strix Halo) | Session's primary artifact |
| `scripts/gpu-bakeoff/` | New dir — harness + runs | Bakeoff infrastructure |
| `scripts/native-bakeoff/` | New dir — three-arm harness, parked | Earlier research thread, parked but shippable |
| `README.md` | One new row in the file index | Discoverability for the new doc |
| `~/bin/DECISIONS.md` | Three new 2026-04-20 entries | MoE preference, 3090 Ti primacy, V100-SDXL contention |
| `~/.claude/projects/-home-claude-bin-gemma4-research/memory/feedback_scrub_pii_before_publish.md` | New memory entry | Seth's preference for scrubbing artifacts before sharing |
| `~/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md` | Index entry added | Link to the new memory |
### Decisions Made
| Decision | Options Considered | Rationale |
|----------|-------------------|-----------|
| Pivot from native-bakeoff to GPU-bakeoff mid-session | Complete native-bakeoff first; park and come back | Seth explicitly pivoted ("What I really want is..."); native-bakeoff was already functionally answered (A ≡ B) |
| Remove V100 from GPU-bakeoff doc entirely rather than keep with caveat | Keep with prominent ⚠ badge; drop the column | Seth directed "remove v100 from doc"; degraded data with caveat pollutes the narrative |
| Env-var-ize host URLs in harness source rather than config file | .env file; hard-coded with placeholders; CLI-only | Lightest change that accomplishes scrub; localhost default keeps steel141 path usable out of the box |
| Start GPU bakeoff on E4B, not 26B, for the native-bakeoff thread | Go straight to 26B (production model) | Actually reversed to 26B mid-session when strix-halo (Matt's host) was found reachable with `gemma4:26b` already pulled — production-shape became the shipped path |
| Don't rewrite git history to remove IPs from earlier commits | Force-push a cleaned history | Destructive; Seth's "remove IP/PI" was scoped to current artifacts, not a history scrub. Flagged the tradeoff and did not act |
| Chain this handoff to the previous OpenWebUI one chronologically even though topically unrelated | Link as "continues from"; mark "supersedes"; no chain | Session-handoff skill's chain field is chronological per doc conventions; the narrative separation is called out in the body |
## Pending Work
## Immediate Next Steps
1. **(Optional) Isolated V100 re-run.** Stop CT 167 (ai-visualizer / SDXL) on pve197, then `OLLAMA_PVE197_URL=http://<ip>:11434 python3 scripts/gpu-bakeoff/harness.py --host pve197`. Expected result: V100 lands between 3090 Ti and Strix Halo based on HBM2 ~900 GB/s spec. Add a V100 column back to the doc with isolated numbers. Judgment call — worth the ai-visualizer interruption?
2. **(Optional) Strix max-model-fit follow-up.** Strix can host models neither the 3090 Ti nor V100 can. Pull a larger model (gemma4:26b-a4b-it-q8_0 at 28 GB, or something 40B+) on the Strix Halo host; re-run harness to characterize the bandwidth/capacity ceiling for that architecture.
3. **(Optional) Close the native-bakeoff thread with arm C.** Set up a JAX env on steel141 or in a vast-h100 session, pip install `gemma`, run the JAX ToolSampler arm against the same mort-bakeoff task suite. If arm C matches arms A/B, that's definitive "Ollama's runtime is faithful to the DeepMind reference." If it diverges, the GGUF quantization / llama.cpp runtime is the variable to investigate.
### Blockers/Open Questions
- **Does `gemma4:31b-it-q4_K_M` on the V100 still deserve its 2026-04-07 "primary model on V100" designation?** The new 2026-04-20 decision noting 26B-MoE preference doesn't formally supersede it — they coexist on a speed vs quality axis that wasn't measured here. If a future session cares, a quality bakeoff (same tasks, qualitatively scored outputs) would resolve it.
- **Quantization sensitivity unmeasured.** All bakeoff numbers are Q4_K_M. Q8 vs Q4 throughput ratio on the same model (especially on Strix where more headroom is available) is an open question that came up in the "open questions" section of the doc.
### Deferred Items
- **Native-bakeoff arm C** — env setup cost, not landing in this session.
- **Git history scrub** — would require force-push; Seth's scrub request was interpreted as "current artifacts only" and he was informed of the tradeoff.
- **DECISIONS.md per-project local** — considered creating a project-local decision log for the bakeoff findings but instead promoted them to the global log (`~/bin/DECISIONS.md`) since the hardware/model implications are cross-project.
## Context for Resuming Agent
## Important Context
- **The V100 caveat is in git history (commit b619035) but not the final doc.** If someone greps the repo for "V100" and expects to find it in the current head, they won't — the final commit `22af597` removed it deliberately.
- **Host aliases were scrubbed this session.** `matt-strix` was renamed to `strix-halo` in the repo; the SSH alias in `~/.ssh/config` and `~/bin/CLAUDE.md` still uses the original name. Don't "reconcile" those by renaming the alias locally — Seth uses it as-is outside the published repo.
- **Harness requires env vars for non-local hosts now.** Running `scripts/gpu-bakeoff/harness.py --host strix-halo` without `OLLAMA_STRIX_URL` set will error out with a clear message. Set it from the SSH alias / Tailscale IP as needed.
- **The scrubbed URL constants are NOT in this repo.** If the next session needs to re-run the bakeoff against the original hosts, pull them from `~/bin/CLAUDE.md` (SSH aliases → tailscale/LAN IPs) or probe via `ssh strix-halo hostname -I` / equivalent.
- **gemma4:latest on steel141 is the E4B-it variant (8 GB), NOT the MoE 26B.** Confirmed during smoke-testing. Other hosts may resolve `gemma4:latest` differently.
- **Push-on-commit is the convention** for this repo (`~/bin/CLAUDE.md` Gitea section). Both commits this session were pushed immediately.
### Assumptions Made
- The V100 was degraded "because of SDXL" based on `/api/ps` showing `size_vram: 1.57 GB` of a 30.5 GB model + `nvidia-smi` showing 31.7/32.7 GB used by other processes. **Not independently verified** by stopping SDXL and re-running; that's the open follow-up. If SDXL wasn't actually the culprit (e.g., Ollama version bug on that host), the finding needs revisiting.
- matt-strix's `gemma4:31b` tag and steel141's `gemma4:31b-it-q4_K_M` tag are the same weights (both Q4_K_M, both 19.9 GB, both 31.3 B params). Verified via `/api/tags` metadata; not by hash comparison.
- Ollama's `/api/generate` canonical timing fields (`prompt_eval_duration`, `eval_duration`, etc.) are trustworthy for throughput measurements. Supported by their deterministic behavior across runs; not compared against external profiling.
### Potential Gotchas
- **`keep_alive: 10m` in the harness keeps models resident.** Running the full matrix against a host with limited VRAM can leave the model loaded after the harness exits; subsequent unrelated Ollama users may see degraded performance until `keep_alive` expires or another model evicts it.
- **The V100 runs are gone from `scripts/gpu-bakeoff/runs/`** (commit `22af597`). Git history has them at `b619035^`. Don't write new code expecting `runs/pve197/` to exist locally.
- **The native-bakeoff `content=""` fix is subtle.** If someone "improves" arm B to preserve the model's pre-tool-call thinking text as assistant content, they'll regress the turn-termination bug. Module-level comment in `scripts/native-bakeoff/arms/ollama_native.py` calls this out but is easy to miss.
- **gemma.cpp status as of 2026-04-20:** dev branch README still says Gemma 2/3 + PaliGemma 2 only. Don't suggest gemma.cpp as a Gemma 4 option without re-checking.
- **Arm B's raw_completion_tail/prompt_tail/prompt_head trace fields** were added during debugging and left in place. They make the trace JSONs larger than strictly necessary; ok to remove if cleanliness matters, but don't delete the fix they were added to diagnose.
## Environment State
### Tools/Services Used
- Local Ollama on steel141 (127.0.0.1:11434) — version and model list as of session
- Remote Ollama on strix-halo (via Tailscale) — version 0.21.0, models: `gemma4:26b`, `gemma4:31b`
- Remote Ollama on pve197 CT 105 — models include the Q8 MoE `gemma4:26b-a4b-it-q8_0` that only fits V100
- Git / Gitea at `git.sethpc.xyz/Seth/gemma4-research`
- Python 3 with `aiohttp`, `jinja2`, `urllib.request` (stdlib only for gpu-bakeoff)
### Active Processes
- None started or left running by this session. The `keep_alive: 10m` in harness.py may still be holding models resident briefly post-session; they'll drop when the TTL expires.
### Environment Variables
- `OLLAMA_STEEL141_URL` — default `http://127.0.0.1:11434` if unset
- `OLLAMA_PVE197_URL` — no default; required if `--host pve197`
- `OLLAMA_STRIX_URL` — no default; required if `--host strix-halo`
- Optionally `OLLAMA_URL` for any one-off calls to a different host, though harness doesn't read this
(No values are embedded in source; none logged here per handoff security policy.)
## Related Resources
- [docs/reference/gpu-bakeoff-2026-04-20.md](../../docs/reference/gpu-bakeoff-2026-04-20.md) — the session's primary artifact
- [scripts/gpu-bakeoff/](../../scripts/gpu-bakeoff/) — harness + raw traces
- [scripts/native-bakeoff/](../../scripts/native-bakeoff/) — parked research thread, functional A+B arms
- [tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja](../../tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja) — authoritative Gemma 4 chat template, rendered by arm B of native-bakeoff
- [~/bin/DECISIONS.md](/home/claude/bin/DECISIONS.md) — three new 2026-04-20 entries relating to this session
- [MEMORY index](/home/claude/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md) — updated with PII-scrub feedback
- Previous handoff: [2026-04-18-233832-openwebui-setup-doc.md](./2026-04-18-233832-openwebui-setup-doc.md) — chronological predecessor, topically unrelated
- Gitea commits this session: `df5542f`, `b619035`, `22af597`, `91842f3`
---
**Security Reminder**: Before finalizing, run `validate_handoff.py` to check for accidental secret exposure.