From 0f82cd71b1d87c9e5b949a76073a8456de179cd8 Mon Sep 17 00:00:00 2001 From: Mortdecai Date: Mon, 20 Apr 2026 06:00:07 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20session=20handoff=20=E2=80=94=20GPU=20b?= =?UTF-8?q?akeoff=20(3090=20Ti=20vs=20Strix=20Halo)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes out the session that produced docs/reference/gpu-bakeoff-2026-04-20.md and the parked scripts/native-bakeoff/ scaffold. Chains (chronologically) from the 2026-04-18 OpenWebUI setup handoff though the topic is unrelated. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...-04-20-055658-gpu-bakeoff-3090-vs-strix.md | 177 ++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 .claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md diff --git a/.claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md b/.claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md new file mode 100644 index 0000000..d122661 --- /dev/null +++ b/.claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md @@ -0,0 +1,177 @@ +# Handoff: GPU Bakeoff — 3090 Ti vs Strix Halo (+ parked native-bakeoff scaffold) + +## Session Metadata +- Created: 2026-04-20 05:56:58 +- Project: /home/claude/bin/gemma4-research +- Branch: master (pushed to origin) +- Session duration: ~extended session, multi-pivot (~4+ hours) + +### Recent Commits (for context) + - **91842f3 docs: scrub PII/IPs from gpu-bakeoff** ← latest, end of session + - **22af597 docs: remove V100 from GPU bakeoff** ← V100 column dropped + - **b619035 feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo** ← initial write-up (superseded by later scrubs) + - **df5542f feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling** ← parked research + - 91aaaa4 docs: redact PII from persistent-correspondence findings + +## Handoff Chain + +- **Continues from**: [2026-04-18-233832-openwebui-setup-doc.md](./2026-04-18-233832-openwebui-setup-doc.md) + - Previous title: OpenWebUI Setup Doc for Gemma 4 +- **Supersedes**: None + +> This session is not a continuation of the OpenWebUI doc work — it's a fresh research thread on the same repo. The link is chronological, not topical. Previous handoff is only relevant if debugging OpenWebUI-related Gemma 4 behavior. + +## Current State Summary + +Session started on a native-vs-JSON tool-calling bakeoff question, pivoted to a cross-GPU throughput comparison mid-session, and shipped the latter. Final state: `docs/reference/gpu-bakeoff-2026-04-20.md` comparing `gemma4:26b` MoE and `gemma4:31b` dense decode/prefill rates on **RTX 3090 Ti (steel141)** vs **AMD Strix Halo iGPU (strix-halo host)**. V100 data was initially gathered and included but **removed** when it turned out the V100 was 95% CPU-bound due to SDXL coresident on CT 167 — the published doc is a clean 2-host comparison. Native-bakeoff harness (the earlier thread) remains scaffolded and committed at `scripts/native-bakeoff/` but not run further. Repo is clean, three commits pushed. + +## Codebase Understanding + +### Architecture Overview + +The repo is a Gemma 4 research corpus. New this session: +- `scripts/native-bakeoff/` — three-arm tool-calling harness (Ollama JSON tools vs Ollama raw native tokens vs google-deepmind/gemma JAX ToolSampler). Arms A and B tested and functionally equivalent on `gemma4:26b` Q4 against a shared task suite lifted from mort-bakeoff. Arm C is env-gated (requires JAX + `gemma` PyPI package); wired but not run. +- `scripts/gpu-bakeoff/` — cross-GPU throughput harness. Takes host aliases from `HOSTS` dict and resolves URLs from env vars (`OLLAMA_STEEL141_URL`, `OLLAMA_PVE197_URL`, `OLLAMA_STRIX_URL`). Runs 1 warmup + 3 measurement calls per (host × model × prompt-length), logs Ollama's canonical timing fields, aggregates min/median/max. +- `docs/reference/gpu-bakeoff-2026-04-20.md` — the finished writeup. 3090 Ti + Strix Halo only. + +The `docs/reference/` tier holds experimental findings; `docs/` top-level holds applied how-to guides. Both bakeoffs landed in `docs/reference/` which is correct. + +### Critical Files + +| File | Purpose | Relevance | +|------|---------|-----------| +| `docs/reference/gpu-bakeoff-2026-04-20.md` | The session's primary artifact | Read this first for the session's shipped findings | +| `scripts/gpu-bakeoff/harness.py` | GPU bakeoff harness, env-var-driven URL resolution | Re-run the bakeoff (e.g., for isolated V100) by setting env vars + invoking | +| `scripts/gpu-bakeoff/runs/**/*.json` | Raw per-call timing data | Source of truth for the doc's numbers; each JSON has warmup + 3 runs with full Ollama timing fields | +| `scripts/native-bakeoff/harness.py` | Parked three-arm tool-calling harness | Reference if revisiting the native-vs-JSON question; arms A and B are ready, arm C needs JAX env | +| `scripts/native-bakeoff/arms/ollama_native.py` | Arm B — renders the canonical HF jinja chat template directly, POSTs to /api/generate raw:true | Contains a subtle fix (keep assistant `content=""` when it has `tool_calls`) that's easy to regress | +| `tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja` | Canonical Gemma 4 chat template, rendered by arm B | Authoritative source of Gemma 4's native tool-call wire format | +| `~/bin/DECISIONS.md` | Global decision log | Three new 2026-04-20 entries: MoE-preferred, 3090 Ti primary, V100 degraded | + +### Key Patterns Discovered + +- **MoE vs dense is a latency cliff, not a smooth curve.** `gemma4:26b` (MoE, ~4B active) decodes ~4.7× faster than `gemma4:31b` (dense, 31.3B active) on every GPU tested, because memory bandwidth is the binding constraint and the active-parameter bill is what you pay for per token. Total parameter count doesn't predict latency. +- **Ollama's JSON↔native-token tool-call translator is faithful** on `gemma4:26b` Q4. Arms A (JSON tools via `/api/chat`) and B (raw native tokens via `/api/generate raw:true`) produced identical behavioral shapes on the 4-task mort-bakeoff suite. Good for mort-bot's confidence in its production path. +- **Ollama's `/api/generate` strips matched stop tokens from the response.** Arm B's initial version mis-handled this by checking `done_reason == "stop"` as the "already terminated" branch; the correct logic is to always re-append the stop token based on which OPEN token (`<|tool_call>` vs `<|turn>`) is present in the completion. +- **Jinja `message.get('content')` checks the raw string, not the strip-thinking'd version.** Storing the model's `<|channel>thought\n` prefix in an assistant message's `content` field causes the template's post-tool-response conditional to append a spurious `\n`, corrupting the next step's prompt. Safe default: leave `content=""` when the message has `tool_calls`. + +## Work Completed + +### Tasks Finished + +- [x] Researched "most native Gemma 4 engine" — concluded `google-deepmind/gemma` (JAX) is the canonical reference; `gemma.cpp` verified to still NOT support Gemma 4 on dev branch (main README "CPU-only inference for: Gemma 2-3, PaliGemma 2") +- [x] Scaffolded three-arm native-bakeoff harness (ollama-json, ollama-native, jax-native) at `scripts/native-bakeoff/` +- [x] Ran A+B sweep on `gemma4:26b` Q4 via Strix Halo host over Tailscale; debugged arm-B parser bug; concluded Ollama's JSON↔native translator is faithful +- [x] Probed GPU inventory across steel141 (3090 Ti), pve197 CT 105 (V100), strix-halo (Strix Halo iGPU) +- [x] Built `scripts/gpu-bakeoff/harness.py` — env-var-keyed hosts, warmup + 3 runs, canonical timing extraction +- [x] Ran the bakeoff; discovered V100 was 95% CPU-bound due to SDXL occupying ~31 GB of its VRAM +- [x] Wrote `docs/reference/gpu-bakeoff-2026-04-20.md` with V100 column initially included, then removed at Seth's direction +- [x] Scrubbed PII/IPs from the doc and harness: host alias `matt-strix` → `strix-halo`, URLs moved to env vars, `runs/` dir renamed, JSONs patched +- [x] Updated `~/bin/DECISIONS.md` with three 2026-04-20 entries +- [x] Added feedback memory for the PII-scrub preference +- [x] Updated `README.md` index entry for the new bakeoff doc + +### Files Modified + +| File | Changes | Rationale | +|------|---------|-----------| +| `docs/reference/gpu-bakeoff-2026-04-20.md` | Created (final: 3090 Ti vs Strix Halo) | Session's primary artifact | +| `scripts/gpu-bakeoff/` | New dir — harness + runs | Bakeoff infrastructure | +| `scripts/native-bakeoff/` | New dir — three-arm harness, parked | Earlier research thread, parked but shippable | +| `README.md` | One new row in the file index | Discoverability for the new doc | +| `~/bin/DECISIONS.md` | Three new 2026-04-20 entries | MoE preference, 3090 Ti primacy, V100-SDXL contention | +| `~/.claude/projects/-home-claude-bin-gemma4-research/memory/feedback_scrub_pii_before_publish.md` | New memory entry | Seth's preference for scrubbing artifacts before sharing | +| `~/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md` | Index entry added | Link to the new memory | + +### Decisions Made + +| Decision | Options Considered | Rationale | +|----------|-------------------|-----------| +| Pivot from native-bakeoff to GPU-bakeoff mid-session | Complete native-bakeoff first; park and come back | Seth explicitly pivoted ("What I really want is..."); native-bakeoff was already functionally answered (A ≡ B) | +| Remove V100 from GPU-bakeoff doc entirely rather than keep with caveat | Keep with prominent ⚠ badge; drop the column | Seth directed "remove v100 from doc"; degraded data with caveat pollutes the narrative | +| Env-var-ize host URLs in harness source rather than config file | .env file; hard-coded with placeholders; CLI-only | Lightest change that accomplishes scrub; localhost default keeps steel141 path usable out of the box | +| Start GPU bakeoff on E4B, not 26B, for the native-bakeoff thread | Go straight to 26B (production model) | Actually reversed to 26B mid-session when strix-halo (Matt's host) was found reachable with `gemma4:26b` already pulled — production-shape became the shipped path | +| Don't rewrite git history to remove IPs from earlier commits | Force-push a cleaned history | Destructive; Seth's "remove IP/PI" was scoped to current artifacts, not a history scrub. Flagged the tradeoff and did not act | +| Chain this handoff to the previous OpenWebUI one chronologically even though topically unrelated | Link as "continues from"; mark "supersedes"; no chain | Session-handoff skill's chain field is chronological per doc conventions; the narrative separation is called out in the body | + +## Pending Work + +## Immediate Next Steps + +1. **(Optional) Isolated V100 re-run.** Stop CT 167 (ai-visualizer / SDXL) on pve197, then `OLLAMA_PVE197_URL=http://:11434 python3 scripts/gpu-bakeoff/harness.py --host pve197`. Expected result: V100 lands between 3090 Ti and Strix Halo based on HBM2 ~900 GB/s spec. Add a V100 column back to the doc with isolated numbers. Judgment call — worth the ai-visualizer interruption? +2. **(Optional) Strix max-model-fit follow-up.** Strix can host models neither the 3090 Ti nor V100 can. Pull a larger model (gemma4:26b-a4b-it-q8_0 at 28 GB, or something 40B+) on the Strix Halo host; re-run harness to characterize the bandwidth/capacity ceiling for that architecture. +3. **(Optional) Close the native-bakeoff thread with arm C.** Set up a JAX env on steel141 or in a vast-h100 session, pip install `gemma`, run the JAX ToolSampler arm against the same mort-bakeoff task suite. If arm C matches arms A/B, that's definitive "Ollama's runtime is faithful to the DeepMind reference." If it diverges, the GGUF quantization / llama.cpp runtime is the variable to investigate. + +### Blockers/Open Questions + +- **Does `gemma4:31b-it-q4_K_M` on the V100 still deserve its 2026-04-07 "primary model on V100" designation?** The new 2026-04-20 decision noting 26B-MoE preference doesn't formally supersede it — they coexist on a speed vs quality axis that wasn't measured here. If a future session cares, a quality bakeoff (same tasks, qualitatively scored outputs) would resolve it. +- **Quantization sensitivity unmeasured.** All bakeoff numbers are Q4_K_M. Q8 vs Q4 throughput ratio on the same model (especially on Strix where more headroom is available) is an open question that came up in the "open questions" section of the doc. + +### Deferred Items + +- **Native-bakeoff arm C** — env setup cost, not landing in this session. +- **Git history scrub** — would require force-push; Seth's scrub request was interpreted as "current artifacts only" and he was informed of the tradeoff. +- **DECISIONS.md per-project local** — considered creating a project-local decision log for the bakeoff findings but instead promoted them to the global log (`~/bin/DECISIONS.md`) since the hardware/model implications are cross-project. + +## Context for Resuming Agent + +## Important Context + +- **The V100 caveat is in git history (commit b619035) but not the final doc.** If someone greps the repo for "V100" and expects to find it in the current head, they won't — the final commit `22af597` removed it deliberately. +- **Host aliases were scrubbed this session.** `matt-strix` was renamed to `strix-halo` in the repo; the SSH alias in `~/.ssh/config` and `~/bin/CLAUDE.md` still uses the original name. Don't "reconcile" those by renaming the alias locally — Seth uses it as-is outside the published repo. +- **Harness requires env vars for non-local hosts now.** Running `scripts/gpu-bakeoff/harness.py --host strix-halo` without `OLLAMA_STRIX_URL` set will error out with a clear message. Set it from the SSH alias / Tailscale IP as needed. +- **The scrubbed URL constants are NOT in this repo.** If the next session needs to re-run the bakeoff against the original hosts, pull them from `~/bin/CLAUDE.md` (SSH aliases → tailscale/LAN IPs) or probe via `ssh strix-halo hostname -I` / equivalent. +- **gemma4:latest on steel141 is the E4B-it variant (8 GB), NOT the MoE 26B.** Confirmed during smoke-testing. Other hosts may resolve `gemma4:latest` differently. +- **Push-on-commit is the convention** for this repo (`~/bin/CLAUDE.md` Gitea section). Both commits this session were pushed immediately. + +### Assumptions Made + +- The V100 was degraded "because of SDXL" based on `/api/ps` showing `size_vram: 1.57 GB` of a 30.5 GB model + `nvidia-smi` showing 31.7/32.7 GB used by other processes. **Not independently verified** by stopping SDXL and re-running; that's the open follow-up. If SDXL wasn't actually the culprit (e.g., Ollama version bug on that host), the finding needs revisiting. +- matt-strix's `gemma4:31b` tag and steel141's `gemma4:31b-it-q4_K_M` tag are the same weights (both Q4_K_M, both 19.9 GB, both 31.3 B params). Verified via `/api/tags` metadata; not by hash comparison. +- Ollama's `/api/generate` canonical timing fields (`prompt_eval_duration`, `eval_duration`, etc.) are trustworthy for throughput measurements. Supported by their deterministic behavior across runs; not compared against external profiling. + +### Potential Gotchas + +- **`keep_alive: 10m` in the harness keeps models resident.** Running the full matrix against a host with limited VRAM can leave the model loaded after the harness exits; subsequent unrelated Ollama users may see degraded performance until `keep_alive` expires or another model evicts it. +- **The V100 runs are gone from `scripts/gpu-bakeoff/runs/`** (commit `22af597`). Git history has them at `b619035^`. Don't write new code expecting `runs/pve197/` to exist locally. +- **The native-bakeoff `content=""` fix is subtle.** If someone "improves" arm B to preserve the model's pre-tool-call thinking text as assistant content, they'll regress the turn-termination bug. Module-level comment in `scripts/native-bakeoff/arms/ollama_native.py` calls this out but is easy to miss. +- **gemma.cpp status as of 2026-04-20:** dev branch README still says Gemma 2/3 + PaliGemma 2 only. Don't suggest gemma.cpp as a Gemma 4 option without re-checking. +- **Arm B's raw_completion_tail/prompt_tail/prompt_head trace fields** were added during debugging and left in place. They make the trace JSONs larger than strictly necessary; ok to remove if cleanliness matters, but don't delete the fix they were added to diagnose. + +## Environment State + +### Tools/Services Used + +- Local Ollama on steel141 (127.0.0.1:11434) — version and model list as of session +- Remote Ollama on strix-halo (via Tailscale) — version 0.21.0, models: `gemma4:26b`, `gemma4:31b` +- Remote Ollama on pve197 CT 105 — models include the Q8 MoE `gemma4:26b-a4b-it-q8_0` that only fits V100 +- Git / Gitea at `git.sethpc.xyz/Seth/gemma4-research` +- Python 3 with `aiohttp`, `jinja2`, `urllib.request` (stdlib only for gpu-bakeoff) + +### Active Processes + +- None started or left running by this session. The `keep_alive: 10m` in harness.py may still be holding models resident briefly post-session; they'll drop when the TTL expires. + +### Environment Variables + +- `OLLAMA_STEEL141_URL` — default `http://127.0.0.1:11434` if unset +- `OLLAMA_PVE197_URL` — no default; required if `--host pve197` +- `OLLAMA_STRIX_URL` — no default; required if `--host strix-halo` +- Optionally `OLLAMA_URL` for any one-off calls to a different host, though harness doesn't read this + +(No values are embedded in source; none logged here per handoff security policy.) + +## Related Resources + +- [docs/reference/gpu-bakeoff-2026-04-20.md](../../docs/reference/gpu-bakeoff-2026-04-20.md) — the session's primary artifact +- [scripts/gpu-bakeoff/](../../scripts/gpu-bakeoff/) — harness + raw traces +- [scripts/native-bakeoff/](../../scripts/native-bakeoff/) — parked research thread, functional A+B arms +- [tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja](../../tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja) — authoritative Gemma 4 chat template, rendered by arm B of native-bakeoff +- [~/bin/DECISIONS.md](/home/claude/bin/DECISIONS.md) — three new 2026-04-20 entries relating to this session +- [MEMORY index](/home/claude/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md) — updated with PII-scrub feedback +- Previous handoff: [2026-04-18-233832-openwebui-setup-doc.md](./2026-04-18-233832-openwebui-setup-doc.md) — chronological predecessor, topically unrelated +- Gitea commits this session: `df5542f`, `b619035`, `22af597`, `91842f3` + +--- + +**Security Reminder**: Before finalizing, run `validate_handoff.py` to check for accidental secret exposure.