Closes out the session that produced docs/reference/gpu-bakeoff-2026-04-20.md and the parked scripts/native-bakeoff/ scaffold. Chains (chronologically) from the 2026-04-18 OpenWebUI setup handoff though the topic is unrelated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
Handoff: GPU Bakeoff — 3090 Ti vs Strix Halo (+ parked native-bakeoff scaffold)
Session Metadata
- Created: 2026-04-20 05:56:58
- Project: /home/claude/bin/gemma4-research
- Branch: master (pushed to origin)
- Session duration: ~extended session, multi-pivot (~4+ hours)
Recent Commits (for context)
91842f3docs: scrub PII/IPs from gpu-bakeoff ← latest, end of session22af597docs: remove V100 from GPU bakeoff ← V100 column droppedb619035feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo ← initial write-up (superseded by later scrubs)df5542ffeat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling ← parked research91aaaa4docs: redact PII from persistent-correspondence findings
Handoff Chain
- Continues from: 2026-04-18-233832-openwebui-setup-doc.md
- Previous title: OpenWebUI Setup Doc for Gemma 4
- Supersedes: None
This session is not a continuation of the OpenWebUI doc work — it's a fresh research thread on the same repo. The link is chronological, not topical. Previous handoff is only relevant if debugging OpenWebUI-related Gemma 4 behavior.
Current State Summary
Session started on a native-vs-JSON tool-calling bakeoff question, pivoted to a cross-GPU throughput comparison mid-session, and shipped the latter. Final state: docs/reference/gpu-bakeoff-2026-04-20.md comparing gemma4:26b MoE and gemma4:31b dense decode/prefill rates on RTX 3090 Ti (steel141) vs AMD Strix Halo iGPU (strix-halo host). V100 data was initially gathered and included but removed when it turned out the V100 was 95% CPU-bound due to SDXL coresident on CT 167 — the published doc is a clean 2-host comparison. Native-bakeoff harness (the earlier thread) remains scaffolded and committed at scripts/native-bakeoff/ but not run further. Repo is clean, three commits pushed.
Codebase Understanding
Architecture Overview
The repo is a Gemma 4 research corpus. New this session:
scripts/native-bakeoff/— three-arm tool-calling harness (Ollama JSON tools vs Ollama raw native tokens vs google-deepmind/gemma JAX ToolSampler). Arms A and B tested and functionally equivalent ongemma4:26bQ4 against a shared task suite lifted from mort-bakeoff. Arm C is env-gated (requires JAX +gemmaPyPI package); wired but not run.scripts/gpu-bakeoff/— cross-GPU throughput harness. Takes host aliases fromHOSTSdict and resolves URLs from env vars (OLLAMA_STEEL141_URL,OLLAMA_PVE197_URL,OLLAMA_STRIX_URL). Runs 1 warmup + 3 measurement calls per (host × model × prompt-length), logs Ollama's canonical timing fields, aggregates min/median/max.docs/reference/gpu-bakeoff-2026-04-20.md— the finished writeup. 3090 Ti + Strix Halo only.
The docs/reference/ tier holds experimental findings; docs/ top-level holds applied how-to guides. Both bakeoffs landed in docs/reference/ which is correct.
Critical Files
| File | Purpose | Relevance |
|---|---|---|
docs/reference/gpu-bakeoff-2026-04-20.md |
The session's primary artifact | Read this first for the session's shipped findings |
scripts/gpu-bakeoff/harness.py |
GPU bakeoff harness, env-var-driven URL resolution | Re-run the bakeoff (e.g., for isolated V100) by setting env vars + invoking |
scripts/gpu-bakeoff/runs/**/*.json |
Raw per-call timing data | Source of truth for the doc's numbers; each JSON has warmup + 3 runs with full Ollama timing fields |
scripts/native-bakeoff/harness.py |
Parked three-arm tool-calling harness | Reference if revisiting the native-vs-JSON question; arms A and B are ready, arm C needs JAX env |
scripts/native-bakeoff/arms/ollama_native.py |
Arm B — renders the canonical HF jinja chat template directly, POSTs to /api/generate raw:true | Contains a subtle fix (keep assistant content="" when it has tool_calls) that's easy to regress |
tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja |
Canonical Gemma 4 chat template, rendered by arm B | Authoritative source of Gemma 4's native tool-call wire format |
~/bin/DECISIONS.md |
Global decision log | Three new 2026-04-20 entries: MoE-preferred, 3090 Ti primary, V100 degraded |
Key Patterns Discovered
- MoE vs dense is a latency cliff, not a smooth curve.
gemma4:26b(MoE, ~4B active) decodes ~4.7× faster thangemma4:31b(dense, 31.3B active) on every GPU tested, because memory bandwidth is the binding constraint and the active-parameter bill is what you pay for per token. Total parameter count doesn't predict latency. - Ollama's JSON↔native-token tool-call translator is faithful on
gemma4:26bQ4. Arms A (JSON tools via/api/chat) and B (raw native tokens via/api/generate raw:true) produced identical behavioral shapes on the 4-task mort-bakeoff suite. Good for mort-bot's confidence in its production path. - Ollama's
/api/generatestrips matched stop tokens from the response. Arm B's initial version mis-handled this by checkingdone_reason == "stop"as the "already terminated" branch; the correct logic is to always re-append the stop token based on which OPEN token (<|tool_call>vs<|turn>) is present in the completion. - Jinja
message.get('content')checks the raw string, not the strip-thinking'd version. Storing the model's<|channel>thought\n<channel|>prefix in an assistant message'scontentfield causes the template's post-tool-response conditional to append a spurious<turn|>\n, corrupting the next step's prompt. Safe default: leavecontent=""when the message hastool_calls.
Work Completed
Tasks Finished
- Researched "most native Gemma 4 engine" — concluded
google-deepmind/gemma(JAX) is the canonical reference;gemma.cppverified to still NOT support Gemma 4 on dev branch (main README "CPU-only inference for: Gemma 2-3, PaliGemma 2") - Scaffolded three-arm native-bakeoff harness (ollama-json, ollama-native, jax-native) at
scripts/native-bakeoff/ - Ran A+B sweep on
gemma4:26bQ4 via Strix Halo host over Tailscale; debugged arm-B parser bug; concluded Ollama's JSON↔native translator is faithful - Probed GPU inventory across steel141 (3090 Ti), pve197 CT 105 (V100), strix-halo (Strix Halo iGPU)
- Built
scripts/gpu-bakeoff/harness.py— env-var-keyed hosts, warmup + 3 runs, canonical timing extraction - Ran the bakeoff; discovered V100 was 95% CPU-bound due to SDXL occupying ~31 GB of its VRAM
- Wrote
docs/reference/gpu-bakeoff-2026-04-20.mdwith V100 column initially included, then removed at Seth's direction - Scrubbed PII/IPs from the doc and harness: host alias
matt-strix→strix-halo, URLs moved to env vars,runs/dir renamed, JSONs patched - Updated
~/bin/DECISIONS.mdwith three 2026-04-20 entries - Added feedback memory for the PII-scrub preference
- Updated
README.mdindex entry for the new bakeoff doc
Files Modified
| File | Changes | Rationale |
|---|---|---|
docs/reference/gpu-bakeoff-2026-04-20.md |
Created (final: 3090 Ti vs Strix Halo) | Session's primary artifact |
scripts/gpu-bakeoff/ |
New dir — harness + runs | Bakeoff infrastructure |
scripts/native-bakeoff/ |
New dir — three-arm harness, parked | Earlier research thread, parked but shippable |
README.md |
One new row in the file index | Discoverability for the new doc |
~/bin/DECISIONS.md |
Three new 2026-04-20 entries | MoE preference, 3090 Ti primacy, V100-SDXL contention |
~/.claude/projects/-home-claude-bin-gemma4-research/memory/feedback_scrub_pii_before_publish.md |
New memory entry | Seth's preference for scrubbing artifacts before sharing |
~/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md |
Index entry added | Link to the new memory |
Decisions Made
| Decision | Options Considered | Rationale |
|---|---|---|
| Pivot from native-bakeoff to GPU-bakeoff mid-session | Complete native-bakeoff first; park and come back | Seth explicitly pivoted ("What I really want is..."); native-bakeoff was already functionally answered (A ≡ B) |
| Remove V100 from GPU-bakeoff doc entirely rather than keep with caveat | Keep with prominent ⚠ badge; drop the column | Seth directed "remove v100 from doc"; degraded data with caveat pollutes the narrative |
| Env-var-ize host URLs in harness source rather than config file | .env file; hard-coded with placeholders; CLI-only | Lightest change that accomplishes scrub; localhost default keeps steel141 path usable out of the box |
| Start GPU bakeoff on E4B, not 26B, for the native-bakeoff thread | Go straight to 26B (production model) | Actually reversed to 26B mid-session when strix-halo (Matt's host) was found reachable with gemma4:26b already pulled — production-shape became the shipped path |
| Don't rewrite git history to remove IPs from earlier commits | Force-push a cleaned history | Destructive; Seth's "remove IP/PI" was scoped to current artifacts, not a history scrub. Flagged the tradeoff and did not act |
| Chain this handoff to the previous OpenWebUI one chronologically even though topically unrelated | Link as "continues from"; mark "supersedes"; no chain | Session-handoff skill's chain field is chronological per doc conventions; the narrative separation is called out in the body |
Pending Work
Immediate Next Steps
- (Optional) Isolated V100 re-run. Stop CT 167 (ai-visualizer / SDXL) on pve197, then
OLLAMA_PVE197_URL=http://<ip>:11434 python3 scripts/gpu-bakeoff/harness.py --host pve197. Expected result: V100 lands between 3090 Ti and Strix Halo based on HBM2 ~900 GB/s spec. Add a V100 column back to the doc with isolated numbers. Judgment call — worth the ai-visualizer interruption? - (Optional) Strix max-model-fit follow-up. Strix can host models neither the 3090 Ti nor V100 can. Pull a larger model (gemma4:26b-a4b-it-q8_0 at 28 GB, or something 40B+) on the Strix Halo host; re-run harness to characterize the bandwidth/capacity ceiling for that architecture.
- (Optional) Close the native-bakeoff thread with arm C. Set up a JAX env on steel141 or in a vast-h100 session, pip install
gemma, run the JAX ToolSampler arm against the same mort-bakeoff task suite. If arm C matches arms A/B, that's definitive "Ollama's runtime is faithful to the DeepMind reference." If it diverges, the GGUF quantization / llama.cpp runtime is the variable to investigate.
Blockers/Open Questions
- Does
gemma4:31b-it-q4_K_Mon the V100 still deserve its 2026-04-07 "primary model on V100" designation? The new 2026-04-20 decision noting 26B-MoE preference doesn't formally supersede it — they coexist on a speed vs quality axis that wasn't measured here. If a future session cares, a quality bakeoff (same tasks, qualitatively scored outputs) would resolve it. - Quantization sensitivity unmeasured. All bakeoff numbers are Q4_K_M. Q8 vs Q4 throughput ratio on the same model (especially on Strix where more headroom is available) is an open question that came up in the "open questions" section of the doc.
Deferred Items
- Native-bakeoff arm C — env setup cost, not landing in this session.
- Git history scrub — would require force-push; Seth's scrub request was interpreted as "current artifacts only" and he was informed of the tradeoff.
- DECISIONS.md per-project local — considered creating a project-local decision log for the bakeoff findings but instead promoted them to the global log (
~/bin/DECISIONS.md) since the hardware/model implications are cross-project.
Context for Resuming Agent
Important Context
- The V100 caveat is in git history (commit
b619035) but not the final doc. If someone greps the repo for "V100" and expects to find it in the current head, they won't — the final commit22af597removed it deliberately. - Host aliases were scrubbed this session.
matt-strixwas renamed tostrix-haloin the repo; the SSH alias in~/.ssh/configand~/bin/CLAUDE.mdstill uses the original name. Don't "reconcile" those by renaming the alias locally — Seth uses it as-is outside the published repo. - Harness requires env vars for non-local hosts now. Running
scripts/gpu-bakeoff/harness.py --host strix-halowithoutOLLAMA_STRIX_URLset will error out with a clear message. Set it from the SSH alias / Tailscale IP as needed. - The scrubbed URL constants are NOT in this repo. If the next session needs to re-run the bakeoff against the original hosts, pull them from
~/bin/CLAUDE.md(SSH aliases → tailscale/LAN IPs) or probe viassh strix-halo hostname -I/ equivalent. - gemma4:latest on steel141 is the E4B-it variant (8 GB), NOT the MoE 26B. Confirmed during smoke-testing. Other hosts may resolve
gemma4:latestdifferently. - Push-on-commit is the convention for this repo (
~/bin/CLAUDE.mdGitea section). Both commits this session were pushed immediately.
Assumptions Made
- The V100 was degraded "because of SDXL" based on
/api/psshowingsize_vram: 1.57 GBof a 30.5 GB model +nvidia-smishowing 31.7/32.7 GB used by other processes. Not independently verified by stopping SDXL and re-running; that's the open follow-up. If SDXL wasn't actually the culprit (e.g., Ollama version bug on that host), the finding needs revisiting. - matt-strix's
gemma4:31btag and steel141'sgemma4:31b-it-q4_K_Mtag are the same weights (both Q4_K_M, both 19.9 GB, both 31.3 B params). Verified via/api/tagsmetadata; not by hash comparison. - Ollama's
/api/generatecanonical timing fields (prompt_eval_duration,eval_duration, etc.) are trustworthy for throughput measurements. Supported by their deterministic behavior across runs; not compared against external profiling.
Potential Gotchas
keep_alive: 10min the harness keeps models resident. Running the full matrix against a host with limited VRAM can leave the model loaded after the harness exits; subsequent unrelated Ollama users may see degraded performance untilkeep_aliveexpires or another model evicts it.- The V100 runs are gone from
scripts/gpu-bakeoff/runs/(commit22af597). Git history has them atb619035^. Don't write new code expectingruns/pve197/to exist locally. - The native-bakeoff
content=""fix is subtle. If someone "improves" arm B to preserve the model's pre-tool-call thinking text as assistant content, they'll regress the turn-termination bug. Module-level comment inscripts/native-bakeoff/arms/ollama_native.pycalls this out but is easy to miss. - gemma.cpp status as of 2026-04-20: dev branch README still says Gemma 2/3 + PaliGemma 2 only. Don't suggest gemma.cpp as a Gemma 4 option without re-checking.
- Arm B's raw_completion_tail/prompt_tail/prompt_head trace fields were added during debugging and left in place. They make the trace JSONs larger than strictly necessary; ok to remove if cleanliness matters, but don't delete the fix they were added to diagnose.
Environment State
Tools/Services Used
- Local Ollama on steel141 (127.0.0.1:11434) — version and model list as of session
- Remote Ollama on strix-halo (via Tailscale) — version 0.21.0, models:
gemma4:26b,gemma4:31b - Remote Ollama on pve197 CT 105 — models include the Q8 MoE
gemma4:26b-a4b-it-q8_0that only fits V100 - Git / Gitea at
git.sethpc.xyz/Seth/gemma4-research - Python 3 with
aiohttp,jinja2,urllib.request(stdlib only for gpu-bakeoff)
Active Processes
- None started or left running by this session. The
keep_alive: 10min harness.py may still be holding models resident briefly post-session; they'll drop when the TTL expires.
Environment Variables
OLLAMA_STEEL141_URL— defaulthttp://127.0.0.1:11434if unsetOLLAMA_PVE197_URL— no default; required if--host pve197OLLAMA_STRIX_URL— no default; required if--host strix-halo- Optionally
OLLAMA_URLfor any one-off calls to a different host, though harness doesn't read this
(No values are embedded in source; none logged here per handoff security policy.)
Related Resources
- docs/reference/gpu-bakeoff-2026-04-20.md — the session's primary artifact
- scripts/gpu-bakeoff/ — harness + raw traces
- scripts/native-bakeoff/ — parked research thread, functional A+B arms
- tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja — authoritative Gemma 4 chat template, rendered by arm B of native-bakeoff
- ~/bin/DECISIONS.md — three new 2026-04-20 entries relating to this session
- MEMORY index — updated with PII-scrub feedback
- Previous handoff: 2026-04-18-233832-openwebui-setup-doc.md — chronological predecessor, topically unrelated
- Gitea commits this session:
df5542f,b619035,22af597,91842f3
Security Reminder: Before finalizing, run validate_handoff.py to check for accidental secret exposure.