docs: scrub PII/IPs from gpu-bakeoff

- Rename host alias matt-strix -> strix-halo (removes third-party name)
- Move host URLs to env-var lookup (OLLAMA_*_URL), drop hardcoded IPs
  from harness source. Defaults: steel141 keeps localhost; pve197 and
  strix-halo require their env var to be set before use.
- Update doc: remove the Tailscale IP and LAN-IP references, describe
  access paths without specific addresses.
- Rename runs/matt-strix -> runs/strix-halo and patch the host field
  in each JSON.

Harness still functional for the original author (set the env vars)
and safe to share without leaking routable addresses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mortdecai
2026-04-20 05:50:52 -04:00
parent 22af59756f
commit 91842f30cb
9 changed files with 43 additions and 21 deletions
+1 -1
View File
@@ -18,7 +18,7 @@ Research corpus and implementation guidance for Google Gemma 4, based on product
| `docs/openwebui-setup.md` | How to configure Gemma 4 inside OpenWebUI — per-setting reference, two ready-to-bake Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table mapped back to GOTCHAS.md. Assumes Ollama + OpenWebUI are already running. | When setting up or debugging a Gemma 4 model in OpenWebUI, or handing the front-end config to someone else | | `docs/openwebui-setup.md` | How to configure Gemma 4 inside OpenWebUI — per-setting reference, two ready-to-bake Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table mapped back to GOTCHAS.md. Assumes Ollama + OpenWebUI are already running. | When setting up or debugging a Gemma 4 model in OpenWebUI, or handing the front-end config to someone else |
| `docs/reference/bakeoff-2026-04-18.md` | CLI-coding-agent bakeoff on 3090 Ti. **Rounds 1/2 misidentified the cause; Round 3 (the correct one): `think: false` silent-stops gemma4:26b at certain multi-turn states on 32K context.** 31B and Qwen3-Coder robust to the flag. Harness at `scripts/bakeoff/` | When deciding which model to back a CLI agent with, writing a custom agent payload, or debugging a silent tool-call halt | | `docs/reference/bakeoff-2026-04-18.md` | CLI-coding-agent bakeoff on 3090 Ti. **Rounds 1/2 misidentified the cause; Round 3 (the correct one): `think: false` silent-stops gemma4:26b at certain multi-turn states on 32K context.** 31B and Qwen3-Coder robust to the flag. Harness at `scripts/bakeoff/` | When deciding which model to back a CLI agent with, writing a custom agent payload, or debugging a silent tool-call halt |
| `docs/reference/mort-bakeoff-2026-04-18.md` | mort-bot-specific `think=true` vs `think=false` bakeoff on mort's actual loop shape (gemma4:26b, num_ctx=8192). **Thinking does NOT accumulate in context on Ollama 0.20.4** — strips it from serialized history. Both settings behave identically on step counts, tool counts, wall clock. Harness at `scripts/mort-bakeoff/` | When deciding mort-bot's THINK env var, or when someone claims "think=true eats context" without pinning an Ollama version | | `docs/reference/mort-bakeoff-2026-04-18.md` | mort-bot-specific `think=true` vs `think=false` bakeoff on mort's actual loop shape (gemma4:26b, num_ctx=8192). **Thinking does NOT accumulate in context on Ollama 0.20.4** — strips it from serialized history. Both settings behave identically on step counts, tool counts, wall clock. Harness at `scripts/mort-bakeoff/` | When deciding mort-bot's THINK env var, or when someone claims "think=true eats context" without pinning an Ollama version |
| `docs/reference/gpu-bakeoff-2026-04-20.md` | Cross-GPU throughput bakeoff: steel141 RTX 3090 Ti vs matt-strix (AMD Strix Halo). **3090 Ti wins decode decisively (128 tok/s on 26B MoE). Strix gets ~42% of that on ~25% of the bandwidth.** Also quantifies the MoE vs dense gap: 26B decodes ~4.7× faster than 31B on both cards. Harness at `scripts/gpu-bakeoff/` | When choosing which host to run a Gemma 4 workload on | | `docs/reference/gpu-bakeoff-2026-04-20.md` | Cross-GPU throughput bakeoff: steel141 RTX 3090 Ti vs strix-halo (AMD Strix Halo). **3090 Ti wins decode decisively (128 tok/s on 26B MoE). Strix gets ~42% of that on ~25% of the bandwidth.** Also quantifies the MoE vs dense gap: 26B decodes ~4.7× faster than 31B on both cards. Harness at `scripts/gpu-bakeoff/` | When choosing which host to run a Gemma 4 workload on |
| `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) | | `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) |
## Source Projects ## Source Projects
+6 -6
View File
@@ -1,7 +1,7 @@
# GPU Bakeoff — Gemma 4 Throughput: 3090 Ti vs Strix Halo # GPU Bakeoff — Gemma 4 Throughput: 3090 Ti vs Strix Halo
**Date:** 2026-04-20 **Date:** 2026-04-20
**Host matrix:** steel141 (RTX 3090 Ti) · matt-strix (AMD Strix Halo iGPU) **Host matrix:** steel141 (RTX 3090 Ti) · strix-halo (AMD Strix Halo iGPU)
**Models:** `gemma4:26b` (MoE Q4_K_M) · `gemma4:31b-it-q4_K_M` (dense Q4_K_M) **Models:** `gemma4:26b` (MoE Q4_K_M) · `gemma4:31b-it-q4_K_M` (dense Q4_K_M)
**Harness:** `scripts/gpu-bakeoff/harness.py` **Harness:** `scripts/gpu-bakeoff/harness.py`
**Raw data:** `scripts/gpu-bakeoff/runs/` **Raw data:** `scripts/gpu-bakeoff/runs/`
@@ -13,7 +13,7 @@
| GPU | 26B (MoE) decode | 31B (dense) decode | Long-prompt prefill (26B) | | GPU | 26B (MoE) decode | 31B (dense) decode | Long-prompt prefill (26B) |
|-----|------------------|--------------------|-----------------------| |-----|------------------|--------------------|-----------------------|
| **RTX 3090 Ti** (steel141) | **128 tok/s** | **27 tok/s** | **23,849 tok/s** | | **RTX 3090 Ti** (steel141) | **128 tok/s** | **27 tok/s** | **23,849 tok/s** |
| **AMD Strix Halo iGPU** (matt-strix) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) | | **AMD Strix Halo iGPU** (strix-halo) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) |
### Headline findings ### Headline findings
@@ -34,8 +34,8 @@
| Host | GPU | VRAM | Bandwidth | Compute cap | Notes | | Host | GPU | VRAM | Bandwidth | Compute cap | Notes |
|------|-----|------|-----------|-------------|-------| |------|-----|------|-----------|-------------|-------|
| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Seth's workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on 127.0.0.1:11434. | | steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on localhost. |
| matt-strix | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama on 100.117.155.64:11434 via Tailscale. | | strix-halo | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama accessed via Tailscale. |
--- ---
@@ -151,7 +151,7 @@ and matches or slightly exceeds proportionally.
1. **Strix max-model fit.** Strix can host models that wouldn't fit the 1. **Strix max-model fit.** Strix can host models that wouldn't fit the
3090 Ti. A follow-up would pull a larger model (70 B+ quantized) on 3090 Ti. A follow-up would pull a larger model (70 B+ quantized) on
matt-strix and measure the Strix-only performance ceiling. strix-halo and measure the Strix-only performance ceiling.
2. **Q8 vs Q4 on Strix.** Same model, two quantizations — quality/speed 2. **Q8 vs Q4 on Strix.** Same model, two quantizations — quality/speed
tradeoff characterization. tradeoff characterization.
@@ -166,7 +166,7 @@ runs/
├── steel141/ ├── steel141/
│ ├── gemma4-26b/{short,long}.json │ ├── gemma4-26b/{short,long}.json
│ └── gemma4-31b/{short,long}.json │ └── gemma4-31b/{short,long}.json
└── matt-strix/ └── strix-halo/
├── gemma4-26b/{short,long}.json ├── gemma4-26b/{short,long}.json
└── gemma4-31b/{short,long}.json └── gemma4-31b/{short,long}.json
``` ```
+30 -8
View File
@@ -5,7 +5,7 @@ three hosts:
- steel141 : RTX 3090 Ti (24 GB GDDR6X, compute 8.6, ~1008 GB/s) - steel141 : RTX 3090 Ti (24 GB GDDR6X, compute 8.6, ~1008 GB/s)
- pve197 : Tesla V100-PCIE-32GB (32 GB HBM2, compute 7.0, ~900 GB/s) - pve197 : Tesla V100-PCIE-32GB (32 GB HBM2, compute 7.0, ~900 GB/s)
- matt-strix: AMD Strix Halo iGPU (shared LPDDR5X, ~256 GB/s) - strix-halo: AMD Strix Halo iGPU (shared LPDDR5X, ~256 GB/s)
Per (host, model, prompt_length), runs 1 warmup + N measurement runs, Per (host, model, prompt_length), runs 1 warmup + N measurement runs,
records Ollama's canonical timing fields, and writes one JSON trace to records Ollama's canonical timing fields, and writes one JSON trace to
@@ -15,6 +15,13 @@ All three Ollama servers are polled via HTTP; no SSH required. All
timings come from Ollama's own /api/generate response fields so wall- timings come from Ollama's own /api/generate response fields so wall-
clock jitter between the harness and the server is excluded. clock jitter between the harness and the server is excluded.
Host URLs are resolved from environment variables so routable addresses
don't live in source. Set these before running against non-local hosts:
OLLAMA_STEEL141_URL=http://127.0.0.1:11434
OLLAMA_PVE197_URL=http://<lan-ip>:11434
OLLAMA_STRIX_URL=http://<tailscale-ip>:11434
Invocation: Invocation:
python3 harness.py --host steel141 --model gemma4:26b --prompt short python3 harness.py --host steel141 --model gemma4:26b --prompt short
python3 harness.py all # runs the full planned matrix python3 harness.py all # runs the full planned matrix
@@ -24,6 +31,7 @@ from __future__ import annotations
import argparse import argparse
import json import json
import os
import sys import sys
import time import time
import urllib.request import urllib.request
@@ -31,16 +39,30 @@ from pathlib import Path
HOSTS = { HOSTS = {
"steel141": {"url": "http://127.0.0.1:11434", "gpu": "RTX 3090 Ti", "vram_gb": 24}, "steel141": {"url_env": "OLLAMA_STEEL141_URL", "default_url": "http://127.0.0.1:11434",
"pve197": {"url": "http://192.168.0.179:11434", "gpu": "Tesla V100-PCIE-32GB", "vram_gb": 32}, "gpu": "RTX 3090 Ti", "vram_gb": 24},
"matt-strix": {"url": "http://100.117.155.64:11434", "gpu": "AMD Strix Halo iGPU", "vram_gb": None}, "pve197": {"url_env": "OLLAMA_PVE197_URL", "default_url": None,
"gpu": "Tesla V100-PCIE-32GB", "vram_gb": 32},
"strix-halo": {"url_env": "OLLAMA_STRIX_URL", "default_url": None,
"gpu": "AMD Strix Halo iGPU", "vram_gb": None},
} }
# Per-host model tag mapping. matt-strix uses gemma4:31b, the others
def _host_url(host: str) -> str:
cfg = HOSTS[host]
url = os.environ.get(cfg["url_env"]) or cfg["default_url"]
if not url:
raise RuntimeError(
f"host {host!r} has no URL — set ${cfg['url_env']} in env"
)
return url
# Per-host model tag mapping. strix-halo uses gemma4:31b, the others
# use gemma4:31b-it-q4_K_M — identical weights, different tags. # use gemma4:31b-it-q4_K_M — identical weights, different tags.
MODEL_ALIASES = { MODEL_ALIASES = {
"gemma4:26b": {"steel141": "gemma4:26b", "pve197": "gemma4:26b", "matt-strix": "gemma4:26b"}, "gemma4:26b": {"steel141": "gemma4:26b", "pve197": "gemma4:26b", "strix-halo": "gemma4:26b"},
"gemma4:31b": {"steel141": "gemma4:31b-it-q4_K_M", "pve197": "gemma4:31b-it-q4_K_M", "matt-strix": "gemma4:31b"}, "gemma4:31b": {"steel141": "gemma4:31b-it-q4_K_M", "pve197": "gemma4:31b-it-q4_K_M", "strix-halo": "gemma4:31b"},
# V100-only edge case — only 32 GB host has headroom for the Q8 MoE. # V100-only edge case — only 32 GB host has headroom for the Q8 MoE.
"gemma4:26b-q8": {"pve197": "gemma4:26b-a4b-it-q8_0"}, "gemma4:26b-q8": {"pve197": "gemma4:26b-a4b-it-q8_0"},
} }
@@ -151,7 +173,7 @@ def run_matrix(
return {"host": host, "model_alias": model_alias, "skipped": "model not available on host"} return {"host": host, "model_alias": model_alias, "skipped": "model not available on host"}
prompt = PROMPTS[prompt_key] prompt = PROMPTS[prompt_key]
url = host_cfg["url"] url = _host_url(host)
trace = { trace = {
"host": host, "host": host,
@@ -1,5 +1,5 @@
{ {
"host": "matt-strix", "host": "strix-halo",
"model_alias": "gemma4:26b-q8", "model_alias": "gemma4:26b-q8",
"skipped": "model not available on host" "skipped": "model not available on host"
} }
@@ -1,5 +1,5 @@
{ {
"host": "matt-strix", "host": "strix-halo",
"model_alias": "gemma4:26b-q8", "model_alias": "gemma4:26b-q8",
"skipped": "model not available on host" "skipped": "model not available on host"
} }
@@ -1,5 +1,5 @@
{ {
"host": "matt-strix", "host": "strix-halo",
"gpu": "AMD Strix Halo iGPU", "gpu": "AMD Strix Halo iGPU",
"vram_gb": null, "vram_gb": null,
"model_alias": "gemma4:26b", "model_alias": "gemma4:26b",
@@ -1,5 +1,5 @@
{ {
"host": "matt-strix", "host": "strix-halo",
"gpu": "AMD Strix Halo iGPU", "gpu": "AMD Strix Halo iGPU",
"vram_gb": null, "vram_gb": null,
"model_alias": "gemma4:26b", "model_alias": "gemma4:26b",
@@ -1,5 +1,5 @@
{ {
"host": "matt-strix", "host": "strix-halo",
"gpu": "AMD Strix Halo iGPU", "gpu": "AMD Strix Halo iGPU",
"vram_gb": null, "vram_gb": null,
"model_alias": "gemma4:31b", "model_alias": "gemma4:31b",
@@ -1,5 +1,5 @@
{ {
"host": "matt-strix", "host": "strix-halo",
"gpu": "AMD Strix Halo iGPU", "gpu": "AMD Strix Halo iGPU",
"vram_gb": null, "vram_gb": null,
"model_alias": "gemma4:31b", "model_alias": "gemma4:31b",