docs: scrub PII/IPs from gpu-bakeoff

- Rename host alias matt-strix -> strix-halo (removes third-party name) - Move host URLs to env-var lookup (OLLAMA_*_URL), drop hardcoded IPs from harness source. Defaults: steel141 keeps localhost; pve197 and strix-halo require their env var to be set before use. - Update doc: remove the Tailscale IP and LAN-IP references, describe access paths without specific addresses. - Rename runs/matt-strix -> runs/strix-halo and patch the host field in each JSON. Harness still functional for the original author (set the env vars) and safe to share without leaking routable addresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 05:50:52 -04:00
parent 22af59756f
commit 91842f30cb
9 changed files with 43 additions and 21 deletions
@@ -18,7 +18,7 @@ Research corpus and implementation guidance for Google Gemma 4, based on product
 | `docs/openwebui-setup.md` | How to configure Gemma 4 inside OpenWebUI — per-setting reference, two ready-to-bake Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table mapped back to GOTCHAS.md. Assumes Ollama + OpenWebUI are already running. | When setting up or debugging a Gemma 4 model in OpenWebUI, or handing the front-end config to someone else |
 | `docs/reference/bakeoff-2026-04-18.md` | CLI-coding-agent bakeoff on 3090 Ti. **Rounds 1/2 misidentified the cause; Round 3 (the correct one): `think: false` silent-stops gemma4:26b at certain multi-turn states on 32K context.** 31B and Qwen3-Coder robust to the flag. Harness at `scripts/bakeoff/` | When deciding which model to back a CLI agent with, writing a custom agent payload, or debugging a silent tool-call halt |
 | `docs/reference/mort-bakeoff-2026-04-18.md` | mort-bot-specific `think=true` vs `think=false` bakeoff on mort's actual loop shape (gemma4:26b, num_ctx=8192). **Thinking does NOT accumulate in context on Ollama 0.20.4** — strips it from serialized history. Both settings behave identically on step counts, tool counts, wall clock. Harness at `scripts/mort-bakeoff/` | When deciding mort-bot's THINK env var, or when someone claims "think=true eats context" without pinning an Ollama version |
-| `docs/reference/gpu-bakeoff-2026-04-20.md` | Cross-GPU throughput bakeoff: steel141 RTX 3090 Ti vs matt-strix (AMD Strix Halo). **3090 Ti wins decode decisively (128 tok/s on 26B MoE). Strix gets ~42% of that on ~25% of the bandwidth.** Also quantifies the MoE vs dense gap: 26B decodes ~4.7× faster than 31B on both cards. Harness at `scripts/gpu-bakeoff/` | When choosing which host to run a Gemma 4 workload on |
+| `docs/reference/gpu-bakeoff-2026-04-20.md` | Cross-GPU throughput bakeoff: steel141 RTX 3090 Ti vs strix-halo (AMD Strix Halo). **3090 Ti wins decode decisively (128 tok/s on 26B MoE). Strix gets ~42% of that on ~25% of the bandwidth.** Also quantifies the MoE vs dense gap: 26B decodes ~4.7× faster than 31B on both cards. Harness at `scripts/gpu-bakeoff/` | When choosing which host to run a Gemma 4 workload on |
 | `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) |
 ## Source Projects
@@ -1,7 +1,7 @@
 # GPU Bakeoff — Gemma 4 Throughput: 3090 Ti vs Strix Halo
 **Date:** 2026-04-20
-**Host matrix:** steel141 (RTX 3090 Ti) · matt-strix (AMD Strix Halo iGPU)
+**Host matrix:** steel141 (RTX 3090 Ti) · strix-halo (AMD Strix Halo iGPU)
 **Models:** `gemma4:26b` (MoE Q4_K_M) · `gemma4:31b-it-q4_K_M` (dense Q4_K_M)
 **Harness:** `scripts/gpu-bakeoff/harness.py`
 **Raw data:** `scripts/gpu-bakeoff/runs/`
@@ -13,7 +13,7 @@
 | GPU | 26B (MoE) decode | 31B (dense) decode | Long-prompt prefill (26B) |
 |-----|------------------|--------------------|-----------------------|
 | **RTX 3090 Ti** (steel141) | **128 tok/s** | **27 tok/s** | **23,849 tok/s** |
-| **AMD Strix Halo iGPU** (matt-strix) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) |
+| **AMD Strix Halo iGPU** (strix-halo) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) |
 ### Headline findings
@@ -34,8 +34,8 @@
 | Host | GPU | VRAM | Bandwidth | Compute cap | Notes |
 |------|-----|------|-----------|-------------|-------|
-| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Seth's workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on 127.0.0.1:11434. |
+| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on localhost. |
-| matt-strix | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama on 100.117.155.64:11434 via Tailscale. |
+| strix-halo | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama accessed via Tailscale. |
 ---
@@ -151,7 +151,7 @@ and matches or slightly exceeds proportionally.
 1. **Strix max-model fit.** Strix can host models that wouldn't fit the
   3090 Ti. A follow-up would pull a larger model (70 B+ quantized) on
-   matt-strix and measure the Strix-only performance ceiling.
+   strix-halo and measure the Strix-only performance ceiling.
 2. **Q8 vs Q4 on Strix.** Same model, two quantizations — quality/speed
   tradeoff characterization.
@@ -166,7 +166,7 @@ runs/
 ├── steel141/
 │   ├── gemma4-26b/{short,long}.json
 │   └── gemma4-31b/{short,long}.json
-└── matt-strix/
+└── strix-halo/
    ├── gemma4-26b/{short,long}.json
    └── gemma4-31b/{short,long}.json
 ```
@@ -5,7 +5,7 @@ three hosts:
  - steel141  : RTX 3090 Ti (24 GB GDDR6X, compute 8.6, ~1008 GB/s)
  - pve197    : Tesla V100-PCIE-32GB (32 GB HBM2, compute 7.0, ~900 GB/s)
-  - matt-strix: AMD Strix Halo iGPU (shared LPDDR5X, ~256 GB/s)
+  - strix-halo: AMD Strix Halo iGPU (shared LPDDR5X, ~256 GB/s)
 Per (host, model, prompt_length), runs 1 warmup + N measurement runs,
 records Ollama's canonical timing fields, and writes one JSON trace to
@@ -15,6 +15,13 @@ All three Ollama servers are polled via HTTP; no SSH required. All
 timings come from Ollama's own /api/generate response fields so wall-
 clock jitter between the harness and the server is excluded.
 Host URLs are resolved from environment variables so routable addresses
 don't live in source. Set these before running against non-local hosts:
    OLLAMA_STEEL141_URL=http://127.0.0.1:11434
    OLLAMA_PVE197_URL=http://<lan-ip>:11434
    OLLAMA_STRIX_URL=http://<tailscale-ip>:11434
 Invocation:
    python3 harness.py --host steel141 --model gemma4:26b --prompt short
    python3 harness.py all   # runs the full planned matrix
@@ -24,6 +31,7 @@ from __future__ import annotations
 import argparse
 import json
 import os
 import sys
 import time
 import urllib.request
@@ -31,16 +39,30 @@ from pathlib import Path
 HOSTS = {
-    "steel141":   {"url": "http://127.0.0.1:11434",       "gpu": "RTX 3090 Ti",           "vram_gb": 24},
+    "steel141":   {"url_env": "OLLAMA_STEEL141_URL", "default_url": "http://127.0.0.1:11434",
-    "pve197":     {"url": "http://192.168.0.179:11434",   "gpu": "Tesla V100-PCIE-32GB",  "vram_gb": 32},
+                   "gpu": "RTX 3090 Ti",          "vram_gb": 24},
-    "matt-strix": {"url": "http://100.117.155.64:11434",  "gpu": "AMD Strix Halo iGPU",   "vram_gb": None},
+    "pve197":     {"url_env": "OLLAMA_PVE197_URL",   "default_url": None,
                   "gpu": "Tesla V100-PCIE-32GB", "vram_gb": 32},
    "strix-halo": {"url_env": "OLLAMA_STRIX_URL",    "default_url": None,
                   "gpu": "AMD Strix Halo iGPU",  "vram_gb": None},
 }
-# Per-host model tag mapping. matt-strix uses gemma4:31b, the others
+
 def _host_url(host: str) -> str:
    cfg = HOSTS[host]
    url = os.environ.get(cfg["url_env"]) or cfg["default_url"]
    if not url:
        raise RuntimeError(
            f"host {host!r} has no URL — set ${cfg['url_env']} in env"
        )
    return url
 # Per-host model tag mapping. strix-halo uses gemma4:31b, the others
 # use gemma4:31b-it-q4_K_M — identical weights, different tags.
 MODEL_ALIASES = {
-    "gemma4:26b":  {"steel141": "gemma4:26b",            "pve197": "gemma4:26b",            "matt-strix": "gemma4:26b"},
+    "gemma4:26b":  {"steel141": "gemma4:26b",            "pve197": "gemma4:26b",            "strix-halo": "gemma4:26b"},
-    "gemma4:31b":  {"steel141": "gemma4:31b-it-q4_K_M",  "pve197": "gemma4:31b-it-q4_K_M",  "matt-strix": "gemma4:31b"},
+    "gemma4:31b":  {"steel141": "gemma4:31b-it-q4_K_M",  "pve197": "gemma4:31b-it-q4_K_M",  "strix-halo": "gemma4:31b"},
    # V100-only edge case — only 32 GB host has headroom for the Q8 MoE.
    "gemma4:26b-q8":  {"pve197": "gemma4:26b-a4b-it-q8_0"},
 }
@@ -151,7 +173,7 @@ def run_matrix(
        return {"host": host, "model_alias": model_alias, "skipped": "model not available on host"}
    prompt = PROMPTS[prompt_key]
-    url = host_cfg["url"]
+    url = _host_url(host)
    trace = {
        "host": host,
@@ -1,5 +1,5 @@
 {
-  "host": "matt-strix",
+  "host": "strix-halo",
  "model_alias": "gemma4:26b-q8",
  "skipped": "model not available on host"
 }
@@ -1,5 +1,5 @@
 {
-  "host": "matt-strix",
+  "host": "strix-halo",
  "model_alias": "gemma4:26b-q8",
  "skipped": "model not available on host"
 }
@@ -1,5 +1,5 @@
 {
-  "host": "matt-strix",
+  "host": "strix-halo",
  "gpu": "AMD Strix Halo iGPU",
  "vram_gb": null,
  "model_alias": "gemma4:26b",
@@ -1,5 +1,5 @@
 {
-  "host": "matt-strix",
+  "host": "strix-halo",
  "gpu": "AMD Strix Halo iGPU",
  "vram_gb": null,
  "model_alias": "gemma4:26b",
@@ -1,5 +1,5 @@
 {
-  "host": "matt-strix",
+  "host": "strix-halo",
  "gpu": "AMD Strix Halo iGPU",
  "vram_gb": null,
  "model_alias": "gemma4:31b",
@@ -1,5 +1,5 @@
 {
-  "host": "matt-strix",
+  "host": "strix-halo",
  "gpu": "AMD Strix Halo iGPU",
  "vram_gb": null,
  "model_alias": "gemma4:31b",