docs: scrub PII/IPs from gpu-bakeoff
- Rename host alias matt-strix -> strix-halo (removes third-party name) - Move host URLs to env-var lookup (OLLAMA_*_URL), drop hardcoded IPs from harness source. Defaults: steel141 keeps localhost; pve197 and strix-halo require their env var to be set before use. - Update doc: remove the Tailscale IP and LAN-IP references, describe access paths without specific addresses. - Rename runs/matt-strix -> runs/strix-halo and patch the host field in each JSON. Harness still functional for the original author (set the env vars) and safe to share without leaking routable addresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -18,7 +18,7 @@ Research corpus and implementation guidance for Google Gemma 4, based on product
|
|||||||
| `docs/openwebui-setup.md` | How to configure Gemma 4 inside OpenWebUI — per-setting reference, two ready-to-bake Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table mapped back to GOTCHAS.md. Assumes Ollama + OpenWebUI are already running. | When setting up or debugging a Gemma 4 model in OpenWebUI, or handing the front-end config to someone else |
|
| `docs/openwebui-setup.md` | How to configure Gemma 4 inside OpenWebUI — per-setting reference, two ready-to-bake Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table mapped back to GOTCHAS.md. Assumes Ollama + OpenWebUI are already running. | When setting up or debugging a Gemma 4 model in OpenWebUI, or handing the front-end config to someone else |
|
||||||
| `docs/reference/bakeoff-2026-04-18.md` | CLI-coding-agent bakeoff on 3090 Ti. **Rounds 1/2 misidentified the cause; Round 3 (the correct one): `think: false` silent-stops gemma4:26b at certain multi-turn states on 32K context.** 31B and Qwen3-Coder robust to the flag. Harness at `scripts/bakeoff/` | When deciding which model to back a CLI agent with, writing a custom agent payload, or debugging a silent tool-call halt |
|
| `docs/reference/bakeoff-2026-04-18.md` | CLI-coding-agent bakeoff on 3090 Ti. **Rounds 1/2 misidentified the cause; Round 3 (the correct one): `think: false` silent-stops gemma4:26b at certain multi-turn states on 32K context.** 31B and Qwen3-Coder robust to the flag. Harness at `scripts/bakeoff/` | When deciding which model to back a CLI agent with, writing a custom agent payload, or debugging a silent tool-call halt |
|
||||||
| `docs/reference/mort-bakeoff-2026-04-18.md` | mort-bot-specific `think=true` vs `think=false` bakeoff on mort's actual loop shape (gemma4:26b, num_ctx=8192). **Thinking does NOT accumulate in context on Ollama 0.20.4** — strips it from serialized history. Both settings behave identically on step counts, tool counts, wall clock. Harness at `scripts/mort-bakeoff/` | When deciding mort-bot's THINK env var, or when someone claims "think=true eats context" without pinning an Ollama version |
|
| `docs/reference/mort-bakeoff-2026-04-18.md` | mort-bot-specific `think=true` vs `think=false` bakeoff on mort's actual loop shape (gemma4:26b, num_ctx=8192). **Thinking does NOT accumulate in context on Ollama 0.20.4** — strips it from serialized history. Both settings behave identically on step counts, tool counts, wall clock. Harness at `scripts/mort-bakeoff/` | When deciding mort-bot's THINK env var, or when someone claims "think=true eats context" without pinning an Ollama version |
|
||||||
| `docs/reference/gpu-bakeoff-2026-04-20.md` | Cross-GPU throughput bakeoff: steel141 RTX 3090 Ti vs matt-strix (AMD Strix Halo). **3090 Ti wins decode decisively (128 tok/s on 26B MoE). Strix gets ~42% of that on ~25% of the bandwidth.** Also quantifies the MoE vs dense gap: 26B decodes ~4.7× faster than 31B on both cards. Harness at `scripts/gpu-bakeoff/` | When choosing which host to run a Gemma 4 workload on |
|
| `docs/reference/gpu-bakeoff-2026-04-20.md` | Cross-GPU throughput bakeoff: steel141 RTX 3090 Ti vs strix-halo (AMD Strix Halo). **3090 Ti wins decode decisively (128 tok/s on 26B MoE). Strix gets ~42% of that on ~25% of the bandwidth.** Also quantifies the MoE vs dense gap: 26B decodes ~4.7× faster than 31B on both cards. Harness at `scripts/gpu-bakeoff/` | When choosing which host to run a Gemma 4 workload on |
|
||||||
| `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) |
|
| `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) |
|
||||||
|
|
||||||
## Source Projects
|
## Source Projects
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
# GPU Bakeoff — Gemma 4 Throughput: 3090 Ti vs Strix Halo
|
# GPU Bakeoff — Gemma 4 Throughput: 3090 Ti vs Strix Halo
|
||||||
|
|
||||||
**Date:** 2026-04-20
|
**Date:** 2026-04-20
|
||||||
**Host matrix:** steel141 (RTX 3090 Ti) · matt-strix (AMD Strix Halo iGPU)
|
**Host matrix:** steel141 (RTX 3090 Ti) · strix-halo (AMD Strix Halo iGPU)
|
||||||
**Models:** `gemma4:26b` (MoE Q4_K_M) · `gemma4:31b-it-q4_K_M` (dense Q4_K_M)
|
**Models:** `gemma4:26b` (MoE Q4_K_M) · `gemma4:31b-it-q4_K_M` (dense Q4_K_M)
|
||||||
**Harness:** `scripts/gpu-bakeoff/harness.py`
|
**Harness:** `scripts/gpu-bakeoff/harness.py`
|
||||||
**Raw data:** `scripts/gpu-bakeoff/runs/`
|
**Raw data:** `scripts/gpu-bakeoff/runs/`
|
||||||
@@ -13,7 +13,7 @@
|
|||||||
| GPU | 26B (MoE) decode | 31B (dense) decode | Long-prompt prefill (26B) |
|
| GPU | 26B (MoE) decode | 31B (dense) decode | Long-prompt prefill (26B) |
|
||||||
|-----|------------------|--------------------|-----------------------|
|
|-----|------------------|--------------------|-----------------------|
|
||||||
| **RTX 3090 Ti** (steel141) | **128 tok/s** | **27 tok/s** | **23,849 tok/s** |
|
| **RTX 3090 Ti** (steel141) | **128 tok/s** | **27 tok/s** | **23,849 tok/s** |
|
||||||
| **AMD Strix Halo iGPU** (matt-strix) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) |
|
| **AMD Strix Halo iGPU** (strix-halo) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) |
|
||||||
|
|
||||||
### Headline findings
|
### Headline findings
|
||||||
|
|
||||||
@@ -34,8 +34,8 @@
|
|||||||
|
|
||||||
| Host | GPU | VRAM | Bandwidth | Compute cap | Notes |
|
| Host | GPU | VRAM | Bandwidth | Compute cap | Notes |
|
||||||
|------|-----|------|-----------|-------------|-------|
|
|------|-----|------|-----------|-------------|-------|
|
||||||
| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Seth's workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on 127.0.0.1:11434. |
|
| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on localhost. |
|
||||||
| matt-strix | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama on 100.117.155.64:11434 via Tailscale. |
|
| strix-halo | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama accessed via Tailscale. |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -151,7 +151,7 @@ and matches or slightly exceeds proportionally.
|
|||||||
|
|
||||||
1. **Strix max-model fit.** Strix can host models that wouldn't fit the
|
1. **Strix max-model fit.** Strix can host models that wouldn't fit the
|
||||||
3090 Ti. A follow-up would pull a larger model (70 B+ quantized) on
|
3090 Ti. A follow-up would pull a larger model (70 B+ quantized) on
|
||||||
matt-strix and measure the Strix-only performance ceiling.
|
strix-halo and measure the Strix-only performance ceiling.
|
||||||
2. **Q8 vs Q4 on Strix.** Same model, two quantizations — quality/speed
|
2. **Q8 vs Q4 on Strix.** Same model, two quantizations — quality/speed
|
||||||
tradeoff characterization.
|
tradeoff characterization.
|
||||||
|
|
||||||
@@ -166,7 +166,7 @@ runs/
|
|||||||
├── steel141/
|
├── steel141/
|
||||||
│ ├── gemma4-26b/{short,long}.json
|
│ ├── gemma4-26b/{short,long}.json
|
||||||
│ └── gemma4-31b/{short,long}.json
|
│ └── gemma4-31b/{short,long}.json
|
||||||
└── matt-strix/
|
└── strix-halo/
|
||||||
├── gemma4-26b/{short,long}.json
|
├── gemma4-26b/{short,long}.json
|
||||||
└── gemma4-31b/{short,long}.json
|
└── gemma4-31b/{short,long}.json
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ three hosts:
|
|||||||
|
|
||||||
- steel141 : RTX 3090 Ti (24 GB GDDR6X, compute 8.6, ~1008 GB/s)
|
- steel141 : RTX 3090 Ti (24 GB GDDR6X, compute 8.6, ~1008 GB/s)
|
||||||
- pve197 : Tesla V100-PCIE-32GB (32 GB HBM2, compute 7.0, ~900 GB/s)
|
- pve197 : Tesla V100-PCIE-32GB (32 GB HBM2, compute 7.0, ~900 GB/s)
|
||||||
- matt-strix: AMD Strix Halo iGPU (shared LPDDR5X, ~256 GB/s)
|
- strix-halo: AMD Strix Halo iGPU (shared LPDDR5X, ~256 GB/s)
|
||||||
|
|
||||||
Per (host, model, prompt_length), runs 1 warmup + N measurement runs,
|
Per (host, model, prompt_length), runs 1 warmup + N measurement runs,
|
||||||
records Ollama's canonical timing fields, and writes one JSON trace to
|
records Ollama's canonical timing fields, and writes one JSON trace to
|
||||||
@@ -15,6 +15,13 @@ All three Ollama servers are polled via HTTP; no SSH required. All
|
|||||||
timings come from Ollama's own /api/generate response fields so wall-
|
timings come from Ollama's own /api/generate response fields so wall-
|
||||||
clock jitter between the harness and the server is excluded.
|
clock jitter between the harness and the server is excluded.
|
||||||
|
|
||||||
|
Host URLs are resolved from environment variables so routable addresses
|
||||||
|
don't live in source. Set these before running against non-local hosts:
|
||||||
|
|
||||||
|
OLLAMA_STEEL141_URL=http://127.0.0.1:11434
|
||||||
|
OLLAMA_PVE197_URL=http://<lan-ip>:11434
|
||||||
|
OLLAMA_STRIX_URL=http://<tailscale-ip>:11434
|
||||||
|
|
||||||
Invocation:
|
Invocation:
|
||||||
python3 harness.py --host steel141 --model gemma4:26b --prompt short
|
python3 harness.py --host steel141 --model gemma4:26b --prompt short
|
||||||
python3 harness.py all # runs the full planned matrix
|
python3 harness.py all # runs the full planned matrix
|
||||||
@@ -24,6 +31,7 @@ from __future__ import annotations
|
|||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
import json
|
import json
|
||||||
|
import os
|
||||||
import sys
|
import sys
|
||||||
import time
|
import time
|
||||||
import urllib.request
|
import urllib.request
|
||||||
@@ -31,16 +39,30 @@ from pathlib import Path
|
|||||||
|
|
||||||
|
|
||||||
HOSTS = {
|
HOSTS = {
|
||||||
"steel141": {"url": "http://127.0.0.1:11434", "gpu": "RTX 3090 Ti", "vram_gb": 24},
|
"steel141": {"url_env": "OLLAMA_STEEL141_URL", "default_url": "http://127.0.0.1:11434",
|
||||||
"pve197": {"url": "http://192.168.0.179:11434", "gpu": "Tesla V100-PCIE-32GB", "vram_gb": 32},
|
"gpu": "RTX 3090 Ti", "vram_gb": 24},
|
||||||
"matt-strix": {"url": "http://100.117.155.64:11434", "gpu": "AMD Strix Halo iGPU", "vram_gb": None},
|
"pve197": {"url_env": "OLLAMA_PVE197_URL", "default_url": None,
|
||||||
|
"gpu": "Tesla V100-PCIE-32GB", "vram_gb": 32},
|
||||||
|
"strix-halo": {"url_env": "OLLAMA_STRIX_URL", "default_url": None,
|
||||||
|
"gpu": "AMD Strix Halo iGPU", "vram_gb": None},
|
||||||
}
|
}
|
||||||
|
|
||||||
# Per-host model tag mapping. matt-strix uses gemma4:31b, the others
|
|
||||||
|
def _host_url(host: str) -> str:
|
||||||
|
cfg = HOSTS[host]
|
||||||
|
url = os.environ.get(cfg["url_env"]) or cfg["default_url"]
|
||||||
|
if not url:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"host {host!r} has no URL — set ${cfg['url_env']} in env"
|
||||||
|
)
|
||||||
|
return url
|
||||||
|
|
||||||
|
|
||||||
|
# Per-host model tag mapping. strix-halo uses gemma4:31b, the others
|
||||||
# use gemma4:31b-it-q4_K_M — identical weights, different tags.
|
# use gemma4:31b-it-q4_K_M — identical weights, different tags.
|
||||||
MODEL_ALIASES = {
|
MODEL_ALIASES = {
|
||||||
"gemma4:26b": {"steel141": "gemma4:26b", "pve197": "gemma4:26b", "matt-strix": "gemma4:26b"},
|
"gemma4:26b": {"steel141": "gemma4:26b", "pve197": "gemma4:26b", "strix-halo": "gemma4:26b"},
|
||||||
"gemma4:31b": {"steel141": "gemma4:31b-it-q4_K_M", "pve197": "gemma4:31b-it-q4_K_M", "matt-strix": "gemma4:31b"},
|
"gemma4:31b": {"steel141": "gemma4:31b-it-q4_K_M", "pve197": "gemma4:31b-it-q4_K_M", "strix-halo": "gemma4:31b"},
|
||||||
# V100-only edge case — only 32 GB host has headroom for the Q8 MoE.
|
# V100-only edge case — only 32 GB host has headroom for the Q8 MoE.
|
||||||
"gemma4:26b-q8": {"pve197": "gemma4:26b-a4b-it-q8_0"},
|
"gemma4:26b-q8": {"pve197": "gemma4:26b-a4b-it-q8_0"},
|
||||||
}
|
}
|
||||||
@@ -151,7 +173,7 @@ def run_matrix(
|
|||||||
return {"host": host, "model_alias": model_alias, "skipped": "model not available on host"}
|
return {"host": host, "model_alias": model_alias, "skipped": "model not available on host"}
|
||||||
|
|
||||||
prompt = PROMPTS[prompt_key]
|
prompt = PROMPTS[prompt_key]
|
||||||
url = host_cfg["url"]
|
url = _host_url(host)
|
||||||
|
|
||||||
trace = {
|
trace = {
|
||||||
"host": host,
|
"host": host,
|
||||||
|
|||||||
+1
-1
@@ -1,5 +1,5 @@
|
|||||||
{
|
{
|
||||||
"host": "matt-strix",
|
"host": "strix-halo",
|
||||||
"model_alias": "gemma4:26b-q8",
|
"model_alias": "gemma4:26b-q8",
|
||||||
"skipped": "model not available on host"
|
"skipped": "model not available on host"
|
||||||
}
|
}
|
||||||
+1
-1
@@ -1,5 +1,5 @@
|
|||||||
{
|
{
|
||||||
"host": "matt-strix",
|
"host": "strix-halo",
|
||||||
"model_alias": "gemma4:26b-q8",
|
"model_alias": "gemma4:26b-q8",
|
||||||
"skipped": "model not available on host"
|
"skipped": "model not available on host"
|
||||||
}
|
}
|
||||||
+1
-1
@@ -1,5 +1,5 @@
|
|||||||
{
|
{
|
||||||
"host": "matt-strix",
|
"host": "strix-halo",
|
||||||
"gpu": "AMD Strix Halo iGPU",
|
"gpu": "AMD Strix Halo iGPU",
|
||||||
"vram_gb": null,
|
"vram_gb": null,
|
||||||
"model_alias": "gemma4:26b",
|
"model_alias": "gemma4:26b",
|
||||||
+1
-1
@@ -1,5 +1,5 @@
|
|||||||
{
|
{
|
||||||
"host": "matt-strix",
|
"host": "strix-halo",
|
||||||
"gpu": "AMD Strix Halo iGPU",
|
"gpu": "AMD Strix Halo iGPU",
|
||||||
"vram_gb": null,
|
"vram_gb": null,
|
||||||
"model_alias": "gemma4:26b",
|
"model_alias": "gemma4:26b",
|
||||||
+1
-1
@@ -1,5 +1,5 @@
|
|||||||
{
|
{
|
||||||
"host": "matt-strix",
|
"host": "strix-halo",
|
||||||
"gpu": "AMD Strix Halo iGPU",
|
"gpu": "AMD Strix Halo iGPU",
|
||||||
"vram_gb": null,
|
"vram_gb": null,
|
||||||
"model_alias": "gemma4:31b",
|
"model_alias": "gemma4:31b",
|
||||||
+1
-1
@@ -1,5 +1,5 @@
|
|||||||
{
|
{
|
||||||
"host": "matt-strix",
|
"host": "strix-halo",
|
||||||
"gpu": "AMD Strix Halo iGPU",
|
"gpu": "AMD Strix Halo iGPU",
|
||||||
"vram_gb": null,
|
"vram_gb": null,
|
||||||
"model_alias": "gemma4:31b",
|
"model_alias": "gemma4:31b",
|
||||||
Reference in New Issue
Block a user