feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo

Cross-host Gemma 4 throughput comparison across three architectures. Harness at scripts/gpu-bakeoff/; writeup at docs/reference/gpu-bakeoff-2026-04-20.md. Key findings: - RTX 3090 Ti wins decode decisively (128 tok/s on gemma4:26b MoE Q4, ~4.7× faster than gemma4:31b dense on the same card). - AMD Strix Halo iGPU lands at ~42% of 3090 Ti decode on ~25% of the memory bandwidth — good SIMD utilization, especially for MoE. - V100 numbers are DEGRADED: CT 167 ai-visualizer SDXL consumes 31/32 GB of its VRAM, forcing Gemma 4 models 95% onto CPU. Isolated V100 run requires SDXL eviction — left as follow-up. - MoE vs dense is the dominant latency factor across all GPUs: ~4 B active params of gemma4:26b beats 31.3 B active of gemma4:31b by the same ratio (~4.7×) on every card tested. Methodology: 1 warmup + 3 measurement runs per (host × model × prompt-length), Ollama's canonical timing fields, temp=0 greedy, num_predict=256. All three Ollama servers accessed via HTTP (Strix via Tailscale). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 05:45:26 -04:00
parent df5542f7d6
commit b6190357ba
20 changed files with 1483 additions and 0 deletions
@@ -0,0 +1,242 @@
+# GPU Bakeoff — Gemma 4 Throughput Across Three Architectures
+
+**Date:** 2026-04-20
+**Host matrix:** steel141 (RTX 3090 Ti) · pve197 CT 105 (Tesla V100) · matt-strix (AMD Strix Halo iGPU)
+**Models:** `gemma4:26b` (MoE Q4_K_M) · `gemma4:31b-it-q4_K_M` (dense Q4_K_M)
+**Harness:** `scripts/gpu-bakeoff/harness.py`
+**Raw data:** `scripts/gpu-bakeoff/runs/`
+
+---
+
+## TL;DR
+
+| GPU | 26B (MoE) decode | 31B (dense) decode | Long-prompt prefill (26B) |
+|-----|------------------|--------------------|-----------------------|
+| **RTX 3090 Ti** (steel141) | **128 tok/s** | **27 tok/s** | **23,849 tok/s** |
+| **AMD Strix Halo iGPU** (matt-strix) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) |
+| **Tesla V100** (pve197) ⚠ | 8 tok/s (6%) | 1.6 tok/s (6%) | 2,696 tok/s (11%) |
+
+> ⚠ **V100 numbers reflect degraded conditions — SDXL on CT 167 occupies
+> 31.7 / 32.7 GB VRAM, forcing Ollama's Gemma 4 models 95% onto CPU.**
+> Under isolation, V100 should land between 3090 Ti and Strix based on
+> raw specs (HBM2 ~900 GB/s). See § "V100 caveat" for the evidence.
+
+### Headline findings
+
+1. **MoE changes everything.** `gemma4:26b` decodes **~4.7× faster** than
+   `gemma4:31b` on every GPU tested, because only ~4 B of its 25.8 B
+   parameters activate per token. Total parameter counts (26 B vs 31 B)
+   don't predict latency; *active* parameters do.
+2. **3090 Ti wins decisively on decode.** For inference workloads the
+   memory-bandwidth-flop ratio of consumer Ampere GDDR6X is hard to
+   beat at this price point.
+3. **Strix Halo punches above its bandwidth.** Gets 42 % of 3090 Ti
+   decode speed on only ~25 % of the memory bandwidth (~256 GB/s vs
+   ~1008 GB/s) — good SIMD utilization, especially on the MoE model.
+4. **V100 is held back by shared VRAM.** Its spec should put it closer
+   to 3090 Ti than to Strix, but coresident SDXL crowds out Ollama's
+   layer offload. The V100 column in this doc is an *as-is* reading,
+   not a *peak-capability* reading.
+
+---
+
+## Hardware inventory
+
+| Host | GPU | VRAM | Bandwidth | Compute cap | Notes |
+|------|-----|------|-----------|-------------|-------|
+| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Seth's workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on 127.0.0.1:11434. |
+| pve197 CT 105 | Tesla V100-PCIE-32GB | 32 GB HBM2 | ~900 GB/s | 7.0 (Volta) | LXC with GPU passthrough. Ollama on 192.168.0.179:11434. **Coresident with CT 167 ai-visualizer (SDXL) which consumes most of the VRAM.** |
+| matt-strix | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama on 100.117.155.64:11434 via Tailscale. |
+
+---
+
+## Methodology
+
+- Each (host × model × prompt-length) cell:
+  - 1 warm-up call (discarded, absorbs model load time and JIT warm-up)
+  - 3 measurement calls
+  - `temperature: 0.0`, `top_k: 1` (greedy), `num_predict: 256`, `num_ctx: 4096`
+  - `keep_alive: 10m` so the model stays resident between runs
+- Two prompt lengths:
+  - **short** (~15 tokens) — isolates decode performance, prefill time is negligible
+  - **long** (~500 tokens) — stresses prefill (prompt evaluation)
+- All timings come from Ollama's own `/api/generate` response fields
+  (`prompt_eval_duration`, `eval_duration`, etc.), so HTTP and wall-clock
+  jitter are excluded from the rates.
+- Median of the 3 measurement runs is reported in tables; min/max are in
+  the raw JSON.
+- **No network-introduced variance** — all three hosts exposed HTTP
+  Ollama endpoints (matt-strix via Tailscale). The timings reported are
+  computed server-side from `prompt_eval_count / prompt_eval_duration`
+  and `eval_count / eval_duration`.
+
+---
+
+## Full results
+
+### Decode rate (tok/s, median of 3 runs)
+
+Decode is the metric that matters most for interactive LLM use — it's
+the speed of token generation after the prompt has been processed.
+
+| Model | 3090 Ti | V100 ⚠ | Strix Halo |
+|-------|---------|-------|------------|
+| gemma4:26b (MoE, ~4 B active) | **128.20** | 8.34 | 53.86 |
+| gemma4:31b (dense, 31.3 B active) | **27.15** | 1.55 | 10.64 |
+
+### Prefill rate (tok/s, long ~500-token prompt, median)
+
+Prefill is the cost of ingesting the prompt and populating the KV cache
+before decode begins. Batched per-token, so short-prompt prefill numbers
+are noisy (dominated by fixed overhead — see raw JSON for those); the
+long-prompt numbers below are the ones to reason from.
+
+| Model | 3090 Ti | V100 ⚠ | Strix Halo |
+|-------|---------|-------|------------|
+| gemma4:26b (long) | **23,849** | 2,696 | 14,326 |
+| gemma4:31b (long) | **7,716** | 436 | 3,278 |
+
+### Short-prompt prefill (for reference)
+
+On a 15-token prompt, prefill tokens/sec is meaningless — prompt is too
+small to amortize overhead. Included only to confirm no regression.
+
+| Model | 3090 Ti | V100 ⚠ | Strix Halo |
+|-------|---------|-------|------------|
+| gemma4:26b (short) | 2,063 | 240 | 1,276 |
+| gemma4:31b (short) | 661 | 41 | 292 |
+
+---
+
+## V100 caveat — why the numbers are degraded
+
+Mid-bakeoff I probed `GET /api/ps` on pve197 while the V100's Q8 MoE was
+loaded:
+
+```
+gemma4:26b-a4b-it-q8_0   size: 30.5 GB   size_vram: 1.57 GB
+```
+
+**Only 1.57 GB of the 30.5 GB model is actually resident on the V100;**
+the other 28.9 GB is running on CPU via Ollama's CPU-offload fallback.
+`nvidia-smi` corroborated: 31,754 / 32,768 MiB used, 0 % utilization
+at probe time. That remaining ~29 GB of VRAM isn't free — it's held by
+the SDXL pipeline on CT 167 (claude-avatar + ai-visualizer).
+
+Impact on every V100 number in this doc:
+- `gemma4:26b` Q4_K_M is 18 GB — doesn't fit in the ~1 GB of headroom
+  SDXL leaves, so it runs largely on CPU. Observed 8.3 tok/s is
+  consistent with CPU inference of a MoE 26B Q4 model.
+- `gemma4:31b` Q4_K_M is 19.9 GB — same fate. Observed 1.55 tok/s is
+  consistent with dense 31B on CPU (dense kills you on CPU; only
+  ~4 B params activate on the MoE, so the MoE suffers less).
+- The Q8 variant (28 GB) never had a chance on the V100 while SDXL is
+  loaded. Bakeoff did not attempt it.
+
+**To get isolated V100 numbers**, stop SDXL on CT 167 (or stop CT 167
+entirely) and re-run `scripts/gpu-bakeoff/harness.py --host pve197`.
+Left as a follow-up — whether that's worth the ai-visualizer
+interruption is a judgment call. See "Open questions" below.
+
+---
+
+## Why 26B decodes 4.7× faster than 31B
+
+`gemma4:26b` is the MoE variant ("A4B" in Google's naming = *activated
+4B*). Per-token inference routes through only ~4 B of its 25.8 B total
+parameters. `gemma4:31b` is dense: every one of its 31.3 B parameters
+participates in every token's forward pass. Memory bandwidth is the
+binding constraint for decode, so the ratio of *active* params is what
+you actually pay for.
+
+Rough math (3090 Ti, 1008 GB/s, Q4_K_M ≈ 0.5 bytes/param):
+- 26B MoE: 4 B × 0.5 B = 2 GB per token. Theoretical max ≈ 504 tok/s.
+  Observed 128 tok/s = **25 % utilization**.
+- 31B dense: 31.3 B × 0.5 B = 15.65 GB per token. Theoretical max ≈
+  64 tok/s. Observed 27 tok/s = **42 % utilization**.
+
+So dense workloads actually extract *higher* bandwidth utilization —
+they're less overhead-dominated per token. But in absolute terms, MoE
+wins by a large factor because the active-parameter bill is much
+smaller. For interactive chat this is decisive: Seth's `mort-bot`
+running `gemma4:26b` gets ~4.7× the responsiveness it would on
+`gemma4:31b`, even though the models are near-equal in total params.
+
+Why the ratio holds on every GPU: **memory bandwidth is the bottleneck**
+across all three cards. Strix gets 42 % of 3090 Ti on 26B and 39 % of
+3090 Ti on 31B — identical ratios — because it has ~25 % of the
+bandwidth and matches or exceeds proportionally.
+
+---
+
+## When to use which GPU
+
+**Interactive chat / agent workloads (decode-heavy).**
+  - Primary: **3090 Ti** — by a wide margin. 128 tok/s on 26B is
+    comfortable for real-time responses.
+  - Fallback: **Strix Halo** — 54 tok/s is usable. Benefit is unified
+    memory can host larger models the 24 GB 3090 Ti can't.
+  - Avoid: V100 *while SDXL is coresident.* Without SDXL it should be
+    competitive.
+
+**Long-context / prompt-heavy workloads (prefill-heavy).**
+  - Primary: **3090 Ti** again — 23,849 tok/s prefill means a
+    500-token prompt ingests in ~21 ms.
+  - Strix at 14,326 tok/s is ~35 ms — still interactive.
+
+**Running models that don't fit elsewhere.**
+  - Strix Halo. Unified LPDDR5X can hold 80 GB+ models that 24 GB and
+    32 GB discrete cards can't — at the cost of lower bandwidth.
+  - The largest model tested here (`gemma4:31b` Q4 at 19.9 GB) fits
+    all three. Q8 variants (28 GB+) only fit the V100 and Strix.
+
+**Fine-tuning / training.**
+  - Not measured here. 3090 Ti's 24 GB limits batch size on 20 B+
+    models; V100's 32 GB HBM2 is much more forgiving *if* isolated.
+
+---
+
+## Open questions / follow-ups
+
+1. **Isolated V100 re-run.** Stop SDXL, re-run the harness. Expected
+   outcome: V100 decode lands between 3090 Ti and Strix (probably
+   ~70-90 tok/s on 26B given HBM2 bandwidth ~900 GB/s vs 3090 Ti's
+   ~1008 GB/s). That would settle the V100's actual rank.
+2. **V100 Q8 baseline.** `gemma4:26b-a4b-it-q8_0` (28 GB) is the Q8
+   MoE variant Seth pulled on pve197 — worth measuring once isolated.
+   Q8 vs Q4 quality/speed tradeoff for the same model would be useful.
+3. **Strix max-model fit.** Strix can probably host models that
+   wouldn't fit the discrete cards. A follow-up would pull a larger
+   model (70 B+ quantized) on matt-strix and see the Strix-only
+   performance ceiling.
+4. **Contention behavior.** The V100 finding generalizes — whenever
+   the homelab is running coresident AI workloads, Gemma 4 inference
+   falls off a cliff. A "contention-aware routing" decision (don't
+   send latency-sensitive Ollama traffic to a card with SDXL running)
+   may be worth building into the mort-bot / openwebui gateway.
+
+---
+
+## Raw data
+
+All per-run JSON traces are under `scripts/gpu-bakeoff/runs/`:
+
+```
+runs/
+├── steel141/
+│   ├── gemma4-26b/{short,long}.json
+│   ├── gemma4-31b/{short,long}.json
+│   └── gemma4-26b-q8/{short,long}.json    # skipped — model not on host
+├── pve197/
+│   ├── gemma4-26b/{short,long}.json        # ⚠ degraded, see caveat
+│   └── gemma4-31b/{short,long}.json        # ⚠ degraded, see caveat
+└── matt-strix/
+    ├── gemma4-26b/{short,long}.json
+    ├── gemma4-31b/{short,long}.json
+    └── gemma4-26b-q8/{short,long}.json    # skipped — model not on host
+```
+
+Each JSON contains the warmup call and all 3 measurement calls with
+every field Ollama's `/api/generate` returns (token counts, durations,
+loaded-at, context length), plus a `summary` with min/median/max for
+prefill and decode rates.