Files
gemma4-research/docs/reference/gpu-bakeoff-2026-04-20.md
T
Mortdecai b6190357ba feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo
Cross-host Gemma 4 throughput comparison across three architectures.
Harness at scripts/gpu-bakeoff/; writeup at
docs/reference/gpu-bakeoff-2026-04-20.md.

Key findings:
- RTX 3090 Ti wins decode decisively (128 tok/s on gemma4:26b MoE Q4,
  ~4.7× faster than gemma4:31b dense on the same card).
- AMD Strix Halo iGPU lands at ~42% of 3090 Ti decode on ~25% of the
  memory bandwidth — good SIMD utilization, especially for MoE.
- V100 numbers are DEGRADED: CT 167 ai-visualizer SDXL consumes 31/32
  GB of its VRAM, forcing Gemma 4 models 95% onto CPU. Isolated V100
  run requires SDXL eviction — left as follow-up.
- MoE vs dense is the dominant latency factor across all GPUs: ~4 B
  active params of gemma4:26b beats 31.3 B active of gemma4:31b by
  the same ratio (~4.7×) on every card tested.

Methodology: 1 warmup + 3 measurement runs per (host × model ×
prompt-length), Ollama's canonical timing fields, temp=0 greedy,
num_predict=256. All three Ollama servers accessed via HTTP (Strix
via Tailscale).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 05:45:26 -04:00

243 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GPU Bakeoff — Gemma 4 Throughput Across Three Architectures
**Date:** 2026-04-20
**Host matrix:** steel141 (RTX 3090 Ti) · pve197 CT 105 (Tesla V100) · matt-strix (AMD Strix Halo iGPU)
**Models:** `gemma4:26b` (MoE Q4_K_M) · `gemma4:31b-it-q4_K_M` (dense Q4_K_M)
**Harness:** `scripts/gpu-bakeoff/harness.py`
**Raw data:** `scripts/gpu-bakeoff/runs/`
---
## TL;DR
| GPU | 26B (MoE) decode | 31B (dense) decode | Long-prompt prefill (26B) |
|-----|------------------|--------------------|-----------------------|
| **RTX 3090 Ti** (steel141) | **128 tok/s** | **27 tok/s** | **23,849 tok/s** |
| **AMD Strix Halo iGPU** (matt-strix) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) |
| **Tesla V100** (pve197) ⚠ | 8 tok/s (6%) | 1.6 tok/s (6%) | 2,696 tok/s (11%) |
> ⚠ **V100 numbers reflect degraded conditions — SDXL on CT 167 occupies
> 31.7 / 32.7 GB VRAM, forcing Ollama's Gemma 4 models 95% onto CPU.**
> Under isolation, V100 should land between 3090 Ti and Strix based on
> raw specs (HBM2 ~900 GB/s). See § "V100 caveat" for the evidence.
### Headline findings
1. **MoE changes everything.** `gemma4:26b` decodes **~4.7× faster** than
`gemma4:31b` on every GPU tested, because only ~4 B of its 25.8 B
parameters activate per token. Total parameter counts (26 B vs 31 B)
don't predict latency; *active* parameters do.
2. **3090 Ti wins decisively on decode.** For inference workloads the
memory-bandwidth-flop ratio of consumer Ampere GDDR6X is hard to
beat at this price point.
3. **Strix Halo punches above its bandwidth.** Gets 42 % of 3090 Ti
decode speed on only ~25 % of the memory bandwidth (~256 GB/s vs
~1008 GB/s) — good SIMD utilization, especially on the MoE model.
4. **V100 is held back by shared VRAM.** Its spec should put it closer
to 3090 Ti than to Strix, but coresident SDXL crowds out Ollama's
layer offload. The V100 column in this doc is an *as-is* reading,
not a *peak-capability* reading.
---
## Hardware inventory
| Host | GPU | VRAM | Bandwidth | Compute cap | Notes |
|------|-----|------|-----------|-------------|-------|
| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Seth's workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on 127.0.0.1:11434. |
| pve197 CT 105 | Tesla V100-PCIE-32GB | 32 GB HBM2 | ~900 GB/s | 7.0 (Volta) | LXC with GPU passthrough. Ollama on 192.168.0.179:11434. **Coresident with CT 167 ai-visualizer (SDXL) which consumes most of the VRAM.** |
| matt-strix | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama on 100.117.155.64:11434 via Tailscale. |
---
## Methodology
- Each (host × model × prompt-length) cell:
- 1 warm-up call (discarded, absorbs model load time and JIT warm-up)
- 3 measurement calls
- `temperature: 0.0`, `top_k: 1` (greedy), `num_predict: 256`, `num_ctx: 4096`
- `keep_alive: 10m` so the model stays resident between runs
- Two prompt lengths:
- **short** (~15 tokens) — isolates decode performance, prefill time is negligible
- **long** (~500 tokens) — stresses prefill (prompt evaluation)
- All timings come from Ollama's own `/api/generate` response fields
(`prompt_eval_duration`, `eval_duration`, etc.), so HTTP and wall-clock
jitter are excluded from the rates.
- Median of the 3 measurement runs is reported in tables; min/max are in
the raw JSON.
- **No network-introduced variance** — all three hosts exposed HTTP
Ollama endpoints (matt-strix via Tailscale). The timings reported are
computed server-side from `prompt_eval_count / prompt_eval_duration`
and `eval_count / eval_duration`.
---
## Full results
### Decode rate (tok/s, median of 3 runs)
Decode is the metric that matters most for interactive LLM use — it's
the speed of token generation after the prompt has been processed.
| Model | 3090 Ti | V100 ⚠ | Strix Halo |
|-------|---------|-------|------------|
| gemma4:26b (MoE, ~4 B active) | **128.20** | 8.34 | 53.86 |
| gemma4:31b (dense, 31.3 B active) | **27.15** | 1.55 | 10.64 |
### Prefill rate (tok/s, long ~500-token prompt, median)
Prefill is the cost of ingesting the prompt and populating the KV cache
before decode begins. Batched per-token, so short-prompt prefill numbers
are noisy (dominated by fixed overhead — see raw JSON for those); the
long-prompt numbers below are the ones to reason from.
| Model | 3090 Ti | V100 ⚠ | Strix Halo |
|-------|---------|-------|------------|
| gemma4:26b (long) | **23,849** | 2,696 | 14,326 |
| gemma4:31b (long) | **7,716** | 436 | 3,278 |
### Short-prompt prefill (for reference)
On a 15-token prompt, prefill tokens/sec is meaningless — prompt is too
small to amortize overhead. Included only to confirm no regression.
| Model | 3090 Ti | V100 ⚠ | Strix Halo |
|-------|---------|-------|------------|
| gemma4:26b (short) | 2,063 | 240 | 1,276 |
| gemma4:31b (short) | 661 | 41 | 292 |
---
## V100 caveat — why the numbers are degraded
Mid-bakeoff I probed `GET /api/ps` on pve197 while the V100's Q8 MoE was
loaded:
```
gemma4:26b-a4b-it-q8_0 size: 30.5 GB size_vram: 1.57 GB
```
**Only 1.57 GB of the 30.5 GB model is actually resident on the V100;**
the other 28.9 GB is running on CPU via Ollama's CPU-offload fallback.
`nvidia-smi` corroborated: 31,754 / 32,768 MiB used, 0 % utilization
at probe time. That remaining ~29 GB of VRAM isn't free — it's held by
the SDXL pipeline on CT 167 (claude-avatar + ai-visualizer).
Impact on every V100 number in this doc:
- `gemma4:26b` Q4_K_M is 18 GB — doesn't fit in the ~1 GB of headroom
SDXL leaves, so it runs largely on CPU. Observed 8.3 tok/s is
consistent with CPU inference of a MoE 26B Q4 model.
- `gemma4:31b` Q4_K_M is 19.9 GB — same fate. Observed 1.55 tok/s is
consistent with dense 31B on CPU (dense kills you on CPU; only
~4 B params activate on the MoE, so the MoE suffers less).
- The Q8 variant (28 GB) never had a chance on the V100 while SDXL is
loaded. Bakeoff did not attempt it.
**To get isolated V100 numbers**, stop SDXL on CT 167 (or stop CT 167
entirely) and re-run `scripts/gpu-bakeoff/harness.py --host pve197`.
Left as a follow-up — whether that's worth the ai-visualizer
interruption is a judgment call. See "Open questions" below.
---
## Why 26B decodes 4.7× faster than 31B
`gemma4:26b` is the MoE variant ("A4B" in Google's naming = *activated
4B*). Per-token inference routes through only ~4 B of its 25.8 B total
parameters. `gemma4:31b` is dense: every one of its 31.3 B parameters
participates in every token's forward pass. Memory bandwidth is the
binding constraint for decode, so the ratio of *active* params is what
you actually pay for.
Rough math (3090 Ti, 1008 GB/s, Q4_K_M ≈ 0.5 bytes/param):
- 26B MoE: 4 B × 0.5 B = 2 GB per token. Theoretical max ≈ 504 tok/s.
Observed 128 tok/s = **25 % utilization**.
- 31B dense: 31.3 B × 0.5 B = 15.65 GB per token. Theoretical max ≈
64 tok/s. Observed 27 tok/s = **42 % utilization**.
So dense workloads actually extract *higher* bandwidth utilization —
they're less overhead-dominated per token. But in absolute terms, MoE
wins by a large factor because the active-parameter bill is much
smaller. For interactive chat this is decisive: Seth's `mort-bot`
running `gemma4:26b` gets ~4.7× the responsiveness it would on
`gemma4:31b`, even though the models are near-equal in total params.
Why the ratio holds on every GPU: **memory bandwidth is the bottleneck**
across all three cards. Strix gets 42 % of 3090 Ti on 26B and 39 % of
3090 Ti on 31B — identical ratios — because it has ~25 % of the
bandwidth and matches or exceeds proportionally.
---
## When to use which GPU
**Interactive chat / agent workloads (decode-heavy).**
- Primary: **3090 Ti** — by a wide margin. 128 tok/s on 26B is
comfortable for real-time responses.
- Fallback: **Strix Halo** — 54 tok/s is usable. Benefit is unified
memory can host larger models the 24 GB 3090 Ti can't.
- Avoid: V100 *while SDXL is coresident.* Without SDXL it should be
competitive.
**Long-context / prompt-heavy workloads (prefill-heavy).**
- Primary: **3090 Ti** again — 23,849 tok/s prefill means a
500-token prompt ingests in ~21 ms.
- Strix at 14,326 tok/s is ~35 ms — still interactive.
**Running models that don't fit elsewhere.**
- Strix Halo. Unified LPDDR5X can hold 80 GB+ models that 24 GB and
32 GB discrete cards can't — at the cost of lower bandwidth.
- The largest model tested here (`gemma4:31b` Q4 at 19.9 GB) fits
all three. Q8 variants (28 GB+) only fit the V100 and Strix.
**Fine-tuning / training.**
- Not measured here. 3090 Ti's 24 GB limits batch size on 20 B+
models; V100's 32 GB HBM2 is much more forgiving *if* isolated.
---
## Open questions / follow-ups
1. **Isolated V100 re-run.** Stop SDXL, re-run the harness. Expected
outcome: V100 decode lands between 3090 Ti and Strix (probably
~70-90 tok/s on 26B given HBM2 bandwidth ~900 GB/s vs 3090 Ti's
~1008 GB/s). That would settle the V100's actual rank.
2. **V100 Q8 baseline.** `gemma4:26b-a4b-it-q8_0` (28 GB) is the Q8
MoE variant Seth pulled on pve197 — worth measuring once isolated.
Q8 vs Q4 quality/speed tradeoff for the same model would be useful.
3. **Strix max-model fit.** Strix can probably host models that
wouldn't fit the discrete cards. A follow-up would pull a larger
model (70 B+ quantized) on matt-strix and see the Strix-only
performance ceiling.
4. **Contention behavior.** The V100 finding generalizes — whenever
the homelab is running coresident AI workloads, Gemma 4 inference
falls off a cliff. A "contention-aware routing" decision (don't
send latency-sensitive Ollama traffic to a card with SDXL running)
may be worth building into the mort-bot / openwebui gateway.
---
## Raw data
All per-run JSON traces are under `scripts/gpu-bakeoff/runs/`:
```
runs/
├── steel141/
│ ├── gemma4-26b/{short,long}.json
│ ├── gemma4-31b/{short,long}.json
│ └── gemma4-26b-q8/{short,long}.json # skipped — model not on host
├── pve197/
│ ├── gemma4-26b/{short,long}.json # ⚠ degraded, see caveat
│ └── gemma4-31b/{short,long}.json # ⚠ degraded, see caveat
└── matt-strix/
├── gemma4-26b/{short,long}.json
├── gemma4-31b/{short,long}.json
└── gemma4-26b-q8/{short,long}.json # skipped — model not on host
```
Each JSON contains the warmup call and all 3 measurement calls with
every field Ollama's `/api/generate` returns (token counts, durations,
loaded-at, context length), plus a `summary` with min/median/max for
prefill and decode rates.