V100 data was degraded by SDXL co-residence on CT 167 (31/32 GB VRAM occupied, Gemma 4 models forced 95% onto CPU). Rather than ship a prominent caveat, drop the V100 column entirely so the doc reports only apples-to-apples measurements. V100 can be added back once an isolated run is possible. Removed: V100 column from TL;DR and per-model tables, hardware row, caveat section, and associated raw JSONs under runs/pve197/. Harness config keeps pve197 in HOSTS for future re-runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.9 KiB
GPU Bakeoff — Gemma 4 Throughput: 3090 Ti vs Strix Halo
Date: 2026-04-20
Host matrix: steel141 (RTX 3090 Ti) · matt-strix (AMD Strix Halo iGPU)
Models: gemma4:26b (MoE Q4_K_M) · gemma4:31b-it-q4_K_M (dense Q4_K_M)
Harness: scripts/gpu-bakeoff/harness.py
Raw data: scripts/gpu-bakeoff/runs/
TL;DR
| GPU | 26B (MoE) decode | 31B (dense) decode | Long-prompt prefill (26B) |
|---|---|---|---|
| RTX 3090 Ti (steel141) | 128 tok/s | 27 tok/s | 23,849 tok/s |
| AMD Strix Halo iGPU (matt-strix) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) |
Headline findings
- MoE changes everything.
gemma4:26bdecodes ~4.7× faster thangemma4:31bon every GPU tested, because only ~4 B of its 25.8 B parameters activate per token. Total parameter counts (26 B vs 31 B) don't predict latency; active parameters do. - 3090 Ti wins decisively on decode. For inference workloads the memory-bandwidth-flop ratio of consumer Ampere GDDR6X is hard to beat at this price point.
- Strix Halo punches above its bandwidth. Gets 42 % of 3090 Ti decode speed on only ~25 % of the memory bandwidth (~256 GB/s vs ~1008 GB/s) — good SIMD utilization, especially on the MoE model.
Hardware inventory
| Host | GPU | VRAM | Bandwidth | Compute cap | Notes |
|---|---|---|---|---|---|
| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Seth's workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on 127.0.0.1:11434. |
| matt-strix | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama on 100.117.155.64:11434 via Tailscale. |
Methodology
- Each (host × model × prompt-length) cell:
- 1 warm-up call (discarded, absorbs model load time and JIT warm-up)
- 3 measurement calls
temperature: 0.0,top_k: 1(greedy),num_predict: 256,num_ctx: 4096keep_alive: 10mso the model stays resident between runs
- Two prompt lengths:
- short (~15 tokens) — isolates decode performance, prefill time is negligible
- long (~500 tokens) — stresses prefill (prompt evaluation)
- All timings come from Ollama's own
/api/generateresponse fields (prompt_eval_duration,eval_duration, etc.), so HTTP and wall-clock jitter are excluded from the rates. - Median of the 3 measurement runs is reported in tables; min/max are in the raw JSON.
Full results
Decode rate (tok/s, median of 3 runs)
Decode is the metric that matters most for interactive LLM use — it's the speed of token generation after the prompt has been processed.
| Model | 3090 Ti | Strix Halo |
|---|---|---|
| gemma4:26b (MoE, ~4 B active) | 128.20 | 53.86 |
| gemma4:31b (dense, 31.3 B active) | 27.15 | 10.64 |
Prefill rate (tok/s, long ~500-token prompt, median)
Prefill is the cost of ingesting the prompt and populating the KV cache before decode begins. Batched per-token, so short-prompt prefill numbers are noisy (dominated by fixed overhead — see raw JSON for those); the long-prompt numbers below are the ones to reason from.
| Model | 3090 Ti | Strix Halo |
|---|---|---|
| gemma4:26b (long) | 23,849 | 14,326 |
| gemma4:31b (long) | 7,716 | 3,278 |
Short-prompt prefill (for reference)
On a 15-token prompt, prefill tokens/sec is meaningless — prompt is too small to amortize overhead. Included only to confirm no regression.
| Model | 3090 Ti | Strix Halo |
|---|---|---|
| gemma4:26b (short) | 2,063 | 1,276 |
| gemma4:31b (short) | 661 | 292 |
Why 26B decodes 4.7× faster than 31B
gemma4:26b is the MoE variant ("A4B" in Google's naming = activated
4B). Per-token inference routes through only ~4 B of its 25.8 B total
parameters. gemma4:31b is dense: every one of its 31.3 B parameters
participates in every token's forward pass. Memory bandwidth is the
binding constraint for decode, so the ratio of active params is what
you actually pay for.
Rough math (3090 Ti, 1008 GB/s, Q4_K_M ≈ 0.5 bytes/param):
- 26B MoE: 4 B × 0.5 B = 2 GB per token. Theoretical max ≈ 504 tok/s. Observed 128 tok/s = 25 % utilization.
- 31B dense: 31.3 B × 0.5 B = 15.65 GB per token. Theoretical max ≈ 64 tok/s. Observed 27 tok/s = 42 % utilization.
So dense workloads actually extract higher bandwidth utilization —
they're less overhead-dominated per token. But in absolute terms, MoE
wins by a large factor because the active-parameter bill is much
smaller. For interactive chat this is decisive: Seth's mort-bot
running gemma4:26b gets ~4.7× the responsiveness it would on
gemma4:31b, even though the models are near-equal in total params.
Why the ratio holds on both GPUs: memory bandwidth is the bottleneck on both cards. Strix gets 42 % of 3090 Ti on 26B and 39 % of 3090 Ti on 31B — nearly identical ratios — because it has ~25 % of the bandwidth and matches or slightly exceeds proportionally.
When to use which GPU
Interactive chat / agent workloads (decode-heavy).
- Primary: 3090 Ti — by a wide margin. 128 tok/s on 26B is comfortable for real-time responses.
- Fallback: Strix Halo — 54 tok/s is usable. Benefit is unified memory can host larger models the 24 GB 3090 Ti can't.
Long-context / prompt-heavy workloads (prefill-heavy).
- Primary: 3090 Ti again — 23,849 tok/s prefill means a 500-token prompt ingests in ~21 ms.
- Strix at 14,326 tok/s is ~35 ms — still interactive.
Running models that don't fit on discrete cards.
- Strix Halo. Unified LPDDR5X can hold 80 GB+ models that a 24 GB 3090 Ti can't — at the cost of lower bandwidth.
- The largest model tested here (
gemma4:31bQ4 at 19.9 GB) fits both. Q8 variants (28 GB+) only fit Strix in this matrix.
Fine-tuning / training.
- Not measured here. 3090 Ti's 24 GB limits batch size on 20 B+ models.
Open questions / follow-ups
- Strix max-model fit. Strix can host models that wouldn't fit the 3090 Ti. A follow-up would pull a larger model (70 B+ quantized) on matt-strix and measure the Strix-only performance ceiling.
- Q8 vs Q4 on Strix. Same model, two quantizations — quality/speed tradeoff characterization.
Raw data
All per-run JSON traces are under scripts/gpu-bakeoff/runs/:
runs/
├── steel141/
│ ├── gemma4-26b/{short,long}.json
│ └── gemma4-31b/{short,long}.json
└── matt-strix/
├── gemma4-26b/{short,long}.json
└── gemma4-31b/{short,long}.json
Each JSON contains the warmup call and all 3 measurement calls with
every field Ollama's /api/generate returns (token counts, durations,
loaded-at, context length), plus a summary with min/median/max for
prefill and decode rates.