Files

T

Mortdecai 22af59756f docs: remove V100 from GPU bakeoff

V100 data was degraded by SDXL co-residence on CT 167 (31/32 GB VRAM
occupied, Gemma 4 models forced 95% onto CPU). Rather than ship a
prominent caveat, drop the V100 column entirely so the doc reports
only apples-to-apples measurements. V100 can be added back once an
isolated run is possible.

Removed: V100 column from TL;DR and per-model tables, hardware row,
caveat section, and associated raw JSONs under runs/pve197/. Harness
config keeps pve197 in HOSTS for future re-runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-20 05:47:41 -04:00

6.9 KiB

Raw Blame History

GPU Bakeoff — Gemma 4 Throughput: 3090 Ti vs Strix Halo

Date: 2026-04-20 Host matrix: steel141 (RTX 3090 Ti) · matt-strix (AMD Strix Halo iGPU) Models: gemma4:26b (MoE Q4_K_M) · gemma4:31b-it-q4_K_M (dense Q4_K_M) Harness: scripts/gpu-bakeoff/harness.py Raw data: scripts/gpu-bakeoff/runs/

TL;DR

GPU	26B (MoE) decode	31B (dense) decode	Long-prompt prefill (26B)
RTX 3090 Ti (steel141)	128 tok/s	27 tok/s	23,849 tok/s
AMD Strix Halo iGPU (matt-strix)	54 tok/s (42%)	11 tok/s (39%)	14,326 tok/s (60%)

Headline findings

MoE changes everything. gemma4:26b decodes ~4.7× faster than gemma4:31b on every GPU tested, because only ~4 B of its 25.8 B parameters activate per token. Total parameter counts (26 B vs 31 B) don't predict latency; active parameters do.
3090 Ti wins decisively on decode. For inference workloads the memory-bandwidth-flop ratio of consumer Ampere GDDR6X is hard to beat at this price point.
Strix Halo punches above its bandwidth. Gets 42 % of 3090 Ti decode speed on only ~25 % of the memory bandwidth (~256 GB/s vs ~1008 GB/s) — good SIMD utilization, especially on the MoE model.

Hardware inventory

Host	GPU	VRAM	Bandwidth	Compute cap	Notes
steel141	RTX 3090 Ti	24 GB GDDR6X	~1008 GB/s	8.6 (Ampere)	Seth's workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on 127.0.0.1:11434.
matt-strix	AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU)	Shared LPDDR5X	~256 GB/s	—	Unified memory lets it fit models a 24 GB card can't. Ollama on 100.117.155.64:11434 via Tailscale.

Methodology

Each (host × model × prompt-length) cell:
- 1 warm-up call (discarded, absorbs model load time and JIT warm-up)
- 3 measurement calls
- temperature: 0.0, top_k: 1 (greedy), num_predict: 256, num_ctx: 4096
- keep_alive: 10m so the model stays resident between runs
Two prompt lengths:
- short (~15 tokens) — isolates decode performance, prefill time is negligible
- long (~500 tokens) — stresses prefill (prompt evaluation)
All timings come from Ollama's own /api/generate response fields (prompt_eval_duration, eval_duration, etc.), so HTTP and wall-clock jitter are excluded from the rates.
Median of the 3 measurement runs is reported in tables; min/max are in the raw JSON.

Full results

Decode rate (tok/s, median of 3 runs)

Decode is the metric that matters most for interactive LLM use — it's the speed of token generation after the prompt has been processed.

Model	3090 Ti	Strix Halo
gemma4:26b (MoE, ~4 B active)	128.20	53.86
gemma4:31b (dense, 31.3 B active)	27.15	10.64

Prefill rate (tok/s, long ~500-token prompt, median)

Prefill is the cost of ingesting the prompt and populating the KV cache before decode begins. Batched per-token, so short-prompt prefill numbers are noisy (dominated by fixed overhead — see raw JSON for those); the long-prompt numbers below are the ones to reason from.

Model	3090 Ti	Strix Halo
gemma4:26b (long)	23,849	14,326
gemma4:31b (long)	7,716	3,278

Short-prompt prefill (for reference)

On a 15-token prompt, prefill tokens/sec is meaningless — prompt is too small to amortize overhead. Included only to confirm no regression.

Model	3090 Ti	Strix Halo
gemma4:26b (short)	2,063	1,276
gemma4:31b (short)	661	292

Why 26B decodes 4.7× faster than 31B

gemma4:26b is the MoE variant ("A4B" in Google's naming = activated 4B). Per-token inference routes through only ~4 B of its 25.8 B total parameters. gemma4:31b is dense: every one of its 31.3 B parameters participates in every token's forward pass. Memory bandwidth is the binding constraint for decode, so the ratio of active params is what you actually pay for.

Rough math (3090 Ti, 1008 GB/s, Q4_K_M ≈ 0.5 bytes/param):

26B MoE: 4 B × 0.5 B = 2 GB per token. Theoretical max ≈ 504 tok/s. Observed 128 tok/s = 25 % utilization.
31B dense: 31.3 B × 0.5 B = 15.65 GB per token. Theoretical max ≈ 64 tok/s. Observed 27 tok/s = 42 % utilization.

So dense workloads actually extract higher bandwidth utilization — they're less overhead-dominated per token. But in absolute terms, MoE wins by a large factor because the active-parameter bill is much smaller. For interactive chat this is decisive: Seth's mort-bot running gemma4:26b gets ~4.7× the responsiveness it would on gemma4:31b, even though the models are near-equal in total params.

Why the ratio holds on both GPUs: memory bandwidth is the bottleneck on both cards. Strix gets 42 % of 3090 Ti on 26B and 39 % of 3090 Ti on 31B — nearly identical ratios — because it has ~25 % of the bandwidth and matches or slightly exceeds proportionally.

When to use which GPU

Interactive chat / agent workloads (decode-heavy).

Primary: 3090 Ti — by a wide margin. 128 tok/s on 26B is comfortable for real-time responses.
Fallback: Strix Halo — 54 tok/s is usable. Benefit is unified memory can host larger models the 24 GB 3090 Ti can't.

Long-context / prompt-heavy workloads (prefill-heavy).

Primary: 3090 Ti again — 23,849 tok/s prefill means a 500-token prompt ingests in ~21 ms.
Strix at 14,326 tok/s is ~35 ms — still interactive.

Running models that don't fit on discrete cards.

Strix Halo. Unified LPDDR5X can hold 80 GB+ models that a 24 GB 3090 Ti can't — at the cost of lower bandwidth.
The largest model tested here (gemma4:31b Q4 at 19.9 GB) fits both. Q8 variants (28 GB+) only fit Strix in this matrix.

Fine-tuning / training.

Not measured here. 3090 Ti's 24 GB limits batch size on 20 B+ models.

Open questions / follow-ups

Strix max-model fit. Strix can host models that wouldn't fit the 3090 Ti. A follow-up would pull a larger model (70 B+ quantized) on matt-strix and measure the Strix-only performance ceiling.
Q8 vs Q4 on Strix. Same model, two quantizations — quality/speed tradeoff characterization.

Raw data

All per-run JSON traces are under scripts/gpu-bakeoff/runs/:

runs/
├── steel141/
│   ├── gemma4-26b/{short,long}.json
│   └── gemma4-31b/{short,long}.json
└── matt-strix/
    ├── gemma4-26b/{short,long}.json
    └── gemma4-31b/{short,long}.json

Each JSON contains the warmup call and all 3 measurement calls with every field Ollama's /api/generate returns (token counts, durations, loaded-at, context length), plus a summary with min/median/max for prefill and decode rates.

6.9 KiB Raw Blame History Unescape Escape