91842f30cb
- Rename host alias matt-strix -> strix-halo (removes third-party name) - Move host URLs to env-var lookup (OLLAMA_*_URL), drop hardcoded IPs from harness source. Defaults: steel141 keeps localhost; pve197 and strix-halo require their env var to be set before use. - Update doc: remove the Tailscale IP and LAN-IP references, describe access paths without specific addresses. - Rename runs/matt-strix -> runs/strix-halo and patch the host field in each JSON. Harness still functional for the original author (set the env vars) and safe to share without leaking routable addresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
178 lines
6.9 KiB
Markdown
178 lines
6.9 KiB
Markdown
# GPU Bakeoff — Gemma 4 Throughput: 3090 Ti vs Strix Halo
|
||
|
||
**Date:** 2026-04-20
|
||
**Host matrix:** steel141 (RTX 3090 Ti) · strix-halo (AMD Strix Halo iGPU)
|
||
**Models:** `gemma4:26b` (MoE Q4_K_M) · `gemma4:31b-it-q4_K_M` (dense Q4_K_M)
|
||
**Harness:** `scripts/gpu-bakeoff/harness.py`
|
||
**Raw data:** `scripts/gpu-bakeoff/runs/`
|
||
|
||
---
|
||
|
||
## TL;DR
|
||
|
||
| GPU | 26B (MoE) decode | 31B (dense) decode | Long-prompt prefill (26B) |
|
||
|-----|------------------|--------------------|-----------------------|
|
||
| **RTX 3090 Ti** (steel141) | **128 tok/s** | **27 tok/s** | **23,849 tok/s** |
|
||
| **AMD Strix Halo iGPU** (strix-halo) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) |
|
||
|
||
### Headline findings
|
||
|
||
1. **MoE changes everything.** `gemma4:26b` decodes **~4.7× faster** than
|
||
`gemma4:31b` on every GPU tested, because only ~4 B of its 25.8 B
|
||
parameters activate per token. Total parameter counts (26 B vs 31 B)
|
||
don't predict latency; *active* parameters do.
|
||
2. **3090 Ti wins decisively on decode.** For inference workloads the
|
||
memory-bandwidth-flop ratio of consumer Ampere GDDR6X is hard to
|
||
beat at this price point.
|
||
3. **Strix Halo punches above its bandwidth.** Gets 42 % of 3090 Ti
|
||
decode speed on only ~25 % of the memory bandwidth (~256 GB/s vs
|
||
~1008 GB/s) — good SIMD utilization, especially on the MoE model.
|
||
|
||
---
|
||
|
||
## Hardware inventory
|
||
|
||
| Host | GPU | VRAM | Bandwidth | Compute cap | Notes |
|
||
|------|-----|------|-----------|-------------|-------|
|
||
| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on localhost. |
|
||
| strix-halo | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama accessed via Tailscale. |
|
||
|
||
---
|
||
|
||
## Methodology
|
||
|
||
- Each (host × model × prompt-length) cell:
|
||
- 1 warm-up call (discarded, absorbs model load time and JIT warm-up)
|
||
- 3 measurement calls
|
||
- `temperature: 0.0`, `top_k: 1` (greedy), `num_predict: 256`, `num_ctx: 4096`
|
||
- `keep_alive: 10m` so the model stays resident between runs
|
||
- Two prompt lengths:
|
||
- **short** (~15 tokens) — isolates decode performance, prefill time is negligible
|
||
- **long** (~500 tokens) — stresses prefill (prompt evaluation)
|
||
- All timings come from Ollama's own `/api/generate` response fields
|
||
(`prompt_eval_duration`, `eval_duration`, etc.), so HTTP and wall-clock
|
||
jitter are excluded from the rates.
|
||
- Median of the 3 measurement runs is reported in tables; min/max are in
|
||
the raw JSON.
|
||
|
||
---
|
||
|
||
## Full results
|
||
|
||
### Decode rate (tok/s, median of 3 runs)
|
||
|
||
Decode is the metric that matters most for interactive LLM use — it's
|
||
the speed of token generation after the prompt has been processed.
|
||
|
||
| Model | 3090 Ti | Strix Halo |
|
||
|-------|---------|------------|
|
||
| gemma4:26b (MoE, ~4 B active) | **128.20** | 53.86 |
|
||
| gemma4:31b (dense, 31.3 B active) | **27.15** | 10.64 |
|
||
|
||
### Prefill rate (tok/s, long ~500-token prompt, median)
|
||
|
||
Prefill is the cost of ingesting the prompt and populating the KV cache
|
||
before decode begins. Batched per-token, so short-prompt prefill numbers
|
||
are noisy (dominated by fixed overhead — see raw JSON for those); the
|
||
long-prompt numbers below are the ones to reason from.
|
||
|
||
| Model | 3090 Ti | Strix Halo |
|
||
|-------|---------|------------|
|
||
| gemma4:26b (long) | **23,849** | 14,326 |
|
||
| gemma4:31b (long) | **7,716** | 3,278 |
|
||
|
||
### Short-prompt prefill (for reference)
|
||
|
||
On a 15-token prompt, prefill tokens/sec is meaningless — prompt is too
|
||
small to amortize overhead. Included only to confirm no regression.
|
||
|
||
| Model | 3090 Ti | Strix Halo |
|
||
|-------|---------|------------|
|
||
| gemma4:26b (short) | 2,063 | 1,276 |
|
||
| gemma4:31b (short) | 661 | 292 |
|
||
|
||
---
|
||
|
||
## Why 26B decodes 4.7× faster than 31B
|
||
|
||
`gemma4:26b` is the MoE variant ("A4B" in Google's naming = *activated
|
||
4B*). Per-token inference routes through only ~4 B of its 25.8 B total
|
||
parameters. `gemma4:31b` is dense: every one of its 31.3 B parameters
|
||
participates in every token's forward pass. Memory bandwidth is the
|
||
binding constraint for decode, so the ratio of *active* params is what
|
||
you actually pay for.
|
||
|
||
Rough math (3090 Ti, 1008 GB/s, Q4_K_M ≈ 0.5 bytes/param):
|
||
- 26B MoE: 4 B × 0.5 B = 2 GB per token. Theoretical max ≈ 504 tok/s.
|
||
Observed 128 tok/s = **25 % utilization**.
|
||
- 31B dense: 31.3 B × 0.5 B = 15.65 GB per token. Theoretical max ≈
|
||
64 tok/s. Observed 27 tok/s = **42 % utilization**.
|
||
|
||
So dense workloads actually extract *higher* bandwidth utilization —
|
||
they're less overhead-dominated per token. But in absolute terms, MoE
|
||
wins by a large factor because the active-parameter bill is much
|
||
smaller. For interactive chat this is decisive: Seth's `mort-bot`
|
||
running `gemma4:26b` gets ~4.7× the responsiveness it would on
|
||
`gemma4:31b`, even though the models are near-equal in total params.
|
||
|
||
Why the ratio holds on both GPUs: **memory bandwidth is the bottleneck**
|
||
on both cards. Strix gets 42 % of 3090 Ti on 26B and 39 % of 3090 Ti on
|
||
31B — nearly identical ratios — because it has ~25 % of the bandwidth
|
||
and matches or slightly exceeds proportionally.
|
||
|
||
---
|
||
|
||
## When to use which GPU
|
||
|
||
**Interactive chat / agent workloads (decode-heavy).**
|
||
- Primary: **3090 Ti** — by a wide margin. 128 tok/s on 26B is
|
||
comfortable for real-time responses.
|
||
- Fallback: **Strix Halo** — 54 tok/s is usable. Benefit is unified
|
||
memory can host larger models the 24 GB 3090 Ti can't.
|
||
|
||
**Long-context / prompt-heavy workloads (prefill-heavy).**
|
||
- Primary: **3090 Ti** again — 23,849 tok/s prefill means a
|
||
500-token prompt ingests in ~21 ms.
|
||
- Strix at 14,326 tok/s is ~35 ms — still interactive.
|
||
|
||
**Running models that don't fit on discrete cards.**
|
||
- Strix Halo. Unified LPDDR5X can hold 80 GB+ models that a 24 GB
|
||
3090 Ti can't — at the cost of lower bandwidth.
|
||
- The largest model tested here (`gemma4:31b` Q4 at 19.9 GB) fits
|
||
both. Q8 variants (28 GB+) only fit Strix in this matrix.
|
||
|
||
**Fine-tuning / training.**
|
||
- Not measured here. 3090 Ti's 24 GB limits batch size on 20 B+
|
||
models.
|
||
|
||
---
|
||
|
||
## Open questions / follow-ups
|
||
|
||
1. **Strix max-model fit.** Strix can host models that wouldn't fit the
|
||
3090 Ti. A follow-up would pull a larger model (70 B+ quantized) on
|
||
strix-halo and measure the Strix-only performance ceiling.
|
||
2. **Q8 vs Q4 on Strix.** Same model, two quantizations — quality/speed
|
||
tradeoff characterization.
|
||
|
||
---
|
||
|
||
## Raw data
|
||
|
||
All per-run JSON traces are under `scripts/gpu-bakeoff/runs/`:
|
||
|
||
```
|
||
runs/
|
||
├── steel141/
|
||
│ ├── gemma4-26b/{short,long}.json
|
||
│ └── gemma4-31b/{short,long}.json
|
||
└── strix-halo/
|
||
├── gemma4-26b/{short,long}.json
|
||
└── gemma4-31b/{short,long}.json
|
||
```
|
||
|
||
Each JSON contains the warmup call and all 3 measurement calls with
|
||
every field Ollama's `/api/generate` returns (token counts, durations,
|
||
loaded-at, context length), plus a `summary` with min/median/max for
|
||
prefill and decode rates.
|