Files
gemma4-research/scripts/gpu-bakeoff/runs/matt-strix-rerun.log
T
Mortdecai b6190357ba feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo
Cross-host Gemma 4 throughput comparison across three architectures.
Harness at scripts/gpu-bakeoff/; writeup at
docs/reference/gpu-bakeoff-2026-04-20.md.

Key findings:
- RTX 3090 Ti wins decode decisively (128 tok/s on gemma4:26b MoE Q4,
  ~4.7× faster than gemma4:31b dense on the same card).
- AMD Strix Halo iGPU lands at ~42% of 3090 Ti decode on ~25% of the
  memory bandwidth — good SIMD utilization, especially for MoE.
- V100 numbers are DEGRADED: CT 167 ai-visualizer SDXL consumes 31/32
  GB of its VRAM, forcing Gemma 4 models 95% onto CPU. Isolated V100
  run requires SDXL eviction — left as follow-up.
- MoE vs dense is the dominant latency factor across all GPUs: ~4 B
  active params of gemma4:26b beats 31.3 B active of gemma4:31b by
  the same ratio (~4.7×) on every card tested.

Methodology: 1 warmup + 3 measurement runs per (host × model ×
prompt-length), Ollama's canonical timing fields, temp=0 greedy,
num_predict=256. All three Ollama servers accessed via HTTP (Strix
via Tailscale).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 05:45:26 -04:00

7 lines
475 B
Plaintext

[matt-strix] gemma4:26b short — prefill=1275.71 tok/s decode= 53.83 tok/s
[matt-strix] gemma4:26b long — prefill=14326.07 tok/s decode= 52.42 tok/s
[matt-strix] gemma4:31b short — prefill= 291.74 tok/s decode= 10.64 tok/s
[matt-strix] gemma4:31b long — prefill= 3277.8 tok/s decode= 10.42 tok/s
[matt-strix] gemma4:26b-q8 short — model not available on host
[matt-strix] gemma4:26b-q8 long — model not available on host