feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo
Cross-host Gemma 4 throughput comparison across three architectures. Harness at scripts/gpu-bakeoff/; writeup at docs/reference/gpu-bakeoff-2026-04-20.md. Key findings: - RTX 3090 Ti wins decode decisively (128 tok/s on gemma4:26b MoE Q4, ~4.7× faster than gemma4:31b dense on the same card). - AMD Strix Halo iGPU lands at ~42% of 3090 Ti decode on ~25% of the memory bandwidth — good SIMD utilization, especially for MoE. - V100 numbers are DEGRADED: CT 167 ai-visualizer SDXL consumes 31/32 GB of its VRAM, forcing Gemma 4 models 95% onto CPU. Isolated V100 run requires SDXL eviction — left as follow-up. - MoE vs dense is the dominant latency factor across all GPUs: ~4 B active params of gemma4:26b beats 31.3 B active of gemma4:31b by the same ratio (~4.7×) on every card tested. Methodology: 1 warmup + 3 measurement runs per (host × model × prompt-length), Ollama's canonical timing fields, temp=0 greedy, num_predict=256. All three Ollama servers accessed via HTTP (Strix via Tailscale). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,81 @@
|
||||
{
|
||||
"host": "matt-strix",
|
||||
"gpu": "AMD Strix Halo iGPU",
|
||||
"vram_gb": null,
|
||||
"model_alias": "gemma4:26b",
|
||||
"model_tag": "gemma4:26b",
|
||||
"prompt_key": "short",
|
||||
"prompt_chars": 78,
|
||||
"num_predict": 256,
|
||||
"num_ctx": 4096,
|
||||
"runs": [
|
||||
{
|
||||
"prompt_tokens": 28,
|
||||
"prompt_eval_ms": 21.9,
|
||||
"prefill_tok_per_s": 1278.99,
|
||||
"output_tokens": 256,
|
||||
"eval_ms": 4754.7,
|
||||
"decode_tok_per_s": 53.84,
|
||||
"load_ms": 172.3,
|
||||
"total_ms": 5008.5,
|
||||
"harness_wall_s": 5.057,
|
||||
"done_reason": "length"
|
||||
},
|
||||
{
|
||||
"prompt_tokens": 28,
|
||||
"prompt_eval_ms": 21.9,
|
||||
"prefill_tok_per_s": 1275.71,
|
||||
"output_tokens": 256,
|
||||
"eval_ms": 4755.7,
|
||||
"decode_tok_per_s": 53.83,
|
||||
"load_ms": 151.6,
|
||||
"total_ms": 4988.3,
|
||||
"harness_wall_s": 5.043,
|
||||
"done_reason": "length"
|
||||
},
|
||||
{
|
||||
"prompt_tokens": 28,
|
||||
"prompt_eval_ms": 22.0,
|
||||
"prefill_tok_per_s": 1271.11,
|
||||
"output_tokens": 256,
|
||||
"eval_ms": 4757.6,
|
||||
"decode_tok_per_s": 53.81,
|
||||
"load_ms": 154.4,
|
||||
"total_ms": 4993.2,
|
||||
"harness_wall_s": 5.048,
|
||||
"done_reason": "length"
|
||||
}
|
||||
],
|
||||
"warmup": {
|
||||
"prompt_tokens": 28,
|
||||
"prompt_eval_ms": 93.1,
|
||||
"prefill_tok_per_s": 300.9,
|
||||
"output_tokens": 256,
|
||||
"eval_ms": 4756.6,
|
||||
"decode_tok_per_s": 53.82,
|
||||
"load_ms": 2272.4,
|
||||
"total_ms": 7250.0,
|
||||
"harness_wall_s": 7.341,
|
||||
"done_reason": "length"
|
||||
},
|
||||
"summary": {
|
||||
"prefill_tok_per_s": {
|
||||
"min": 1271.11,
|
||||
"median": 1275.71,
|
||||
"max": 1278.99,
|
||||
"n": 3
|
||||
},
|
||||
"decode_tok_per_s": {
|
||||
"min": 53.81,
|
||||
"median": 53.83,
|
||||
"max": 53.84,
|
||||
"n": 3
|
||||
},
|
||||
"total_ms": {
|
||||
"min": 4988.3,
|
||||
"median": 4993.2,
|
||||
"max": 5008.5,
|
||||
"n": 3
|
||||
}
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user