Files
gemma4-research/scripts/gpu-bakeoff/runs/steel141/gemma4-31b/short.json
T
Mortdecai b6190357ba feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo
Cross-host Gemma 4 throughput comparison across three architectures.
Harness at scripts/gpu-bakeoff/; writeup at
docs/reference/gpu-bakeoff-2026-04-20.md.

Key findings:
- RTX 3090 Ti wins decode decisively (128 tok/s on gemma4:26b MoE Q4,
  ~4.7× faster than gemma4:31b dense on the same card).
- AMD Strix Halo iGPU lands at ~42% of 3090 Ti decode on ~25% of the
  memory bandwidth — good SIMD utilization, especially for MoE.
- V100 numbers are DEGRADED: CT 167 ai-visualizer SDXL consumes 31/32
  GB of its VRAM, forcing Gemma 4 models 95% onto CPU. Isolated V100
  run requires SDXL eviction — left as follow-up.
- MoE vs dense is the dominant latency factor across all GPUs: ~4 B
  active params of gemma4:26b beats 31.3 B active of gemma4:31b by
  the same ratio (~4.7×) on every card tested.

Methodology: 1 warmup + 3 measurement runs per (host × model ×
prompt-length), Ollama's canonical timing fields, temp=0 greedy,
num_predict=256. All three Ollama servers accessed via HTTP (Strix
via Tailscale).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 05:45:26 -04:00

81 lines
1.7 KiB
JSON

{
"host": "steel141",
"gpu": "RTX 3090 Ti",
"vram_gb": 24,
"model_alias": "gemma4:31b",
"model_tag": "gemma4:31b-it-q4_K_M",
"prompt_key": "short",
"prompt_chars": 78,
"num_predict": 256,
"num_ctx": 4096,
"runs": [
{
"prompt_tokens": 27,
"prompt_eval_ms": 44.1,
"prefill_tok_per_s": 611.75,
"output_tokens": 256,
"eval_ms": 9189.5,
"decode_tok_per_s": 27.86,
"load_ms": 373.7,
"total_ms": 9759.8,
"harness_wall_s": 9.762,
"done_reason": "length"
},
{
"prompt_tokens": 27,
"prompt_eval_ms": 40.4,
"prefill_tok_per_s": 668.59,
"output_tokens": 256,
"eval_ms": 9115.3,
"decode_tok_per_s": 28.08,
"load_ms": 340.5,
"total_ms": 9635.7,
"harness_wall_s": 9.638,
"done_reason": "length"
},
{
"prompt_tokens": 27,
"prompt_eval_ms": 40.9,
"prefill_tok_per_s": 660.95,
"output_tokens": 256,
"eval_ms": 9123.7,
"decode_tok_per_s": 28.06,
"load_ms": 325.8,
"total_ms": 9626.6,
"harness_wall_s": 9.629,
"done_reason": "length"
}
],
"warmup": {
"prompt_tokens": 27,
"prompt_eval_ms": 139.6,
"prefill_tok_per_s": 193.44,
"output_tokens": 256,
"eval_ms": 9190.0,
"decode_tok_per_s": 27.86,
"load_ms": 13817.9,
"total_ms": 23488.4,
"harness_wall_s": 23.491,
"done_reason": "length"
},
"summary": {
"prefill_tok_per_s": {
"min": 611.75,
"median": 660.95,
"max": 668.59,
"n": 3
},
"decode_tok_per_s": {
"min": 27.86,
"median": 28.06,
"max": 28.08,
"n": 3
},
"total_ms": {
"min": 9626.6,
"median": 9635.7,
"max": 9759.8,
"n": 3
}
}
}