Files
Seth 33e3e55770 Round 5: Live RCON bake-off results (preliminary)
- Expanded dataset from 31 to 182 examples (edge cases, log extraction, bug reports)
- Tested gemma3n:e4b vs qwen3:8b on live Paper 1.21 server via RCON
- Key finding: only ~33% of commands succeed on a real server
- gemma3n wins per-command success (61% vs 9.6%), qwen3 wins accuracy (72% vs 62%)
- Results noted as preliminary — no player online inflated failure rates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 15:09:10 -04:00

2.6 KiB

Results Summary

Final Standings (all rounds combined)

Rank Model Params Cmd Match Exact Match Syntax OK Safety No Grat. Actions Avg Latency Avg Tokens
1 gemma3n:e4b 6.9B 80.6% 19.4% 77.4% 100% 100% 5.9s 98
2 qwen3:8b (1500 tok) 8B 77.4% 12.9% 64.5% 96.8% 100% 16.0s 212
3 qwen3-coder:30b 30B MoE 67.7% 16.1% 71.0% 93.5% 96.8% 14.7s 163
4 phi4-mini 3.8B 61.3% 9.7% 80.6% 93.5% 100% 4.5s 59
5 qwen3:8b (400 tok) 8B 41.9% 19.4% 87.1% 100% 96.8% 8.7s 297
6 qwen3.5:9b 9B 29.0% 22.6% 96.8% 96.8% 100% 22.6s 271
7 qwen3.5:4b 4B 19.4% 19.4% 100% 100% 100% 7.7s 377
8 qwen3:4b 4B 16.1% 16.1% 100% 100% 100% 5.7s 400

Key Observations

  1. Size doesn't determine quality. The 6.9B model beat the 30B model on every metric.
  2. Token budget matters for thinking models. qwen3:8b jumped from 42% to 77% just by increasing num_predict from 400 to 1500.
  3. Safety is hard. Three models (qwen3-coder, phi4-mini, qwen3.5:9b) executed dangerous commands when asked politely.
  4. The 4B models are too small. Perfect syntax and safety scores are misleading -- they're scoring high by producing empty responses.

Round Details

  • Round 1: gemma3n:e4b vs qwen3-coder:30b (400 token budget)
  • Round 2: qwen3.5:4b + qwen3.5:9b + gemma3n:e4b (400 token budget)
  • Round 3: qwen3:4b + qwen3:8b + phi4-mini + gemma3n:e4b (400 token budget)
  • Round 4: qwen3:8b + qwen3:4b + gemma3n:e4b (1500 token budget -- the fix)
  • Round 5: gemma3n:e4b vs qwen3:8b -- live RCON execution on Paper 1.21 server (136 command_gen examples from expanded 182-example dataset). See README for full results and caveats.

Round 5: Live RCON Results (preliminary)

Metric gemma3n:e4b qwen3:8b Winner
Command match 62.5% 72.1% qwen3
Syntax correct 84.6% 85.3% qwen3
RCON success (per example) 33.1% 34.6% qwen3
RCON cmd success (per cmd) 61.1% 9.6% gemma3n
Empty responses 12.5% 18.4% gemma3n
Avg latency 13.4s 20.7s gemma3n

Note: RCON success rates are artificially low — no player was online during testing, causing all player-targeting commands to fail. See README for full caveats.