Files

T

Seth 33e3e55770 Round 5: Live RCON bake-off results (preliminary)

- Expanded dataset from 31 to 182 examples (edge cases, log extraction, bug reports)
- Tested gemma3n:e4b vs qwen3:8b on live Paper 1.21 server via RCON
- Key finding: only ~33% of commands succeed on a real server
- gemma3n wins per-command success (61% vs 9.6%), qwen3 wins accuracy (72% vs 62%)
- Results noted as preliminary — no player online inflated failure rates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-18 15:09:10 -04:00

2.6 KiB

Raw Permalink Blame History

Results Summary

Final Standings (all rounds combined)

Rank	Model	Params	Cmd Match	Exact Match	Syntax OK	Safety	No Grat. Actions	Avg Latency	Avg Tokens
1	gemma3n:e4b	6.9B	80.6%	19.4%	77.4%	100%	100%	5.9s	98
2	qwen3:8b (1500 tok)	8B	77.4%	12.9%	64.5%	96.8%	100%	16.0s	212
3	qwen3-coder:30b	30B MoE	67.7%	16.1%	71.0%	93.5%	96.8%	14.7s	163
4	phi4-mini	3.8B	61.3%	9.7%	80.6%	93.5%	100%	4.5s	59
5	qwen3:8b (400 tok)	8B	41.9%	19.4%	87.1%	100%	96.8%	8.7s	297
6	qwen3.5:9b	9B	29.0%	22.6%	96.8%	96.8%	100%	22.6s	271
7	qwen3.5:4b	4B	19.4%	19.4%	100%	100%	100%	7.7s	377
8	qwen3:4b	4B	16.1%	16.1%	100%	100%	100%	5.7s	400

Key Observations

Size doesn't determine quality. The 6.9B model beat the 30B model on every metric.
Token budget matters for thinking models. qwen3:8b jumped from 42% to 77% just by increasing num_predict from 400 to 1500.
Safety is hard. Three models (qwen3-coder, phi4-mini, qwen3.5:9b) executed dangerous commands when asked politely.
The 4B models are too small. Perfect syntax and safety scores are misleading -- they're scoring high by producing empty responses.

Round Details

Round 1: gemma3n:e4b vs qwen3-coder:30b (400 token budget)
Round 2: qwen3.5:4b + qwen3.5:9b + gemma3n:e4b (400 token budget)
Round 3: qwen3:4b + qwen3:8b + phi4-mini + gemma3n:e4b (400 token budget)
Round 4: qwen3:8b + qwen3:4b + gemma3n:e4b (1500 token budget -- the fix)
Round 5: gemma3n:e4b vs qwen3:8b -- live RCON execution on Paper 1.21 server (136 command_gen examples from expanded 182-example dataset). See README for full results and caveats.

Round 5: Live RCON Results (preliminary)

Metric	gemma3n:e4b	qwen3:8b	Winner
Command match	62.5%	72.1%	qwen3
Syntax correct	84.6%	85.3%	qwen3
RCON success (per example)	33.1%	34.6%	qwen3
RCON cmd success (per cmd)	61.1%	9.6%	gemma3n
Empty responses	12.5%	18.4%	gemma3n
Avg latency	13.4s	20.7s	gemma3n

Note: RCON success rates are artificially low — no player was online during testing, causing all player-targeting commands to fail. See README for full caveats.

2.6 KiB Raw Permalink Blame History