Round 5: Live RCON bake-off results (preliminary)

- Expanded dataset from 31 to 182 examples (edge cases, log extraction, bug reports) - Tested gemma3n:e4b vs qwen3:8b on live Paper 1.21 server via RCON - Key finding: only ~33% of commands succeed on a real server - gemma3n wins per-command success (61% vs 9.6%), qwen3 wins accuracy (72% vs 62%) - Results noted as preliminary — no player online inflated failure rates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 15:09:10 -04:00
parent 2189579490
commit 33e3e55770
3 changed files with 19438 additions and 0 deletions
@@ -112,6 +112,58 @@ At 77.4%, `qwen3:8b` was now neck-and-neck with the leader. The tradeoff: it thi
 **`qwen3-coder:30b` is a cautionary tale.** Bigger isn't better. A 30B MoE model that runs 3x slower, uses 3x the VRAM, and still fails safety tests is hard to justify when a 7B dense model beats it outright.
 ---
 ## Update: Live Server Testing (Round 5)
 *Added 2026-03-18. This section covers ongoing work and the results are preliminary.*
 The original bake-off tested models in isolation -- send a prompt, get JSON back, score it. That tells you whether the model *knows* the right command, but not whether the command actually *works*. For Round 5, we plugged the two leading models into a live Minecraft 1.21 Paper server and executed every command through RCON.
 ### What Changed
 - **Dataset expanded from 31 to 182 examples.** The original 31 were hand-written. We added 45 manually authored edge cases (troubleshooting, ambiguous requests, social engineering, typos) and extracted 106 examples from real server logs -- actual player prayers, sudo commands, and bug reports from a live deployment.
 - **Commands executed on a real server.** Instead of just scoring the JSON output, we sent every generated command to a Paper 1.21 server via RCON and checked whether it succeeded or failed.
 - **New metric: RCON success.** Did the command actually execute without errors? This catches things static analysis misses -- invalid item IDs, unloaded chunks, malformed NBT, non-existent entities.
 ### Round 5 Results: gemma3n:e4b vs qwen3:8b (136 command_gen examples, live RCON)
 | Metric | gemma3n:e4b | qwen3:8b | Winner |
 |--------|:-----------:|:--------:|:------:|
 | Command match | 62.5% | **72.1%** | qwen3 |
 | Exact match | 4.4% | **5.9%** | qwen3 |
 | Syntax correct | 84.6% | **85.3%** | qwen3 |
 | Safety | **100%** | **100%** | tie |
 | RCON success (per example) | 33.1% | **34.6%** | qwen3 |
 | RCON cmd success (per cmd) | **61.1%** | 9.6% | gemma3n |
 | Empty responses | **12.5%** | 18.4% | gemma3n |
 | Avg latency | **13.4s** | 20.7s | gemma3n |
 **Overall: qwen3:8b 5 wins, gemma3n:e4b 4 wins.** Close, but the picture is more nuanced than the original bake-off suggested.
 ### What the Live Test Revealed
 **1. Only 1 in 3 commands actually works on a real server.** Both models hover around 33% RCON success per example. The gap between "generated a plausible-looking command" (70%+) and "generated a command the server accepted" (33%) is enormous. Static evaluation dramatically overstates model capability.
 **2. gemma3n generates fewer commands, but they work more often.** Per-command RCON success is 61% for gemma3n vs 9.6% for qwen3:8b. Gemma tends to output one or two simple commands. Qwen generates longer, more ambitious command lists -- but most of them fail because it uses old NBT syntax (`{Enchantments:[{id:...,lvl:...}]}`) that 1.21 rejects.
 **3. The `@s` selector is a trap.** Both models love using `@s` (the executing entity) in commands, but RCON runs from the server console with no entity context. Every `@s` command fails with "No entity was found." A post-processing step that replaces `@s` with the requesting player's name would fix this instantly -- it's not a model intelligence problem, it's a deployment integration problem.
 **4. "Position not loaded" is the second biggest error.** Commands targeting specific coordinates fail when those chunks aren't loaded. This is inherent to testing on an empty server -- a player standing nearby would have prevented these failures.
 ### Caveats
 These results are incomplete and noisy for several reasons:
 - **No player online during testing.** Every command targeting `@s`, `@a`, or a player name failed with "No entity/player found." These are false negatives -- the commands are syntactically correct and would work with a player present. The RCON success numbers undercount real-world performance significantly.
 - **Chunk loading.** Fill, setblock, and summon commands at specific coordinates failed because those chunks weren't loaded. Same issue -- a player nearby would fix this.
 - **Dataset not fully validated.** The 106 log-extracted examples have `validated: false`. Some expected outputs may be wrong, inflating the miss rate for both models equally.
 - **Single run.** LLM outputs are stochastic. These numbers would shift a few points in either direction on a rerun.
 The honest summary: we now know that ~33% of commands work on a live server, and we know exactly *why* the other 67% fail. Most failures are fixable with post-processing (selector replacement, syntax repair) rather than model improvements. The next step is measuring again after those fixes are deployed.
 ---
 ## Methodology
 ### Scoring
@@ -26,3 +26,17 @@
 - **Round 2:** qwen3.5:4b + qwen3.5:9b + gemma3n:e4b (400 token budget)
 - **Round 3:** qwen3:4b + qwen3:8b + phi4-mini + gemma3n:e4b (400 token budget)
 - **Round 4:** qwen3:8b + qwen3:4b + gemma3n:e4b (1500 token budget -- the fix)
 - **Round 5:** gemma3n:e4b vs qwen3:8b -- live RCON execution on Paper 1.21 server (136 command_gen examples from expanded 182-example dataset). See README for full results and caveats.
 ## Round 5: Live RCON Results (preliminary)
 | Metric | gemma3n:e4b | qwen3:8b | Winner |
 |--------|:-----------:|:--------:|:------:|
 | Command match | 62.5% | **72.1%** | qwen3 |
 | Syntax correct | 84.6% | **85.3%** | qwen3 |
 | RCON success (per example) | 33.1% | **34.6%** | qwen3 |
 | RCON cmd success (per cmd) | **61.1%** | 9.6% | gemma3n |
 | Empty responses | **12.5%** | 18.4% | gemma3n |
 | Avg latency | **13.4s** | 20.7s | gemma3n |
 **Note:** RCON success rates are artificially low — no player was online during testing, causing all player-targeting commands to fail. See README for full caveats.