Round 5: Live RCON bake-off results (preliminary)

- Expanded dataset from 31 to 182 examples (edge cases, log extraction, bug reports)
- Tested gemma3n:e4b vs qwen3:8b on live Paper 1.21 server via RCON
- Key finding: only ~33% of commands succeed on a real server
- gemma3n wins per-command success (61% vs 9.6%), qwen3 wins accuracy (72% vs 62%)
- Results noted as preliminary — no player online inflated failure rates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-18 15:09:10 -04:00
parent 2189579490
commit 33e3e55770
3 changed files with 19438 additions and 0 deletions
+52
View File
@@ -112,6 +112,58 @@ At 77.4%, `qwen3:8b` was now neck-and-neck with the leader. The tradeoff: it thi
**`qwen3-coder:30b` is a cautionary tale.** Bigger isn't better. A 30B MoE model that runs 3x slower, uses 3x the VRAM, and still fails safety tests is hard to justify when a 7B dense model beats it outright. **`qwen3-coder:30b` is a cautionary tale.** Bigger isn't better. A 30B MoE model that runs 3x slower, uses 3x the VRAM, and still fails safety tests is hard to justify when a 7B dense model beats it outright.
---
## Update: Live Server Testing (Round 5)
*Added 2026-03-18. This section covers ongoing work and the results are preliminary.*
The original bake-off tested models in isolation -- send a prompt, get JSON back, score it. That tells you whether the model *knows* the right command, but not whether the command actually *works*. For Round 5, we plugged the two leading models into a live Minecraft 1.21 Paper server and executed every command through RCON.
### What Changed
- **Dataset expanded from 31 to 182 examples.** The original 31 were hand-written. We added 45 manually authored edge cases (troubleshooting, ambiguous requests, social engineering, typos) and extracted 106 examples from real server logs -- actual player prayers, sudo commands, and bug reports from a live deployment.
- **Commands executed on a real server.** Instead of just scoring the JSON output, we sent every generated command to a Paper 1.21 server via RCON and checked whether it succeeded or failed.
- **New metric: RCON success.** Did the command actually execute without errors? This catches things static analysis misses -- invalid item IDs, unloaded chunks, malformed NBT, non-existent entities.
### Round 5 Results: gemma3n:e4b vs qwen3:8b (136 command_gen examples, live RCON)
| Metric | gemma3n:e4b | qwen3:8b | Winner |
|--------|:-----------:|:--------:|:------:|
| Command match | 62.5% | **72.1%** | qwen3 |
| Exact match | 4.4% | **5.9%** | qwen3 |
| Syntax correct | 84.6% | **85.3%** | qwen3 |
| Safety | **100%** | **100%** | tie |
| RCON success (per example) | 33.1% | **34.6%** | qwen3 |
| RCON cmd success (per cmd) | **61.1%** | 9.6% | gemma3n |
| Empty responses | **12.5%** | 18.4% | gemma3n |
| Avg latency | **13.4s** | 20.7s | gemma3n |
**Overall: qwen3:8b 5 wins, gemma3n:e4b 4 wins.** Close, but the picture is more nuanced than the original bake-off suggested.
### What the Live Test Revealed
**1. Only 1 in 3 commands actually works on a real server.** Both models hover around 33% RCON success per example. The gap between "generated a plausible-looking command" (70%+) and "generated a command the server accepted" (33%) is enormous. Static evaluation dramatically overstates model capability.
**2. gemma3n generates fewer commands, but they work more often.** Per-command RCON success is 61% for gemma3n vs 9.6% for qwen3:8b. Gemma tends to output one or two simple commands. Qwen generates longer, more ambitious command lists -- but most of them fail because it uses old NBT syntax (`{Enchantments:[{id:...,lvl:...}]}`) that 1.21 rejects.
**3. The `@s` selector is a trap.** Both models love using `@s` (the executing entity) in commands, but RCON runs from the server console with no entity context. Every `@s` command fails with "No entity was found." A post-processing step that replaces `@s` with the requesting player's name would fix this instantly -- it's not a model intelligence problem, it's a deployment integration problem.
**4. "Position not loaded" is the second biggest error.** Commands targeting specific coordinates fail when those chunks aren't loaded. This is inherent to testing on an empty server -- a player standing nearby would have prevented these failures.
### Caveats
These results are incomplete and noisy for several reasons:
- **No player online during testing.** Every command targeting `@s`, `@a`, or a player name failed with "No entity/player found." These are false negatives -- the commands are syntactically correct and would work with a player present. The RCON success numbers undercount real-world performance significantly.
- **Chunk loading.** Fill, setblock, and summon commands at specific coordinates failed because those chunks weren't loaded. Same issue -- a player nearby would fix this.
- **Dataset not fully validated.** The 106 log-extracted examples have `validated: false`. Some expected outputs may be wrong, inflating the miss rate for both models equally.
- **Single run.** LLM outputs are stochastic. These numbers would shift a few points in either direction on a rerun.
The honest summary: we now know that ~33% of commands work on a live server, and we know exactly *why* the other 67% fail. Most failures are fixable with post-processing (selector replacement, syntax repair) rather than model improvements. The next step is measuring again after those fixes are deployed.
---
## Methodology ## Methodology
### Scoring ### Scoring
File diff suppressed because one or more lines are too long
+14
View File
@@ -26,3 +26,17 @@
- **Round 2:** qwen3.5:4b + qwen3.5:9b + gemma3n:e4b (400 token budget) - **Round 2:** qwen3.5:4b + qwen3.5:9b + gemma3n:e4b (400 token budget)
- **Round 3:** qwen3:4b + qwen3:8b + phi4-mini + gemma3n:e4b (400 token budget) - **Round 3:** qwen3:4b + qwen3:8b + phi4-mini + gemma3n:e4b (400 token budget)
- **Round 4:** qwen3:8b + qwen3:4b + gemma3n:e4b (1500 token budget -- the fix) - **Round 4:** qwen3:8b + qwen3:4b + gemma3n:e4b (1500 token budget -- the fix)
- **Round 5:** gemma3n:e4b vs qwen3:8b -- live RCON execution on Paper 1.21 server (136 command_gen examples from expanded 182-example dataset). See README for full results and caveats.
## Round 5: Live RCON Results (preliminary)
| Metric | gemma3n:e4b | qwen3:8b | Winner |
|--------|:-----------:|:--------:|:------:|
| Command match | 62.5% | **72.1%** | qwen3 |
| Syntax correct | 84.6% | **85.3%** | qwen3 |
| RCON success (per example) | 33.1% | **34.6%** | qwen3 |
| RCON cmd success (per cmd) | **61.1%** | 9.6% | gemma3n |
| Empty responses | **12.5%** | 18.4% | gemma3n |
| Avg latency | **13.4s** | 20.7s | gemma3n |
**Note:** RCON success rates are artificially low — no player was online during testing, causing all player-targeting commands to fail. See README for full caveats.