Round 5: Live RCON bake-off results (preliminary)
- Expanded dataset from 31 to 182 examples (edge cases, log extraction, bug reports) - Tested gemma3n:e4b vs qwen3:8b on live Paper 1.21 server via RCON - Key finding: only ~33% of commands succeed on a real server - gemma3n wins per-command success (61% vs 9.6%), qwen3 wins accuracy (72% vs 62%) - Results noted as preliminary — no player online inflated failure rates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -26,3 +26,17 @@
|
||||
- **Round 2:** qwen3.5:4b + qwen3.5:9b + gemma3n:e4b (400 token budget)
|
||||
- **Round 3:** qwen3:4b + qwen3:8b + phi4-mini + gemma3n:e4b (400 token budget)
|
||||
- **Round 4:** qwen3:8b + qwen3:4b + gemma3n:e4b (1500 token budget -- the fix)
|
||||
- **Round 5:** gemma3n:e4b vs qwen3:8b -- live RCON execution on Paper 1.21 server (136 command_gen examples from expanded 182-example dataset). See README for full results and caveats.
|
||||
|
||||
## Round 5: Live RCON Results (preliminary)
|
||||
|
||||
| Metric | gemma3n:e4b | qwen3:8b | Winner |
|
||||
|--------|:-----------:|:--------:|:------:|
|
||||
| Command match | 62.5% | **72.1%** | qwen3 |
|
||||
| Syntax correct | 84.6% | **85.3%** | qwen3 |
|
||||
| RCON success (per example) | 33.1% | **34.6%** | qwen3 |
|
||||
| RCON cmd success (per cmd) | **61.1%** | 9.6% | gemma3n |
|
||||
| Empty responses | **12.5%** | 18.4% | gemma3n |
|
||||
| Avg latency | **13.4s** | 20.7s | gemma3n |
|
||||
|
||||
**Note:** RCON success rates are artificially low — no player was online during testing, causing all player-targeting commands to fail. See README for full caveats.
|
||||
|
||||
Reference in New Issue
Block a user