33e3e55770
- Expanded dataset from 31 to 182 examples (edge cases, log extraction, bug reports) - Tested gemma3n:e4b vs qwen3:8b on live Paper 1.21 server via RCON - Key finding: only ~33% of commands succeed on a real server - gemma3n wins per-command success (61% vs 9.6%), qwen3 wins accuracy (72% vs 62%) - Results noted as preliminary — no player online inflated failure rates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
43 lines
2.6 KiB
Markdown
43 lines
2.6 KiB
Markdown
# Results Summary
|
|
|
|
## Final Standings (all rounds combined)
|
|
|
|
| Rank | Model | Params | Cmd Match | Exact Match | Syntax OK | Safety | No Grat. Actions | Avg Latency | Avg Tokens |
|
|
|:----:|-------|-------:|:---------:|:-----------:|:---------:|:------:|:----------------:|------------:|-----------:|
|
|
| 1 | **gemma3n:e4b** | 6.9B | **80.6%** | 19.4% | 77.4% | **100%** | **100%** | 5.9s | 98 |
|
|
| 2 | qwen3:8b (1500 tok) | 8B | 77.4% | 12.9% | 64.5% | 96.8% | **100%** | 16.0s | 212 |
|
|
| 3 | qwen3-coder:30b | 30B MoE | 67.7% | 16.1% | 71.0% | 93.5% | 96.8% | 14.7s | 163 |
|
|
| 4 | phi4-mini | 3.8B | 61.3% | 9.7% | 80.6% | 93.5% | **100%** | **4.5s** | 59 |
|
|
| 5 | qwen3:8b (400 tok) | 8B | 41.9% | 19.4% | **87.1%** | **100%** | 96.8% | 8.7s | 297 |
|
|
| 6 | qwen3.5:9b | 9B | 29.0% | 22.6% | 96.8% | 96.8% | **100%** | 22.6s | 271 |
|
|
| 7 | qwen3.5:4b | 4B | 19.4% | 19.4% | **100%** | **100%** | **100%** | 7.7s | 377 |
|
|
| 8 | qwen3:4b | 4B | 16.1% | 16.1% | **100%** | **100%** | **100%** | 5.7s | 400 |
|
|
|
|
## Key Observations
|
|
|
|
1. **Size doesn't determine quality.** The 6.9B model beat the 30B model on every metric.
|
|
2. **Token budget matters for thinking models.** qwen3:8b jumped from 42% to 77% just by increasing num_predict from 400 to 1500.
|
|
3. **Safety is hard.** Three models (qwen3-coder, phi4-mini, qwen3.5:9b) executed dangerous commands when asked politely.
|
|
4. **The 4B models are too small.** Perfect syntax and safety scores are misleading -- they're scoring high by producing empty responses.
|
|
|
|
## Round Details
|
|
|
|
- **Round 1:** gemma3n:e4b vs qwen3-coder:30b (400 token budget)
|
|
- **Round 2:** qwen3.5:4b + qwen3.5:9b + gemma3n:e4b (400 token budget)
|
|
- **Round 3:** qwen3:4b + qwen3:8b + phi4-mini + gemma3n:e4b (400 token budget)
|
|
- **Round 4:** qwen3:8b + qwen3:4b + gemma3n:e4b (1500 token budget -- the fix)
|
|
- **Round 5:** gemma3n:e4b vs qwen3:8b -- live RCON execution on Paper 1.21 server (136 command_gen examples from expanded 182-example dataset). See README for full results and caveats.
|
|
|
|
## Round 5: Live RCON Results (preliminary)
|
|
|
|
| Metric | gemma3n:e4b | qwen3:8b | Winner |
|
|
|--------|:-----------:|:--------:|:------:|
|
|
| Command match | 62.5% | **72.1%** | qwen3 |
|
|
| Syntax correct | 84.6% | **85.3%** | qwen3 |
|
|
| RCON success (per example) | 33.1% | **34.6%** | qwen3 |
|
|
| RCON cmd success (per cmd) | **61.1%** | 9.6% | gemma3n |
|
|
| Empty responses | **12.5%** | 18.4% | gemma3n |
|
|
| Avg latency | **13.4s** | 20.7s | gemma3n |
|
|
|
|
**Note:** RCON success rates are artificially low — no player was online during testing, causing all player-targeting commands to fail. See README for full caveats.
|