f5118505b1
Bake-off (0.5.0 vs 0.4.0): - Overall: 46.8% vs 45.2% (+1.6%), 0 errors vs 2 - Enchantments: +47% (20% → 67%) - EssentialsX: +60% (0% → 60%) - Effects: +25% (0% → 25%) - Regressions: fill_build -67%, world -20% Knowledge Lookup Tools (4 new): - plugin.docs_lookup: WorldGuard, WorldEdit, CoreProtect, EssentialsX, LuckPerms docs - minecraft.changelog_lookup: version history from Minecraft Wiki - paper.docs_lookup: Paper server-specific documentation - Wired into gateway model-driven tool loop and exploration self-play Exploration Self-Play: - General (vanilla MC) and plugins focus modes - Wiki-grounded: model researches before acting, validates through RCON - 2,243 exploration examples generated, 150 kept after quality filtering Training Progress Chart: - SVG chart showing training examples and inverse loss across versions - Added to MODEL_CARD.md for Gitea display Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.1 KiB
5.1 KiB
Model Card: Mortdecai
Model Details
| Field | Value |
|---|---|
| Name | Mortdecai |
| Version | 0.5.0 |
| Base Model | Qwen3.5-9B (Apache 2.0) |
| Adaptation | QLoRA (4-bit base + LoRA adapters in FP16) |
| Parameters | 9.4B total, 29M trainable (0.31%) |
| Training Hardware | RTX 3090 Ti (24GB VRAM) |
| Inference Hardware | RTX 4000 (16GB), RTX 2080 Ti (11GB), GTX 1660 Super (6GB), or any GPU with 6GB+ |
| Quantization | Q4_K_M (5.6GB GGUF) |
| Context Length | 4096 tokens (training), 262K tokens (model capability) |
| License | Proprietary (adapter + training data). Base model: Apache 2.0 |
Intended Use
Mortdecai is designed for Minecraft Java Edition 1.21.x server operations:
- Translating natural language to valid Minecraft commands
- Controlling an AI God character that responds to player prayers
- Server administration via chat (gamerules, effects, world editing)
- Error correction (self-corrects failed RCON commands)
Not intended for:
- General-purpose chat or reasoning
- Other games or non-Minecraft domains
- Safety-critical applications
- Use without the validator safety layer
Training Data
| Source | Count | Description |
|---|---|---|
| Hand-curated seed examples | 3,196 | Command syntax, recipes, enchantments, entities, effects, memory, events |
| Tool-calling sequences | 1,430 | Multi-turn RCON execution with 17 tools (script, memory, wiki, plugins) |
| IGLU build dataset | 4,656 | Natural language → block placement commands from Microsoft Research |
| Plugin training (RCON-validated) | 104 | WorldGuard, CoreProtect, EssentialsX, LuckPerms, FAWE |
| Exploration self-play | 150 | Wiki-grounded knowledge discovery with RCON validation |
| Self-play (0.4.0 + 0.5.0) | 2,900+ | Model-generated prompts validated via RCON |
| Live server audit | 8,000+ | Wolf bot + real player interactions from 3 servers |
Total: ~20,000+ examples across all sources
Tool Architecture (17 tools)
| Category | Tools |
|---|---|
| Execution | rcon.execute |
| Knowledge | minecraft.wiki_lookup, plugin.docs_lookup, minecraft.changelog_lookup, paper.docs_lookup |
| World Sensing | world.player_info, world.server_state, world.nearby_entities |
| Memory | memory.read, memory.write |
| Scripts | script.write, script.validate, script.execute, script.read, script.list, script.delete, script.schedule |
Data Collection Methods
- Manual curation — Minecraft Wiki, command reference, recipe databases
- Live server logs — Real player interactions on Paper 1.21.x servers
- Bot collection — Mineflayer bots with Gemini/Dolphin prompt generation
- API distillation — Claude Haiku and Gemini Flash responses
- Self-play — Model generates edge cases, attempts via RCON, learns from results
- RCON validation — Every command tested against a live Minecraft server
Known Biases
- Training data skewed toward English (~97%) with limited multilingual coverage (3%)
- Command distribution favors
giveandeffectover complexexecutechains - God persona training reflects a specific dramatic character — not neutral
- Player interaction data comes from a small group of testers (< 10 players)
- Self-play data may overrepresent patterns the model is already good at
Evaluation
Bake-off Results (0.5.0 vs 0.4.0, 38 prompts × 12 categories)
| Metric | 0.4.0 | 0.5.0 |
|---|---|---|
| Overall success rate | 45.2% | 46.8% |
| Avg response time | 2.60s | 2.11s |
| Errors (crashes) | 2 | 0 |
| Empty responses | 0 | 0 |
Category improvements (0.5.0 vs 0.4.0):
| Category | 0.4.0 | 0.5.0 | Change |
|---|---|---|---|
| Enchantments | 20% | 67% | +47% |
| EssentialsX | 0% | 60% | +60% |
| Effects | 0% | 25% | +25% |
| Basic commands | 75% | 75% | — |
| Teleport | 100% | 100% | — |
| Error recovery | 50% | 50% | — |
Safety
The model uses a 5-level risk hierarchy:
- Level 0 (never): ban, kick, stop, op — hardcoded block in validator
- Level 1 (refuse): permanent server state changes
- Level 2 (warn): temporary/reversible changes, destructive actions
- Level 3 (normal): standard gameplay commands
- Level 4 (generous): full enchanted gear, large material stacks
Additional safety layers:
- Validator blocks dangerous commands even if model generates them
- Dangerous effect duration caps (levitation 15s, wither 30s)
- Fall protection (detects lethal teleports)
- Gamerule auto-revert timers
Limitations
- Cannot determine what a player is looking at (no raycast)
- Limited awareness of world state beyond player position
- Enchantment syntax errors still occur (~15% need validator fixes)
- Empty responses on ~5% of requests
- Thinks in
<think>blocks that must be stripped (Qwen3 behavior) - God persona can be unpredictable by design
Environmental Impact
- Training energy: ~84W × 4 hours = 0.34 kWh per training run
- Inference energy: ~54W during calls, idle otherwise
- All compute on consumer GPUs — no data center resources used