Files
Mortdecai/MODEL_CARD.md
T
Mortdecai f5118505b1 0.5.0 bake-off results, knowledge lookup tools, training progress chart
Bake-off (0.5.0 vs 0.4.0):
- Overall: 46.8% vs 45.2% (+1.6%), 0 errors vs 2
- Enchantments: +47% (20% → 67%)
- EssentialsX: +60% (0% → 60%)
- Effects: +25% (0% → 25%)
- Regressions: fill_build -67%, world -20%

Knowledge Lookup Tools (4 new):
- plugin.docs_lookup: WorldGuard, WorldEdit, CoreProtect, EssentialsX, LuckPerms docs
- minecraft.changelog_lookup: version history from Minecraft Wiki
- paper.docs_lookup: Paper server-specific documentation
- Wired into gateway model-driven tool loop and exploration self-play

Exploration Self-Play:
- General (vanilla MC) and plugins focus modes
- Wiki-grounded: model researches before acting, validates through RCON
- 2,243 exploration examples generated, 150 kept after quality filtering

Training Progress Chart:
- SVG chart showing training examples and inverse loss across versions
- Added to MODEL_CARD.md for Gitea display

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 15:28:09 -04:00

5.1 KiB
Raw Blame History

Model Card: Mortdecai

Training Progress

Model Details

Field Value
Name Mortdecai
Version 0.5.0
Base Model Qwen3.5-9B (Apache 2.0)
Adaptation QLoRA (4-bit base + LoRA adapters in FP16)
Parameters 9.4B total, 29M trainable (0.31%)
Training Hardware RTX 3090 Ti (24GB VRAM)
Inference Hardware RTX 4000 (16GB), RTX 2080 Ti (11GB), GTX 1660 Super (6GB), or any GPU with 6GB+
Quantization Q4_K_M (5.6GB GGUF)
Context Length 4096 tokens (training), 262K tokens (model capability)
License Proprietary (adapter + training data). Base model: Apache 2.0

Intended Use

Mortdecai is designed for Minecraft Java Edition 1.21.x server operations:

  • Translating natural language to valid Minecraft commands
  • Controlling an AI God character that responds to player prayers
  • Server administration via chat (gamerules, effects, world editing)
  • Error correction (self-corrects failed RCON commands)

Not intended for:

  • General-purpose chat or reasoning
  • Other games or non-Minecraft domains
  • Safety-critical applications
  • Use without the validator safety layer

Training Data

Source Count Description
Hand-curated seed examples 3,196 Command syntax, recipes, enchantments, entities, effects, memory, events
Tool-calling sequences 1,430 Multi-turn RCON execution with 17 tools (script, memory, wiki, plugins)
IGLU build dataset 4,656 Natural language → block placement commands from Microsoft Research
Plugin training (RCON-validated) 104 WorldGuard, CoreProtect, EssentialsX, LuckPerms, FAWE
Exploration self-play 150 Wiki-grounded knowledge discovery with RCON validation
Self-play (0.4.0 + 0.5.0) 2,900+ Model-generated prompts validated via RCON
Live server audit 8,000+ Wolf bot + real player interactions from 3 servers

Total: ~20,000+ examples across all sources

Tool Architecture (17 tools)

Category Tools
Execution rcon.execute
Knowledge minecraft.wiki_lookup, plugin.docs_lookup, minecraft.changelog_lookup, paper.docs_lookup
World Sensing world.player_info, world.server_state, world.nearby_entities
Memory memory.read, memory.write
Scripts script.write, script.validate, script.execute, script.read, script.list, script.delete, script.schedule

Data Collection Methods

  1. Manual curation — Minecraft Wiki, command reference, recipe databases
  2. Live server logs — Real player interactions on Paper 1.21.x servers
  3. Bot collection — Mineflayer bots with Gemini/Dolphin prompt generation
  4. API distillation — Claude Haiku and Gemini Flash responses
  5. Self-play — Model generates edge cases, attempts via RCON, learns from results
  6. RCON validation — Every command tested against a live Minecraft server

Known Biases

  • Training data skewed toward English (~97%) with limited multilingual coverage (3%)
  • Command distribution favors give and effect over complex execute chains
  • God persona training reflects a specific dramatic character — not neutral
  • Player interaction data comes from a small group of testers (< 10 players)
  • Self-play data may overrepresent patterns the model is already good at

Evaluation

Bake-off Results (0.5.0 vs 0.4.0, 38 prompts × 12 categories)

Metric 0.4.0 0.5.0
Overall success rate 45.2% 46.8%
Avg response time 2.60s 2.11s
Errors (crashes) 2 0
Empty responses 0 0

Category improvements (0.5.0 vs 0.4.0):

Category 0.4.0 0.5.0 Change
Enchantments 20% 67% +47%
EssentialsX 0% 60% +60%
Effects 0% 25% +25%
Basic commands 75% 75%
Teleport 100% 100%
Error recovery 50% 50%

Safety

The model uses a 5-level risk hierarchy:

  • Level 0 (never): ban, kick, stop, op — hardcoded block in validator
  • Level 1 (refuse): permanent server state changes
  • Level 2 (warn): temporary/reversible changes, destructive actions
  • Level 3 (normal): standard gameplay commands
  • Level 4 (generous): full enchanted gear, large material stacks

Additional safety layers:

  • Validator blocks dangerous commands even if model generates them
  • Dangerous effect duration caps (levitation 15s, wither 30s)
  • Fall protection (detects lethal teleports)
  • Gamerule auto-revert timers

Limitations

  • Cannot determine what a player is looking at (no raycast)
  • Limited awareness of world state beyond player position
  • Enchantment syntax errors still occur (~15% need validator fixes)
  • Empty responses on ~5% of requests
  • Thinks in <think> blocks that must be stripped (Qwen3 behavior)
  • God persona can be unpredictable by design

Environmental Impact

  • Training energy: ~84W × 4 hours = 0.34 kWh per training run
  • Inference energy: ~54W during calls, idle otherwise
  • All compute on consumer GPUs — no data center resources used