Files

T

Mortdecai f5118505b1 0.5.0 bake-off results, knowledge lookup tools, training progress chart

Bake-off (0.5.0 vs 0.4.0):
- Overall: 46.8% vs 45.2% (+1.6%), 0 errors vs 2
- Enchantments: +47% (20% → 67%)
- EssentialsX: +60% (0% → 60%)
- Effects: +25% (0% → 25%)
- Regressions: fill_build -67%, world -20%

Knowledge Lookup Tools (4 new):
- plugin.docs_lookup: WorldGuard, WorldEdit, CoreProtect, EssentialsX, LuckPerms docs
- minecraft.changelog_lookup: version history from Minecraft Wiki
- paper.docs_lookup: Paper server-specific documentation
- Wired into gateway model-driven tool loop and exploration self-play

Exploration Self-Play:
- General (vanilla MC) and plugins focus modes
- Wiki-grounded: model researches before acting, validates through RCON
- 2,243 exploration examples generated, 150 kept after quality filtering

Training Progress Chart:
- SVG chart showing training examples and inverse loss across versions
- Added to MODEL_CARD.md for Gitea display

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-21 15:28:09 -04:00

5.1 KiB

Raw Blame History

Model Card: Mortdecai

Model Details

Field	Value
Name	Mortdecai
Version	0.5.0
Base Model	Qwen3.5-9B (Apache 2.0)
Adaptation	QLoRA (4-bit base + LoRA adapters in FP16)
Parameters	9.4B total, 29M trainable (0.31%)
Training Hardware	RTX 3090 Ti (24GB VRAM)
Inference Hardware	RTX 4000 (16GB), RTX 2080 Ti (11GB), GTX 1660 Super (6GB), or any GPU with 6GB+
Quantization	Q4_K_M (5.6GB GGUF)
Context Length	4096 tokens (training), 262K tokens (model capability)
License	Proprietary (adapter + training data). Base model: Apache 2.0

Intended Use

Mortdecai is designed for Minecraft Java Edition 1.21.x server operations:

Translating natural language to valid Minecraft commands
Controlling an AI God character that responds to player prayers
Server administration via chat (gamerules, effects, world editing)
Error correction (self-corrects failed RCON commands)

Not intended for:

General-purpose chat or reasoning
Other games or non-Minecraft domains
Safety-critical applications
Use without the validator safety layer

Training Data

Source	Count	Description
Hand-curated seed examples	3,196	Command syntax, recipes, enchantments, entities, effects, memory, events
Tool-calling sequences	1,430	Multi-turn RCON execution with 17 tools (script, memory, wiki, plugins)
IGLU build dataset	4,656	Natural language → block placement commands from Microsoft Research
Plugin training (RCON-validated)	104	WorldGuard, CoreProtect, EssentialsX, LuckPerms, FAWE
Exploration self-play	150	Wiki-grounded knowledge discovery with RCON validation
Self-play (0.4.0 + 0.5.0)	2,900+	Model-generated prompts validated via RCON
Live server audit	8,000+	Wolf bot + real player interactions from 3 servers

Total: ~20,000+ examples across all sources

Tool Architecture (17 tools)

Category	Tools
Execution	rcon.execute
Knowledge	minecraft.wiki_lookup, plugin.docs_lookup, minecraft.changelog_lookup, paper.docs_lookup
World Sensing	world.player_info, world.server_state, world.nearby_entities
Memory	memory.read, memory.write
Scripts	script.write, script.validate, script.execute, script.read, script.list, script.delete, script.schedule

Data Collection Methods

Manual curation — Minecraft Wiki, command reference, recipe databases
Live server logs — Real player interactions on Paper 1.21.x servers
Bot collection — Mineflayer bots with Gemini/Dolphin prompt generation
API distillation — Claude Haiku and Gemini Flash responses
Self-play — Model generates edge cases, attempts via RCON, learns from results
RCON validation — Every command tested against a live Minecraft server

Known Biases

Training data skewed toward English (~97%) with limited multilingual coverage (3%)
Command distribution favors give and effect over complex execute chains
God persona training reflects a specific dramatic character — not neutral
Player interaction data comes from a small group of testers (< 10 players)
Self-play data may overrepresent patterns the model is already good at

Evaluation

Bake-off Results (0.5.0 vs 0.4.0, 38 prompts × 12 categories)

Metric	0.4.0	0.5.0
Overall success rate	45.2%	46.8%
Avg response time	2.60s	2.11s
Errors (crashes)	2	0
Empty responses	0	0

Category improvements (0.5.0 vs 0.4.0):

Category	0.4.0	0.5.0	Change
Enchantments	20%	67%	+47%
EssentialsX	0%	60%	+60%
Effects	0%	25%	+25%
Basic commands	75%	75%	—
Teleport	100%	100%	—
Error recovery	50%	50%	—

Safety

The model uses a 5-level risk hierarchy:

Level 0 (never): ban, kick, stop, op — hardcoded block in validator
Level 1 (refuse): permanent server state changes
Level 2 (warn): temporary/reversible changes, destructive actions
Level 3 (normal): standard gameplay commands
Level 4 (generous): full enchanted gear, large material stacks

Additional safety layers:

Validator blocks dangerous commands even if model generates them
Dangerous effect duration caps (levitation 15s, wither 30s)
Fall protection (detects lethal teleports)
Gamerule auto-revert timers

Limitations

Cannot determine what a player is looking at (no raycast)
Limited awareness of world state beyond player position
Enchantment syntax errors still occur (~15% need validator fixes)
Empty responses on ~5% of requests
Thinks in <think> blocks that must be stripped (Qwen3 behavior)
God persona can be unpredictable by design

Environmental Impact

Training energy: ~84W × 4 hours = 0.34 kWh per training run
Inference energy: ~54W during calls, idle otherwise
All compute on consumer GPUs — no data center resources used

5.1 KiB Raw Blame History Unescape Escape