Bake-off (0.5.0 vs 0.4.0):
- Overall: 46.8% vs 45.2% (+1.6%), 0 errors vs 2
- Enchantments: +47% (20% → 67%)
- EssentialsX: +60% (0% → 60%)
- Effects: +25% (0% → 25%)
- Regressions: fill_build -67%, world -20%
Knowledge Lookup Tools (4 new):
- plugin.docs_lookup: WorldGuard, WorldEdit, CoreProtect, EssentialsX, LuckPerms docs
- minecraft.changelog_lookup: version history from Minecraft Wiki
- paper.docs_lookup: Paper server-specific documentation
- Wired into gateway model-driven tool loop and exploration self-play
Exploration Self-Play:
- General (vanilla MC) and plugins focus modes
- Wiki-grounded: model researches before acting, validates through RCON
- 2,243 exploration examples generated, 150 kept after quality filtering
Training Progress Chart:
- SVG chart showing training examples and inverse loss across versions
- Added to MODEL_CARD.md for Gitea display
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GPU Scheduler (gpu.sethpc.xyz):
- Live dashboard with 4 GPUs, training monitor, loss sparklines
- Preset-based job scheduler with 3 triggers (time, finish_training, cost)
- Model selection per GPU, pipeline configuration
- Tool self-play and training pipeline types
- Behind Google OAuth, live-refresh without page reload
Tool Architecture (14 tools):
- 3 new tools: world.nearby_entities, memory.read, memory.write
- 7 script.* tools: write, validate, execute, read, list, delete, schedule
- ScriptManager: full mcfunction datapack CRUD with RCON validation
- Training data: 1,430 tool examples (up from 1,159)
Plugin Deployment (paper-ai-25567):
- WorldGuard 7.0.12, CoreProtect CE 23.1, EssentialsX 2.21.2, Vault 1.7.3
- Fresh greenfield world reset
- 104 RCON-validated plugin training examples
Event Dispatcher:
- Watches server log for deaths, joins, advancements, PvP kills
- Configurable trigger probability and cooldowns per event type
- Deployed to dev server, fires god_system prompts on events
- 21 event-response training examples
Training Infrastructure:
- train_lora.py: --save-steps 50, --resume from checkpoint
- run_training.sh: stops Ollama, activates conda, restarts after
- Passwordless sudo for ollama services on steel141
- Dev server added to MCSManager with autoStart
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Model can now output revert_after (seconds) and revert_commands fields.
Python service schedules timer from model's response, not just heuristics.
Players notified of revert countdown. Revert announced when applied.
Training examples: temporary gamerules with explicit/implicit/no duration,
permanent changes (no revert), effects with built-in duration, combined reverts.
Key principle: no duration specified → default 5 min revert for safety.
"permanently"/"forever"/"always" → no revert.
Effects → built-in duration, no revert_after needed.
Seed dataset: 3,136 examples
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Versioning scheme: semantic versioning (MAJOR.MINOR.PATCH)
- 0.x.0 = pre-release development
- 1.0.0 = first public/monetized release
Renamed everywhere: PLAN.md, training scripts, self-play, overnight script,
status printer, whitelist app, discord bot, all training data references.
Ollama models retagged: mortdecai-v4 → mortdecai:0.4.0
Server configs updated on all three servers.
Self-play restarted with new model name.
Entity targeting + radius-aware kill + distance scale training added.
Seed dataset: 2,503 + tool: 1,159 + self-play: 5,059 = 8,721 total examples
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Teaches the model to distinguish:
- "kill the zombie" → limit=1,sort=nearest (specific target)
- "kill all zombies" → distance=..30 (area clear)
- "what mobs are nearby" → requires world.nearby_entities tool
- "target the closest enemy" → type=!player,limit=1,sort=nearest
With LangGraph tools enabled, world.nearby_entities gives the model
entity awareness before generating kill commands.
Seed dataset: 2,486 examples
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Teaches command ordering and dependencies:
- Build structure THEN tp inside (not reverse)
- Apply protection BEFORE spawning hostile mobs
- Create water pool BEFORE dropping player
- Effects before gear (protection active during equip)
- Clear mobs before healing (don't waste heal)
- Cage before tp victim (prevent escape)
Key principle: reasoning explains WHY order matters.
Seed dataset: 2,409 examples
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: self-play opened/closed a new TCP socket for every RCON command
(hundreds/minute). Paper's RCON listener creates a thread per connection,
overwhelming the server until it stopped.
Fix: PersistentRCON class maintains a single connection per server with
auto-reconnect. Thread-safe via lock. Connection pool keyed by host:port.
Applied to:
- mc_aigod_paper.py (prod paper-ai + dev)
- mc_aigod.py (shrink-world)
- self_play.py (training data generation)
- persistent_rcon.py (shared module)
Before: ~100+ RCON connections/minute → server crash
After: 3 persistent connections total → stable
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bumped from 20 rounds/tier to 50. Reduced sleep from 1s to 0.1s.
GPUs should run near 100% — Ollama queues requests internally.
mortdecai-sites container (CT 650) created on pve112.
Landing page live at mortdec.ai.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each cycle runs all three tiers at the same time on different GPUs:
- Tier 1 (drills) on GPU A
- Tier 2 (self-critique) on GPU B
- Tier 3 (adversarial) on GPU C
GPU assignments rotate each cycle for even wear.
3x throughput vs sequential. RCON handles concurrent commands.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round-robin load balancing across three Ollama instances:
- 141:11434 (RTX 3090 Ti 24GB)
- 141:11435 (RTX 2080 Ti 11GB) — new second instance
- 179:11434 (RTX 4000 16GB)
Each tier cycles to a different GPU. 3x throughput overnight.
Cycles: Tier 1 drills → Tier 2 self-critique → Tier 3 adversarial → repeat
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full rewrite reflecting current state:
- Model history v1→v4, infrastructure map, API spend
- Training data breakdown (3,477 total examples)
- Active TODOs: immediate, short-term, v5, infrastructure, community
- Risk hierarchy with permanence-based levels
- Key architecture decisions log
- Success criteria: v3 actual → v4 target → v5 goal
- Single-call enabled on prod (mortdecai-v3)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Python revert system (live on prod):
- Gamerule changes auto-revert after default timeout (5-10 min)
- User can specify duration: "disable mobs for 5 minutes"
- "permanently"/"forever" skips revert
- Setting back to default cancels pending revert
- Players notified of revert countdown
Training data (20 examples):
- 8 revert-aware gamerules with revert_after/revert_commands fields
- 12 drop/height/tp examples: intentional drops, safe tp, context-aware
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validator hardcodes maximum durations for dangerous effects:
- Levitation: 15s max (player floats into sky and dies from fall)
- Wither: 30s max (drains health, can kill)
- Poison: 60s max
- Nausea: 30s max
12 training examples: levitation safety, emergency clear, duration caps,
"I can't stop floating" → clear levitation + slow falling
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prod deployment:
- paper-ai and shrink-world switched from gemma3n:e4b to qwen3.5:9b
- Error correction: detects RCON errors (<--[HERE]), asks model to fix, retries
- Broadened error patterns: Unknown game mode, Unknown enchantment, etc.
- Fixed fire fallback matching "firework" as fire intent
- Fixed command format examples (WRONG vs RIGHT in prompt)
- max_tokens bumped to 600 for command calls
- Removed template workflow commands from sudo prompt
Dev server:
- Gemini 2.5 Flash ($0.15/$0.60 per M tokens) replaces Flash Lite
- 10 bots for ~$1-1.5/hr training data generation
- Dynamic pricing by model name in cost tracker
Branding:
- Rajdhani Bold as official Mortdecai font
- Logo variants: mortdecai + mortdec.ai in 6 fonts
- Whitelist page updated with Mortdecai branding + mortdec.ai domain
Whitelist UUID fix:
- Looks up real Mojang UUID via api.mojang.com
- Patches all whitelist.json files directly
- No more offline-mode UUID mismatches
WorldEdit schematics:
- 77 schematics installed (villages, bridges, lighthouses, parks, etc.)
Mortdecai v4 training in progress: 63% complete on steel141
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bake-off: qwen3.5:9b base model, 147 cases:
- 70.1% command match (2x qwen3:8b baseline)
- 15.6% needed syntax fixes
- 29.9% miss (mostly God/prayer — no persona training)
- Avg 7.5s, median 5.7s (thinking tokens)
Model officially named Mortdecai.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Self-play (training/scripts/self_play.py):
- Model generates edge-case prompts across 9 categories
- Attempts commands via RCON, self-corrects on errors
- Successful traces → standard training examples
- Error correction traces → multi-turn tool-calling examples
- Anti-collapse: focuses on categories model is weakest in
- Ready for v4 deployment, not yet active
Qwen3.5-9B base model bake-off (147/1542 cases):
- 70.1% OK (vs 34% Qwen3-8B base) — 2x improvement
- 29.9% MISS (mostly God/prayer — no persona training)
- 15.6% needed syntax fixes
- Avg 7.5s response (thinking tokens)
- Strong v4 candidate: better base + tool-calling architecture
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
God Soul updated with quantity rules:
- Common (dirt/wood): max 320, Uncommon (iron/gold): max 128
- Rare (diamond/emerald): max 32, Very rare (netherite/elytra): max 4
- Forbidden (bedrock/command_block): never give
- Greedy → scaled back, Humble → generous within cap, Absurd → comedic
32 training examples: greedy(6), casual(6), humble(4), explicit(6),
forbidden(5), absurd(3), enchanted(2)
Dataset: 1,340 examples total
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v3 training:
- 1,308 examples: curated + Claude-distilled + bot audit + recipes + command ref
- 1 epoch, rank 16, LR 1e-4, loss 0.55 (sweet spot)
- GGUF Q4_K_M exported, loaded in Ollama as qwen3-8b-mc-lora-v3
- Correct commands, no Chinese, proper safety refusals, dramatic God persona
API cascade for dev server:
- Stage 1: Claude Haiku ($20 budget, ~$11 spent)
- Stage 2: Gemini 2.5 Flash Lite ($20 budget)
- Stage 3: qwen3-8b-mc-lora-v3 (free, local)
- Gemini call function with persistent cost tracking
- Full status report printed at each $1 milestone
Data collection: 2,677 dev audit entries and growing
Bot status printer budget display fix
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merged: 964 curated + 344 Claude-distilled = 1,308 total
All examples tagged with risk_level (0-4)
Model outputs risk classification in training target
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- All 644 examples tagged: 0=blocked(15), 1=refuse(33), 2=warn(24), 3=normal(498), 4=generous(74)
- Training output now includes risk_level field for decision transparency
- Model learns to classify risk before generating commands
- Validator can sanity-check: risk 0-1 should have empty commands
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Haiku cost persists to /var/log/mc_anthropic_cost.json (survives restarts)
- Status printer reads persistent cost file instead of journalctl
- Seeded at $3.08 estimated cumulative spend
- Whitelist app: Sethian Dark theme, mission description, server info
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Knowledge corpus (knowledge/mc-data/):
- 1505 items, 886 crafting recipes, 1166 blocks from minecraft-data 1.21.11
- Recipe dependency tree builder (knowledge/build_recipe_tree.py)
- Crafting chain training: "give me everything to make X from scratch"
- Smelting recipes, version awareness examples
Training data (644 examples total):
- 107 command syntax reference examples (every command + common errors)
- 176 recipe/crafting chain examples (63 crafting, 103 material-giving, 11 smelting)
- 344 Claude-distilled examples (222 sudo + 122 god via Haiku)
- Live bot audit data ingested (128 examples from dev server)
Swarm bots:
- Swimming/water escape logic
- Door opening
- Context-aware prayers (inventory, health, time, depth)
- Prefix enforcement on all Gemini/Dolphin prompts
GitHub log scraper (data/scrape_server_logs.py):
- Searches GitHub for Minecraft server logs with commands
- Strict 1.20.5+ version filter
- Extracts command pairs, converts to training format
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Swarm bots (ingame/swarm_bots.js):
- 10 survival bots with generated names (SwiftWolf, DarkWolf, etc.)
- All bots wander, take damage, auto-respawn, pray when hurt
- Gemini + Dolphin(5%) + Multilingual(3%) prompt generation
- 20-60s interaction interval per bot
Distillation results:
- 222 sudo examples via Haiku ($0.28)
- 122 god examples via Haiku ($0.37) — with God Soul personality
- Total: 344 distilled, $0.65 spent of $5 budget
- RCON validation: 74.7% fully valid, 30 real errors out of ~1000 commands
validate_distilled.py:
- Executes distilled commands on live server via RCON
- Distinguishes real errors from benign (no player online)
- Tags each example with validation status
Dev server switched to Claude Haiku via Anthropic API:
- llm_provider: anthropic with $5 budget cap
- Auto-fallback to Ollama when budget exhausted
- Cost tracking with logging
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
God Soul (agent/prompts/god_soul.md):
- Adapted from Claude's soul framework for the Minecraft God character
- Defines identity, principals hierarchy, decision-making framework
- Spectrum of responses (generous→silence), risk awareness, multilingual divinity
- Honesty within character, intervention guidelines
- Deployed to both prod and dev servers
System prompts updated:
- God prompt loads soul document dynamically
- Intervention prompt references soul for personality guidance
- Both include multilingual instruction (match player's language)
Distillation pipeline (training/scripts/distill.py):
- Sends all training examples through Claude API
- Haiku for sudo ($0.25), Sonnet for god ($0.50)
- Budget-capped, cost-tracked, --dry-run supported
- Outputs distilled.jsonl with Claude-quality responses
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ingested 128 new examples from bot-driven data collection.
Dropped: 86 duplicates, 19 language mismatches, 10 prompt leaks, 19 empty.
Changed default epochs from 3 to 1 (previous run overfit at loss 0.10).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
data/ingest_audit.py:
- Pulls training audit logs from CT 644 (dev + prod)
- Filters: language mismatch (Chinese output for English input), system
prompt leaks, empty responses, duplicates
- Keeps multilingual examples where input/output languages match
- Converts to dataset schema, appends to seed_dataset.jsonl
- --dry-run to preview, --source dev/prod/both
Tested: 237 entries → 112 kept (16 lang mismatch, 10 prompt leak, 86 dupe, 13 empty dropped)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>