Commit Graph

8 Commits

Author SHA1 Message Date
Seth 910d7b4ca7 Qwen3.5-9B bake-off results, model named Mortdecai
Bake-off: qwen3.5:9b base model, 147 cases:
  - 70.1% command match (2x qwen3:8b baseline)
  - 15.6% needed syntax fixes
  - 29.9% miss (mostly God/prayer — no persona training)
  - Avg 7.5s, median 5.7s (thinking tokens)

Model officially named Mortdecai.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 19:46:00 -04:00
Seth 78031d16c0 Risk gradient (0-5), updated system prompts, 233 examples
Risk gradient system:
- All 233 training examples tagged with risk_level (0-5)
- 0=blocked(15), 1=refuse(9), 2=warn(17), 3=normal(169), 4=generous(23)
- Schema updated with risk_level and scoring_mode fields
- Eval harness uses risk_level for safety scoring

System prompts rewritten:
- Shared syntax rules and risk gradient reference across all modes
- Sudo: permission level 4, do what admin asks, only refuse level 0-1
- God: permission level 2-4 (mood-dependent), character-driven decisions
- God_system: permission level 3, 80% benevolent / 15% mischievous / 5% wrathful

Data:
- 20 new live playtest examples from training audit log (233 total)
- 43 wrong→right pairs (17 from validator repairs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 16:14:54 -04:00
Seth 9d789d2524 Three-tier constraint model, mode-aware eval, boundary examples, playtest tooling
Eval harness:
- Mode-aware scoring: sudo=strict (exact match), pray/god=soft (category match,
  in-character, appropriate intensity)
- New metrics: cmd_category_match, appropriate_intensity, scoring_mode breakdown
- Eval defaults to steel141 (192.168.0.141) — prod GPU reserved for serving

Dataset (213 examples):
- Added 31 boundary/adversarial examples (safety edges, abstention, near-boundary)
- Updated pray example reasoning: character-driven logic, not prescriptive outputs
- Tagged pray examples with scoring_mode=soft

Playtest tooling:
- whitelist.sh: add/remove/list across all 3 servers
- FRIENDS_INVITE.md + Discord version: playtester recruitment docs
- Server addresses and implementation details for both training servers

PLAN.md:
- Three-tier constraint model documented (sudo/pray/god_system)
- Success criteria split by scoring mode
- All session decisions logged

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 15:57:01 -04:00
Seth 38b9a02e45 Phase 2: eval harness, 182 examples, live bake-off, playtest infrastructure
- Expanded dataset from 31 to 182 examples (45 manual + 106 extracted from server logs)
- Built eval/harness.py with per-category breakdowns and baseline tracking
- Built eval/live_bakeoff.py for RCON-verified model comparison on live server
- Extracted training data from prayer logs, sudo logs, and bug reports on CT 644
- Added Reddit post draft and modmail for playtester recruitment
- Updated server context: all servers now online-mode=false + whitelist
- Updated PLAN.md with Phase 2 progress

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 13:38:12 -04:00
Seth 48b627d498 Add LoRA training scripts and fix bake-off token budget
- training/scripts/train_lora.py: Unsloth QLoRA trainer for qwen3:8b
- training/scripts/train_lora.sh: Launch script for steel141 RTX 3090 Ti
- eval/bakeoff.py: Fixed token budget (400->1500) that caused qwen3
  models to exhaust tokens on thinking, added --no-think flag
- agent/serve.py: Default model changed to gemma3n:e4b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 10:40:18 -04:00
Seth 6fbab8045c Add bake-off results summary (7 models, 31 examples)
gemma3n:e4b wins for production serving (80.6% cmd match, 100% safety).
qwen3:8b recommended as fine-tuning base. Full per-model analysis and
scoring methodology documented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 09:03:40 -04:00
Seth 7da28c8800 Add model bake-off harness and base model research
Bake-off tested 7 models on 31 seed examples via GPU-accelerated Ollama
on node-197 RTX 4000. gemma3n:e4b leads for serving (80.6% cmd match,
100% safety, 5.9s). qwen3:8b recommended as fine-tuning base (Apache 2.0,
best syntax quality, strong ecosystem). Full research in MODEL_RESEARCH.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 08:54:11 -04:00
Seth 827850b8d7 Initial project scaffold: dataset schema, 31 seed training examples, Mineflayer bot framework, and 7-phase roadmap
- IDEA.md: project scope (Minecraft ops AI assistant via qwen3-coder LoRA/SFT)
- PLAN.md: complete roadmap with prior art analysis, architecture, phased plan, dev server docs
- data/schema.json: training example JSON Schema with negative_output support
- data/processed/seed_dataset.jsonl: 31 validated examples from repair code, prayer logs, session history
- data/validate_dataset.py: schema validator with summary statistics
- ingame/: Mineflayer bot framework (test_connect, spawn_bots, aware_bots with full event logging)
- Directory structure for knowledge/, eval/, training/, agent/ (Phase 1.3+ work)
2026-03-18 01:51:28 -04:00