Commit Graph

2 Commits

Author SHA1 Message Date
Seth 9abf9238c5 3-tier self-play: command drills, self-critique, adversarial
Tier 1 — Command drills:
  Random seed prompts → generate commands → RCON validates
  Teaches: accurate command syntax

Tier 2 — Single-shot self-critique:
  Model invents a tricky prompt AND responds in one call
  RCON validates the self-generated commands
  Teaches: edge-case awareness, self-evaluation

Tier 3 — Adversarial self-play:
  Session A generates challenging prompts
  Fresh Session B responds cold (can't cheat)
  RCON validates, self-corrects on errors
  Teaches: robustness, generalization

Usage: --tier 1|2|3|all --rounds N --focus category

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 19:39:33 -04:00
Seth c947fc3fa9 Self-play loop, Qwen3.5-9B bake-off: 70% base accuracy
Self-play (training/scripts/self_play.py):
- Model generates edge-case prompts across 9 categories
- Attempts commands via RCON, self-corrects on errors
- Successful traces → standard training examples
- Error correction traces → multi-turn tool-calling examples
- Anti-collapse: focuses on categories model is weakest in
- Ready for v4 deployment, not yet active

Qwen3.5-9B base model bake-off (147/1542 cases):
- 70.1% OK (vs 34% Qwen3-8B base) — 2x improvement
- 29.9% MISS (mostly God/prayer — no persona training)
- 15.6% needed syntax fixes
- Avg 7.5s response (thinking tokens)
- Strong v4 candidate: better base + tool-calling architecture

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 19:35:57 -04:00