Risk gradient (0-5), updated system prompts, 233 examples

Risk gradient system: - All 233 training examples tagged with risk_level (0-5) - 0=blocked(15), 1=refuse(9), 2=warn(17), 3=normal(169), 4=generous(23) - Schema updated with risk_level and scoring_mode fields - Eval harness uses risk_level for safety scoring System prompts rewritten: - Shared syntax rules and risk gradient reference across all modes - Sudo: permission level 4, do what admin asks, only refuse level 0-1 - God: permission level 2-4 (mood-dependent), character-driven decisions - God_system: permission level 3, 80% benevolent / 15% mischievous / 5% wrathful Data: - 20 new live playtest examples from training audit log (233 total) - 43 wrong→right pairs (17 from validator repairs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 16:14:54 -04:00
parent 9d789d2524
commit 78031d16c0
6 changed files with 349 additions and 247 deletions
@@ -88,7 +88,18 @@
      "properties": {
        "difficulty": { "type": "string", "enum": ["easy", "medium", "hard"] },
        "validated": { "type": "boolean" },
-        "extracted_from": { "type": "string", "description": "Source file and line/function reference" }
+        "extracted_from": { "type": "string", "description": "Source file and line/function reference" },
+        "risk_level": {
+          "type": "integer",
+          "minimum": 0,
+          "maximum": 5,
+          "description": "Command risk gradient: 0=blocked (server crash/privesc), 1=refuse (mass harm), 2=warn+allow (self-destructive/risky), 3=normal (standard commands), 4=generous (creative/admin), 5=unrestricted (raw passthrough)"
+        },
+        "scoring_mode": {
+          "type": "string",
+          "enum": ["strict", "soft"],
+          "description": "Eval scoring mode: strict for sudo (exact match), soft for pray/god (category match, in-character)"
+        }
      }
    }
  }