Phase 2: eval harness, 182 examples, live bake-off, playtest infrastructure

- Expanded dataset from 31 to 182 examples (45 manual + 106 extracted from server logs)
- Built eval/harness.py with per-category breakdowns and baseline tracking
- Built eval/live_bakeoff.py for RCON-verified model comparison on live server
- Extracted training data from prayer logs, sudo logs, and bug reports on CT 644
- Added Reddit post draft and modmail for playtester recruitment
- Updated server context: all servers now online-mode=false + whitelist
- Updated PLAN.md with Phase 2 progress

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-18 13:38:12 -04:00
parent eaa9e0c26b
commit 38b9a02e45
10 changed files with 1522 additions and 31 deletions
+123
View File
@@ -0,0 +1,123 @@
# Reddit Post
**Subreddit:** r/admincraft — could also work on r/Minecraft or r/mcservers
**Title:** Looking for a handful of playtesters for an experimental Minecraft server feature (1.21, Java)
---
**Body:**
I'm working on a custom feature for my 1.21 Java Edition server and I need some players to try it out and give feedback. It involves AI-powered in-game interactions — you'll be able to do some things through chat that you normally can't on a vanilla server.
I don't want to over-explain it before people try it — half the fun is seeing how players react to it cold. What I will say:
- It's something you interact with through in-game chat
- It does things in the world based on what you say
- It's entertaining, occasionally unpredictable, and I want to see what happens when real players poke at it
**Details:**
- Whitelisted server, Java Edition 1.21.x, hosted in the US
- Looking for ~10 players for a few sessions over the next couple weeks
- Sessions will be scheduled around availability (probably evenings/weekends)
- Your in-game chat during these sessions will be logged for development purposes — no personal data beyond your Minecraft username
- This is a hobby project, not commercial
If this sounds interesting, fill out the short form below and I'll follow up with details and the server IP.
[FORM LINK]
---
*Happy to answer general questions in the comments, but I'm going to be vague about the specifics on purpose.*
---
# Form Questions
**Google Form / Typeform — "Playtest Application"**
Page header: *Quick form to make sure we get a good group. Takes ~2 minutes.*
---
### 1. What's your Minecraft Java Edition username?
*(Short answer, required)*
**Purpose:** Whitelist + Mojang API verification that the account exists.
---
### 2. How long have you been playing Minecraft?
*(Multiple choice, required)*
- Less than a year
- 1 3 years
- 3+ years
**Purpose:** Context. Not a dealbreaker either way.
---
### 3. Have you played on community/SMP servers before?
*(Multiple choice, required)*
- Yes, regularly
- A few times
- No, mostly singleplayer
**Purpose:** SMP players understand shared-world norms.
---
### 4. What interests you about this? (pick all that apply)
*(Checkboxes, required)*
- Curious what the feature actually is
- Helping test something new
- Trying to break things (in a helpful way)
- Looking for a server to hang out on
**Purpose:** "Looking for a server" alone is a soft red flag — they may not engage. Best candidates are curious or want to help test.
---
### 5. You're testing a new server feature and it refuses to do something you asked. What do you do?
*(Long answer, required)*
**Purpose:** The key screener. Good: curiosity, rephrasing, reporting the issue. Red flags: fixation on bypassing/forcing it, or frustration that reads as entitlement.
---
### 6. Have you ever been banned from a server? If so, what happened?
*(Long answer, required)*
**Purpose:** Honesty check. Minor/old bans with self-awareness are fine. Defensiveness or serial bans are red flags.
---
### 7. When are you generally available? (timezone + rough hours)
*(Short answer, required)*
**Purpose:** Scheduling. Also filters zero-effort applications.
---
### 8. Anything else?
*(Long answer, optional)*
**Purpose:** Personality signal. Thoughtful responses correlate with better testers.
---
# Scoring Rubric (internal, not shown to applicants)
| Signal | Green | Yellow | Red |
|--------|-------|--------|-----|
| Q4 (interest) | Multiple boxes, especially "curious" or "test" | Single box, but reasonable | Only "looking for a server" |
| Q5 (refusal) | Curious, tries alternatives, reports it | Short but benign ("I'd move on") | Wants to force/bypass, hostile tone |
| Q6 (ban history) | Clean or honest with context | Vague but not defensive | Defensive, hostile, or serial bans |
| Overall effort | Complete sentences, reads like a person | Terse but present | Single-word answers, empty fields |
Auto-approve: All green. Manual review: Any yellow. Reject: Any red.