docs: bootstrap repo with bakeoff results and game-mechanics idea bank

This repo opens with the design-discovery work completed before any product code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local Ollama established that: - Whole-puzzle generation in the Connections shape is unreliable on Gemma 4 (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally out of project scope, so the generation route is harder still. - Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b) on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging of player-INVENTED categories. That is the structural unlock vs static hand-curated word games. The README contains the full writeup, the test bench, and a brainstormed bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo range, plus a primitives table for recombination. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:09:46 -04:00
commit 5a2a02e483
10 changed files with 4659 additions and 0 deletions
@@ -0,0 +1,235 @@
+# seth_semantic_game
+
+**Working title.** A self-hosted word game built around an LLM's ability to fairly judge *player-invented* semantic categories in real time — something static, hand-curated word games structurally cannot do.
+
+This repo documents the design discovery process, including two model bakeoffs that picked the architecture and a brainstormed bank of game-mechanics ideas that the actual product will draw from.
+
+---
+
+## TL;DR
+
+- **Seed idea:** clone NYT Connections (16 words → 4 hidden groups of 4) with a local LLM doing the curation.
+- **Seed idea died fast:** unaided whole-puzzle generation on Gemma 4 ships broken puzzles ~50% of the time (duplicate tiles, mislabeled categories, fake wordplay) — see [docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md](docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md).
+- **The actual unlock:** Gemma 4 reliably judges whether a player-supplied category fits a player-supplied set of words. Across 35 hand-labeled cases on three model sizes, **CREATIVE_ACCEPT scored 10/10 on every model** including the 8B variant at 0.7s per call. JUDGE landed at 87.5% / 93.75% / 100% (8B / 26b / 31b). See [docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md](docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md).
+- **The pivot:** stop trying to generate Connections. Build games where the *player* invents the groupings and the LLM is the live, fair judge. That's what the static format can't do.
+- **Models in scope:** `gemma4:latest` (8B) for live judging, `gemma4:26b` for offline puzzle prep / critique. `gemma4:31b` was tested and is more accurate, but is intentionally out of scope for this project.
+
+---
+
+## What we did
+
+Two experiments, both reproducible from `scripts/` against a local Ollama (point `OLLAMA_HOST` at your instance; defaults to `http://localhost:11434`).
+
+### Experiment 1 — Generation bakeoff
+
+**Question:** can Gemma 4 generate a Connections-quality 16-word / 4-group puzzle in one shot?
+
+**Setup:** 5 puzzles per model on gemma4:26b and gemma4:31b. Strict JSON schema requesting groups + difficulty bands + claimed overlap-trap words. No format=json (that's a known Gemma 4 + Ollama hang); JSON parsed client-side; up to 3 retries with temperature bumped +0.1 each attempt.
+
+**Results:**
+
+| Model | Pass | Borderline | Fail | Avg s/puzzle |
+|---|---|---|---|---|
+| `gemma4:26b` | 1 | 1 + 1 partial | 2 | 5.2 |
+| `gemma4:31b-it-q4_K_M` | 2 | 2 | 1 | 18.2 |
+
+Failure modes ranked by severity:
+
+1. **Structural violations** — duplicate or near-duplicate words on the 16-tile board. *Trivially detectable.*
+2. **Broken category logic** — words listed in a category they don't actually fit (`DELUXE` doesn't start with the full Greek letter "DELTA"; `LIBRA` isn't a "type of scale"). *Hard to detect deterministically — needs a critique pass.*
+3. **Redundant categories** — two groups themed on the same concept. Detectable.
+4. **Self-graded traps don't always hold up** — Gemma's claimed `intended_traps` were sometimes nonsense (`PRESS` claimed to fit "Words after BLOOD," but the compound is *blood pressure*, not *blood press*). **Important consequence: the same model cannot be trusted to grade its own output.**
+
+This was decisive for the project direction: unaided generation isn't viable; AND we're explicitly capping at 26b, which is the *less* reliable generator. So we need a different game shape — one that doesn't depend on the LLM generating finished puzzles unaided.
+
+### Experiment 2 — Semantic-skill bakeoff
+
+**Question:** instead of whole-puzzle generation, can Gemma reliably perform the atomic skills a live game would need? Specifically:
+
+- **JUDGE** — given a category and 4 words, does Gemma correctly say yes/no on whether they all fit?
+- **CREATE** — given a category, does Gemma produce 4 tightly-fitting words?
+- **CREATIVE_ACCEPT** — given 4 words and a *player-proposed* category, does Gemma fairly judge whether the category validates the grouping (even if it differs from any "intended" category)?
+
+The third one is the design-relevant one. If it works, the game can let players invent their own groupings — which is structurally impossible for a hand-curated static format.
+
+**Setup:** 35 hand-labeled cases (16 JUDGE / 10 CREATE / 9 CREATIVE_ACCEPT + 2 deliberately ambiguous) tested across `gemma4:latest` (8B), `gemma4:26b`, and `gemma4:31b`. Each case has explicit ground truth in the test bank.
+
+**Results:**
+
+| Model | JUDGE | CREATE | CREATIVE_ACCEPT | Avg s/call |
+|---|---|---|---|---|
+| `gemma4:latest` (8B) | 14/16 (87.5%) | 8/10 | **10/10** | 0.7 |
+| `gemma4:26b` | 15/16 (93.75%) | 9/10 | **10/10** | 0.8 |
+| `gemma4:31b-it-q4_K_M` | 16/16 | 9/10 | **10/10** | 2.3 |
+
+**Key findings:**
+
+- **CREATIVE_ACCEPT is decisive across all three models.** 10/10 on five player-creative-but-valid groupings (e.g. `WHIP / NUT / CODE / SMILE → "Things you can crack"` accepted) AND 10/10 on five invalid ones (e.g. `OAK / MAPLE / BIRCH / PINE → "Furniture brands"` rejected). The model gets the distinction.
+- **8B is fast enough to use as a live judge.** Sub-second on a 24 GB consumer GPU; per-guess economics are effectively free.
+- **26b is mildly over-permissive on borderline cases.** It accepted KIWI as a tech brand (`APPLE / ORANGE / KIWI / BLACKBERRY → "Tech/phone brands"`). 8B and 31b were stricter. For a live game, false-positives degrade integrity more than false-negatives — so 8B's calibration is the right tradeoff for live judging.
+- **One failure mode is shared by all three models:** "homophones-of-body-parts" (8B gave SEA/SEE/HEAR/HERE — none of which sound like body parts; 26b gave EYE which IS a body part rather than a homophone of one; 31b parse-failed three times running). Avoid this category class or scaffold prompts with worked examples.
+
+---
+
+## What we picked
+
+**Model assignments:**
+
+| Role | Model | Why |
+|---|---|---|
+| Live JUDGE (per player guess) | `gemma4:latest` (8B) | Sub-second, strict-enough calibration, 87.5% accuracy on tight cases |
+| Live CREATIVE_ACCEPT | `gemma4:latest` (8B) | 10/10 in test, sub-second |
+| Offline puzzle generation (if used at all) | `gemma4:26b` with strict filters + retries | 31b is out of scope by user constraint; 26b plus a deterministic post-filter and a critique pass is the workable path |
+| Offline critique pass | `gemma4:26b` grading 8B's work, OR a non-Gemma open-weights judge | A model cannot be trusted to grade itself — the bakeoff confirmed Gemma rubber-stamps its own structural mistakes |
+
+**Operational gotchas baked into the scripts** (all from upstream Gemma 4 + Ollama issue tracker; documented in the bakeoff scripts):
+
+- No `format: "json"` — server-side JSON enforcer hangs gemma4:26b Q4 indefinitely; ask for JSON in the prompt and parse client-side.
+- `think: false` for single-turn JSON pipelines — otherwise thinking tokens consume the response budget and `response` comes back empty.
+- Override Ollama defaults: `num_ctx` (default 2048 truncates the prompt), `num_predict` (default 128 truncates the output).
+- For multi-turn tool-calling agents the rule is the opposite: leave `think` unset on 26b. Not relevant here, but worth knowing.
+
+---
+
+## Game-mechanics idea bank
+
+The two bakeoffs together say: **don't build a game where the LLM is the curator. Build a game where the LLM is the live, fair judge of player creativity.** Below are 10 distinct game ideas that take that as the design constraint. None of them is Connections; each one leans on something a static game structurally can't replicate (live category validation, multi-solution puzzles, generative answer pools, semantic chains, etc.).
+
+Each idea lists its **tempo** (how fast the game feels), the **AI calls per turn** (so cost can be reasoned about), and the **structural novelty** (the thing this idea can do that a hand-curated static format cannot).
+
+### Fast-paced (≤60-second rounds)
+
+#### 1. **Pile** — speedrun categorize
+- **Tempo:** real-time, 60-second rounds.
+- **Mechanic:** A pool of ~16 random words. You drag any 3–5 of them into a box and type a category. The LLM (8B) judges in ~0.7s. Accepted → those words clear, refilled from a deck. Rejected → they stay. Score = words categorized per minute.
+- **AI calls:** 1 per submission (CREATIVE_ACCEPT shape: player-supplied category + player-supplied words).
+- **Structural novelty:** the player invents groupings under time pressure; categories aren't pre-known. A static game has a single fixed answer per puzzle; this one has open-ended valid answers as long as the LLM can confirm tightness.
+
+#### 2. **Bridge** — single-word polysemy speedrun
+- **Tempo:** real-time, ~10 sec per move.
+- **Mechanic:** Two category cards on screen ("Words for sharp pain" and "Things that bite"). Type one word the LLM accepts as fitting BOTH (e.g. `STING`). Move on. Faster = more points.
+- **AI calls:** 2 JUDGE calls per submission (one per category, on the player's word).
+- **Structural novelty:** the polysemy/multi-meaning skill — a known Connections difficulty axis — turned into the *primary* gameplay loop. Static games can plant such words but can't let the player invent them on demand.
+
+#### 3. **Threaded** — semantic word chains
+- **Tempo:** real-time / continuous.
+- **Mechanic:** Words drift across a conveyor belt. You build a chain by linking consecutive words with a category the LLM accepts ("APPLE → ORANGE: both fruits" → "ORANGE → RED: both colors" → "RED → ANGRY: red with anger"). Chain length = score. One chain per game.
+- **AI calls:** 1 JUDGE per link, on the player's pair-and-category.
+- **Structural novelty:** emergent semantic graphs from arbitrary word streams. The category set isn't pre-built — it's whatever the player can find. A static game can't be open-ended on the connection vocabulary.
+
+### Medium-paced (5–15 minute sessions)
+
+#### 4. **Stretch** — push a category to its limit
+- **Tempo:** medium, 5-min sessions.
+- **Mechanic:** The game opens with a tight seed category and 4 starting words ("Types of trees: OAK, MAPLE, BIRCH, PINE"). Add a 5th word — does it still fit? LLM judges. Yes → add a 6th. Each accepted word = +1 point. First rejection ends the round. Some categories support more stretch than others (broader = more elastic).
+- **AI calls:** 1 JUDGE per word added.
+- **Structural novelty:** category *elasticity* as a gameplay dimension. There's no pre-set answer length. The player learns intuitions about which categories admit how much stretching — a meta-skill no static game develops.
+
+#### 5. **Inverse** — multi-solution sort
+- **Tempo:** medium, ~10 min per puzzle.
+- **Mechanic:** 16 words on a board with NO predetermined grouping. The player sorts them into ANY 4 groups of 4 with ANY categories of their choice. The LLM judges all 4 categories. All 4 valid → win. Bonus for tightness (LLM rates each category 1–5).
+- **AI calls:** 4 CREATIVE_ACCEPT per submission, plus optional 4 tightness-score calls.
+- **Structural novelty:** Connections has *one* valid answer; this version has thousands. Players compete on creativity and tightness, not on guessing the curator's mind.
+
+#### 6. **Misfit** — odd-one-out, then redeem
+- **Tempo:** medium, ~3 min per puzzle.
+- **Mechanic:** The game shows a category and 4–5 words; one of them doesn't quite fit. Stage 1: identify the misfit. Stage 2 (bonus): propose a category the *misfit* word DOES fit. Both stages judged by the LLM.
+- **AI calls:** 1 JUDGE on stage 1 (verifies the misfit), 1 CREATIVE_ACCEPT on stage 2 (validates the player's redemption category).
+- **Structural novelty:** the second stage — "what category does the wrong word actually fit?" — is essentially impossible without live judging. Static games can plant misfits; they can't accept arbitrary creative redemptions.
+
+### Slow / daily
+
+#### 7. **Coalition** — daily creativity leaderboard
+- **Tempo:** daily, 24-hour cycle, async.
+- **Mechanic:** Once per day, the system publishes 16 words (offline-generated by 26b with the guarded pipeline + filter + critique pass). All players worldwide get the same 16. Each player submits their own 4×4 sort with 4 self-supplied categories. Server collects all submissions. Daily leaderboard ranks by:
+  - **Validity:** all 4 categories accepted by the LLM (binary gate).
+  - **Tightness score:** LLM rates each category 1–5; submission score is the average.
+  - **Uniqueness:** how few other players used the same exact grouping (rewards creativity over the obvious solution).
+- **AI calls:** 4 CREATIVE_ACCEPT + 4 tightness ratings per submission.
+- **Structural novelty:** the social/share ritual of Wordle and Connections, but with creativity as the leaderboard axis instead of speed-to-known-answer. "I split the daily 16 with the only 'Greek myths' grouping anyone found" is a different brag than "I solved it in 2 mistakes."
+
+#### 8. **Bench** — collaborative single-category foraging
+- **Tempo:** daily, 24-hour async.
+- **Mechanic:** Each day a single category is published ("Words that follow GREEN" or "Things you can break"). Players have 24 hours to submit as many words as they can; LLM judges each. Each accepted word is "claimed" by the first submitter (publicly visible). Per-player score = unique claims.
+- **AI calls:** 1 JUDGE per submitted word.
+- **Structural novelty:** the *answer set is generative*, not hand-curated. NYT can't ship an open-ended "submit anything that fits" puzzle because they don't know all the answers; the LLM does (well enough for 87.5% of cases, with the bench growing publicly to fill in the rest).
+
+### Hybrid / structurally distinctive
+
+#### 9. **Heist** — competitive bluff-and-claim
+- **Tempo:** medium-fast, 2-team multiplayer.
+- **Mechanic:** Two teams share a pool of words. Each turn, the active team **announces a category** ("Words that follow BLUE") and has 30 seconds to claim words from the pool that fit. The opposing team can **challenge** any claim — if the LLM agrees the word doesn't fit, the claiming team loses points; if it does, the challenger loses points. Bluffing dynamics emerge naturally: claim a borderline word and dare them to challenge.
+- **AI calls:** 1 JUDGE per claim (at challenge-time only — no need to judge unchallenged claims unless you want a "true scoring" cleanup pass at end-of-game).
+- **Structural novelty:** competitive *risk-taking* on category boundaries. The challenge mechanic literally requires a live, fair judge — there's no static-game equivalent because static games can't adjudicate disputes mid-play.
+
+#### 10. **Hidden** — find the broadest tight category
+- **Tempo:** medium, ~5 min per puzzle.
+- **Mechanic:** 12 (or more) words on a board. Find ONE category that fits ALL of them — and the *narrower / more specific* the category, the higher the score. ("Things that exist" gets you 1 point; "Things you'd find in a 1980s bedroom" gets you 8.) LLM judges on both validity (does it actually fit all 12?) and tightness (1–5).
+- **AI calls:** 1 batched JUDGE (on category × 12 words) per submission, plus 1 tightness rating.
+- **Structural novelty:** the inversion. Every other word game asks the player to find narrow groups inside a board; this one asks the player to find the *broadest* category that *still* feels tight. A different cognitive skill, and impossible without live category judging.
+
+---
+
+## Recombinable building blocks
+
+The 10 ideas above mix five primitives. Use these to remix or design new variants:
+
+| Primitive | Variants |
+|---|---|
+| **Time pressure** | Real-time / per-move timer / per-day async / untimed |
+| **Goal direction** | Find a valid grouping · validate a player-proposed grouping · find a misfit · find a "bridge" word · find the broadest tight category · build a chain |
+| **Player count** | Solo · async-multi (Wordle-shape) · sync-co-op · sync-versus |
+| **Word source** | Daily-curated 16 · player-supplied · conveyor-fed stream · category-seeded generation |
+| **Scoring axis** | Speed · count · uniqueness vs other players · LLM-rated tightness · chain length |
+| **AI call shape** | JUDGE single · JUDGE batched (one category × N words) · CREATIVE_ACCEPT · CREATE (rare — from the bakeoff this is the least reliable axis) · tightness-rating |
+
+Easy recombinations to consider:
+
+- **Pile + Coalition** = daily 60-second speedrun on the day's curated word pool, leaderboard by score.
+- **Stretch + Hidden** = find the longest broadest category that still passes the tightness bar.
+- **Heist + Threaded** = chain-builder versus mode where teams steal links from each other's chains.
+- **Bench + Misfit** = daily foraging where some submissions are deliberate adversarial misfits the community has to flag.
+
+---
+
+## Open questions / things still untested
+
+1. **Adversarial player input on CREATIVE_ACCEPT.** Tests used honest categories. Real players will gaming-test the judge with categories like "Words containing a vowel" (trivially-true on most English words) or "Words that are 4–7 letters long" (true by construction in many cases). Need a category-tightness pre-check on player input — at minimum, require the category to *fail* for at least one word from the wider deck, or apply a specificity bar.
+2. **Cultural / contextual category robustness.** Tested categories were lexical/factual ("Roman gods", "fruits", "things you can crack"). Cultural references and time-bound categories ("Words in Beatles songs", "Common Texan slang") may break the judge.
+3. **Critique-pass effectiveness.** The generation pipeline assumes a second-model critique pass catches structural mistakes. Not yet verified — feed Experiment 1's failed puzzles into a critique prompt and check.
+4. **8B's "no" bias on hard YES cases.** It missed `judge-y3` (days of the week — said all four were misfits, which was incoherent) and `judge-y6` (cold turkey). 8B might be slightly more conservative in production than its test numbers suggest.
+5. **Diversity over time.** All 10 puzzles in Experiment 1 were unseeded; 31b reached for "scales" twice in 5 puzzles. With 26b alone for generation, the diversity question is sharper. A 30-day seeded run is the next experiment if any of the daily-puzzle ideas (Coalition, Bench) goes forward.
+
+---
+
+## Repo structure
+
+```
+.
+├── README.md                          # this file
+├── IDEA.md                            # original brief, with note about the pivot
+├── DECISIONS.md                       # decision log, kept as project moves forward
+├── scripts/
+│   ├── gemma-generation-bakeoff.py    # Experiment 1 — whole-puzzle generation
+│   └── gemma-semantic-bakeoff.py      # Experiment 2 — atomic skills
+└── docs/reference/
+    ├── gemma-generation-bakeoff-2026-04-27-221751.md       # Experiment 1 report (graded)
+    ├── gemma-generation-bakeoff-2026-04-27-221751-raw.json
+    ├── gemma-semantic-bakeoff-2026-04-27-224800.md         # Experiment 2 report (graded)
+    └── gemma-semantic-bakeoff-2026-04-27-224800-raw.json
+```
+
+## Reproduce
+
+```bash
+# point at any local Ollama with gemma4:latest and gemma4:26b loaded
+export OLLAMA_HOST=http://localhost:11434
+python3 scripts/gemma-semantic-bakeoff.py    # ~5 min on a 24 GB GPU
+python3 scripts/gemma-generation-bakeoff.py  # ~10 min
+```
+
+Reports land in `docs/reference/` with timestamps. Hand-grade the CREATE outputs and any TODO grades inline in the markdown — both bakeoff scripts emit grading-friendly reports.
+
+## License
+
+Not yet specified. If you're considering using this code or the test bank in your own work, open an issue and ask.