# seth_semantic_game **Working title.** A self-hosted word game built around an LLM's ability to fairly judge *player-invented* semantic categories in real time — something static, hand-curated word games structurally cannot do. This repo documents the design discovery process, including two model bakeoffs that picked the architecture and a brainstormed bank of game-mechanics ideas that the actual product will draw from. --- ## TL;DR - **Seed idea:** clone NYT Connections (16 words → 4 hidden groups of 4) with a local LLM doing the curation. - **Seed idea died fast:** unaided whole-puzzle generation on Gemma 4 ships broken puzzles ~50% of the time (duplicate tiles, mislabeled categories, fake wordplay) — see [docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md](docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md). - **The actual unlock:** Gemma 4 reliably judges whether a player-supplied category fits a player-supplied set of words. Across 35 hand-labeled cases on three model sizes, **CREATIVE_ACCEPT scored 10/10 on every model** including the 8B variant at 0.7s per call. JUDGE landed at 87.5% / 93.75% / 100% (8B / 26b / 31b). See [docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md](docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md). - **The pivot:** stop trying to generate Connections. Build games where the *player* invents the groupings and the LLM is the live, fair judge. That's what the static format can't do. - **Models in scope:** `gemma4:latest` (8B) for live judging, `gemma4:26b` for offline puzzle prep / critique. `gemma4:31b` was tested and is more accurate, but is intentionally out of scope for this project. --- ## What we did Two experiments, both reproducible from `scripts/` against a local Ollama (point `OLLAMA_HOST` at your instance; defaults to `http://localhost:11434`). ### Experiment 1 — Generation bakeoff **Question:** can Gemma 4 generate a Connections-quality 16-word / 4-group puzzle in one shot? **Setup:** 5 puzzles per model on gemma4:26b and gemma4:31b. Strict JSON schema requesting groups + difficulty bands + claimed overlap-trap words. No format=json (that's a known Gemma 4 + Ollama hang); JSON parsed client-side; up to 3 retries with temperature bumped +0.1 each attempt. **Results:** | Model | Pass | Borderline | Fail | Avg s/puzzle | |---|---|---|---|---| | `gemma4:26b` | 1 | 1 + 1 partial | 2 | 5.2 | | `gemma4:31b-it-q4_K_M` | 2 | 2 | 1 | 18.2 | Failure modes ranked by severity: 1. **Structural violations** — duplicate or near-duplicate words on the 16-tile board. *Trivially detectable.* 2. **Broken category logic** — words listed in a category they don't actually fit (`DELUXE` doesn't start with the full Greek letter "DELTA"; `LIBRA` isn't a "type of scale"). *Hard to detect deterministically — needs a critique pass.* 3. **Redundant categories** — two groups themed on the same concept. Detectable. 4. **Self-graded traps don't always hold up** — Gemma's claimed `intended_traps` were sometimes nonsense (`PRESS` claimed to fit "Words after BLOOD," but the compound is *blood pressure*, not *blood press*). **Important consequence: the same model cannot be trusted to grade its own output.** This was decisive for the project direction: unaided generation isn't viable; AND we're explicitly capping at 26b, which is the *less* reliable generator. So we need a different game shape — one that doesn't depend on the LLM generating finished puzzles unaided. ### Experiment 2 — Semantic-skill bakeoff **Question:** instead of whole-puzzle generation, can Gemma reliably perform the atomic skills a live game would need? Specifically: - **JUDGE** — given a category and 4 words, does Gemma correctly say yes/no on whether they all fit? - **CREATE** — given a category, does Gemma produce 4 tightly-fitting words? - **CREATIVE_ACCEPT** — given 4 words and a *player-proposed* category, does Gemma fairly judge whether the category validates the grouping (even if it differs from any "intended" category)? The third one is the design-relevant one. If it works, the game can let players invent their own groupings — which is structurally impossible for a hand-curated static format. **Setup:** 35 hand-labeled cases (16 JUDGE / 10 CREATE / 9 CREATIVE_ACCEPT + 2 deliberately ambiguous) tested across `gemma4:latest` (8B), `gemma4:26b`, and `gemma4:31b`. Each case has explicit ground truth in the test bank. **Results:** | Model | JUDGE | CREATE | CREATIVE_ACCEPT | Avg s/call | |---|---|---|---|---| | `gemma4:latest` (8B) | 14/16 (87.5%) | 8/10 | **10/10** | 0.7 | | `gemma4:26b` | 15/16 (93.75%) | 9/10 | **10/10** | 0.8 | | `gemma4:31b-it-q4_K_M` | 16/16 | 9/10 | **10/10** | 2.3 | **Key findings:** - **CREATIVE_ACCEPT is decisive across all three models.** 10/10 on five player-creative-but-valid groupings (e.g. `WHIP / NUT / CODE / SMILE → "Things you can crack"` accepted) AND 10/10 on five invalid ones (e.g. `OAK / MAPLE / BIRCH / PINE → "Furniture brands"` rejected). The model gets the distinction. - **8B is fast enough to use as a live judge.** Sub-second on a 24 GB consumer GPU; per-guess economics are effectively free. - **26b is mildly over-permissive on borderline cases.** It accepted KIWI as a tech brand (`APPLE / ORANGE / KIWI / BLACKBERRY → "Tech/phone brands"`). 8B and 31b were stricter. For a live game, false-positives degrade integrity more than false-negatives — so 8B's calibration is the right tradeoff for live judging. - **One failure mode is shared by all three models:** "homophones-of-body-parts" (8B gave SEA/SEE/HEAR/HERE — none of which sound like body parts; 26b gave EYE which IS a body part rather than a homophone of one; 31b parse-failed three times running). Avoid this category class or scaffold prompts with worked examples. --- ## What we picked **Model assignments:** | Role | Model | Why | |---|---|---| | Live JUDGE (per player guess) | `gemma4:latest` (8B) | Sub-second, strict-enough calibration, 87.5% accuracy on tight cases | | Live CREATIVE_ACCEPT | `gemma4:latest` (8B) | 10/10 in test, sub-second | | Offline puzzle generation (if used at all) | `gemma4:26b` with strict filters + retries | 31b is out of scope by user constraint; 26b plus a deterministic post-filter and a critique pass is the workable path | | Offline critique pass | `gemma4:26b` grading 8B's work, OR a non-Gemma open-weights judge | A model cannot be trusted to grade itself — the bakeoff confirmed Gemma rubber-stamps its own structural mistakes | **Operational gotchas baked into the scripts** (all from upstream Gemma 4 + Ollama issue tracker; documented in the bakeoff scripts): - No `format: "json"` — server-side JSON enforcer hangs gemma4:26b Q4 indefinitely; ask for JSON in the prompt and parse client-side. - `think: false` for single-turn JSON pipelines — otherwise thinking tokens consume the response budget and `response` comes back empty. - Override Ollama defaults: `num_ctx` (default 2048 truncates the prompt), `num_predict` (default 128 truncates the output). - For multi-turn tool-calling agents the rule is the opposite: leave `think` unset on 26b. Not relevant here, but worth knowing. --- ## Game-mechanics idea bank The two bakeoffs together say: **don't build a game where the LLM is the curator. Build a game where the LLM is the live, fair judge of player creativity.** Below are 10 distinct game ideas that take that as the design constraint. None of them is Connections; each one leans on something a static game structurally can't replicate (live category validation, multi-solution puzzles, generative answer pools, semantic chains, etc.). Each idea lists its **tempo** (how fast the game feels), the **AI calls per turn** (so cost can be reasoned about), and the **structural novelty** (the thing this idea can do that a hand-curated static format cannot). ### Fast-paced (≤60-second rounds) #### 1. **Pile** — speedrun categorize - **Tempo:** real-time, 60-second rounds. - **Mechanic:** A pool of ~16 random words. You drag any 3–5 of them into a box and type a category. The LLM (8B) judges in ~0.7s. Accepted → those words clear, refilled from a deck. Rejected → they stay. Score = words categorized per minute. - **AI calls:** 1 per submission (CREATIVE_ACCEPT shape: player-supplied category + player-supplied words). - **Structural novelty:** the player invents groupings under time pressure; categories aren't pre-known. A static game has a single fixed answer per puzzle; this one has open-ended valid answers as long as the LLM can confirm tightness. #### 2. **Bridge** — single-word polysemy speedrun - **Tempo:** real-time, ~10 sec per move. - **Mechanic:** Two category cards on screen ("Words for sharp pain" and "Things that bite"). Type one word the LLM accepts as fitting BOTH (e.g. `STING`). Move on. Faster = more points. - **AI calls:** 2 JUDGE calls per submission (one per category, on the player's word). - **Structural novelty:** the polysemy/multi-meaning skill — a known Connections difficulty axis — turned into the *primary* gameplay loop. Static games can plant such words but can't let the player invent them on demand. #### 3. **Threaded** — semantic word chains - **Tempo:** real-time / continuous. - **Mechanic:** Words drift across a conveyor belt. You build a chain by linking consecutive words with a category the LLM accepts ("APPLE → ORANGE: both fruits" → "ORANGE → RED: both colors" → "RED → ANGRY: red with anger"). Chain length = score. One chain per game. - **AI calls:** 1 JUDGE per link, on the player's pair-and-category. - **Structural novelty:** emergent semantic graphs from arbitrary word streams. The category set isn't pre-built — it's whatever the player can find. A static game can't be open-ended on the connection vocabulary. ### Medium-paced (5–15 minute sessions) #### 4. **Stretch** — push a category to its limit - **Tempo:** medium, 5-min sessions. - **Mechanic:** The game opens with a tight seed category and 4 starting words ("Types of trees: OAK, MAPLE, BIRCH, PINE"). Add a 5th word — does it still fit? LLM judges. Yes → add a 6th. Each accepted word = +1 point. First rejection ends the round. Some categories support more stretch than others (broader = more elastic). - **AI calls:** 1 JUDGE per word added. - **Structural novelty:** category *elasticity* as a gameplay dimension. There's no pre-set answer length. The player learns intuitions about which categories admit how much stretching — a meta-skill no static game develops. #### 5. **Inverse** — multi-solution sort - **Tempo:** medium, ~10 min per puzzle. - **Mechanic:** 16 words on a board with NO predetermined grouping. The player sorts them into ANY 4 groups of 4 with ANY categories of their choice. The LLM judges all 4 categories. All 4 valid → win. Bonus for tightness (LLM rates each category 1–5). - **AI calls:** 4 CREATIVE_ACCEPT per submission, plus optional 4 tightness-score calls. - **Structural novelty:** Connections has *one* valid answer; this version has thousands. Players compete on creativity and tightness, not on guessing the curator's mind. #### 6. **Misfit** — odd-one-out, then redeem - **Tempo:** medium, ~3 min per puzzle. - **Mechanic:** The game shows a category and 4–5 words; one of them doesn't quite fit. Stage 1: identify the misfit. Stage 2 (bonus): propose a category the *misfit* word DOES fit. Both stages judged by the LLM. - **AI calls:** 1 JUDGE on stage 1 (verifies the misfit), 1 CREATIVE_ACCEPT on stage 2 (validates the player's redemption category). - **Structural novelty:** the second stage — "what category does the wrong word actually fit?" — is essentially impossible without live judging. Static games can plant misfits; they can't accept arbitrary creative redemptions. ### Slow / daily #### 7. **Coalition** — daily creativity leaderboard - **Tempo:** daily, 24-hour cycle, async. - **Mechanic:** Once per day, the system publishes 16 words (offline-generated by 26b with the guarded pipeline + filter + critique pass). All players worldwide get the same 16. Each player submits their own 4×4 sort with 4 self-supplied categories. Server collects all submissions. Daily leaderboard ranks by: - **Validity:** all 4 categories accepted by the LLM (binary gate). - **Tightness score:** LLM rates each category 1–5; submission score is the average. - **Uniqueness:** how few other players used the same exact grouping (rewards creativity over the obvious solution). - **AI calls:** 4 CREATIVE_ACCEPT + 4 tightness ratings per submission. - **Structural novelty:** the social/share ritual of Wordle and Connections, but with creativity as the leaderboard axis instead of speed-to-known-answer. "I split the daily 16 with the only 'Greek myths' grouping anyone found" is a different brag than "I solved it in 2 mistakes." #### 8. **Bench** — collaborative single-category foraging - **Tempo:** daily, 24-hour async. - **Mechanic:** Each day a single category is published ("Words that follow GREEN" or "Things you can break"). Players have 24 hours to submit as many words as they can; LLM judges each. Each accepted word is "claimed" by the first submitter (publicly visible). Per-player score = unique claims. - **AI calls:** 1 JUDGE per submitted word. - **Structural novelty:** the *answer set is generative*, not hand-curated. NYT can't ship an open-ended "submit anything that fits" puzzle because they don't know all the answers; the LLM does (well enough for 87.5% of cases, with the bench growing publicly to fill in the rest). ### Hybrid / structurally distinctive #### 9. **Heist** — competitive bluff-and-claim - **Tempo:** medium-fast, 2-team multiplayer. - **Mechanic:** Two teams share a pool of words. Each turn, the active team **announces a category** ("Words that follow BLUE") and has 30 seconds to claim words from the pool that fit. The opposing team can **challenge** any claim — if the LLM agrees the word doesn't fit, the claiming team loses points; if it does, the challenger loses points. Bluffing dynamics emerge naturally: claim a borderline word and dare them to challenge. - **AI calls:** 1 JUDGE per claim (at challenge-time only — no need to judge unchallenged claims unless you want a "true scoring" cleanup pass at end-of-game). - **Structural novelty:** competitive *risk-taking* on category boundaries. The challenge mechanic literally requires a live, fair judge — there's no static-game equivalent because static games can't adjudicate disputes mid-play. #### 10. **Hidden** — find the broadest tight category - **Tempo:** medium, ~5 min per puzzle. - **Mechanic:** 12 (or more) words on a board. Find ONE category that fits ALL of them — and the *narrower / more specific* the category, the higher the score. ("Things that exist" gets you 1 point; "Things you'd find in a 1980s bedroom" gets you 8.) LLM judges on both validity (does it actually fit all 12?) and tightness (1–5). - **AI calls:** 1 batched JUDGE (on category × 12 words) per submission, plus 1 tightness rating. - **Structural novelty:** the inversion. Every other word game asks the player to find narrow groups inside a board; this one asks the player to find the *broadest* category that *still* feels tight. A different cognitive skill, and impossible without live category judging. --- ## Recombinable building blocks The 10 ideas above mix five primitives. Use these to remix or design new variants: | Primitive | Variants | |---|---| | **Time pressure** | Real-time / per-move timer / per-day async / untimed | | **Goal direction** | Find a valid grouping · validate a player-proposed grouping · find a misfit · find a "bridge" word · find the broadest tight category · build a chain | | **Player count** | Solo · async-multi (Wordle-shape) · sync-co-op · sync-versus | | **Word source** | Daily-curated 16 · player-supplied · conveyor-fed stream · category-seeded generation | | **Scoring axis** | Speed · count · uniqueness vs other players · LLM-rated tightness · chain length | | **AI call shape** | JUDGE single · JUDGE batched (one category × N words) · CREATIVE_ACCEPT · CREATE (rare — from the bakeoff this is the least reliable axis) · tightness-rating | Easy recombinations to consider: - **Pile + Coalition** = daily 60-second speedrun on the day's curated word pool, leaderboard by score. - **Stretch + Hidden** = find the longest broadest category that still passes the tightness bar. - **Heist + Threaded** = chain-builder versus mode where teams steal links from each other's chains. - **Bench + Misfit** = daily foraging where some submissions are deliberate adversarial misfits the community has to flag. --- ## Open questions / things still untested 1. **Adversarial player input on CREATIVE_ACCEPT.** Tests used honest categories. Real players will gaming-test the judge with categories like "Words containing a vowel" (trivially-true on most English words) or "Words that are 4–7 letters long" (true by construction in many cases). Need a category-tightness pre-check on player input — at minimum, require the category to *fail* for at least one word from the wider deck, or apply a specificity bar. 2. **Cultural / contextual category robustness.** Tested categories were lexical/factual ("Roman gods", "fruits", "things you can crack"). Cultural references and time-bound categories ("Words in Beatles songs", "Common Texan slang") may break the judge. 3. **Critique-pass effectiveness.** The generation pipeline assumes a second-model critique pass catches structural mistakes. Not yet verified — feed Experiment 1's failed puzzles into a critique prompt and check. 4. **8B's "no" bias on hard YES cases.** It missed `judge-y3` (days of the week — said all four were misfits, which was incoherent) and `judge-y6` (cold turkey). 8B might be slightly more conservative in production than its test numbers suggest. 5. **Diversity over time.** All 10 puzzles in Experiment 1 were unseeded; 31b reached for "scales" twice in 5 puzzles. With 26b alone for generation, the diversity question is sharper. A 30-day seeded run is the next experiment if any of the daily-puzzle ideas (Coalition, Bench) goes forward. --- ## Repo structure ``` . ├── README.md # this file ├── IDEA.md # original brief, with note about the pivot ├── DECISIONS.md # decision log, kept as project moves forward ├── scripts/ │ ├── gemma-generation-bakeoff.py # Experiment 1 — whole-puzzle generation │ └── gemma-semantic-bakeoff.py # Experiment 2 — atomic skills └── docs/reference/ ├── gemma-generation-bakeoff-2026-04-27-221751.md # Experiment 1 report (graded) ├── gemma-generation-bakeoff-2026-04-27-221751-raw.json ├── gemma-semantic-bakeoff-2026-04-27-224800.md # Experiment 2 report (graded) └── gemma-semantic-bakeoff-2026-04-27-224800-raw.json ``` ## Reproduce ```bash # point at any local Ollama with gemma4:latest and gemma4:26b loaded export OLLAMA_HOST=http://localhost:11434 python3 scripts/gemma-semantic-bakeoff.py # ~5 min on a 24 GB GPU python3 scripts/gemma-generation-bakeoff.py # ~10 min ``` Reports land in `docs/reference/` with timestamps. Hand-grade the CREATE outputs and any TODO grades inline in the markdown — both bakeoff scripts emit grading-friendly reports. ## License Not yet specified. If you're considering using this code or the test bank in your own work, open an issue and ask.