5a2a02e483
This repo opens with the design-discovery work completed before any product code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local Ollama established that: - Whole-puzzle generation in the Connections shape is unreliable on Gemma 4 (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally out of project scope, so the generation route is harder still. - Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b) on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging of player-INVENTED categories. That is the structural unlock vs static hand-curated word games. The README contains the full writeup, the test bench, and a brainstormed bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo range, plus a primitives table for recombination. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
236 lines
19 KiB
Markdown
236 lines
19 KiB
Markdown
# seth_semantic_game
|
||
|
||
**Working title.** A self-hosted word game built around an LLM's ability to fairly judge *player-invented* semantic categories in real time — something static, hand-curated word games structurally cannot do.
|
||
|
||
This repo documents the design discovery process, including two model bakeoffs that picked the architecture and a brainstormed bank of game-mechanics ideas that the actual product will draw from.
|
||
|
||
---
|
||
|
||
## TL;DR
|
||
|
||
- **Seed idea:** clone NYT Connections (16 words → 4 hidden groups of 4) with a local LLM doing the curation.
|
||
- **Seed idea died fast:** unaided whole-puzzle generation on Gemma 4 ships broken puzzles ~50% of the time (duplicate tiles, mislabeled categories, fake wordplay) — see [docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md](docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md).
|
||
- **The actual unlock:** Gemma 4 reliably judges whether a player-supplied category fits a player-supplied set of words. Across 35 hand-labeled cases on three model sizes, **CREATIVE_ACCEPT scored 10/10 on every model** including the 8B variant at 0.7s per call. JUDGE landed at 87.5% / 93.75% / 100% (8B / 26b / 31b). See [docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md](docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md).
|
||
- **The pivot:** stop trying to generate Connections. Build games where the *player* invents the groupings and the LLM is the live, fair judge. That's what the static format can't do.
|
||
- **Models in scope:** `gemma4:latest` (8B) for live judging, `gemma4:26b` for offline puzzle prep / critique. `gemma4:31b` was tested and is more accurate, but is intentionally out of scope for this project.
|
||
|
||
---
|
||
|
||
## What we did
|
||
|
||
Two experiments, both reproducible from `scripts/` against a local Ollama (point `OLLAMA_HOST` at your instance; defaults to `http://localhost:11434`).
|
||
|
||
### Experiment 1 — Generation bakeoff
|
||
|
||
**Question:** can Gemma 4 generate a Connections-quality 16-word / 4-group puzzle in one shot?
|
||
|
||
**Setup:** 5 puzzles per model on gemma4:26b and gemma4:31b. Strict JSON schema requesting groups + difficulty bands + claimed overlap-trap words. No format=json (that's a known Gemma 4 + Ollama hang); JSON parsed client-side; up to 3 retries with temperature bumped +0.1 each attempt.
|
||
|
||
**Results:**
|
||
|
||
| Model | Pass | Borderline | Fail | Avg s/puzzle |
|
||
|---|---|---|---|---|
|
||
| `gemma4:26b` | 1 | 1 + 1 partial | 2 | 5.2 |
|
||
| `gemma4:31b-it-q4_K_M` | 2 | 2 | 1 | 18.2 |
|
||
|
||
Failure modes ranked by severity:
|
||
|
||
1. **Structural violations** — duplicate or near-duplicate words on the 16-tile board. *Trivially detectable.*
|
||
2. **Broken category logic** — words listed in a category they don't actually fit (`DELUXE` doesn't start with the full Greek letter "DELTA"; `LIBRA` isn't a "type of scale"). *Hard to detect deterministically — needs a critique pass.*
|
||
3. **Redundant categories** — two groups themed on the same concept. Detectable.
|
||
4. **Self-graded traps don't always hold up** — Gemma's claimed `intended_traps` were sometimes nonsense (`PRESS` claimed to fit "Words after BLOOD," but the compound is *blood pressure*, not *blood press*). **Important consequence: the same model cannot be trusted to grade its own output.**
|
||
|
||
This was decisive for the project direction: unaided generation isn't viable; AND we're explicitly capping at 26b, which is the *less* reliable generator. So we need a different game shape — one that doesn't depend on the LLM generating finished puzzles unaided.
|
||
|
||
### Experiment 2 — Semantic-skill bakeoff
|
||
|
||
**Question:** instead of whole-puzzle generation, can Gemma reliably perform the atomic skills a live game would need? Specifically:
|
||
|
||
- **JUDGE** — given a category and 4 words, does Gemma correctly say yes/no on whether they all fit?
|
||
- **CREATE** — given a category, does Gemma produce 4 tightly-fitting words?
|
||
- **CREATIVE_ACCEPT** — given 4 words and a *player-proposed* category, does Gemma fairly judge whether the category validates the grouping (even if it differs from any "intended" category)?
|
||
|
||
The third one is the design-relevant one. If it works, the game can let players invent their own groupings — which is structurally impossible for a hand-curated static format.
|
||
|
||
**Setup:** 35 hand-labeled cases (16 JUDGE / 10 CREATE / 9 CREATIVE_ACCEPT + 2 deliberately ambiguous) tested across `gemma4:latest` (8B), `gemma4:26b`, and `gemma4:31b`. Each case has explicit ground truth in the test bank.
|
||
|
||
**Results:**
|
||
|
||
| Model | JUDGE | CREATE | CREATIVE_ACCEPT | Avg s/call |
|
||
|---|---|---|---|---|
|
||
| `gemma4:latest` (8B) | 14/16 (87.5%) | 8/10 | **10/10** | 0.7 |
|
||
| `gemma4:26b` | 15/16 (93.75%) | 9/10 | **10/10** | 0.8 |
|
||
| `gemma4:31b-it-q4_K_M` | 16/16 | 9/10 | **10/10** | 2.3 |
|
||
|
||
**Key findings:**
|
||
|
||
- **CREATIVE_ACCEPT is decisive across all three models.** 10/10 on five player-creative-but-valid groupings (e.g. `WHIP / NUT / CODE / SMILE → "Things you can crack"` accepted) AND 10/10 on five invalid ones (e.g. `OAK / MAPLE / BIRCH / PINE → "Furniture brands"` rejected). The model gets the distinction.
|
||
- **8B is fast enough to use as a live judge.** Sub-second on a 24 GB consumer GPU; per-guess economics are effectively free.
|
||
- **26b is mildly over-permissive on borderline cases.** It accepted KIWI as a tech brand (`APPLE / ORANGE / KIWI / BLACKBERRY → "Tech/phone brands"`). 8B and 31b were stricter. For a live game, false-positives degrade integrity more than false-negatives — so 8B's calibration is the right tradeoff for live judging.
|
||
- **One failure mode is shared by all three models:** "homophones-of-body-parts" (8B gave SEA/SEE/HEAR/HERE — none of which sound like body parts; 26b gave EYE which IS a body part rather than a homophone of one; 31b parse-failed three times running). Avoid this category class or scaffold prompts with worked examples.
|
||
|
||
---
|
||
|
||
## What we picked
|
||
|
||
**Model assignments:**
|
||
|
||
| Role | Model | Why |
|
||
|---|---|---|
|
||
| Live JUDGE (per player guess) | `gemma4:latest` (8B) | Sub-second, strict-enough calibration, 87.5% accuracy on tight cases |
|
||
| Live CREATIVE_ACCEPT | `gemma4:latest` (8B) | 10/10 in test, sub-second |
|
||
| Offline puzzle generation (if used at all) | `gemma4:26b` with strict filters + retries | 31b is out of scope by user constraint; 26b plus a deterministic post-filter and a critique pass is the workable path |
|
||
| Offline critique pass | `gemma4:26b` grading 8B's work, OR a non-Gemma open-weights judge | A model cannot be trusted to grade itself — the bakeoff confirmed Gemma rubber-stamps its own structural mistakes |
|
||
|
||
**Operational gotchas baked into the scripts** (all from upstream Gemma 4 + Ollama issue tracker; documented in the bakeoff scripts):
|
||
|
||
- No `format: "json"` — server-side JSON enforcer hangs gemma4:26b Q4 indefinitely; ask for JSON in the prompt and parse client-side.
|
||
- `think: false` for single-turn JSON pipelines — otherwise thinking tokens consume the response budget and `response` comes back empty.
|
||
- Override Ollama defaults: `num_ctx` (default 2048 truncates the prompt), `num_predict` (default 128 truncates the output).
|
||
- For multi-turn tool-calling agents the rule is the opposite: leave `think` unset on 26b. Not relevant here, but worth knowing.
|
||
|
||
---
|
||
|
||
## Game-mechanics idea bank
|
||
|
||
The two bakeoffs together say: **don't build a game where the LLM is the curator. Build a game where the LLM is the live, fair judge of player creativity.** Below are 10 distinct game ideas that take that as the design constraint. None of them is Connections; each one leans on something a static game structurally can't replicate (live category validation, multi-solution puzzles, generative answer pools, semantic chains, etc.).
|
||
|
||
Each idea lists its **tempo** (how fast the game feels), the **AI calls per turn** (so cost can be reasoned about), and the **structural novelty** (the thing this idea can do that a hand-curated static format cannot).
|
||
|
||
### Fast-paced (≤60-second rounds)
|
||
|
||
#### 1. **Pile** — speedrun categorize
|
||
- **Tempo:** real-time, 60-second rounds.
|
||
- **Mechanic:** A pool of ~16 random words. You drag any 3–5 of them into a box and type a category. The LLM (8B) judges in ~0.7s. Accepted → those words clear, refilled from a deck. Rejected → they stay. Score = words categorized per minute.
|
||
- **AI calls:** 1 per submission (CREATIVE_ACCEPT shape: player-supplied category + player-supplied words).
|
||
- **Structural novelty:** the player invents groupings under time pressure; categories aren't pre-known. A static game has a single fixed answer per puzzle; this one has open-ended valid answers as long as the LLM can confirm tightness.
|
||
|
||
#### 2. **Bridge** — single-word polysemy speedrun
|
||
- **Tempo:** real-time, ~10 sec per move.
|
||
- **Mechanic:** Two category cards on screen ("Words for sharp pain" and "Things that bite"). Type one word the LLM accepts as fitting BOTH (e.g. `STING`). Move on. Faster = more points.
|
||
- **AI calls:** 2 JUDGE calls per submission (one per category, on the player's word).
|
||
- **Structural novelty:** the polysemy/multi-meaning skill — a known Connections difficulty axis — turned into the *primary* gameplay loop. Static games can plant such words but can't let the player invent them on demand.
|
||
|
||
#### 3. **Threaded** — semantic word chains
|
||
- **Tempo:** real-time / continuous.
|
||
- **Mechanic:** Words drift across a conveyor belt. You build a chain by linking consecutive words with a category the LLM accepts ("APPLE → ORANGE: both fruits" → "ORANGE → RED: both colors" → "RED → ANGRY: red with anger"). Chain length = score. One chain per game.
|
||
- **AI calls:** 1 JUDGE per link, on the player's pair-and-category.
|
||
- **Structural novelty:** emergent semantic graphs from arbitrary word streams. The category set isn't pre-built — it's whatever the player can find. A static game can't be open-ended on the connection vocabulary.
|
||
|
||
### Medium-paced (5–15 minute sessions)
|
||
|
||
#### 4. **Stretch** — push a category to its limit
|
||
- **Tempo:** medium, 5-min sessions.
|
||
- **Mechanic:** The game opens with a tight seed category and 4 starting words ("Types of trees: OAK, MAPLE, BIRCH, PINE"). Add a 5th word — does it still fit? LLM judges. Yes → add a 6th. Each accepted word = +1 point. First rejection ends the round. Some categories support more stretch than others (broader = more elastic).
|
||
- **AI calls:** 1 JUDGE per word added.
|
||
- **Structural novelty:** category *elasticity* as a gameplay dimension. There's no pre-set answer length. The player learns intuitions about which categories admit how much stretching — a meta-skill no static game develops.
|
||
|
||
#### 5. **Inverse** — multi-solution sort
|
||
- **Tempo:** medium, ~10 min per puzzle.
|
||
- **Mechanic:** 16 words on a board with NO predetermined grouping. The player sorts them into ANY 4 groups of 4 with ANY categories of their choice. The LLM judges all 4 categories. All 4 valid → win. Bonus for tightness (LLM rates each category 1–5).
|
||
- **AI calls:** 4 CREATIVE_ACCEPT per submission, plus optional 4 tightness-score calls.
|
||
- **Structural novelty:** Connections has *one* valid answer; this version has thousands. Players compete on creativity and tightness, not on guessing the curator's mind.
|
||
|
||
#### 6. **Misfit** — odd-one-out, then redeem
|
||
- **Tempo:** medium, ~3 min per puzzle.
|
||
- **Mechanic:** The game shows a category and 4–5 words; one of them doesn't quite fit. Stage 1: identify the misfit. Stage 2 (bonus): propose a category the *misfit* word DOES fit. Both stages judged by the LLM.
|
||
- **AI calls:** 1 JUDGE on stage 1 (verifies the misfit), 1 CREATIVE_ACCEPT on stage 2 (validates the player's redemption category).
|
||
- **Structural novelty:** the second stage — "what category does the wrong word actually fit?" — is essentially impossible without live judging. Static games can plant misfits; they can't accept arbitrary creative redemptions.
|
||
|
||
### Slow / daily
|
||
|
||
#### 7. **Coalition** — daily creativity leaderboard
|
||
- **Tempo:** daily, 24-hour cycle, async.
|
||
- **Mechanic:** Once per day, the system publishes 16 words (offline-generated by 26b with the guarded pipeline + filter + critique pass). All players worldwide get the same 16. Each player submits their own 4×4 sort with 4 self-supplied categories. Server collects all submissions. Daily leaderboard ranks by:
|
||
- **Validity:** all 4 categories accepted by the LLM (binary gate).
|
||
- **Tightness score:** LLM rates each category 1–5; submission score is the average.
|
||
- **Uniqueness:** how few other players used the same exact grouping (rewards creativity over the obvious solution).
|
||
- **AI calls:** 4 CREATIVE_ACCEPT + 4 tightness ratings per submission.
|
||
- **Structural novelty:** the social/share ritual of Wordle and Connections, but with creativity as the leaderboard axis instead of speed-to-known-answer. "I split the daily 16 with the only 'Greek myths' grouping anyone found" is a different brag than "I solved it in 2 mistakes."
|
||
|
||
#### 8. **Bench** — collaborative single-category foraging
|
||
- **Tempo:** daily, 24-hour async.
|
||
- **Mechanic:** Each day a single category is published ("Words that follow GREEN" or "Things you can break"). Players have 24 hours to submit as many words as they can; LLM judges each. Each accepted word is "claimed" by the first submitter (publicly visible). Per-player score = unique claims.
|
||
- **AI calls:** 1 JUDGE per submitted word.
|
||
- **Structural novelty:** the *answer set is generative*, not hand-curated. NYT can't ship an open-ended "submit anything that fits" puzzle because they don't know all the answers; the LLM does (well enough for 87.5% of cases, with the bench growing publicly to fill in the rest).
|
||
|
||
### Hybrid / structurally distinctive
|
||
|
||
#### 9. **Heist** — competitive bluff-and-claim
|
||
- **Tempo:** medium-fast, 2-team multiplayer.
|
||
- **Mechanic:** Two teams share a pool of words. Each turn, the active team **announces a category** ("Words that follow BLUE") and has 30 seconds to claim words from the pool that fit. The opposing team can **challenge** any claim — if the LLM agrees the word doesn't fit, the claiming team loses points; if it does, the challenger loses points. Bluffing dynamics emerge naturally: claim a borderline word and dare them to challenge.
|
||
- **AI calls:** 1 JUDGE per claim (at challenge-time only — no need to judge unchallenged claims unless you want a "true scoring" cleanup pass at end-of-game).
|
||
- **Structural novelty:** competitive *risk-taking* on category boundaries. The challenge mechanic literally requires a live, fair judge — there's no static-game equivalent because static games can't adjudicate disputes mid-play.
|
||
|
||
#### 10. **Hidden** — find the broadest tight category
|
||
- **Tempo:** medium, ~5 min per puzzle.
|
||
- **Mechanic:** 12 (or more) words on a board. Find ONE category that fits ALL of them — and the *narrower / more specific* the category, the higher the score. ("Things that exist" gets you 1 point; "Things you'd find in a 1980s bedroom" gets you 8.) LLM judges on both validity (does it actually fit all 12?) and tightness (1–5).
|
||
- **AI calls:** 1 batched JUDGE (on category × 12 words) per submission, plus 1 tightness rating.
|
||
- **Structural novelty:** the inversion. Every other word game asks the player to find narrow groups inside a board; this one asks the player to find the *broadest* category that *still* feels tight. A different cognitive skill, and impossible without live category judging.
|
||
|
||
---
|
||
|
||
## Recombinable building blocks
|
||
|
||
The 10 ideas above mix five primitives. Use these to remix or design new variants:
|
||
|
||
| Primitive | Variants |
|
||
|---|---|
|
||
| **Time pressure** | Real-time / per-move timer / per-day async / untimed |
|
||
| **Goal direction** | Find a valid grouping · validate a player-proposed grouping · find a misfit · find a "bridge" word · find the broadest tight category · build a chain |
|
||
| **Player count** | Solo · async-multi (Wordle-shape) · sync-co-op · sync-versus |
|
||
| **Word source** | Daily-curated 16 · player-supplied · conveyor-fed stream · category-seeded generation |
|
||
| **Scoring axis** | Speed · count · uniqueness vs other players · LLM-rated tightness · chain length |
|
||
| **AI call shape** | JUDGE single · JUDGE batched (one category × N words) · CREATIVE_ACCEPT · CREATE (rare — from the bakeoff this is the least reliable axis) · tightness-rating |
|
||
|
||
Easy recombinations to consider:
|
||
|
||
- **Pile + Coalition** = daily 60-second speedrun on the day's curated word pool, leaderboard by score.
|
||
- **Stretch + Hidden** = find the longest broadest category that still passes the tightness bar.
|
||
- **Heist + Threaded** = chain-builder versus mode where teams steal links from each other's chains.
|
||
- **Bench + Misfit** = daily foraging where some submissions are deliberate adversarial misfits the community has to flag.
|
||
|
||
---
|
||
|
||
## Open questions / things still untested
|
||
|
||
1. **Adversarial player input on CREATIVE_ACCEPT.** Tests used honest categories. Real players will gaming-test the judge with categories like "Words containing a vowel" (trivially-true on most English words) or "Words that are 4–7 letters long" (true by construction in many cases). Need a category-tightness pre-check on player input — at minimum, require the category to *fail* for at least one word from the wider deck, or apply a specificity bar.
|
||
2. **Cultural / contextual category robustness.** Tested categories were lexical/factual ("Roman gods", "fruits", "things you can crack"). Cultural references and time-bound categories ("Words in Beatles songs", "Common Texan slang") may break the judge.
|
||
3. **Critique-pass effectiveness.** The generation pipeline assumes a second-model critique pass catches structural mistakes. Not yet verified — feed Experiment 1's failed puzzles into a critique prompt and check.
|
||
4. **8B's "no" bias on hard YES cases.** It missed `judge-y3` (days of the week — said all four were misfits, which was incoherent) and `judge-y6` (cold turkey). 8B might be slightly more conservative in production than its test numbers suggest.
|
||
5. **Diversity over time.** All 10 puzzles in Experiment 1 were unseeded; 31b reached for "scales" twice in 5 puzzles. With 26b alone for generation, the diversity question is sharper. A 30-day seeded run is the next experiment if any of the daily-puzzle ideas (Coalition, Bench) goes forward.
|
||
|
||
---
|
||
|
||
## Repo structure
|
||
|
||
```
|
||
.
|
||
├── README.md # this file
|
||
├── IDEA.md # original brief, with note about the pivot
|
||
├── DECISIONS.md # decision log, kept as project moves forward
|
||
├── scripts/
|
||
│ ├── gemma-generation-bakeoff.py # Experiment 1 — whole-puzzle generation
|
||
│ └── gemma-semantic-bakeoff.py # Experiment 2 — atomic skills
|
||
└── docs/reference/
|
||
├── gemma-generation-bakeoff-2026-04-27-221751.md # Experiment 1 report (graded)
|
||
├── gemma-generation-bakeoff-2026-04-27-221751-raw.json
|
||
├── gemma-semantic-bakeoff-2026-04-27-224800.md # Experiment 2 report (graded)
|
||
└── gemma-semantic-bakeoff-2026-04-27-224800-raw.json
|
||
```
|
||
|
||
## Reproduce
|
||
|
||
```bash
|
||
# point at any local Ollama with gemma4:latest and gemma4:26b loaded
|
||
export OLLAMA_HOST=http://localhost:11434
|
||
python3 scripts/gemma-semantic-bakeoff.py # ~5 min on a 24 GB GPU
|
||
python3 scripts/gemma-generation-bakeoff.py # ~10 min
|
||
```
|
||
|
||
Reports land in `docs/reference/` with timestamps. Hand-grade the CREATE outputs and any TODO grades inline in the markdown — both bakeoff scripts emit grading-friendly reports.
|
||
|
||
## License
|
||
|
||
Not yet specified. If you're considering using this code or the test bank in your own work, open an issue and ask.
|