docs: bootstrap repo with bakeoff results and game-mechanics idea bank

This repo opens with the design-discovery work completed before any product
code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local
Ollama established that:

- Whole-puzzle generation in the Connections shape is unreliable on Gemma 4
  (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally
  out of project scope, so the generation route is harder still.
- Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b)
  on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging
  of player-INVENTED categories. That is the structural unlock vs static
  hand-curated word games.

The README contains the full writeup, the test bench, and a brainstormed
bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo
range, plus a primitives table for recombination.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mortdecai
2026-04-27 23:09:46 -04:00
commit 5a2a02e483
10 changed files with 4659 additions and 0 deletions
+30
View File
@@ -0,0 +1,30 @@
# DECISIONS.md — seth_semantic_game Decision Log
Project-specific decisions. For global/cross-cutting decisions, see `~/bin/DECISIONS.md`.
Format: `YYYY-MM-DD: <decision> — <why>`
## Architecture
- **2026-04-27: The Gemma-enabled twist is real-time CREATIVE_ACCEPT — fair judging of player-invented categories** — Semantic bakeoff (`docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md`) showed all three Gemma 4 variants (8B, 26b, 31b) achieve 10/10 on player-creative-but-valid grouping judgments. This is the IDEA.md unlock: a derivative game that *accepts the player's own valid groupings* in real time, which the static NYT format structurally cannot do. Likely product framing: "Connections, but you can group however you can defend."
- **2026-04-27: Live judging on gemma4:latest (8B) at 0.7s/call** — 8B JUDGE accuracy is 87.5% strict, CREATIVE_ACCEPT 100%, output sub-second. Per-guess economics are effectively free. (Originally this entry called for 31b on once-per-day generation; that was superseded when 31b was removed from scope — see below.)
- **2026-04-27: 26b is NOT the live judge despite being only marginally slower than 8B** — 26b showed an "agree with the user" bias on the borderline tech-brand case (accepted KIWI as a tech brand). For CREATIVE_ACCEPT specifically, false-positives are worse than false-negatives — accepting bad groupings degrades game integrity, while rejecting valid ones is just frustrating. 8B's stricter calibration is the right tradeoff.
- **2026-04-27: Generation must go through a guarded pipeline, not a single Gemma call** — Prior bakeoff (`docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md`) showed gemma4:31b passes ~40-50% structurally clean and gemma4:26b ~20-30%; both produce duplicate-tile and broken-category failures unaided. Acceptable design shape: 31b generate → deterministic filter (16 distinct tiles, no dup words, all claimed-trap words present) → category-similarity check → critique pass (8B or 26b — much cheaper than 31b critique) → cache the day's accepted puzzle.
- **2026-04-27: gemma4:31b is OUT OF SCOPE — only 8B and 26b are in the model lineup** — User constraint: 31b's quality edge does not justify keeping it as a project dependency; 8B and 26b are good enough. **Implication for generation**: 26b's ~20-30% structural-pass rate becomes the working number. Generation pipeline must do more work to compensate — either stricter automated filters, more retry attempts, OR shift the design center toward player-driven generation (game ideas where the *player* supplies words/categories and Gemma judges) rather than AI-driven generation. The latter is favored because Gemma's per-call JUDGE / CREATIVE_ACCEPT is reliable on both 8B and 26b (87.5% / 93.75%) — that's the strong axis to lean on.
- **2026-04-27: Live judging on gemma4:latest (8B), generation candidate gemma4:26b** — 8B JUDGE 14/16, CREATIVE_ACCEPT 10/10, 0.7s. 26b is the heavier model when accuracy matters more (e.g. offline puzzle gen + critique). Model use by role: live JUDGE → 8B; live CREATIVE_ACCEPT → 8B; offline generation → 26b with retries; offline critique → 26b grading 8B's output (or vice-versa) so the same model isn't rubber-stamping itself.
## Implementation
- **2026-04-27: Single-turn JSON pipeline payload settings (canonical for this project)** — `think: false`, `num_ctx: 8192`, `num_predict: 4096`, NO `format: "json"`, parse JSON client-side via `body[body.find('{'):body.rfind('}')+1]`, retry up to 3× with temperature bumped +0.1 each attempt. All four are mandatory per `~/bin/gemma4-research/GOTCHAS.md` for gemma4:26b/31b on Ollama 0.20.x; format=json hangs the model, default num_predict=128 truncates output, default num_ctx=2048 truncates the prompt, and unset `think` consumes the response budget on thinking tokens.
- **2026-04-27: Inference host = local 3090 Ti (24 GB)** — delivers ~94 tok/s on gemma4:26b and ~24 tok/s on gemma4:31b; sub-second per-call latency on the short JUDGE / CREATIVE_ACCEPT prompts.
## Deferred / Rejected
<!-- Decisions NOT to do something are just as valuable -- prevents re-proposing rejected ideas -->
- **2026-04-27 — REJECTED: Gemma self-grading puzzles** — In the bakeoff, Gemma's own "intended_traps" claims didn't always hold up (e.g., #3-26b claimed `PRESS` traps the "Words after BLOOD" group, but blood-press isn't a phrase). If we route the critique pass back through the same model, it will rubber-stamp the same kinds of errors it generates. Use a different judge: a non-Gemma model on the same host (any reasonably-capable open-weights model), or two different Gemma sizes against each other.
- **2026-04-27 — DEFERRED: Connections-vs-Gemma blind anchor** — The plan called for mixing one real NYT puzzle into the grading set. Skipped because Gemma's structural failures (duplicate tiles, broken categories) are obvious curator-rejections — the within-Gemma evidence was decisive on its own. Revisit before locking the design: eyeball one filter-passed Gemma puzzle next to a real NYT puzzle and confirm equivalence.
- **2026-04-27 — DEFERRED: Diversity-over-time test** — All 10 bakeoff puzzles were unseeded. With 31b alone, two of five were scale-themed; risk of long-term repetition. Need a seeded run (e.g., 30 puzzles with date-rotated theme prompts) before committing to a year-round daily-puzzle product.
- **2026-04-27 — DEFERRED: Critique-pass effectiveness test** — The architecture above assumes a second-model critique pass catches the broken categories. Not yet verified. Next experiment: feed the failed bakeoff puzzles into a critique prompt and check whether the model flags the actual structural issues.
- **2026-04-27 — DEFERRED: Adversarial-player robustness on CREATIVE_ACCEPT** — Test cases were honest player categories. Real players will gaming-test the judge with categories like "Words containing a vowel" (trivially-true for most English words) or "Words that are 4-7 letters long" (true by construction in many cases). Need a category-tightness pre-check on player input — e.g. require category to fail for at least one word on the board, or require category specificity above a threshold — before submitting it to Gemma for word-fit judging.
- **2026-04-27 — DEFERRED: Cultural/contextual category robustness** — Tested categories were lexical/factual ("Roman gods", "fruits", "things you can crack"). Cultural references ("Words in Beatles songs", "Common Texan slang") and time-bound categories may break the judge. Test before designing any feature that depends on them.
- **2026-04-27 — KNOWN LIMIT: Hard wordplay categories ("homophones of body parts") fail on all three Gemma 4 variants** — This is a structural model limit, not a configuration issue. If this category class is desired in puzzles, scaffold with worked examples in the prompt or human-curate the seed list; do not rely on unaided generation for it.