docs: bootstrap repo with bakeoff results and game-mechanics idea bank

This repo opens with the design-discovery work completed before any product
code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local
Ollama established that:

- Whole-puzzle generation in the Connections shape is unreliable on Gemma 4
  (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally
  out of project scope, so the generation route is harder still.
- Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b)
  on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging
  of player-INVENTED categories. That is the structural unlock vs static
  hand-curated word games.

The README contains the full writeup, the test bench, and a brainstormed
bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo
range, plus a primitives table for recombination.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mortdecai
2026-04-27 23:09:46 -04:00
commit 5a2a02e483
10 changed files with 4659 additions and 0 deletions
+19
View File
@@ -0,0 +1,19 @@
# Local backups (created by editing pipeline; not for sharing)
.backup/
# Session handoff documents — heavily homelab-internal, replaced by README
.claude/
# Python
__pycache__/
*.pyc
*.pyo
# Editor / OS
.DS_Store
*.swp
*~
# Local environment
.env
.env.*
+30
View File
@@ -0,0 +1,30 @@
# DECISIONS.md — seth_semantic_game Decision Log
Project-specific decisions. For global/cross-cutting decisions, see `~/bin/DECISIONS.md`.
Format: `YYYY-MM-DD: <decision> — <why>`
## Architecture
- **2026-04-27: The Gemma-enabled twist is real-time CREATIVE_ACCEPT — fair judging of player-invented categories** — Semantic bakeoff (`docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md`) showed all three Gemma 4 variants (8B, 26b, 31b) achieve 10/10 on player-creative-but-valid grouping judgments. This is the IDEA.md unlock: a derivative game that *accepts the player's own valid groupings* in real time, which the static NYT format structurally cannot do. Likely product framing: "Connections, but you can group however you can defend."
- **2026-04-27: Live judging on gemma4:latest (8B) at 0.7s/call** — 8B JUDGE accuracy is 87.5% strict, CREATIVE_ACCEPT 100%, output sub-second. Per-guess economics are effectively free. (Originally this entry called for 31b on once-per-day generation; that was superseded when 31b was removed from scope — see below.)
- **2026-04-27: 26b is NOT the live judge despite being only marginally slower than 8B** — 26b showed an "agree with the user" bias on the borderline tech-brand case (accepted KIWI as a tech brand). For CREATIVE_ACCEPT specifically, false-positives are worse than false-negatives — accepting bad groupings degrades game integrity, while rejecting valid ones is just frustrating. 8B's stricter calibration is the right tradeoff.
- **2026-04-27: Generation must go through a guarded pipeline, not a single Gemma call** — Prior bakeoff (`docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md`) showed gemma4:31b passes ~40-50% structurally clean and gemma4:26b ~20-30%; both produce duplicate-tile and broken-category failures unaided. Acceptable design shape: 31b generate → deterministic filter (16 distinct tiles, no dup words, all claimed-trap words present) → category-similarity check → critique pass (8B or 26b — much cheaper than 31b critique) → cache the day's accepted puzzle.
- **2026-04-27: gemma4:31b is OUT OF SCOPE — only 8B and 26b are in the model lineup** — User constraint: 31b's quality edge does not justify keeping it as a project dependency; 8B and 26b are good enough. **Implication for generation**: 26b's ~20-30% structural-pass rate becomes the working number. Generation pipeline must do more work to compensate — either stricter automated filters, more retry attempts, OR shift the design center toward player-driven generation (game ideas where the *player* supplies words/categories and Gemma judges) rather than AI-driven generation. The latter is favored because Gemma's per-call JUDGE / CREATIVE_ACCEPT is reliable on both 8B and 26b (87.5% / 93.75%) — that's the strong axis to lean on.
- **2026-04-27: Live judging on gemma4:latest (8B), generation candidate gemma4:26b** — 8B JUDGE 14/16, CREATIVE_ACCEPT 10/10, 0.7s. 26b is the heavier model when accuracy matters more (e.g. offline puzzle gen + critique). Model use by role: live JUDGE → 8B; live CREATIVE_ACCEPT → 8B; offline generation → 26b with retries; offline critique → 26b grading 8B's output (or vice-versa) so the same model isn't rubber-stamping itself.
## Implementation
- **2026-04-27: Single-turn JSON pipeline payload settings (canonical for this project)** — `think: false`, `num_ctx: 8192`, `num_predict: 4096`, NO `format: "json"`, parse JSON client-side via `body[body.find('{'):body.rfind('}')+1]`, retry up to 3× with temperature bumped +0.1 each attempt. All four are mandatory per `~/bin/gemma4-research/GOTCHAS.md` for gemma4:26b/31b on Ollama 0.20.x; format=json hangs the model, default num_predict=128 truncates output, default num_ctx=2048 truncates the prompt, and unset `think` consumes the response budget on thinking tokens.
- **2026-04-27: Inference host = local 3090 Ti (24 GB)** — delivers ~94 tok/s on gemma4:26b and ~24 tok/s on gemma4:31b; sub-second per-call latency on the short JUDGE / CREATIVE_ACCEPT prompts.
## Deferred / Rejected
<!-- Decisions NOT to do something are just as valuable -- prevents re-proposing rejected ideas -->
- **2026-04-27 — REJECTED: Gemma self-grading puzzles** — In the bakeoff, Gemma's own "intended_traps" claims didn't always hold up (e.g., #3-26b claimed `PRESS` traps the "Words after BLOOD" group, but blood-press isn't a phrase). If we route the critique pass back through the same model, it will rubber-stamp the same kinds of errors it generates. Use a different judge: a non-Gemma model on the same host (any reasonably-capable open-weights model), or two different Gemma sizes against each other.
- **2026-04-27 — DEFERRED: Connections-vs-Gemma blind anchor** — The plan called for mixing one real NYT puzzle into the grading set. Skipped because Gemma's structural failures (duplicate tiles, broken categories) are obvious curator-rejections — the within-Gemma evidence was decisive on its own. Revisit before locking the design: eyeball one filter-passed Gemma puzzle next to a real NYT puzzle and confirm equivalence.
- **2026-04-27 — DEFERRED: Diversity-over-time test** — All 10 bakeoff puzzles were unseeded. With 31b alone, two of five were scale-themed; risk of long-term repetition. Need a seeded run (e.g., 30 puzzles with date-rotated theme prompts) before committing to a year-round daily-puzzle product.
- **2026-04-27 — DEFERRED: Critique-pass effectiveness test** — The architecture above assumes a second-model critique pass catches the broken categories. Not yet verified. Next experiment: feed the failed bakeoff puzzles into a critique prompt and check whether the model flags the actual structural issues.
- **2026-04-27 — DEFERRED: Adversarial-player robustness on CREATIVE_ACCEPT** — Test cases were honest player categories. Real players will gaming-test the judge with categories like "Words containing a vowel" (trivially-true for most English words) or "Words that are 4-7 letters long" (true by construction in many cases). Need a category-tightness pre-check on player input — e.g. require category to fail for at least one word on the board, or require category specificity above a threshold — before submitting it to Gemma for word-fit judging.
- **2026-04-27 — DEFERRED: Cultural/contextual category robustness** — Tested categories were lexical/factual ("Roman gods", "fruits", "things you can crack"). Cultural references ("Words in Beatles songs", "Common Texan slang") and time-bound categories may break the judge. Test before designing any feature that depends on them.
- **2026-04-27 — KNOWN LIMIT: Hard wordplay categories ("homophones of body parts") fail on all three Gemma 4 variants** — This is a structural model limit, not a configuration issue. If this category class is desired in puzzles, scaffold with worked examples in the prompt or human-curate the seed list; do not rely on unaided generation for it.
+49
View File
@@ -0,0 +1,49 @@
# IDEA.md — seth_semantic_game
## What is this?
A daily word game **based on NYT Connections**, powered by a locally-hosted Gemma 4
model. Connections gives the player 16 words that have to be sorted into 4 hidden
groups of 4 by shared semantic category. The twist for this project — what makes it
worth building rather than just playing the original — is whatever Gemma 4 enables
that NYT's hand-curated static format cannot.
That twist is **not yet decided**. That's what brainstorming is for.
The base mechanic is fixed:
- Connections-style grouping puzzle (semantic categories, not letters)
- Gemma 4 in the loop somewhere (puzzle generation, judging, hint system, or all of
the above)
- Daily-puzzle structure with social-shareable result (the Connections / Wordle
ritual — borrowed *only* for its sharing pattern, not its gameplay)
This is **not** Wordle-derived. The original draft of this file framed it as
"Wordle-style"; that was wrong. The mechanic is grouping, not letter-guessing.
## Problem it solves
Mostly fun and a real use of the local Gemma 4 stack. NYT Connections is hand-curated
and ships one puzzle per day; a generative version could ship infinite puzzles, accept
fuzzy or creative groupings, generate themed/seeded puzzles, or do other things the
hand-built version structurally cannot. Secondary: a daily-puzzle hook for sethpc.xyz
alongside other homelab games.
## Constraints / preferences
- Self-hosted: Ollama with Gemma 4 on commodity GPU (a single 24 GB card is enough)
- Web frontend, dark theme with orange accents
- If a puzzle is generative, output must be **deterministic per day** (every player
on a given date gets the same puzzle). Likely a date-seeded prompt with cached
output rather than a fresh generation per request.
- Per-guess judging cost should be cheap — at most one Gemma call per submission, and
ideally answers are precomputed when the daily puzzle is generated, so judging
becomes a cheap lookup.
- No login required for casual play (cookies/localStorage for streak)
> NOTE on history: this brief was originally a "Wordle-style" framing. That was
> wrong — the seed game is NYT Connections (16 words → 4 hidden groups of 4).
> But after the model bakeoffs (see README), the *direction* shifted again:
> rather than cloning Connections, the project pivots toward gameplay that
> uses Gemma's per-call CREATIVE_ACCEPT ability to fairly judge
> player-INVENTED categories — a thing static curated games structurally can't
> do. The brainstormed game ideas in the README are what came out of that.
+235
View File
@@ -0,0 +1,235 @@
# seth_semantic_game
**Working title.** A self-hosted word game built around an LLM's ability to fairly judge *player-invented* semantic categories in real time — something static, hand-curated word games structurally cannot do.
This repo documents the design discovery process, including two model bakeoffs that picked the architecture and a brainstormed bank of game-mechanics ideas that the actual product will draw from.
---
## TL;DR
- **Seed idea:** clone NYT Connections (16 words → 4 hidden groups of 4) with a local LLM doing the curation.
- **Seed idea died fast:** unaided whole-puzzle generation on Gemma 4 ships broken puzzles ~50% of the time (duplicate tiles, mislabeled categories, fake wordplay) — see [docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md](docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md).
- **The actual unlock:** Gemma 4 reliably judges whether a player-supplied category fits a player-supplied set of words. Across 35 hand-labeled cases on three model sizes, **CREATIVE_ACCEPT scored 10/10 on every model** including the 8B variant at 0.7s per call. JUDGE landed at 87.5% / 93.75% / 100% (8B / 26b / 31b). See [docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md](docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md).
- **The pivot:** stop trying to generate Connections. Build games where the *player* invents the groupings and the LLM is the live, fair judge. That's what the static format can't do.
- **Models in scope:** `gemma4:latest` (8B) for live judging, `gemma4:26b` for offline puzzle prep / critique. `gemma4:31b` was tested and is more accurate, but is intentionally out of scope for this project.
---
## What we did
Two experiments, both reproducible from `scripts/` against a local Ollama (point `OLLAMA_HOST` at your instance; defaults to `http://localhost:11434`).
### Experiment 1 — Generation bakeoff
**Question:** can Gemma 4 generate a Connections-quality 16-word / 4-group puzzle in one shot?
**Setup:** 5 puzzles per model on gemma4:26b and gemma4:31b. Strict JSON schema requesting groups + difficulty bands + claimed overlap-trap words. No format=json (that's a known Gemma 4 + Ollama hang); JSON parsed client-side; up to 3 retries with temperature bumped +0.1 each attempt.
**Results:**
| Model | Pass | Borderline | Fail | Avg s/puzzle |
|---|---|---|---|---|
| `gemma4:26b` | 1 | 1 + 1 partial | 2 | 5.2 |
| `gemma4:31b-it-q4_K_M` | 2 | 2 | 1 | 18.2 |
Failure modes ranked by severity:
1. **Structural violations** — duplicate or near-duplicate words on the 16-tile board. *Trivially detectable.*
2. **Broken category logic** — words listed in a category they don't actually fit (`DELUXE` doesn't start with the full Greek letter "DELTA"; `LIBRA` isn't a "type of scale"). *Hard to detect deterministically — needs a critique pass.*
3. **Redundant categories** — two groups themed on the same concept. Detectable.
4. **Self-graded traps don't always hold up** — Gemma's claimed `intended_traps` were sometimes nonsense (`PRESS` claimed to fit "Words after BLOOD," but the compound is *blood pressure*, not *blood press*). **Important consequence: the same model cannot be trusted to grade its own output.**
This was decisive for the project direction: unaided generation isn't viable; AND we're explicitly capping at 26b, which is the *less* reliable generator. So we need a different game shape — one that doesn't depend on the LLM generating finished puzzles unaided.
### Experiment 2 — Semantic-skill bakeoff
**Question:** instead of whole-puzzle generation, can Gemma reliably perform the atomic skills a live game would need? Specifically:
- **JUDGE** — given a category and 4 words, does Gemma correctly say yes/no on whether they all fit?
- **CREATE** — given a category, does Gemma produce 4 tightly-fitting words?
- **CREATIVE_ACCEPT** — given 4 words and a *player-proposed* category, does Gemma fairly judge whether the category validates the grouping (even if it differs from any "intended" category)?
The third one is the design-relevant one. If it works, the game can let players invent their own groupings — which is structurally impossible for a hand-curated static format.
**Setup:** 35 hand-labeled cases (16 JUDGE / 10 CREATE / 9 CREATIVE_ACCEPT + 2 deliberately ambiguous) tested across `gemma4:latest` (8B), `gemma4:26b`, and `gemma4:31b`. Each case has explicit ground truth in the test bank.
**Results:**
| Model | JUDGE | CREATE | CREATIVE_ACCEPT | Avg s/call |
|---|---|---|---|---|
| `gemma4:latest` (8B) | 14/16 (87.5%) | 8/10 | **10/10** | 0.7 |
| `gemma4:26b` | 15/16 (93.75%) | 9/10 | **10/10** | 0.8 |
| `gemma4:31b-it-q4_K_M` | 16/16 | 9/10 | **10/10** | 2.3 |
**Key findings:**
- **CREATIVE_ACCEPT is decisive across all three models.** 10/10 on five player-creative-but-valid groupings (e.g. `WHIP / NUT / CODE / SMILE → "Things you can crack"` accepted) AND 10/10 on five invalid ones (e.g. `OAK / MAPLE / BIRCH / PINE → "Furniture brands"` rejected). The model gets the distinction.
- **8B is fast enough to use as a live judge.** Sub-second on a 24 GB consumer GPU; per-guess economics are effectively free.
- **26b is mildly over-permissive on borderline cases.** It accepted KIWI as a tech brand (`APPLE / ORANGE / KIWI / BLACKBERRY → "Tech/phone brands"`). 8B and 31b were stricter. For a live game, false-positives degrade integrity more than false-negatives — so 8B's calibration is the right tradeoff for live judging.
- **One failure mode is shared by all three models:** "homophones-of-body-parts" (8B gave SEA/SEE/HEAR/HERE — none of which sound like body parts; 26b gave EYE which IS a body part rather than a homophone of one; 31b parse-failed three times running). Avoid this category class or scaffold prompts with worked examples.
---
## What we picked
**Model assignments:**
| Role | Model | Why |
|---|---|---|
| Live JUDGE (per player guess) | `gemma4:latest` (8B) | Sub-second, strict-enough calibration, 87.5% accuracy on tight cases |
| Live CREATIVE_ACCEPT | `gemma4:latest` (8B) | 10/10 in test, sub-second |
| Offline puzzle generation (if used at all) | `gemma4:26b` with strict filters + retries | 31b is out of scope by user constraint; 26b plus a deterministic post-filter and a critique pass is the workable path |
| Offline critique pass | `gemma4:26b` grading 8B's work, OR a non-Gemma open-weights judge | A model cannot be trusted to grade itself — the bakeoff confirmed Gemma rubber-stamps its own structural mistakes |
**Operational gotchas baked into the scripts** (all from upstream Gemma 4 + Ollama issue tracker; documented in the bakeoff scripts):
- No `format: "json"` — server-side JSON enforcer hangs gemma4:26b Q4 indefinitely; ask for JSON in the prompt and parse client-side.
- `think: false` for single-turn JSON pipelines — otherwise thinking tokens consume the response budget and `response` comes back empty.
- Override Ollama defaults: `num_ctx` (default 2048 truncates the prompt), `num_predict` (default 128 truncates the output).
- For multi-turn tool-calling agents the rule is the opposite: leave `think` unset on 26b. Not relevant here, but worth knowing.
---
## Game-mechanics idea bank
The two bakeoffs together say: **don't build a game where the LLM is the curator. Build a game where the LLM is the live, fair judge of player creativity.** Below are 10 distinct game ideas that take that as the design constraint. None of them is Connections; each one leans on something a static game structurally can't replicate (live category validation, multi-solution puzzles, generative answer pools, semantic chains, etc.).
Each idea lists its **tempo** (how fast the game feels), the **AI calls per turn** (so cost can be reasoned about), and the **structural novelty** (the thing this idea can do that a hand-curated static format cannot).
### Fast-paced (≤60-second rounds)
#### 1. **Pile** — speedrun categorize
- **Tempo:** real-time, 60-second rounds.
- **Mechanic:** A pool of ~16 random words. You drag any 35 of them into a box and type a category. The LLM (8B) judges in ~0.7s. Accepted → those words clear, refilled from a deck. Rejected → they stay. Score = words categorized per minute.
- **AI calls:** 1 per submission (CREATIVE_ACCEPT shape: player-supplied category + player-supplied words).
- **Structural novelty:** the player invents groupings under time pressure; categories aren't pre-known. A static game has a single fixed answer per puzzle; this one has open-ended valid answers as long as the LLM can confirm tightness.
#### 2. **Bridge** — single-word polysemy speedrun
- **Tempo:** real-time, ~10 sec per move.
- **Mechanic:** Two category cards on screen ("Words for sharp pain" and "Things that bite"). Type one word the LLM accepts as fitting BOTH (e.g. `STING`). Move on. Faster = more points.
- **AI calls:** 2 JUDGE calls per submission (one per category, on the player's word).
- **Structural novelty:** the polysemy/multi-meaning skill — a known Connections difficulty axis — turned into the *primary* gameplay loop. Static games can plant such words but can't let the player invent them on demand.
#### 3. **Threaded** — semantic word chains
- **Tempo:** real-time / continuous.
- **Mechanic:** Words drift across a conveyor belt. You build a chain by linking consecutive words with a category the LLM accepts ("APPLE → ORANGE: both fruits" → "ORANGE → RED: both colors" → "RED → ANGRY: red with anger"). Chain length = score. One chain per game.
- **AI calls:** 1 JUDGE per link, on the player's pair-and-category.
- **Structural novelty:** emergent semantic graphs from arbitrary word streams. The category set isn't pre-built — it's whatever the player can find. A static game can't be open-ended on the connection vocabulary.
### Medium-paced (515 minute sessions)
#### 4. **Stretch** — push a category to its limit
- **Tempo:** medium, 5-min sessions.
- **Mechanic:** The game opens with a tight seed category and 4 starting words ("Types of trees: OAK, MAPLE, BIRCH, PINE"). Add a 5th word — does it still fit? LLM judges. Yes → add a 6th. Each accepted word = +1 point. First rejection ends the round. Some categories support more stretch than others (broader = more elastic).
- **AI calls:** 1 JUDGE per word added.
- **Structural novelty:** category *elasticity* as a gameplay dimension. There's no pre-set answer length. The player learns intuitions about which categories admit how much stretching — a meta-skill no static game develops.
#### 5. **Inverse** — multi-solution sort
- **Tempo:** medium, ~10 min per puzzle.
- **Mechanic:** 16 words on a board with NO predetermined grouping. The player sorts them into ANY 4 groups of 4 with ANY categories of their choice. The LLM judges all 4 categories. All 4 valid → win. Bonus for tightness (LLM rates each category 15).
- **AI calls:** 4 CREATIVE_ACCEPT per submission, plus optional 4 tightness-score calls.
- **Structural novelty:** Connections has *one* valid answer; this version has thousands. Players compete on creativity and tightness, not on guessing the curator's mind.
#### 6. **Misfit** — odd-one-out, then redeem
- **Tempo:** medium, ~3 min per puzzle.
- **Mechanic:** The game shows a category and 45 words; one of them doesn't quite fit. Stage 1: identify the misfit. Stage 2 (bonus): propose a category the *misfit* word DOES fit. Both stages judged by the LLM.
- **AI calls:** 1 JUDGE on stage 1 (verifies the misfit), 1 CREATIVE_ACCEPT on stage 2 (validates the player's redemption category).
- **Structural novelty:** the second stage — "what category does the wrong word actually fit?" — is essentially impossible without live judging. Static games can plant misfits; they can't accept arbitrary creative redemptions.
### Slow / daily
#### 7. **Coalition** — daily creativity leaderboard
- **Tempo:** daily, 24-hour cycle, async.
- **Mechanic:** Once per day, the system publishes 16 words (offline-generated by 26b with the guarded pipeline + filter + critique pass). All players worldwide get the same 16. Each player submits their own 4×4 sort with 4 self-supplied categories. Server collects all submissions. Daily leaderboard ranks by:
- **Validity:** all 4 categories accepted by the LLM (binary gate).
- **Tightness score:** LLM rates each category 15; submission score is the average.
- **Uniqueness:** how few other players used the same exact grouping (rewards creativity over the obvious solution).
- **AI calls:** 4 CREATIVE_ACCEPT + 4 tightness ratings per submission.
- **Structural novelty:** the social/share ritual of Wordle and Connections, but with creativity as the leaderboard axis instead of speed-to-known-answer. "I split the daily 16 with the only 'Greek myths' grouping anyone found" is a different brag than "I solved it in 2 mistakes."
#### 8. **Bench** — collaborative single-category foraging
- **Tempo:** daily, 24-hour async.
- **Mechanic:** Each day a single category is published ("Words that follow GREEN" or "Things you can break"). Players have 24 hours to submit as many words as they can; LLM judges each. Each accepted word is "claimed" by the first submitter (publicly visible). Per-player score = unique claims.
- **AI calls:** 1 JUDGE per submitted word.
- **Structural novelty:** the *answer set is generative*, not hand-curated. NYT can't ship an open-ended "submit anything that fits" puzzle because they don't know all the answers; the LLM does (well enough for 87.5% of cases, with the bench growing publicly to fill in the rest).
### Hybrid / structurally distinctive
#### 9. **Heist** — competitive bluff-and-claim
- **Tempo:** medium-fast, 2-team multiplayer.
- **Mechanic:** Two teams share a pool of words. Each turn, the active team **announces a category** ("Words that follow BLUE") and has 30 seconds to claim words from the pool that fit. The opposing team can **challenge** any claim — if the LLM agrees the word doesn't fit, the claiming team loses points; if it does, the challenger loses points. Bluffing dynamics emerge naturally: claim a borderline word and dare them to challenge.
- **AI calls:** 1 JUDGE per claim (at challenge-time only — no need to judge unchallenged claims unless you want a "true scoring" cleanup pass at end-of-game).
- **Structural novelty:** competitive *risk-taking* on category boundaries. The challenge mechanic literally requires a live, fair judge — there's no static-game equivalent because static games can't adjudicate disputes mid-play.
#### 10. **Hidden** — find the broadest tight category
- **Tempo:** medium, ~5 min per puzzle.
- **Mechanic:** 12 (or more) words on a board. Find ONE category that fits ALL of them — and the *narrower / more specific* the category, the higher the score. ("Things that exist" gets you 1 point; "Things you'd find in a 1980s bedroom" gets you 8.) LLM judges on both validity (does it actually fit all 12?) and tightness (15).
- **AI calls:** 1 batched JUDGE (on category × 12 words) per submission, plus 1 tightness rating.
- **Structural novelty:** the inversion. Every other word game asks the player to find narrow groups inside a board; this one asks the player to find the *broadest* category that *still* feels tight. A different cognitive skill, and impossible without live category judging.
---
## Recombinable building blocks
The 10 ideas above mix five primitives. Use these to remix or design new variants:
| Primitive | Variants |
|---|---|
| **Time pressure** | Real-time / per-move timer / per-day async / untimed |
| **Goal direction** | Find a valid grouping · validate a player-proposed grouping · find a misfit · find a "bridge" word · find the broadest tight category · build a chain |
| **Player count** | Solo · async-multi (Wordle-shape) · sync-co-op · sync-versus |
| **Word source** | Daily-curated 16 · player-supplied · conveyor-fed stream · category-seeded generation |
| **Scoring axis** | Speed · count · uniqueness vs other players · LLM-rated tightness · chain length |
| **AI call shape** | JUDGE single · JUDGE batched (one category × N words) · CREATIVE_ACCEPT · CREATE (rare — from the bakeoff this is the least reliable axis) · tightness-rating |
Easy recombinations to consider:
- **Pile + Coalition** = daily 60-second speedrun on the day's curated word pool, leaderboard by score.
- **Stretch + Hidden** = find the longest broadest category that still passes the tightness bar.
- **Heist + Threaded** = chain-builder versus mode where teams steal links from each other's chains.
- **Bench + Misfit** = daily foraging where some submissions are deliberate adversarial misfits the community has to flag.
---
## Open questions / things still untested
1. **Adversarial player input on CREATIVE_ACCEPT.** Tests used honest categories. Real players will gaming-test the judge with categories like "Words containing a vowel" (trivially-true on most English words) or "Words that are 47 letters long" (true by construction in many cases). Need a category-tightness pre-check on player input — at minimum, require the category to *fail* for at least one word from the wider deck, or apply a specificity bar.
2. **Cultural / contextual category robustness.** Tested categories were lexical/factual ("Roman gods", "fruits", "things you can crack"). Cultural references and time-bound categories ("Words in Beatles songs", "Common Texan slang") may break the judge.
3. **Critique-pass effectiveness.** The generation pipeline assumes a second-model critique pass catches structural mistakes. Not yet verified — feed Experiment 1's failed puzzles into a critique prompt and check.
4. **8B's "no" bias on hard YES cases.** It missed `judge-y3` (days of the week — said all four were misfits, which was incoherent) and `judge-y6` (cold turkey). 8B might be slightly more conservative in production than its test numbers suggest.
5. **Diversity over time.** All 10 puzzles in Experiment 1 were unseeded; 31b reached for "scales" twice in 5 puzzles. With 26b alone for generation, the diversity question is sharper. A 30-day seeded run is the next experiment if any of the daily-puzzle ideas (Coalition, Bench) goes forward.
---
## Repo structure
```
.
├── README.md # this file
├── IDEA.md # original brief, with note about the pivot
├── DECISIONS.md # decision log, kept as project moves forward
├── scripts/
│ ├── gemma-generation-bakeoff.py # Experiment 1 — whole-puzzle generation
│ └── gemma-semantic-bakeoff.py # Experiment 2 — atomic skills
└── docs/reference/
├── gemma-generation-bakeoff-2026-04-27-221751.md # Experiment 1 report (graded)
├── gemma-generation-bakeoff-2026-04-27-221751-raw.json
├── gemma-semantic-bakeoff-2026-04-27-224800.md # Experiment 2 report (graded)
└── gemma-semantic-bakeoff-2026-04-27-224800-raw.json
```
## Reproduce
```bash
# point at any local Ollama with gemma4:latest and gemma4:26b loaded
export OLLAMA_HOST=http://localhost:11434
python3 scripts/gemma-semantic-bakeoff.py # ~5 min on a 24 GB GPU
python3 scripts/gemma-generation-bakeoff.py # ~10 min
```
Reports land in `docs/reference/` with timestamps. Hand-grade the CREATE outputs and any TODO grades inline in the markdown — both bakeoff scripts emit grading-friendly reports.
## License
Not yet specified. If you're considering using this code or the test bank in your own work, open an issue and ask.
@@ -0,0 +1,790 @@
[
{
"model": "gemma4:26b",
"i": 1,
"dt": 5.670714855194092,
"ok": true,
"attempts": 1,
"puzzle": {
"theme_seed": "An exploration of common words that can be paired with 'BIRD' or 'DRAGON' or relate to types of 'FLY'.",
"groups": [
{
"difficulty": "yellow",
"category": "Types of Flies",
"words": [
"FRUIT",
"HOUSE",
"FIRE",
"BLOW"
]
},
{
"difficulty": "green",
"category": "Words before 'FLY'",
"words": [
"DRAGON",
"MAY",
"BUTTER",
"BOTTLE"
]
},
{
"difficulty": "blue",
"category": "Types of Birds",
"words": [
"LARK",
"SWALLOW",
"CRANE",
"SWIFT"
]
},
{
"difficulty": "purple",
"category": "Words that follow 'DRAGON'",
"words": [
"FRUIT",
"FLY",
"SLAYER",
"FLAME"
]
}
],
"intended_traps": [
{
"word": "FRUIT",
"actual_group": "purple",
"trap_group": "yellow",
"reason": "Fruit fly is a very common type of fly, making it a primary red herring for the yellow group."
},
{
"word": "FIRE",
"actual_group": "yellow",
"trap_group": "purple",
"reason": "Fire dragon or dragon flame connection could lead someone to look for a dragon-themed group."
},
{
"word": "DRAGON",
"actual_group": "green",
"trap_group": "blue",
"reason": "Dragon is a mythical creature, and users might look for a 'mythical creatures' group including Crane or Swallow (which are real birds)."
},
{
"word": "FLY",
"actual_group": "purple",
"trap_group": "green",
"reason": "Dragonfly is a common word, tempting the player to put 'FLY' in the 'Words before FLY' group."
}
]
},
"eval_count": 500,
"prompt_eval_count": 440
},
{
"model": "gemma4:26b",
"i": 2,
"dt": 5.300434589385986,
"ok": true,
"attempts": 1,
"puzzle": {
"theme_seed": "Homophones and linguistic overlaps between nature and everyday objects.",
"groups": [
{
"difficulty": "yellow",
"category": "Types of Trees",
"words": [
"BIRCH",
"CEDAR",
"MAPLE",
"PINE"
]
},
{
"difficulty": "green",
"category": "Things that can be 'Dropped'",
"words": [
"BEAT",
"HINT",
"LINE",
"SUBJECT"
]
},
{
"difficulty": "blue",
"category": "Units of Measurement",
"words": [
"FOOT",
"GRAM",
"SECOND",
"STONE"
]
},
{
"difficulty": "purple",
"category": "Homophones of Body Parts",
"words": [
"EYE",
"HARE",
"MUSSEL",
"QUEUE"
]
}
],
"intended_traps": [
{
"word": "PINE",
"actual_group": "yellow",
"trap_group": "green",
"reason": "To 'pine' for someone is a feeling, but 'pine' could be mistaken for something you can 'drop' (like a pine needle)."
},
{
"word": "FOOT",
"actual_group": "blue",
"trap_group": "yellow",
"reason": "A 'foot' is a unit of measurement, but it is also a body part, potentially leading to the 'Body Part Homophones' group."
},
{
"word": "BEAT",
"actual_group": "green",
"trap_group": "blue",
"reason": "A 'beat' is a rhythm, but it could be mistaken for a unit of time/measurement in a musical context."
},
{
"word": "EYE",
"actual_group": "purple",
"trap_group": "blue",
"reason": "The word 'eye' is a body part, which might tempt players looking for a body part category, but it is actually a homophone (I) for the purple group."
}
]
},
"eval_count": 510,
"prompt_eval_count": 440
},
{
"model": "gemma4:26b",
"i": 3,
"dt": 5.103548049926758,
"ok": true,
"attempts": 1,
"puzzle": {
"theme_seed": "The concept of 'Pressure' and things that can be measured or applied.",
"groups": [
{
"difficulty": "yellow",
"category": "Units of Pressure",
"words": [
"BAR",
"PASCAL",
"PSI",
"TORR"
]
},
{
"difficulty": "green",
"category": "Words after 'BLOOD'",
"words": [
"PRESSURE",
"TYPE",
"ORANGE",
"DONOR"
]
},
{
"difficulty": "blue",
"category": "Things that can be 'Broken'",
"words": [
"HEART",
"RECORD",
"LAW",
"PROMISE"
]
},
{
"difficulty": "purple",
"category": "Words that are also Verbs for 'Applying Force'",
"words": [
"PRESS",
"STAMP",
"PUNCH",
"DRIVE"
]
}
],
"intended_traps": [
{
"word": "PRESSURE",
"actual_group": "green",
"trap_group": "yellow",
"reason": "Pressure is the concept being measured by the units in the yellow group."
},
{
"word": "BAR",
"actual_group": "yellow",
"trap_group": "blue",
"reason": "A 'bar' is something that can be 'broken' (like a chocolate bar) or refers to a legal setting."
},
{
"word": "PRESS",
"actual_group": "purple",
"trap_group": "green",
"reason": "The 'Press' is a common term in news/media, and 'Press' can be related to 'Blood' in a medical context (blood pressure)."
},
{
"word": "RECORD",
"actual_group": "blue",
"trap_group": "purple",
"reason": "A 'record' is something you can 'stamp' or 'drive' (as in a driving record)."
}
]
},
"eval_count": 492,
"prompt_eval_count": 440
},
{
"model": "gemma4:26b",
"i": 4,
"dt": 5.257405757904053,
"ok": true,
"attempts": 1,
"puzzle": {
"theme_seed": "A collection of words related to things that can be 'broken' or 'cracked', mixed with other categories like cooking and music.",
"groups": [
{
"difficulty": "yellow",
"category": "Kinds of Eggs",
"words": [
"CHICKEN",
"QUAIL",
"GOOSE",
"DUCK"
]
},
{
"difficulty": "green",
"category": "Things You Can Crack",
"words": [
"NUT",
"CODE",
"WHIP",
"SMILE"
]
},
{
"difficulty": "blue",
"category": "Musical Terms",
"words": [
"NOTE",
"CHORD",
"CLEF",
"BEAT"
]
},
{
"difficulty": "purple",
"category": "Words Before 'Egg'",
"words": [
"NEST",
"EASTER",
"BAD",
"NESTING"
]
}
],
"intended_traps": [
{
"word": "DUCK",
"actual_group": "yellow",
"trap_group": "green",
"reason": "DUCK is an action related to avoiding a whip or a blow, and can be paired with 'crack' in some linguistic contexts."
},
{
"word": "CHICKEN",
"actual_group": "yellow",
"trap_group": "purple",
"reason": "CHICKEN is an egg-related word that could tempt someone looking for '___ Egg' phrases (like Chicken Egg)."
},
{
"word": "BEAT",
"actual_group": "blue",
"trap_group": "green",
"reason": "BEAT is something you can do to a whip or a person, fitting the 'Things you can crack/strike' theme."
},
{
"word": "NEST",
"actual_group": "purple",
"trap_group": "yellow",
"reason": "NEST is intrinsically linked to the birds in the yellow group (CHICKEN, GOOSE, etc.)."
}
]
},
"eval_count": 506,
"prompt_eval_count": 440
},
{
"model": "gemma4:26b",
"i": 5,
"dt": 4.578032493591309,
"ok": true,
"attempts": 1,
"puzzle": {
"theme_seed": "A collection of words centered around various types of 'records' and 'tracks'.",
"groups": [
{
"difficulty": "yellow",
"category": "Types of Records",
"words": [
"LP",
"EP",
"Single",
"Album"
]
},
{
"difficulty": "green",
"category": "Things that have Tracks",
"words": [
"Train",
"Athlete",
"Song",
"Mud"
]
},
{
"difficulty": "blue",
"category": "Words for 'Evidence'",
"words": [
"Clue",
"Trace",
"Mark",
"Proof"
]
},
{
"difficulty": "purple",
"category": "Words that follow 'Side'",
"words": [
"Effect",
"Kick",
"Split",
"Walk"
]
}
],
"intended_traps": [
{
"word": "Single",
"actual_group": "yellow",
"trap_group": "blue",
"reason": "A 'single' can be a piece of evidence or a solitary item."
},
{
"word": "Song",
"actual_group": "green",
"trap_group": "yellow",
"reason": "A song is often released as an LP, EP, or Single."
},
{
"word": "Trace",
"actual_group": "blue",
"trap_group": "green",
"reason": "A trace can be a physical track left behind in mud."
},
{
"word": "Split",
"actual_group": "purple",
"trap_group": "green",
"reason": "A 'split' can refer to a track or division in a race."
}
]
},
"eval_count": 436,
"prompt_eval_count": 440
},
{
"model": "gemma4:31b-it-q4_K_M",
"i": 1,
"dt": 18.89196801185608,
"ok": true,
"attempts": 1,
"puzzle": {
"theme_seed": "Words related to circles, timing, and subtle wordplay with 'Hands'",
"groups": [
{
"difficulty": "yellow",
"category": "Circular Objects",
"words": [
"RING",
"COIN",
"WHEEL",
"DISC"
]
},
{
"difficulty": "green",
"category": "Units of Time",
"words": [
"SECOND",
"MINUTE",
"HOUR",
"DAY"
]
},
{
"difficulty": "blue",
"category": "Ways to Assist",
"words": [
"HELP",
"AID",
"BACK",
"SUPPORT"
]
},
{
"difficulty": "purple",
"category": "Things with Hands",
"words": [
"CLOCK",
"WATCH",
"DECK",
"PLAYER"
]
}
],
"intended_traps": [
{
"word": "SECOND",
"actual_group": "green",
"trap_group": "blue",
"reason": "To 'second' a motion is to support it."
},
{
"word": "WATCH",
"actual_group": "purple",
"trap_group": "green",
"reason": "A watch is a device used to measure the units of time (second, minute, hour)."
},
{
"word": "CLOCK",
"actual_group": "purple",
"trap_group": "green",
"reason": "Similar to watch, it is intrinsically linked to the time units group."
},
{
"word": "BACK",
"actual_group": "blue",
"trap_group": "yellow",
"reason": "Could be mistaken as a part of a circular object or a physical position."
}
]
},
"eval_count": 439,
"prompt_eval_count": 440
},
{
"model": "gemma4:31b-it-q4_K_M",
"i": 2,
"dt": 16.603100776672363,
"ok": true,
"attempts": 1,
"puzzle": {
"theme_seed": "A mix of architectural elements, types of scales, and linguistic wordplay",
"groups": [
{
"difficulty": "yellow",
"category": "Parts of a Building",
"words": [
"BEAM",
"COLUMN",
"EAVE",
"FOUNDATION"
]
},
{
"difficulty": "green",
"category": "Types of Scales",
"words": [
"LIBRA",
"MUSICAL",
"WEIGHING",
"TIMELINE"
]
},
{
"difficulty": "blue",
"category": "Synonyms for Balance",
"words": [
"EQUILIBRIUM",
"POISE",
"STEADY",
"STABILITY"
]
},
{
"difficulty": "purple",
"category": "Words that start with a Greek Letter",
"words": [
"ALPHABET",
"BETRAYAL",
"GAMUT",
"DELUXE"
]
}
],
"intended_traps": [
{
"word": "LIBRA",
"actual_group": "green",
"trap_group": "blue",
"reason": "Libra is the sign of the scales, which represents balance/equilibrium."
},
{
"word": "COLUMN",
"actual_group": "yellow",
"trap_group": "green",
"reason": "A column can be a way to organize data on a scale or chart."
},
{
"word": "GAMUT",
"actual_group": "purple",
"trap_group": "green",
"reason": "The 'gamut' refers to the full range or scale of something."
}
]
},
"eval_count": 410,
"prompt_eval_count": 440
},
{
"model": "gemma4:31b-it-q4_K_M",
"i": 3,
"dt": 17.279732942581177,
"ok": true,
"attempts": 1,
"puzzle": {
"theme_seed": "Words related to vision, optics, and ocular puns",
"groups": [
{
"difficulty": "yellow",
"category": "Parts of the Eye",
"words": [
"CORNEA",
"PUPIL",
"IRIS",
"RETINA"
]
},
{
"difficulty": "green",
"category": "Types of Lenses",
"words": [
"CONCAVE",
"CONVEX",
"CONTACT",
"BIFOCAL"
]
},
{
"difficulty": "blue",
"category": "Words meaning 'To Witness'",
"words": [
"OBSERVE",
"SPOT",
"NOTE",
"VIEW"
]
},
{
"difficulty": "purple",
"category": "___ EYE",
"words": [
"BULLS",
"BIRD",
"FISH",
"PRIVATE"
]
}
],
"intended_traps": [
{
"word": "PUPIL",
"actual_group": "yellow",
"trap_group": "blue",
"reason": "A pupil is a student, and one might 'note' or 'observe' a pupil in a classroom setting."
},
{
"word": "PRIVATE",
"actual_group": "purple",
"trap_group": "blue",
"reason": "Private can be a synonym for 'secret' or 'hidden', contrasting with the act of observing or spotting."
},
{
"word": "CONTACT",
"actual_group": "green",
"trap_group": "blue",
"reason": "To 'contact' someone can be confused with 'spotting' or 'viewing' them in a social context."
}
]
},
"eval_count": 422,
"prompt_eval_count": 440
},
{
"model": "gemma4:31b-it-q4_K_M",
"i": 4,
"dt": 19.867947340011597,
"ok": true,
"attempts": 1,
"puzzle": {
"theme_seed": "Common words that can also be types of birds or musical terms",
"groups": [
{
"difficulty": "yellow",
"category": "Things that can be 'Cast'",
"words": [
"SPELL",
"VOTE",
"SHADOW",
"NET"
]
},
{
"difficulty": "green",
"category": "Musical Dynamics",
"words": [
"PIANO",
"FORTE",
"SHARP",
"FLAT"
]
},
{
"difficulty": "blue",
"category": "Types of Birds",
"words": [
"CRANE",
"SWALLOW",
"SWIFT",
"LOON"
]
},
{
"difficulty": "purple",
"category": "Words that mean 'to move quickly'",
"words": [
"DART",
"BOLT",
"RUSH",
"FLY"
]
}
],
"intended_traps": [
{
"word": "SWIFT",
"actual_group": "blue",
"trap_group": "purple",
"reason": "Swift is a synonym for moving quickly, fitting the 'move quickly' group."
},
{
"word": "FLY",
"actual_group": "purple",
"trap_group": "blue",
"reason": "Fly is a common bird-related term (or a type of insect), making it look like it belongs with Crane and Swallow."
},
{
"word": "SWALLOW",
"actual_group": "blue",
"trap_group": "purple",
"reason": "While a bird, if the user interprets 'move quickly' as 'gulp down', it might confuse them, though primarily it serves as a bird lure."
},
{
"word": "SHARP",
"actual_group": "green",
"trap_group": "purple",
"reason": "A 'sharp' turn or movement can be associated with quickness/darting."
}
]
},
"eval_count": 487,
"prompt_eval_count": 440
},
{
"model": "gemma4:31b-it-q4_K_M",
"i": 5,
"dt": 18.18777823448181,
"ok": true,
"attempts": 2,
"puzzle": {
"theme_seed": "Words related to types of 'scales' and musical terminology",
"groups": [
{
"difficulty": "yellow",
"category": "Musical Scales",
"words": [
"MAJOR",
"MINOR",
"CHROMATIC",
"PENTATONIC"
]
},
{
"difficulty": "green",
"category": "Things that Weigh",
"words": [
"BALANCE",
"SPRING",
"DIGITAL",
"TRIPLE-BEAM"
]
},
{
"difficulty": "blue",
"category": "Ways to Climb",
"words": [
"ASCEND",
"MOUNT",
"SCALE",
"SURMOUNT"
]
},
{
"difficulty": "purple",
"category": "___ NOTE",
"words": [
"FOOT",
"BANK",
"KEY",
"HIGH"
]
}
],
"intended_traps": [
{
"word": "SCALE",
"actual_group": "blue",
"trap_group": "yellow",
"reason": "It is the root word for musical scales, leading the player to look for other scale-related terms."
},
{
"word": "KEY",
"actual_group": "purple",
"trap_group": "yellow",
"reason": "A 'key' is fundamentally linked to musical scales (e.g., the Key of C Major)."
},
{
"word": "HIGH",
"actual_group": "purple",
"trap_group": "blue",
"reason": "High is an adjective often associated with climbing or ascending."
},
{
"word": "BALANCE",
"actual_group": "green",
"trap_group": "blue",
"reason": "Balance can be seen as a state of being when climbing or mountaineering."
}
]
},
"eval_count": 453,
"prompt_eval_count": 440
}
]
@@ -0,0 +1,278 @@
# Gemma 4 Generation Bakeoff -- 2026-04-27-221751
## Setup
- Local Ollama on the test host (RTX 3090 Ti, 24 GB VRAM)
- Other GPU workloads paused for the duration of the run
- Models: `gemma4:26b`, `gemma4:31b-it-q4_K_M`
- 5 puzzles per model, base temperature 0.8
- Gemma 4 settings (per `~/bin/gemma4-research/GOTCHAS.md`): `think=false`, `num_ctx=8192`, `num_predict=4096`. No `format=json` (infinite-loop bug). JSON extracted client-side via `body[body.find('{'):body.rfind('}')+1]`.
- Up to 3 attempts per puzzle with temperature bumped +0.1 each retry (AI_Visualizer pattern). Reported metrics use the *successful* attempt.
- One-shot, unaided generation. No critique pass, no example puzzle in prompt.
## Timing
| Model | n | avg s | avg tokens | tok/s |
|---|---|---|---|---|
| `gemma4:26b` | 5 | 5.2 | 489 | 94.3 |
| `gemma4:31b-it-q4_K_M` | 5 | 18.2 | 442 | 24.3 |
## JSON parse rate
- `gemma4:26b`: 5/5 parsed cleanly
- `gemma4:31b-it-q4_K_M`: 5/5 parsed cleanly
## gemma4:26b -- puzzle 1
**Theme seed:** An exploration of common words that can be paired with 'BIRD' or 'DRAGON' or relate to types of 'FLY'.
| Diff | Category | Words |
|---|---|---|
| yellow | Types of Flies | FRUIT, HOUSE, FIRE, BLOW |
| green | Words before 'FLY' | DRAGON, MAY, BUTTER, BOTTLE |
| blue | Types of Birds | LARK, SWALLOW, CRANE, SWIFT |
| purple | Words that follow 'DRAGON' | FRUIT, FLY, SLAYER, FLAME |
**Claimed traps (4):**
- `FRUIT` (actually purple, traps yellow): Fruit fly is a very common type of fly, making it a primary red herring for the yellow group.
- `FIRE` (actually yellow, traps purple): Fire dragon or dragon flame connection could lead someone to look for a dragon-themed group.
- `DRAGON` (actually green, traps blue): Dragon is a mythical creature, and users might look for a 'mythical creatures' group including Crane or Swallow (which are real birds).
- `FLY` (actually purple, traps green): Dragonfly is a common word, tempting the player to put 'FLY' in the 'Words before FLY' group.
**Grade: FAIL.** `FRUIT` appears in **two groups** (yellow Types of Flies + purple Words that follow DRAGON) — that's only 15 distinct tiles, not the required 16. Categories are also redundant: yellow ("Types of Flies") and green ("Words before FLY") are both fly-themed. A human curator would never ship this.
_dt=5.7s, tokens=500_
## gemma4:26b -- puzzle 2
**Theme seed:** Homophones and linguistic overlaps between nature and everyday objects.
| Diff | Category | Words |
|---|---|---|
| yellow | Types of Trees | BIRCH, CEDAR, MAPLE, PINE |
| green | Things that can be 'Dropped' | BEAT, HINT, LINE, SUBJECT |
| blue | Units of Measurement | FOOT, GRAM, SECOND, STONE |
| purple | Homophones of Body Parts | EYE, HARE, MUSSEL, QUEUE |
**Claimed traps (4):**
- `PINE` (actually yellow, traps green): To 'pine' for someone is a feeling, but 'pine' could be mistaken for something you can 'drop' (like a pine needle).
- `FOOT` (actually blue, traps yellow): A 'foot' is a unit of measurement, but it is also a body part, potentially leading to the 'Body Part Homophones' group.
- `BEAT` (actually green, traps blue): A 'beat' is a rhythm, but it could be mistaken for a unit of time/measurement in a musical context.
- `EYE` (actually purple, traps blue): The word 'eye' is a body part, which might tempt players looking for a body part category, but it is actually a homophone (I) for the purple group.
**Grade: PARTIAL FAIL.** Yellow/green/blue all tight. Purple "Homophones of Body Parts" is half-broken: `HARE`→hair ✓, `MUSSEL`→muscle ✓, but `EYE` *is* a body part (not a homophone of one) and `QUEUE`→cue is not a body-part homophone. 2/4 wrong on the trickiest group is a curator-rejection. Trap reasoning is weak (claims `EYE` traps blue Units — body parts aren't units).
_dt=5.3s, tokens=510_
## gemma4:26b -- puzzle 3
**Theme seed:** The concept of 'Pressure' and things that can be measured or applied.
| Diff | Category | Words |
|---|---|---|
| yellow | Units of Pressure | BAR, PASCAL, PSI, TORR |
| green | Words after 'BLOOD' | PRESSURE, TYPE, ORANGE, DONOR |
| blue | Things that can be 'Broken' | HEART, RECORD, LAW, PROMISE |
| purple | Words that are also Verbs for 'Applying Force' | PRESS, STAMP, PUNCH, DRIVE |
**Claimed traps (4):**
- `PRESSURE` (actually green, traps yellow): Pressure is the concept being measured by the units in the yellow group.
- `BAR` (actually yellow, traps blue): A 'bar' is something that can be 'broken' (like a chocolate bar) or refers to a legal setting.
- `PRESS` (actually purple, traps green): The 'Press' is a common term in news/media, and 'Press' can be related to 'Blood' in a medical context (blood pressure).
- `RECORD` (actually blue, traps purple): A 'record' is something you can 'stamp' or 'drive' (as in a driving record).
**Grade: BORDERLINE PASS.** Categories tight, 16 distinct words, ✓. `BAR` (yellow → blue trap) is a real overlap (chocolate bar, broken bar). `PRESS` claimed to trap green ("Words after BLOOD") — but blood **pressure** is the compound, not blood **press**, so the claimed trap doesn't actually hold. `RECORD`→stamp/drive is a stretch. 1/4 of Gemma's self-graded traps is broken; remainder OK.
_dt=5.1s, tokens=492_
## gemma4:26b -- puzzle 4
**Theme seed:** A collection of words related to things that can be 'broken' or 'cracked', mixed with other categories like cooking and music.
| Diff | Category | Words |
|---|---|---|
| yellow | Kinds of Eggs | CHICKEN, QUAIL, GOOSE, DUCK |
| green | Things You Can Crack | NUT, CODE, WHIP, SMILE |
| blue | Musical Terms | NOTE, CHORD, CLEF, BEAT |
| purple | Words Before 'Egg' | NEST, EASTER, BAD, NESTING |
**Claimed traps (4):**
- `DUCK` (actually yellow, traps green): DUCK is an action related to avoiding a whip or a blow, and can be paired with 'crack' in some linguistic contexts.
- `CHICKEN` (actually yellow, traps purple): CHICKEN is an egg-related word that could tempt someone looking for '___ Egg' phrases (like Chicken Egg).
- `BEAT` (actually blue, traps green): BEAT is something you can do to a whip or a person, fitting the 'Things you can crack/strike' theme.
- `NEST` (actually purple, traps yellow): NEST is intrinsically linked to the birds in the yellow group (CHICKEN, GOOSE, etc.).
**Grade: FAIL.** Purple has `NESTING` as "Words Before EGG" — *nesting egg* isn't a phrase (*nest egg* is). Worse, `NEST` and `NESTING` are near-duplicate tiles. Yellow "Kinds of Eggs" and purple "Words Before EGG" are also redundant (chicken egg vs nest egg — both compound forms). Curator reject.
_dt=5.3s, tokens=506_
## gemma4:26b -- puzzle 5
**Theme seed:** A collection of words centered around various types of 'records' and 'tracks'.
| Diff | Category | Words |
|---|---|---|
| yellow | Types of Records | LP, EP, Single, Album |
| green | Things that have Tracks | Train, Athlete, Song, Mud |
| blue | Words for 'Evidence' | Clue, Trace, Mark, Proof |
| purple | Words that follow 'Side' | Effect, Kick, Split, Walk |
**Claimed traps (4):**
- `Single` (actually yellow, traps blue): A 'single' can be a piece of evidence or a solitary item.
- `Song` (actually green, traps yellow): A song is often released as an LP, EP, or Single.
- `Trace` (actually blue, traps green): A trace can be a physical track left behind in mud.
- `Split` (actually purple, traps green): A 'split' can refer to a track or division in a race.
**Grade: PASS.** Cleanest 26b output. 16 distinct words; categories tight; the word "track" carries **three different meanings** across green (railroad/running/album), with `SONG`→Records and `TRACE`→Tracks as real overlaps. That's the multi-meaning anchor pattern Connections actually uses. `SINGLE` and `SPLIT` traps are weak, but the puzzle structure is publication-shaped.
_dt=4.6s, tokens=436_
## gemma4:31b-it-q4_K_M -- puzzle 1
**Theme seed:** Words related to circles, timing, and subtle wordplay with 'Hands'
| Diff | Category | Words |
|---|---|---|
| yellow | Circular Objects | RING, COIN, WHEEL, DISC |
| green | Units of Time | SECOND, MINUTE, HOUR, DAY |
| blue | Ways to Assist | HELP, AID, BACK, SUPPORT |
| purple | Things with Hands | CLOCK, WATCH, DECK, PLAYER |
**Claimed traps (4):**
- `SECOND` (actually green, traps blue): To 'second' a motion is to support it.
- `WATCH` (actually purple, traps green): A watch is a device used to measure the units of time (second, minute, hour).
- `CLOCK` (actually purple, traps green): Similar to watch, it is intrinsically linked to the time units group.
- `BACK` (actually blue, traps yellow): Could be mistaken as a part of a circular object or a physical position.
**Grade: BORDERLINE PASS.** `DECK` in "Things with Hands" is shaky — decks have cards, players have hands; the phrase is "all hands on deck" (sailors), not "deck has hands." Other 3 purple members fit. Strong saving grace: `SECOND`→assist (second a motion), `WATCH`→time, `CLOCK`→time are 3 genuinely strong traps. Categories otherwise tight.
_dt=18.9s, tokens=439_
## gemma4:31b-it-q4_K_M -- puzzle 2
**Theme seed:** A mix of architectural elements, types of scales, and linguistic wordplay
| Diff | Category | Words |
|---|---|---|
| yellow | Parts of a Building | BEAM, COLUMN, EAVE, FOUNDATION |
| green | Types of Scales | LIBRA, MUSICAL, WEIGHING, TIMELINE |
| blue | Synonyms for Balance | EQUILIBRIUM, POISE, STEADY, STABILITY |
| purple | Words that start with a Greek Letter | ALPHABET, BETRAYAL, GAMUT, DELUXE |
**Claimed traps (3):**
- `LIBRA` (actually green, traps blue): Libra is the sign of the scales, which represents balance/equilibrium.
- `COLUMN` (actually yellow, traps green): A column can be a way to organize data on a scale or chart.
- `GAMUT` (actually purple, traps green): The 'gamut' refers to the full range or scale of something.
**Grade: FAIL.** Two broken categories. Green "Types of Scales" includes `LIBRA` (zodiac sign, not a scale type) and `TIMELINE` (a line, not a scale). Purple "Words that start with a Greek Letter" includes `DELUXE` — but `DELUXE` starts with "DELU", not the full letter name "DELTA". `GAMUT` starts with "GAM", missing the full "GAMMA". 2 of 4 purple words don't actually start with the complete Greek-letter name.
_dt=16.6s, tokens=410_
## gemma4:31b-it-q4_K_M -- puzzle 3
**Theme seed:** Words related to vision, optics, and ocular puns
| Diff | Category | Words |
|---|---|---|
| yellow | Parts of the Eye | CORNEA, PUPIL, IRIS, RETINA |
| green | Types of Lenses | CONCAVE, CONVEX, CONTACT, BIFOCAL |
| blue | Words meaning 'To Witness' | OBSERVE, SPOT, NOTE, VIEW |
| purple | ___ EYE | BULLS, BIRD, FISH, PRIVATE |
**Claimed traps (3):**
- `PUPIL` (actually yellow, traps blue): A pupil is a student, and one might 'note' or 'observe' a pupil in a classroom setting.
- `PRIVATE` (actually purple, traps blue): Private can be a synonym for 'secret' or 'hidden', contrasting with the act of observing or spotting.
- `CONTACT` (actually green, traps blue): To 'contact' someone can be confused with 'spotting' or 'viewing' them in a social context.
**Grade: PASS.** All 4 categories tight: parts of eye / lens types / witness verbs / `___EYE` compounds (bull's, bird's, fish, private). `PUPIL`→student is a real, NYT-style multi-meaning trap. Only 3 traps claimed (rubric asked for 2+, ✓). Closest puzzle to publication quality across both models.
_dt=17.3s, tokens=422_
## gemma4:31b-it-q4_K_M -- puzzle 4
**Theme seed:** Common words that can also be types of birds or musical terms
| Diff | Category | Words |
|---|---|---|
| yellow | Things that can be 'Cast' | SPELL, VOTE, SHADOW, NET |
| green | Musical Dynamics | PIANO, FORTE, SHARP, FLAT |
| blue | Types of Birds | CRANE, SWALLOW, SWIFT, LOON |
| purple | Words that mean 'to move quickly' | DART, BOLT, RUSH, FLY |
**Claimed traps (4):**
- `SWIFT` (actually blue, traps purple): Swift is a synonym for moving quickly, fitting the 'move quickly' group.
- `FLY` (actually purple, traps blue): Fly is a common bird-related term (or a type of insect), making it look like it belongs with Crane and Swallow.
- `SWALLOW` (actually blue, traps purple): While a bird, if the user interprets 'move quickly' as 'gulp down', it might confuse them, though primarily it serves as a bird lure.
- `SHARP` (actually green, traps purple): A 'sharp' turn or movement can be associated with quickness/darting.
**Grade: BORDERLINE PASS.** Green should be "Musical Terms" not "Dynamics" — `SHARP` and `FLAT` are accidentals (pitch modifiers), not dynamics (loudness). Pedantic but a real category-label miss. Saving graces: `SWIFT` (bird that means fast) and `FLY` (bird that means move quickly) are exactly the kind of multi-meaning anchors a real Connections puzzle would deploy.
_dt=19.9s, tokens=487_
## gemma4:31b-it-q4_K_M -- puzzle 5
**Theme seed:** Words related to types of 'scales' and musical terminology
| Diff | Category | Words |
|---|---|---|
| yellow | Musical Scales | MAJOR, MINOR, CHROMATIC, PENTATONIC |
| green | Things that Weigh | BALANCE, SPRING, DIGITAL, TRIPLE-BEAM |
| blue | Ways to Climb | ASCEND, MOUNT, SCALE, SURMOUNT |
| purple | ___ NOTE | FOOT, BANK, KEY, HIGH |
**Claimed traps (4):**
- `SCALE` (actually blue, traps yellow): It is the root word for musical scales, leading the player to look for other scale-related terms.
- `KEY` (actually purple, traps yellow): A 'key' is fundamentally linked to musical scales (e.g., the Key of C Major).
- `HIGH` (actually purple, traps blue): High is an adjective often associated with climbing or ascending.
- `BALANCE` (actually green, traps blue): Balance can be seen as a state of being when climbing or mountaineering.
**Grade: PASS.** The whole puzzle is built around `SCALE` carrying three meanings: musical scale (yellow), weighing scale (green's theme — though Gemma mislabels it "Things that Weigh" instead of "Types of Scales"), and "to climb" (blue, where SCALE-the-word lives). That is exactly the central-anchor pattern a real NYT Connections puzzle uses. `KEY`, `HIGH`, `BALANCE` traps all genuinely overlap. Categories slightly mislabeled but structure is publication-quality.
_dt=18.2s, tokens=453_
---
## Aggregate
| Model | Pass | Borderline | Fail | Avg s | Avg tok/s |
|---|---|---|---|---|---|
| `gemma4:26b` | 1 (#5) | 1 (#3) + 1 partial (#2) | 2 (#1, #4) | 5.2 | 94.3 |
| `gemma4:31b-it-q4_K_M` | 2 (#3, #5) | 2 (#1, #4) | 1 (#2) | 18.2 | 24.3 |
**31b is materially more reliable** — 2 clean passes vs 26b's 1, and only 1 hard fail vs 26b's 2 hard fails plus a partial-fail. 31b is 3.5× slower per generation but at 18s for a once-per-day puzzle, that's irrelevant. 26b is fast enough for interactive use but produces broken puzzles half the time.
### Failure modes (in order of how often they recur)
1. **Structural violations** — duplicate or near-duplicate words on the 16-tile board, or a word listed in two groups. (#1-26b: `FRUIT` × 2; #4-26b: `NEST`/`NESTING`.) Catastrophic — a real Connections board has 16 *distinct* tiles. **Trivially detectable** with a deterministic post-filter.
2. **Broken category logic** — words placed in a category that don't actually fit. (#2-26b: `EYE`/`QUEUE` aren't body-part homophones; #4-26b: `NESTING` isn't a "Word before EGG"; #2-31b: `LIBRA`/`TIMELINE` aren't scales, `DELUXE` doesn't start with the full Greek letter "DELTA"; #1-31b: `DECK` doesn't have hands.) **Hard to detect deterministically** — needs a critique/judging pass.
3. **Redundant categories** — two groups themed on the same concept (#1-26b: yellow + green both fly-themed; #4-26b: yellow + purple both egg-themed). Detectable with a category-similarity check.
4. **Weak/circular trap reasoning** — Gemma's claimed "intended_traps" sometimes don't actually hold. (#3-26b: `PRESS` doesn't fit "Words after BLOOD" — the compound is *blood pressure*, not *blood press*.) Means **Gemma cannot reliably grade its own puzzles** — independent judging required.
### Successes (when Gemma gets it right, what it does right)
- **Multi-meaning anchor words** — `SCALE` (3 meanings, #5-31b), `SWIFT`/`FLY` (bird + fast, #4-31b), `PUPIL` (eye + student, #3-31b), `TRACK` (3 meanings, #5-26b). When Gemma builds a puzzle around an anchor, it produces real Connections-grade overlap.
- **Compound-word categories** — "`___ EYE`" (#3-31b), "Side `___`" (#5-26b), "Words before EGG" (#4-26b — when Gemma doesn't poison it). These are the easiest pattern to get right.
- **Tight short labels** when Gemma sticks to well-known domains (parts of eye, types of lenses, types of records).
### Implication for design
**Generation is viable, but not unaided.** The shape of the data engine:
```
generate (gemma4:31b)
→ deterministic filter [check 16 distinct tiles, no dup words, all words appear in categories]
→ category-similarity check [reject puzzles with redundant themes]
→ critique pass [either gemma4:31b second pass, or qwen3-coder:30b as judge]
→ reject + regenerate if any fail; accept once filtered
→ cache as the day's puzzle
```
At ~18s/generation and a roughly 50% structural-pass rate, a daily puzzle costs an expected ~2 generations + 1 critique = maybe 1 minute of GPU time per day. Effectively free.
**26b is unsuitable as the primary generator** — too many hard fails. It could plausibly be the *judging* model (cheaper, runs on every player guess) since judging is much easier than generating. But that decision is for the brainstorm.
### Risks not yet checked
- **Diversity over time.** All 10 puzzles produced here are within a single seed-less batch. If Gemma keeps reaching for the same themes (we saw "scales" twice on 31b alone), a 365-day-per-year stream might get repetitive. Test this with seeded prompts before committing.
- **Connections-vs-Gemma blind anchor not run.** I deferred this — the structural failures in Gemma's output (duplicate words, broken categories) are so obviously curator-rejection-tier that no human-curated puzzle would have them, so the within-Gemma comparison was decisive on its own. Still, before final design, eyeball one Gemma-pass puzzle next to a real NYT puzzle and check whether it actually feels equivalent.
- **Two-pass critique not validated.** The proposal above assumes a critique pass would catch Gemma's category mistakes. That assumption has not been tested. The next experiment is "feed Gemma's broken puzzles back to Gemma (or to a different model) and see if it flags the structural issues."
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,514 @@
# Gemma 4 Semantic Bakeoff -- 2026-04-27 22:51
## Setup
- Local Ollama on the test host (RTX 3090 Ti, 24 GB VRAM)
- Models: `gemma4:latest`, `gemma4:26b`, `gemma4:31b-it-q4_K_M`
- Temperature 0.2 (raised +0.1 per retry on JSON parse fail, max 3 attempts)
- think=false, num_ctx=4096, num_predict=512, no format=json (per gemma4-research/GOTCHAS.md)
- 38 test cases: 16 JUDGE, 10 CREATE, 12 CREATIVE_ACCEPT
- Ground truth hand-labeled inline in `scripts/gemma-semantic-bakeoff.py`
## Auto-scored summary
| Model | JUDGE pass | CREATIVE_ACCEPT pass | parse fails | avg s |
|---|---|---|---|---|
| `gemma4:latest` | 14/16 | 10/10 | 0 | 0.7 |
| `gemma4:26b` | 15/16 | 10/10 | 0 | 0.8 |
| `gemma4:31b-it-q4_K_M` | 16/16 | 10/10 | 1 | 2.3 |
## JUDGE
### judge-y1
- Category: `Types of trees`
- Words: `OAK`, `MAPLE`, `BIRCH`, `PINE`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | All four words are common names of tree species. | PASS | 0.7s |
| `gemma4:26b` | yes | All four words are specific, well-known types of trees. | PASS | 0.8s |
| `gemma4:31b-it-q4_K_M` | yes | All four words are common types of trees. | PASS | 2.1s |
### judge-y2
- Category: `Greek letters`
- Words: `ALPHA`, `BETA`, `GAMMA`, `DELTA`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | All four words are names of Greek letters. | PASS | 0.6s |
| `gemma4:26b` | yes | All four words are standard names of letters in the Greek alphabet. | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | yes | All four words are the first four letters of the Greek alphabet. | PASS | 2.1s |
### judge-y3
- Category: `Days of the week`
- Words: `MONDAY`, `FRIDAY`, `SUNDAY`, `WEDNESDAY`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | The words are not all representative of the days of the week. (misfit: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | FAIL | 0.8s |
| `gemma4:26b` | yes | All four words are specific names of days in the week. | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | yes | All four words are standard days of the week. | PASS | 1.9s |
### judge-y4
- Category: `Synonyms for 'happy'`
- Words: `JOYFUL`, `GLAD`, `CHEERFUL`, `ELATED`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | All four words are direct and common synonyms for 'happy'. | PASS | 0.6s |
| `gemma4:26b` | yes | All four words are direct synonyms representing different degrees of happiness. | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | yes | All four words are direct synonyms for the state of being happy. | PASS | 2.1s |
### judge-y5
- Category: `___ HOUSE (compound words)`
- Words: `GREEN`, `ICE`, `FIRE`, `COURT`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | All four words can precede 'HOUSE' to form common compound words. | PASS | 0.7s |
| `gemma4:26b` | no | While Green, Ice, and Fire House are common, 'Court House' is typically written as two words (Courthouse) or refers to a | FAIL | 0.9s |
| `gemma4:31b-it-q4_K_M` | yes | All four words form common compound words or phrases with 'house'. | PASS | 2.1s |
### judge-y6
- Category: `Words that follow COLD`
- Words: `SHOULDER`, `FRONT`, `SNAP`, `TURKEY`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | Only SHOULDER, FRONT, and SNAP fit common phrases with COLD, while TURKEY does not. (misfit: TURKEY) | FAIL | 0.8s |
| `gemma4:26b` | yes | All four words form common phrases when preceded by 'cold': cold shoulder, cold front, cold snap, and cold turkey. | PASS | 0.9s |
| `gemma4:31b-it-q4_K_M` | yes | All four words form common compound words or phrases when preceded by 'cold'. | PASS | 2.1s |
### judge-y7
- Category: `Verbs meaning 'to move quickly'`
- Words: `DART`, `BOLT`, `RUSH`, `FLY`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | All four words are strong synonyms for moving rapidly. | PASS | 0.6s |
| `gemma4:26b` | yes | All four words are synonyms for moving at high speed. | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | yes | All four words are common synonyms for moving rapidly. | PASS | 2.0s |
### judge-y8
- Category: `Synonyms for 'idea'`
- Words: `NOTION`, `CONCEPT`, `THOUGHT`, `INKLING`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | All four words are common synonyms for a mental concept or idea. | PASS | 0.7s |
| `gemma4:26b` | yes | All four words are direct synonyms for a mental concept, notion, or vague understanding. | PASS | 0.8s |
| `gemma4:31b-it-q4_K_M` | yes | All four words are common synonyms for a mental representation or a suggestion of an idea. | PASS | 2.2s |
### judge-n1
- Category: `Types of trees`
- Words: `OAK`, `MAPLE`, `BIRCH`, `CARROT`
- Ground truth: **no** (misfit: CARROT)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | Carrot is a root vegetable, not a type of tree. (misfit: CARROT) | PASS | 0.7s |
| `gemma4:26b` | no | Carrot is a root vegetable, not a type of tree. (misfit: CARROT) | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | no | Carrot is a root vegetable, not a tree. (misfit: CARROT) | PASS | 2.1s |
### judge-n2
- Category: `Greek letters`
- Words: `ALPHA`, `BETA`, `GAMMA`, `CYRILLIC`
- Ground truth: **no** (misfit: CYRILLIC)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | Cyrillic is a script, not a Greek letter. (misfit: CYRILLIC) | PASS | 0.7s |
| `gemma4:26b` | no | CYRILLIC refers to a script, not a specific Greek letter. (misfit: CYRILLIC) | PASS | 0.8s |
| `gemma4:31b-it-q4_K_M` | no | Cyrillic is a writing system, not a Greek letter. (misfit: CYRILLIC) | PASS | 2.2s |
### judge-n3
- Category: `Synonyms for 'happy'`
- Words: `JOYFUL`, `GLAD`, `SAD`, `ELATED`
- Ground truth: **no** (misfit: SAD)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | The word 'SAD' is an antonym, not a synonym for 'happy'. (misfit: SAD) | PASS | 0.7s |
| `gemma4:26b` | no | The word 'SAD' is an antonym rather than a synonym. (misfit: SAD) | PASS | 0.8s |
| `gemma4:31b-it-q4_K_M` | no | SAD is an antonym of happy, not a synonym. (misfit: SAD) | PASS | 2.2s |
### judge-n4
- Category: `Days of the week`
- Words: `MONDAY`, `JANUARY`, `SUNDAY`, `WEDNESDAY`
- Ground truth: **no** (misfit: JANUARY)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | January is a month, not a day of the week. (misfit: JANUARY) | PASS | 0.7s |
| `gemma4:26b` | no | JANUARY is a month, not a day of the week. (misfit: JANUARY) | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | no | JANUARY is a month, not a day of the week. (misfit: JANUARY) | PASS | 2.1s |
### judge-n5
- Category: `Body parts`
- Words: `ARM`, `LEG`, `EYE`, `NIGHT`
- Ground truth: **no** (misfit: NIGHT)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | NIGHT is not a body part. (misfit: NIGHT) | PASS | 0.6s |
| `gemma4:26b` | no | The word 'NIGHT' is not a body part. (misfit: NIGHT) | PASS | 0.8s |
| `gemma4:31b-it-q4_K_M` | no | NIGHT is a time of day, not a body part. (misfit: NIGHT) | PASS | 2.1s |
### judge-n6
- Category: `Types of birds`
- Words: `CRANE`, `SWALLOW`, `BAT`, `MOSQUITO`
- Ground truth: **no** (misfit: BAT, MOSQUITO)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | Bats and mosquitoes are not types of birds. (misfit: BAT, MOSQUITO) | PASS | 0.7s |
| `gemma4:26b` | no | BAT and MOSQUITO are mammals and insects, respectively, not birds. (misfit: BAT, MOSQUITO) | PASS | 0.8s |
| `gemma4:31b-it-q4_K_M` | no | Bat and mosquito are not birds. (misfit: BAT, MOSQUITO) | PASS | 2.1s |
### judge-n7
- Category: `Things that are red`
- Words: `APPLE`, `BLUE`, `ROSE`, `GRASS`
- Ground truth: **no** (misfit: BLUE, GRASS)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | Blue is a color, not necessarily a thing that is red. (misfit: BLUE) | PASS | 0.7s |
| `gemma4:26b` | no | BLUE and GRASS are not typically red. (misfit: BLUE, GRASS) | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | no | Blue and grass are not red. (misfit: BLUE, GRASS) | PASS | 2.0s |
### judge-n8
- Category: `Words that follow COLD`
- Words: `SHOULDER`, `FRONT`, `PIZZA`, `MOUNTAIN`
- Ground truth: **no** (misfit: PIZZA, MOUNTAIN)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | Only SHOULDER, FRONT, and PIZZA fit common phrases with COLD, but MOUNTAIN does not. (misfit: MOUNTAIN) | PASS | 0.8s |
| `gemma4:26b` | no | While shoulder, front, and mountain can follow 'cold', 'cold pizza' is a description of temperature rather than a common | PASS | 0.9s |
| `gemma4:31b-it-q4_K_M` | no | Cold pizza and cold mountain are not established compound words or common idioms like cold shoulder and cold front. (mis | PASS | 2.7s |
## CREATE
### create-e1
- Category: `Types of trees`
- Quality bar: Four valid tree species; e.g. OAK, MAPLE, BIRCH, PINE.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | Oak, Maple, Pine, Birch | These are all common types of trees. | MANUAL | 0.6s |
| `gemma4:26b` | Oak, Maple, Birch, Cedar | These are all common species of trees. | MANUAL | 0.7s |
| `gemma4:31b-it-q4_K_M` | Oak, Maple, Pine, Birch | All four are common types of trees. | MANUAL | 1.9s |
### create-e2
- Category: `Greek letters`
- Quality bar: Four genuine Greek letters.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | Alpha, Beta, Gamma, Delta | These are the first four letters of the Greek alphabet. | MANUAL | 0.6s |
| `gemma4:26b` | ALPHA, BETA, GAMMA, DELTA | These are the first four letters of the Greek alphabet. | MANUAL | 0.7s |
| `gemma4:31b-it-q4_K_M` | Alpha, Beta, Gamma, Delta | These are the first four letters of the Greek alphabet. | MANUAL | 1.9s |
### create-e3
- Category: `Synonyms for 'angry'`
- Quality bar: Four words that all genuinely mean angry/furious.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | furious, irate, enraged, mad | All four words are direct synonyms for feeling intense anger. | MANUAL | 0.6s |
| `gemma4:26b` | Irate, Livid, Furious, Incensed | All four words are direct synonyms for being extremely angry. | MANUAL | 0.7s |
| `gemma4:31b-it-q4_K_M` | Furious, Irate, Livid, Enraged | All four words are strong synonyms for being extremely angry. | MANUAL | 2.1s |
### create-e4
- Category: `Days of the week`
- Quality bar: Four of the seven weekday names, no months or other words.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | Monday, Tuesday, Wednesday, Thursday | These are four consecutive days of the week. | MANUAL | 0.6s |
| `gemma4:26b` | Monday, Tuesday, Wednesday, Thursday | These are all consecutive days of the week. | MANUAL | 0.7s |
| `gemma4:31b-it-q4_K_M` | Monday, Wednesday, Friday, Sunday | They are all days of the week. | MANUAL | 1.8s |
### create-m1
- Category: `___ STORM (compound words ending in STORM)`
- Quality bar: Four words that each form a real compound or fixed phrase with STORM (e.g. SAND, BRAIN, THUNDER, SNOW).
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | BLIZZARD, HEAT, STORM, SNOW | These words can all precede 'STORM' to form a type of storm. | MANUAL | 0.7s |
| `gemma4:26b` | BRAIN, DUST, FIRE, SNOW | Each word forms a common compound word when followed by 'STORM'. | MANUAL | 0.7s |
| `gemma4:31b-it-q4_K_M` | BRAIN, FIRE, THUNDER, SNOW | All four words form common compound words when followed by STORM. | MANUAL | 2.1s |
### create-m2
- Category: `Synonyms for 'small'`
- Quality bar: Four words that all genuinely mean small.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | tiny, petite, minuscule, diminutive | All four words are direct synonyms for 'small' in size. | MANUAL | 0.7s |
| `gemma4:26b` | tiny, microscopic, diminutive, minuscule | All four words are direct synonyms for something of very small size. | MANUAL | 0.7s |
| `gemma4:31b-it-q4_K_M` | tiny, petite, slight, miniature | All four words are direct synonyms for small in size. | MANUAL | 2.0s |
### create-m3
- Category: `Words that follow BLUE`
- Quality bar: Four words that each form a real compound with BLUE (e.g. BERRY, BIRD, PRINT, BELL, GRASS).
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | JEANS, BIRD, MOON, BLOOD | All words can follow 'BLUE' to form common phrases. | MANUAL | 0.7s |
| `gemma4:26b` | BIRD, JAY, PRINT, WHALE | Each word can be preceded by 'blue' to form a common term. | MANUAL | 0.8s |
| `gemma4:31b-it-q4_K_M` | BERRY, JAY, MOON, PRINT | All four words follow 'BLUE' to form common compound words or phrases. | MANUAL | 2.2s |
### create-m4
- Category: `Things that can be 'broken'`
- Quality bar: Four words that each form a real fixed phrase with 'broken' (heart, record, law, promise, etc.).
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | heart, promise, record, bone | All four items can be metaphorically or literally broken. | MANUAL | 0.6s |
| `gemma4:26b` | Promise, Record, Heart, Law | All four are common objects or concepts that can be broken. | MANUAL | 0.7s |
| `gemma4:31b-it-q4_K_M` | Record, Promise, Silence, Heart | All four are common abstract or physical things that can be broken. | MANUAL | 2.1s |
### create-h1
- Category: `Words that are homophones of body parts but spelled differently (e.g. HARE = hair, MUSSEL = muscle)`
- Quality bar: Four words that each sound like a body part but are spelled differently. Valid examples: HARE (hair), MUSSEL (muscle), HEAL (heel), SOUL (sole), AYE/EYE-spelled-otherwise. EYE and HEEL alone do NOT count -- those are the body parts themselves, not homophones of them.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | SEA, SEE, HEAR, HERE | All four words are homophones of common body parts (sea/see for 'C' sound, hear/here for 'hear'). | MANUAL | 0.8s |
| `gemma4:26b` | MUSSEL, HARE, EYE, KNEAD | Each word is a homophone of a body part (muscle, hair, eye, knee) but spelled differently. | MANUAL | 0.9s |
| `gemma4:31b-it-q4_K_M` | _parse fail_ | `JSONDecodeError('Extra data: line 3 column 1 (char 133)')` | PARSE_FAIL | 6.7s |
### create-h2
- Category: `Words that contain a body part as a substring (e.g. HEARTBEAT contains HEART)`
- Quality bar: Four words that each contain a body part anywhere inside them. Valid examples: HEARTH (HEART), CHESTNUT (CHEST), EARTH (EAR), HEADACHE (HEAD).
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | KNEEHIGH, EARDRUM, STOMACHACHE, EYELASH | Each word contains a body part name as a substring. | MANUAL | 0.7s |
| `gemma4:26b` | HANDSOME, FOOTPRINT, EARNEST, ARMORY | Each word contains a body part (hand, foot, ear, arm) as a substring. | MANUAL | 0.8s |
| `gemma4:31b-it-q4_K_M` | HEARTBEAT, HANDSOME, FOOTPRINT, ARMCHAIR | Each word contains a body part as a substring. | MANUAL | 2.2s |
## CREATIVE_ACCEPT
### ca-y1
- Words: `SCALE`, `MOUNT`, `ASCEND`, `CLIMB`
- Player's category: `"Verbs for going up"`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | All four words can be used as verbs meaning to go up. | PASS | 0.6s |
| `gemma4:26b` | yes | All four words are synonyms or closely related verbs describing the act of rising or ascending. | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | yes | All four words are synonyms for the action of moving upward. | PASS | 2.1s |
### ca-y2
- Words: `APPLE`, `ORANGE`, `KIWI`, `BLACKBERRY`
- Player's category: `"Fruits"`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | All four listed items are types of fruit. | PASS | 0.6s |
| `gemma4:26b` | yes | All four words are common types of fruit. | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | yes | All four words are common types of fruit. | PASS | 1.8s |
### ca-y3
- Words: `WHIP`, `NUT`, `CODE`, `SMILE`
- Player's category: `"Things you can crack"`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | You can crack a nut, a code, a smile, or a whip (like a whip crack). | PASS | 0.7s |
| `gemma4:26b` | yes | You can crack a whip, crack a nut, crack a code, and crack a smile. | PASS | 0.8s |
| `gemma4:31b-it-q4_K_M` | yes | All four words are common objects or concepts that can be 'cracked'. (weak: WHIP) | PASS | 2.2s |
### ca-y4
- Words: `BAT`, `BALL`, `GLOVE`, `MITT`
- Player's category: `"Baseball equipment"`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | All four items are commonly associated with baseball. | PASS | 0.6s |
| `gemma4:26b` | yes | All four words are standard pieces of equipment used in the game of baseball. | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | yes | All four items are standard pieces of equipment used in baseball. | PASS | 2.0s |
### ca-y5
- Words: `MARS`, `VENUS`, `MERCURY`, `JUPITER`
- Player's category: `"Roman gods"`
- Ground truth: **yes**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | yes | All four are names of Roman deities associated with planets. | PASS | 0.6s |
| `gemma4:26b` | yes | All four words are names of prominent deities in Roman mythology. | PASS | 0.7s |
| `gemma4:31b-it-q4_K_M` | yes | All four words are names of Roman deities as well as planets. | PASS | 2.0s |
### ca-n1
- Words: `OAK`, `MAPLE`, `BIRCH`, `PINE`
- Player's category: `"Furniture brands"`
- Ground truth: **no**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | These words are types of trees, not furniture brands. (weak: OAK, MAPLE, BIRCH, PINE) | PASS | 0.7s |
| `gemma4:26b` | no | These words represent types of trees rather than furniture brands. (weak: OAK, MAPLE, BIRCH, PINE) | PASS | 0.8s |
| `gemma4:31b-it-q4_K_M` | no | These are types of trees/woods, not brands of furniture. (weak: OAK, MAPLE, BIRCH, PINE) | PASS | 2.7s |
### ca-n2
- Words: `ALPHA`, `BETA`, `GAMMA`, `DELTA`
- Player's category: `"Words meaning 'small'"`
- Ground truth: **no**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | These words are Greek letters, not inherently meaning 'small'. (weak: ALPHA) | PASS | 0.6s |
| `gemma4:26b` | no | Alpha, beta, gamma, and delta are Greek letters used to denote various things, but none of them inherently mean 'small'. | PASS | 0.9s |
| `gemma4:31b-it-q4_K_M` | no | These are Greek letters used as designations, not synonyms for 'small'. (weak: ALPHA, BETA, GAMMA, DELTA) | PASS | 2.6s |
### ca-n3
- Words: `BAT`, `BALL`, `GLOVE`, `MITT`
- Player's category: `"Things worn on your hand"`
- Ground truth: **no**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | BAT and BALL are not things worn on the hand. (weak: BAT, BALL) | PASS | 0.7s |
| `gemma4:26b` | no | Bat and ball are not items worn on the hand. (weak: BAT, BALL) | PASS | 0.8s |
| `gemma4:31b-it-q4_K_M` | no | A bat and a ball are not worn on the hand. (weak: BAT, BALL) | PASS | 2.3s |
### ca-n4
- Words: `MONDAY`, `FRIDAY`, `SUNDAY`, `WEDNESDAY`
- Player's category: `"Months of the year"`
- Ground truth: **no**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | All the words listed are days of the week, not months of the year. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | PASS | 0.8s |
| `gemma4:26b` | no | None of the words are months of the year; they are all days of the week. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | PASS | 0.9s |
| `gemma4:31b-it-q4_K_M` | no | The words are days of the week, not months of the year. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | PASS | 2.6s |
### ca-n5
- Words: `WHIP`, `NUT`, `CODE`, `SMILE`
- Player's category: `"Things found in a kitchen"`
- Ground truth: **no**
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | Nut and smile are not typically considered kitchen items. (weak: NUT, SMILE) | PASS | 0.7s |
| `gemma4:26b` | no | While whip and nut can relate to food, code and smile have no plausible connection to kitchen items. (weak: CODE, SMILE) | PASS | 0.8s |
| `gemma4:31b-it-q4_K_M` | no | Code and smile are not items typically found in a kitchen. (weak: CODE, SMILE) | PASS | 2.4s |
### ca-b1
- Words: `APPLE`, `ORANGE`, `KIWI`, `BLACKBERRY`
- Player's category: `"Tech/phone brands"`
- Ground truth: **borderline** -- APPLE and BLACKBERRY clearly are tech brands; ORANGE is a EU/UK telecom carrier (defensible); KIWI is not a tech brand. Strict grader says no; lenient grader might accept ORANGE.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | None of these words are recognized tech or phone brands. (weak: APPLE) | BORDERLINE | 0.6s |
| `gemma4:26b` | yes | All four words are names of well-known technology companies or mobile device brands. | BORDERLINE | 0.7s |
| `gemma4:31b-it-q4_K_M` | no | Orange and Kiwi are not recognized tech or phone brands. (weak: ORANGE, KIWI) | BORDERLINE | 2.3s |
### ca-b2
- Words: `SHARP`, `FLAT`, `NATURAL`, `KEY`
- Player's category: `"Real estate listing terms"`
- Ground truth: **borderline** -- FLAT (UK apartment) and KEY (keys) connect; SHARP (a sharp property) and NATURAL (natural light?) are weak. Strict no; creative yes.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
| `gemma4:latest` | no | Only SHARP and FLAT are common musical terms, while KEY and NATURAL are also musical terms, making 'Real estate listing | BORDERLINE | 0.8s |
| `gemma4:26b` | no | While 'flat' and 'key' can relate to real estate, 'sharp' and 'natural' have no plausible connection to real estate list | BORDERLINE | 0.9s |
| `gemma4:31b-it-q4_K_M` | no | Natural and Sharp are not standard terms used in real estate listings. (weak: NATURAL, SHARP) | BORDERLINE | 2.4s |
---
## Manual grading of CREATE outputs
Auto-grading isn't possible for CREATE — graded by hand here.
| Case | gemma4:latest (8B) | gemma4:26b | gemma4:31b |
|---|---|---|---|
| **e1** Trees | OAK MAPLE PINE BIRCH ✓ | OAK MAPLE BIRCH CEDAR ✓ | OAK MAPLE PINE BIRCH ✓ |
| **e2** Greek letters | ALPHA BETA GAMMA DELTA ✓ | ALPHA BETA GAMMA DELTA ✓ | ALPHA BETA GAMMA DELTA ✓ |
| **e3** "Angry" synonyms | furious irate enraged mad ✓ | irate livid furious incensed ✓ | furious irate livid enraged ✓ |
| **e4** Days of week | Mon Tue Wed Thu ✓ | Mon Tue Wed Thu ✓ | Mon Wed Fri Sun ✓ |
| **m1** \_\_\_ STORM | **FAIL** — listed STORM itself; HEAT-storm not standard; BLIZZARD-storm redundant | brain dust fire snow ✓ | brain fire thunder snow ✓ |
| **m2** "Small" synonyms | tiny petite minuscule diminutive ✓ | tiny microscopic diminutive minuscule ✓ | tiny petite slight miniature ✓ |
| **m3** Words after BLUE | jeans bird moon blood ✓ | bird jay print whale ✓ | berry jay moon print ✓ |
| **m4** "Broken" things | heart promise record bone ✓ | promise record heart law ✓ | record promise silence heart ✓ |
| **h1** Body-part homophones | **FAIL** — gave SEA/SEE/HEAR/HERE, missed the body-part requirement entirely | **PARTIAL** — MUSSEL/HARE/KNEAD ✓ but EYE is the body part, not a homophone of one | **PARSE_FAIL** (after 3 retries) |
| **h2** Containing body parts | KNEEHIGH EARDRUM STOMACHACHE EYELASH ✓ | HANDSOME FOOTPRINT EARNEST ARMORY ✓ | HEARTBEAT HANDSOME FOOTPRINT ARMCHAIR ✓ |
**CREATE pass rate**: 8B = 8/10 (e/m clean, both hard cases fail) · 26b = 9/10 (1 partial on h1) · 31b = 9/10 (1 parse-fail on h1)
The h1 failure is consistent with the prior puzzle bakeoff (#2-26b had the same "homophones of body parts" failure: words that ARE body parts vs words that SOUND LIKE them). **All three models share this confusion.** Designs depending on deep wordplay categories like this need either prompt scaffolding (give a worked example) or human curation of category seeds.
## Aggregate
| Model | JUDGE | CREATE | CREATIVE_ACCEPT | Borderline | Avg s | Notes |
|---|---|---|---|---|---|---|
| `gemma4:latest` (8B) | 14/16 | 8/10 | 10/10 | 0/2 strict-aligned | 0.7 | Fastest. Slight bias toward "no" on hard YES cases (judge-y3 days-of-week miss, judge-y6 cold-turkey miss) |
| `gemma4:26b` | 15/16 | 9/10 | 10/10 | 1/2 over-permissive (said KIWI is a tech brand) | 0.8 | Best speed/quality balance for live judging. Shows mild "be helpful, agree" bias |
| `gemma4:31b-it-q4_K_M` | 16/16 | 9/10 | 10/10 | 2/2 strict | 2.3 | Most accurate. Only candidate for once-per-day generation. 1 parse-fail on h1 (3 retries didn't recover) |
### What this proves
1. **The CREATIVE_ACCEPT axis works on every model tested.** This is the structural unlock that makes a Gemma-powered Connections derivative meaningfully different from the static NYT format: live, fair judging of player-invented groupings. 10/10 across 3 models on 5 valid + 5 invalid player categories — accept WHIP/NUT/CODE/SMILE for "things you can crack", reject OAK/MAPLE/BIRCH/PINE for "furniture brands", reject MONDAY/FRIDAY/SUNDAY/WEDNESDAY for "months." The model gets the distinction Connections cares about.
2. **Per-guess JUDGE economics are cheap.** 0.7-0.8s on the 3090 Ti for 8B/26b. Even 1000 player-guesses/day costs <15 GPU-minutes — effectively free.
3. **31b is the right generator** (validated earlier) and **the right offline critique judge.** 26b/8B are the right live judges.
4. **Hard wordplay categories (homophones-of-body-parts class) consistently fail** across all three models. Either avoid them or scaffold with examples.
### What this doesn't prove (limits / next steps)
- **Borderline-case behavior is model-specific.** 26b said yes to KIWI as tech brand — that's a real false-positive risk for the CREATIVE_ACCEPT design. If the live game uses 26b, it will sometimes accept groupings a strict grader would reject. 8B's stricter bias makes it safer here despite the 87.5% JUDGE rate; 31b is consistent and would be the gold standard but is too slow for live use.
- **Cultural/contextual categories untested.** "Words in a Beatles song", "Things only true after 2020" — these may break the judge in ways simple semantics don't.
- **No adversarial player.** What if the player invents a category to *deliberately game* the system into accepting a near-wrong grouping? E.g. "Words that contain a vowel" trivially fits any 4 English words. Need a category-tightness check on player input, not just word-fit.
- **Ground truth is mine and arguable.** judge-y5 (\_\_\_ HOUSE — 26b said "courthouse is one word, not 'court' + 'house'") is a defensible call I marked as a fail. Real human-grader agreement might bump 26b to 16/16 too.
+236
View File
@@ -0,0 +1,236 @@
#!/usr/bin/env python3
"""Bakeoff: can Gemma 4 generate Connections-style structured puzzles?
Stress-tests unaided one-shot generation on gemma4:26b and gemma4:31b on a
local Ollama (point OLLAMA_HOST at your instance; default localhost:11434).
Output is graded by hand afterward against a rubric in the README:
overlap-traps, tight category labels, purple wordplay, blind anchor vs a
real human-curated puzzle.
"""
import json
import os
import sys
import time
import urllib.request
from datetime import datetime
from pathlib import Path
OLLAMA = f"{os.environ.get('OLLAMA_HOST', 'http://localhost:11434').rstrip('/')}/api/generate"
MODELS = ["gemma4:26b", "gemma4:31b-it-q4_K_M"]
N_PER_MODEL = 5
TEMPERATURE = 0.8
PROJECT_ROOT = Path(__file__).resolve().parent.parent
PROMPT = """You are designing a single puzzle in the style of NYT Connections.
A Connections puzzle has:
- Exactly 16 distinct words or short phrases
- Sorted into 4 hidden groups of 4
- Each group has a tight, specific category label
- Difficulty bands: yellow (easiest, most direct), green (medium), blue (harder, often more abstract), purple (trickiest -- wordplay, double meanings, hidden patterns; e.g. "___ HOUSE": GREEN, ICE, COURT, FIRE)
- The CRITICAL feature: at least 2-3 words must plausibly fit a different group than where they actually go. These red herrings are what make the puzzle hard. Without them, the puzzle is trivial.
Generate ONE puzzle on a theme of your choice. Output strict JSON in this shape:
{
"theme_seed": "<one-line description of what inspired the puzzle>",
"groups": [
{"difficulty": "yellow", "category": "<tight category label>", "words": ["W1","W2","W3","W4"]},
{"difficulty": "green", "category": "<...>", "words": [...]},
{"difficulty": "blue", "category": "<...>", "words": [...]},
{"difficulty": "purple", "category": "<...>", "words": [...]}
],
"intended_traps": [
{"word": "<a word from the puzzle>", "actual_group": "yellow|green|blue|purple", "trap_group": "yellow|green|blue|purple", "reason": "<why it plausibly fits the trap group>"}
]
}
Rules:
- All 16 words must be distinct
- Categories must be tight enough that the right answer feels obviously right after the reveal
- intended_traps must list at least 2 genuine red-herring words
- Output ONLY the JSON object. No preamble, no markdown fences, no commentary.
"""
def call(model: str, prompt: str, temperature: float, timeout: int = 600):
# NOTE on Gemma 4 settings (see ~/bin/gemma4-research/GOTCHAS.md):
# - No format=json (infinite loop on gemma4:26b Q4)
# - think=false for single-turn JSON pipelines (else thinking tokens eat budget)
# - num_ctx >> 2048 default, num_predict >> 128 default
payload = {
"model": model,
"prompt": prompt,
"stream": False,
"think": False,
"options": {
"temperature": temperature,
"num_ctx": 8192,
"num_predict": 4096,
},
}
req = urllib.request.Request(
OLLAMA,
data=json.dumps(payload).encode(),
headers={"Content-Type": "application/json"},
)
t0 = time.time()
with urllib.request.urlopen(req, timeout=timeout) as r:
data = json.loads(r.read())
return time.time() - t0, data
def extract_json(body: str):
"""Pull the JSON object out of a Gemma response. Returns parsed dict or raises."""
if not body or "{" not in body or "}" not in body:
raise ValueError("no JSON object delimiters in response")
chunk = body[body.find("{"): body.rfind("}") + 1]
return json.loads(chunk)
def warm(model: str) -> None:
print(f"[warm] {model}", file=sys.stderr, flush=True)
call(model, "Reply with just the word OK.", temperature=0.1, timeout=300)
def run_model(model: str, n: int):
out = []
for i in range(1, n + 1):
# Retry with temp-bump pattern from AI_Visualizer
last_raw = ""
last_dt = 0.0
last_data = {}
last_err = None
puzzle = None
ok = False
attempts = 0
for attempt in range(3):
attempts = attempt + 1
temp = TEMPERATURE + attempt * 0.1
print(f"[{model}] puzzle {i}/{n} attempt {attempts} (temp={temp:.1f})",
file=sys.stderr, flush=True)
try:
dt, data = call(model, PROMPT, temperature=temp)
except Exception as e:
last_err = repr(e)
continue
last_dt, last_data = dt, data
last_raw = data.get("response", "") or ""
try:
puzzle = extract_json(last_raw)
ok = True
break
except Exception as e:
last_err = repr(e)
continue
if ok:
out.append({
"model": model, "i": i, "dt": last_dt, "ok": True,
"attempts": attempts,
"puzzle": puzzle,
"eval_count": last_data.get("eval_count", 0),
"prompt_eval_count": last_data.get("prompt_eval_count", 0),
})
else:
out.append({
"model": model, "i": i, "dt": last_dt, "ok": False,
"attempts": attempts,
"puzzle": {"_parse_error": last_err, "_raw": last_raw[:3000]},
"eval_count": last_data.get("eval_count", 0) if last_data else 0,
"prompt_eval_count": last_data.get("prompt_eval_count", 0) if last_data else 0,
})
return out
def render(results, stamp: str) -> str:
lines = [
f"# Gemma 4 Generation Bakeoff -- {stamp}",
"",
"## Setup",
f"- Ollama endpoint: `{OLLAMA}` (RTX 3090 Ti on the test host)",
"- Other GPU workloads paused for the duration of the run",
f"- Models: {', '.join(f'`{m}`' for m in MODELS)}",
f"- {N_PER_MODEL} puzzles per model, base temperature {TEMPERATURE}",
"- Gemma 4 settings (per `~/bin/gemma4-research/GOTCHAS.md`): `think=false`, "
"`num_ctx=8192`, `num_predict=4096`. No `format=json` (infinite-loop bug). "
"JSON extracted client-side via `body[body.find('{'):body.rfind('}')+1]`.",
"- Up to 3 attempts per puzzle with temperature bumped +0.1 each retry "
"(AI_Visualizer pattern). Reported metrics use the *successful* attempt.",
"- One-shot, unaided generation. No critique pass, no example puzzle in prompt.",
"",
"## Timing",
"",
"| Model | n | avg s | avg tokens | tok/s |",
"|---|---|---|---|---|",
]
for m in MODELS:
rs = [r for r in results if r["model"] == m and "error" not in r]
if not rs:
lines.append(f"| `{m}` | 0 | -- | -- | -- |")
continue
avg_s = sum(r["dt"] for r in rs) / len(rs)
avg_tok = sum(r["eval_count"] for r in rs) / len(rs)
toks = avg_tok / avg_s if avg_s else 0
lines.append(f"| `{m}` | {len(rs)} | {avg_s:.1f} | {avg_tok:.0f} | {toks:.1f} |")
lines += ["", "## JSON parse rate", ""]
for m in MODELS:
rs = [r for r in results if r["model"] == m]
ok = sum(1 for r in rs if r.get("ok"))
lines.append(f"- `{m}`: {ok}/{len(rs)} parsed cleanly")
lines += [""]
for r in results:
head = f"## {r['model']} -- puzzle {r['i']}"
lines += [head, ""]
if "error" in r:
lines += [f"_API error:_ `{r['error']}`", ""]
continue
if not r.get("ok"):
err = r["puzzle"].get("_parse_error", "")
raw = r["puzzle"].get("_raw", "")[:1500]
lines += [f"_JSON parse failed:_ `{err}`", "```", raw, "```", ""]
continue
p = r["puzzle"]
lines += [f"**Theme seed:** {p.get('theme_seed', '--')}", ""]
lines += ["| Diff | Category | Words |", "|---|---|---|"]
for g in p.get("groups", []) or []:
words = ", ".join(g.get("words", []) or [])
cat = (g.get("category") or "?").replace("|", "\\|")
lines.append(f"| {g.get('difficulty', '?')} | {cat} | {words} |")
traps = p.get("intended_traps", []) or []
lines += ["", f"**Claimed traps ({len(traps)}):**"]
if not traps:
lines.append("- _none claimed_")
for t in traps:
lines.append(
f"- `{t.get('word')}` (actually {t.get('actual_group')}, traps {t.get('trap_group')}): "
f"{t.get('reason')}"
)
lines += ["", "_Grade:_ TODO", "", f"_dt={r['dt']:.1f}s, tokens={r['eval_count']}_", ""]
return "\n".join(lines)
def main() -> None:
out_dir = PROJECT_ROOT / "docs" / "reference"
out_dir.mkdir(parents=True, exist_ok=True)
stamp = datetime.now().strftime("%Y-%m-%d-%H%M%S")
raw_path = out_dir / f"gemma-generation-bakeoff-{stamp}-raw.json"
md_path = out_dir / f"gemma-generation-bakeoff-{stamp}.md"
all_results = []
for m in MODELS:
warm(m)
all_results.extend(run_model(m, N_PER_MODEL))
raw_path.write_text(json.dumps(all_results, indent=2))
print(f"raw -> {raw_path}", file=sys.stderr)
md_path.write_text(render(all_results, stamp))
print(f"md -> {md_path}", file=sys.stderr)
# Final stdout: just the markdown path so callers can pipe.
print(md_path)
if __name__ == "__main__":
main()
+431
View File
@@ -0,0 +1,431 @@
#!/usr/bin/env python3
"""Bakeoff: Gemma 4's atomic semantic-matching abilities.
Three test types, all with hand-labeled ground truth:
- JUDGE: given (category, 4 words), does Gemma correctly say yes/no on whether
the words tightly fit?
- CREATE: given a category, does Gemma produce 4 words that tightly fit it?
- CREATIVE_ACCEPT: given 4 words and a player-proposed category that may or
may not be the puzzle's intended one, does Gemma fairly judge validity?
This is the test of whether "fuzzy / creative-grouping acceptance" -- the
twist from IDEA.md that a static NYT format structurally can't do -- is
feasible.
Models tested: gemma4:26b, gemma4:31b-it-q4_K_M, gemma4:latest (8B). 8B is
included because judging runs per player guess in any live design; if 8B is
reliable enough for JUDGE, the per-guess economics get a lot better.
Settings (well-known Gemma-4-on-Ollama gotchas): think=false, num_ctx=4096,
num_predict=512, no format=json (server-side JSON enforcer hangs on 26b Q4),
JSON extracted client-side. Point OLLAMA_HOST at your instance; default
localhost:11434.
"""
import json
import os
import sys
import time
import urllib.request
from datetime import datetime
from pathlib import Path
OLLAMA = f"{os.environ.get('OLLAMA_HOST', 'http://localhost:11434').rstrip('/')}/api/generate"
MODELS = ["gemma4:latest", "gemma4:26b", "gemma4:31b-it-q4_K_M"]
PROJECT_ROOT = Path(__file__).resolve().parent.parent
TEMPERATURE = 0.2 # judging is a low-creativity task; we want consistency
# ---------- prompts ----------
JUDGE_PROMPT = """You are evaluating whether four words tightly fit a given semantic category, in the style of NYT Connections.
Category: {category}
Words: {w1}, {w2}, {w3}, {w4}
Do ALL FOUR words clearly fit this category? Be strict -- if even one word doesn't fit, the answer is "no". Generic loose connections do not count.
Output strict JSON, no preamble or fences:
{{"verdict": "yes" or "no", "reason": "<one short sentence>", "misfit_words": ["<any words that don't fit>"]}}
"""
CREATE_PROMPT = """You are creating a tight 4-word group in the style of NYT Connections.
Category: {category}
Produce EXACTLY four words or short phrases that tightly fit this category. Each word must clearly belong; vague or loosely-related words are not acceptable.
Output strict JSON, no preamble or fences:
{{"words": ["W1", "W2", "W3", "W4"], "reason": "<one short sentence on how all four fit>"}}
"""
CREATIVE_ACCEPT_PROMPT = """You are judging a Connections-style puzzle where the player has proposed their OWN category for four words. Their category may differ from the puzzle's intended one, but it might still be a valid alternative -- if all four words plausibly fit the player's category, accept it.
Words: {w1}, {w2}, {w3}, {w4}
Player's proposed category: "{player_category}"
Do all four words plausibly fit the player's category? Be fair: a player-creative-but-valid grouping should be accepted. But if even one word genuinely doesn't fit, reject it.
Output strict JSON, no preamble or fences:
{{"valid": "yes" or "no", "reason": "<one short sentence>", "weak_words": ["<any words that don't really fit>"]}}
"""
# ---------- test bank ----------
# Each case has hand-labeled ground truth. The "gt" field is what a thoughtful
# human grader would say (yes/no for JUDGE and CREATIVE_ACCEPT). For CREATE,
# `gt_check` describes what a passing answer should look like.
CASES = [
# ---- JUDGE: clear yes (tight fit) ----
{"id": "judge-y1", "type": "JUDGE", "category": "Types of trees",
"words": ["OAK", "MAPLE", "BIRCH", "PINE"], "gt": "yes"},
{"id": "judge-y2", "type": "JUDGE", "category": "Greek letters",
"words": ["ALPHA", "BETA", "GAMMA", "DELTA"], "gt": "yes"},
{"id": "judge-y3", "type": "JUDGE", "category": "Days of the week",
"words": ["MONDAY", "FRIDAY", "SUNDAY", "WEDNESDAY"], "gt": "yes"},
{"id": "judge-y4", "type": "JUDGE", "category": "Synonyms for 'happy'",
"words": ["JOYFUL", "GLAD", "CHEERFUL", "ELATED"], "gt": "yes"},
{"id": "judge-y5", "type": "JUDGE", "category": "___ HOUSE (compound words)",
"words": ["GREEN", "ICE", "FIRE", "COURT"], "gt": "yes"},
{"id": "judge-y6", "type": "JUDGE", "category": "Words that follow COLD",
"words": ["SHOULDER", "FRONT", "SNAP", "TURKEY"], "gt": "yes"},
{"id": "judge-y7", "type": "JUDGE", "category": "Verbs meaning 'to move quickly'",
"words": ["DART", "BOLT", "RUSH", "FLY"], "gt": "yes"},
{"id": "judge-y8", "type": "JUDGE", "category": "Synonyms for 'idea'",
"words": ["NOTION", "CONCEPT", "THOUGHT", "INKLING"], "gt": "yes"},
# ---- JUDGE: clear no (one or more words don't fit) ----
{"id": "judge-n1", "type": "JUDGE", "category": "Types of trees",
"words": ["OAK", "MAPLE", "BIRCH", "CARROT"], "gt": "no",
"gt_misfit": ["CARROT"]},
{"id": "judge-n2", "type": "JUDGE", "category": "Greek letters",
"words": ["ALPHA", "BETA", "GAMMA", "CYRILLIC"], "gt": "no",
"gt_misfit": ["CYRILLIC"]},
{"id": "judge-n3", "type": "JUDGE", "category": "Synonyms for 'happy'",
"words": ["JOYFUL", "GLAD", "SAD", "ELATED"], "gt": "no",
"gt_misfit": ["SAD"]},
{"id": "judge-n4", "type": "JUDGE", "category": "Days of the week",
"words": ["MONDAY", "JANUARY", "SUNDAY", "WEDNESDAY"], "gt": "no",
"gt_misfit": ["JANUARY"]},
{"id": "judge-n5", "type": "JUDGE", "category": "Body parts",
"words": ["ARM", "LEG", "EYE", "NIGHT"], "gt": "no",
"gt_misfit": ["NIGHT"]},
{"id": "judge-n6", "type": "JUDGE", "category": "Types of birds",
"words": ["CRANE", "SWALLOW", "BAT", "MOSQUITO"], "gt": "no",
"gt_misfit": ["BAT", "MOSQUITO"]},
{"id": "judge-n7", "type": "JUDGE", "category": "Things that are red",
"words": ["APPLE", "BLUE", "ROSE", "GRASS"], "gt": "no",
"gt_misfit": ["BLUE", "GRASS"]},
{"id": "judge-n8", "type": "JUDGE", "category": "Words that follow COLD",
"words": ["SHOULDER", "FRONT", "PIZZA", "MOUNTAIN"], "gt": "no",
"gt_misfit": ["PIZZA", "MOUNTAIN"]},
# ---- CREATE: easy categories ----
{"id": "create-e1", "type": "CREATE", "category": "Types of trees",
"gt_check": "Four valid tree species; e.g. OAK, MAPLE, BIRCH, PINE."},
{"id": "create-e2", "type": "CREATE", "category": "Greek letters",
"gt_check": "Four genuine Greek letters."},
{"id": "create-e3", "type": "CREATE", "category": "Synonyms for 'angry'",
"gt_check": "Four words that all genuinely mean angry/furious."},
{"id": "create-e4", "type": "CREATE", "category": "Days of the week",
"gt_check": "Four of the seven weekday names, no months or other words."},
# ---- CREATE: medium (compound / polysemy) ----
{"id": "create-m1", "type": "CREATE", "category": "___ STORM (compound words ending in STORM)",
"gt_check": "Four words that each form a real compound or fixed phrase with STORM (e.g. SAND, BRAIN, THUNDER, SNOW)."},
{"id": "create-m2", "type": "CREATE", "category": "Synonyms for 'small'",
"gt_check": "Four words that all genuinely mean small."},
{"id": "create-m3", "type": "CREATE", "category": "Words that follow BLUE",
"gt_check": "Four words that each form a real compound with BLUE (e.g. BERRY, BIRD, PRINT, BELL, GRASS)."},
{"id": "create-m4", "type": "CREATE", "category": "Things that can be 'broken'",
"gt_check": "Four words that each form a real fixed phrase with 'broken' (heart, record, law, promise, etc.)."},
# ---- CREATE: hard (wordplay / tight constraint) ----
{"id": "create-h1", "type": "CREATE",
"category": "Words that are homophones of body parts but spelled differently (e.g. HARE = hair, MUSSEL = muscle)",
"gt_check": "Four words that each sound like a body part but are spelled differently. Valid examples: HARE (hair), MUSSEL (muscle), HEAL (heel), SOUL (sole), AYE/EYE-spelled-otherwise. EYE and HEEL alone do NOT count -- those are the body parts themselves, not homophones of them."},
{"id": "create-h2", "type": "CREATE",
"category": "Words that contain a body part as a substring (e.g. HEARTBEAT contains HEART)",
"gt_check": "Four words that each contain a body part anywhere inside them. Valid examples: HEARTH (HEART), CHESTNUT (CHEST), EARTH (EAR), HEADACHE (HEAD)."},
# ---- CREATIVE_ACCEPT: player's grouping is genuinely valid ----
{"id": "ca-y1", "type": "CREATIVE_ACCEPT",
"words": ["SCALE", "MOUNT", "ASCEND", "CLIMB"],
"player_category": "Verbs for going up", "gt": "yes"},
{"id": "ca-y2", "type": "CREATIVE_ACCEPT",
"words": ["APPLE", "ORANGE", "KIWI", "BLACKBERRY"],
"player_category": "Fruits", "gt": "yes"},
{"id": "ca-y3", "type": "CREATIVE_ACCEPT",
"words": ["WHIP", "NUT", "CODE", "SMILE"],
"player_category": "Things you can crack", "gt": "yes"},
{"id": "ca-y4", "type": "CREATIVE_ACCEPT",
"words": ["BAT", "BALL", "GLOVE", "MITT"],
"player_category": "Baseball equipment", "gt": "yes"},
{"id": "ca-y5", "type": "CREATIVE_ACCEPT",
"words": ["MARS", "VENUS", "MERCURY", "JUPITER"],
"player_category": "Roman gods", "gt": "yes"},
# ---- CREATIVE_ACCEPT: player's grouping is wrong ----
{"id": "ca-n1", "type": "CREATIVE_ACCEPT",
"words": ["OAK", "MAPLE", "BIRCH", "PINE"],
"player_category": "Furniture brands", "gt": "no"},
{"id": "ca-n2", "type": "CREATIVE_ACCEPT",
"words": ["ALPHA", "BETA", "GAMMA", "DELTA"],
"player_category": "Words meaning 'small'", "gt": "no"},
{"id": "ca-n3", "type": "CREATIVE_ACCEPT",
"words": ["BAT", "BALL", "GLOVE", "MITT"],
"player_category": "Things worn on your hand", "gt": "no",
"gt_weak": ["BAT", "BALL"]},
{"id": "ca-n4", "type": "CREATIVE_ACCEPT",
"words": ["MONDAY", "FRIDAY", "SUNDAY", "WEDNESDAY"],
"player_category": "Months of the year", "gt": "no"},
{"id": "ca-n5", "type": "CREATIVE_ACCEPT",
"words": ["WHIP", "NUT", "CODE", "SMILE"],
"player_category": "Things found in a kitchen", "gt": "no",
"gt_weak": ["CODE", "SMILE"]},
# ---- CREATIVE_ACCEPT: borderline (deliberately ambiguous) ----
{"id": "ca-b1", "type": "CREATIVE_ACCEPT",
"words": ["APPLE", "ORANGE", "KIWI", "BLACKBERRY"],
"player_category": "Tech/phone brands", "gt": "borderline",
"gt_note": "APPLE and BLACKBERRY clearly are tech brands; ORANGE is a EU/UK telecom carrier (defensible); KIWI is not a tech brand. Strict grader says no; lenient grader might accept ORANGE."},
{"id": "ca-b2", "type": "CREATIVE_ACCEPT",
"words": ["SHARP", "FLAT", "NATURAL", "KEY"],
"player_category": "Real estate listing terms", "gt": "borderline",
"gt_note": "FLAT (UK apartment) and KEY (keys) connect; SHARP (a sharp property) and NATURAL (natural light?) are weak. Strict no; creative yes."},
]
# ---------- runner ----------
def call(model, prompt, temperature=TEMPERATURE, timeout=300):
payload = {
"model": model,
"prompt": prompt,
"stream": False,
"think": False,
"options": {"temperature": temperature, "num_ctx": 4096, "num_predict": 512},
}
req = urllib.request.Request(
OLLAMA, data=json.dumps(payload).encode(),
headers={"Content-Type": "application/json"},
)
t0 = time.time()
with urllib.request.urlopen(req, timeout=timeout) as r:
data = json.loads(r.read())
return time.time() - t0, data
def extract_json(body):
if not body or "{" not in body or "}" not in body:
raise ValueError("no JSON braces in response")
return json.loads(body[body.find("{"): body.rfind("}") + 1])
def render_prompt(case):
if case["type"] == "JUDGE":
return JUDGE_PROMPT.format(
category=case["category"],
w1=case["words"][0], w2=case["words"][1],
w3=case["words"][2], w4=case["words"][3],
)
if case["type"] == "CREATE":
return CREATE_PROMPT.format(category=case["category"])
if case["type"] == "CREATIVE_ACCEPT":
return CREATIVE_ACCEPT_PROMPT.format(
w1=case["words"][0], w2=case["words"][1],
w3=case["words"][2], w4=case["words"][3],
player_category=case["player_category"],
)
raise ValueError(case["type"])
def warm(model):
print(f"[warm] {model}", file=sys.stderr, flush=True)
call(model, "Reply with the word OK only.", temperature=0.1, timeout=300)
def run_model(model, cases):
out = []
for case in cases:
prompt = render_prompt(case)
last_err = None
parsed = None
last_dt = 0.0
last_eval = 0
last_raw = ""
for attempt in range(3):
temp = TEMPERATURE + attempt * 0.1
print(f"[{model}] {case['id']} attempt {attempt+1} (temp={temp:.1f})",
file=sys.stderr, flush=True)
try:
dt, data = call(model, prompt, temperature=temp)
except Exception as e:
last_err = repr(e)
continue
last_dt = dt
last_eval = data.get("eval_count", 0)
last_raw = data.get("response", "") or ""
try:
parsed = extract_json(last_raw)
last_err = None
break
except Exception as e:
last_err = repr(e)
continue
out.append({
"case_id": case["id"], "type": case["type"], "model": model,
"dt": last_dt, "eval_count": last_eval,
"ok": parsed is not None,
"parsed": parsed,
"raw": last_raw[:1500] if parsed is None else None,
"error": last_err,
"case": case,
})
return out
def score(results):
"""Auto-score against ground truth where possible."""
for r in results:
c = r["case"]
if not r["ok"]:
r["score"] = "PARSE_FAIL"
continue
p = r["parsed"]
if c["type"] == "JUDGE":
v = (p.get("verdict") or "").strip().lower()
r["score"] = "PASS" if v == c["gt"] else "FAIL"
elif c["type"] == "CREATIVE_ACCEPT":
v = (p.get("valid") or "").strip().lower()
if c["gt"] == "borderline":
r["score"] = "BORDERLINE" # human grades these
else:
r["score"] = "PASS" if v == c["gt"] else "FAIL"
elif c["type"] == "CREATE":
r["score"] = "MANUAL" # human grades these against gt_check
return results
def render(results):
by_model = {}
for r in results:
by_model.setdefault(r["model"], []).append(r)
lines = [f"# Gemma 4 Semantic Bakeoff -- {datetime.now().strftime('%Y-%m-%d %H:%M')}", ""]
lines += [
"## Setup",
f"- Host: steel141 (RTX 3090 Ti) `{OLLAMA}`",
f"- Models: {', '.join('`'+m+'`' for m in MODELS)}",
f"- Temperature {TEMPERATURE} (raised +0.1 per retry on JSON parse fail, max 3 attempts)",
"- think=false, num_ctx=4096, num_predict=512, no format=json (per gemma4-research/GOTCHAS.md)",
f"- {len(CASES)} test cases: "
f"{sum(1 for c in CASES if c['type']=='JUDGE')} JUDGE, "
f"{sum(1 for c in CASES if c['type']=='CREATE')} CREATE, "
f"{sum(1 for c in CASES if c['type']=='CREATIVE_ACCEPT')} CREATIVE_ACCEPT",
"- Ground truth hand-labeled inline in `scripts/gemma-semantic-bakeoff.py`",
"",
]
# ---- per-model summaries ----
lines += ["## Auto-scored summary", ""]
lines += ["| Model | JUDGE pass | CREATIVE_ACCEPT pass | parse fails | avg s |", "|---|---|---|---|---|"]
for m in MODELS:
rs = by_model.get(m, [])
if not rs:
lines.append(f"| `{m}` | - | - | - | - |")
continue
j_pass = sum(1 for r in rs if r["case"]["type"] == "JUDGE" and r.get("score") == "PASS")
j_n = sum(1 for r in rs if r["case"]["type"] == "JUDGE")
c_pass = sum(1 for r in rs if r["case"]["type"] == "CREATIVE_ACCEPT" and r.get("score") == "PASS")
c_n = sum(1 for r in rs if r["case"]["type"] == "CREATIVE_ACCEPT" and r["case"].get("gt") != "borderline")
parse_fail = sum(1 for r in rs if not r["ok"])
avg_dt = sum(r["dt"] for r in rs) / max(len(rs), 1)
lines.append(f"| `{m}` | {j_pass}/{j_n} | {c_pass}/{c_n} | {parse_fail} | {avg_dt:.1f} |")
lines += [""]
# ---- by case-type, full breakdown ----
for tname in ["JUDGE", "CREATE", "CREATIVE_ACCEPT"]:
lines += [f"## {tname}", ""]
cases_of_type = [c for c in CASES if c["type"] == tname]
for case in cases_of_type:
lines += [f"### {case['id']}", ""]
if tname == "JUDGE":
lines += [
f"- Category: `{case['category']}`",
f"- Words: {', '.join('`'+w+'`' for w in case['words'])}",
f"- Ground truth: **{case['gt']}**" + (
f" (misfit: {', '.join(case.get('gt_misfit', []))})" if case.get("gt_misfit") else ""),
"",
]
elif tname == "CREATE":
lines += [
f"- Category: `{case['category']}`",
f"- Quality bar: {case['gt_check']}",
"",
]
else: # CREATIVE_ACCEPT
lines += [
f"- Words: {', '.join('`'+w+'`' for w in case['words'])}",
f"- Player's category: `\"{case['player_category']}\"`",
f"- Ground truth: **{case['gt']}**" + (
f" -- {case.get('gt_note', '')}" if case.get("gt_note") else ""),
"",
]
lines += ["| Model | Verdict | Reason | Score | dt |", "|---|---|---|---|---|"]
for m in MODELS:
r = next((r for r in by_model.get(m, []) if r["case_id"] == case["id"]), None)
if r is None:
lines.append(f"| `{m}` | - | - | - | - |")
continue
if not r["ok"]:
lines.append(f"| `{m}` | _parse fail_ | `{(r.get('error') or '')[:60]}` | PARSE_FAIL | {r['dt']:.1f}s |")
continue
p = r["parsed"]
if tname == "JUDGE":
v = p.get("verdict", "?")
reason = p.get("reason", "")
extra = ""
if p.get("misfit_words"):
extra = f" (misfit: {', '.join(p['misfit_words'])})"
elif tname == "CREATE":
v = ", ".join(p.get("words", []) or [])[:80]
reason = p.get("reason", "")
extra = ""
else:
v = p.get("valid", "?")
reason = p.get("reason", "")
extra = ""
if p.get("weak_words"):
extra = f" (weak: {', '.join(p['weak_words'])})"
reason_short = (reason + extra).replace("|", "\\|")[:120]
v_clean = str(v).replace("|", "\\|")[:80]
lines.append(f"| `{m}` | {v_clean} | {reason_short} | {r.get('score', '?')} | {r['dt']:.1f}s |")
lines += [""]
return "\n".join(lines)
def main():
out_dir = PROJECT_ROOT / "docs" / "reference"
out_dir.mkdir(parents=True, exist_ok=True)
stamp = datetime.now().strftime("%Y-%m-%d-%H%M%S")
raw_path = out_dir / f"gemma-semantic-bakeoff-{stamp}-raw.json"
md_path = out_dir / f"gemma-semantic-bakeoff-{stamp}.md"
all_results = []
for m in MODELS:
warm(m)
all_results.extend(run_model(m, CASES))
score(all_results)
# save raw without the case dict redundantly
raw = [{k: v for k, v in r.items() if k != "case"} | {"case_id": r["case_id"]} for r in all_results]
raw_path.write_text(json.dumps(raw, indent=2))
print(f"raw -> {raw_path}", file=sys.stderr)
md_path.write_text(render(all_results))
print(f"md -> {md_path}", file=sys.stderr)
print(md_path)
if __name__ == "__main__":
main()