Files
seth_semantic_game/docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md
Mortdecai 5a2a02e483 docs: bootstrap repo with bakeoff results and game-mechanics idea bank
This repo opens with the design-discovery work completed before any product
code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local
Ollama established that:

- Whole-puzzle generation in the Connections shape is unreliable on Gemma 4
  (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally
  out of project scope, so the generation route is harder still.
- Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b)
  on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging
  of player-INVENTED categories. That is the structural unlock vs static
  hand-curated word games.

The README contains the full writeup, the test bench, and a brainstormed
bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo
range, plus a primitives table for recombination.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:09:46 -04:00

279 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gemma 4 Generation Bakeoff -- 2026-04-27-221751
## Setup
- Local Ollama on the test host (RTX 3090 Ti, 24 GB VRAM)
- Other GPU workloads paused for the duration of the run
- Models: `gemma4:26b`, `gemma4:31b-it-q4_K_M`
- 5 puzzles per model, base temperature 0.8
- Gemma 4 settings (per `~/bin/gemma4-research/GOTCHAS.md`): `think=false`, `num_ctx=8192`, `num_predict=4096`. No `format=json` (infinite-loop bug). JSON extracted client-side via `body[body.find('{'):body.rfind('}')+1]`.
- Up to 3 attempts per puzzle with temperature bumped +0.1 each retry (AI_Visualizer pattern). Reported metrics use the *successful* attempt.
- One-shot, unaided generation. No critique pass, no example puzzle in prompt.
## Timing
| Model | n | avg s | avg tokens | tok/s |
|---|---|---|---|---|
| `gemma4:26b` | 5 | 5.2 | 489 | 94.3 |
| `gemma4:31b-it-q4_K_M` | 5 | 18.2 | 442 | 24.3 |
## JSON parse rate
- `gemma4:26b`: 5/5 parsed cleanly
- `gemma4:31b-it-q4_K_M`: 5/5 parsed cleanly
## gemma4:26b -- puzzle 1
**Theme seed:** An exploration of common words that can be paired with 'BIRD' or 'DRAGON' or relate to types of 'FLY'.
| Diff | Category | Words |
|---|---|---|
| yellow | Types of Flies | FRUIT, HOUSE, FIRE, BLOW |
| green | Words before 'FLY' | DRAGON, MAY, BUTTER, BOTTLE |
| blue | Types of Birds | LARK, SWALLOW, CRANE, SWIFT |
| purple | Words that follow 'DRAGON' | FRUIT, FLY, SLAYER, FLAME |
**Claimed traps (4):**
- `FRUIT` (actually purple, traps yellow): Fruit fly is a very common type of fly, making it a primary red herring for the yellow group.
- `FIRE` (actually yellow, traps purple): Fire dragon or dragon flame connection could lead someone to look for a dragon-themed group.
- `DRAGON` (actually green, traps blue): Dragon is a mythical creature, and users might look for a 'mythical creatures' group including Crane or Swallow (which are real birds).
- `FLY` (actually purple, traps green): Dragonfly is a common word, tempting the player to put 'FLY' in the 'Words before FLY' group.
**Grade: FAIL.** `FRUIT` appears in **two groups** (yellow Types of Flies + purple Words that follow DRAGON) — that's only 15 distinct tiles, not the required 16. Categories are also redundant: yellow ("Types of Flies") and green ("Words before FLY") are both fly-themed. A human curator would never ship this.
_dt=5.7s, tokens=500_
## gemma4:26b -- puzzle 2
**Theme seed:** Homophones and linguistic overlaps between nature and everyday objects.
| Diff | Category | Words |
|---|---|---|
| yellow | Types of Trees | BIRCH, CEDAR, MAPLE, PINE |
| green | Things that can be 'Dropped' | BEAT, HINT, LINE, SUBJECT |
| blue | Units of Measurement | FOOT, GRAM, SECOND, STONE |
| purple | Homophones of Body Parts | EYE, HARE, MUSSEL, QUEUE |
**Claimed traps (4):**
- `PINE` (actually yellow, traps green): To 'pine' for someone is a feeling, but 'pine' could be mistaken for something you can 'drop' (like a pine needle).
- `FOOT` (actually blue, traps yellow): A 'foot' is a unit of measurement, but it is also a body part, potentially leading to the 'Body Part Homophones' group.
- `BEAT` (actually green, traps blue): A 'beat' is a rhythm, but it could be mistaken for a unit of time/measurement in a musical context.
- `EYE` (actually purple, traps blue): The word 'eye' is a body part, which might tempt players looking for a body part category, but it is actually a homophone (I) for the purple group.
**Grade: PARTIAL FAIL.** Yellow/green/blue all tight. Purple "Homophones of Body Parts" is half-broken: `HARE`→hair ✓, `MUSSEL`→muscle ✓, but `EYE` *is* a body part (not a homophone of one) and `QUEUE`→cue is not a body-part homophone. 2/4 wrong on the trickiest group is a curator-rejection. Trap reasoning is weak (claims `EYE` traps blue Units — body parts aren't units).
_dt=5.3s, tokens=510_
## gemma4:26b -- puzzle 3
**Theme seed:** The concept of 'Pressure' and things that can be measured or applied.
| Diff | Category | Words |
|---|---|---|
| yellow | Units of Pressure | BAR, PASCAL, PSI, TORR |
| green | Words after 'BLOOD' | PRESSURE, TYPE, ORANGE, DONOR |
| blue | Things that can be 'Broken' | HEART, RECORD, LAW, PROMISE |
| purple | Words that are also Verbs for 'Applying Force' | PRESS, STAMP, PUNCH, DRIVE |
**Claimed traps (4):**
- `PRESSURE` (actually green, traps yellow): Pressure is the concept being measured by the units in the yellow group.
- `BAR` (actually yellow, traps blue): A 'bar' is something that can be 'broken' (like a chocolate bar) or refers to a legal setting.
- `PRESS` (actually purple, traps green): The 'Press' is a common term in news/media, and 'Press' can be related to 'Blood' in a medical context (blood pressure).
- `RECORD` (actually blue, traps purple): A 'record' is something you can 'stamp' or 'drive' (as in a driving record).
**Grade: BORDERLINE PASS.** Categories tight, 16 distinct words, ✓. `BAR` (yellow → blue trap) is a real overlap (chocolate bar, broken bar). `PRESS` claimed to trap green ("Words after BLOOD") — but blood **pressure** is the compound, not blood **press**, so the claimed trap doesn't actually hold. `RECORD`→stamp/drive is a stretch. 1/4 of Gemma's self-graded traps is broken; remainder OK.
_dt=5.1s, tokens=492_
## gemma4:26b -- puzzle 4
**Theme seed:** A collection of words related to things that can be 'broken' or 'cracked', mixed with other categories like cooking and music.
| Diff | Category | Words |
|---|---|---|
| yellow | Kinds of Eggs | CHICKEN, QUAIL, GOOSE, DUCK |
| green | Things You Can Crack | NUT, CODE, WHIP, SMILE |
| blue | Musical Terms | NOTE, CHORD, CLEF, BEAT |
| purple | Words Before 'Egg' | NEST, EASTER, BAD, NESTING |
**Claimed traps (4):**
- `DUCK` (actually yellow, traps green): DUCK is an action related to avoiding a whip or a blow, and can be paired with 'crack' in some linguistic contexts.
- `CHICKEN` (actually yellow, traps purple): CHICKEN is an egg-related word that could tempt someone looking for '___ Egg' phrases (like Chicken Egg).
- `BEAT` (actually blue, traps green): BEAT is something you can do to a whip or a person, fitting the 'Things you can crack/strike' theme.
- `NEST` (actually purple, traps yellow): NEST is intrinsically linked to the birds in the yellow group (CHICKEN, GOOSE, etc.).
**Grade: FAIL.** Purple has `NESTING` as "Words Before EGG" — *nesting egg* isn't a phrase (*nest egg* is). Worse, `NEST` and `NESTING` are near-duplicate tiles. Yellow "Kinds of Eggs" and purple "Words Before EGG" are also redundant (chicken egg vs nest egg — both compound forms). Curator reject.
_dt=5.3s, tokens=506_
## gemma4:26b -- puzzle 5
**Theme seed:** A collection of words centered around various types of 'records' and 'tracks'.
| Diff | Category | Words |
|---|---|---|
| yellow | Types of Records | LP, EP, Single, Album |
| green | Things that have Tracks | Train, Athlete, Song, Mud |
| blue | Words for 'Evidence' | Clue, Trace, Mark, Proof |
| purple | Words that follow 'Side' | Effect, Kick, Split, Walk |
**Claimed traps (4):**
- `Single` (actually yellow, traps blue): A 'single' can be a piece of evidence or a solitary item.
- `Song` (actually green, traps yellow): A song is often released as an LP, EP, or Single.
- `Trace` (actually blue, traps green): A trace can be a physical track left behind in mud.
- `Split` (actually purple, traps green): A 'split' can refer to a track or division in a race.
**Grade: PASS.** Cleanest 26b output. 16 distinct words; categories tight; the word "track" carries **three different meanings** across green (railroad/running/album), with `SONG`→Records and `TRACE`→Tracks as real overlaps. That's the multi-meaning anchor pattern Connections actually uses. `SINGLE` and `SPLIT` traps are weak, but the puzzle structure is publication-shaped.
_dt=4.6s, tokens=436_
## gemma4:31b-it-q4_K_M -- puzzle 1
**Theme seed:** Words related to circles, timing, and subtle wordplay with 'Hands'
| Diff | Category | Words |
|---|---|---|
| yellow | Circular Objects | RING, COIN, WHEEL, DISC |
| green | Units of Time | SECOND, MINUTE, HOUR, DAY |
| blue | Ways to Assist | HELP, AID, BACK, SUPPORT |
| purple | Things with Hands | CLOCK, WATCH, DECK, PLAYER |
**Claimed traps (4):**
- `SECOND` (actually green, traps blue): To 'second' a motion is to support it.
- `WATCH` (actually purple, traps green): A watch is a device used to measure the units of time (second, minute, hour).
- `CLOCK` (actually purple, traps green): Similar to watch, it is intrinsically linked to the time units group.
- `BACK` (actually blue, traps yellow): Could be mistaken as a part of a circular object or a physical position.
**Grade: BORDERLINE PASS.** `DECK` in "Things with Hands" is shaky — decks have cards, players have hands; the phrase is "all hands on deck" (sailors), not "deck has hands." Other 3 purple members fit. Strong saving grace: `SECOND`→assist (second a motion), `WATCH`→time, `CLOCK`→time are 3 genuinely strong traps. Categories otherwise tight.
_dt=18.9s, tokens=439_
## gemma4:31b-it-q4_K_M -- puzzle 2
**Theme seed:** A mix of architectural elements, types of scales, and linguistic wordplay
| Diff | Category | Words |
|---|---|---|
| yellow | Parts of a Building | BEAM, COLUMN, EAVE, FOUNDATION |
| green | Types of Scales | LIBRA, MUSICAL, WEIGHING, TIMELINE |
| blue | Synonyms for Balance | EQUILIBRIUM, POISE, STEADY, STABILITY |
| purple | Words that start with a Greek Letter | ALPHABET, BETRAYAL, GAMUT, DELUXE |
**Claimed traps (3):**
- `LIBRA` (actually green, traps blue): Libra is the sign of the scales, which represents balance/equilibrium.
- `COLUMN` (actually yellow, traps green): A column can be a way to organize data on a scale or chart.
- `GAMUT` (actually purple, traps green): The 'gamut' refers to the full range or scale of something.
**Grade: FAIL.** Two broken categories. Green "Types of Scales" includes `LIBRA` (zodiac sign, not a scale type) and `TIMELINE` (a line, not a scale). Purple "Words that start with a Greek Letter" includes `DELUXE` — but `DELUXE` starts with "DELU", not the full letter name "DELTA". `GAMUT` starts with "GAM", missing the full "GAMMA". 2 of 4 purple words don't actually start with the complete Greek-letter name.
_dt=16.6s, tokens=410_
## gemma4:31b-it-q4_K_M -- puzzle 3
**Theme seed:** Words related to vision, optics, and ocular puns
| Diff | Category | Words |
|---|---|---|
| yellow | Parts of the Eye | CORNEA, PUPIL, IRIS, RETINA |
| green | Types of Lenses | CONCAVE, CONVEX, CONTACT, BIFOCAL |
| blue | Words meaning 'To Witness' | OBSERVE, SPOT, NOTE, VIEW |
| purple | ___ EYE | BULLS, BIRD, FISH, PRIVATE |
**Claimed traps (3):**
- `PUPIL` (actually yellow, traps blue): A pupil is a student, and one might 'note' or 'observe' a pupil in a classroom setting.
- `PRIVATE` (actually purple, traps blue): Private can be a synonym for 'secret' or 'hidden', contrasting with the act of observing or spotting.
- `CONTACT` (actually green, traps blue): To 'contact' someone can be confused with 'spotting' or 'viewing' them in a social context.
**Grade: PASS.** All 4 categories tight: parts of eye / lens types / witness verbs / `___EYE` compounds (bull's, bird's, fish, private). `PUPIL`→student is a real, NYT-style multi-meaning trap. Only 3 traps claimed (rubric asked for 2+, ✓). Closest puzzle to publication quality across both models.
_dt=17.3s, tokens=422_
## gemma4:31b-it-q4_K_M -- puzzle 4
**Theme seed:** Common words that can also be types of birds or musical terms
| Diff | Category | Words |
|---|---|---|
| yellow | Things that can be 'Cast' | SPELL, VOTE, SHADOW, NET |
| green | Musical Dynamics | PIANO, FORTE, SHARP, FLAT |
| blue | Types of Birds | CRANE, SWALLOW, SWIFT, LOON |
| purple | Words that mean 'to move quickly' | DART, BOLT, RUSH, FLY |
**Claimed traps (4):**
- `SWIFT` (actually blue, traps purple): Swift is a synonym for moving quickly, fitting the 'move quickly' group.
- `FLY` (actually purple, traps blue): Fly is a common bird-related term (or a type of insect), making it look like it belongs with Crane and Swallow.
- `SWALLOW` (actually blue, traps purple): While a bird, if the user interprets 'move quickly' as 'gulp down', it might confuse them, though primarily it serves as a bird lure.
- `SHARP` (actually green, traps purple): A 'sharp' turn or movement can be associated with quickness/darting.
**Grade: BORDERLINE PASS.** Green should be "Musical Terms" not "Dynamics" — `SHARP` and `FLAT` are accidentals (pitch modifiers), not dynamics (loudness). Pedantic but a real category-label miss. Saving graces: `SWIFT` (bird that means fast) and `FLY` (bird that means move quickly) are exactly the kind of multi-meaning anchors a real Connections puzzle would deploy.
_dt=19.9s, tokens=487_
## gemma4:31b-it-q4_K_M -- puzzle 5
**Theme seed:** Words related to types of 'scales' and musical terminology
| Diff | Category | Words |
|---|---|---|
| yellow | Musical Scales | MAJOR, MINOR, CHROMATIC, PENTATONIC |
| green | Things that Weigh | BALANCE, SPRING, DIGITAL, TRIPLE-BEAM |
| blue | Ways to Climb | ASCEND, MOUNT, SCALE, SURMOUNT |
| purple | ___ NOTE | FOOT, BANK, KEY, HIGH |
**Claimed traps (4):**
- `SCALE` (actually blue, traps yellow): It is the root word for musical scales, leading the player to look for other scale-related terms.
- `KEY` (actually purple, traps yellow): A 'key' is fundamentally linked to musical scales (e.g., the Key of C Major).
- `HIGH` (actually purple, traps blue): High is an adjective often associated with climbing or ascending.
- `BALANCE` (actually green, traps blue): Balance can be seen as a state of being when climbing or mountaineering.
**Grade: PASS.** The whole puzzle is built around `SCALE` carrying three meanings: musical scale (yellow), weighing scale (green's theme — though Gemma mislabels it "Things that Weigh" instead of "Types of Scales"), and "to climb" (blue, where SCALE-the-word lives). That is exactly the central-anchor pattern a real NYT Connections puzzle uses. `KEY`, `HIGH`, `BALANCE` traps all genuinely overlap. Categories slightly mislabeled but structure is publication-quality.
_dt=18.2s, tokens=453_
---
## Aggregate
| Model | Pass | Borderline | Fail | Avg s | Avg tok/s |
|---|---|---|---|---|---|
| `gemma4:26b` | 1 (#5) | 1 (#3) + 1 partial (#2) | 2 (#1, #4) | 5.2 | 94.3 |
| `gemma4:31b-it-q4_K_M` | 2 (#3, #5) | 2 (#1, #4) | 1 (#2) | 18.2 | 24.3 |
**31b is materially more reliable** — 2 clean passes vs 26b's 1, and only 1 hard fail vs 26b's 2 hard fails plus a partial-fail. 31b is 3.5× slower per generation but at 18s for a once-per-day puzzle, that's irrelevant. 26b is fast enough for interactive use but produces broken puzzles half the time.
### Failure modes (in order of how often they recur)
1. **Structural violations** — duplicate or near-duplicate words on the 16-tile board, or a word listed in two groups. (#1-26b: `FRUIT` × 2; #4-26b: `NEST`/`NESTING`.) Catastrophic — a real Connections board has 16 *distinct* tiles. **Trivially detectable** with a deterministic post-filter.
2. **Broken category logic** — words placed in a category that don't actually fit. (#2-26b: `EYE`/`QUEUE` aren't body-part homophones; #4-26b: `NESTING` isn't a "Word before EGG"; #2-31b: `LIBRA`/`TIMELINE` aren't scales, `DELUXE` doesn't start with the full Greek letter "DELTA"; #1-31b: `DECK` doesn't have hands.) **Hard to detect deterministically** — needs a critique/judging pass.
3. **Redundant categories** — two groups themed on the same concept (#1-26b: yellow + green both fly-themed; #4-26b: yellow + purple both egg-themed). Detectable with a category-similarity check.
4. **Weak/circular trap reasoning** — Gemma's claimed "intended_traps" sometimes don't actually hold. (#3-26b: `PRESS` doesn't fit "Words after BLOOD" — the compound is *blood pressure*, not *blood press*.) Means **Gemma cannot reliably grade its own puzzles** — independent judging required.
### Successes (when Gemma gets it right, what it does right)
- **Multi-meaning anchor words** — `SCALE` (3 meanings, #5-31b), `SWIFT`/`FLY` (bird + fast, #4-31b), `PUPIL` (eye + student, #3-31b), `TRACK` (3 meanings, #5-26b). When Gemma builds a puzzle around an anchor, it produces real Connections-grade overlap.
- **Compound-word categories** — "`___ EYE`" (#3-31b), "Side `___`" (#5-26b), "Words before EGG" (#4-26b — when Gemma doesn't poison it). These are the easiest pattern to get right.
- **Tight short labels** when Gemma sticks to well-known domains (parts of eye, types of lenses, types of records).
### Implication for design
**Generation is viable, but not unaided.** The shape of the data engine:
```
generate (gemma4:31b)
→ deterministic filter [check 16 distinct tiles, no dup words, all words appear in categories]
→ category-similarity check [reject puzzles with redundant themes]
→ critique pass [either gemma4:31b second pass, or qwen3-coder:30b as judge]
→ reject + regenerate if any fail; accept once filtered
→ cache as the day's puzzle
```
At ~18s/generation and a roughly 50% structural-pass rate, a daily puzzle costs an expected ~2 generations + 1 critique = maybe 1 minute of GPU time per day. Effectively free.
**26b is unsuitable as the primary generator** — too many hard fails. It could plausibly be the *judging* model (cheaper, runs on every player guess) since judging is much easier than generating. But that decision is for the brainstorm.
### Risks not yet checked
- **Diversity over time.** All 10 puzzles produced here are within a single seed-less batch. If Gemma keeps reaching for the same themes (we saw "scales" twice on 31b alone), a 365-day-per-year stream might get repetitive. Test this with seeded prompts before committing.
- **Connections-vs-Gemma blind anchor not run.** I deferred this — the structural failures in Gemma's output (duplicate words, broken categories) are so obviously curator-rejection-tier that no human-curated puzzle would have them, so the within-Gemma comparison was decisive on its own. Still, before final design, eyeball one Gemma-pass puzzle next to a real NYT puzzle and check whether it actually feels equivalent.
- **Two-pass critique not validated.** The proposal above assumes a critique pass would catch Gemma's category mistakes. That assumption has not been tested. The next experiment is "feed Gemma's broken puzzles back to Gemma (or to a different model) and see if it flags the structural issues."