Files
seth_semantic_game/docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md
Mortdecai 5a2a02e483 docs: bootstrap repo with bakeoff results and game-mechanics idea bank
This repo opens with the design-discovery work completed before any product
code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local
Ollama established that:

- Whole-puzzle generation in the Connections shape is unreliable on Gemma 4
  (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally
  out of project scope, so the generation route is harder still.
- Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b)
  on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging
  of player-INVENTED categories. That is the structural unlock vs static
  hand-curated word games.

The README contains the full writeup, the test bench, and a brainstormed
bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo
range, plus a primitives table for recombination.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:09:46 -04:00

18 KiB
Raw Permalink Blame History

Gemma 4 Generation Bakeoff -- 2026-04-27-221751

Setup

  • Local Ollama on the test host (RTX 3090 Ti, 24 GB VRAM)
  • Other GPU workloads paused for the duration of the run
  • Models: gemma4:26b, gemma4:31b-it-q4_K_M
  • 5 puzzles per model, base temperature 0.8
  • Gemma 4 settings (per ~/bin/gemma4-research/GOTCHAS.md): think=false, num_ctx=8192, num_predict=4096. No format=json (infinite-loop bug). JSON extracted client-side via body[body.find('{'):body.rfind('}')+1].
  • Up to 3 attempts per puzzle with temperature bumped +0.1 each retry (AI_Visualizer pattern). Reported metrics use the successful attempt.
  • One-shot, unaided generation. No critique pass, no example puzzle in prompt.

Timing

Model n avg s avg tokens tok/s
gemma4:26b 5 5.2 489 94.3
gemma4:31b-it-q4_K_M 5 18.2 442 24.3

JSON parse rate

  • gemma4:26b: 5/5 parsed cleanly
  • gemma4:31b-it-q4_K_M: 5/5 parsed cleanly

gemma4:26b -- puzzle 1

Theme seed: An exploration of common words that can be paired with 'BIRD' or 'DRAGON' or relate to types of 'FLY'.

Diff Category Words
yellow Types of Flies FRUIT, HOUSE, FIRE, BLOW
green Words before 'FLY' DRAGON, MAY, BUTTER, BOTTLE
blue Types of Birds LARK, SWALLOW, CRANE, SWIFT
purple Words that follow 'DRAGON' FRUIT, FLY, SLAYER, FLAME

Claimed traps (4):

  • FRUIT (actually purple, traps yellow): Fruit fly is a very common type of fly, making it a primary red herring for the yellow group.
  • FIRE (actually yellow, traps purple): Fire dragon or dragon flame connection could lead someone to look for a dragon-themed group.
  • DRAGON (actually green, traps blue): Dragon is a mythical creature, and users might look for a 'mythical creatures' group including Crane or Swallow (which are real birds).
  • FLY (actually purple, traps green): Dragonfly is a common word, tempting the player to put 'FLY' in the 'Words before FLY' group.

Grade: FAIL. FRUIT appears in two groups (yellow Types of Flies + purple Words that follow DRAGON) — that's only 15 distinct tiles, not the required 16. Categories are also redundant: yellow ("Types of Flies") and green ("Words before FLY") are both fly-themed. A human curator would never ship this.

dt=5.7s, tokens=500

gemma4:26b -- puzzle 2

Theme seed: Homophones and linguistic overlaps between nature and everyday objects.

Diff Category Words
yellow Types of Trees BIRCH, CEDAR, MAPLE, PINE
green Things that can be 'Dropped' BEAT, HINT, LINE, SUBJECT
blue Units of Measurement FOOT, GRAM, SECOND, STONE
purple Homophones of Body Parts EYE, HARE, MUSSEL, QUEUE

Claimed traps (4):

  • PINE (actually yellow, traps green): To 'pine' for someone is a feeling, but 'pine' could be mistaken for something you can 'drop' (like a pine needle).
  • FOOT (actually blue, traps yellow): A 'foot' is a unit of measurement, but it is also a body part, potentially leading to the 'Body Part Homophones' group.
  • BEAT (actually green, traps blue): A 'beat' is a rhythm, but it could be mistaken for a unit of time/measurement in a musical context.
  • EYE (actually purple, traps blue): The word 'eye' is a body part, which might tempt players looking for a body part category, but it is actually a homophone (I) for the purple group.

Grade: PARTIAL FAIL. Yellow/green/blue all tight. Purple "Homophones of Body Parts" is half-broken: HARE→hair ✓, MUSSEL→muscle ✓, but EYE is a body part (not a homophone of one) and QUEUE→cue is not a body-part homophone. 2/4 wrong on the trickiest group is a curator-rejection. Trap reasoning is weak (claims EYE traps blue Units — body parts aren't units).

dt=5.3s, tokens=510

gemma4:26b -- puzzle 3

Theme seed: The concept of 'Pressure' and things that can be measured or applied.

Diff Category Words
yellow Units of Pressure BAR, PASCAL, PSI, TORR
green Words after 'BLOOD' PRESSURE, TYPE, ORANGE, DONOR
blue Things that can be 'Broken' HEART, RECORD, LAW, PROMISE
purple Words that are also Verbs for 'Applying Force' PRESS, STAMP, PUNCH, DRIVE

Claimed traps (4):

  • PRESSURE (actually green, traps yellow): Pressure is the concept being measured by the units in the yellow group.
  • BAR (actually yellow, traps blue): A 'bar' is something that can be 'broken' (like a chocolate bar) or refers to a legal setting.
  • PRESS (actually purple, traps green): The 'Press' is a common term in news/media, and 'Press' can be related to 'Blood' in a medical context (blood pressure).
  • RECORD (actually blue, traps purple): A 'record' is something you can 'stamp' or 'drive' (as in a driving record).

Grade: BORDERLINE PASS. Categories tight, 16 distinct words, ✓. BAR (yellow → blue trap) is a real overlap (chocolate bar, broken bar). PRESS claimed to trap green ("Words after BLOOD") — but blood pressure is the compound, not blood press, so the claimed trap doesn't actually hold. RECORD→stamp/drive is a stretch. 1/4 of Gemma's self-graded traps is broken; remainder OK.

dt=5.1s, tokens=492

gemma4:26b -- puzzle 4

Theme seed: A collection of words related to things that can be 'broken' or 'cracked', mixed with other categories like cooking and music.

Diff Category Words
yellow Kinds of Eggs CHICKEN, QUAIL, GOOSE, DUCK
green Things You Can Crack NUT, CODE, WHIP, SMILE
blue Musical Terms NOTE, CHORD, CLEF, BEAT
purple Words Before 'Egg' NEST, EASTER, BAD, NESTING

Claimed traps (4):

  • DUCK (actually yellow, traps green): DUCK is an action related to avoiding a whip or a blow, and can be paired with 'crack' in some linguistic contexts.
  • CHICKEN (actually yellow, traps purple): CHICKEN is an egg-related word that could tempt someone looking for '___ Egg' phrases (like Chicken Egg).
  • BEAT (actually blue, traps green): BEAT is something you can do to a whip or a person, fitting the 'Things you can crack/strike' theme.
  • NEST (actually purple, traps yellow): NEST is intrinsically linked to the birds in the yellow group (CHICKEN, GOOSE, etc.).

Grade: FAIL. Purple has NESTING as "Words Before EGG" — nesting egg isn't a phrase (nest egg is). Worse, NEST and NESTING are near-duplicate tiles. Yellow "Kinds of Eggs" and purple "Words Before EGG" are also redundant (chicken egg vs nest egg — both compound forms). Curator reject.

dt=5.3s, tokens=506

gemma4:26b -- puzzle 5

Theme seed: A collection of words centered around various types of 'records' and 'tracks'.

Diff Category Words
yellow Types of Records LP, EP, Single, Album
green Things that have Tracks Train, Athlete, Song, Mud
blue Words for 'Evidence' Clue, Trace, Mark, Proof
purple Words that follow 'Side' Effect, Kick, Split, Walk

Claimed traps (4):

  • Single (actually yellow, traps blue): A 'single' can be a piece of evidence or a solitary item.
  • Song (actually green, traps yellow): A song is often released as an LP, EP, or Single.
  • Trace (actually blue, traps green): A trace can be a physical track left behind in mud.
  • Split (actually purple, traps green): A 'split' can refer to a track or division in a race.

Grade: PASS. Cleanest 26b output. 16 distinct words; categories tight; the word "track" carries three different meanings across green (railroad/running/album), with SONG→Records and TRACE→Tracks as real overlaps. That's the multi-meaning anchor pattern Connections actually uses. SINGLE and SPLIT traps are weak, but the puzzle structure is publication-shaped.

dt=4.6s, tokens=436

gemma4:31b-it-q4_K_M -- puzzle 1

Theme seed: Words related to circles, timing, and subtle wordplay with 'Hands'

Diff Category Words
yellow Circular Objects RING, COIN, WHEEL, DISC
green Units of Time SECOND, MINUTE, HOUR, DAY
blue Ways to Assist HELP, AID, BACK, SUPPORT
purple Things with Hands CLOCK, WATCH, DECK, PLAYER

Claimed traps (4):

  • SECOND (actually green, traps blue): To 'second' a motion is to support it.
  • WATCH (actually purple, traps green): A watch is a device used to measure the units of time (second, minute, hour).
  • CLOCK (actually purple, traps green): Similar to watch, it is intrinsically linked to the time units group.
  • BACK (actually blue, traps yellow): Could be mistaken as a part of a circular object or a physical position.

Grade: BORDERLINE PASS. DECK in "Things with Hands" is shaky — decks have cards, players have hands; the phrase is "all hands on deck" (sailors), not "deck has hands." Other 3 purple members fit. Strong saving grace: SECOND→assist (second a motion), WATCH→time, CLOCK→time are 3 genuinely strong traps. Categories otherwise tight.

dt=18.9s, tokens=439

gemma4:31b-it-q4_K_M -- puzzle 2

Theme seed: A mix of architectural elements, types of scales, and linguistic wordplay

Diff Category Words
yellow Parts of a Building BEAM, COLUMN, EAVE, FOUNDATION
green Types of Scales LIBRA, MUSICAL, WEIGHING, TIMELINE
blue Synonyms for Balance EQUILIBRIUM, POISE, STEADY, STABILITY
purple Words that start with a Greek Letter ALPHABET, BETRAYAL, GAMUT, DELUXE

Claimed traps (3):

  • LIBRA (actually green, traps blue): Libra is the sign of the scales, which represents balance/equilibrium.
  • COLUMN (actually yellow, traps green): A column can be a way to organize data on a scale or chart.
  • GAMUT (actually purple, traps green): The 'gamut' refers to the full range or scale of something.

Grade: FAIL. Two broken categories. Green "Types of Scales" includes LIBRA (zodiac sign, not a scale type) and TIMELINE (a line, not a scale). Purple "Words that start with a Greek Letter" includes DELUXE — but DELUXE starts with "DELU", not the full letter name "DELTA". GAMUT starts with "GAM", missing the full "GAMMA". 2 of 4 purple words don't actually start with the complete Greek-letter name.

dt=16.6s, tokens=410

gemma4:31b-it-q4_K_M -- puzzle 3

Theme seed: Words related to vision, optics, and ocular puns

Diff Category Words
yellow Parts of the Eye CORNEA, PUPIL, IRIS, RETINA
green Types of Lenses CONCAVE, CONVEX, CONTACT, BIFOCAL
blue Words meaning 'To Witness' OBSERVE, SPOT, NOTE, VIEW
purple ___ EYE BULLS, BIRD, FISH, PRIVATE

Claimed traps (3):

  • PUPIL (actually yellow, traps blue): A pupil is a student, and one might 'note' or 'observe' a pupil in a classroom setting.
  • PRIVATE (actually purple, traps blue): Private can be a synonym for 'secret' or 'hidden', contrasting with the act of observing or spotting.
  • CONTACT (actually green, traps blue): To 'contact' someone can be confused with 'spotting' or 'viewing' them in a social context.

Grade: PASS. All 4 categories tight: parts of eye / lens types / witness verbs / ___EYE compounds (bull's, bird's, fish, private). PUPIL→student is a real, NYT-style multi-meaning trap. Only 3 traps claimed (rubric asked for 2+, ✓). Closest puzzle to publication quality across both models.

dt=17.3s, tokens=422

gemma4:31b-it-q4_K_M -- puzzle 4

Theme seed: Common words that can also be types of birds or musical terms

Diff Category Words
yellow Things that can be 'Cast' SPELL, VOTE, SHADOW, NET
green Musical Dynamics PIANO, FORTE, SHARP, FLAT
blue Types of Birds CRANE, SWALLOW, SWIFT, LOON
purple Words that mean 'to move quickly' DART, BOLT, RUSH, FLY

Claimed traps (4):

  • SWIFT (actually blue, traps purple): Swift is a synonym for moving quickly, fitting the 'move quickly' group.
  • FLY (actually purple, traps blue): Fly is a common bird-related term (or a type of insect), making it look like it belongs with Crane and Swallow.
  • SWALLOW (actually blue, traps purple): While a bird, if the user interprets 'move quickly' as 'gulp down', it might confuse them, though primarily it serves as a bird lure.
  • SHARP (actually green, traps purple): A 'sharp' turn or movement can be associated with quickness/darting.

Grade: BORDERLINE PASS. Green should be "Musical Terms" not "Dynamics" — SHARP and FLAT are accidentals (pitch modifiers), not dynamics (loudness). Pedantic but a real category-label miss. Saving graces: SWIFT (bird that means fast) and FLY (bird that means move quickly) are exactly the kind of multi-meaning anchors a real Connections puzzle would deploy.

dt=19.9s, tokens=487

gemma4:31b-it-q4_K_M -- puzzle 5

Theme seed: Words related to types of 'scales' and musical terminology

Diff Category Words
yellow Musical Scales MAJOR, MINOR, CHROMATIC, PENTATONIC
green Things that Weigh BALANCE, SPRING, DIGITAL, TRIPLE-BEAM
blue Ways to Climb ASCEND, MOUNT, SCALE, SURMOUNT
purple ___ NOTE FOOT, BANK, KEY, HIGH

Claimed traps (4):

  • SCALE (actually blue, traps yellow): It is the root word for musical scales, leading the player to look for other scale-related terms.
  • KEY (actually purple, traps yellow): A 'key' is fundamentally linked to musical scales (e.g., the Key of C Major).
  • HIGH (actually purple, traps blue): High is an adjective often associated with climbing or ascending.
  • BALANCE (actually green, traps blue): Balance can be seen as a state of being when climbing or mountaineering.

Grade: PASS. The whole puzzle is built around SCALE carrying three meanings: musical scale (yellow), weighing scale (green's theme — though Gemma mislabels it "Things that Weigh" instead of "Types of Scales"), and "to climb" (blue, where SCALE-the-word lives). That is exactly the central-anchor pattern a real NYT Connections puzzle uses. KEY, HIGH, BALANCE traps all genuinely overlap. Categories slightly mislabeled but structure is publication-quality.

dt=18.2s, tokens=453


Aggregate

Model Pass Borderline Fail Avg s Avg tok/s
gemma4:26b 1 (#5) 1 (#3) + 1 partial (#2) 2 (#1, #4) 5.2 94.3
gemma4:31b-it-q4_K_M 2 (#3, #5) 2 (#1, #4) 1 (#2) 18.2 24.3

31b is materially more reliable — 2 clean passes vs 26b's 1, and only 1 hard fail vs 26b's 2 hard fails plus a partial-fail. 31b is 3.5× slower per generation but at 18s for a once-per-day puzzle, that's irrelevant. 26b is fast enough for interactive use but produces broken puzzles half the time.

Failure modes (in order of how often they recur)

  1. Structural violations — duplicate or near-duplicate words on the 16-tile board, or a word listed in two groups. (#1-26b: FRUIT × 2; #4-26b: NEST/NESTING.) Catastrophic — a real Connections board has 16 distinct tiles. Trivially detectable with a deterministic post-filter.
  2. Broken category logic — words placed in a category that don't actually fit. (#2-26b: EYE/QUEUE aren't body-part homophones; #4-26b: NESTING isn't a "Word before EGG"; #2-31b: LIBRA/TIMELINE aren't scales, DELUXE doesn't start with the full Greek letter "DELTA"; #1-31b: DECK doesn't have hands.) Hard to detect deterministically — needs a critique/judging pass.
  3. Redundant categories — two groups themed on the same concept (#1-26b: yellow + green both fly-themed; #4-26b: yellow + purple both egg-themed). Detectable with a category-similarity check.
  4. Weak/circular trap reasoning — Gemma's claimed "intended_traps" sometimes don't actually hold. (#3-26b: PRESS doesn't fit "Words after BLOOD" — the compound is blood pressure, not blood press.) Means Gemma cannot reliably grade its own puzzles — independent judging required.

Successes (when Gemma gets it right, what it does right)

  • Multi-meaning anchor wordsSCALE (3 meanings, #5-31b), SWIFT/FLY (bird + fast, #4-31b), PUPIL (eye + student, #3-31b), TRACK (3 meanings, #5-26b). When Gemma builds a puzzle around an anchor, it produces real Connections-grade overlap.
  • Compound-word categories — "___ EYE" (#3-31b), "Side ___" (#5-26b), "Words before EGG" (#4-26b — when Gemma doesn't poison it). These are the easiest pattern to get right.
  • Tight short labels when Gemma sticks to well-known domains (parts of eye, types of lenses, types of records).

Implication for design

Generation is viable, but not unaided. The shape of the data engine:

generate (gemma4:31b)
  → deterministic filter [check 16 distinct tiles, no dup words, all words appear in categories]
  → category-similarity check [reject puzzles with redundant themes]
  → critique pass [either gemma4:31b second pass, or qwen3-coder:30b as judge]
  → reject + regenerate if any fail; accept once filtered
  → cache as the day's puzzle

At ~18s/generation and a roughly 50% structural-pass rate, a daily puzzle costs an expected ~2 generations + 1 critique = maybe 1 minute of GPU time per day. Effectively free.

26b is unsuitable as the primary generator — too many hard fails. It could plausibly be the judging model (cheaper, runs on every player guess) since judging is much easier than generating. But that decision is for the brainstorm.

Risks not yet checked

  • Diversity over time. All 10 puzzles produced here are within a single seed-less batch. If Gemma keeps reaching for the same themes (we saw "scales" twice on 31b alone), a 365-day-per-year stream might get repetitive. Test this with seeded prompts before committing.
  • Connections-vs-Gemma blind anchor not run. I deferred this — the structural failures in Gemma's output (duplicate words, broken categories) are so obviously curator-rejection-tier that no human-curated puzzle would have them, so the within-Gemma comparison was decisive on its own. Still, before final design, eyeball one Gemma-pass puzzle next to a real NYT puzzle and check whether it actually feels equivalent.
  • Two-pass critique not validated. The proposal above assumes a critique pass would catch Gemma's category mistakes. That assumption has not been tested. The next experiment is "feed Gemma's broken puzzles back to Gemma (or to a different model) and see if it flags the structural issues."