This repo opens with the design-discovery work completed before any product code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local Ollama established that: - Whole-puzzle generation in the Connections shape is unreliable on Gemma 4 (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally out of project scope, so the generation route is harder still. - Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b) on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging of player-INVENTED categories. That is the structural unlock vs static hand-curated word games. The README contains the full writeup, the test bench, and a brainstormed bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo range, plus a primitives table for recombination. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 KiB
Gemma 4 Generation Bakeoff -- 2026-04-27-221751
Setup
- Local Ollama on the test host (RTX 3090 Ti, 24 GB VRAM)
- Other GPU workloads paused for the duration of the run
- Models:
gemma4:26b,gemma4:31b-it-q4_K_M - 5 puzzles per model, base temperature 0.8
- Gemma 4 settings (per
~/bin/gemma4-research/GOTCHAS.md):think=false,num_ctx=8192,num_predict=4096. Noformat=json(infinite-loop bug). JSON extracted client-side viabody[body.find('{'):body.rfind('}')+1]. - Up to 3 attempts per puzzle with temperature bumped +0.1 each retry (AI_Visualizer pattern). Reported metrics use the successful attempt.
- One-shot, unaided generation. No critique pass, no example puzzle in prompt.
Timing
| Model | n | avg s | avg tokens | tok/s |
|---|---|---|---|---|
gemma4:26b |
5 | 5.2 | 489 | 94.3 |
gemma4:31b-it-q4_K_M |
5 | 18.2 | 442 | 24.3 |
JSON parse rate
gemma4:26b: 5/5 parsed cleanlygemma4:31b-it-q4_K_M: 5/5 parsed cleanly
gemma4:26b -- puzzle 1
Theme seed: An exploration of common words that can be paired with 'BIRD' or 'DRAGON' or relate to types of 'FLY'.
| Diff | Category | Words |
|---|---|---|
| yellow | Types of Flies | FRUIT, HOUSE, FIRE, BLOW |
| green | Words before 'FLY' | DRAGON, MAY, BUTTER, BOTTLE |
| blue | Types of Birds | LARK, SWALLOW, CRANE, SWIFT |
| purple | Words that follow 'DRAGON' | FRUIT, FLY, SLAYER, FLAME |
Claimed traps (4):
FRUIT(actually purple, traps yellow): Fruit fly is a very common type of fly, making it a primary red herring for the yellow group.FIRE(actually yellow, traps purple): Fire dragon or dragon flame connection could lead someone to look for a dragon-themed group.DRAGON(actually green, traps blue): Dragon is a mythical creature, and users might look for a 'mythical creatures' group including Crane or Swallow (which are real birds).FLY(actually purple, traps green): Dragonfly is a common word, tempting the player to put 'FLY' in the 'Words before FLY' group.
Grade: FAIL. FRUIT appears in two groups (yellow Types of Flies + purple Words that follow DRAGON) — that's only 15 distinct tiles, not the required 16. Categories are also redundant: yellow ("Types of Flies") and green ("Words before FLY") are both fly-themed. A human curator would never ship this.
dt=5.7s, tokens=500
gemma4:26b -- puzzle 2
Theme seed: Homophones and linguistic overlaps between nature and everyday objects.
| Diff | Category | Words |
|---|---|---|
| yellow | Types of Trees | BIRCH, CEDAR, MAPLE, PINE |
| green | Things that can be 'Dropped' | BEAT, HINT, LINE, SUBJECT |
| blue | Units of Measurement | FOOT, GRAM, SECOND, STONE |
| purple | Homophones of Body Parts | EYE, HARE, MUSSEL, QUEUE |
Claimed traps (4):
PINE(actually yellow, traps green): To 'pine' for someone is a feeling, but 'pine' could be mistaken for something you can 'drop' (like a pine needle).FOOT(actually blue, traps yellow): A 'foot' is a unit of measurement, but it is also a body part, potentially leading to the 'Body Part Homophones' group.BEAT(actually green, traps blue): A 'beat' is a rhythm, but it could be mistaken for a unit of time/measurement in a musical context.EYE(actually purple, traps blue): The word 'eye' is a body part, which might tempt players looking for a body part category, but it is actually a homophone (I) for the purple group.
Grade: PARTIAL FAIL. Yellow/green/blue all tight. Purple "Homophones of Body Parts" is half-broken: HARE→hair ✓, MUSSEL→muscle ✓, but EYE is a body part (not a homophone of one) and QUEUE→cue is not a body-part homophone. 2/4 wrong on the trickiest group is a curator-rejection. Trap reasoning is weak (claims EYE traps blue Units — body parts aren't units).
dt=5.3s, tokens=510
gemma4:26b -- puzzle 3
Theme seed: The concept of 'Pressure' and things that can be measured or applied.
| Diff | Category | Words |
|---|---|---|
| yellow | Units of Pressure | BAR, PASCAL, PSI, TORR |
| green | Words after 'BLOOD' | PRESSURE, TYPE, ORANGE, DONOR |
| blue | Things that can be 'Broken' | HEART, RECORD, LAW, PROMISE |
| purple | Words that are also Verbs for 'Applying Force' | PRESS, STAMP, PUNCH, DRIVE |
Claimed traps (4):
PRESSURE(actually green, traps yellow): Pressure is the concept being measured by the units in the yellow group.BAR(actually yellow, traps blue): A 'bar' is something that can be 'broken' (like a chocolate bar) or refers to a legal setting.PRESS(actually purple, traps green): The 'Press' is a common term in news/media, and 'Press' can be related to 'Blood' in a medical context (blood pressure).RECORD(actually blue, traps purple): A 'record' is something you can 'stamp' or 'drive' (as in a driving record).
Grade: BORDERLINE PASS. Categories tight, 16 distinct words, ✓. BAR (yellow → blue trap) is a real overlap (chocolate bar, broken bar). PRESS claimed to trap green ("Words after BLOOD") — but blood pressure is the compound, not blood press, so the claimed trap doesn't actually hold. RECORD→stamp/drive is a stretch. 1/4 of Gemma's self-graded traps is broken; remainder OK.
dt=5.1s, tokens=492
gemma4:26b -- puzzle 4
Theme seed: A collection of words related to things that can be 'broken' or 'cracked', mixed with other categories like cooking and music.
| Diff | Category | Words |
|---|---|---|
| yellow | Kinds of Eggs | CHICKEN, QUAIL, GOOSE, DUCK |
| green | Things You Can Crack | NUT, CODE, WHIP, SMILE |
| blue | Musical Terms | NOTE, CHORD, CLEF, BEAT |
| purple | Words Before 'Egg' | NEST, EASTER, BAD, NESTING |
Claimed traps (4):
DUCK(actually yellow, traps green): DUCK is an action related to avoiding a whip or a blow, and can be paired with 'crack' in some linguistic contexts.CHICKEN(actually yellow, traps purple): CHICKEN is an egg-related word that could tempt someone looking for '___ Egg' phrases (like Chicken Egg).BEAT(actually blue, traps green): BEAT is something you can do to a whip or a person, fitting the 'Things you can crack/strike' theme.NEST(actually purple, traps yellow): NEST is intrinsically linked to the birds in the yellow group (CHICKEN, GOOSE, etc.).
Grade: FAIL. Purple has NESTING as "Words Before EGG" — nesting egg isn't a phrase (nest egg is). Worse, NEST and NESTING are near-duplicate tiles. Yellow "Kinds of Eggs" and purple "Words Before EGG" are also redundant (chicken egg vs nest egg — both compound forms). Curator reject.
dt=5.3s, tokens=506
gemma4:26b -- puzzle 5
Theme seed: A collection of words centered around various types of 'records' and 'tracks'.
| Diff | Category | Words |
|---|---|---|
| yellow | Types of Records | LP, EP, Single, Album |
| green | Things that have Tracks | Train, Athlete, Song, Mud |
| blue | Words for 'Evidence' | Clue, Trace, Mark, Proof |
| purple | Words that follow 'Side' | Effect, Kick, Split, Walk |
Claimed traps (4):
Single(actually yellow, traps blue): A 'single' can be a piece of evidence or a solitary item.Song(actually green, traps yellow): A song is often released as an LP, EP, or Single.Trace(actually blue, traps green): A trace can be a physical track left behind in mud.Split(actually purple, traps green): A 'split' can refer to a track or division in a race.
Grade: PASS. Cleanest 26b output. 16 distinct words; categories tight; the word "track" carries three different meanings across green (railroad/running/album), with SONG→Records and TRACE→Tracks as real overlaps. That's the multi-meaning anchor pattern Connections actually uses. SINGLE and SPLIT traps are weak, but the puzzle structure is publication-shaped.
dt=4.6s, tokens=436
gemma4:31b-it-q4_K_M -- puzzle 1
Theme seed: Words related to circles, timing, and subtle wordplay with 'Hands'
| Diff | Category | Words |
|---|---|---|
| yellow | Circular Objects | RING, COIN, WHEEL, DISC |
| green | Units of Time | SECOND, MINUTE, HOUR, DAY |
| blue | Ways to Assist | HELP, AID, BACK, SUPPORT |
| purple | Things with Hands | CLOCK, WATCH, DECK, PLAYER |
Claimed traps (4):
SECOND(actually green, traps blue): To 'second' a motion is to support it.WATCH(actually purple, traps green): A watch is a device used to measure the units of time (second, minute, hour).CLOCK(actually purple, traps green): Similar to watch, it is intrinsically linked to the time units group.BACK(actually blue, traps yellow): Could be mistaken as a part of a circular object or a physical position.
Grade: BORDERLINE PASS. DECK in "Things with Hands" is shaky — decks have cards, players have hands; the phrase is "all hands on deck" (sailors), not "deck has hands." Other 3 purple members fit. Strong saving grace: SECOND→assist (second a motion), WATCH→time, CLOCK→time are 3 genuinely strong traps. Categories otherwise tight.
dt=18.9s, tokens=439
gemma4:31b-it-q4_K_M -- puzzle 2
Theme seed: A mix of architectural elements, types of scales, and linguistic wordplay
| Diff | Category | Words |
|---|---|---|
| yellow | Parts of a Building | BEAM, COLUMN, EAVE, FOUNDATION |
| green | Types of Scales | LIBRA, MUSICAL, WEIGHING, TIMELINE |
| blue | Synonyms for Balance | EQUILIBRIUM, POISE, STEADY, STABILITY |
| purple | Words that start with a Greek Letter | ALPHABET, BETRAYAL, GAMUT, DELUXE |
Claimed traps (3):
LIBRA(actually green, traps blue): Libra is the sign of the scales, which represents balance/equilibrium.COLUMN(actually yellow, traps green): A column can be a way to organize data on a scale or chart.GAMUT(actually purple, traps green): The 'gamut' refers to the full range or scale of something.
Grade: FAIL. Two broken categories. Green "Types of Scales" includes LIBRA (zodiac sign, not a scale type) and TIMELINE (a line, not a scale). Purple "Words that start with a Greek Letter" includes DELUXE — but DELUXE starts with "DELU", not the full letter name "DELTA". GAMUT starts with "GAM", missing the full "GAMMA". 2 of 4 purple words don't actually start with the complete Greek-letter name.
dt=16.6s, tokens=410
gemma4:31b-it-q4_K_M -- puzzle 3
Theme seed: Words related to vision, optics, and ocular puns
| Diff | Category | Words |
|---|---|---|
| yellow | Parts of the Eye | CORNEA, PUPIL, IRIS, RETINA |
| green | Types of Lenses | CONCAVE, CONVEX, CONTACT, BIFOCAL |
| blue | Words meaning 'To Witness' | OBSERVE, SPOT, NOTE, VIEW |
| purple | ___ EYE | BULLS, BIRD, FISH, PRIVATE |
Claimed traps (3):
PUPIL(actually yellow, traps blue): A pupil is a student, and one might 'note' or 'observe' a pupil in a classroom setting.PRIVATE(actually purple, traps blue): Private can be a synonym for 'secret' or 'hidden', contrasting with the act of observing or spotting.CONTACT(actually green, traps blue): To 'contact' someone can be confused with 'spotting' or 'viewing' them in a social context.
Grade: PASS. All 4 categories tight: parts of eye / lens types / witness verbs / ___EYE compounds (bull's, bird's, fish, private). PUPIL→student is a real, NYT-style multi-meaning trap. Only 3 traps claimed (rubric asked for 2+, ✓). Closest puzzle to publication quality across both models.
dt=17.3s, tokens=422
gemma4:31b-it-q4_K_M -- puzzle 4
Theme seed: Common words that can also be types of birds or musical terms
| Diff | Category | Words |
|---|---|---|
| yellow | Things that can be 'Cast' | SPELL, VOTE, SHADOW, NET |
| green | Musical Dynamics | PIANO, FORTE, SHARP, FLAT |
| blue | Types of Birds | CRANE, SWALLOW, SWIFT, LOON |
| purple | Words that mean 'to move quickly' | DART, BOLT, RUSH, FLY |
Claimed traps (4):
SWIFT(actually blue, traps purple): Swift is a synonym for moving quickly, fitting the 'move quickly' group.FLY(actually purple, traps blue): Fly is a common bird-related term (or a type of insect), making it look like it belongs with Crane and Swallow.SWALLOW(actually blue, traps purple): While a bird, if the user interprets 'move quickly' as 'gulp down', it might confuse them, though primarily it serves as a bird lure.SHARP(actually green, traps purple): A 'sharp' turn or movement can be associated with quickness/darting.
Grade: BORDERLINE PASS. Green should be "Musical Terms" not "Dynamics" — SHARP and FLAT are accidentals (pitch modifiers), not dynamics (loudness). Pedantic but a real category-label miss. Saving graces: SWIFT (bird that means fast) and FLY (bird that means move quickly) are exactly the kind of multi-meaning anchors a real Connections puzzle would deploy.
dt=19.9s, tokens=487
gemma4:31b-it-q4_K_M -- puzzle 5
Theme seed: Words related to types of 'scales' and musical terminology
| Diff | Category | Words |
|---|---|---|
| yellow | Musical Scales | MAJOR, MINOR, CHROMATIC, PENTATONIC |
| green | Things that Weigh | BALANCE, SPRING, DIGITAL, TRIPLE-BEAM |
| blue | Ways to Climb | ASCEND, MOUNT, SCALE, SURMOUNT |
| purple | ___ NOTE | FOOT, BANK, KEY, HIGH |
Claimed traps (4):
SCALE(actually blue, traps yellow): It is the root word for musical scales, leading the player to look for other scale-related terms.KEY(actually purple, traps yellow): A 'key' is fundamentally linked to musical scales (e.g., the Key of C Major).HIGH(actually purple, traps blue): High is an adjective often associated with climbing or ascending.BALANCE(actually green, traps blue): Balance can be seen as a state of being when climbing or mountaineering.
Grade: PASS. The whole puzzle is built around SCALE carrying three meanings: musical scale (yellow), weighing scale (green's theme — though Gemma mislabels it "Things that Weigh" instead of "Types of Scales"), and "to climb" (blue, where SCALE-the-word lives). That is exactly the central-anchor pattern a real NYT Connections puzzle uses. KEY, HIGH, BALANCE traps all genuinely overlap. Categories slightly mislabeled but structure is publication-quality.
dt=18.2s, tokens=453
Aggregate
| Model | Pass | Borderline | Fail | Avg s | Avg tok/s |
|---|---|---|---|---|---|
gemma4:26b |
1 (#5) | 1 (#3) + 1 partial (#2) | 2 (#1, #4) | 5.2 | 94.3 |
gemma4:31b-it-q4_K_M |
2 (#3, #5) | 2 (#1, #4) | 1 (#2) | 18.2 | 24.3 |
31b is materially more reliable — 2 clean passes vs 26b's 1, and only 1 hard fail vs 26b's 2 hard fails plus a partial-fail. 31b is 3.5× slower per generation but at 18s for a once-per-day puzzle, that's irrelevant. 26b is fast enough for interactive use but produces broken puzzles half the time.
Failure modes (in order of how often they recur)
- Structural violations — duplicate or near-duplicate words on the 16-tile board, or a word listed in two groups. (#1-26b:
FRUIT× 2; #4-26b:NEST/NESTING.) Catastrophic — a real Connections board has 16 distinct tiles. Trivially detectable with a deterministic post-filter. - Broken category logic — words placed in a category that don't actually fit. (#2-26b:
EYE/QUEUEaren't body-part homophones; #4-26b:NESTINGisn't a "Word before EGG"; #2-31b:LIBRA/TIMELINEaren't scales,DELUXEdoesn't start with the full Greek letter "DELTA"; #1-31b:DECKdoesn't have hands.) Hard to detect deterministically — needs a critique/judging pass. - Redundant categories — two groups themed on the same concept (#1-26b: yellow + green both fly-themed; #4-26b: yellow + purple both egg-themed). Detectable with a category-similarity check.
- Weak/circular trap reasoning — Gemma's claimed "intended_traps" sometimes don't actually hold. (#3-26b:
PRESSdoesn't fit "Words after BLOOD" — the compound is blood pressure, not blood press.) Means Gemma cannot reliably grade its own puzzles — independent judging required.
Successes (when Gemma gets it right, what it does right)
- Multi-meaning anchor words —
SCALE(3 meanings, #5-31b),SWIFT/FLY(bird + fast, #4-31b),PUPIL(eye + student, #3-31b),TRACK(3 meanings, #5-26b). When Gemma builds a puzzle around an anchor, it produces real Connections-grade overlap. - Compound-word categories — "
___ EYE" (#3-31b), "Side___" (#5-26b), "Words before EGG" (#4-26b — when Gemma doesn't poison it). These are the easiest pattern to get right. - Tight short labels when Gemma sticks to well-known domains (parts of eye, types of lenses, types of records).
Implication for design
Generation is viable, but not unaided. The shape of the data engine:
generate (gemma4:31b)
→ deterministic filter [check 16 distinct tiles, no dup words, all words appear in categories]
→ category-similarity check [reject puzzles with redundant themes]
→ critique pass [either gemma4:31b second pass, or qwen3-coder:30b as judge]
→ reject + regenerate if any fail; accept once filtered
→ cache as the day's puzzle
At ~18s/generation and a roughly 50% structural-pass rate, a daily puzzle costs an expected ~2 generations + 1 critique = maybe 1 minute of GPU time per day. Effectively free.
26b is unsuitable as the primary generator — too many hard fails. It could plausibly be the judging model (cheaper, runs on every player guess) since judging is much easier than generating. But that decision is for the brainstorm.
Risks not yet checked
- Diversity over time. All 10 puzzles produced here are within a single seed-less batch. If Gemma keeps reaching for the same themes (we saw "scales" twice on 31b alone), a 365-day-per-year stream might get repetitive. Test this with seeded prompts before committing.
- Connections-vs-Gemma blind anchor not run. I deferred this — the structural failures in Gemma's output (duplicate words, broken categories) are so obviously curator-rejection-tier that no human-curated puzzle would have them, so the within-Gemma comparison was decisive on its own. Still, before final design, eyeball one Gemma-pass puzzle next to a real NYT puzzle and check whether it actually feels equivalent.
- Two-pass critique not validated. The proposal above assumes a critique pass would catch Gemma's category mistakes. That assumption has not been tested. The next experiment is "feed Gemma's broken puzzles back to Gemma (or to a different model) and see if it flags the structural issues."