seth_semantic_game/docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md

# Gemma 4 Generation Bakeoff -- 2026-04-27-221751

## Setup
- Local Ollama on the test host (RTX 3090 Ti, 24 GB VRAM)
- Other GPU workloads paused for the duration of the run
- Models: `gemma4:26b`, `gemma4:31b-it-q4_K_M`
- 5 puzzles per model, base temperature 0.8
- Gemma 4 settings (per `~/bin/gemma4-research/GOTCHAS.md`): `think=false`, `num_ctx=8192`, `num_predict=4096`. No `format=json` (infinite-loop bug). JSON extracted client-side via `body[body.find('{'):body.rfind('}')+1]`.
- Up to 3 attempts per puzzle with temperature bumped +0.1 each retry (AI_Visualizer pattern). Reported metrics use the *successful* attempt.
- One-shot, unaided generation. No critique pass, no example puzzle in prompt.

## Timing

| Model | n | avg s | avg tokens | tok/s |
|---|---|---|---|---|
| `gemma4:26b` | 5 | 5.2 | 489 | 94.3 |
| `gemma4:31b-it-q4_K_M` | 5 | 18.2 | 442 | 24.3 |

## JSON parse rate

- `gemma4:26b`: 5/5 parsed cleanly
- `gemma4:31b-it-q4_K_M`: 5/5 parsed cleanly

## gemma4:26b -- puzzle 1

**Theme seed:** An exploration of common words that can be paired with 'BIRD' or 'DRAGON' or relate to types of 'FLY'.

| Diff | Category | Words |
|---|---|---|
| yellow | Types of Flies | FRUIT, HOUSE, FIRE, BLOW |
| green | Words before 'FLY' | DRAGON, MAY, BUTTER, BOTTLE |
| blue | Types of Birds | LARK, SWALLOW, CRANE, SWIFT |
| purple | Words that follow 'DRAGON' | FRUIT, FLY, SLAYER, FLAME |

**Claimed traps (4):**
- `FRUIT` (actually purple, traps yellow): Fruit fly is a very common type of fly, making it a primary red herring for the yellow group.
- `FIRE` (actually yellow, traps purple): Fire dragon or dragon flame connection could lead someone to look for a dragon-themed group.
- `DRAGON` (actually green, traps blue): Dragon is a mythical creature, and users might look for a 'mythical creatures' group including Crane or Swallow (which are real birds).
- `FLY` (actually purple, traps green): Dragonfly is a common word, tempting the player to put 'FLY' in the 'Words before FLY' group.

**Grade: FAIL.** `FRUIT` appears in **two groups** (yellow Types of Flies + purple Words that follow DRAGON) — that's only 15 distinct tiles, not the required 16. Categories are also redundant: yellow ("Types of Flies") and green ("Words before FLY") are both fly-themed. A human curator would never ship this.

_dt=5.7s, tokens=500_

## gemma4:26b -- puzzle 2

**Theme seed:** Homophones and linguistic overlaps between nature and everyday objects.

| Diff | Category | Words |
|---|---|---|
| yellow | Types of Trees | BIRCH, CEDAR, MAPLE, PINE |
| green | Things that can be 'Dropped' | BEAT, HINT, LINE, SUBJECT |
| blue | Units of Measurement | FOOT, GRAM, SECOND, STONE |
| purple | Homophones of Body Parts | EYE, HARE, MUSSEL, QUEUE |

**Claimed traps (4):**
- `PINE` (actually yellow, traps green): To 'pine' for someone is a feeling, but 'pine' could be mistaken for something you can 'drop' (like a pine needle).
- `FOOT` (actually blue, traps yellow): A 'foot' is a unit of measurement, but it is also a body part, potentially leading to the 'Body Part Homophones' group.
- `BEAT` (actually green, traps blue): A 'beat' is a rhythm, but it could be mistaken for a unit of time/measurement in a musical context.
- `EYE` (actually purple, traps blue): The word 'eye' is a body part, which might tempt players looking for a body part category, but it is actually a homophone (I) for the purple group.

**Grade: PARTIAL FAIL.** Yellow/green/blue all tight. Purple "Homophones of Body Parts" is half-broken: `HARE`→hair ✓, `MUSSEL`→muscle ✓, but `EYE` *is* a body part (not a homophone of one) and `QUEUE`→cue is not a body-part homophone. 2/4 wrong on the trickiest group is a curator-rejection. Trap reasoning is weak (claims `EYE` traps blue Units — body parts aren't units).

_dt=5.3s, tokens=510_

## gemma4:26b -- puzzle 3

**Theme seed:** The concept of 'Pressure' and things that can be measured or applied.

| Diff | Category | Words |
|---|---|---|
| yellow | Units of Pressure | BAR, PASCAL, PSI, TORR |
| green | Words after 'BLOOD' | PRESSURE, TYPE, ORANGE, DONOR |
| blue | Things that can be 'Broken' | HEART, RECORD, LAW, PROMISE |
| purple | Words that are also Verbs for 'Applying Force' | PRESS, STAMP, PUNCH, DRIVE |

**Claimed traps (4):**
- `PRESSURE` (actually green, traps yellow): Pressure is the concept being measured by the units in the yellow group.
- `BAR` (actually yellow, traps blue): A 'bar' is something that can be 'broken' (like a chocolate bar) or refers to a legal setting.
- `PRESS` (actually purple, traps green): The 'Press' is a common term in news/media, and 'Press' can be related to 'Blood' in a medical context (blood pressure).
- `RECORD` (actually blue, traps purple): A 'record' is something you can 'stamp' or 'drive' (as in a driving record).

**Grade: BORDERLINE PASS.** Categories tight, 16 distinct words, ✓. `BAR` (yellow → blue trap) is a real overlap (chocolate bar, broken bar). `PRESS` claimed to trap green ("Words after BLOOD") — but blood **pressure** is the compound, not blood **press**, so the claimed trap doesn't actually hold. `RECORD`→stamp/drive is a stretch. 1/4 of Gemma's self-graded traps is broken; remainder OK.

_dt=5.1s, tokens=492_

## gemma4:26b -- puzzle 4

**Theme seed:** A collection of words related to things that can be 'broken' or 'cracked', mixed with other categories like cooking and music.

| Diff | Category | Words |
|---|---|---|
| yellow | Kinds of Eggs | CHICKEN, QUAIL, GOOSE, DUCK |
| green | Things You Can Crack | NUT, CODE, WHIP, SMILE |
| blue | Musical Terms | NOTE, CHORD, CLEF, BEAT |
| purple | Words Before 'Egg' | NEST, EASTER, BAD, NESTING |

**Claimed traps (4):**
- `DUCK` (actually yellow, traps green): DUCK is an action related to avoiding a whip or a blow, and can be paired with 'crack' in some linguistic contexts.
- `CHICKEN` (actually yellow, traps purple): CHICKEN is an egg-related word that could tempt someone looking for '___ Egg' phrases (like Chicken Egg).
- `BEAT` (actually blue, traps green): BEAT is something you can do to a whip or a person, fitting the 'Things you can crack/strike' theme.
- `NEST` (actually purple, traps yellow): NEST is intrinsically linked to the birds in the yellow group (CHICKEN, GOOSE, etc.).

**Grade: FAIL.** Purple has `NESTING` as "Words Before EGG" — *nesting egg* isn't a phrase (*nest egg* is). Worse, `NEST` and `NESTING` are near-duplicate tiles. Yellow "Kinds of Eggs" and purple "Words Before EGG" are also redundant (chicken egg vs nest egg — both compound forms). Curator reject.

_dt=5.3s, tokens=506_

## gemma4:26b -- puzzle 5

**Theme seed:** A collection of words centered around various types of 'records' and 'tracks'.

| Diff | Category | Words |
|---|---|---|
| yellow | Types of Records | LP, EP, Single, Album |
| green | Things that have Tracks | Train, Athlete, Song, Mud |
| blue | Words for 'Evidence' | Clue, Trace, Mark, Proof |
| purple | Words that follow 'Side' | Effect, Kick, Split, Walk |

**Claimed traps (4):**
- `Single` (actually yellow, traps blue): A 'single' can be a piece of evidence or a solitary item.
- `Song` (actually green, traps yellow): A song is often released as an LP, EP, or Single.
- `Trace` (actually blue, traps green): A trace can be a physical track left behind in mud.
- `Split` (actually purple, traps green): A 'split' can refer to a track or division in a race.

**Grade: PASS.** Cleanest 26b output. 16 distinct words; categories tight; the word "track" carries **three different meanings** across green (railroad/running/album), with `SONG`→Records and `TRACE`→Tracks as real overlaps. That's the multi-meaning anchor pattern Connections actually uses. `SINGLE` and `SPLIT` traps are weak, but the puzzle structure is publication-shaped.

_dt=4.6s, tokens=436_

## gemma4:31b-it-q4_K_M -- puzzle 1

**Theme seed:** Words related to circles, timing, and subtle wordplay with 'Hands'

| Diff | Category | Words |
|---|---|---|
| yellow | Circular Objects | RING, COIN, WHEEL, DISC |
| green | Units of Time | SECOND, MINUTE, HOUR, DAY |
| blue | Ways to Assist | HELP, AID, BACK, SUPPORT |
| purple | Things with Hands | CLOCK, WATCH, DECK, PLAYER |

**Claimed traps (4):**
- `SECOND` (actually green, traps blue): To 'second' a motion is to support it.
- `WATCH` (actually purple, traps green): A watch is a device used to measure the units of time (second, minute, hour).
- `CLOCK` (actually purple, traps green): Similar to watch, it is intrinsically linked to the time units group.
- `BACK` (actually blue, traps yellow): Could be mistaken as a part of a circular object or a physical position.

**Grade: BORDERLINE PASS.** `DECK` in "Things with Hands" is shaky — decks have cards, players have hands; the phrase is "all hands on deck" (sailors), not "deck has hands." Other 3 purple members fit. Strong saving grace: `SECOND`→assist (second a motion), `WATCH`→time, `CLOCK`→time are 3 genuinely strong traps. Categories otherwise tight.

_dt=18.9s, tokens=439_

## gemma4:31b-it-q4_K_M -- puzzle 2

**Theme seed:** A mix of architectural elements, types of scales, and linguistic wordplay

| Diff | Category | Words |
|---|---|---|
| yellow | Parts of a Building | BEAM, COLUMN, EAVE, FOUNDATION |
| green | Types of Scales | LIBRA, MUSICAL, WEIGHING, TIMELINE |
| blue | Synonyms for Balance | EQUILIBRIUM, POISE, STEADY, STABILITY |
| purple | Words that start with a Greek Letter | ALPHABET, BETRAYAL, GAMUT, DELUXE |

**Claimed traps (3):**
- `LIBRA` (actually green, traps blue): Libra is the sign of the scales, which represents balance/equilibrium.
- `COLUMN` (actually yellow, traps green): A column can be a way to organize data on a scale or chart.
- `GAMUT` (actually purple, traps green): The 'gamut' refers to the full range or scale of something.

**Grade: FAIL.** Two broken categories. Green "Types of Scales" includes `LIBRA` (zodiac sign, not a scale type) and `TIMELINE` (a line, not a scale). Purple "Words that start with a Greek Letter" includes `DELUXE` — but `DELUXE` starts with "DELU", not the full letter name "DELTA". `GAMUT` starts with "GAM", missing the full "GAMMA". 2 of 4 purple words don't actually start with the complete Greek-letter name.

_dt=16.6s, tokens=410_

## gemma4:31b-it-q4_K_M -- puzzle 3

**Theme seed:** Words related to vision, optics, and ocular puns

| Diff | Category | Words |
|---|---|---|
| yellow | Parts of the Eye | CORNEA, PUPIL, IRIS, RETINA |
| green | Types of Lenses | CONCAVE, CONVEX, CONTACT, BIFOCAL |
| blue | Words meaning 'To Witness' | OBSERVE, SPOT, NOTE, VIEW |
| purple | ___ EYE | BULLS, BIRD, FISH, PRIVATE |

**Claimed traps (3):**
- `PUPIL` (actually yellow, traps blue): A pupil is a student, and one might 'note' or 'observe' a pupil in a classroom setting.
- `PRIVATE` (actually purple, traps blue): Private can be a synonym for 'secret' or 'hidden', contrasting with the act of observing or spotting.
- `CONTACT` (actually green, traps blue): To 'contact' someone can be confused with 'spotting' or 'viewing' them in a social context.

**Grade: PASS.** All 4 categories tight: parts of eye / lens types / witness verbs / `___EYE` compounds (bull's, bird's, fish, private). `PUPIL`→student is a real, NYT-style multi-meaning trap. Only 3 traps claimed (rubric asked for 2+, ✓). Closest puzzle to publication quality across both models.

_dt=17.3s, tokens=422_

## gemma4:31b-it-q4_K_M -- puzzle 4

**Theme seed:** Common words that can also be types of birds or musical terms

| Diff | Category | Words |
|---|---|---|
| yellow | Things that can be 'Cast' | SPELL, VOTE, SHADOW, NET |
| green | Musical Dynamics | PIANO, FORTE, SHARP, FLAT |
| blue | Types of Birds | CRANE, SWALLOW, SWIFT, LOON |
| purple | Words that mean 'to move quickly' | DART, BOLT, RUSH, FLY |

**Claimed traps (4):**
- `SWIFT` (actually blue, traps purple): Swift is a synonym for moving quickly, fitting the 'move quickly' group.
- `FLY` (actually purple, traps blue): Fly is a common bird-related term (or a type of insect), making it look like it belongs with Crane and Swallow.
- `SWALLOW` (actually blue, traps purple): While a bird, if the user interprets 'move quickly' as 'gulp down', it might confuse them, though primarily it serves as a bird lure.
- `SHARP` (actually green, traps purple): A 'sharp' turn or movement can be associated with quickness/darting.

**Grade: BORDERLINE PASS.** Green should be "Musical Terms" not "Dynamics" — `SHARP` and `FLAT` are accidentals (pitch modifiers), not dynamics (loudness). Pedantic but a real category-label miss. Saving graces: `SWIFT` (bird that means fast) and `FLY` (bird that means move quickly) are exactly the kind of multi-meaning anchors a real Connections puzzle would deploy.

_dt=19.9s, tokens=487_

## gemma4:31b-it-q4_K_M -- puzzle 5

**Theme seed:** Words related to types of 'scales' and musical terminology

| Diff | Category | Words |
|---|---|---|
| yellow | Musical Scales | MAJOR, MINOR, CHROMATIC, PENTATONIC |
| green | Things that Weigh | BALANCE, SPRING, DIGITAL, TRIPLE-BEAM |
| blue | Ways to Climb | ASCEND, MOUNT, SCALE, SURMOUNT |
| purple | ___ NOTE | FOOT, BANK, KEY, HIGH |

**Claimed traps (4):**
- `SCALE` (actually blue, traps yellow): It is the root word for musical scales, leading the player to look for other scale-related terms.
- `KEY` (actually purple, traps yellow): A 'key' is fundamentally linked to musical scales (e.g., the Key of C Major).
- `HIGH` (actually purple, traps blue): High is an adjective often associated with climbing or ascending.
- `BALANCE` (actually green, traps blue): Balance can be seen as a state of being when climbing or mountaineering.

**Grade: PASS.** The whole puzzle is built around `SCALE` carrying three meanings: musical scale (yellow), weighing scale (green's theme — though Gemma mislabels it "Things that Weigh" instead of "Types of Scales"), and "to climb" (blue, where SCALE-the-word lives). That is exactly the central-anchor pattern a real NYT Connections puzzle uses. `KEY`, `HIGH`, `BALANCE` traps all genuinely overlap. Categories slightly mislabeled but structure is publication-quality.

_dt=18.2s, tokens=453_

---

## Aggregate

| Model | Pass | Borderline | Fail | Avg s | Avg tok/s |
|---|---|---|---|---|---|
| `gemma4:26b` | 1 (#5) | 1 (#3) + 1 partial (#2) | 2 (#1, #4) | 5.2 | 94.3 |
| `gemma4:31b-it-q4_K_M` | 2 (#3, #5) | 2 (#1, #4) | 1 (#2) | 18.2 | 24.3 |

**31b is materially more reliable** — 2 clean passes vs 26b's 1, and only 1 hard fail vs 26b's 2 hard fails plus a partial-fail. 31b is 3.5× slower per generation but at 18s for a once-per-day puzzle, that's irrelevant. 26b is fast enough for interactive use but produces broken puzzles half the time.

### Failure modes (in order of how often they recur)

1. **Structural violations** — duplicate or near-duplicate words on the 16-tile board, or a word listed in two groups. (#1-26b: `FRUIT` × 2; #4-26b: `NEST`/`NESTING`.) Catastrophic — a real Connections board has 16 *distinct* tiles. **Trivially detectable** with a deterministic post-filter.
2. **Broken category logic** — words placed in a category that don't actually fit. (#2-26b: `EYE`/`QUEUE` aren't body-part homophones; #4-26b: `NESTING` isn't a "Word before EGG"; #2-31b: `LIBRA`/`TIMELINE` aren't scales, `DELUXE` doesn't start with the full Greek letter "DELTA"; #1-31b: `DECK` doesn't have hands.) **Hard to detect deterministically** — needs a critique/judging pass.
3. **Redundant categories** — two groups themed on the same concept (#1-26b: yellow + green both fly-themed; #4-26b: yellow + purple both egg-themed). Detectable with a category-similarity check.
4. **Weak/circular trap reasoning** — Gemma's claimed "intended_traps" sometimes don't actually hold. (#3-26b: `PRESS` doesn't fit "Words after BLOOD" — the compound is *blood pressure*, not *blood press*.) Means **Gemma cannot reliably grade its own puzzles** — independent judging required.

### Successes (when Gemma gets it right, what it does right)

- **Multi-meaning anchor words** — `SCALE` (3 meanings, #5-31b), `SWIFT`/`FLY` (bird + fast, #4-31b), `PUPIL` (eye + student, #3-31b), `TRACK` (3 meanings, #5-26b). When Gemma builds a puzzle around an anchor, it produces real Connections-grade overlap.
- **Compound-word categories** — "`___ EYE`" (#3-31b), "Side `___`" (#5-26b), "Words before EGG" (#4-26b — when Gemma doesn't poison it). These are the easiest pattern to get right.
- **Tight short labels** when Gemma sticks to well-known domains (parts of eye, types of lenses, types of records).

### Implication for design

**Generation is viable, but not unaided.** The shape of the data engine:

```
generate (gemma4:31b)
  → deterministic filter [check 16 distinct tiles, no dup words, all words appear in categories]
  → category-similarity check [reject puzzles with redundant themes]
  → critique pass [either gemma4:31b second pass, or qwen3-coder:30b as judge]
  → reject + regenerate if any fail; accept once filtered
  → cache as the day's puzzle
```

At ~18s/generation and a roughly 50% structural-pass rate, a daily puzzle costs an expected ~2 generations + 1 critique = maybe 1 minute of GPU time per day. Effectively free.

**26b is unsuitable as the primary generator** — too many hard fails. It could plausibly be the *judging* model (cheaper, runs on every player guess) since judging is much easier than generating. But that decision is for the brainstorm.

### Risks not yet checked

- **Diversity over time.** All 10 puzzles produced here are within a single seed-less batch. If Gemma keeps reaching for the same themes (we saw "scales" twice on 31b alone), a 365-day-per-year stream might get repetitive. Test this with seeded prompts before committing.
- **Connections-vs-Gemma blind anchor not run.** I deferred this — the structural failures in Gemma's output (duplicate words, broken categories) are so obviously curator-rejection-tier that no human-curated puzzle would have them, so the within-Gemma comparison was decisive on its own. Still, before final design, eyeball one Gemma-pass puzzle next to a real NYT puzzle and check whether it actually feels equivalent.
- **Two-pass critique not validated.** The proposal above assumes a critique pass would catch Gemma's category mistakes. That assumption has not been tested. The next experiment is "feed Gemma's broken puzzles back to Gemma (or to a different model) and see if it flags the structural issues."