Files
seth_semantic_game/docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md
Mortdecai 5a2a02e483 docs: bootstrap repo with bakeoff results and game-mechanics idea bank
This repo opens with the design-discovery work completed before any product
code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local
Ollama established that:

- Whole-puzzle generation in the Connections shape is unreliable on Gemma 4
  (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally
  out of project scope, so the generation route is harder still.
- Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b)
  on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging
  of player-INVENTED categories. That is the structural unlock vs static
  hand-curated word games.

The README contains the full writeup, the test bench, and a brainstormed
bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo
range, plus a primitives table for recombination.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:09:46 -04:00

27 KiB

Gemma 4 Semantic Bakeoff -- 2026-04-27 22:51

Setup

  • Local Ollama on the test host (RTX 3090 Ti, 24 GB VRAM)
  • Models: gemma4:latest, gemma4:26b, gemma4:31b-it-q4_K_M
  • Temperature 0.2 (raised +0.1 per retry on JSON parse fail, max 3 attempts)
  • think=false, num_ctx=4096, num_predict=512, no format=json (per gemma4-research/GOTCHAS.md)
  • 38 test cases: 16 JUDGE, 10 CREATE, 12 CREATIVE_ACCEPT
  • Ground truth hand-labeled inline in scripts/gemma-semantic-bakeoff.py

Auto-scored summary

Model JUDGE pass CREATIVE_ACCEPT pass parse fails avg s
gemma4:latest 14/16 10/10 0 0.7
gemma4:26b 15/16 10/10 0 0.8
gemma4:31b-it-q4_K_M 16/16 10/10 1 2.3

JUDGE

judge-y1

  • Category: Types of trees
  • Words: OAK, MAPLE, BIRCH, PINE
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes All four words are common names of tree species. PASS 0.7s
gemma4:26b yes All four words are specific, well-known types of trees. PASS 0.8s
gemma4:31b-it-q4_K_M yes All four words are common types of trees. PASS 2.1s

judge-y2

  • Category: Greek letters
  • Words: ALPHA, BETA, GAMMA, DELTA
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes All four words are names of Greek letters. PASS 0.6s
gemma4:26b yes All four words are standard names of letters in the Greek alphabet. PASS 0.7s
gemma4:31b-it-q4_K_M yes All four words are the first four letters of the Greek alphabet. PASS 2.1s

judge-y3

  • Category: Days of the week
  • Words: MONDAY, FRIDAY, SUNDAY, WEDNESDAY
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest no The words are not all representative of the days of the week. (misfit: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) FAIL 0.8s
gemma4:26b yes All four words are specific names of days in the week. PASS 0.7s
gemma4:31b-it-q4_K_M yes All four words are standard days of the week. PASS 1.9s

judge-y4

  • Category: Synonyms for 'happy'
  • Words: JOYFUL, GLAD, CHEERFUL, ELATED
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes All four words are direct and common synonyms for 'happy'. PASS 0.6s
gemma4:26b yes All four words are direct synonyms representing different degrees of happiness. PASS 0.7s
gemma4:31b-it-q4_K_M yes All four words are direct synonyms for the state of being happy. PASS 2.1s

judge-y5

  • Category: ___ HOUSE (compound words)
  • Words: GREEN, ICE, FIRE, COURT
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes All four words can precede 'HOUSE' to form common compound words. PASS 0.7s
gemma4:26b no While Green, Ice, and Fire House are common, 'Court House' is typically written as two words (Courthouse) or refers to a FAIL 0.9s
gemma4:31b-it-q4_K_M yes All four words form common compound words or phrases with 'house'. PASS 2.1s

judge-y6

  • Category: Words that follow COLD
  • Words: SHOULDER, FRONT, SNAP, TURKEY
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest no Only SHOULDER, FRONT, and SNAP fit common phrases with COLD, while TURKEY does not. (misfit: TURKEY) FAIL 0.8s
gemma4:26b yes All four words form common phrases when preceded by 'cold': cold shoulder, cold front, cold snap, and cold turkey. PASS 0.9s
gemma4:31b-it-q4_K_M yes All four words form common compound words or phrases when preceded by 'cold'. PASS 2.1s

judge-y7

  • Category: Verbs meaning 'to move quickly'
  • Words: DART, BOLT, RUSH, FLY
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes All four words are strong synonyms for moving rapidly. PASS 0.6s
gemma4:26b yes All four words are synonyms for moving at high speed. PASS 0.7s
gemma4:31b-it-q4_K_M yes All four words are common synonyms for moving rapidly. PASS 2.0s

judge-y8

  • Category: Synonyms for 'idea'
  • Words: NOTION, CONCEPT, THOUGHT, INKLING
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes All four words are common synonyms for a mental concept or idea. PASS 0.7s
gemma4:26b yes All four words are direct synonyms for a mental concept, notion, or vague understanding. PASS 0.8s
gemma4:31b-it-q4_K_M yes All four words are common synonyms for a mental representation or a suggestion of an idea. PASS 2.2s

judge-n1

  • Category: Types of trees
  • Words: OAK, MAPLE, BIRCH, CARROT
  • Ground truth: no (misfit: CARROT)
Model Verdict Reason Score dt
gemma4:latest no Carrot is a root vegetable, not a type of tree. (misfit: CARROT) PASS 0.7s
gemma4:26b no Carrot is a root vegetable, not a type of tree. (misfit: CARROT) PASS 0.7s
gemma4:31b-it-q4_K_M no Carrot is a root vegetable, not a tree. (misfit: CARROT) PASS 2.1s

judge-n2

  • Category: Greek letters
  • Words: ALPHA, BETA, GAMMA, CYRILLIC
  • Ground truth: no (misfit: CYRILLIC)
Model Verdict Reason Score dt
gemma4:latest no Cyrillic is a script, not a Greek letter. (misfit: CYRILLIC) PASS 0.7s
gemma4:26b no CYRILLIC refers to a script, not a specific Greek letter. (misfit: CYRILLIC) PASS 0.8s
gemma4:31b-it-q4_K_M no Cyrillic is a writing system, not a Greek letter. (misfit: CYRILLIC) PASS 2.2s

judge-n3

  • Category: Synonyms for 'happy'
  • Words: JOYFUL, GLAD, SAD, ELATED
  • Ground truth: no (misfit: SAD)
Model Verdict Reason Score dt
gemma4:latest no The word 'SAD' is an antonym, not a synonym for 'happy'. (misfit: SAD) PASS 0.7s
gemma4:26b no The word 'SAD' is an antonym rather than a synonym. (misfit: SAD) PASS 0.8s
gemma4:31b-it-q4_K_M no SAD is an antonym of happy, not a synonym. (misfit: SAD) PASS 2.2s

judge-n4

  • Category: Days of the week
  • Words: MONDAY, JANUARY, SUNDAY, WEDNESDAY
  • Ground truth: no (misfit: JANUARY)
Model Verdict Reason Score dt
gemma4:latest no January is a month, not a day of the week. (misfit: JANUARY) PASS 0.7s
gemma4:26b no JANUARY is a month, not a day of the week. (misfit: JANUARY) PASS 0.7s
gemma4:31b-it-q4_K_M no JANUARY is a month, not a day of the week. (misfit: JANUARY) PASS 2.1s

judge-n5

  • Category: Body parts
  • Words: ARM, LEG, EYE, NIGHT
  • Ground truth: no (misfit: NIGHT)
Model Verdict Reason Score dt
gemma4:latest no NIGHT is not a body part. (misfit: NIGHT) PASS 0.6s
gemma4:26b no The word 'NIGHT' is not a body part. (misfit: NIGHT) PASS 0.8s
gemma4:31b-it-q4_K_M no NIGHT is a time of day, not a body part. (misfit: NIGHT) PASS 2.1s

judge-n6

  • Category: Types of birds
  • Words: CRANE, SWALLOW, BAT, MOSQUITO
  • Ground truth: no (misfit: BAT, MOSQUITO)
Model Verdict Reason Score dt
gemma4:latest no Bats and mosquitoes are not types of birds. (misfit: BAT, MOSQUITO) PASS 0.7s
gemma4:26b no BAT and MOSQUITO are mammals and insects, respectively, not birds. (misfit: BAT, MOSQUITO) PASS 0.8s
gemma4:31b-it-q4_K_M no Bat and mosquito are not birds. (misfit: BAT, MOSQUITO) PASS 2.1s

judge-n7

  • Category: Things that are red
  • Words: APPLE, BLUE, ROSE, GRASS
  • Ground truth: no (misfit: BLUE, GRASS)
Model Verdict Reason Score dt
gemma4:latest no Blue is a color, not necessarily a thing that is red. (misfit: BLUE) PASS 0.7s
gemma4:26b no BLUE and GRASS are not typically red. (misfit: BLUE, GRASS) PASS 0.7s
gemma4:31b-it-q4_K_M no Blue and grass are not red. (misfit: BLUE, GRASS) PASS 2.0s

judge-n8

  • Category: Words that follow COLD
  • Words: SHOULDER, FRONT, PIZZA, MOUNTAIN
  • Ground truth: no (misfit: PIZZA, MOUNTAIN)
Model Verdict Reason Score dt
gemma4:latest no Only SHOULDER, FRONT, and PIZZA fit common phrases with COLD, but MOUNTAIN does not. (misfit: MOUNTAIN) PASS 0.8s
gemma4:26b no While shoulder, front, and mountain can follow 'cold', 'cold pizza' is a description of temperature rather than a common PASS 0.9s
gemma4:31b-it-q4_K_M no Cold pizza and cold mountain are not established compound words or common idioms like cold shoulder and cold front. (mis PASS 2.7s

CREATE

create-e1

  • Category: Types of trees
  • Quality bar: Four valid tree species; e.g. OAK, MAPLE, BIRCH, PINE.
Model Verdict Reason Score dt
gemma4:latest Oak, Maple, Pine, Birch These are all common types of trees. MANUAL 0.6s
gemma4:26b Oak, Maple, Birch, Cedar These are all common species of trees. MANUAL 0.7s
gemma4:31b-it-q4_K_M Oak, Maple, Pine, Birch All four are common types of trees. MANUAL 1.9s

create-e2

  • Category: Greek letters
  • Quality bar: Four genuine Greek letters.
Model Verdict Reason Score dt
gemma4:latest Alpha, Beta, Gamma, Delta These are the first four letters of the Greek alphabet. MANUAL 0.6s
gemma4:26b ALPHA, BETA, GAMMA, DELTA These are the first four letters of the Greek alphabet. MANUAL 0.7s
gemma4:31b-it-q4_K_M Alpha, Beta, Gamma, Delta These are the first four letters of the Greek alphabet. MANUAL 1.9s

create-e3

  • Category: Synonyms for 'angry'
  • Quality bar: Four words that all genuinely mean angry/furious.
Model Verdict Reason Score dt
gemma4:latest furious, irate, enraged, mad All four words are direct synonyms for feeling intense anger. MANUAL 0.6s
gemma4:26b Irate, Livid, Furious, Incensed All four words are direct synonyms for being extremely angry. MANUAL 0.7s
gemma4:31b-it-q4_K_M Furious, Irate, Livid, Enraged All four words are strong synonyms for being extremely angry. MANUAL 2.1s

create-e4

  • Category: Days of the week
  • Quality bar: Four of the seven weekday names, no months or other words.
Model Verdict Reason Score dt
gemma4:latest Monday, Tuesday, Wednesday, Thursday These are four consecutive days of the week. MANUAL 0.6s
gemma4:26b Monday, Tuesday, Wednesday, Thursday These are all consecutive days of the week. MANUAL 0.7s
gemma4:31b-it-q4_K_M Monday, Wednesday, Friday, Sunday They are all days of the week. MANUAL 1.8s

create-m1

  • Category: ___ STORM (compound words ending in STORM)
  • Quality bar: Four words that each form a real compound or fixed phrase with STORM (e.g. SAND, BRAIN, THUNDER, SNOW).
Model Verdict Reason Score dt
gemma4:latest BLIZZARD, HEAT, STORM, SNOW These words can all precede 'STORM' to form a type of storm. MANUAL 0.7s
gemma4:26b BRAIN, DUST, FIRE, SNOW Each word forms a common compound word when followed by 'STORM'. MANUAL 0.7s
gemma4:31b-it-q4_K_M BRAIN, FIRE, THUNDER, SNOW All four words form common compound words when followed by STORM. MANUAL 2.1s

create-m2

  • Category: Synonyms for 'small'
  • Quality bar: Four words that all genuinely mean small.
Model Verdict Reason Score dt
gemma4:latest tiny, petite, minuscule, diminutive All four words are direct synonyms for 'small' in size. MANUAL 0.7s
gemma4:26b tiny, microscopic, diminutive, minuscule All four words are direct synonyms for something of very small size. MANUAL 0.7s
gemma4:31b-it-q4_K_M tiny, petite, slight, miniature All four words are direct synonyms for small in size. MANUAL 2.0s

create-m3

  • Category: Words that follow BLUE
  • Quality bar: Four words that each form a real compound with BLUE (e.g. BERRY, BIRD, PRINT, BELL, GRASS).
Model Verdict Reason Score dt
gemma4:latest JEANS, BIRD, MOON, BLOOD All words can follow 'BLUE' to form common phrases. MANUAL 0.7s
gemma4:26b BIRD, JAY, PRINT, WHALE Each word can be preceded by 'blue' to form a common term. MANUAL 0.8s
gemma4:31b-it-q4_K_M BERRY, JAY, MOON, PRINT All four words follow 'BLUE' to form common compound words or phrases. MANUAL 2.2s

create-m4

  • Category: Things that can be 'broken'
  • Quality bar: Four words that each form a real fixed phrase with 'broken' (heart, record, law, promise, etc.).
Model Verdict Reason Score dt
gemma4:latest heart, promise, record, bone All four items can be metaphorically or literally broken. MANUAL 0.6s
gemma4:26b Promise, Record, Heart, Law All four are common objects or concepts that can be broken. MANUAL 0.7s
gemma4:31b-it-q4_K_M Record, Promise, Silence, Heart All four are common abstract or physical things that can be broken. MANUAL 2.1s

create-h1

  • Category: Words that are homophones of body parts but spelled differently (e.g. HARE = hair, MUSSEL = muscle)
  • Quality bar: Four words that each sound like a body part but are spelled differently. Valid examples: HARE (hair), MUSSEL (muscle), HEAL (heel), SOUL (sole), AYE/EYE-spelled-otherwise. EYE and HEEL alone do NOT count -- those are the body parts themselves, not homophones of them.
Model Verdict Reason Score dt
gemma4:latest SEA, SEE, HEAR, HERE All four words are homophones of common body parts (sea/see for 'C' sound, hear/here for 'hear'). MANUAL 0.8s
gemma4:26b MUSSEL, HARE, EYE, KNEAD Each word is a homophone of a body part (muscle, hair, eye, knee) but spelled differently. MANUAL 0.9s
gemma4:31b-it-q4_K_M parse fail JSONDecodeError('Extra data: line 3 column 1 (char 133)') PARSE_FAIL 6.7s

create-h2

  • Category: Words that contain a body part as a substring (e.g. HEARTBEAT contains HEART)
  • Quality bar: Four words that each contain a body part anywhere inside them. Valid examples: HEARTH (HEART), CHESTNUT (CHEST), EARTH (EAR), HEADACHE (HEAD).
Model Verdict Reason Score dt
gemma4:latest KNEEHIGH, EARDRUM, STOMACHACHE, EYELASH Each word contains a body part name as a substring. MANUAL 0.7s
gemma4:26b HANDSOME, FOOTPRINT, EARNEST, ARMORY Each word contains a body part (hand, foot, ear, arm) as a substring. MANUAL 0.8s
gemma4:31b-it-q4_K_M HEARTBEAT, HANDSOME, FOOTPRINT, ARMCHAIR Each word contains a body part as a substring. MANUAL 2.2s

CREATIVE_ACCEPT

ca-y1

  • Words: SCALE, MOUNT, ASCEND, CLIMB
  • Player's category: "Verbs for going up"
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes All four words can be used as verbs meaning to go up. PASS 0.6s
gemma4:26b yes All four words are synonyms or closely related verbs describing the act of rising or ascending. PASS 0.7s
gemma4:31b-it-q4_K_M yes All four words are synonyms for the action of moving upward. PASS 2.1s

ca-y2

  • Words: APPLE, ORANGE, KIWI, BLACKBERRY
  • Player's category: "Fruits"
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes All four listed items are types of fruit. PASS 0.6s
gemma4:26b yes All four words are common types of fruit. PASS 0.7s
gemma4:31b-it-q4_K_M yes All four words are common types of fruit. PASS 1.8s

ca-y3

  • Words: WHIP, NUT, CODE, SMILE
  • Player's category: "Things you can crack"
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes You can crack a nut, a code, a smile, or a whip (like a whip crack). PASS 0.7s
gemma4:26b yes You can crack a whip, crack a nut, crack a code, and crack a smile. PASS 0.8s
gemma4:31b-it-q4_K_M yes All four words are common objects or concepts that can be 'cracked'. (weak: WHIP) PASS 2.2s

ca-y4

  • Words: BAT, BALL, GLOVE, MITT
  • Player's category: "Baseball equipment"
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes All four items are commonly associated with baseball. PASS 0.6s
gemma4:26b yes All four words are standard pieces of equipment used in the game of baseball. PASS 0.7s
gemma4:31b-it-q4_K_M yes All four items are standard pieces of equipment used in baseball. PASS 2.0s

ca-y5

  • Words: MARS, VENUS, MERCURY, JUPITER
  • Player's category: "Roman gods"
  • Ground truth: yes
Model Verdict Reason Score dt
gemma4:latest yes All four are names of Roman deities associated with planets. PASS 0.6s
gemma4:26b yes All four words are names of prominent deities in Roman mythology. PASS 0.7s
gemma4:31b-it-q4_K_M yes All four words are names of Roman deities as well as planets. PASS 2.0s

ca-n1

  • Words: OAK, MAPLE, BIRCH, PINE
  • Player's category: "Furniture brands"
  • Ground truth: no
Model Verdict Reason Score dt
gemma4:latest no These words are types of trees, not furniture brands. (weak: OAK, MAPLE, BIRCH, PINE) PASS 0.7s
gemma4:26b no These words represent types of trees rather than furniture brands. (weak: OAK, MAPLE, BIRCH, PINE) PASS 0.8s
gemma4:31b-it-q4_K_M no These are types of trees/woods, not brands of furniture. (weak: OAK, MAPLE, BIRCH, PINE) PASS 2.7s

ca-n2

  • Words: ALPHA, BETA, GAMMA, DELTA
  • Player's category: "Words meaning 'small'"
  • Ground truth: no
Model Verdict Reason Score dt
gemma4:latest no These words are Greek letters, not inherently meaning 'small'. (weak: ALPHA) PASS 0.6s
gemma4:26b no Alpha, beta, gamma, and delta are Greek letters used to denote various things, but none of them inherently mean 'small'. PASS 0.9s
gemma4:31b-it-q4_K_M no These are Greek letters used as designations, not synonyms for 'small'. (weak: ALPHA, BETA, GAMMA, DELTA) PASS 2.6s

ca-n3

  • Words: BAT, BALL, GLOVE, MITT
  • Player's category: "Things worn on your hand"
  • Ground truth: no
Model Verdict Reason Score dt
gemma4:latest no BAT and BALL are not things worn on the hand. (weak: BAT, BALL) PASS 0.7s
gemma4:26b no Bat and ball are not items worn on the hand. (weak: BAT, BALL) PASS 0.8s
gemma4:31b-it-q4_K_M no A bat and a ball are not worn on the hand. (weak: BAT, BALL) PASS 2.3s

ca-n4

  • Words: MONDAY, FRIDAY, SUNDAY, WEDNESDAY
  • Player's category: "Months of the year"
  • Ground truth: no
Model Verdict Reason Score dt
gemma4:latest no All the words listed are days of the week, not months of the year. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) PASS 0.8s
gemma4:26b no None of the words are months of the year; they are all days of the week. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) PASS 0.9s
gemma4:31b-it-q4_K_M no The words are days of the week, not months of the year. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) PASS 2.6s

ca-n5

  • Words: WHIP, NUT, CODE, SMILE
  • Player's category: "Things found in a kitchen"
  • Ground truth: no
Model Verdict Reason Score dt
gemma4:latest no Nut and smile are not typically considered kitchen items. (weak: NUT, SMILE) PASS 0.7s
gemma4:26b no While whip and nut can relate to food, code and smile have no plausible connection to kitchen items. (weak: CODE, SMILE) PASS 0.8s
gemma4:31b-it-q4_K_M no Code and smile are not items typically found in a kitchen. (weak: CODE, SMILE) PASS 2.4s

ca-b1

  • Words: APPLE, ORANGE, KIWI, BLACKBERRY
  • Player's category: "Tech/phone brands"
  • Ground truth: borderline -- APPLE and BLACKBERRY clearly are tech brands; ORANGE is a EU/UK telecom carrier (defensible); KIWI is not a tech brand. Strict grader says no; lenient grader might accept ORANGE.
Model Verdict Reason Score dt
gemma4:latest no None of these words are recognized tech or phone brands. (weak: APPLE) BORDERLINE 0.6s
gemma4:26b yes All four words are names of well-known technology companies or mobile device brands. BORDERLINE 0.7s
gemma4:31b-it-q4_K_M no Orange and Kiwi are not recognized tech or phone brands. (weak: ORANGE, KIWI) BORDERLINE 2.3s

ca-b2

  • Words: SHARP, FLAT, NATURAL, KEY
  • Player's category: "Real estate listing terms"
  • Ground truth: borderline -- FLAT (UK apartment) and KEY (keys) connect; SHARP (a sharp property) and NATURAL (natural light?) are weak. Strict no; creative yes.
Model Verdict Reason Score dt
gemma4:latest no Only SHARP and FLAT are common musical terms, while KEY and NATURAL are also musical terms, making 'Real estate listing BORDERLINE 0.8s
gemma4:26b no While 'flat' and 'key' can relate to real estate, 'sharp' and 'natural' have no plausible connection to real estate list BORDERLINE 0.9s
gemma4:31b-it-q4_K_M no Natural and Sharp are not standard terms used in real estate listings. (weak: NATURAL, SHARP) BORDERLINE 2.4s

Manual grading of CREATE outputs

Auto-grading isn't possible for CREATE — graded by hand here.

Case gemma4:latest (8B) gemma4:26b gemma4:31b
e1 Trees OAK MAPLE PINE BIRCH ✓ OAK MAPLE BIRCH CEDAR ✓ OAK MAPLE PINE BIRCH ✓
e2 Greek letters ALPHA BETA GAMMA DELTA ✓ ALPHA BETA GAMMA DELTA ✓ ALPHA BETA GAMMA DELTA ✓
e3 "Angry" synonyms furious irate enraged mad ✓ irate livid furious incensed ✓ furious irate livid enraged ✓
e4 Days of week Mon Tue Wed Thu ✓ Mon Tue Wed Thu ✓ Mon Wed Fri Sun ✓
m1 ___ STORM FAIL — listed STORM itself; HEAT-storm not standard; BLIZZARD-storm redundant brain dust fire snow ✓ brain fire thunder snow ✓
m2 "Small" synonyms tiny petite minuscule diminutive ✓ tiny microscopic diminutive minuscule ✓ tiny petite slight miniature ✓
m3 Words after BLUE jeans bird moon blood ✓ bird jay print whale ✓ berry jay moon print ✓
m4 "Broken" things heart promise record bone ✓ promise record heart law ✓ record promise silence heart ✓
h1 Body-part homophones FAIL — gave SEA/SEE/HEAR/HERE, missed the body-part requirement entirely PARTIAL — MUSSEL/HARE/KNEAD ✓ but EYE is the body part, not a homophone of one PARSE_FAIL (after 3 retries)
h2 Containing body parts KNEEHIGH EARDRUM STOMACHACHE EYELASH ✓ HANDSOME FOOTPRINT EARNEST ARMORY ✓ HEARTBEAT HANDSOME FOOTPRINT ARMCHAIR ✓

CREATE pass rate: 8B = 8/10 (e/m clean, both hard cases fail) · 26b = 9/10 (1 partial on h1) · 31b = 9/10 (1 parse-fail on h1)

The h1 failure is consistent with the prior puzzle bakeoff (#2-26b had the same "homophones of body parts" failure: words that ARE body parts vs words that SOUND LIKE them). All three models share this confusion. Designs depending on deep wordplay categories like this need either prompt scaffolding (give a worked example) or human curation of category seeds.

Aggregate

Model JUDGE CREATE CREATIVE_ACCEPT Borderline Avg s Notes
gemma4:latest (8B) 14/16 8/10 10/10 0/2 strict-aligned 0.7 Fastest. Slight bias toward "no" on hard YES cases (judge-y3 days-of-week miss, judge-y6 cold-turkey miss)
gemma4:26b 15/16 9/10 10/10 1/2 over-permissive (said KIWI is a tech brand) 0.8 Best speed/quality balance for live judging. Shows mild "be helpful, agree" bias
gemma4:31b-it-q4_K_M 16/16 9/10 10/10 2/2 strict 2.3 Most accurate. Only candidate for once-per-day generation. 1 parse-fail on h1 (3 retries didn't recover)

What this proves

  1. The CREATIVE_ACCEPT axis works on every model tested. This is the structural unlock that makes a Gemma-powered Connections derivative meaningfully different from the static NYT format: live, fair judging of player-invented groupings. 10/10 across 3 models on 5 valid + 5 invalid player categories — accept WHIP/NUT/CODE/SMILE for "things you can crack", reject OAK/MAPLE/BIRCH/PINE for "furniture brands", reject MONDAY/FRIDAY/SUNDAY/WEDNESDAY for "months." The model gets the distinction Connections cares about.
  2. Per-guess JUDGE economics are cheap. 0.7-0.8s on the 3090 Ti for 8B/26b. Even 1000 player-guesses/day costs <15 GPU-minutes — effectively free.
  3. 31b is the right generator (validated earlier) and the right offline critique judge. 26b/8B are the right live judges.
  4. Hard wordplay categories (homophones-of-body-parts class) consistently fail across all three models. Either avoid them or scaffold with examples.

What this doesn't prove (limits / next steps)

  • Borderline-case behavior is model-specific. 26b said yes to KIWI as tech brand — that's a real false-positive risk for the CREATIVE_ACCEPT design. If the live game uses 26b, it will sometimes accept groupings a strict grader would reject. 8B's stricter bias makes it safer here despite the 87.5% JUDGE rate; 31b is consistent and would be the gold standard but is too slow for live use.
  • Cultural/contextual categories untested. "Words in a Beatles song", "Things only true after 2020" — these may break the judge in ways simple semantics don't.
  • No adversarial player. What if the player invents a category to deliberately game the system into accepting a near-wrong grouping? E.g. "Words that contain a vowel" trivially fits any 4 English words. Need a category-tightness check on player input, not just word-fit.
  • Ground truth is mine and arguable. judge-y5 (___ HOUSE — 26b said "courthouse is one word, not 'court' + 'house'") is a defensible call I marked as a fail. Real human-grader agreement might bump 26b to 16/16 too.