5a2a02e483
This repo opens with the design-discovery work completed before any product code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local Ollama established that: - Whole-puzzle generation in the Connections shape is unreliable on Gemma 4 (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally out of project scope, so the generation route is harder still. - Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b) on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging of player-INVENTED categories. That is the structural unlock vs static hand-curated word games. The README contains the full writeup, the test bench, and a brainstormed bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo range, plus a primitives table for recombination. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
27 KiB
27 KiB
Gemma 4 Semantic Bakeoff -- 2026-04-27 22:51
Setup
- Local Ollama on the test host (RTX 3090 Ti, 24 GB VRAM)
- Models:
gemma4:latest,gemma4:26b,gemma4:31b-it-q4_K_M - Temperature 0.2 (raised +0.1 per retry on JSON parse fail, max 3 attempts)
- think=false, num_ctx=4096, num_predict=512, no format=json (per gemma4-research/GOTCHAS.md)
- 38 test cases: 16 JUDGE, 10 CREATE, 12 CREATIVE_ACCEPT
- Ground truth hand-labeled inline in
scripts/gemma-semantic-bakeoff.py
Auto-scored summary
| Model | JUDGE pass | CREATIVE_ACCEPT pass | parse fails | avg s |
|---|---|---|---|---|
gemma4:latest |
14/16 | 10/10 | 0 | 0.7 |
gemma4:26b |
15/16 | 10/10 | 0 | 0.8 |
gemma4:31b-it-q4_K_M |
16/16 | 10/10 | 1 | 2.3 |
JUDGE
judge-y1
- Category:
Types of trees - Words:
OAK,MAPLE,BIRCH,PINE - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | All four words are common names of tree species. | PASS | 0.7s |
gemma4:26b |
yes | All four words are specific, well-known types of trees. | PASS | 0.8s |
gemma4:31b-it-q4_K_M |
yes | All four words are common types of trees. | PASS | 2.1s |
judge-y2
- Category:
Greek letters - Words:
ALPHA,BETA,GAMMA,DELTA - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | All four words are names of Greek letters. | PASS | 0.6s |
gemma4:26b |
yes | All four words are standard names of letters in the Greek alphabet. | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
yes | All four words are the first four letters of the Greek alphabet. | PASS | 2.1s |
judge-y3
- Category:
Days of the week - Words:
MONDAY,FRIDAY,SUNDAY,WEDNESDAY - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | The words are not all representative of the days of the week. (misfit: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | FAIL | 0.8s |
gemma4:26b |
yes | All four words are specific names of days in the week. | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
yes | All four words are standard days of the week. | PASS | 1.9s |
judge-y4
- Category:
Synonyms for 'happy' - Words:
JOYFUL,GLAD,CHEERFUL,ELATED - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | All four words are direct and common synonyms for 'happy'. | PASS | 0.6s |
gemma4:26b |
yes | All four words are direct synonyms representing different degrees of happiness. | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
yes | All four words are direct synonyms for the state of being happy. | PASS | 2.1s |
judge-y5
- Category:
___ HOUSE (compound words) - Words:
GREEN,ICE,FIRE,COURT - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | All four words can precede 'HOUSE' to form common compound words. | PASS | 0.7s |
gemma4:26b |
no | While Green, Ice, and Fire House are common, 'Court House' is typically written as two words (Courthouse) or refers to a | FAIL | 0.9s |
gemma4:31b-it-q4_K_M |
yes | All four words form common compound words or phrases with 'house'. | PASS | 2.1s |
judge-y6
- Category:
Words that follow COLD - Words:
SHOULDER,FRONT,SNAP,TURKEY - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | Only SHOULDER, FRONT, and SNAP fit common phrases with COLD, while TURKEY does not. (misfit: TURKEY) | FAIL | 0.8s |
gemma4:26b |
yes | All four words form common phrases when preceded by 'cold': cold shoulder, cold front, cold snap, and cold turkey. | PASS | 0.9s |
gemma4:31b-it-q4_K_M |
yes | All four words form common compound words or phrases when preceded by 'cold'. | PASS | 2.1s |
judge-y7
- Category:
Verbs meaning 'to move quickly' - Words:
DART,BOLT,RUSH,FLY - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | All four words are strong synonyms for moving rapidly. | PASS | 0.6s |
gemma4:26b |
yes | All four words are synonyms for moving at high speed. | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
yes | All four words are common synonyms for moving rapidly. | PASS | 2.0s |
judge-y8
- Category:
Synonyms for 'idea' - Words:
NOTION,CONCEPT,THOUGHT,INKLING - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | All four words are common synonyms for a mental concept or idea. | PASS | 0.7s |
gemma4:26b |
yes | All four words are direct synonyms for a mental concept, notion, or vague understanding. | PASS | 0.8s |
gemma4:31b-it-q4_K_M |
yes | All four words are common synonyms for a mental representation or a suggestion of an idea. | PASS | 2.2s |
judge-n1
- Category:
Types of trees - Words:
OAK,MAPLE,BIRCH,CARROT - Ground truth: no (misfit: CARROT)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | Carrot is a root vegetable, not a type of tree. (misfit: CARROT) | PASS | 0.7s |
gemma4:26b |
no | Carrot is a root vegetable, not a type of tree. (misfit: CARROT) | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
no | Carrot is a root vegetable, not a tree. (misfit: CARROT) | PASS | 2.1s |
judge-n2
- Category:
Greek letters - Words:
ALPHA,BETA,GAMMA,CYRILLIC - Ground truth: no (misfit: CYRILLIC)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | Cyrillic is a script, not a Greek letter. (misfit: CYRILLIC) | PASS | 0.7s |
gemma4:26b |
no | CYRILLIC refers to a script, not a specific Greek letter. (misfit: CYRILLIC) | PASS | 0.8s |
gemma4:31b-it-q4_K_M |
no | Cyrillic is a writing system, not a Greek letter. (misfit: CYRILLIC) | PASS | 2.2s |
judge-n3
- Category:
Synonyms for 'happy' - Words:
JOYFUL,GLAD,SAD,ELATED - Ground truth: no (misfit: SAD)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | The word 'SAD' is an antonym, not a synonym for 'happy'. (misfit: SAD) | PASS | 0.7s |
gemma4:26b |
no | The word 'SAD' is an antonym rather than a synonym. (misfit: SAD) | PASS | 0.8s |
gemma4:31b-it-q4_K_M |
no | SAD is an antonym of happy, not a synonym. (misfit: SAD) | PASS | 2.2s |
judge-n4
- Category:
Days of the week - Words:
MONDAY,JANUARY,SUNDAY,WEDNESDAY - Ground truth: no (misfit: JANUARY)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | January is a month, not a day of the week. (misfit: JANUARY) | PASS | 0.7s |
gemma4:26b |
no | JANUARY is a month, not a day of the week. (misfit: JANUARY) | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
no | JANUARY is a month, not a day of the week. (misfit: JANUARY) | PASS | 2.1s |
judge-n5
- Category:
Body parts - Words:
ARM,LEG,EYE,NIGHT - Ground truth: no (misfit: NIGHT)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | NIGHT is not a body part. (misfit: NIGHT) | PASS | 0.6s |
gemma4:26b |
no | The word 'NIGHT' is not a body part. (misfit: NIGHT) | PASS | 0.8s |
gemma4:31b-it-q4_K_M |
no | NIGHT is a time of day, not a body part. (misfit: NIGHT) | PASS | 2.1s |
judge-n6
- Category:
Types of birds - Words:
CRANE,SWALLOW,BAT,MOSQUITO - Ground truth: no (misfit: BAT, MOSQUITO)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | Bats and mosquitoes are not types of birds. (misfit: BAT, MOSQUITO) | PASS | 0.7s |
gemma4:26b |
no | BAT and MOSQUITO are mammals and insects, respectively, not birds. (misfit: BAT, MOSQUITO) | PASS | 0.8s |
gemma4:31b-it-q4_K_M |
no | Bat and mosquito are not birds. (misfit: BAT, MOSQUITO) | PASS | 2.1s |
judge-n7
- Category:
Things that are red - Words:
APPLE,BLUE,ROSE,GRASS - Ground truth: no (misfit: BLUE, GRASS)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | Blue is a color, not necessarily a thing that is red. (misfit: BLUE) | PASS | 0.7s |
gemma4:26b |
no | BLUE and GRASS are not typically red. (misfit: BLUE, GRASS) | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
no | Blue and grass are not red. (misfit: BLUE, GRASS) | PASS | 2.0s |
judge-n8
- Category:
Words that follow COLD - Words:
SHOULDER,FRONT,PIZZA,MOUNTAIN - Ground truth: no (misfit: PIZZA, MOUNTAIN)
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | Only SHOULDER, FRONT, and PIZZA fit common phrases with COLD, but MOUNTAIN does not. (misfit: MOUNTAIN) | PASS | 0.8s |
gemma4:26b |
no | While shoulder, front, and mountain can follow 'cold', 'cold pizza' is a description of temperature rather than a common | PASS | 0.9s |
gemma4:31b-it-q4_K_M |
no | Cold pizza and cold mountain are not established compound words or common idioms like cold shoulder and cold front. (mis | PASS | 2.7s |
CREATE
create-e1
- Category:
Types of trees - Quality bar: Four valid tree species; e.g. OAK, MAPLE, BIRCH, PINE.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
Oak, Maple, Pine, Birch | These are all common types of trees. | MANUAL | 0.6s |
gemma4:26b |
Oak, Maple, Birch, Cedar | These are all common species of trees. | MANUAL | 0.7s |
gemma4:31b-it-q4_K_M |
Oak, Maple, Pine, Birch | All four are common types of trees. | MANUAL | 1.9s |
create-e2
- Category:
Greek letters - Quality bar: Four genuine Greek letters.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
Alpha, Beta, Gamma, Delta | These are the first four letters of the Greek alphabet. | MANUAL | 0.6s |
gemma4:26b |
ALPHA, BETA, GAMMA, DELTA | These are the first four letters of the Greek alphabet. | MANUAL | 0.7s |
gemma4:31b-it-q4_K_M |
Alpha, Beta, Gamma, Delta | These are the first four letters of the Greek alphabet. | MANUAL | 1.9s |
create-e3
- Category:
Synonyms for 'angry' - Quality bar: Four words that all genuinely mean angry/furious.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
furious, irate, enraged, mad | All four words are direct synonyms for feeling intense anger. | MANUAL | 0.6s |
gemma4:26b |
Irate, Livid, Furious, Incensed | All four words are direct synonyms for being extremely angry. | MANUAL | 0.7s |
gemma4:31b-it-q4_K_M |
Furious, Irate, Livid, Enraged | All four words are strong synonyms for being extremely angry. | MANUAL | 2.1s |
create-e4
- Category:
Days of the week - Quality bar: Four of the seven weekday names, no months or other words.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
Monday, Tuesday, Wednesday, Thursday | These are four consecutive days of the week. | MANUAL | 0.6s |
gemma4:26b |
Monday, Tuesday, Wednesday, Thursday | These are all consecutive days of the week. | MANUAL | 0.7s |
gemma4:31b-it-q4_K_M |
Monday, Wednesday, Friday, Sunday | They are all days of the week. | MANUAL | 1.8s |
create-m1
- Category:
___ STORM (compound words ending in STORM) - Quality bar: Four words that each form a real compound or fixed phrase with STORM (e.g. SAND, BRAIN, THUNDER, SNOW).
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
BLIZZARD, HEAT, STORM, SNOW | These words can all precede 'STORM' to form a type of storm. | MANUAL | 0.7s |
gemma4:26b |
BRAIN, DUST, FIRE, SNOW | Each word forms a common compound word when followed by 'STORM'. | MANUAL | 0.7s |
gemma4:31b-it-q4_K_M |
BRAIN, FIRE, THUNDER, SNOW | All four words form common compound words when followed by STORM. | MANUAL | 2.1s |
create-m2
- Category:
Synonyms for 'small' - Quality bar: Four words that all genuinely mean small.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
tiny, petite, minuscule, diminutive | All four words are direct synonyms for 'small' in size. | MANUAL | 0.7s |
gemma4:26b |
tiny, microscopic, diminutive, minuscule | All four words are direct synonyms for something of very small size. | MANUAL | 0.7s |
gemma4:31b-it-q4_K_M |
tiny, petite, slight, miniature | All four words are direct synonyms for small in size. | MANUAL | 2.0s |
create-m3
- Category:
Words that follow BLUE - Quality bar: Four words that each form a real compound with BLUE (e.g. BERRY, BIRD, PRINT, BELL, GRASS).
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
JEANS, BIRD, MOON, BLOOD | All words can follow 'BLUE' to form common phrases. | MANUAL | 0.7s |
gemma4:26b |
BIRD, JAY, PRINT, WHALE | Each word can be preceded by 'blue' to form a common term. | MANUAL | 0.8s |
gemma4:31b-it-q4_K_M |
BERRY, JAY, MOON, PRINT | All four words follow 'BLUE' to form common compound words or phrases. | MANUAL | 2.2s |
create-m4
- Category:
Things that can be 'broken' - Quality bar: Four words that each form a real fixed phrase with 'broken' (heart, record, law, promise, etc.).
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
heart, promise, record, bone | All four items can be metaphorically or literally broken. | MANUAL | 0.6s |
gemma4:26b |
Promise, Record, Heart, Law | All four are common objects or concepts that can be broken. | MANUAL | 0.7s |
gemma4:31b-it-q4_K_M |
Record, Promise, Silence, Heart | All four are common abstract or physical things that can be broken. | MANUAL | 2.1s |
create-h1
- Category:
Words that are homophones of body parts but spelled differently (e.g. HARE = hair, MUSSEL = muscle) - Quality bar: Four words that each sound like a body part but are spelled differently. Valid examples: HARE (hair), MUSSEL (muscle), HEAL (heel), SOUL (sole), AYE/EYE-spelled-otherwise. EYE and HEEL alone do NOT count -- those are the body parts themselves, not homophones of them.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
SEA, SEE, HEAR, HERE | All four words are homophones of common body parts (sea/see for 'C' sound, hear/here for 'hear'). | MANUAL | 0.8s |
gemma4:26b |
MUSSEL, HARE, EYE, KNEAD | Each word is a homophone of a body part (muscle, hair, eye, knee) but spelled differently. | MANUAL | 0.9s |
gemma4:31b-it-q4_K_M |
parse fail | JSONDecodeError('Extra data: line 3 column 1 (char 133)') |
PARSE_FAIL | 6.7s |
create-h2
- Category:
Words that contain a body part as a substring (e.g. HEARTBEAT contains HEART) - Quality bar: Four words that each contain a body part anywhere inside them. Valid examples: HEARTH (HEART), CHESTNUT (CHEST), EARTH (EAR), HEADACHE (HEAD).
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
KNEEHIGH, EARDRUM, STOMACHACHE, EYELASH | Each word contains a body part name as a substring. | MANUAL | 0.7s |
gemma4:26b |
HANDSOME, FOOTPRINT, EARNEST, ARMORY | Each word contains a body part (hand, foot, ear, arm) as a substring. | MANUAL | 0.8s |
gemma4:31b-it-q4_K_M |
HEARTBEAT, HANDSOME, FOOTPRINT, ARMCHAIR | Each word contains a body part as a substring. | MANUAL | 2.2s |
CREATIVE_ACCEPT
ca-y1
- Words:
SCALE,MOUNT,ASCEND,CLIMB - Player's category:
"Verbs for going up" - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | All four words can be used as verbs meaning to go up. | PASS | 0.6s |
gemma4:26b |
yes | All four words are synonyms or closely related verbs describing the act of rising or ascending. | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
yes | All four words are synonyms for the action of moving upward. | PASS | 2.1s |
ca-y2
- Words:
APPLE,ORANGE,KIWI,BLACKBERRY - Player's category:
"Fruits" - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | All four listed items are types of fruit. | PASS | 0.6s |
gemma4:26b |
yes | All four words are common types of fruit. | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
yes | All four words are common types of fruit. | PASS | 1.8s |
ca-y3
- Words:
WHIP,NUT,CODE,SMILE - Player's category:
"Things you can crack" - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | You can crack a nut, a code, a smile, or a whip (like a whip crack). | PASS | 0.7s |
gemma4:26b |
yes | You can crack a whip, crack a nut, crack a code, and crack a smile. | PASS | 0.8s |
gemma4:31b-it-q4_K_M |
yes | All four words are common objects or concepts that can be 'cracked'. (weak: WHIP) | PASS | 2.2s |
ca-y4
- Words:
BAT,BALL,GLOVE,MITT - Player's category:
"Baseball equipment" - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | All four items are commonly associated with baseball. | PASS | 0.6s |
gemma4:26b |
yes | All four words are standard pieces of equipment used in the game of baseball. | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
yes | All four items are standard pieces of equipment used in baseball. | PASS | 2.0s |
ca-y5
- Words:
MARS,VENUS,MERCURY,JUPITER - Player's category:
"Roman gods" - Ground truth: yes
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
yes | All four are names of Roman deities associated with planets. | PASS | 0.6s |
gemma4:26b |
yes | All four words are names of prominent deities in Roman mythology. | PASS | 0.7s |
gemma4:31b-it-q4_K_M |
yes | All four words are names of Roman deities as well as planets. | PASS | 2.0s |
ca-n1
- Words:
OAK,MAPLE,BIRCH,PINE - Player's category:
"Furniture brands" - Ground truth: no
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | These words are types of trees, not furniture brands. (weak: OAK, MAPLE, BIRCH, PINE) | PASS | 0.7s |
gemma4:26b |
no | These words represent types of trees rather than furniture brands. (weak: OAK, MAPLE, BIRCH, PINE) | PASS | 0.8s |
gemma4:31b-it-q4_K_M |
no | These are types of trees/woods, not brands of furniture. (weak: OAK, MAPLE, BIRCH, PINE) | PASS | 2.7s |
ca-n2
- Words:
ALPHA,BETA,GAMMA,DELTA - Player's category:
"Words meaning 'small'" - Ground truth: no
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | These words are Greek letters, not inherently meaning 'small'. (weak: ALPHA) | PASS | 0.6s |
gemma4:26b |
no | Alpha, beta, gamma, and delta are Greek letters used to denote various things, but none of them inherently mean 'small'. | PASS | 0.9s |
gemma4:31b-it-q4_K_M |
no | These are Greek letters used as designations, not synonyms for 'small'. (weak: ALPHA, BETA, GAMMA, DELTA) | PASS | 2.6s |
ca-n3
- Words:
BAT,BALL,GLOVE,MITT - Player's category:
"Things worn on your hand" - Ground truth: no
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | BAT and BALL are not things worn on the hand. (weak: BAT, BALL) | PASS | 0.7s |
gemma4:26b |
no | Bat and ball are not items worn on the hand. (weak: BAT, BALL) | PASS | 0.8s |
gemma4:31b-it-q4_K_M |
no | A bat and a ball are not worn on the hand. (weak: BAT, BALL) | PASS | 2.3s |
ca-n4
- Words:
MONDAY,FRIDAY,SUNDAY,WEDNESDAY - Player's category:
"Months of the year" - Ground truth: no
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | All the words listed are days of the week, not months of the year. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | PASS | 0.8s |
gemma4:26b |
no | None of the words are months of the year; they are all days of the week. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | PASS | 0.9s |
gemma4:31b-it-q4_K_M |
no | The words are days of the week, not months of the year. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | PASS | 2.6s |
ca-n5
- Words:
WHIP,NUT,CODE,SMILE - Player's category:
"Things found in a kitchen" - Ground truth: no
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | Nut and smile are not typically considered kitchen items. (weak: NUT, SMILE) | PASS | 0.7s |
gemma4:26b |
no | While whip and nut can relate to food, code and smile have no plausible connection to kitchen items. (weak: CODE, SMILE) | PASS | 0.8s |
gemma4:31b-it-q4_K_M |
no | Code and smile are not items typically found in a kitchen. (weak: CODE, SMILE) | PASS | 2.4s |
ca-b1
- Words:
APPLE,ORANGE,KIWI,BLACKBERRY - Player's category:
"Tech/phone brands" - Ground truth: borderline -- APPLE and BLACKBERRY clearly are tech brands; ORANGE is a EU/UK telecom carrier (defensible); KIWI is not a tech brand. Strict grader says no; lenient grader might accept ORANGE.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | None of these words are recognized tech or phone brands. (weak: APPLE) | BORDERLINE | 0.6s |
gemma4:26b |
yes | All four words are names of well-known technology companies or mobile device brands. | BORDERLINE | 0.7s |
gemma4:31b-it-q4_K_M |
no | Orange and Kiwi are not recognized tech or phone brands. (weak: ORANGE, KIWI) | BORDERLINE | 2.3s |
ca-b2
- Words:
SHARP,FLAT,NATURAL,KEY - Player's category:
"Real estate listing terms" - Ground truth: borderline -- FLAT (UK apartment) and KEY (keys) connect; SHARP (a sharp property) and NATURAL (natural light?) are weak. Strict no; creative yes.
| Model | Verdict | Reason | Score | dt |
|---|---|---|---|---|
gemma4:latest |
no | Only SHARP and FLAT are common musical terms, while KEY and NATURAL are also musical terms, making 'Real estate listing | BORDERLINE | 0.8s |
gemma4:26b |
no | While 'flat' and 'key' can relate to real estate, 'sharp' and 'natural' have no plausible connection to real estate list | BORDERLINE | 0.9s |
gemma4:31b-it-q4_K_M |
no | Natural and Sharp are not standard terms used in real estate listings. (weak: NATURAL, SHARP) | BORDERLINE | 2.4s |
Manual grading of CREATE outputs
Auto-grading isn't possible for CREATE — graded by hand here.
| Case | gemma4:latest (8B) | gemma4:26b | gemma4:31b |
|---|---|---|---|
| e1 Trees | OAK MAPLE PINE BIRCH ✓ | OAK MAPLE BIRCH CEDAR ✓ | OAK MAPLE PINE BIRCH ✓ |
| e2 Greek letters | ALPHA BETA GAMMA DELTA ✓ | ALPHA BETA GAMMA DELTA ✓ | ALPHA BETA GAMMA DELTA ✓ |
| e3 "Angry" synonyms | furious irate enraged mad ✓ | irate livid furious incensed ✓ | furious irate livid enraged ✓ |
| e4 Days of week | Mon Tue Wed Thu ✓ | Mon Tue Wed Thu ✓ | Mon Wed Fri Sun ✓ |
| m1 ___ STORM | FAIL — listed STORM itself; HEAT-storm not standard; BLIZZARD-storm redundant | brain dust fire snow ✓ | brain fire thunder snow ✓ |
| m2 "Small" synonyms | tiny petite minuscule diminutive ✓ | tiny microscopic diminutive minuscule ✓ | tiny petite slight miniature ✓ |
| m3 Words after BLUE | jeans bird moon blood ✓ | bird jay print whale ✓ | berry jay moon print ✓ |
| m4 "Broken" things | heart promise record bone ✓ | promise record heart law ✓ | record promise silence heart ✓ |
| h1 Body-part homophones | FAIL — gave SEA/SEE/HEAR/HERE, missed the body-part requirement entirely | PARTIAL — MUSSEL/HARE/KNEAD ✓ but EYE is the body part, not a homophone of one | PARSE_FAIL (after 3 retries) |
| h2 Containing body parts | KNEEHIGH EARDRUM STOMACHACHE EYELASH ✓ | HANDSOME FOOTPRINT EARNEST ARMORY ✓ | HEARTBEAT HANDSOME FOOTPRINT ARMCHAIR ✓ |
CREATE pass rate: 8B = 8/10 (e/m clean, both hard cases fail) · 26b = 9/10 (1 partial on h1) · 31b = 9/10 (1 parse-fail on h1)
The h1 failure is consistent with the prior puzzle bakeoff (#2-26b had the same "homophones of body parts" failure: words that ARE body parts vs words that SOUND LIKE them). All three models share this confusion. Designs depending on deep wordplay categories like this need either prompt scaffolding (give a worked example) or human curation of category seeds.
Aggregate
| Model | JUDGE | CREATE | CREATIVE_ACCEPT | Borderline | Avg s | Notes |
|---|---|---|---|---|---|---|
gemma4:latest (8B) |
14/16 | 8/10 | 10/10 | 0/2 strict-aligned | 0.7 | Fastest. Slight bias toward "no" on hard YES cases (judge-y3 days-of-week miss, judge-y6 cold-turkey miss) |
gemma4:26b |
15/16 | 9/10 | 10/10 | 1/2 over-permissive (said KIWI is a tech brand) | 0.8 | Best speed/quality balance for live judging. Shows mild "be helpful, agree" bias |
gemma4:31b-it-q4_K_M |
16/16 | 9/10 | 10/10 | 2/2 strict | 2.3 | Most accurate. Only candidate for once-per-day generation. 1 parse-fail on h1 (3 retries didn't recover) |
What this proves
- The CREATIVE_ACCEPT axis works on every model tested. This is the structural unlock that makes a Gemma-powered Connections derivative meaningfully different from the static NYT format: live, fair judging of player-invented groupings. 10/10 across 3 models on 5 valid + 5 invalid player categories — accept WHIP/NUT/CODE/SMILE for "things you can crack", reject OAK/MAPLE/BIRCH/PINE for "furniture brands", reject MONDAY/FRIDAY/SUNDAY/WEDNESDAY for "months." The model gets the distinction Connections cares about.
- Per-guess JUDGE economics are cheap. 0.7-0.8s on the 3090 Ti for 8B/26b. Even 1000 player-guesses/day costs <15 GPU-minutes — effectively free.
- 31b is the right generator (validated earlier) and the right offline critique judge. 26b/8B are the right live judges.
- Hard wordplay categories (homophones-of-body-parts class) consistently fail across all three models. Either avoid them or scaffold with examples.
What this doesn't prove (limits / next steps)
- Borderline-case behavior is model-specific. 26b said yes to KIWI as tech brand — that's a real false-positive risk for the CREATIVE_ACCEPT design. If the live game uses 26b, it will sometimes accept groupings a strict grader would reject. 8B's stricter bias makes it safer here despite the 87.5% JUDGE rate; 31b is consistent and would be the gold standard but is too slow for live use.
- Cultural/contextual categories untested. "Words in a Beatles song", "Things only true after 2020" — these may break the judge in ways simple semantics don't.
- No adversarial player. What if the player invents a category to deliberately game the system into accepting a near-wrong grouping? E.g. "Words that contain a vowel" trivially fits any 4 English words. Need a category-tightness check on player input, not just word-fit.
- Ground truth is mine and arguable. judge-y5 (___ HOUSE — 26b said "courthouse is one word, not 'court' + 'house'") is a defensible call I marked as a fail. Real human-grader agreement might bump 26b to 16/16 too.