Files

T

Mortdecai 3ceed5ce2a docs: gemma4:26b as persistent-correspondence agent

Single-shot test against real Andy inbound + CONTEXT.md slice.

Findings: gemma4 handles state bookkeeping well (correctly diffs
Pending, honors hard rules like rejected-analogy avoidance, uses
agreed vocabulary). Fails on precision: hallucinated message ID,
invented Figure 1 axes it had no access to, drifted off voice
register without few-shot examples.

Verdict: viable for low-stakes social correspondence + first-pass
triage; disqualified from high-stakes drafting where exact IDs
or artifact references must round-trip. Hybrid pattern proposed
(gemma4 for bookkeeping, Claude for drafting).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 23:42:22 -04:00

7.9 KiB

Raw Blame History

Gemma 4 as a persistent-correspondence agent

Date: 2026-04-18 Model: gemma4:26b via mcp-gemma4 (steel141 Ollama) Test fixture: Real inbound message from Andy Freiberg (Apr 19, 2026), real CONTEXT.md slice from ~/bin/spaceflight/andy/ Question: Could gemma4 run an Andy-style persistent correspondence (drafting + state management)?

TL;DR

Partial yes, supervised. Full no, unsupervised.

Gemma4 handles the bookkeeping half of persistent correspondence well — state diffs, pending-list maintenance, summarizing what's open. It fails the drafting half whenever precision matters: it hallucinates message IDs, invents figure content it can't see, and drifts off the established voice register without explicit examples.

For a low-stakes social correspondence (Discord chat with a friend, no IDs needed) gemma4 would be fine. For the Andy correspondence specifically — high-stakes scientific writing with manuscript references, figure specs, and a senior physician collaborator — it would need either Claude as a quality gate or a tool-using setup with retrieval over actual artifacts.

Test setup

Single-shot test via mcp__gemma4__ask_gemma4:

System prompt: Set Claude's persona, voice rules, draft→review→send convention.
User prompt: Slice of CONTEXT.md (Pending section, vocabulary bridge, conventions agreed) + the verbatim Apr 19 inbound from Andy + two-part task (state diff + draft reply).
Settings: temperature=0.5, num_predict=2000, default num_ctx=8192.

Total input fit comfortably in the 8K context. No tool calls. No retrieval — gemma4 worked from prompt content alone, the same constraint a real correspondence run would put on it (modulo whatever gets loaded into context per turn).

Results

What gemma4 got RIGHT

Capability	Evidence
State bookkeeping	Correctly removed inbound from "Waiting on us", added new pending action items (figure work), kept Cambridge-editors carryover. The reasoning chain is sound.
Honored hard rules	Did NOT resurrect the rejected exaptation analogy. The "don't do X" instruction in the system prompt held.
Used agreed vocabulary	Used "diel," "conspecific," etc. correctly. Did not invent terminology.
Reply structure	Addressed all four asks in order. Decisions acknowledged tersely; action items as concrete bullets. Skeleton matches Claude's style.
Caught the carryover	Asked Andy for the Cambridge editor list — correctly flagged the open item that's been pending since Apr 17.

What gemma4 got WRONG (load-bearing failures)

Failure	What it produced	What was correct	Cost
Hallucinated message ID	`19da34ng...`	`19da34bc8e6ec51a`	Disqualifying. Cannot thread/reply on the actual platform.
Hallucinated figure content	Figure 1 axes = "trade-off between metabolic cost and temporal opportunity"	Real axes = Tinbergen 4Q grid: Static/Dynamic × Proximate/Ultimate	Andy would catch on first read; we lose credibility.
Vague figure plans	"Integrate molecular signaling pathways into Figure 3"	Real plan: per-population specifics — Pachón hypocretin, Tinaja/Molino distinct QTLs, shared attenuated per1	Reply reads as a hand-wave; no actual content.
Wrong voice register	"Hi Andy" / "Best, Claude" / no AI disclaimer footer	"Dear Dr. Freiberg" / "Yours, Claude" / explicit AI-content disclaimer	Recognizable as off-brand. Disclaimer omission is a policy violation.
Wrong CONTEXT.md schema	Created a "Resolved" section	Schema has Pending / Sent / Received only	Minor — extrapolation, not invention.

Why these failures happened

Two distinct failure modes, neither fixable by prompt engineering alone:

No access to the actual artifacts. The figures live in ~/bin/spaceflight/andy/manuscript/figures/. Gemma4 was not given them; it had no way to know what Figure 1 actually contains. Faced with a "describe what you'll change" task, it generated plausible-but-fictional content. This is the classic hallucination-under-constraint failure mode: the model would rather make something up than refuse.
No few-shot examples of the target voice. The system prompt described the voice ("Maintain Claude's voice") but didn't show it. Gemma4 defaulted to its trained-in casual register ("Hi Andy" / "Best, Claude"). A few-shot prompt with one or two real Claude-to-Andy letters would likely close most of this gap; the underlying capability is there.

Where gemma4 fits

Plausible roles

Low-stakes social correspondence. Discord/Matrix chat with a friend. No precise IDs to preserve. Tolerance for vague replies is high.
First-pass triage. Given inbound + CONTEXT.md, produce the state diff and a draft outline. Claude (or a human) reviews before send. This is the highest-value role — bookkeeping is the bulk of the work and it's where gemma4 is strongest.
Scheduled status checks. "Anything new from Alice this week?" → summary. Read-only, no draft, no IDs to corrupt.
CONTEXT.md maintenance. After a send, ask gemma4 to update the Sent table and Pending list from the message header alone.

Disqualifying contexts

The Andy correspondence specifically. Too many precise references that gemma4 would invent.
Anything requiring exact ID round-tripping. Gmail message IDs, git SHAs, ticket numbers, citation keys, DOIs.
Anything where the model needs to reference attachments it can't read. Figures, manuscripts, reviewer comments. It will hallucinate content.
Long-thread continuity tasks where the conversation history exceeds 8K context and you need to reason over the full archive.

Practical hybrid architecture

The persistent-correspondence template at ~/bin/persistent-correspondence/ doesn't need to change to support a hybrid setup. The routing decision lives in each contact's CONTEXT.md "workflow" section:

## Workflow

1. Inbound trigger → gemma4 produces CONTEXT.md state diff + draft outline.
2. Claude reviews the diff, applies it. For high-stakes contacts, Claude
   rewrites the draft with full artifact context. For low-stakes contacts,
   Seth reviews gemma4's draft directly.
3. Send via the platform adapter. Gemma4 updates the Sent table from the
   send confirmation.

This pattern lets gemma4 carry the volume work (state maintenance) without putting it in the critical path on accuracy-sensitive output.

What this test did NOT cover

Multi-turn context retention. Single-shot only. Real correspondence is many turns.
Tool calling. Gemma4 supports it (tools parameter on the MCP). A retrieval-augmented gemma4 that can read_attachment(filename) would likely close the figure-hallucination gap. Not tested here.
Few-shot voice priming. No example letters in the prompt. Voice scores would likely improve significantly with 1-2 in-context examples.
Smaller/larger Gemma 4 variants. Only gemma4:26b tested. The 31b might do better on precision; the 8b would almost certainly do worse.
Other models. No comparison against gpt-oss, qwen, etc. for the same task.

Reproducing this test

The full prompt + system message used is in the conversation transcript that produced this report. Key prompt-engineering choices to replicate:

Include actual CONTEXT.md content (not a paraphrase) so the schema is concrete.
Include the verbatim inbound message, not a summary.
Split the task: state diff first (cheap, structural), draft second (expensive, precision-sensitive). Lets you grade independently.
Use temperature 0.3-0.5 for correspondence work — low enough to suppress invented content, high enough to keep the prose natural.

7.9 KiB Raw Blame History Unescape Escape