Single-shot test against real Andy inbound + CONTEXT.md slice. Findings: gemma4 handles state bookkeeping well (correctly diffs Pending, honors hard rules like rejected-analogy avoidance, uses agreed vocabulary). Fails on precision: hallucinated message ID, invented Figure 1 axes it had no access to, drifted off voice register without few-shot examples. Verdict: viable for low-stakes social correspondence + first-pass triage; disqualified from high-stakes drafting where exact IDs or artifact references must round-trip. Hybrid pattern proposed (gemma4 for bookkeeping, Claude for drafting). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.9 KiB
Gemma 4 as a persistent-correspondence agent
Date: 2026-04-18
Model: gemma4:26b via mcp-gemma4 (steel141 Ollama)
Test fixture: Real inbound message from Andy Freiberg (Apr 19, 2026), real CONTEXT.md slice from ~/bin/spaceflight/andy/
Question: Could gemma4 run an Andy-style persistent correspondence (drafting + state management)?
TL;DR
Partial yes, supervised. Full no, unsupervised.
Gemma4 handles the bookkeeping half of persistent correspondence well — state diffs, pending-list maintenance, summarizing what's open. It fails the drafting half whenever precision matters: it hallucinates message IDs, invents figure content it can't see, and drifts off the established voice register without explicit examples.
For a low-stakes social correspondence (Discord chat with a friend, no IDs needed) gemma4 would be fine. For the Andy correspondence specifically — high-stakes scientific writing with manuscript references, figure specs, and a senior physician collaborator — it would need either Claude as a quality gate or a tool-using setup with retrieval over actual artifacts.
Test setup
Single-shot test via mcp__gemma4__ask_gemma4:
- System prompt: Set Claude's persona, voice rules, draft→review→send convention.
- User prompt: Slice of CONTEXT.md (Pending section, vocabulary bridge, conventions agreed) + the verbatim Apr 19 inbound from Andy + two-part task (state diff + draft reply).
- Settings:
temperature=0.5,num_predict=2000, defaultnum_ctx=8192.
Total input fit comfortably in the 8K context. No tool calls. No retrieval — gemma4 worked from prompt content alone, the same constraint a real correspondence run would put on it (modulo whatever gets loaded into context per turn).
Results
What gemma4 got RIGHT
| Capability | Evidence |
|---|---|
| State bookkeeping | Correctly removed inbound from "Waiting on us", added new pending action items (figure work), kept Cambridge-editors carryover. The reasoning chain is sound. |
| Honored hard rules | Did NOT resurrect the rejected exaptation analogy. The "don't do X" instruction in the system prompt held. |
| Used agreed vocabulary | Used "diel," "conspecific," etc. correctly. Did not invent terminology. |
| Reply structure | Addressed all four asks in order. Decisions acknowledged tersely; action items as concrete bullets. Skeleton matches Claude's style. |
| Caught the carryover | Asked Andy for the Cambridge editor list — correctly flagged the open item that's been pending since Apr 17. |
What gemma4 got WRONG (load-bearing failures)
| Failure | What it produced | What was correct | Cost |
|---|---|---|---|
| Hallucinated message ID | 19da34ng... |
19da34bc8e6ec51a |
Disqualifying. Cannot thread/reply on the actual platform. |
| Hallucinated figure content | Figure 1 axes = "trade-off between metabolic cost and temporal opportunity" | Real axes = Tinbergen 4Q grid: Static/Dynamic × Proximate/Ultimate | Andy would catch on first read; we lose credibility. |
| Vague figure plans | "Integrate molecular signaling pathways into Figure 3" | Real plan: per-population specifics — Pachón hypocretin, Tinaja/Molino distinct QTLs, shared attenuated per1 | Reply reads as a hand-wave; no actual content. |
| Wrong voice register | "Hi Andy" / "Best, Claude" / no AI disclaimer footer | "Dear Dr. Freiberg" / "Yours, Claude" / explicit AI-content disclaimer | Recognizable as off-brand. Disclaimer omission is a policy violation. |
| Wrong CONTEXT.md schema | Created a "Resolved" section | Schema has Pending / Sent / Received only | Minor — extrapolation, not invention. |
Why these failures happened
Two distinct failure modes, neither fixable by prompt engineering alone:
-
No access to the actual artifacts. The figures live in
~/bin/spaceflight/andy/manuscript/figures/. Gemma4 was not given them; it had no way to know what Figure 1 actually contains. Faced with a "describe what you'll change" task, it generated plausible-but-fictional content. This is the classic hallucination-under-constraint failure mode: the model would rather make something up than refuse. -
No few-shot examples of the target voice. The system prompt described the voice ("Maintain Claude's voice") but didn't show it. Gemma4 defaulted to its trained-in casual register ("Hi Andy" / "Best, Claude"). A few-shot prompt with one or two real Claude-to-Andy letters would likely close most of this gap; the underlying capability is there.
Where gemma4 fits
Plausible roles
- Low-stakes social correspondence. Discord/Matrix chat with a friend. No precise IDs to preserve. Tolerance for vague replies is high.
- First-pass triage. Given inbound + CONTEXT.md, produce the state diff and a draft outline. Claude (or a human) reviews before send. This is the highest-value role — bookkeeping is the bulk of the work and it's where gemma4 is strongest.
- Scheduled status checks. "Anything new from Alice this week?" → summary. Read-only, no draft, no IDs to corrupt.
- CONTEXT.md maintenance. After a send, ask gemma4 to update the Sent table and Pending list from the message header alone.
Disqualifying contexts
- The Andy correspondence specifically. Too many precise references that gemma4 would invent.
- Anything requiring exact ID round-tripping. Gmail message IDs, git SHAs, ticket numbers, citation keys, DOIs.
- Anything where the model needs to reference attachments it can't read. Figures, manuscripts, reviewer comments. It will hallucinate content.
- Long-thread continuity tasks where the conversation history exceeds 8K context and you need to reason over the full archive.
Practical hybrid architecture
The persistent-correspondence template at ~/bin/persistent-correspondence/ doesn't need to change to support a hybrid setup. The routing decision lives in each contact's CONTEXT.md "workflow" section:
## Workflow
1. Inbound trigger → gemma4 produces CONTEXT.md state diff + draft outline.
2. Claude reviews the diff, applies it. For high-stakes contacts, Claude
rewrites the draft with full artifact context. For low-stakes contacts,
Seth reviews gemma4's draft directly.
3. Send via the platform adapter. Gemma4 updates the Sent table from the
send confirmation.
This pattern lets gemma4 carry the volume work (state maintenance) without putting it in the critical path on accuracy-sensitive output.
What this test did NOT cover
- Multi-turn context retention. Single-shot only. Real correspondence is many turns.
- Tool calling. Gemma4 supports it (
toolsparameter on the MCP). A retrieval-augmented gemma4 that canread_attachment(filename)would likely close the figure-hallucination gap. Not tested here. - Few-shot voice priming. No example letters in the prompt. Voice scores would likely improve significantly with 1-2 in-context examples.
- Smaller/larger Gemma 4 variants. Only
gemma4:26btested. The 31b might do better on precision; the 8b would almost certainly do worse. - Other models. No comparison against gpt-oss, qwen, etc. for the same task.
Reproducing this test
The full prompt + system message used is in the conversation transcript that produced this report. Key prompt-engineering choices to replicate:
- Include actual CONTEXT.md content (not a paraphrase) so the schema is concrete.
- Include the verbatim inbound message, not a summary.
- Split the task: state diff first (cheap, structural), draft second (expensive, precision-sensitive). Lets you grade independently.
- Use temperature 0.3-0.5 for correspondence work — low enough to suppress invented content, high enough to keep the prose natural.
See also
~/bin/persistent-correspondence/— the template abstracted from the Andy correspondence~/bin/spaceflight/andy/CONTEXT.md— the reference implementation this test was drawn from~/bin/gemma4-research/README.md— overall Gemma 4 reference and gotchas