Files
gemma4-research/docs/persistent_correspondant.md
T
Mortdecai 91aaaa48d7 docs: redact PII from persistent-correspondence findings
Strip identifying details from the gemma4 correspondence test:
contact name, file paths that imply the project, and
manuscript-specific terminology that would identify the
collaborator. Technical findings about gemma4 unchanged.

NOTE: prior commit 3ceed5c contains the unredacted version
in git history. History rewrite (force-push) requires explicit
authorization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:46:00 -04:00

109 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gemma 4 as a persistent-correspondence agent
**Date:** 2026-04-18
**Model:** `gemma4:26b` via mcp-gemma4 (steel141 Ollama)
**Test fixture:** Real inbound message from a long-running scientific correspondence + the corresponding `CONTEXT.md` slice. (Source contact and project details redacted.)
**Question:** Could gemma4 run a high-stakes persistent correspondence (drafting + state management)?
## TL;DR
**Partial yes, supervised. Full no, unsupervised.**
Gemma4 handles the **bookkeeping** half of persistent correspondence well — state diffs, pending-list maintenance, summarizing what's open. It fails the **drafting** half whenever precision matters: it hallucinates message IDs, invents content from artifacts it can't see, and drifts off the established voice register without explicit examples.
For low-stakes social correspondence (Discord chat with a friend, no IDs needed) gemma4 would be fine. For a high-stakes scientific correspondence — manuscript references, figure specs, a senior expert collaborator — it would need either a stronger model as a quality gate or a tool-using setup with retrieval over actual artifacts.
## Test setup
Single-shot test via `mcp__gemma4__ask_gemma4`:
- **System prompt:** Set the assistant's persona, voice rules, draft→review→send convention.
- **User prompt:** Slice of `CONTEXT.md` (Pending section, vocabulary bridge, conventions agreed) + the verbatim inbound message + two-part task (state diff + draft reply).
- **Settings:** `temperature=0.5`, `num_predict=2000`, default `num_ctx=8192`.
Total input fit comfortably in the 8K context. No tool calls. No retrieval — gemma4 worked from prompt content alone, the same constraint a real correspondence run would put on it (modulo whatever gets loaded into context per turn).
## Results
### What gemma4 got RIGHT
| Capability | Evidence |
|------------|----------|
| State bookkeeping | Correctly removed the inbound from "Waiting on us", added new pending action items, kept a long-standing carryover item. The reasoning chain is sound. |
| Honored hard rules | Did NOT resurrect an analogy the contact had previously rejected. The "don't do X" instruction in the system prompt held. |
| Used agreed vocabulary | Used the field-specific terminology the parties had agreed on. Did not invent terminology. |
| Reply structure | Addressed all asks in order. Decisions acknowledged tersely; action items as concrete bullets. Skeleton matches the established style. |
| Caught the carryover | Asked the contact to resolve a still-open item from a prior thread. |
### What gemma4 got WRONG (load-bearing failures)
| Failure | What it produced | What was correct | Cost |
|---------|------------------|------------------|------|
| **Hallucinated message ID** | Truncated and corrupted the platform's message ID with invented characters | The exact ID supplied in the prompt | Disqualifying. Cannot thread/reply on the actual platform. |
| **Hallucinated artifact content** | Invented axes for a figure it had never seen (plausible-sounding but wrong) | Real axes were a domain-standard 2×2 grid the prompt did not describe | Recipient would catch on first read; trust loss. |
| **Vague action plans** | "Integrate molecular signaling pathways into Figure 3" — a hand-wave | Real plan had per-population specifics with citation keys | Reply reads as a hand-wave; no actual content. |
| **Wrong voice register** | "Hi [first name]" / "Best, [assistant]" / no AI disclaimer footer | Formal salutation / formal sign-off / explicit AI-content disclaimer | Recognizable as off-brand. Disclaimer omission is a policy violation. |
| **Wrong CONTEXT.md schema** | Created a "Resolved" section | Schema has Pending / Sent / Received only | Minor — extrapolation, not invention. |
### Why these failures happened
Two distinct failure modes, neither fixable by prompt engineering alone:
1. **No access to the actual artifacts.** Figures, manuscript drafts, citation databases — none of them were in context. Faced with a "describe what you'll change" task, gemma4 generated plausible-but-fictional content. This is the **classic hallucination-under-constraint failure mode**: the model would rather make something up than refuse.
2. **No few-shot examples of the target voice.** The system prompt described the voice but didn't show it. Gemma4 defaulted to its trained-in casual register. A few-shot prompt with one or two real example letters would likely close most of this gap; the underlying capability is there.
## Where gemma4 fits
### Plausible roles
- **Low-stakes social correspondence.** Discord/Matrix chat with a friend. No precise IDs to preserve. Tolerance for vague replies is high.
- **First-pass triage.** Given inbound + CONTEXT.md, produce the state diff and a draft outline. A stronger model (or a human) reviews before send. This is the highest-value role — bookkeeping is the bulk of the work and it's where gemma4 is strongest.
- **Scheduled status checks.** "Anything new from this contact this week?" → summary. Read-only, no draft, no IDs to corrupt.
- **CONTEXT.md maintenance.** After a send, ask gemma4 to update the Sent table and Pending list from the message header alone.
### Disqualifying contexts
- **High-stakes technical correspondence.** Anywhere precise references must round-trip exactly: manuscripts, citations, figure descriptions, code patches.
- **Anything requiring exact ID round-tripping.** Gmail message IDs, git SHAs, ticket numbers, citation keys, DOIs.
- **Anything where the model needs to reference attachments it can't read.** Will hallucinate content.
- **Long-thread continuity tasks** where the conversation history exceeds 8K context and you need to reason over the full archive.
## Practical hybrid architecture
The persistent-correspondence template (a separate local repo) does not need to change to support a hybrid setup. The routing decision lives in each contact's `CONTEXT.md` "workflow" section:
```
## Workflow
1. Inbound trigger → gemma4 produces CONTEXT.md state diff + draft outline.
2. A stronger model reviews the diff, applies it. For high-stakes contacts,
it rewrites the draft with full artifact context. For low-stakes contacts,
the human reviews gemma4's draft directly.
3. Send via the platform adapter. Gemma4 updates the Sent table from the
send confirmation.
```
This pattern lets gemma4 carry the volume work (state maintenance) without putting it in the critical path on accuracy-sensitive output.
## What this test did NOT cover
- **Multi-turn context retention.** Single-shot only. Real correspondence is many turns.
- **Tool calling.** Gemma4 supports it (`tools` parameter on the MCP). A retrieval-augmented gemma4 that can `read_attachment(filename)` would likely close the artifact-hallucination gap. Not tested here.
- **Few-shot voice priming.** No example letters in the prompt. Voice scores would likely improve significantly with 1-2 in-context examples.
- **Smaller/larger Gemma 4 variants.** Only `gemma4:26b` tested. The 31b might do better on precision; the 8b would almost certainly do worse.
- **Other models.** No comparison against gpt-oss, qwen, etc. for the same task.
## Reproducing this test
Key prompt-engineering choices to replicate:
- Include actual `CONTEXT.md` content (not a paraphrase) so the schema is concrete.
- Include the verbatim inbound message, not a summary.
- Split the task: state diff first (cheap, structural), draft second (expensive, precision-sensitive). Lets you grade independently.
- Use temperature 0.3-0.5 for correspondence work — low enough to suppress invented content, high enough to keep the prose natural.
## See also
- `~/bin/gemma4-research/README.md` — overall Gemma 4 reference and gotchas