docs: redact PII from persistent-correspondence findings
Strip identifying details from the gemma4 correspondence test:
contact name, file paths that imply the project, and
manuscript-specific terminology that would identify the
collaborator. Technical findings about gemma4 unchanged.
NOTE: prior commit 3ceed5c contains the unredacted version
in git history. History rewrite (force-push) requires explicit
authorization.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2,23 +2,23 @@
|
|||||||
|
|
||||||
**Date:** 2026-04-18
|
**Date:** 2026-04-18
|
||||||
**Model:** `gemma4:26b` via mcp-gemma4 (steel141 Ollama)
|
**Model:** `gemma4:26b` via mcp-gemma4 (steel141 Ollama)
|
||||||
**Test fixture:** Real inbound message from Andy Freiberg (Apr 19, 2026), real CONTEXT.md slice from `~/bin/spaceflight/andy/`
|
**Test fixture:** Real inbound message from a long-running scientific correspondence + the corresponding `CONTEXT.md` slice. (Source contact and project details redacted.)
|
||||||
**Question:** Could gemma4 run an Andy-style persistent correspondence (drafting + state management)?
|
**Question:** Could gemma4 run a high-stakes persistent correspondence (drafting + state management)?
|
||||||
|
|
||||||
## TL;DR
|
## TL;DR
|
||||||
|
|
||||||
**Partial yes, supervised. Full no, unsupervised.**
|
**Partial yes, supervised. Full no, unsupervised.**
|
||||||
|
|
||||||
Gemma4 handles the **bookkeeping** half of persistent correspondence well — state diffs, pending-list maintenance, summarizing what's open. It fails the **drafting** half whenever precision matters: it hallucinates message IDs, invents figure content it can't see, and drifts off the established voice register without explicit examples.
|
Gemma4 handles the **bookkeeping** half of persistent correspondence well — state diffs, pending-list maintenance, summarizing what's open. It fails the **drafting** half whenever precision matters: it hallucinates message IDs, invents content from artifacts it can't see, and drifts off the established voice register without explicit examples.
|
||||||
|
|
||||||
For a low-stakes social correspondence (Discord chat with a friend, no IDs needed) gemma4 would be fine. For the Andy correspondence specifically — high-stakes scientific writing with manuscript references, figure specs, and a senior physician collaborator — it would need either Claude as a quality gate or a tool-using setup with retrieval over actual artifacts.
|
For low-stakes social correspondence (Discord chat with a friend, no IDs needed) gemma4 would be fine. For a high-stakes scientific correspondence — manuscript references, figure specs, a senior expert collaborator — it would need either a stronger model as a quality gate or a tool-using setup with retrieval over actual artifacts.
|
||||||
|
|
||||||
## Test setup
|
## Test setup
|
||||||
|
|
||||||
Single-shot test via `mcp__gemma4__ask_gemma4`:
|
Single-shot test via `mcp__gemma4__ask_gemma4`:
|
||||||
|
|
||||||
- **System prompt:** Set Claude's persona, voice rules, draft→review→send convention.
|
- **System prompt:** Set the assistant's persona, voice rules, draft→review→send convention.
|
||||||
- **User prompt:** Slice of CONTEXT.md (Pending section, vocabulary bridge, conventions agreed) + the verbatim Apr 19 inbound from Andy + two-part task (state diff + draft reply).
|
- **User prompt:** Slice of `CONTEXT.md` (Pending section, vocabulary bridge, conventions agreed) + the verbatim inbound message + two-part task (state diff + draft reply).
|
||||||
- **Settings:** `temperature=0.5`, `num_predict=2000`, default `num_ctx=8192`.
|
- **Settings:** `temperature=0.5`, `num_predict=2000`, default `num_ctx=8192`.
|
||||||
|
|
||||||
Total input fit comfortably in the 8K context. No tool calls. No retrieval — gemma4 worked from prompt content alone, the same constraint a real correspondence run would put on it (modulo whatever gets loaded into context per turn).
|
Total input fit comfortably in the 8K context. No tool calls. No retrieval — gemma4 worked from prompt content alone, the same constraint a real correspondence run would put on it (modulo whatever gets loaded into context per turn).
|
||||||
@@ -29,57 +29,57 @@ Total input fit comfortably in the 8K context. No tool calls. No retrieval — g
|
|||||||
|
|
||||||
| Capability | Evidence |
|
| Capability | Evidence |
|
||||||
|------------|----------|
|
|------------|----------|
|
||||||
| State bookkeeping | Correctly removed inbound from "Waiting on us", added new pending action items (figure work), kept Cambridge-editors carryover. The reasoning chain is sound. |
|
| State bookkeeping | Correctly removed the inbound from "Waiting on us", added new pending action items, kept a long-standing carryover item. The reasoning chain is sound. |
|
||||||
| Honored hard rules | Did NOT resurrect the rejected exaptation analogy. The "don't do X" instruction in the system prompt held. |
|
| Honored hard rules | Did NOT resurrect an analogy the contact had previously rejected. The "don't do X" instruction in the system prompt held. |
|
||||||
| Used agreed vocabulary | Used "diel," "conspecific," etc. correctly. Did not invent terminology. |
|
| Used agreed vocabulary | Used the field-specific terminology the parties had agreed on. Did not invent terminology. |
|
||||||
| Reply structure | Addressed all four asks in order. Decisions acknowledged tersely; action items as concrete bullets. Skeleton matches Claude's style. |
|
| Reply structure | Addressed all asks in order. Decisions acknowledged tersely; action items as concrete bullets. Skeleton matches the established style. |
|
||||||
| Caught the carryover | Asked Andy for the Cambridge editor list — correctly flagged the open item that's been pending since Apr 17. |
|
| Caught the carryover | Asked the contact to resolve a still-open item from a prior thread. |
|
||||||
|
|
||||||
### What gemma4 got WRONG (load-bearing failures)
|
### What gemma4 got WRONG (load-bearing failures)
|
||||||
|
|
||||||
| Failure | What it produced | What was correct | Cost |
|
| Failure | What it produced | What was correct | Cost |
|
||||||
|---------|------------------|------------------|------|
|
|---------|------------------|------------------|------|
|
||||||
| **Hallucinated message ID** | `19da34ng...` | `19da34bc8e6ec51a` | Disqualifying. Cannot thread/reply on the actual platform. |
|
| **Hallucinated message ID** | Truncated and corrupted the platform's message ID with invented characters | The exact ID supplied in the prompt | Disqualifying. Cannot thread/reply on the actual platform. |
|
||||||
| **Hallucinated figure content** | Figure 1 axes = "trade-off between metabolic cost and temporal opportunity" | Real axes = Tinbergen 4Q grid: Static/Dynamic × Proximate/Ultimate | Andy would catch on first read; we lose credibility. |
|
| **Hallucinated artifact content** | Invented axes for a figure it had never seen (plausible-sounding but wrong) | Real axes were a domain-standard 2×2 grid the prompt did not describe | Recipient would catch on first read; trust loss. |
|
||||||
| **Vague figure plans** | "Integrate molecular signaling pathways into Figure 3" | Real plan: per-population specifics — Pachón hypocretin, Tinaja/Molino distinct QTLs, shared attenuated per1 | Reply reads as a hand-wave; no actual content. |
|
| **Vague action plans** | "Integrate molecular signaling pathways into Figure 3" — a hand-wave | Real plan had per-population specifics with citation keys | Reply reads as a hand-wave; no actual content. |
|
||||||
| **Wrong voice register** | "Hi Andy" / "Best, Claude" / no AI disclaimer footer | "Dear Dr. Freiberg" / "Yours, Claude" / explicit AI-content disclaimer | Recognizable as off-brand. Disclaimer omission is a policy violation. |
|
| **Wrong voice register** | "Hi [first name]" / "Best, [assistant]" / no AI disclaimer footer | Formal salutation / formal sign-off / explicit AI-content disclaimer | Recognizable as off-brand. Disclaimer omission is a policy violation. |
|
||||||
| **Wrong CONTEXT.md schema** | Created a "Resolved" section | Schema has Pending / Sent / Received only | Minor — extrapolation, not invention. |
|
| **Wrong CONTEXT.md schema** | Created a "Resolved" section | Schema has Pending / Sent / Received only | Minor — extrapolation, not invention. |
|
||||||
|
|
||||||
### Why these failures happened
|
### Why these failures happened
|
||||||
|
|
||||||
Two distinct failure modes, neither fixable by prompt engineering alone:
|
Two distinct failure modes, neither fixable by prompt engineering alone:
|
||||||
|
|
||||||
1. **No access to the actual artifacts.** The figures live in `~/bin/spaceflight/andy/manuscript/figures/`. Gemma4 was not given them; it had no way to know what Figure 1 actually contains. Faced with a "describe what you'll change" task, it generated plausible-but-fictional content. This is the **classic hallucination-under-constraint failure mode**: the model would rather make something up than refuse.
|
1. **No access to the actual artifacts.** Figures, manuscript drafts, citation databases — none of them were in context. Faced with a "describe what you'll change" task, gemma4 generated plausible-but-fictional content. This is the **classic hallucination-under-constraint failure mode**: the model would rather make something up than refuse.
|
||||||
|
|
||||||
2. **No few-shot examples of the target voice.** The system prompt described the voice ("Maintain Claude's voice") but didn't show it. Gemma4 defaulted to its trained-in casual register ("Hi Andy" / "Best, Claude"). A few-shot prompt with one or two real Claude-to-Andy letters would likely close most of this gap; the underlying capability is there.
|
2. **No few-shot examples of the target voice.** The system prompt described the voice but didn't show it. Gemma4 defaulted to its trained-in casual register. A few-shot prompt with one or two real example letters would likely close most of this gap; the underlying capability is there.
|
||||||
|
|
||||||
## Where gemma4 fits
|
## Where gemma4 fits
|
||||||
|
|
||||||
### Plausible roles
|
### Plausible roles
|
||||||
|
|
||||||
- **Low-stakes social correspondence.** Discord/Matrix chat with a friend. No precise IDs to preserve. Tolerance for vague replies is high.
|
- **Low-stakes social correspondence.** Discord/Matrix chat with a friend. No precise IDs to preserve. Tolerance for vague replies is high.
|
||||||
- **First-pass triage.** Given inbound + CONTEXT.md, produce the state diff and a draft outline. Claude (or a human) reviews before send. This is the highest-value role — bookkeeping is the bulk of the work and it's where gemma4 is strongest.
|
- **First-pass triage.** Given inbound + CONTEXT.md, produce the state diff and a draft outline. A stronger model (or a human) reviews before send. This is the highest-value role — bookkeeping is the bulk of the work and it's where gemma4 is strongest.
|
||||||
- **Scheduled status checks.** "Anything new from Alice this week?" → summary. Read-only, no draft, no IDs to corrupt.
|
- **Scheduled status checks.** "Anything new from this contact this week?" → summary. Read-only, no draft, no IDs to corrupt.
|
||||||
- **CONTEXT.md maintenance.** After a send, ask gemma4 to update the Sent table and Pending list from the message header alone.
|
- **CONTEXT.md maintenance.** After a send, ask gemma4 to update the Sent table and Pending list from the message header alone.
|
||||||
|
|
||||||
### Disqualifying contexts
|
### Disqualifying contexts
|
||||||
|
|
||||||
- **The Andy correspondence specifically.** Too many precise references that gemma4 would invent.
|
- **High-stakes technical correspondence.** Anywhere precise references must round-trip exactly: manuscripts, citations, figure descriptions, code patches.
|
||||||
- **Anything requiring exact ID round-tripping.** Gmail message IDs, git SHAs, ticket numbers, citation keys, DOIs.
|
- **Anything requiring exact ID round-tripping.** Gmail message IDs, git SHAs, ticket numbers, citation keys, DOIs.
|
||||||
- **Anything where the model needs to reference attachments it can't read.** Figures, manuscripts, reviewer comments. It will hallucinate content.
|
- **Anything where the model needs to reference attachments it can't read.** Will hallucinate content.
|
||||||
- **Long-thread continuity tasks** where the conversation history exceeds 8K context and you need to reason over the full archive.
|
- **Long-thread continuity tasks** where the conversation history exceeds 8K context and you need to reason over the full archive.
|
||||||
|
|
||||||
## Practical hybrid architecture
|
## Practical hybrid architecture
|
||||||
|
|
||||||
The persistent-correspondence template at `~/bin/persistent-correspondence/` doesn't need to change to support a hybrid setup. The routing decision lives in each contact's `CONTEXT.md` "workflow" section:
|
The persistent-correspondence template (a separate local repo) does not need to change to support a hybrid setup. The routing decision lives in each contact's `CONTEXT.md` "workflow" section:
|
||||||
|
|
||||||
```
|
```
|
||||||
## Workflow
|
## Workflow
|
||||||
|
|
||||||
1. Inbound trigger → gemma4 produces CONTEXT.md state diff + draft outline.
|
1. Inbound trigger → gemma4 produces CONTEXT.md state diff + draft outline.
|
||||||
2. Claude reviews the diff, applies it. For high-stakes contacts, Claude
|
2. A stronger model reviews the diff, applies it. For high-stakes contacts,
|
||||||
rewrites the draft with full artifact context. For low-stakes contacts,
|
it rewrites the draft with full artifact context. For low-stakes contacts,
|
||||||
Seth reviews gemma4's draft directly.
|
the human reviews gemma4's draft directly.
|
||||||
3. Send via the platform adapter. Gemma4 updates the Sent table from the
|
3. Send via the platform adapter. Gemma4 updates the Sent table from the
|
||||||
send confirmation.
|
send confirmation.
|
||||||
```
|
```
|
||||||
@@ -89,22 +89,20 @@ This pattern lets gemma4 carry the volume work (state maintenance) without putti
|
|||||||
## What this test did NOT cover
|
## What this test did NOT cover
|
||||||
|
|
||||||
- **Multi-turn context retention.** Single-shot only. Real correspondence is many turns.
|
- **Multi-turn context retention.** Single-shot only. Real correspondence is many turns.
|
||||||
- **Tool calling.** Gemma4 supports it (`tools` parameter on the MCP). A retrieval-augmented gemma4 that can `read_attachment(filename)` would likely close the figure-hallucination gap. Not tested here.
|
- **Tool calling.** Gemma4 supports it (`tools` parameter on the MCP). A retrieval-augmented gemma4 that can `read_attachment(filename)` would likely close the artifact-hallucination gap. Not tested here.
|
||||||
- **Few-shot voice priming.** No example letters in the prompt. Voice scores would likely improve significantly with 1-2 in-context examples.
|
- **Few-shot voice priming.** No example letters in the prompt. Voice scores would likely improve significantly with 1-2 in-context examples.
|
||||||
- **Smaller/larger Gemma 4 variants.** Only `gemma4:26b` tested. The 31b might do better on precision; the 8b would almost certainly do worse.
|
- **Smaller/larger Gemma 4 variants.** Only `gemma4:26b` tested. The 31b might do better on precision; the 8b would almost certainly do worse.
|
||||||
- **Other models.** No comparison against gpt-oss, qwen, etc. for the same task.
|
- **Other models.** No comparison against gpt-oss, qwen, etc. for the same task.
|
||||||
|
|
||||||
## Reproducing this test
|
## Reproducing this test
|
||||||
|
|
||||||
The full prompt + system message used is in the conversation transcript that produced this report. Key prompt-engineering choices to replicate:
|
Key prompt-engineering choices to replicate:
|
||||||
|
|
||||||
- Include actual CONTEXT.md content (not a paraphrase) so the schema is concrete.
|
- Include actual `CONTEXT.md` content (not a paraphrase) so the schema is concrete.
|
||||||
- Include the verbatim inbound message, not a summary.
|
- Include the verbatim inbound message, not a summary.
|
||||||
- Split the task: state diff first (cheap, structural), draft second (expensive, precision-sensitive). Lets you grade independently.
|
- Split the task: state diff first (cheap, structural), draft second (expensive, precision-sensitive). Lets you grade independently.
|
||||||
- Use temperature 0.3-0.5 for correspondence work — low enough to suppress invented content, high enough to keep the prose natural.
|
- Use temperature 0.3-0.5 for correspondence work — low enough to suppress invented content, high enough to keep the prose natural.
|
||||||
|
|
||||||
## See also
|
## See also
|
||||||
|
|
||||||
- `~/bin/persistent-correspondence/` — the template abstracted from the Andy correspondence
|
|
||||||
- `~/bin/spaceflight/andy/CONTEXT.md` — the reference implementation this test was drawn from
|
|
||||||
- `~/bin/gemma4-research/README.md` — overall Gemma 4 reference and gotchas
|
- `~/bin/gemma4-research/README.md` — overall Gemma 4 reference and gotchas
|
||||||
|
|||||||
Reference in New Issue
Block a user