docs: redact PII from persistent-correspondence findings

Strip identifying details from the gemma4 correspondence test:
contact name, file paths that imply the project, and
manuscript-specific terminology that would identify the
collaborator. Technical findings about gemma4 unchanged.

NOTE: prior commit 3ceed5c contains the unredacted version
in git history. History rewrite (force-push) requires explicit
authorization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mortdecai
2026-04-18 23:46:00 -04:00
parent 3ceed5ce2a
commit 91aaaa48d7
+28 -30
View File
@@ -2,23 +2,23 @@
**Date:** 2026-04-18 **Date:** 2026-04-18
**Model:** `gemma4:26b` via mcp-gemma4 (steel141 Ollama) **Model:** `gemma4:26b` via mcp-gemma4 (steel141 Ollama)
**Test fixture:** Real inbound message from Andy Freiberg (Apr 19, 2026), real CONTEXT.md slice from `~/bin/spaceflight/andy/` **Test fixture:** Real inbound message from a long-running scientific correspondence + the corresponding `CONTEXT.md` slice. (Source contact and project details redacted.)
**Question:** Could gemma4 run an Andy-style persistent correspondence (drafting + state management)? **Question:** Could gemma4 run a high-stakes persistent correspondence (drafting + state management)?
## TL;DR ## TL;DR
**Partial yes, supervised. Full no, unsupervised.** **Partial yes, supervised. Full no, unsupervised.**
Gemma4 handles the **bookkeeping** half of persistent correspondence well — state diffs, pending-list maintenance, summarizing what's open. It fails the **drafting** half whenever precision matters: it hallucinates message IDs, invents figure content it can't see, and drifts off the established voice register without explicit examples. Gemma4 handles the **bookkeeping** half of persistent correspondence well — state diffs, pending-list maintenance, summarizing what's open. It fails the **drafting** half whenever precision matters: it hallucinates message IDs, invents content from artifacts it can't see, and drifts off the established voice register without explicit examples.
For a low-stakes social correspondence (Discord chat with a friend, no IDs needed) gemma4 would be fine. For the Andy correspondence specifically — high-stakes scientific writing with manuscript references, figure specs, and a senior physician collaborator — it would need either Claude as a quality gate or a tool-using setup with retrieval over actual artifacts. For low-stakes social correspondence (Discord chat with a friend, no IDs needed) gemma4 would be fine. For a high-stakes scientific correspondence — manuscript references, figure specs, a senior expert collaborator — it would need either a stronger model as a quality gate or a tool-using setup with retrieval over actual artifacts.
## Test setup ## Test setup
Single-shot test via `mcp__gemma4__ask_gemma4`: Single-shot test via `mcp__gemma4__ask_gemma4`:
- **System prompt:** Set Claude's persona, voice rules, draft→review→send convention. - **System prompt:** Set the assistant's persona, voice rules, draft→review→send convention.
- **User prompt:** Slice of CONTEXT.md (Pending section, vocabulary bridge, conventions agreed) + the verbatim Apr 19 inbound from Andy + two-part task (state diff + draft reply). - **User prompt:** Slice of `CONTEXT.md` (Pending section, vocabulary bridge, conventions agreed) + the verbatim inbound message + two-part task (state diff + draft reply).
- **Settings:** `temperature=0.5`, `num_predict=2000`, default `num_ctx=8192`. - **Settings:** `temperature=0.5`, `num_predict=2000`, default `num_ctx=8192`.
Total input fit comfortably in the 8K context. No tool calls. No retrieval — gemma4 worked from prompt content alone, the same constraint a real correspondence run would put on it (modulo whatever gets loaded into context per turn). Total input fit comfortably in the 8K context. No tool calls. No retrieval — gemma4 worked from prompt content alone, the same constraint a real correspondence run would put on it (modulo whatever gets loaded into context per turn).
@@ -29,57 +29,57 @@ Total input fit comfortably in the 8K context. No tool calls. No retrieval — g
| Capability | Evidence | | Capability | Evidence |
|------------|----------| |------------|----------|
| State bookkeeping | Correctly removed inbound from "Waiting on us", added new pending action items (figure work), kept Cambridge-editors carryover. The reasoning chain is sound. | | State bookkeeping | Correctly removed the inbound from "Waiting on us", added new pending action items, kept a long-standing carryover item. The reasoning chain is sound. |
| Honored hard rules | Did NOT resurrect the rejected exaptation analogy. The "don't do X" instruction in the system prompt held. | | Honored hard rules | Did NOT resurrect an analogy the contact had previously rejected. The "don't do X" instruction in the system prompt held. |
| Used agreed vocabulary | Used "diel," "conspecific," etc. correctly. Did not invent terminology. | | Used agreed vocabulary | Used the field-specific terminology the parties had agreed on. Did not invent terminology. |
| Reply structure | Addressed all four asks in order. Decisions acknowledged tersely; action items as concrete bullets. Skeleton matches Claude's style. | | Reply structure | Addressed all asks in order. Decisions acknowledged tersely; action items as concrete bullets. Skeleton matches the established style. |
| Caught the carryover | Asked Andy for the Cambridge editor list — correctly flagged the open item that's been pending since Apr 17. | | Caught the carryover | Asked the contact to resolve a still-open item from a prior thread. |
### What gemma4 got WRONG (load-bearing failures) ### What gemma4 got WRONG (load-bearing failures)
| Failure | What it produced | What was correct | Cost | | Failure | What it produced | What was correct | Cost |
|---------|------------------|------------------|------| |---------|------------------|------------------|------|
| **Hallucinated message ID** | `19da34ng...` | `19da34bc8e6ec51a` | Disqualifying. Cannot thread/reply on the actual platform. | | **Hallucinated message ID** | Truncated and corrupted the platform's message ID with invented characters | The exact ID supplied in the prompt | Disqualifying. Cannot thread/reply on the actual platform. |
| **Hallucinated figure content** | Figure 1 axes = "trade-off between metabolic cost and temporal opportunity" | Real axes = Tinbergen 4Q grid: Static/Dynamic × Proximate/Ultimate | Andy would catch on first read; we lose credibility. | | **Hallucinated artifact content** | Invented axes for a figure it had never seen (plausible-sounding but wrong) | Real axes were a domain-standard 2×2 grid the prompt did not describe | Recipient would catch on first read; trust loss. |
| **Vague figure plans** | "Integrate molecular signaling pathways into Figure 3" | Real plan: per-population specifics — Pachón hypocretin, Tinaja/Molino distinct QTLs, shared attenuated per1 | Reply reads as a hand-wave; no actual content. | | **Vague action plans** | "Integrate molecular signaling pathways into Figure 3" — a hand-wave | Real plan had per-population specifics with citation keys | Reply reads as a hand-wave; no actual content. |
| **Wrong voice register** | "Hi Andy" / "Best, Claude" / no AI disclaimer footer | "Dear Dr. Freiberg" / "Yours, Claude" / explicit AI-content disclaimer | Recognizable as off-brand. Disclaimer omission is a policy violation. | | **Wrong voice register** | "Hi [first name]" / "Best, [assistant]" / no AI disclaimer footer | Formal salutation / formal sign-off / explicit AI-content disclaimer | Recognizable as off-brand. Disclaimer omission is a policy violation. |
| **Wrong CONTEXT.md schema** | Created a "Resolved" section | Schema has Pending / Sent / Received only | Minor — extrapolation, not invention. | | **Wrong CONTEXT.md schema** | Created a "Resolved" section | Schema has Pending / Sent / Received only | Minor — extrapolation, not invention. |
### Why these failures happened ### Why these failures happened
Two distinct failure modes, neither fixable by prompt engineering alone: Two distinct failure modes, neither fixable by prompt engineering alone:
1. **No access to the actual artifacts.** The figures live in `~/bin/spaceflight/andy/manuscript/figures/`. Gemma4 was not given them; it had no way to know what Figure 1 actually contains. Faced with a "describe what you'll change" task, it generated plausible-but-fictional content. This is the **classic hallucination-under-constraint failure mode**: the model would rather make something up than refuse. 1. **No access to the actual artifacts.** Figures, manuscript drafts, citation databases — none of them were in context. Faced with a "describe what you'll change" task, gemma4 generated plausible-but-fictional content. This is the **classic hallucination-under-constraint failure mode**: the model would rather make something up than refuse.
2. **No few-shot examples of the target voice.** The system prompt described the voice ("Maintain Claude's voice") but didn't show it. Gemma4 defaulted to its trained-in casual register ("Hi Andy" / "Best, Claude"). A few-shot prompt with one or two real Claude-to-Andy letters would likely close most of this gap; the underlying capability is there. 2. **No few-shot examples of the target voice.** The system prompt described the voice but didn't show it. Gemma4 defaulted to its trained-in casual register. A few-shot prompt with one or two real example letters would likely close most of this gap; the underlying capability is there.
## Where gemma4 fits ## Where gemma4 fits
### Plausible roles ### Plausible roles
- **Low-stakes social correspondence.** Discord/Matrix chat with a friend. No precise IDs to preserve. Tolerance for vague replies is high. - **Low-stakes social correspondence.** Discord/Matrix chat with a friend. No precise IDs to preserve. Tolerance for vague replies is high.
- **First-pass triage.** Given inbound + CONTEXT.md, produce the state diff and a draft outline. Claude (or a human) reviews before send. This is the highest-value role — bookkeeping is the bulk of the work and it's where gemma4 is strongest. - **First-pass triage.** Given inbound + CONTEXT.md, produce the state diff and a draft outline. A stronger model (or a human) reviews before send. This is the highest-value role — bookkeeping is the bulk of the work and it's where gemma4 is strongest.
- **Scheduled status checks.** "Anything new from Alice this week?" → summary. Read-only, no draft, no IDs to corrupt. - **Scheduled status checks.** "Anything new from this contact this week?" → summary. Read-only, no draft, no IDs to corrupt.
- **CONTEXT.md maintenance.** After a send, ask gemma4 to update the Sent table and Pending list from the message header alone. - **CONTEXT.md maintenance.** After a send, ask gemma4 to update the Sent table and Pending list from the message header alone.
### Disqualifying contexts ### Disqualifying contexts
- **The Andy correspondence specifically.** Too many precise references that gemma4 would invent. - **High-stakes technical correspondence.** Anywhere precise references must round-trip exactly: manuscripts, citations, figure descriptions, code patches.
- **Anything requiring exact ID round-tripping.** Gmail message IDs, git SHAs, ticket numbers, citation keys, DOIs. - **Anything requiring exact ID round-tripping.** Gmail message IDs, git SHAs, ticket numbers, citation keys, DOIs.
- **Anything where the model needs to reference attachments it can't read.** Figures, manuscripts, reviewer comments. It will hallucinate content. - **Anything where the model needs to reference attachments it can't read.** Will hallucinate content.
- **Long-thread continuity tasks** where the conversation history exceeds 8K context and you need to reason over the full archive. - **Long-thread continuity tasks** where the conversation history exceeds 8K context and you need to reason over the full archive.
## Practical hybrid architecture ## Practical hybrid architecture
The persistent-correspondence template at `~/bin/persistent-correspondence/` doesn't need to change to support a hybrid setup. The routing decision lives in each contact's `CONTEXT.md` "workflow" section: The persistent-correspondence template (a separate local repo) does not need to change to support a hybrid setup. The routing decision lives in each contact's `CONTEXT.md` "workflow" section:
``` ```
## Workflow ## Workflow
1. Inbound trigger → gemma4 produces CONTEXT.md state diff + draft outline. 1. Inbound trigger → gemma4 produces CONTEXT.md state diff + draft outline.
2. Claude reviews the diff, applies it. For high-stakes contacts, Claude 2. A stronger model reviews the diff, applies it. For high-stakes contacts,
rewrites the draft with full artifact context. For low-stakes contacts, it rewrites the draft with full artifact context. For low-stakes contacts,
Seth reviews gemma4's draft directly. the human reviews gemma4's draft directly.
3. Send via the platform adapter. Gemma4 updates the Sent table from the 3. Send via the platform adapter. Gemma4 updates the Sent table from the
send confirmation. send confirmation.
``` ```
@@ -89,22 +89,20 @@ This pattern lets gemma4 carry the volume work (state maintenance) without putti
## What this test did NOT cover ## What this test did NOT cover
- **Multi-turn context retention.** Single-shot only. Real correspondence is many turns. - **Multi-turn context retention.** Single-shot only. Real correspondence is many turns.
- **Tool calling.** Gemma4 supports it (`tools` parameter on the MCP). A retrieval-augmented gemma4 that can `read_attachment(filename)` would likely close the figure-hallucination gap. Not tested here. - **Tool calling.** Gemma4 supports it (`tools` parameter on the MCP). A retrieval-augmented gemma4 that can `read_attachment(filename)` would likely close the artifact-hallucination gap. Not tested here.
- **Few-shot voice priming.** No example letters in the prompt. Voice scores would likely improve significantly with 1-2 in-context examples. - **Few-shot voice priming.** No example letters in the prompt. Voice scores would likely improve significantly with 1-2 in-context examples.
- **Smaller/larger Gemma 4 variants.** Only `gemma4:26b` tested. The 31b might do better on precision; the 8b would almost certainly do worse. - **Smaller/larger Gemma 4 variants.** Only `gemma4:26b` tested. The 31b might do better on precision; the 8b would almost certainly do worse.
- **Other models.** No comparison against gpt-oss, qwen, etc. for the same task. - **Other models.** No comparison against gpt-oss, qwen, etc. for the same task.
## Reproducing this test ## Reproducing this test
The full prompt + system message used is in the conversation transcript that produced this report. Key prompt-engineering choices to replicate: Key prompt-engineering choices to replicate:
- Include actual CONTEXT.md content (not a paraphrase) so the schema is concrete. - Include actual `CONTEXT.md` content (not a paraphrase) so the schema is concrete.
- Include the verbatim inbound message, not a summary. - Include the verbatim inbound message, not a summary.
- Split the task: state diff first (cheap, structural), draft second (expensive, precision-sensitive). Lets you grade independently. - Split the task: state diff first (cheap, structural), draft second (expensive, precision-sensitive). Lets you grade independently.
- Use temperature 0.3-0.5 for correspondence work — low enough to suppress invented content, high enough to keep the prose natural. - Use temperature 0.3-0.5 for correspondence work — low enough to suppress invented content, high enough to keep the prose natural.
## See also ## See also
- `~/bin/persistent-correspondence/` — the template abstracted from the Andy correspondence
- `~/bin/spaceflight/andy/CONTEXT.md` — the reference implementation this test was drawn from
- `~/bin/gemma4-research/README.md` — overall Gemma 4 reference and gotchas - `~/bin/gemma4-research/README.md` — overall Gemma 4 reference and gotchas