docs: initial Gemma 4 research corpus and synthesis

Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 18:14:19 -04:00
commit 5011059f5d
9 changed files with 861 additions and 0 deletions
@@ -0,0 +1,95 @@
+# Gemma 4 Implementation Reference
+
+> Patterns extracted from Seth's two production Gemma 4 projects.
+
+## Project: Simon (FreibergFamily/simon/)
+
+**Purpose:** AI genealogy historian — multi-turn chat with tool-calling agent
+
+| Setting | Value |
+|---------|-------|
+| Model | `gemma4:26b` |
+| API | `/api/chat` (multi-turn) |
+| num_ctx | 32768 |
+| num_predict | 4096 |
+| temperature | 1.0 |
+| top_p | 0.95 |
+| top_k | 64 |
+| keep_alive | 4h |
+| think | (not explicitly set — should be false) |
+| format_json | not used |
+| Vision | not used |
+| Tool calling | 6 tools, max 12 iterations |
+
+### Key Patterns
+
+1. **Aggressive system prompt:** 40+ lines defining identity, boundaries, tool usage rules, multi-step chaining requirements. Gemma 4 follows all of it.
+
+2. **Tool chaining instructions:** System prompt explicitly tells Gemma to chain tools (e.g., "after lookup_person, ALSO call get_historical_context"). Gemma 4 follows these multi-step chains reliably.
+
+3. **Parallel tool calls:** Encouraged in system prompt for multiple lookups. Gemma 4 does this.
+
+4. **History pruning:** Drops old tool results and tool-call messages, keeps assistant summaries. Prevents context bloat in multi-turn.
+
+5. **Fallback to streaming:** After 12 tool iterations, switches to stream mode (no tools) to force a text response.
+
+6. **Two modes (historian vs interview):** Completely different system prompts swapped at runtime. Gemma 4 stays in character for both.
+
+---
+
+## Project: AI Visualizer (AI_Visualizer/)
+
+**Purpose:** Music-reactive video generator — Gemma 4 as reasoning engine across 4 pipeline stages
+
+| Stage | num_predict | num_ctx | temperature | Purpose |
+|-------|-------------|---------|-------------|---------|
+| Mood Analysis | 4096 | 16384 | 0.4-0.6 | Analyze CLAP descriptors -> narratives + boundary adjustments |
+| Rate Pass | 512 | 4096 | 0.3-0.5 | Choose visual pacing rate per music segment |
+| Storyboard | 2048 | 4096 | 0.6-0.8 | Generate SDXL prompts per music segment |
+| Batch Expansion | 2048 | default | 0.7 | Interpolate between scene prompts over time |
+| Vision Validator | 256 | default | 0.2 | Critique generated frames (queued for disable) |
+
+### Key Patterns
+
+1. **No tool calling used.** All Gemma interaction is single-turn generate with JSON requested in prompt.
+
+2. **Client-side JSON extraction:**
+   ```python
+   body = response["response"]
+   start = body.find("{")
+   end = body.rfind("}")
+   obj = json.loads(body[start:end + 1])
+   ```
+
+3. **Temperature ramping on retry:** Base temp + bump per attempt. Conservative first, creative on retry.
+
+4. **think: false everywhere.** Explicitly set on every call. Critical for budget control.
+
+5. **format_json: false everywhere.** Causes infinite loops on nested schemas.
+
+6. **Model pinning:** `keep_alive=-1` to prevent GPU eviction during long SDXL pauses.
+
+7. **Explicit num_ctx:** Added after discovering Ollama defaults to 2048, which truncated mood analyzer prompts on long tracks.
+
+8. **Banned vocabulary in prompts:** List of cliche words (cinematic, dramatic, ethereal...) passed to Gemma to avoid generic output.
+
+9. **Vision for image critique:** Base64-encoded PNG -> structured SCORE/ISSUE/REASON output parsed by regex. Works but overrejects on subjective quality.
+
+---
+
+## Common Settings Across Both Projects
+
+```json
+{
+  "model": "gemma4:26b",
+  "think": false,
+  "options": {
+    "num_ctx": 4096,
+    "num_predict": 2048,
+    "temperature": 0.5
+  },
+  "keep_alive": "30m"
+}
+```
+
+Adjust num_ctx/num_predict upward for your payload size. These are safe minimums.