Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.5 KiB
Gemma 4 Implementation Reference
Patterns extracted from Seth's two production Gemma 4 projects.
Project: Simon (FreibergFamily/simon/)
Purpose: AI genealogy historian — multi-turn chat with tool-calling agent
| Setting | Value |
|---|---|
| Model | gemma4:26b |
| API | /api/chat (multi-turn) |
| num_ctx | 32768 |
| num_predict | 4096 |
| temperature | 1.0 |
| top_p | 0.95 |
| top_k | 64 |
| keep_alive | 4h |
| think | (not explicitly set — should be false) |
| format_json | not used |
| Vision | not used |
| Tool calling | 6 tools, max 12 iterations |
Key Patterns
-
Aggressive system prompt: 40+ lines defining identity, boundaries, tool usage rules, multi-step chaining requirements. Gemma 4 follows all of it.
-
Tool chaining instructions: System prompt explicitly tells Gemma to chain tools (e.g., "after lookup_person, ALSO call get_historical_context"). Gemma 4 follows these multi-step chains reliably.
-
Parallel tool calls: Encouraged in system prompt for multiple lookups. Gemma 4 does this.
-
History pruning: Drops old tool results and tool-call messages, keeps assistant summaries. Prevents context bloat in multi-turn.
-
Fallback to streaming: After 12 tool iterations, switches to stream mode (no tools) to force a text response.
-
Two modes (historian vs interview): Completely different system prompts swapped at runtime. Gemma 4 stays in character for both.
Project: AI Visualizer (AI_Visualizer/)
Purpose: Music-reactive video generator — Gemma 4 as reasoning engine across 4 pipeline stages
| Stage | num_predict | num_ctx | temperature | Purpose |
|---|---|---|---|---|
| Mood Analysis | 4096 | 16384 | 0.4-0.6 | Analyze CLAP descriptors -> narratives + boundary adjustments |
| Rate Pass | 512 | 4096 | 0.3-0.5 | Choose visual pacing rate per music segment |
| Storyboard | 2048 | 4096 | 0.6-0.8 | Generate SDXL prompts per music segment |
| Batch Expansion | 2048 | default | 0.7 | Interpolate between scene prompts over time |
| Vision Validator | 256 | default | 0.2 | Critique generated frames (queued for disable) |
Key Patterns
-
No tool calling used. All Gemma interaction is single-turn generate with JSON requested in prompt.
-
Client-side JSON extraction:
body = response["response"] start = body.find("{") end = body.rfind("}") obj = json.loads(body[start:end + 1]) -
Temperature ramping on retry: Base temp + bump per attempt. Conservative first, creative on retry.
-
think: false everywhere. Explicitly set on every call. Critical for budget control.
-
format_json: false everywhere. Causes infinite loops on nested schemas.
-
Model pinning:
keep_alive=-1to prevent GPU eviction during long SDXL pauses. -
Explicit num_ctx: Added after discovering Ollama defaults to 2048, which truncated mood analyzer prompts on long tracks.
-
Banned vocabulary in prompts: List of cliche words (cinematic, dramatic, ethereal...) passed to Gemma to avoid generic output.
-
Vision for image critique: Base64-encoded PNG -> structured SCORE/ISSUE/REASON output parsed by regex. Works but overrejects on subjective quality.
Common Settings Across Both Projects
{
"model": "gemma4:26b",
"think": false,
"options": {
"num_ctx": 4096,
"num_predict": 2048,
"temperature": 0.5
},
"keep_alive": "30m"
}
Adjust num_ctx/num_predict upward for your payload size. These are safe minimums.