From 5011059f5d634f88528d948724b3382176480eb1 Mon Sep 17 00:00:00 2001 From: Mortdecai Date: Sun, 12 Apr 2026 18:14:19 -0400 Subject: [PATCH] docs: initial Gemma 4 research corpus and synthesis Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) --- CORPUS_architecture.md | 105 +++++++++++++++++ CORPUS_benchmarks.md | 40 +++++++ CORPUS_capabilities.md | 55 +++++++++ CORPUS_ollama_variants.md | 42 +++++++ CORPUS_tool_calling_format.md | 100 +++++++++++++++++ GOTCHAS.md | 205 ++++++++++++++++++++++++++++++++++ IMPLEMENTATIONS.md | 95 ++++++++++++++++ README.md | 25 +++++ SYNTHESIS.md | 194 ++++++++++++++++++++++++++++++++ 9 files changed, 861 insertions(+) create mode 100644 CORPUS_architecture.md create mode 100644 CORPUS_benchmarks.md create mode 100644 CORPUS_capabilities.md create mode 100644 CORPUS_ollama_variants.md create mode 100644 CORPUS_tool_calling_format.md create mode 100644 GOTCHAS.md create mode 100644 IMPLEMENTATIONS.md create mode 100644 README.md create mode 100644 SYNTHESIS.md diff --git a/CORPUS_architecture.md b/CORPUS_architecture.md new file mode 100644 index 0000000..9d0bc4b --- /dev/null +++ b/CORPUS_architecture.md @@ -0,0 +1,105 @@ +# Gemma 4 Architecture Reference + +> Sources: Google DeepMind blog, HuggingFace blog (huggingface.co/blog/gemma4), +> Maarten Grootendorst visual guide, kaitchup.substack.com, wavespeed.ai + +## Model Family + +| Variant | Total Params | Effective Params | Type | Notes | +|---------|-------------|-----------------|------|-------| +| E2B | ~5.1B | ~2.3B | Dense + PLE | On-device, audio+vision | +| E4B | ~8B | ~4B | Dense + PLE | On-device, audio+vision | +| 31B | 31B | 31B | Dense | 60 layers, widened vs Gemma 3 27B (62 layers) | +| 26B A4B | 26B | ~4B active | MoE | 128 experts, 8 active + 1 shared | + +## Attention Architecture + +- **Pattern:** Local (sliding window) interleaved with global attention + - E2B: 4:1 ratio (4 local, 1 global). E4B/31B/26B: 5:1 ratio + - Global attention is always the last layer +- **Sliding window:** E2B/E4B = 512 tokens; 31B/26B = 1024 tokens +- **Grouped Query Attention (GQA):** + - Local: 2 query heads share 1 KV head + - Global: 8 query heads share 1 KV head, doubled Key dimensions + +## Positional Encoding: Proportional RoPE (p-RoPE) + +- Applied to global attention layers only +- p=0.25 -> rotates only 25% of head dimensions +- theta=1M +- 75% of dimensions are position-independent -> better long-context extrapolation +- Replaces Gemma 3's 8x linear frequency scaling + +## Per-Layer Embeddings (PLE) — E2B/E4B Only + +- Each decoder layer gets its own unique token representation +- Parallel lower-dimensional pathway alongside main residual stream +- PLE dimensions: 256 (E2B), 2560 (E4B) +- Original embedding dimensions: 1536 (E2B), 2560 (E4B) +- Applied between decoder blocks with gating function +- This is why E2B has 5.1B total but only 2.3B effective — the PLE table is large + +## Shared KV Cache + +- Last N layers reuse K/V tensors from earlier layers (same attention type) +- No quality loss in practice +- Significant memory + compute savings for long-context generation + +## Vision Encoder + +- Params: 150M (E2B/E4B), 550M (31B/26B) +- Patch size: 16x16 pixels +- 3x3 neighboring patches merged into single embedding +- Uses 2D RoPE for variable aspect ratio +- Token budgets: 70, 140, 280, 560, 1120 soft tokens +- Approximate resolutions: 272x176 (70 tokens) -> 1088x704 (1120 tokens) + +## Audio Encoder — E2B/E4B Only + +- Conformer architecture with convolutional modules +- Mel-spectrogram feature extraction +- Two 2D conv layers for downsampling +- NOT available on 31B or 26B variants + +## MoE Details (26B A4B) + +- 128 total experts +- 8 experts activated per token +- 1 shared expert (3x size of regular experts) +- 119 experts unused during any given forward pass + +## Context Window + +| Variant | Context Window | MRCR v2 8-needle @ 128K | +|---------|---------------|------------------------| +| E2B | 128K | 19.1% | +| E4B | 128K | 25.4% | +| 26B A4B | 256K | 44.1% | +| 31B | 256K | 66.4% | +| Gemma 3 27B | 128K | 13.5% | + +- Ollama default num_ctx: 2048 (must override!) +- Retrieval accuracy diminishes beyond ~100K tokens in repetitive/unstructured text + +## Vocabulary + +- SentencePiece tokenizer, 262,144 tokens (256K vocab, up from 256K in earlier Gemma) + +## Memory Requirements (approximate) + +| Model | BF16 | 8-bit | 4-bit | +|-------|------|-------|-------| +| E2B | 9.6 GB | 4.6 GB | 3.2 GB | +| E4B | 15 GB | 7.5 GB | 5 GB | +| 31B Dense | 58.3 GB | 30.4 GB | 17.4 GB | +| 26B A4B (MoE) | 48 GB | 25 GB | 15.6 GB | + +Note: 26B MoE requires ALL 26B params loaded despite only activating ~4B per token. + +## License + +Apache 2.0 — major change from Gemma 3's proprietary "Gemma Terms of Use". No custom clauses, no redistribution restrictions. + +## Training Data Cutoff + +January 2025 diff --git a/CORPUS_benchmarks.md b/CORPUS_benchmarks.md new file mode 100644 index 0000000..c91b1dc --- /dev/null +++ b/CORPUS_benchmarks.md @@ -0,0 +1,40 @@ +# Gemma 4 Benchmarks + +> Source: Google DeepMind model card, HuggingFace blog, LMArena +> Released: April 2, 2026 + +## Gemma 4 vs Gemma 3 (biggest single-version jump in Gemma family) + +| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B A4B | Delta (31B vs G3) | +|-----------|------------|------------|----------------|-------------------| +| MMLU Pro | 67.6% | 85.2% | 82.6% | +17.6 | +| AIME 2026 (no tools) | 20.8% | 89.2% | 88.3% | +68.4 | +| GPQA Diamond | 42.4% | 84.3% | 82.3% | +41.9 | +| BigBench Extra Hard | 19.3% | 74.4% | 64.8% | +55.1 | +| LiveCodeBench v6 | 29.1% | 80.0% | 77.1% | +50.9 | +| Codeforces ELO | 110 | 2150 | 1718 | +2040 | +| MMMU Pro (vision) | 49.7% | 76.9% | 73.8% | +27.2 | +| MATH-Vision | 46.0% | 85.6% | 82.4% | +39.6 | +| OmniDocBench (lower=better) | 0.365 | 0.131 | 0.149 | -0.234 | +| MRCR v2 128K | 13.5% | 66.4% | 44.1% | +52.9 | +| MMMLU (multilingual) | 70.7% | 88.4% | 86.3% | +17.7 | + +## Arena Scores + +| Model | LMArena Score | Rank | +|-------|--------------|------| +| Gemma 4 31B | 1452 | #3 | +| Gemma 4 26B A4B | 1441 | #6 | + +## Agentic Benchmark (tau2-bench) + +| Model | Score | +|-------|-------| +| 31B | 86.4% | +| 26B A4B | 85.5% | +| E4B | 57.5% | +| E2B | 29.4% | + +## Takeaway + +The jump from Gemma 3 to 4 is enormous — AIME went from 20.8% to 89.2%, Codeforces from 110 to 2150 ELO. This is not an incremental update. The 26B MoE nearly matches 31B Dense on most benchmarks while using ~4B active params. diff --git a/CORPUS_capabilities.md b/CORPUS_capabilities.md new file mode 100644 index 0000000..52b2cc6 --- /dev/null +++ b/CORPUS_capabilities.md @@ -0,0 +1,55 @@ +# Gemma 4 Capabilities Reference + +## Modalities + +### Text (all variants) +- Standard instruction-following, chat, completion +- System prompt support (critical — see synthesis) +- 128K context window (training length) +- 262K vocabulary + +### Vision (all variants) +- **Tested and verified working** (Seth, 2026-04-10) +- Accurately described colors, shapes, composition in 256x256 test image +- ~25 tok/s, ~24s end-to-end on pve197 V100 +- Input: base64-encoded image in `images` field of Ollama API +- Vision encoder: 16x16 patches, 2D RoPE, variable aspect ratio +- Token budgets scale with resolution (70-1120 soft tokens) +- Used in AI_Visualizer for SDXL frame quality criticism + +### Audio (E2B/E4B only) +- **Not tested by Seth** — status unknown in practice +- Conformer architecture (~300M params), mel-spectrogram input +- **Trained on SPEECH ONLY — not music or environmental sounds** +- Maximum 30 seconds per clip +- NOT available on 26B or 31B variants +- AI_Visualizer explicitly rejected audio for music analysis (DECISIONS S2) — correct call, model wasn't trained for it + +### Video (all variants) +- E2B/E4B: video WITH audio (`load_audio_from_video=True`) +- 31B/26B: video WITHOUT audio +- Not explicitly post-trained on video but works +- Maximum 60 seconds at 1 frame/second +- Not tested by Seth + +### Tool Calling / Function Calling +- **Verified reliable** in both Simon and AI_Visualizer +- Ollama native tool format (OpenAI-compatible function calling) +- Simon: 6 genealogy tools, up to 12 sequential iterations +- Supports parallel tool calls in single response +- Weak at deeply nested JSON schemas -> prefer sequential calls + +## Benchmark Context (vs Gemma 3) + +- 31B replaces Gemma 3 27B (60 layers vs 62, but wider) +- MoE variant (26B) is new — no Gemma 3 equivalent +- E-series with PLE is new — on-device focus +- Proportional RoPE replaces linear frequency scaling -> better long-context +- Shared KV cache is new -> more efficient inference + +## What Gemma 4 Does NOT Do + +- No native code execution / sandboxing +- No web browsing or retrieval +- Audio only on E-series (not the models most people run) +- No built-in RAG — tool calling can implement it diff --git a/CORPUS_ollama_variants.md b/CORPUS_ollama_variants.md new file mode 100644 index 0000000..1f7be26 --- /dev/null +++ b/CORPUS_ollama_variants.md @@ -0,0 +1,42 @@ +# Gemma 4 on Ollama — Available Variants + +> Last verified against Seth's homelab: 2026-04-12 + +## Ollama Model Tags + +| Tag | Params | Quant | Size on Disk | VRAM | Notes | +|-----|--------|-------|-------------|------|-------| +| `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 | +| `gemma4:26b` | 25.8B | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti | +| `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) | + +## Capabilities by Variant (from `ollama show`) + +All variants support: +- Text generation (completion, chat) +- Vision (image input via base64 in `images` field) +- Tool/function calling (native Ollama tool format) + +E-series (E2B, E4B) additionally support: +- Audio input (conformer encoder) + +## GPU Coexistence (pve197 V100 32GB) + +- gemma4:26b + SDXL Turbo: ~28.5GB peak VRAM — fits on V100-32GB +- gemma4:31b: 24.5GB alone — memory pressure with any coexisting model +- gemma4:e4b-it-q8_0: ~12GB — comfortable headroom + +## Ollama API Endpoint + +- `/api/generate` (single-turn, used by AI_Visualizer) +- `/api/chat` (multi-turn with message history, used by Simon) +- Both accept `tools`, `images`, `stream`, `options`, `keep_alive` + +## Important Ollama Defaults to Override + +| Parameter | Ollama Default | Recommended | Why | +|-----------|---------------|-------------|-----| +| `num_ctx` | 2048 | 4096-32768 | Default is absurdly small, causes truncation | +| `num_predict` | 128 | 512-4096+ | Default truncates almost all useful output | +| `think` | true (Ollama 0.20+) | false | See GOTCHAS doc | +| `keep_alive` | 5m | 30m-4h | Prevents expensive model reload between calls | diff --git a/CORPUS_tool_calling_format.md b/CORPUS_tool_calling_format.md new file mode 100644 index 0000000..349f235 --- /dev/null +++ b/CORPUS_tool_calling_format.md @@ -0,0 +1,100 @@ +# Gemma 4 Native Tool Calling Format + +> Source: Google AI for Developers - Function Calling docs +> https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4 + +## Special Tokens (6 total) + +| Token | Purpose | +|-------|---------| +| `<\|tool>` / `` | Tool definition block | +| `<\|tool_call>` / `` | Model's tool request | +| `<\|tool_response>` / `` | Tool execution result | + +String delimiter: `<\|"\|>` (encloses all string values in native format) + +## Native Format (raw model tokens) + +### Tool definition in system prompt: +``` +<|tool>declaration: +get_current_temperature{ + location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>}, + unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]} +} +``` + +### Tool call from model: +``` +<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>} +``` + +### Tool response: +``` +<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>} +``` + +## JSON Chat Format (for Ollama / OpenAI-compatible APIs) + +This is what you actually use in practice. Ollama translates to/from native tokens. + +### Tool definition: +```json +{ + "type": "function", + "function": { + "name": "get_weather", + "description": "Get current weather for a location", + "parameters": { + "type": "object", + "properties": { + "city": {"type": "string", "description": "The city name"} + }, + "required": ["city"] + } + } +} +``` + +### Model returns: +```json +{ + "role": "assistant", + "tool_calls": [{ + "function": { + "name": "get_weather", + "arguments": {"city": "London"} + } + }] +} +``` + +### Tool result message: +```json +{ + "role": "tool", + "content": "{\"temperature\": 15, \"weather\": \"sunny\"}" +} +``` + +## Thinking Mode + Tool Calls + +- When thinking is enabled, preserve thoughts between tool calls +- For long agent chains, summarize thoughts as plain text to save context +- Recommended: **disable thinking for tool-heavy workflows** (Seth's finding) + +## Framework Flags + +| Framework | Required Flag | +|-----------|--------------| +| llama.cpp | `--jinja` | +| vLLM | `--enable-auto-tool-choice` | +| Ollama | Works via `/api/chat` endpoint with `tools` field | +| transformers | `apply_chat_template(tools=[...])` | + +## Known Issues + +- Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls +- llama.cpp: format mismatches and continuous loops reported +- LM Studio: compatibility issues with tool calling +- **Workaround:** Use non-streaming mode for tool calls (proven in Simon) diff --git a/GOTCHAS.md b/GOTCHAS.md new file mode 100644 index 0000000..7ec7fe7 --- /dev/null +++ b/GOTCHAS.md @@ -0,0 +1,205 @@ +# Gemma 4 Gotchas & Known Issues + +> Derived from Seth's production implementations (Simon, AI_Visualizer) +> and community reports. These are hard-won lessons. + +## CRITICAL: Thinking Mode Eats Context + +**Severity: HIGH — causes silent failures** + +Gemma 4 in Ollama 0.20+ defaults to `think: true`. When enabled: +- Thinking tokens go into a hidden `thinking` field, NOT `response` +- If `num_predict` is limited, thinking consumes the entire budget +- `response` comes back **empty** — no error, just silence +- On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without) + +**Fix:** Always pass `think: false` in the Ollama payload. Seth has had success ONLY with thinking off. + +```json +{ + "model": "gemma4:26b", + "think": false, + "options": { "num_predict": 4096 } +} +``` + +## CRITICAL: format=json Causes Infinite Loops + +**Severity: HIGH — hangs indefinitely** + +Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested. + +**Fix:** Never use `format: "json"`. Instead: +1. Request JSON structure in the prompt text +2. Parse client-side with regex + `json.loads` + json5 fallback + +```python +# DO THIS +response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False) +body = response["response"] +obj = json.loads(body[body.find("{"):body.rfind("}") + 1]) + +# NOT THIS +response = client.generate(model="gemma4:26b", prompt=prompt, format="json") # HANGS +``` + +## CRITICAL: Ollama Default Context is 2048 + +**Severity: HIGH — causes truncation** + +Ollama defaults `num_ctx` to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated. + +**Fix:** Always set `num_ctx` explicitly: +```json +{ "options": { "num_ctx": 8192 } } +``` + +Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn. + +## HIGH: num_predict Default is 128 + +**Severity: HIGH — truncates output** + +Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output exceeds this. + +**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+. + +## MEDIUM: Weak at Long/Nested JSON + +**Severity: MEDIUM — causes parse failures** + +Gemma 4 reliably produces short JSON (5-10 fields) but struggles with: +- Deeply nested schemas (3+ levels) +- Long arrays (20+ items) +- Mixed nesting + length + +**Fix:** Sequential tool calls. Break one large JSON request into multiple smaller calls: +- Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc. +- Due to Gemma 4's fast speed and free local use, sequential calls are cheap + +**Fallback pattern (AI_Visualizer):** +```python +for attempt in range(MAX_RETRIES): + temp = BASE_TEMP + attempt * TEMP_BUMP # 0.4 -> 0.5 -> 0.6 + response = call_gemma(temperature=temp) + try: + return parse_json(response) + except JSONDecodeError: + continue +``` + +## MEDIUM: Identity Confusion + +**Severity: MEDIUM — cosmetic but confusing** + +Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may: +- Claim to be a different model +- Hallucinate capabilities it doesn't have +- Respond as a generic "AI assistant" without personality + +**Fix:** Explicit identity in system prompt: +``` +You are [Name], a [role]. You are powered by Gemma 4. +You ONLY do [X]. You NEVER do [Y]. +``` + +Gemma 4 does NOT need hand-holding on task execution — it's very capable. +It needs explicit instructions about identity and boundaries. + +## MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens) + +**Severity: MEDIUM — hardware-specific, affects RTX 3090** + +Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model. + +**Source:** [ollama/ollama#15350](https://github.com/ollama/ollama/issues/15350) + +**Fix:** Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware. + +## MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming + +**Severity: MEDIUM — version-specific** + +As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio. + +**Source:** [community reports](https://dev.to/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f) + +**Fix:** Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls. + +## MEDIUM: VRAM-Hungry for Context + +**Severity: MEDIUM — affects hardware planning** + +Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card. + +**Implication:** Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable. + +## MEDIUM: Safety Overfiltering + +**Severity: MEDIUM — blocks benign prompts** + +Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts. + +**Fix:** Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly. + +## MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0) + +**Severity: MEDIUM — crashes on first attention forward pass** + +The 31B and 26B ship with `num_kv_shared_layers = 0`, which causes `layer_types[:-0]` to collapse to zero layer slots. Crashes on first forward pass. + +**Fix:** Patch the config. Check model card discussions for the exact fix. + +## LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090) + +**Severity: LOW — vLLM-specific** + +Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+. + +**Source:** [vllm-project/vllm#38887](https://github.com/vllm-project/vllm/issues/38887) + +**Fix:** Use Ollama instead of vLLM for now, or wait for the fix. + +## LOW: `` Token Infinite Loop (Vulkan backends) + +**Severity: LOW — Vulkan-specific** + +Gemma 4 can generate `` or `` tokens in an infinite loop on Vulkan backends in llama.cpp. + +**Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516) + +## LOW: Fine-Tuning Ecosystem Issues + +**Severity: LOW — only relevant if fine-tuning** + +Day-one issues for fine-tuners: +- HuggingFace Transformers didn't recognize gemma4 architecture (required install from source) +- PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type) +- New `mm_token_type_ids` field required during training even for text-only data +- E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug) + +## LOW: Vision Validator Overrejects + +**Severity: LOW — specific to evaluative vision tasks** + +In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable. + +**Pattern:** Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?" + +## LOW: Keep-Alive Too Short + +**Severity: LOW — performance only** + +Default `keep_alive` is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty). + +**Fix:** Set `keep_alive` to match your pipeline duration: +```json +{ "keep_alive": "4h" } +``` + +Or pin/unpin explicitly: +```python +client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0}) # pin +# ... do work ... +client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0}) # unpin +``` diff --git a/IMPLEMENTATIONS.md b/IMPLEMENTATIONS.md new file mode 100644 index 0000000..0999935 --- /dev/null +++ b/IMPLEMENTATIONS.md @@ -0,0 +1,95 @@ +# Gemma 4 Implementation Reference + +> Patterns extracted from Seth's two production Gemma 4 projects. + +## Project: Simon (FreibergFamily/simon/) + +**Purpose:** AI genealogy historian — multi-turn chat with tool-calling agent + +| Setting | Value | +|---------|-------| +| Model | `gemma4:26b` | +| API | `/api/chat` (multi-turn) | +| num_ctx | 32768 | +| num_predict | 4096 | +| temperature | 1.0 | +| top_p | 0.95 | +| top_k | 64 | +| keep_alive | 4h | +| think | (not explicitly set — should be false) | +| format_json | not used | +| Vision | not used | +| Tool calling | 6 tools, max 12 iterations | + +### Key Patterns + +1. **Aggressive system prompt:** 40+ lines defining identity, boundaries, tool usage rules, multi-step chaining requirements. Gemma 4 follows all of it. + +2. **Tool chaining instructions:** System prompt explicitly tells Gemma to chain tools (e.g., "after lookup_person, ALSO call get_historical_context"). Gemma 4 follows these multi-step chains reliably. + +3. **Parallel tool calls:** Encouraged in system prompt for multiple lookups. Gemma 4 does this. + +4. **History pruning:** Drops old tool results and tool-call messages, keeps assistant summaries. Prevents context bloat in multi-turn. + +5. **Fallback to streaming:** After 12 tool iterations, switches to stream mode (no tools) to force a text response. + +6. **Two modes (historian vs interview):** Completely different system prompts swapped at runtime. Gemma 4 stays in character for both. + +--- + +## Project: AI Visualizer (AI_Visualizer/) + +**Purpose:** Music-reactive video generator — Gemma 4 as reasoning engine across 4 pipeline stages + +| Stage | num_predict | num_ctx | temperature | Purpose | +|-------|-------------|---------|-------------|---------| +| Mood Analysis | 4096 | 16384 | 0.4-0.6 | Analyze CLAP descriptors -> narratives + boundary adjustments | +| Rate Pass | 512 | 4096 | 0.3-0.5 | Choose visual pacing rate per music segment | +| Storyboard | 2048 | 4096 | 0.6-0.8 | Generate SDXL prompts per music segment | +| Batch Expansion | 2048 | default | 0.7 | Interpolate between scene prompts over time | +| Vision Validator | 256 | default | 0.2 | Critique generated frames (queued for disable) | + +### Key Patterns + +1. **No tool calling used.** All Gemma interaction is single-turn generate with JSON requested in prompt. + +2. **Client-side JSON extraction:** + ```python + body = response["response"] + start = body.find("{") + end = body.rfind("}") + obj = json.loads(body[start:end + 1]) + ``` + +3. **Temperature ramping on retry:** Base temp + bump per attempt. Conservative first, creative on retry. + +4. **think: false everywhere.** Explicitly set on every call. Critical for budget control. + +5. **format_json: false everywhere.** Causes infinite loops on nested schemas. + +6. **Model pinning:** `keep_alive=-1` to prevent GPU eviction during long SDXL pauses. + +7. **Explicit num_ctx:** Added after discovering Ollama defaults to 2048, which truncated mood analyzer prompts on long tracks. + +8. **Banned vocabulary in prompts:** List of cliche words (cinematic, dramatic, ethereal...) passed to Gemma to avoid generic output. + +9. **Vision for image critique:** Base64-encoded PNG -> structured SCORE/ISSUE/REASON output parsed by regex. Works but overrejects on subjective quality. + +--- + +## Common Settings Across Both Projects + +```json +{ + "model": "gemma4:26b", + "think": false, + "options": { + "num_ctx": 4096, + "num_predict": 2048, + "temperature": 0.5 + }, + "keep_alive": "30m" +} +``` + +Adjust num_ctx/num_predict upward for your payload size. These are safe minimums. diff --git a/README.md b/README.md new file mode 100644 index 0000000..d4bf567 --- /dev/null +++ b/README.md @@ -0,0 +1,25 @@ +# gemma4-research + +Research corpus and implementation guidance for Google Gemma 4, based on production use in Seth's homelab. + +## Files + +| File | What | When to Read | +|------|------|-------------| +| `SYNTHESIS.md` | **Start here.** Opinionated guide — how to build with Gemma 4 | Before any new Gemma 4 implementation | +| `GOTCHAS.md` | Known issues and workarounds, severity-ranked | When debugging Gemma 4 issues or starting a new project | +| `IMPLEMENTATIONS.md` | Patterns from Simon and AI_Visualizer | When designing a new Gemma 4 integration | +| `CORPUS_architecture.md` | Model architecture details (layers, attention, PLE, MoE) | When you need to understand WHY Gemma 4 behaves a certain way | +| `CORPUS_ollama_variants.md` | Available models, sizes, VRAM, Ollama settings | When choosing a model variant or configuring Ollama | +| `CORPUS_capabilities.md` | Modalities (vision, audio, video, tools), what it can/can't do | When scoping what Gemma 4 can handle | +| `CORPUS_benchmarks.md` | Full benchmark table vs Gemma 3, arena scores, agentic scores | When comparing Gemma 4 to alternatives | +| `CORPUS_tool_calling_format.md` | Native token format + JSON API format for function calling | When implementing tool calling | + +## Source Projects + +- **Simon** (`~/bin/FreibergFamily/simon/`) — Multi-turn chat agent with 6 tools, genealogy historian +- **AI Visualizer** (`~/bin/AI_Visualizer/`) — Music video generator, 4-stage Gemma pipeline + vision + +## Key Insight + +Gemma 4 is ultra-compliant and highly capable but doesn't know who it is. It needs explicit system prompts, not hand-holding. Due to fast local inference, sequential tool calls beat long JSON requests. diff --git a/SYNTHESIS.md b/SYNTHESIS.md new file mode 100644 index 0000000..e9fc368 --- /dev/null +++ b/SYNTHESIS.md @@ -0,0 +1,194 @@ +# Gemma 4 Synthesis — How to Build With It + +> Opinionated guide based on two production implementations and ongoing use. +> Seth Freiberg, 2026-04-12 + +## The One-Paragraph Summary + +Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs `num_predict` increased (Ollama defaults are absurdly low), `think` set to false (thinking eats the context budget), and `format: json` avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output. + +## Mental Model + +Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain: +- Who they are and what their job is +- What they should and should NOT do +- Exactly what format you want the deliverable in +- The boundaries of their role + +Get those right and Gemma 4 just works. Get them wrong and you get a generic chatbot. + +## Mandatory Ollama Settings + +Every Gemma 4 call MUST include: + +```json +{ + "think": false, + "options": { + "num_ctx": 4096, + "num_predict": 2048 + } +} +``` + +**Why each one:** +- `think: false` — Ollama 0.20+ defaults to think:true. Thinking tokens consume num_predict budget invisibly, returning empty responses. Seth has ONLY had success with thinking off. +- `num_ctx: 4096+` — Ollama defaults to 2048. Your system prompt alone might exceed that. +- `num_predict: 2048+` — Ollama defaults to 128. Any structured output gets truncated. + +Scale these to your task. The values above are safe minimums, not recommendations. + +## System Prompt Template + +``` +You are [NAME], a [ROLE DESCRIPTION]. + +## What You Do +- [Explicit list of responsibilities] +- [Tools you have access to and when to use each one] + +## What You Do NOT Do +- [Explicit list of things to refuse or avoid] +- [Common mistakes to prevent] + +## Output Format +[Exact schema, field names, example if complex] +Respond with ONLY [format]. No prose outside the [format]. + +## Rules +- [Behavioral constraints] +- [Multi-step chaining instructions if using tools] + +Today's date: [DATE] +``` + +**Key principles:** +1. Identity first — who is this agent? +2. Positive instructions before negative (what TO do before what NOT to do) +3. Output format is explicit and complete — Gemma 4 follows schemas faithfully +4. "No prose outside the JSON" prevents wrapper text that breaks parsing +5. Date injection helps with temporal reasoning + +## Tool Calling Strategy + +Gemma 4 is **reliable for tool calling** but **weak at structuring long JSONs**. + +### When to use tool calling (Ollama native) +- Multi-turn agents with 2-10 tools +- Sequential reasoning chains (lookup A -> use A to decide B -> lookup B) +- Any task where the model needs to gather information before responding + +### When to use prompt-based JSON instead +- Single-turn generation with known output structure +- When you need specific JSON schema control +- When the output is a payload (prompts, configs) not a conversation + +### The Sequential Pattern + +Instead of asking Gemma 4 to produce one massive JSON: +``` +BAD: "Generate a 50-scene storyboard as JSON" -> truncated/malformed +GOOD: "Generate scenes 1-5 as JSON" x10 -> reliable every time +``` + +Gemma 4's inference speed makes sequential calls cheap. A 10-call chain at ~134 tok/s on a 3090 Ti costs seconds, not minutes. This is the fundamental advantage of local models — latency is predictable and network-free. + +## JSON Extraction Pattern + +Since `format: "json"` is broken, always extract client-side: + +```python +# Python +import json +raw = response["response"] +start = raw.find("{") +end = raw.rfind("}") +if start >= 0 and end > start: + obj = json.loads(raw[start:end + 1]) +``` + +```javascript +// JavaScript +const raw = response.message.content; +const match = raw.match(/\{[\s\S]*\}/); +if (match) obj = JSON.parse(match[0]); +``` + +For arrays, find `[` and `]` instead. Add json5 fallback for trailing commas. + +## Temperature Guidelines + +| Task Type | Temperature | Why | +|-----------|-------------|-----| +| Evaluation / scoring | 0.2 | Consistent, reproducible judgments | +| Structured extraction | 0.3-0.4 | Faithful to schema | +| Creative generation | 0.6-0.8 | Variety without chaos | +| Conversation / chat | 0.7-1.0 | Natural feel | + +Retry strategy: bump temp +0.1 per retry to escape format failures. + +## Vision Usage + +**Works for:** Describing image contents (objects, colors, composition, text) +**Unreliable for:** Subjective quality scoring, aesthetic judgment + +```python +import base64 +with open("image.png", "rb") as f: + b64 = base64.b64encode(f.read()).decode("ascii") + +response = client.generate( + model="gemma4:26b", + prompt="Describe this image in detail.", + images=[b64], + think=False, + options={"temperature": 0.2, "num_predict": 512} +) +``` + +Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only. + +## Context Management + +### Multi-turn (chat agents) +- Prune old tool results and tool-call messages +- Keep assistant's natural-language summaries +- Set num_ctx to 32768 for rich conversations +- Set a tool iteration limit (12 is proven) with streaming fallback + +### Single-turn (pipeline stages) +- Calculate your prompt size and set num_ctx accordingly +- For long inputs (full track analysis), use recursive splitting at natural boundaries +- Pin model with `keep_alive=-1` if pipeline has idle gaps + +## Model Selection + +| Use Case | Recommended | Why | +|----------|------------|-----| +| Production pipeline (needs GPU coexistence) | `gemma4:26b` | Best quality/speed/VRAM balance | +| On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio | +| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Sharpest but slow under memory pressure | +| Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev | + +## Anti-Patterns + +1. **Don't use `format: "json"`** — infinite loops on nested schemas +2. **Don't leave `think` at default** — eats your output budget silently +3. **Don't leave `num_predict` at default** — 128 tokens is nothing +4. **Don't leave `num_ctx` at default** — 2048 truncates most prompts +5. **Don't ask for huge JSON in one call** — break into sequential calls +6. **Don't use thinking mode for evaluation** — inflates scores, wastes context +7. **Don't skip system prompt identity** — Gemma 4 becomes a generic chatbot +8. **Don't use audio on 26B/31B** — only E-series has audio encoder + +## Quick-Start Checklist + +- [ ] Set `think: false` +- [ ] Set `num_predict` >= 512 (2048+ for JSON output) +- [ ] Set `num_ctx` >= 4096 (scale to your prompt size) +- [ ] Write explicit system prompt with identity + boundaries + output format +- [ ] Extract JSON client-side (no `format: "json"`) +- [ ] Set `keep_alive` >= 30m (or pin with -1) +- [ ] For long structured output, use sequential calls +- [ ] For vision, pass base64 in `images` array +- [ ] Test with your actual prompt length — Ollama won't warn about truncation