docs: initial Gemma 4 research corpus and synthesis

Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 18:14:19 -04:00
commit 5011059f5d
9 changed files with 861 additions and 0 deletions
@@ -0,0 +1,42 @@
+# Gemma 4 on Ollama — Available Variants
+
+> Last verified against Seth's homelab: 2026-04-12
+
+## Ollama Model Tags
+
+| Tag | Params | Quant | Size on Disk | VRAM | Notes |
+|-----|--------|-------|-------------|------|-------|
+| `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 |
+| `gemma4:26b` | 25.8B | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti |
+| `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) |
+
+## Capabilities by Variant (from `ollama show`)
+
+All variants support:
+- Text generation (completion, chat)
+- Vision (image input via base64 in `images` field)
+- Tool/function calling (native Ollama tool format)
+
+E-series (E2B, E4B) additionally support:
+- Audio input (conformer encoder)
+
+## GPU Coexistence (pve197 V100 32GB)
+
+- gemma4:26b + SDXL Turbo: ~28.5GB peak VRAM — fits on V100-32GB
+- gemma4:31b: 24.5GB alone — memory pressure with any coexisting model
+- gemma4:e4b-it-q8_0: ~12GB — comfortable headroom
+
+## Ollama API Endpoint
+
+- `/api/generate` (single-turn, used by AI_Visualizer)
+- `/api/chat` (multi-turn with message history, used by Simon)
+- Both accept `tools`, `images`, `stream`, `options`, `keep_alive`
+
+## Important Ollama Defaults to Override
+
+| Parameter | Ollama Default | Recommended | Why |
+|-----------|---------------|-------------|-----|
+| `num_ctx` | 2048 | 4096-32768 | Default is absurdly small, causes truncation |
+| `num_predict` | 128 | 512-4096+ | Default truncates almost all useful output |
+| `think` | true (Ollama 0.20+) | false | See GOTCHAS doc |
+| `keep_alive` | 5m | 30m-4h | Prevents expensive model reload between calls |