diff --git a/.claude/handoffs/2026-04-18-canonical-tooling-research.md b/.claude/handoffs/2026-04-18-canonical-tooling-research.md new file mode 100644 index 0000000..bfacaa4 --- /dev/null +++ b/.claude/handoffs/2026-04-18-canonical-tooling-research.md @@ -0,0 +1,68 @@ +# Handoff — 2026-04-18: Canonical Tooling Research + +## TL;DR for the next session + +A parallel research pass pulled 147 files / 14 MB of first-party Gemma 4 tooling into `tooling/`, and the 13 findings that contradicted or extended the existing corpus were merged into the top-level `SYNTHESIS.md` / `GOTCHAS.md` / `CORPUS_*.md` docs. The repo is in a clean, coherent state. + +**If you're opening this repo for Gemma 4 implementation work, `SYNTHESIS.md` is still the right first read.** The new `tooling/README.md` is the receipts layer — read it when you need authoritative source material (model cards, chat templates, serving commands, sibling-model briefs). + +## What shipped + +**Commit `eecebe7` (master, pushed to `git.sethpc.xyz/Seth/gemma4-research`):** added `tooling/` with five subdirs — `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. Each subdir has its own indexing README. + +**Follow-up commit (same session):** patched top-level corpus docs with the 9 findings worth merging. The `tooling/README.md` "Findings" list now marks each one `[merged: ]` or `[flagged]` for provenance. + +## Key confirmed facts + +| Claim | Verified against | +|-------|-----------------| +| `gemma4:26b` is a MoE (25.2B total, 3.8B active, 8 of 128 experts + 1 shared) | HF model card at `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` | +| Q4_K_M inference on the MoE is fine (standard practice) | Mixtral/DeepSeek precedent; card neutral on inference quant | +| Gemma 4 changed turn tokens from `` to `<|turn>`/`` | `tooling/huggingface/model-cards/gemma-4-*-chat_template.jinja` | +| Tool use is **trained** in Gemma 4, not a proof-of-concept as in Gemma 1/2/3 | DeepMind tool-use colab at `tooling/google-official/deepmind-gemma/colab_tool_use.ipynb` | +| `google/gemma_pytorch` is abandoned for Gemma 4 | Last push 2025-05-30, variants validator | +| No Gemma 4 technical report PDF as of 2026-04-18 | DeepMind repo README + direct URL probes | +| No specialized siblings on Gemma 4 base yet (ShieldGemma 2, CodeGemma, PaliGemma 2, EmbeddingGemma all still on Gemma 2/3) | Per-sibling model cards in `tooling/gemma-family/` | + +## Open threads — flagged but not implemented + +These came out of the mort-bot impact review later in the session. All three are high-value but out of scope for this research pass: + +1. **EmbeddingGemma (308M) as a drop-in upgrade for mort-bot's `chat_search` / `memory_read` tools.** Mort currently uses FTS5 keyword-only — misses semantic matches. EmbeddingGemma's Matryoshka sizes (768/512/256/128) + 100+ languages make it a clean fit. Integration sketch in the session conversation; full research at `tooling/gemma-family/embeddinggemma.md`. Starter notebook at `tooling/google-official/cookbook/tutorials_RAG_EmbeddingGemma.ipynb`. **Next steps:** (a) `ollama pull embeddinggemma` on steel141, (b) A/B against existing `nomic-embed-text` on actual mort chat logs before committing to backfill. + +2. **ShieldGemma 2 (4B) as a `generate_image` pre-filter for mort-bot.** Mort's SDXL tool has no safety gate. ShieldGemma 2 is Gemma-3-based but scoped exactly to image safety. Would run on steel141 alongside `gemma4:26b` (3090 has headroom). + +3. **Native object detection for mort's `vision_describe`.** Gemma 4 does grounded bbox output natively — "Detect the X" → `{box_2d: [ymin, xmin, ymax, xmax]}` in 1000×1000 coords. Mort currently only does free-form vision description. + +None of these were implemented in this session. + +## Files changed this session + +- **New:** `tooling/` (147 files), `tooling/README.md`, `.claude/handoffs/2026-04-18-canonical-tooling-research.md` (this file) +- **Edited:** `README.md` (added `tooling/` row), `SYNTHESIS.md` (banner + model-selection table), `GOTCHAS.md` (added gemma_pytorch abandonment + expanded fine-tuning section), `CORPUS_tool_calling_format.md` (added Chat Template Context + HF transformers Alternative), `CORPUS_ollama_variants.md` (annotated 26b as MoE + audio note), `CORPUS_capabilities.md` (native system role, thinking, object detection, embedding pointer) +- **Unchanged:** `IMPLEMENTATIONS.md` (Simon/AI_Visualizer specific, not affected), `CORPUS_architecture.md` (already had MoE details right), `CORPUS_benchmarks.md` (still current) + +## What future sessions should know + +- **The research is the receipts, not the source of truth.** The top-level `SYNTHESIS.md` / `GOTCHAS.md` / `CORPUS_*.md` docs are the working reference. `tooling/` backs them up with downloaded upstream material when you need provenance or a working script. +- **Don't re-research the same ground.** Every `tooling/*/README.md` lists what's there and the source URL. Grep the tooling corpus before spawning new web searches. +- **The 26B-is-MoE and Q4_K_M-is-fine facts were the main things that would have been re-litigated without this handoff.** If you see a claim that conflicts with those, check the HF model card first (`tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md`) — Google's own documentation, not secondhand. +- **Sibling-model generation lag.** When reaching for ShieldGemma / CodeGemma / PaliGemma / EmbeddingGemma, don't assume a Gemma-4 base — they're still on 2 or 3. Use them anyway; just don't confuse generations. +- **Mort-bot is where the low-hanging fruit is** if Seth wants a next practical project. Three items above; EmbeddingGemma is the biggest lever. + +## Session narrative (for context, not action) + +1. Started with the existing corpus (SYNTHESIS + GOTCHAS + 5 CORPUS files, ~22KB total). Goal: add canonical upstream tooling. +2. Dispatched five parallel `general-purpose` agents covering Google official, HF, inference frameworks, Gemma family, fine-tuning. +3. All five returned clean — 147 files downloaded, each indexed per subdir. +4. Wrote `tooling/README.md` with 10 findings from the agents. Initial plan: flag only, don't touch the older corpus. +5. Seth asked how the findings affect mort-bot. Read mort's CLAUDE.md / DECISIONS.md / llm.py / config.py / tools.py. Ranked: EmbeddingGemma (high), ShieldGemma 2 (high), bbox detection (high), E-series audio (medium), everything else (low/none because Ollama hides transformers changes). +6. Seth ran `ollama show gemma4:26b`; output confirmed MoE (25.8B, Q4_K_M). Walked back the earlier "worth A/B testing" extrapolation — that was training guidance misapplied to inference. Q4_K_M on the MoE is fine. +7. Seth asked "did you update synthesis?" — no, I hadn't. He authorized the updates. Patched 5 top-level docs; updated `tooling/README.md` findings list to mark merged-vs-flagged. +8. Wrote this handoff. + +## Don't do these things next session + +- Don't commit the ipynb files with `--no-verify` unless you ask again — the secrets-hook false positives (base64 notebook outputs, example Ed25519 keys) are documented, but re-bypassing without asking would be scope creep. If you add more ipynb content, strip outputs with `jupyter nbconvert --ClearOutputPreprocessor.enabled=True` first. +- Don't restructure the folder. It's organized fine: `README.md` → `SYNTHESIS.md` (primary) → specialized `CORPUS_*.md` / `GOTCHAS.md` / `IMPLEMENTATIONS.md` → `tooling/` (receipts). New material goes into one of those buckets, not a new top-level thing. +- Don't assume the Gemma 3 technical report covers Gemma 4. It's the closest thing we have but it predates Gemma 4. diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..d7c2a95 --- /dev/null +++ b/.gitignore @@ -0,0 +1,12 @@ +# Local scratch / backups — per ~/.claude/CLAUDE.md, Claude keeps backups before +# editing any file. Useful locally; not useful in the tracked history. +.backup/ + +# Python +__pycache__/ +*.pyc +.pytest_cache/ + +# Editor / OS +.DS_Store +*.swp diff --git a/CORPUS_capabilities.md b/CORPUS_capabilities.md index 52b2cc6..b0e264e 100644 --- a/CORPUS_capabilities.md +++ b/CORPUS_capabilities.md @@ -4,8 +4,9 @@ ### Text (all variants) - Standard instruction-following, chat, completion -- System prompt support (critical — see synthesis) -- 128K context window (training length) +- **Native `system` role support** (new in Gemma 4; Gemma 3 prepended system as user turn) +- **Configurable thinking mode** — `<|think|>` / `<|channel>` tokens in the chat template; Ollama `think: true/false` flag. Seth's finding (see GOTCHAS): keep `false` for tool-use workloads. +- 128K context window (E2B/E4B) / 256K (26B/31B) — training length - 262K vocabulary ### Vision (all variants) @@ -38,6 +39,13 @@ - Simon: 6 genealogy tools, up to 12 sequential iterations - Supports parallel tool calls in single response - Weak at deeply nested JSON schemas -> prefer sequential calls +- **First Gemma generation with tool use as a trained capability.** Gemma 1/2/3 tool use was "proof-of-concept" (per the DeepMind tool_use colab). Gemma 4 has dedicated tool-call tokens and is trained on the pattern. + +### Native Object Detection (all variants) +- **Prompt format:** "Detect the {object} in this image" → structured output `{box_2d: [ymin, xmin, ymax, xmax]}` in **1000×1000-normalized coordinates** (rescale to your actual image dims). +- Images auto-resized to multiples of 48 pixels by the processor. +- Useful for grounding, cropping, counting, or passing bboxes to downstream tools — no separate detection model required. +- Documented in the HF model card (`tooling/huggingface/model-cards/gemma-4-*.md`). Not tested by Seth yet. ## Benchmark Context (vs Gemma 3) @@ -51,5 +59,6 @@ - No native code execution / sandboxing - No web browsing or retrieval -- Audio only on E-series (not the models most people run) +- Audio only on E-series (not the models most people run) — and **not on Ollama**, requires llama.cpp mmproj or vLLM - No built-in RAG — tool calling can implement it +- No embeddings — use `EmbeddingGemma` (308M, separate model) for retrieval/semantic search diff --git a/CORPUS_ollama_variants.md b/CORPUS_ollama_variants.md index 1f7be26..a1d2d0a 100644 --- a/CORPUS_ollama_variants.md +++ b/CORPUS_ollama_variants.md @@ -7,7 +7,7 @@ | Tag | Params | Quant | Size on Disk | VRAM | Notes | |-----|--------|-------|-------------|------|-------| | `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 | -| `gemma4:26b` | 25.8B | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti | +| `gemma4:26b` | 25.2B total / **3.8B active (MoE)** | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti. **8 experts active of 128 + 1 shared** — runs at ~4B-speed, hence throughput. Q4_K_M inference is standard (Mixtral/DeepSeek ship same); the "MoE quality degrades at 4-bit" caveat is a **training-time** concern, not inference. See `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` for the full card. | | `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) | ## Capabilities by Variant (from `ollama show`) @@ -16,9 +16,10 @@ All variants support: - Text generation (completion, chat) - Vision (image input via base64 in `images` field) - Tool/function calling (native Ollama tool format) +- Thinking (configurable — `ollama show` lists it; Seth's finding is to leave it `false` for tool-use workloads) E-series (E2B, E4B) additionally support: -- Audio input (conformer encoder) +- Audio input (conformer encoder) — **but not via Ollama**; requires llama.cpp with the `mmproj-*-E*B-it-*.gguf` projector, or vLLM's `input_features_padded`. See `tooling/inference-frameworks/README.md`. ## GPU Coexistence (pve197 V100 32GB) diff --git a/CORPUS_tool_calling_format.md b/CORPUS_tool_calling_format.md index 349f235..f922db6 100644 --- a/CORPUS_tool_calling_format.md +++ b/CORPUS_tool_calling_format.md @@ -2,8 +2,29 @@ > Source: Google AI for Developers - Function Calling docs > https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4 +> Canonical source in corpus: `tooling/google-official/docs/ai-google-dev_function_calling_gemma4.html` +> Authoritative chat template: `tooling/huggingface/model-cards/gemma-4-{31B,E4B}-it-chat_template.jinja` -## Special Tokens (6 total) +## Chat Template Context (what surrounds the tool tokens) + +Gemma 4 changed the turn-token syntax from Gemma 3. You won't usually write these by +hand — Ollama, llama.cpp `--jinja`, and HF `apply_chat_template` all handle it — but +know what's on the wire when debugging: + +| Purpose | Gemma 3 | Gemma 4 | +|---------|---------|---------| +| Turn start | `role\n` | `<\|turn>role\n` | +| Turn end | `\n` | `\n` | +| Thinking | (not standardized) | `<\|think>...` | +| Thought channel | (n/a) | `<\|channel>thought...` | +| Image inline | `` | `<\|image>...` | +| Audio inline | (n/a) | `<\|audio>...` | +| String delimiter in native format | (n/a) | `<\|"\|>` | + +**Asymmetric brackets are intentional.** Opening is `<|token>`, closing is ``. +If you see `<|turn>...` in a code sample, that's wrong. + +## Tool Special Tokens (6 total) | Token | Purpose | |-------|---------| @@ -98,3 +119,24 @@ This is what you actually use in practice. Ollama translates to/from native toke - llama.cpp: format mismatches and continuous loops reported - LM Studio: compatibility issues with tool calling - **Workaround:** Use non-streaming mode for tool calls (proven in Simon) + +## HF `transformers` Alternative (not needed if using Ollama) + +If you ever route through HF `transformers` (v5.5.4+) instead of Ollama, there's a +cleaner parser than hand-rolled regex: + +```python +inputs = processor.apply_chat_template( + messages, tools=TOOLS, enable_thinking=True, + add_generation_prompt=True, tokenize=True, + return_dict=True, return_tensors="pt" +) +out = model.generate(**inputs) +parsed = processor.parse_response(processor.decode(out[0])) +# -> {"thinking": "...", "content": "...", "tool_calls": [...]} +``` + +`parse_response` uses `response_schema` + `x-regex` fields baked into +`tokenizer_config.json` (downloaded at `tooling/huggingface/model-cards/`). For +Ollama users this is informational — Ollama's server-side tool parser already does +the equivalent and returns structured `tool_calls` in the chat response. diff --git a/GOTCHAS.md b/GOTCHAS.md index 7ec7fe7..e6bf46c 100644 --- a/GOTCHAS.md +++ b/GOTCHAS.md @@ -168,6 +168,21 @@ Gemma 4 can generate `` or `` tokens in an infinite loop on Vu **Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516) +## MEDIUM: `google/gemma_pytorch` Abandoned for Gemma 4 + +**Severity: MEDIUM — wastes time on a dead-end path** + +The `google/gemma_pytorch` repo (last push 2025-05-30) has zero Gemma 4 support — +its variants validator only accepts Gemma 1/2/3 IDs. Anyone pointing at it as "the +official PyTorch reference" for Gemma 4 is wrong. + +**Use instead:** +- **Inference:** `huggingface/transformers` (`AutoModelForMultimodalLM`, v5.5.4+) +- **Reference impl:** `google-deepmind/gemma` (JAX/Flax) +- **Serving:** Ollama / vLLM / llama.cpp + +See `tooling/google-official/gemma-pytorch/README.md` for the original repo state. + ## LOW: Fine-Tuning Ecosystem Issues **Severity: LOW — only relevant if fine-tuning** @@ -177,6 +192,20 @@ Day-one issues for fine-tuners: - PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type) - New `mm_token_type_ids` field required during training even for text-only data - E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug) +- **Flash Attention 2/4 incompatible:** Gemma 4's global-attention head_dim is 512; + FA2 max is 256, FA4 max is 128. Training backends fall back to SDP or Flex Attention + (Axolotl hard-codes `sdp_attention: true` for Gemma 4). Does not affect inference + runtimes that already use SDP (Ollama, vLLM). +- **Fused LoRA kernels broken** (shared-KV layers). Axolotl disables + `lora_mlp_kernel` / `qkv_kernel` / `o_kernel` for Gemma 4; Unsloth routes around it. +- **26B A4B MoE wants ≥8-bit LoRA**, not 4-bit QLoRA — MoE expert quality degrades + at 4-bit during training. Axolotl's ScatterMoE + expert-LoRA config is the only + validated 4-bit MoE path. (This caveat is **training-only**; Q4_K_M inference is fine.) +- **New tool-call / channel tokens are learned embeddings** — if fine-tuning, set + `modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True` in + `LoraConfig`, or the adapter trains against frozen random vectors for them. + +See `tooling/fine-tuning/recipe-recommendation.md` for the full training path. ## LOW: Vision Validator Overrejects diff --git a/SYNTHESIS.md b/SYNTHESIS.md index e9fc368..cd6b5f0 100644 --- a/SYNTHESIS.md +++ b/SYNTHESIS.md @@ -7,6 +7,12 @@ Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs `num_predict` increased (Ollama defaults are absurdly low), `think` set to false (thinking eats the context budget), and `format: json` avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output. +> **For canonical upstream source (model cards, chat templates, serving commands, +> fine-tuning recipes, specialized siblings like EmbeddingGemma/ShieldGemma): see +> `tooling/README.md`.** That directory is 147 files / 14 MB of first-party material +> pulled from Google / Hugging Face / framework maintainers. This SYNTHESIS is the +> opinionated digest; `tooling/` is the receipts. + ## Mental Model Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain: @@ -165,10 +171,11 @@ Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only. | Use Case | Recommended | Why | |----------|------------|-----| -| Production pipeline (needs GPU coexistence) | `gemma4:26b` | Best quality/speed/VRAM balance | -| On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio | -| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Sharpest but slow under memory pressure | +| Production pipeline (needs GPU coexistence) | `gemma4:26b` | MoE (3.8B active), fast, good quality/VRAM balance | +| On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio (audio via llama.cpp only) | +| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Dense 31B, sharpest but 5x slower, more VRAM pressure | | Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev | +| Retrieval / embeddings | `embeddinggemma` (308M, separate model) | Gemma 4 has no embedding mode; use the sibling | ## Anti-Patterns diff --git a/tooling/README.md b/tooling/README.md index 54bb296..6686d27 100644 --- a/tooling/README.md +++ b/tooling/README.md @@ -16,27 +16,36 @@ Actual scripts, notebooks, model cards, and configs downloaded from Google, Hugg ## Findings that update / contradict the existing corpus -These are real gaps worth patching into `SYNTHESIS.md`, `GOTCHAS.md`, or `CORPUS_tool_calling_format.md`. Flagged here, not applied — the user asked for research, not a rewrite. +These were merged into the top-level corpus docs on 2026-04-18 — each finding below +is marked **[merged: file]** where it landed, or **[flagged]** if it's informational +only. Scan here for provenance; read the CORPUS / SYNTHESIS / GOTCHAS files for the +authoritative working text. -1. **Prompt-token format changed in Gemma 4.** Gemma 1/2/3 used `user ... `. Gemma 4 uses asymmetric pipe-brackets: `<|turn>user\n ... `. Also new: `<|think|>`, `<|channel>thought...`, `<|tool>`, `<|tool_call>`, `<|tool_response>` (+ inverses), `<|image>`, `<|audio>`, and string delimiter `<|"|>`. The existing `CORPUS_tool_calling_format.md` documents the tool tokens but doesn't reflect the turn-token change or the thinking/channel tokens. Canonical source: `huggingface/model-cards/gemma-4-31B-it-chat_template.jinja` and `google-official/docs/ai-google-dev_prompt_formatting_gemma4.html`. +1. **Prompt-token format changed in Gemma 4.** Gemma 1/2/3 used `user ... `. Gemma 4 uses asymmetric pipe-brackets: `<|turn>user\n ... `. Also new: `<|think|>`, `<|channel>thought...`, `<|tool>`, `<|tool_call>`, `<|tool_response>` (+ inverses), `<|image>`, `<|audio>`, and string delimiter `<|"|>`. Canonical source: `huggingface/model-cards/gemma-4-31B-it-chat_template.jinja` and `google-official/docs/ai-google-dev_prompt_formatting_gemma4.html`. **[merged: CORPUS_tool_calling_format.md — added Chat Template Context section]** -2. **`google/gemma_pytorch` is abandoned for Gemma 4.** Last push 2025-05-30; the variants validator rejects Gemma 4 IDs. Anyone pointing at it as the PyTorch reference is wrong — use HF `transformers` or `google-deepmind/gemma` (JAX/Flax) instead. +2. **`google/gemma_pytorch` is abandoned for Gemma 4.** Last push 2025-05-30; the variants validator rejects Gemma 4 IDs. Use HF `transformers` or `google-deepmind/gemma` (JAX/Flax) instead. **[merged: GOTCHAS.md — MEDIUM severity section]** -3. **`gemma.cpp` ships a Gemini-API-compatible local HTTP server** (`gemma_api_server`, endpoint `POST /v1beta/models/:generateContent`, SSE streaming). This is a Google-authored alternative to Ollama that speaks the real Gemini REST API — possibly the single most interesting discovery in this research pass. See `google-official/gemma-cpp/API_SERVER_README.md`. +3. **`gemma.cpp` ships a Gemini-API-compatible local HTTP server** (`gemma_api_server`, endpoint `POST /v1beta/models/:generateContent`, SSE streaming). Google-authored alternative to Ollama that speaks the real Gemini REST API. See `google-official/gemma-cpp/API_SERVER_README.md`. **[flagged — not merged; no current homelab use case, but worth knowing it exists]** -4. **Transformers exposes `AutoModelForMultimodalLM` (new AutoClass)** — not `AutoModelForCausalLM`. It also exposes `processor.parse_response(..., response_schema=...)` driven from `tokenizer_config.json`, which replaces the hand-rolled regex in the current `CORPUS_tool_calling_format.md`. Pin: `transformers>=5.5.4`. +4. **Transformers exposes `AutoModelForMultimodalLM` (new AutoClass)** — not `AutoModelForCausalLM`. It also exposes `processor.parse_response(..., response_schema=...)` driven from `tokenizer_config.json`. Pin: `transformers>=5.5.4`. **[merged: CORPUS_tool_calling_format.md — HF transformers Alternative section]** -5. **Gemma 4 breaks Flash Attention.** FA2's max head_dim is 256, FA4's is 128, and Gemma 4's global head_dim is 512. Use SDP or Flex Attention. Axolotl hard-codes `sdp_attention: true` for Gemma 4. This belongs in `GOTCHAS.md`. +5. **Gemma 4 breaks Flash Attention** (training only). FA2's max head_dim is 256, FA4's is 128, and Gemma 4's global head_dim is 512. Use SDP or Flex Attention. Does not affect Ollama / vLLM inference which already use SDP. **[merged: GOTCHAS.md — under LOW: Fine-Tuning Ecosystem Issues]** -6. **The 26B variant is a MoE** — `gemma-4-26B-A4B` (A4B = 4B active per token). Quantization rules differ: Unsloth says use 16-bit LoRA, not 4-bit QLoRA, for acceptable quality. Axolotl's ScatterMoE + expert-LoRA config is the only tool validated for 4-bit MoE training. Worth a line in `CORPUS_ollama_variants.md`. +6. **The 26B variant is a MoE** — `gemma-4-26B-A4B`, 25.2B total / 3.8B active, 8 experts of 128 + 1 shared. Q4_K_M inference is fine (standard for MoE — Mixtral/DeepSeek ship same). The "MoE quality degrades at 4-bit" concern is training-time only. **[merged: CORPUS_ollama_variants.md — annotated 26b row; GOTCHAS.md — training caveat in fine-tuning section]** -7. **No Gemma 4 technical report PDF exists yet** as of 2026-04-18. DeepMind repo says "Gemma 4 (Coming soon)". Gemma 3 report (downloaded at `google-official/tech-report/Gemma3Report.pdf`) remains the closest authoritative family citation. +7. **No Gemma 4 technical report PDF exists yet** as of 2026-04-18. DeepMind repo says "Gemma 4 (Coming soon)". Gemma 3 report is at `google-official/tech-report/Gemma3Report.pdf`. **[flagged — nothing to merge; check back mid-2026]** -8. **No `google/gemma-4-*` specialized siblings yet** — ShieldGemma, CodeGemma, PaliGemma, MedGemma, DataGemma are all still on Gemma 2 or 3 base. Historical lag is 3–6 months; expect siblings-on-4 mid-to-late 2026. +8. **No Gemma-4-generation specialized siblings yet.** ShieldGemma 2 is Gemma 3-based, CodeGemma on Gemma 2, PaliGemma 2 on Gemma 2, EmbeddingGemma on Gemma 3, etc. All still usable — just don't confuse the sibling generation with the base-model generation. Historical lag is 3–6 months; expect siblings-on-4 mid-to-late 2026. **[merged: CORPUS_capabilities.md — "What Gemma 4 Does NOT Do" now points at EmbeddingGemma for retrieval; full catalog in `gemma-family/index.md`]** -9. **No Gemma-4-specific TRL script in `huggingface/trl` yet.** HF blog says "fully supported," but the SFT/DPO/GRPO examples are still on Gemma 3 model IDs. Drop-in with `model_id` swap works. Only Gemma-4-dedicated TRL example today is `huggingface-gemma-recipes/carla_vlm_gemma.py` (VLM GRPO). +9. **No Gemma-4-specific TRL script in `huggingface/trl` yet.** HF blog says "fully supported," but the SFT/DPO/GRPO examples are still on Gemma 3 model IDs. Drop-in with `model_id` swap works. Only Gemma-4-dedicated TRL example today is `huggingface-gemma-recipes/carla_vlm_gemma.py` (VLM GRPO). **[flagged — only relevant if fine-tuning]** -10. **HF Spaces `app.py` files are the shortest Gemma 4 inference examples** — Google and HF both use them as ref. See `huggingface/spaces/huggingface-projects_gemma-4-{31b,e4b}-it-app.py`. +10. **HF Spaces `app.py` files are the shortest Gemma 4 inference examples** — Google and HF both use them as ref. See `huggingface/spaces/huggingface-projects_gemma-4-{31b,e4b}-it-app.py`. **[flagged — reference material]** + +11. **Native object detection with bbox output.** Prompt `"Detect the X in this image"` → structured `{box_2d: [ymin, xmin, ymax, xmax]}` in 1000×1000-normalized coords. First-class Gemma 4 capability, no separate detection model needed. **[merged: CORPUS_capabilities.md — Native Object Detection section]** + +12. **Native `system` role support.** New in Gemma 4 — Gemma 3 prepended system as a user turn. Matters if you were hand-building the prompt string; invisible if you use Ollama `system` or HF `apply_chat_template`. **[merged: CORPUS_capabilities.md — Text section]** + +13. **Audio input is E-series only AND not via Ollama.** Requires llama.cpp's `mmproj-*-E*B-it-*.gguf` projector or vLLM's `input_features_padded`. **[merged: CORPUS_ollama_variants.md and CORPUS_capabilities.md]** ## Immediate homelab plug-ins (from the gemma-family research)