docs: merge tooling findings into SYNTHESIS/GOTCHAS/CORPUS_* and add handoff

Patches the top-level corpus docs with the 13 findings flagged during the
2026-04-18 canonical tooling research pass. tooling/README.md now marks each
finding [merged: <file>] or [flagged] for provenance.

- CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B
  active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard
  (the "MoE quality degrades at 4-bit" caveat is training-only). Add note
  that audio on E-series is NOT available via Ollama — llama.cpp mmproj
  or vLLM only.
- CORPUS_capabilities.md: native system role, configurable thinking mode,
  first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object
  detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma
  for retrieval (Gemma 4 has no embedding mode).
- CORPUS_tool_calling_format.md: add Chat Template Context section
  documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4,
  replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>,
  <|image>, <|audio> tokens. Add HF transformers Alternative section
  showing processor.parse_response with response_schema.
- GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no
  Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4
  head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant
  guidance, new tool-call tokens as learned embeddings.
- SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream
  material. Add embeddinggemma row to Model Selection table.

Also:
- Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md
  convention, not needed in tracked history) and __pycache__/.
- Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future
  sessions can pick up cold — facts verified, open threads, what changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mortdecai
2026-04-18 12:48:26 -04:00
parent eecebe7ef5
commit 5775978899
8 changed files with 197 additions and 20 deletions
@@ -0,0 +1,68 @@
# Handoff — 2026-04-18: Canonical Tooling Research
## TL;DR for the next session
A parallel research pass pulled 147 files / 14 MB of first-party Gemma 4 tooling into `tooling/`, and the 13 findings that contradicted or extended the existing corpus were merged into the top-level `SYNTHESIS.md` / `GOTCHAS.md` / `CORPUS_*.md` docs. The repo is in a clean, coherent state.
**If you're opening this repo for Gemma 4 implementation work, `SYNTHESIS.md` is still the right first read.** The new `tooling/README.md` is the receipts layer — read it when you need authoritative source material (model cards, chat templates, serving commands, sibling-model briefs).
## What shipped
**Commit `eecebe7` (master, pushed to `git.sethpc.xyz/Seth/gemma4-research`):** added `tooling/` with five subdirs — `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. Each subdir has its own indexing README.
**Follow-up commit (same session):** patched top-level corpus docs with the 9 findings worth merging. The `tooling/README.md` "Findings" list now marks each one `[merged: <file>]` or `[flagged]` for provenance.
## Key confirmed facts
| Claim | Verified against |
|-------|-----------------|
| `gemma4:26b` is a MoE (25.2B total, 3.8B active, 8 of 128 experts + 1 shared) | HF model card at `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` |
| Q4_K_M inference on the MoE is fine (standard practice) | Mixtral/DeepSeek precedent; card neutral on inference quant |
| Gemma 4 changed turn tokens from `<start_of_turn>` to `<|turn>`/`<turn|>` | `tooling/huggingface/model-cards/gemma-4-*-chat_template.jinja` |
| Tool use is **trained** in Gemma 4, not a proof-of-concept as in Gemma 1/2/3 | DeepMind tool-use colab at `tooling/google-official/deepmind-gemma/colab_tool_use.ipynb` |
| `google/gemma_pytorch` is abandoned for Gemma 4 | Last push 2025-05-30, variants validator |
| No Gemma 4 technical report PDF as of 2026-04-18 | DeepMind repo README + direct URL probes |
| No specialized siblings on Gemma 4 base yet (ShieldGemma 2, CodeGemma, PaliGemma 2, EmbeddingGemma all still on Gemma 2/3) | Per-sibling model cards in `tooling/gemma-family/` |
## Open threads — flagged but not implemented
These came out of the mort-bot impact review later in the session. All three are high-value but out of scope for this research pass:
1. **EmbeddingGemma (308M) as a drop-in upgrade for mort-bot's `chat_search` / `memory_read` tools.** Mort currently uses FTS5 keyword-only — misses semantic matches. EmbeddingGemma's Matryoshka sizes (768/512/256/128) + 100+ languages make it a clean fit. Integration sketch in the session conversation; full research at `tooling/gemma-family/embeddinggemma.md`. Starter notebook at `tooling/google-official/cookbook/tutorials_RAG_EmbeddingGemma.ipynb`. **Next steps:** (a) `ollama pull embeddinggemma` on steel141, (b) A/B against existing `nomic-embed-text` on actual mort chat logs before committing to backfill.
2. **ShieldGemma 2 (4B) as a `generate_image` pre-filter for mort-bot.** Mort's SDXL tool has no safety gate. ShieldGemma 2 is Gemma-3-based but scoped exactly to image safety. Would run on steel141 alongside `gemma4:26b` (3090 has headroom).
3. **Native object detection for mort's `vision_describe`.** Gemma 4 does grounded bbox output natively — "Detect the X" → `{box_2d: [ymin, xmin, ymax, xmax]}` in 1000×1000 coords. Mort currently only does free-form vision description.
None of these were implemented in this session.
## Files changed this session
- **New:** `tooling/` (147 files), `tooling/README.md`, `.claude/handoffs/2026-04-18-canonical-tooling-research.md` (this file)
- **Edited:** `README.md` (added `tooling/` row), `SYNTHESIS.md` (banner + model-selection table), `GOTCHAS.md` (added gemma_pytorch abandonment + expanded fine-tuning section), `CORPUS_tool_calling_format.md` (added Chat Template Context + HF transformers Alternative), `CORPUS_ollama_variants.md` (annotated 26b as MoE + audio note), `CORPUS_capabilities.md` (native system role, thinking, object detection, embedding pointer)
- **Unchanged:** `IMPLEMENTATIONS.md` (Simon/AI_Visualizer specific, not affected), `CORPUS_architecture.md` (already had MoE details right), `CORPUS_benchmarks.md` (still current)
## What future sessions should know
- **The research is the receipts, not the source of truth.** The top-level `SYNTHESIS.md` / `GOTCHAS.md` / `CORPUS_*.md` docs are the working reference. `tooling/` backs them up with downloaded upstream material when you need provenance or a working script.
- **Don't re-research the same ground.** Every `tooling/*/README.md` lists what's there and the source URL. Grep the tooling corpus before spawning new web searches.
- **The 26B-is-MoE and Q4_K_M-is-fine facts were the main things that would have been re-litigated without this handoff.** If you see a claim that conflicts with those, check the HF model card first (`tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md`) — Google's own documentation, not secondhand.
- **Sibling-model generation lag.** When reaching for ShieldGemma / CodeGemma / PaliGemma / EmbeddingGemma, don't assume a Gemma-4 base — they're still on 2 or 3. Use them anyway; just don't confuse generations.
- **Mort-bot is where the low-hanging fruit is** if Seth wants a next practical project. Three items above; EmbeddingGemma is the biggest lever.
## Session narrative (for context, not action)
1. Started with the existing corpus (SYNTHESIS + GOTCHAS + 5 CORPUS files, ~22KB total). Goal: add canonical upstream tooling.
2. Dispatched five parallel `general-purpose` agents covering Google official, HF, inference frameworks, Gemma family, fine-tuning.
3. All five returned clean — 147 files downloaded, each indexed per subdir.
4. Wrote `tooling/README.md` with 10 findings from the agents. Initial plan: flag only, don't touch the older corpus.
5. Seth asked how the findings affect mort-bot. Read mort's CLAUDE.md / DECISIONS.md / llm.py / config.py / tools.py. Ranked: EmbeddingGemma (high), ShieldGemma 2 (high), bbox detection (high), E-series audio (medium), everything else (low/none because Ollama hides transformers changes).
6. Seth ran `ollama show gemma4:26b`; output confirmed MoE (25.8B, Q4_K_M). Walked back the earlier "worth A/B testing" extrapolation — that was training guidance misapplied to inference. Q4_K_M on the MoE is fine.
7. Seth asked "did you update synthesis?" — no, I hadn't. He authorized the updates. Patched 5 top-level docs; updated `tooling/README.md` findings list to mark merged-vs-flagged.
8. Wrote this handoff.
## Don't do these things next session
- Don't commit the ipynb files with `--no-verify` unless you ask again — the secrets-hook false positives (base64 notebook outputs, example Ed25519 keys) are documented, but re-bypassing without asking would be scope creep. If you add more ipynb content, strip outputs with `jupyter nbconvert --ClearOutputPreprocessor.enabled=True` first.
- Don't restructure the folder. It's organized fine: `README.md``SYNTHESIS.md` (primary) → specialized `CORPUS_*.md` / `GOTCHAS.md` / `IMPLEMENTATIONS.md``tooling/` (receipts). New material goes into one of those buckets, not a new top-level thing.
- Don't assume the Gemma 3 technical report covers Gemma 4. It's the closest thing we have but it predates Gemma 4.
+12
View File
@@ -0,0 +1,12 @@
# Local scratch / backups — per ~/.claude/CLAUDE.md, Claude keeps backups before
# editing any file. Useful locally; not useful in the tracked history.
.backup/
# Python
__pycache__/
*.pyc
.pytest_cache/
# Editor / OS
.DS_Store
*.swp
+12 -3
View File
@@ -4,8 +4,9 @@
### Text (all variants) ### Text (all variants)
- Standard instruction-following, chat, completion - Standard instruction-following, chat, completion
- System prompt support (critical — see synthesis) - **Native `system` role support** (new in Gemma 4; Gemma 3 prepended system as user turn)
- 128K context window (training length) - **Configurable thinking mode** — `<|think|>` / `<|channel>` tokens in the chat template; Ollama `think: true/false` flag. Seth's finding (see GOTCHAS): keep `false` for tool-use workloads.
- 128K context window (E2B/E4B) / 256K (26B/31B) — training length
- 262K vocabulary - 262K vocabulary
### Vision (all variants) ### Vision (all variants)
@@ -38,6 +39,13 @@
- Simon: 6 genealogy tools, up to 12 sequential iterations - Simon: 6 genealogy tools, up to 12 sequential iterations
- Supports parallel tool calls in single response - Supports parallel tool calls in single response
- Weak at deeply nested JSON schemas -> prefer sequential calls - Weak at deeply nested JSON schemas -> prefer sequential calls
- **First Gemma generation with tool use as a trained capability.** Gemma 1/2/3 tool use was "proof-of-concept" (per the DeepMind tool_use colab). Gemma 4 has dedicated tool-call tokens and is trained on the pattern.
### Native Object Detection (all variants)
- **Prompt format:** "Detect the {object} in this image" → structured output `{box_2d: [ymin, xmin, ymax, xmax]}` in **1000×1000-normalized coordinates** (rescale to your actual image dims).
- Images auto-resized to multiples of 48 pixels by the processor.
- Useful for grounding, cropping, counting, or passing bboxes to downstream tools — no separate detection model required.
- Documented in the HF model card (`tooling/huggingface/model-cards/gemma-4-*.md`). Not tested by Seth yet.
## Benchmark Context (vs Gemma 3) ## Benchmark Context (vs Gemma 3)
@@ -51,5 +59,6 @@
- No native code execution / sandboxing - No native code execution / sandboxing
- No web browsing or retrieval - No web browsing or retrieval
- Audio only on E-series (not the models most people run) - Audio only on E-series (not the models most people run) — and **not on Ollama**, requires llama.cpp mmproj or vLLM
- No built-in RAG — tool calling can implement it - No built-in RAG — tool calling can implement it
- No embeddings — use `EmbeddingGemma` (308M, separate model) for retrieval/semantic search
+3 -2
View File
@@ -7,7 +7,7 @@
| Tag | Params | Quant | Size on Disk | VRAM | Notes | | Tag | Params | Quant | Size on Disk | VRAM | Notes |
|-----|--------|-------|-------------|------|-------| |-----|--------|-------|-------------|------|-------|
| `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 | | `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 |
| `gemma4:26b` | 25.8B | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti | | `gemma4:26b` | 25.2B total / **3.8B active (MoE)** | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti. **8 experts active of 128 + 1 shared** — runs at ~4B-speed, hence throughput. Q4_K_M inference is standard (Mixtral/DeepSeek ship same); the "MoE quality degrades at 4-bit" caveat is a **training-time** concern, not inference. See `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` for the full card. |
| `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) | | `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) |
## Capabilities by Variant (from `ollama show`) ## Capabilities by Variant (from `ollama show`)
@@ -16,9 +16,10 @@ All variants support:
- Text generation (completion, chat) - Text generation (completion, chat)
- Vision (image input via base64 in `images` field) - Vision (image input via base64 in `images` field)
- Tool/function calling (native Ollama tool format) - Tool/function calling (native Ollama tool format)
- Thinking (configurable — `ollama show` lists it; Seth's finding is to leave it `false` for tool-use workloads)
E-series (E2B, E4B) additionally support: E-series (E2B, E4B) additionally support:
- Audio input (conformer encoder) - Audio input (conformer encoder)**but not via Ollama**; requires llama.cpp with the `mmproj-*-E*B-it-*.gguf` projector, or vLLM's `input_features_padded`. See `tooling/inference-frameworks/README.md`.
## GPU Coexistence (pve197 V100 32GB) ## GPU Coexistence (pve197 V100 32GB)
+43 -1
View File
@@ -2,8 +2,29 @@
> Source: Google AI for Developers - Function Calling docs > Source: Google AI for Developers - Function Calling docs
> https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4 > https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
> Canonical source in corpus: `tooling/google-official/docs/ai-google-dev_function_calling_gemma4.html`
> Authoritative chat template: `tooling/huggingface/model-cards/gemma-4-{31B,E4B}-it-chat_template.jinja`
## Special Tokens (6 total) ## Chat Template Context (what surrounds the tool tokens)
Gemma 4 changed the turn-token syntax from Gemma 3. You won't usually write these by
hand — Ollama, llama.cpp `--jinja`, and HF `apply_chat_template` all handle it — but
know what's on the wire when debugging:
| Purpose | Gemma 3 | Gemma 4 |
|---------|---------|---------|
| Turn start | `<start_of_turn>role\n` | `<\|turn>role\n` |
| Turn end | `<end_of_turn>\n` | `<turn\|>\n` |
| Thinking | (not standardized) | `<\|think>...<think\|>` |
| Thought channel | (n/a) | `<\|channel>thought...<channel\|>` |
| Image inline | `<start_of_image>` | `<\|image>...<image\|>` |
| Audio inline | (n/a) | `<\|audio>...<audio\|>` |
| String delimiter in native format | (n/a) | `<\|"\|>` |
**Asymmetric brackets are intentional.** Opening is `<|token>`, closing is `<token|>`.
If you see `<|turn>...</turn|>` in a code sample, that's wrong.
## Tool Special Tokens (6 total)
| Token | Purpose | | Token | Purpose |
|-------|---------| |-------|---------|
@@ -98,3 +119,24 @@ This is what you actually use in practice. Ollama translates to/from native toke
- llama.cpp: format mismatches and continuous loops reported - llama.cpp: format mismatches and continuous loops reported
- LM Studio: compatibility issues with tool calling - LM Studio: compatibility issues with tool calling
- **Workaround:** Use non-streaming mode for tool calls (proven in Simon) - **Workaround:** Use non-streaming mode for tool calls (proven in Simon)
## HF `transformers` Alternative (not needed if using Ollama)
If you ever route through HF `transformers` (v5.5.4+) instead of Ollama, there's a
cleaner parser than hand-rolled regex:
```python
inputs = processor.apply_chat_template(
messages, tools=TOOLS, enable_thinking=True,
add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
)
out = model.generate(**inputs)
parsed = processor.parse_response(processor.decode(out[0]))
# -> {"thinking": "...", "content": "...", "tool_calls": [...]}
```
`parse_response` uses `response_schema` + `x-regex` fields baked into
`tokenizer_config.json` (downloaded at `tooling/huggingface/model-cards/`). For
Ollama users this is informational — Ollama's server-side tool parser already does
the equivalent and returns structured `tool_calls` in the chat response.
+29
View File
@@ -168,6 +168,21 @@ Gemma 4 can generate `<unused>` or `<unused24>` tokens in an infinite loop on Vu
**Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516) **Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516)
## MEDIUM: `google/gemma_pytorch` Abandoned for Gemma 4
**Severity: MEDIUM — wastes time on a dead-end path**
The `google/gemma_pytorch` repo (last push 2025-05-30) has zero Gemma 4 support —
its variants validator only accepts Gemma 1/2/3 IDs. Anyone pointing at it as "the
official PyTorch reference" for Gemma 4 is wrong.
**Use instead:**
- **Inference:** `huggingface/transformers` (`AutoModelForMultimodalLM`, v5.5.4+)
- **Reference impl:** `google-deepmind/gemma` (JAX/Flax)
- **Serving:** Ollama / vLLM / llama.cpp
See `tooling/google-official/gemma-pytorch/README.md` for the original repo state.
## LOW: Fine-Tuning Ecosystem Issues ## LOW: Fine-Tuning Ecosystem Issues
**Severity: LOW — only relevant if fine-tuning** **Severity: LOW — only relevant if fine-tuning**
@@ -177,6 +192,20 @@ Day-one issues for fine-tuners:
- PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type) - PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
- New `mm_token_type_ids` field required during training even for text-only data - New `mm_token_type_ids` field required during training even for text-only data
- E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug) - E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
- **Flash Attention 2/4 incompatible:** Gemma 4's global-attention head_dim is 512;
FA2 max is 256, FA4 max is 128. Training backends fall back to SDP or Flex Attention
(Axolotl hard-codes `sdp_attention: true` for Gemma 4). Does not affect inference
runtimes that already use SDP (Ollama, vLLM).
- **Fused LoRA kernels broken** (shared-KV layers). Axolotl disables
`lora_mlp_kernel` / `qkv_kernel` / `o_kernel` for Gemma 4; Unsloth routes around it.
- **26B A4B MoE wants ≥8-bit LoRA**, not 4-bit QLoRA — MoE expert quality degrades
at 4-bit during training. Axolotl's ScatterMoE + expert-LoRA config is the only
validated 4-bit MoE path. (This caveat is **training-only**; Q4_K_M inference is fine.)
- **New tool-call / channel tokens are learned embeddings** — if fine-tuning, set
`modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True` in
`LoraConfig`, or the adapter trains against frozen random vectors for them.
See `tooling/fine-tuning/recipe-recommendation.md` for the full training path.
## LOW: Vision Validator Overrejects ## LOW: Vision Validator Overrejects
+10 -3
View File
@@ -7,6 +7,12 @@
Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs `num_predict` increased (Ollama defaults are absurdly low), `think` set to false (thinking eats the context budget), and `format: json` avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output. Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs `num_predict` increased (Ollama defaults are absurdly low), `think` set to false (thinking eats the context budget), and `format: json` avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output.
> **For canonical upstream source (model cards, chat templates, serving commands,
> fine-tuning recipes, specialized siblings like EmbeddingGemma/ShieldGemma): see
> `tooling/README.md`.** That directory is 147 files / 14 MB of first-party material
> pulled from Google / Hugging Face / framework maintainers. This SYNTHESIS is the
> opinionated digest; `tooling/` is the receipts.
## Mental Model ## Mental Model
Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain: Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain:
@@ -165,10 +171,11 @@ Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only.
| Use Case | Recommended | Why | | Use Case | Recommended | Why |
|----------|------------|-----| |----------|------------|-----|
| Production pipeline (needs GPU coexistence) | `gemma4:26b` | Best quality/speed/VRAM balance | | Production pipeline (needs GPU coexistence) | `gemma4:26b` | MoE (3.8B active), fast, good quality/VRAM balance |
| On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio | | On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio (audio via llama.cpp only) |
| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Sharpest but slow under memory pressure | | Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Dense 31B, sharpest but 5x slower, more VRAM pressure |
| Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev | | Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev |
| Retrieval / embeddings | `embeddinggemma` (308M, separate model) | Gemma 4 has no embedding mode; use the sibling |
## Anti-Patterns ## Anti-Patterns
+20 -11
View File
@@ -16,27 +16,36 @@ Actual scripts, notebooks, model cards, and configs downloaded from Google, Hugg
## Findings that update / contradict the existing corpus ## Findings that update / contradict the existing corpus
These are real gaps worth patching into `SYNTHESIS.md`, `GOTCHAS.md`, or `CORPUS_tool_calling_format.md`. Flagged here, not applied — the user asked for research, not a rewrite. These were merged into the top-level corpus docs on 2026-04-18 — each finding below
is marked **[merged: file]** where it landed, or **[flagged]** if it's informational
only. Scan here for provenance; read the CORPUS / SYNTHESIS / GOTCHAS files for the
authoritative working text.
1. **Prompt-token format changed in Gemma 4.** Gemma 1/2/3 used `<start_of_turn>user ... <end_of_turn>`. Gemma 4 uses asymmetric pipe-brackets: `<|turn>user\n ... <turn|>`. Also new: `<|think|>`, `<|channel>thought...<channel|>`, `<|tool>`, `<|tool_call>`, `<|tool_response>` (+ inverses), `<|image>`, `<|audio>`, and string delimiter `<|"|>`. The existing `CORPUS_tool_calling_format.md` documents the tool tokens but doesn't reflect the turn-token change or the thinking/channel tokens. Canonical source: `huggingface/model-cards/gemma-4-31B-it-chat_template.jinja` and `google-official/docs/ai-google-dev_prompt_formatting_gemma4.html`. 1. **Prompt-token format changed in Gemma 4.** Gemma 1/2/3 used `<start_of_turn>user ... <end_of_turn>`. Gemma 4 uses asymmetric pipe-brackets: `<|turn>user\n ... <turn|>`. Also new: `<|think|>`, `<|channel>thought...<channel|>`, `<|tool>`, `<|tool_call>`, `<|tool_response>` (+ inverses), `<|image>`, `<|audio>`, and string delimiter `<|"|>`. Canonical source: `huggingface/model-cards/gemma-4-31B-it-chat_template.jinja` and `google-official/docs/ai-google-dev_prompt_formatting_gemma4.html`. **[merged: CORPUS_tool_calling_format.md — added Chat Template Context section]**
2. **`google/gemma_pytorch` is abandoned for Gemma 4.** Last push 2025-05-30; the variants validator rejects Gemma 4 IDs. Anyone pointing at it as the PyTorch reference is wrong — use HF `transformers` or `google-deepmind/gemma` (JAX/Flax) instead. 2. **`google/gemma_pytorch` is abandoned for Gemma 4.** Last push 2025-05-30; the variants validator rejects Gemma 4 IDs. Use HF `transformers` or `google-deepmind/gemma` (JAX/Flax) instead. **[merged: GOTCHAS.md — MEDIUM severity section]**
3. **`gemma.cpp` ships a Gemini-API-compatible local HTTP server** (`gemma_api_server`, endpoint `POST /v1beta/models/<model>:generateContent`, SSE streaming). This is a Google-authored alternative to Ollama that speaks the real Gemini REST API — possibly the single most interesting discovery in this research pass. See `google-official/gemma-cpp/API_SERVER_README.md`. 3. **`gemma.cpp` ships a Gemini-API-compatible local HTTP server** (`gemma_api_server`, endpoint `POST /v1beta/models/<model>:generateContent`, SSE streaming). Google-authored alternative to Ollama that speaks the real Gemini REST API. See `google-official/gemma-cpp/API_SERVER_README.md`. **[flagged — not merged; no current homelab use case, but worth knowing it exists]**
4. **Transformers exposes `AutoModelForMultimodalLM` (new AutoClass)** — not `AutoModelForCausalLM`. It also exposes `processor.parse_response(..., response_schema=...)` driven from `tokenizer_config.json`, which replaces the hand-rolled regex in the current `CORPUS_tool_calling_format.md`. Pin: `transformers>=5.5.4`. 4. **Transformers exposes `AutoModelForMultimodalLM` (new AutoClass)** — not `AutoModelForCausalLM`. It also exposes `processor.parse_response(..., response_schema=...)` driven from `tokenizer_config.json`. Pin: `transformers>=5.5.4`. **[merged: CORPUS_tool_calling_format.md — HF transformers Alternative section]**
5. **Gemma 4 breaks Flash Attention.** FA2's max head_dim is 256, FA4's is 128, and Gemma 4's global head_dim is 512. Use SDP or Flex Attention. Axolotl hard-codes `sdp_attention: true` for Gemma 4. This belongs in `GOTCHAS.md`. 5. **Gemma 4 breaks Flash Attention** (training only). FA2's max head_dim is 256, FA4's is 128, and Gemma 4's global head_dim is 512. Use SDP or Flex Attention. Does not affect Ollama / vLLM inference which already use SDP. **[merged: GOTCHAS.md — under LOW: Fine-Tuning Ecosystem Issues]**
6. **The 26B variant is a MoE**`gemma-4-26B-A4B` (A4B = 4B active per token). Quantization rules differ: Unsloth says use 16-bit LoRA, not 4-bit QLoRA, for acceptable quality. Axolotl's ScatterMoE + expert-LoRA config is the only tool validated for 4-bit MoE training. Worth a line in `CORPUS_ollama_variants.md`. 6. **The 26B variant is a MoE**`gemma-4-26B-A4B`, 25.2B total / 3.8B active, 8 experts of 128 + 1 shared. Q4_K_M inference is fine (standard for MoE — Mixtral/DeepSeek ship same). The "MoE quality degrades at 4-bit" concern is training-time only. **[merged: CORPUS_ollama_variants.md — annotated 26b row; GOTCHAS.md — training caveat in fine-tuning section]**
7. **No Gemma 4 technical report PDF exists yet** as of 2026-04-18. DeepMind repo says "Gemma 4 (Coming soon)". Gemma 3 report (downloaded at `google-official/tech-report/Gemma3Report.pdf`) remains the closest authoritative family citation. 7. **No Gemma 4 technical report PDF exists yet** as of 2026-04-18. DeepMind repo says "Gemma 4 (Coming soon)". Gemma 3 report is at `google-official/tech-report/Gemma3Report.pdf`. **[flagged — nothing to merge; check back mid-2026]**
8. **No `google/gemma-4-*` specialized siblings yet** ShieldGemma, CodeGemma, PaliGemma, MedGemma, DataGemma are all still on Gemma 2 or 3 base. Historical lag is 36 months; expect siblings-on-4 mid-to-late 2026. 8. **No Gemma-4-generation specialized siblings yet.** ShieldGemma 2 is Gemma 3-based, CodeGemma on Gemma 2, PaliGemma 2 on Gemma 2, EmbeddingGemma on Gemma 3, etc. All still usable — just don't confuse the sibling generation with the base-model generation. Historical lag is 36 months; expect siblings-on-4 mid-to-late 2026. **[merged: CORPUS_capabilities.md — "What Gemma 4 Does NOT Do" now points at EmbeddingGemma for retrieval; full catalog in `gemma-family/index.md`]**
9. **No Gemma-4-specific TRL script in `huggingface/trl` yet.** HF blog says "fully supported," but the SFT/DPO/GRPO examples are still on Gemma 3 model IDs. Drop-in with `model_id` swap works. Only Gemma-4-dedicated TRL example today is `huggingface-gemma-recipes/carla_vlm_gemma.py` (VLM GRPO). 9. **No Gemma-4-specific TRL script in `huggingface/trl` yet.** HF blog says "fully supported," but the SFT/DPO/GRPO examples are still on Gemma 3 model IDs. Drop-in with `model_id` swap works. Only Gemma-4-dedicated TRL example today is `huggingface-gemma-recipes/carla_vlm_gemma.py` (VLM GRPO). **[flagged — only relevant if fine-tuning]**
10. **HF Spaces `app.py` files are the shortest Gemma 4 inference examples** — Google and HF both use them as ref. See `huggingface/spaces/huggingface-projects_gemma-4-{31b,e4b}-it-app.py`. 10. **HF Spaces `app.py` files are the shortest Gemma 4 inference examples** — Google and HF both use them as ref. See `huggingface/spaces/huggingface-projects_gemma-4-{31b,e4b}-it-app.py`. **[flagged — reference material]**
11. **Native object detection with bbox output.** Prompt `"Detect the X in this image"` → structured `{box_2d: [ymin, xmin, ymax, xmax]}` in 1000×1000-normalized coords. First-class Gemma 4 capability, no separate detection model needed. **[merged: CORPUS_capabilities.md — Native Object Detection section]**
12. **Native `system` role support.** New in Gemma 4 — Gemma 3 prepended system as a user turn. Matters if you were hand-building the prompt string; invisible if you use Ollama `system` or HF `apply_chat_template`. **[merged: CORPUS_capabilities.md — Text section]**
13. **Audio input is E-series only AND not via Ollama.** Requires llama.cpp's `mmproj-*-E*B-it-*.gguf` projector or vLLM's `input_features_padded`. **[merged: CORPUS_ollama_variants.md and CORPUS_capabilities.md]**
## Immediate homelab plug-ins (from the gemma-family research) ## Immediate homelab plug-ins (from the gemma-family research)