docs: merge tooling findings into SYNTHESIS/GOTCHAS/CORPUS_* and add handoff

Patches the top-level corpus docs with the 13 findings flagged during the 2026-04-18 canonical tooling research pass. tooling/README.md now marks each finding [merged: <file>] or [flagged] for provenance. - CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard (the "MoE quality degrades at 4-bit" caveat is training-only). Add note that audio on E-series is NOT available via Ollama — llama.cpp mmproj or vLLM only. - CORPUS_capabilities.md: native system role, configurable thinking mode, first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma for retrieval (Gemma 4 has no embedding mode). - CORPUS_tool_calling_format.md: add Chat Template Context section documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4, replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>, <|image>, <|audio> tokens. Add HF transformers Alternative section showing processor.parse_response with response_schema. - GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4 head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant guidance, new tool-call tokens as learned embeddings. - SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream material. Add embeddinggemma row to Model Selection table. Also: - Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md convention, not needed in tracked history) and __pycache__/. - Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future sessions can pick up cold — facts verified, open threads, what changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:48:26 -04:00
parent eecebe7ef5
commit 5775978899
8 changed files with 197 additions and 20 deletions
@@ -0,0 +1,68 @@
 # Handoff — 2026-04-18: Canonical Tooling Research
 ## TL;DR for the next session
 A parallel research pass pulled 147 files / 14 MB of first-party Gemma 4 tooling into `tooling/`, and the 13 findings that contradicted or extended the existing corpus were merged into the top-level `SYNTHESIS.md` / `GOTCHAS.md` / `CORPUS_*.md` docs. The repo is in a clean, coherent state.
 **If you're opening this repo for Gemma 4 implementation work, `SYNTHESIS.md` is still the right first read.** The new `tooling/README.md` is the receipts layer — read it when you need authoritative source material (model cards, chat templates, serving commands, sibling-model briefs).
 ## What shipped
 **Commit `eecebe7` (master, pushed to `git.sethpc.xyz/Seth/gemma4-research`):** added `tooling/` with five subdirs — `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. Each subdir has its own indexing README.
 **Follow-up commit (same session):** patched top-level corpus docs with the 9 findings worth merging. The `tooling/README.md` "Findings" list now marks each one `[merged: <file>]` or `[flagged]` for provenance.
 ## Key confirmed facts
 | Claim | Verified against |
 |-------|-----------------|
 | `gemma4:26b` is a MoE (25.2B total, 3.8B active, 8 of 128 experts + 1 shared) | HF model card at `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` |
 | Q4_K_M inference on the MoE is fine (standard practice) | Mixtral/DeepSeek precedent; card neutral on inference quant |
 | Gemma 4 changed turn tokens from `<start_of_turn>` to `<|turn>`/`<turn|>` | `tooling/huggingface/model-cards/gemma-4-*-chat_template.jinja` |
 | Tool use is **trained** in Gemma 4, not a proof-of-concept as in Gemma 1/2/3 | DeepMind tool-use colab at `tooling/google-official/deepmind-gemma/colab_tool_use.ipynb` |
 | `google/gemma_pytorch` is abandoned for Gemma 4 | Last push 2025-05-30, variants validator |
 | No Gemma 4 technical report PDF as of 2026-04-18 | DeepMind repo README + direct URL probes |
 | No specialized siblings on Gemma 4 base yet (ShieldGemma 2, CodeGemma, PaliGemma 2, EmbeddingGemma all still on Gemma 2/3) | Per-sibling model cards in `tooling/gemma-family/` |
 ## Open threads — flagged but not implemented
 These came out of the mort-bot impact review later in the session. All three are high-value but out of scope for this research pass:
 1. **EmbeddingGemma (308M) as a drop-in upgrade for mort-bot's `chat_search` / `memory_read` tools.** Mort currently uses FTS5 keyword-only — misses semantic matches. EmbeddingGemma's Matryoshka sizes (768/512/256/128) + 100+ languages make it a clean fit. Integration sketch in the session conversation; full research at `tooling/gemma-family/embeddinggemma.md`. Starter notebook at `tooling/google-official/cookbook/tutorials_RAG_EmbeddingGemma.ipynb`. **Next steps:** (a) `ollama pull embeddinggemma` on steel141, (b) A/B against existing `nomic-embed-text` on actual mort chat logs before committing to backfill.
 2. **ShieldGemma 2 (4B) as a `generate_image` pre-filter for mort-bot.** Mort's SDXL tool has no safety gate. ShieldGemma 2 is Gemma-3-based but scoped exactly to image safety. Would run on steel141 alongside `gemma4:26b` (3090 has headroom).
 3. **Native object detection for mort's `vision_describe`.** Gemma 4 does grounded bbox output natively — "Detect the X" → `{box_2d: [ymin, xmin, ymax, xmax]}` in 1000×1000 coords. Mort currently only does free-form vision description.
 None of these were implemented in this session.
 ## Files changed this session
 - **New:** `tooling/` (147 files), `tooling/README.md`, `.claude/handoffs/2026-04-18-canonical-tooling-research.md` (this file)
 - **Edited:** `README.md` (added `tooling/` row), `SYNTHESIS.md` (banner + model-selection table), `GOTCHAS.md` (added gemma_pytorch abandonment + expanded fine-tuning section), `CORPUS_tool_calling_format.md` (added Chat Template Context + HF transformers Alternative), `CORPUS_ollama_variants.md` (annotated 26b as MoE + audio note), `CORPUS_capabilities.md` (native system role, thinking, object detection, embedding pointer)
 - **Unchanged:** `IMPLEMENTATIONS.md` (Simon/AI_Visualizer specific, not affected), `CORPUS_architecture.md` (already had MoE details right), `CORPUS_benchmarks.md` (still current)
 ## What future sessions should know
 - **The research is the receipts, not the source of truth.** The top-level `SYNTHESIS.md` / `GOTCHAS.md` / `CORPUS_*.md` docs are the working reference. `tooling/` backs them up with downloaded upstream material when you need provenance or a working script.
 - **Don't re-research the same ground.** Every `tooling/*/README.md` lists what's there and the source URL. Grep the tooling corpus before spawning new web searches.
 - **The 26B-is-MoE and Q4_K_M-is-fine facts were the main things that would have been re-litigated without this handoff.** If you see a claim that conflicts with those, check the HF model card first (`tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md`) — Google's own documentation, not secondhand.
 - **Sibling-model generation lag.** When reaching for ShieldGemma / CodeGemma / PaliGemma / EmbeddingGemma, don't assume a Gemma-4 base — they're still on 2 or 3. Use them anyway; just don't confuse generations.
 - **Mort-bot is where the low-hanging fruit is** if Seth wants a next practical project. Three items above; EmbeddingGemma is the biggest lever.
 ## Session narrative (for context, not action)
 1. Started with the existing corpus (SYNTHESIS + GOTCHAS + 5 CORPUS files, ~22KB total). Goal: add canonical upstream tooling.
 2. Dispatched five parallel `general-purpose` agents covering Google official, HF, inference frameworks, Gemma family, fine-tuning.
 3. All five returned clean — 147 files downloaded, each indexed per subdir.
 4. Wrote `tooling/README.md` with 10 findings from the agents. Initial plan: flag only, don't touch the older corpus.
 5. Seth asked how the findings affect mort-bot. Read mort's CLAUDE.md / DECISIONS.md / llm.py / config.py / tools.py. Ranked: EmbeddingGemma (high), ShieldGemma 2 (high), bbox detection (high), E-series audio (medium), everything else (low/none because Ollama hides transformers changes).
 6. Seth ran `ollama show gemma4:26b`; output confirmed MoE (25.8B, Q4_K_M). Walked back the earlier "worth A/B testing" extrapolation — that was training guidance misapplied to inference. Q4_K_M on the MoE is fine.
 7. Seth asked "did you update synthesis?" — no, I hadn't. He authorized the updates. Patched 5 top-level docs; updated `tooling/README.md` findings list to mark merged-vs-flagged.
 8. Wrote this handoff.
 ## Don't do these things next session
 - Don't commit the ipynb files with `--no-verify` unless you ask again — the secrets-hook false positives (base64 notebook outputs, example Ed25519 keys) are documented, but re-bypassing without asking would be scope creep. If you add more ipynb content, strip outputs with `jupyter nbconvert --ClearOutputPreprocessor.enabled=True` first.
 - Don't restructure the folder. It's organized fine: `README.md` → `SYNTHESIS.md` (primary) → specialized `CORPUS_*.md` / `GOTCHAS.md` / `IMPLEMENTATIONS.md` → `tooling/` (receipts). New material goes into one of those buckets, not a new top-level thing.
 - Don't assume the Gemma 3 technical report covers Gemma 4. It's the closest thing we have but it predates Gemma 4.
@@ -0,0 +1,12 @@
 # Local scratch / backups — per ~/.claude/CLAUDE.md, Claude keeps backups before
 # editing any file. Useful locally; not useful in the tracked history.
 .backup/
 # Python
 __pycache__/
 *.pyc
 .pytest_cache/
 # Editor / OS
 .DS_Store
 *.swp
@@ -4,8 +4,9 @@
 ### Text (all variants)
 - Standard instruction-following, chat, completion
- System prompt support (critical — see synthesis)
+- **Native `system` role support** (new in Gemma 4; Gemma 3 prepended system as user turn)
- 128K context window (training length)
+- **Configurable thinking mode** — `<|think|>` / `<|channel>` tokens in the chat template; Ollama `think: true/false` flag. Seth's finding (see GOTCHAS): keep `false` for tool-use workloads.
 - 128K context window (E2B/E4B) / 256K (26B/31B) — training length
 - 262K vocabulary
 ### Vision (all variants)
@@ -38,6 +39,13 @@
 - Simon: 6 genealogy tools, up to 12 sequential iterations
 - Supports parallel tool calls in single response
 - Weak at deeply nested JSON schemas -> prefer sequential calls
 - **First Gemma generation with tool use as a trained capability.** Gemma 1/2/3 tool use was "proof-of-concept" (per the DeepMind tool_use colab). Gemma 4 has dedicated tool-call tokens and is trained on the pattern.
 ### Native Object Detection (all variants)
 - **Prompt format:** "Detect the {object} in this image" → structured output `{box_2d: [ymin, xmin, ymax, xmax]}` in **1000×1000-normalized coordinates** (rescale to your actual image dims).
 - Images auto-resized to multiples of 48 pixels by the processor.
 - Useful for grounding, cropping, counting, or passing bboxes to downstream tools — no separate detection model required.
 - Documented in the HF model card (`tooling/huggingface/model-cards/gemma-4-*.md`). Not tested by Seth yet.
 ## Benchmark Context (vs Gemma 3)
@@ -51,5 +59,6 @@
 - No native code execution / sandboxing
 - No web browsing or retrieval
- Audio only on E-series (not the models most people run)
+- Audio only on E-series (not the models most people run) — and **not on Ollama**, requires llama.cpp mmproj or vLLM
 - No built-in RAG — tool calling can implement it
 - No embeddings — use `EmbeddingGemma` (308M, separate model) for retrieval/semantic search
@@ -7,7 +7,7 @@
 | Tag | Params | Quant | Size on Disk | VRAM | Notes |
 |-----|--------|-------|-------------|------|-------|
 | `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 |
-| `gemma4:26b` | 25.8B | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti |
+| `gemma4:26b` | 25.2B total / **3.8B active (MoE)** | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti. **8 experts active of 128 + 1 shared** — runs at ~4B-speed, hence throughput. Q4_K_M inference is standard (Mixtral/DeepSeek ship same); the "MoE quality degrades at 4-bit" caveat is a **training-time** concern, not inference. See `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` for the full card. |
 | `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) |
 ## Capabilities by Variant (from `ollama show`)
@@ -16,9 +16,10 @@ All variants support:
 - Text generation (completion, chat)
 - Vision (image input via base64 in `images` field)
 - Tool/function calling (native Ollama tool format)
 - Thinking (configurable — `ollama show` lists it; Seth's finding is to leave it `false` for tool-use workloads)
 E-series (E2B, E4B) additionally support:
- Audio input (conformer encoder)
+- Audio input (conformer encoder) — **but not via Ollama**; requires llama.cpp with the `mmproj-*-E*B-it-*.gguf` projector, or vLLM's `input_features_padded`. See `tooling/inference-frameworks/README.md`.
 ## GPU Coexistence (pve197 V100 32GB)
@@ -2,8 +2,29 @@
 > Source: Google AI for Developers - Function Calling docs
 > https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
 > Canonical source in corpus: `tooling/google-official/docs/ai-google-dev_function_calling_gemma4.html`
 > Authoritative chat template: `tooling/huggingface/model-cards/gemma-4-{31B,E4B}-it-chat_template.jinja`
-## Special Tokens (6 total)
+## Chat Template Context (what surrounds the tool tokens)
 Gemma 4 changed the turn-token syntax from Gemma 3. You won't usually write these by
 hand — Ollama, llama.cpp `--jinja`, and HF `apply_chat_template` all handle it — but
 know what's on the wire when debugging:
 | Purpose | Gemma 3 | Gemma 4 |
 |---------|---------|---------|
 | Turn start | `<start_of_turn>role\n` | `<\|turn>role\n` |
 | Turn end | `<end_of_turn>\n` | `<turn\|>\n` |
 | Thinking | (not standardized) | `<\|think>...<think\|>` |
 | Thought channel | (n/a) | `<\|channel>thought...<channel\|>` |
 | Image inline | `<start_of_image>` | `<\|image>...<image\|>` |
 | Audio inline | (n/a) | `<\|audio>...<audio\|>` |
 | String delimiter in native format | (n/a) | `<\|"\|>` |
 **Asymmetric brackets are intentional.** Opening is `<|token>`, closing is `<token|>`.
 If you see `<|turn>...</turn|>` in a code sample, that's wrong.
 ## Tool Special Tokens (6 total)
 | Token | Purpose |
 |-------|---------|
@@ -98,3 +119,24 @@ This is what you actually use in practice. Ollama translates to/from native toke
 - llama.cpp: format mismatches and continuous loops reported
 - LM Studio: compatibility issues with tool calling
 - **Workaround:** Use non-streaming mode for tool calls (proven in Simon)
 ## HF `transformers` Alternative (not needed if using Ollama)
 If you ever route through HF `transformers` (v5.5.4+) instead of Ollama, there's a
 cleaner parser than hand-rolled regex:
 ```python
 inputs = processor.apply_chat_template(
    messages, tools=TOOLS, enable_thinking=True,
    add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
 )
 out = model.generate(**inputs)
 parsed = processor.parse_response(processor.decode(out[0]))
 # -> {"thinking": "...", "content": "...", "tool_calls": [...]}
 ```
 `parse_response` uses `response_schema` + `x-regex` fields baked into
 `tokenizer_config.json` (downloaded at `tooling/huggingface/model-cards/`). For
 Ollama users this is informational — Ollama's server-side tool parser already does
 the equivalent and returns structured `tool_calls` in the chat response.
@@ -168,6 +168,21 @@ Gemma 4 can generate `<unused>` or `<unused24>` tokens in an infinite loop on Vu
 **Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516)
 ## MEDIUM: `google/gemma_pytorch` Abandoned for Gemma 4
 **Severity: MEDIUM — wastes time on a dead-end path**
 The `google/gemma_pytorch` repo (last push 2025-05-30) has zero Gemma 4 support —
 its variants validator only accepts Gemma 1/2/3 IDs. Anyone pointing at it as "the
 official PyTorch reference" for Gemma 4 is wrong.
 **Use instead:**
 - **Inference:** `huggingface/transformers` (`AutoModelForMultimodalLM`, v5.5.4+)
 - **Reference impl:** `google-deepmind/gemma` (JAX/Flax)
 - **Serving:** Ollama / vLLM / llama.cpp
 See `tooling/google-official/gemma-pytorch/README.md` for the original repo state.
 ## LOW: Fine-Tuning Ecosystem Issues
 **Severity: LOW — only relevant if fine-tuning**
@@ -177,6 +192,20 @@ Day-one issues for fine-tuners:
 - PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
 - New `mm_token_type_ids` field required during training even for text-only data
 - E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
 - **Flash Attention 2/4 incompatible:** Gemma 4's global-attention head_dim is 512;
  FA2 max is 256, FA4 max is 128. Training backends fall back to SDP or Flex Attention
  (Axolotl hard-codes `sdp_attention: true` for Gemma 4). Does not affect inference
  runtimes that already use SDP (Ollama, vLLM).
 - **Fused LoRA kernels broken** (shared-KV layers). Axolotl disables
  `lora_mlp_kernel` / `qkv_kernel` / `o_kernel` for Gemma 4; Unsloth routes around it.
 - **26B A4B MoE wants ≥8-bit LoRA**, not 4-bit QLoRA — MoE expert quality degrades
  at 4-bit during training. Axolotl's ScatterMoE + expert-LoRA config is the only
  validated 4-bit MoE path. (This caveat is **training-only**; Q4_K_M inference is fine.)
 - **New tool-call / channel tokens are learned embeddings** — if fine-tuning, set
  `modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True` in
  `LoraConfig`, or the adapter trains against frozen random vectors for them.
 See `tooling/fine-tuning/recipe-recommendation.md` for the full training path.
 ## LOW: Vision Validator Overrejects
@@ -7,6 +7,12 @@
 Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs `num_predict` increased (Ollama defaults are absurdly low), `think` set to false (thinking eats the context budget), and `format: json` avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output.
 > **For canonical upstream source (model cards, chat templates, serving commands,
 > fine-tuning recipes, specialized siblings like EmbeddingGemma/ShieldGemma): see
 > `tooling/README.md`.** That directory is 147 files / 14 MB of first-party material
 > pulled from Google / Hugging Face / framework maintainers. This SYNTHESIS is the
 > opinionated digest; `tooling/` is the receipts.
 ## Mental Model
 Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain:
@@ -165,10 +171,11 @@ Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only.
 | Use Case | Recommended | Why |
 |----------|------------|-----|
-| Production pipeline (needs GPU coexistence) | `gemma4:26b` | Best quality/speed/VRAM balance |
+| Production pipeline (needs GPU coexistence) | `gemma4:26b` | MoE (3.8B active), fast, good quality/VRAM balance |
-| On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio |
+| On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio (audio via llama.cpp only) |
-| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Sharpest but slow under memory pressure |
+| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Dense 31B, sharpest but 5x slower, more VRAM pressure |
 | Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev |
 | Retrieval / embeddings | `embeddinggemma` (308M, separate model) | Gemma 4 has no embedding mode; use the sibling |
 ## Anti-Patterns
@@ -16,27 +16,36 @@ Actual scripts, notebooks, model cards, and configs downloaded from Google, Hugg
 ## Findings that update / contradict the existing corpus
-These are real gaps worth patching into `SYNTHESIS.md`, `GOTCHAS.md`, or `CORPUS_tool_calling_format.md`. Flagged here, not applied — the user asked for research, not a rewrite.
+These were merged into the top-level corpus docs on 2026-04-18 — each finding below
 is marked **[merged: file]** where it landed, or **[flagged]** if it's informational
 only. Scan here for provenance; read the CORPUS / SYNTHESIS / GOTCHAS files for the
 authoritative working text.
-1. **Prompt-token format changed in Gemma 4.** Gemma 1/2/3 used `<start_of_turn>user ... <end_of_turn>`. Gemma 4 uses asymmetric pipe-brackets: `<|turn>user\n ... <turn|>`. Also new: `<|think|>`, `<|channel>thought...<channel|>`, `<|tool>`, `<|tool_call>`, `<|tool_response>` (+ inverses), `<|image>`, `<|audio>`, and string delimiter `<|"|>`. The existing `CORPUS_tool_calling_format.md` documents the tool tokens but doesn't reflect the turn-token change or the thinking/channel tokens. Canonical source: `huggingface/model-cards/gemma-4-31B-it-chat_template.jinja` and `google-official/docs/ai-google-dev_prompt_formatting_gemma4.html`.
+1. **Prompt-token format changed in Gemma 4.** Gemma 1/2/3 used `<start_of_turn>user ... <end_of_turn>`. Gemma 4 uses asymmetric pipe-brackets: `<|turn>user\n ... <turn|>`. Also new: `<|think|>`, `<|channel>thought...<channel|>`, `<|tool>`, `<|tool_call>`, `<|tool_response>` (+ inverses), `<|image>`, `<|audio>`, and string delimiter `<|"|>`. Canonical source: `huggingface/model-cards/gemma-4-31B-it-chat_template.jinja` and `google-official/docs/ai-google-dev_prompt_formatting_gemma4.html`. **[merged: CORPUS_tool_calling_format.md — added Chat Template Context section]**
-2. **`google/gemma_pytorch` is abandoned for Gemma 4.** Last push 2025-05-30; the variants validator rejects Gemma 4 IDs. Anyone pointing at it as the PyTorch reference is wrong — use HF `transformers` or `google-deepmind/gemma` (JAX/Flax) instead.
+2. **`google/gemma_pytorch` is abandoned for Gemma 4.** Last push 2025-05-30; the variants validator rejects Gemma 4 IDs. Use HF `transformers` or `google-deepmind/gemma` (JAX/Flax) instead. **[merged: GOTCHAS.md — MEDIUM severity section]**
-3. **`gemma.cpp` ships a Gemini-API-compatible local HTTP server** (`gemma_api_server`, endpoint `POST /v1beta/models/<model>:generateContent`, SSE streaming). This is a Google-authored alternative to Ollama that speaks the real Gemini REST API — possibly the single most interesting discovery in this research pass. See `google-official/gemma-cpp/API_SERVER_README.md`.
+3. **`gemma.cpp` ships a Gemini-API-compatible local HTTP server** (`gemma_api_server`, endpoint `POST /v1beta/models/<model>:generateContent`, SSE streaming). Google-authored alternative to Ollama that speaks the real Gemini REST API. See `google-official/gemma-cpp/API_SERVER_README.md`. **[flagged — not merged; no current homelab use case, but worth knowing it exists]**
-4. **Transformers exposes `AutoModelForMultimodalLM` (new AutoClass)** — not `AutoModelForCausalLM`. It also exposes `processor.parse_response(..., response_schema=...)` driven from `tokenizer_config.json`, which replaces the hand-rolled regex in the current `CORPUS_tool_calling_format.md`. Pin: `transformers>=5.5.4`.
+4. **Transformers exposes `AutoModelForMultimodalLM` (new AutoClass)** — not `AutoModelForCausalLM`. It also exposes `processor.parse_response(..., response_schema=...)` driven from `tokenizer_config.json`. Pin: `transformers>=5.5.4`. **[merged: CORPUS_tool_calling_format.md — HF transformers Alternative section]**
-5. **Gemma 4 breaks Flash Attention.** FA2's max head_dim is 256, FA4's is 128, and Gemma 4's global head_dim is 512. Use SDP or Flex Attention. Axolotl hard-codes `sdp_attention: true` for Gemma 4. This belongs in `GOTCHAS.md`.
+5. **Gemma 4 breaks Flash Attention** (training only). FA2's max head_dim is 256, FA4's is 128, and Gemma 4's global head_dim is 512. Use SDP or Flex Attention. Does not affect Ollama / vLLM inference which already use SDP. **[merged: GOTCHAS.md — under LOW: Fine-Tuning Ecosystem Issues]**
-6. **The 26B variant is a MoE** — `gemma-4-26B-A4B` (A4B = 4B active per token). Quantization rules differ: Unsloth says use 16-bit LoRA, not 4-bit QLoRA, for acceptable quality. Axolotl's ScatterMoE + expert-LoRA config is the only tool validated for 4-bit MoE training. Worth a line in `CORPUS_ollama_variants.md`.
+6. **The 26B variant is a MoE** — `gemma-4-26B-A4B`, 25.2B total / 3.8B active, 8 experts of 128 + 1 shared. Q4_K_M inference is fine (standard for MoE — Mixtral/DeepSeek ship same). The "MoE quality degrades at 4-bit" concern is training-time only. **[merged: CORPUS_ollama_variants.md — annotated 26b row; GOTCHAS.md — training caveat in fine-tuning section]**
-7. **No Gemma 4 technical report PDF exists yet** as of 2026-04-18. DeepMind repo says "Gemma 4 (Coming soon)". Gemma 3 report (downloaded at `google-official/tech-report/Gemma3Report.pdf`) remains the closest authoritative family citation.
+7. **No Gemma 4 technical report PDF exists yet** as of 2026-04-18. DeepMind repo says "Gemma 4 (Coming soon)". Gemma 3 report is at `google-official/tech-report/Gemma3Report.pdf`. **[flagged — nothing to merge; check back mid-2026]**
-8. **No `google/gemma-4-*` specialized siblings yet** — ShieldGemma, CodeGemma, PaliGemma, MedGemma, DataGemma are all still on Gemma 2 or 3 base. Historical lag is 3–6 months; expect siblings-on-4 mid-to-late 2026.
+8. **No Gemma-4-generation specialized siblings yet.** ShieldGemma 2 is Gemma 3-based, CodeGemma on Gemma 2, PaliGemma 2 on Gemma 2, EmbeddingGemma on Gemma 3, etc. All still usable — just don't confuse the sibling generation with the base-model generation. Historical lag is 3–6 months; expect siblings-on-4 mid-to-late 2026. **[merged: CORPUS_capabilities.md — "What Gemma 4 Does NOT Do" now points at EmbeddingGemma for retrieval; full catalog in `gemma-family/index.md`]**
-9. **No Gemma-4-specific TRL script in `huggingface/trl` yet.** HF blog says "fully supported," but the SFT/DPO/GRPO examples are still on Gemma 3 model IDs. Drop-in with `model_id` swap works. Only Gemma-4-dedicated TRL example today is `huggingface-gemma-recipes/carla_vlm_gemma.py` (VLM GRPO).
+9. **No Gemma-4-specific TRL script in `huggingface/trl` yet.** HF blog says "fully supported," but the SFT/DPO/GRPO examples are still on Gemma 3 model IDs. Drop-in with `model_id` swap works. Only Gemma-4-dedicated TRL example today is `huggingface-gemma-recipes/carla_vlm_gemma.py` (VLM GRPO). **[flagged — only relevant if fine-tuning]**
-10. **HF Spaces `app.py` files are the shortest Gemma 4 inference examples** — Google and HF both use them as ref. See `huggingface/spaces/huggingface-projects_gemma-4-{31b,e4b}-it-app.py`.
+10. **HF Spaces `app.py` files are the shortest Gemma 4 inference examples** — Google and HF both use them as ref. See `huggingface/spaces/huggingface-projects_gemma-4-{31b,e4b}-it-app.py`. **[flagged — reference material]**
 11. **Native object detection with bbox output.** Prompt `"Detect the X in this image"` → structured `{box_2d: [ymin, xmin, ymax, xmax]}` in 1000×1000-normalized coords. First-class Gemma 4 capability, no separate detection model needed. **[merged: CORPUS_capabilities.md — Native Object Detection section]**
 12. **Native `system` role support.** New in Gemma 4 — Gemma 3 prepended system as a user turn. Matters if you were hand-building the prompt string; invisible if you use Ollama `system` or HF `apply_chat_template`. **[merged: CORPUS_capabilities.md — Text section]**
 13. **Audio input is E-series only AND not via Ollama.** Requires llama.cpp's `mmproj-*-E*B-it-*.gguf` projector or vLLM's `input_features_padded`. **[merged: CORPUS_ollama_variants.md and CORPUS_capabilities.md]**
 ## Immediate homelab plug-ins (from the gemma-family research)