docs: merge tooling findings into SYNTHESIS/GOTCHAS/CORPUS_* and add handoff
Patches the top-level corpus docs with the 13 findings flagged during the 2026-04-18 canonical tooling research pass. tooling/README.md now marks each finding [merged: <file>] or [flagged] for provenance. - CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard (the "MoE quality degrades at 4-bit" caveat is training-only). Add note that audio on E-series is NOT available via Ollama — llama.cpp mmproj or vLLM only. - CORPUS_capabilities.md: native system role, configurable thinking mode, first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma for retrieval (Gemma 4 has no embedding mode). - CORPUS_tool_calling_format.md: add Chat Template Context section documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4, replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>, <|image>, <|audio> tokens. Add HF transformers Alternative section showing processor.parse_response with response_schema. - GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4 head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant guidance, new tool-call tokens as learned embeddings. - SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream material. Add embeddinggemma row to Model Selection table. Also: - Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md convention, not needed in tracked history) and __pycache__/. - Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future sessions can pick up cold — facts verified, open threads, what changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,68 @@
|
||||
# Handoff — 2026-04-18: Canonical Tooling Research
|
||||
|
||||
## TL;DR for the next session
|
||||
|
||||
A parallel research pass pulled 147 files / 14 MB of first-party Gemma 4 tooling into `tooling/`, and the 13 findings that contradicted or extended the existing corpus were merged into the top-level `SYNTHESIS.md` / `GOTCHAS.md` / `CORPUS_*.md` docs. The repo is in a clean, coherent state.
|
||||
|
||||
**If you're opening this repo for Gemma 4 implementation work, `SYNTHESIS.md` is still the right first read.** The new `tooling/README.md` is the receipts layer — read it when you need authoritative source material (model cards, chat templates, serving commands, sibling-model briefs).
|
||||
|
||||
## What shipped
|
||||
|
||||
**Commit `eecebe7` (master, pushed to `git.sethpc.xyz/Seth/gemma4-research`):** added `tooling/` with five subdirs — `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. Each subdir has its own indexing README.
|
||||
|
||||
**Follow-up commit (same session):** patched top-level corpus docs with the 9 findings worth merging. The `tooling/README.md` "Findings" list now marks each one `[merged: <file>]` or `[flagged]` for provenance.
|
||||
|
||||
## Key confirmed facts
|
||||
|
||||
| Claim | Verified against |
|
||||
|-------|-----------------|
|
||||
| `gemma4:26b` is a MoE (25.2B total, 3.8B active, 8 of 128 experts + 1 shared) | HF model card at `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` |
|
||||
| Q4_K_M inference on the MoE is fine (standard practice) | Mixtral/DeepSeek precedent; card neutral on inference quant |
|
||||
| Gemma 4 changed turn tokens from `<start_of_turn>` to `<|turn>`/`<turn|>` | `tooling/huggingface/model-cards/gemma-4-*-chat_template.jinja` |
|
||||
| Tool use is **trained** in Gemma 4, not a proof-of-concept as in Gemma 1/2/3 | DeepMind tool-use colab at `tooling/google-official/deepmind-gemma/colab_tool_use.ipynb` |
|
||||
| `google/gemma_pytorch` is abandoned for Gemma 4 | Last push 2025-05-30, variants validator |
|
||||
| No Gemma 4 technical report PDF as of 2026-04-18 | DeepMind repo README + direct URL probes |
|
||||
| No specialized siblings on Gemma 4 base yet (ShieldGemma 2, CodeGemma, PaliGemma 2, EmbeddingGemma all still on Gemma 2/3) | Per-sibling model cards in `tooling/gemma-family/` |
|
||||
|
||||
## Open threads — flagged but not implemented
|
||||
|
||||
These came out of the mort-bot impact review later in the session. All three are high-value but out of scope for this research pass:
|
||||
|
||||
1. **EmbeddingGemma (308M) as a drop-in upgrade for mort-bot's `chat_search` / `memory_read` tools.** Mort currently uses FTS5 keyword-only — misses semantic matches. EmbeddingGemma's Matryoshka sizes (768/512/256/128) + 100+ languages make it a clean fit. Integration sketch in the session conversation; full research at `tooling/gemma-family/embeddinggemma.md`. Starter notebook at `tooling/google-official/cookbook/tutorials_RAG_EmbeddingGemma.ipynb`. **Next steps:** (a) `ollama pull embeddinggemma` on steel141, (b) A/B against existing `nomic-embed-text` on actual mort chat logs before committing to backfill.
|
||||
|
||||
2. **ShieldGemma 2 (4B) as a `generate_image` pre-filter for mort-bot.** Mort's SDXL tool has no safety gate. ShieldGemma 2 is Gemma-3-based but scoped exactly to image safety. Would run on steel141 alongside `gemma4:26b` (3090 has headroom).
|
||||
|
||||
3. **Native object detection for mort's `vision_describe`.** Gemma 4 does grounded bbox output natively — "Detect the X" → `{box_2d: [ymin, xmin, ymax, xmax]}` in 1000×1000 coords. Mort currently only does free-form vision description.
|
||||
|
||||
None of these were implemented in this session.
|
||||
|
||||
## Files changed this session
|
||||
|
||||
- **New:** `tooling/` (147 files), `tooling/README.md`, `.claude/handoffs/2026-04-18-canonical-tooling-research.md` (this file)
|
||||
- **Edited:** `README.md` (added `tooling/` row), `SYNTHESIS.md` (banner + model-selection table), `GOTCHAS.md` (added gemma_pytorch abandonment + expanded fine-tuning section), `CORPUS_tool_calling_format.md` (added Chat Template Context + HF transformers Alternative), `CORPUS_ollama_variants.md` (annotated 26b as MoE + audio note), `CORPUS_capabilities.md` (native system role, thinking, object detection, embedding pointer)
|
||||
- **Unchanged:** `IMPLEMENTATIONS.md` (Simon/AI_Visualizer specific, not affected), `CORPUS_architecture.md` (already had MoE details right), `CORPUS_benchmarks.md` (still current)
|
||||
|
||||
## What future sessions should know
|
||||
|
||||
- **The research is the receipts, not the source of truth.** The top-level `SYNTHESIS.md` / `GOTCHAS.md` / `CORPUS_*.md` docs are the working reference. `tooling/` backs them up with downloaded upstream material when you need provenance or a working script.
|
||||
- **Don't re-research the same ground.** Every `tooling/*/README.md` lists what's there and the source URL. Grep the tooling corpus before spawning new web searches.
|
||||
- **The 26B-is-MoE and Q4_K_M-is-fine facts were the main things that would have been re-litigated without this handoff.** If you see a claim that conflicts with those, check the HF model card first (`tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md`) — Google's own documentation, not secondhand.
|
||||
- **Sibling-model generation lag.** When reaching for ShieldGemma / CodeGemma / PaliGemma / EmbeddingGemma, don't assume a Gemma-4 base — they're still on 2 or 3. Use them anyway; just don't confuse generations.
|
||||
- **Mort-bot is where the low-hanging fruit is** if Seth wants a next practical project. Three items above; EmbeddingGemma is the biggest lever.
|
||||
|
||||
## Session narrative (for context, not action)
|
||||
|
||||
1. Started with the existing corpus (SYNTHESIS + GOTCHAS + 5 CORPUS files, ~22KB total). Goal: add canonical upstream tooling.
|
||||
2. Dispatched five parallel `general-purpose` agents covering Google official, HF, inference frameworks, Gemma family, fine-tuning.
|
||||
3. All five returned clean — 147 files downloaded, each indexed per subdir.
|
||||
4. Wrote `tooling/README.md` with 10 findings from the agents. Initial plan: flag only, don't touch the older corpus.
|
||||
5. Seth asked how the findings affect mort-bot. Read mort's CLAUDE.md / DECISIONS.md / llm.py / config.py / tools.py. Ranked: EmbeddingGemma (high), ShieldGemma 2 (high), bbox detection (high), E-series audio (medium), everything else (low/none because Ollama hides transformers changes).
|
||||
6. Seth ran `ollama show gemma4:26b`; output confirmed MoE (25.8B, Q4_K_M). Walked back the earlier "worth A/B testing" extrapolation — that was training guidance misapplied to inference. Q4_K_M on the MoE is fine.
|
||||
7. Seth asked "did you update synthesis?" — no, I hadn't. He authorized the updates. Patched 5 top-level docs; updated `tooling/README.md` findings list to mark merged-vs-flagged.
|
||||
8. Wrote this handoff.
|
||||
|
||||
## Don't do these things next session
|
||||
|
||||
- Don't commit the ipynb files with `--no-verify` unless you ask again — the secrets-hook false positives (base64 notebook outputs, example Ed25519 keys) are documented, but re-bypassing without asking would be scope creep. If you add more ipynb content, strip outputs with `jupyter nbconvert --ClearOutputPreprocessor.enabled=True` first.
|
||||
- Don't restructure the folder. It's organized fine: `README.md` → `SYNTHESIS.md` (primary) → specialized `CORPUS_*.md` / `GOTCHAS.md` / `IMPLEMENTATIONS.md` → `tooling/` (receipts). New material goes into one of those buckets, not a new top-level thing.
|
||||
- Don't assume the Gemma 3 technical report covers Gemma 4. It's the closest thing we have but it predates Gemma 4.
|
||||
+12
@@ -0,0 +1,12 @@
|
||||
# Local scratch / backups — per ~/.claude/CLAUDE.md, Claude keeps backups before
|
||||
# editing any file. Useful locally; not useful in the tracked history.
|
||||
.backup/
|
||||
|
||||
# Python
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.pytest_cache/
|
||||
|
||||
# Editor / OS
|
||||
.DS_Store
|
||||
*.swp
|
||||
+12
-3
@@ -4,8 +4,9 @@
|
||||
|
||||
### Text (all variants)
|
||||
- Standard instruction-following, chat, completion
|
||||
- System prompt support (critical — see synthesis)
|
||||
- 128K context window (training length)
|
||||
- **Native `system` role support** (new in Gemma 4; Gemma 3 prepended system as user turn)
|
||||
- **Configurable thinking mode** — `<|think|>` / `<|channel>` tokens in the chat template; Ollama `think: true/false` flag. Seth's finding (see GOTCHAS): keep `false` for tool-use workloads.
|
||||
- 128K context window (E2B/E4B) / 256K (26B/31B) — training length
|
||||
- 262K vocabulary
|
||||
|
||||
### Vision (all variants)
|
||||
@@ -38,6 +39,13 @@
|
||||
- Simon: 6 genealogy tools, up to 12 sequential iterations
|
||||
- Supports parallel tool calls in single response
|
||||
- Weak at deeply nested JSON schemas -> prefer sequential calls
|
||||
- **First Gemma generation with tool use as a trained capability.** Gemma 1/2/3 tool use was "proof-of-concept" (per the DeepMind tool_use colab). Gemma 4 has dedicated tool-call tokens and is trained on the pattern.
|
||||
|
||||
### Native Object Detection (all variants)
|
||||
- **Prompt format:** "Detect the {object} in this image" → structured output `{box_2d: [ymin, xmin, ymax, xmax]}` in **1000×1000-normalized coordinates** (rescale to your actual image dims).
|
||||
- Images auto-resized to multiples of 48 pixels by the processor.
|
||||
- Useful for grounding, cropping, counting, or passing bboxes to downstream tools — no separate detection model required.
|
||||
- Documented in the HF model card (`tooling/huggingface/model-cards/gemma-4-*.md`). Not tested by Seth yet.
|
||||
|
||||
## Benchmark Context (vs Gemma 3)
|
||||
|
||||
@@ -51,5 +59,6 @@
|
||||
|
||||
- No native code execution / sandboxing
|
||||
- No web browsing or retrieval
|
||||
- Audio only on E-series (not the models most people run)
|
||||
- Audio only on E-series (not the models most people run) — and **not on Ollama**, requires llama.cpp mmproj or vLLM
|
||||
- No built-in RAG — tool calling can implement it
|
||||
- No embeddings — use `EmbeddingGemma` (308M, separate model) for retrieval/semantic search
|
||||
|
||||
@@ -7,7 +7,7 @@
|
||||
| Tag | Params | Quant | Size on Disk | VRAM | Notes |
|
||||
|-----|--------|-------|-------------|------|-------|
|
||||
| `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 |
|
||||
| `gemma4:26b` | 25.8B | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti |
|
||||
| `gemma4:26b` | 25.2B total / **3.8B active (MoE)** | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti. **8 experts active of 128 + 1 shared** — runs at ~4B-speed, hence throughput. Q4_K_M inference is standard (Mixtral/DeepSeek ship same); the "MoE quality degrades at 4-bit" caveat is a **training-time** concern, not inference. See `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` for the full card. |
|
||||
| `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) |
|
||||
|
||||
## Capabilities by Variant (from `ollama show`)
|
||||
@@ -16,9 +16,10 @@ All variants support:
|
||||
- Text generation (completion, chat)
|
||||
- Vision (image input via base64 in `images` field)
|
||||
- Tool/function calling (native Ollama tool format)
|
||||
- Thinking (configurable — `ollama show` lists it; Seth's finding is to leave it `false` for tool-use workloads)
|
||||
|
||||
E-series (E2B, E4B) additionally support:
|
||||
- Audio input (conformer encoder)
|
||||
- Audio input (conformer encoder) — **but not via Ollama**; requires llama.cpp with the `mmproj-*-E*B-it-*.gguf` projector, or vLLM's `input_features_padded`. See `tooling/inference-frameworks/README.md`.
|
||||
|
||||
## GPU Coexistence (pve197 V100 32GB)
|
||||
|
||||
|
||||
@@ -2,8 +2,29 @@
|
||||
|
||||
> Source: Google AI for Developers - Function Calling docs
|
||||
> https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
|
||||
> Canonical source in corpus: `tooling/google-official/docs/ai-google-dev_function_calling_gemma4.html`
|
||||
> Authoritative chat template: `tooling/huggingface/model-cards/gemma-4-{31B,E4B}-it-chat_template.jinja`
|
||||
|
||||
## Special Tokens (6 total)
|
||||
## Chat Template Context (what surrounds the tool tokens)
|
||||
|
||||
Gemma 4 changed the turn-token syntax from Gemma 3. You won't usually write these by
|
||||
hand — Ollama, llama.cpp `--jinja`, and HF `apply_chat_template` all handle it — but
|
||||
know what's on the wire when debugging:
|
||||
|
||||
| Purpose | Gemma 3 | Gemma 4 |
|
||||
|---------|---------|---------|
|
||||
| Turn start | `<start_of_turn>role\n` | `<\|turn>role\n` |
|
||||
| Turn end | `<end_of_turn>\n` | `<turn\|>\n` |
|
||||
| Thinking | (not standardized) | `<\|think>...<think\|>` |
|
||||
| Thought channel | (n/a) | `<\|channel>thought...<channel\|>` |
|
||||
| Image inline | `<start_of_image>` | `<\|image>...<image\|>` |
|
||||
| Audio inline | (n/a) | `<\|audio>...<audio\|>` |
|
||||
| String delimiter in native format | (n/a) | `<\|"\|>` |
|
||||
|
||||
**Asymmetric brackets are intentional.** Opening is `<|token>`, closing is `<token|>`.
|
||||
If you see `<|turn>...</turn|>` in a code sample, that's wrong.
|
||||
|
||||
## Tool Special Tokens (6 total)
|
||||
|
||||
| Token | Purpose |
|
||||
|-------|---------|
|
||||
@@ -98,3 +119,24 @@ This is what you actually use in practice. Ollama translates to/from native toke
|
||||
- llama.cpp: format mismatches and continuous loops reported
|
||||
- LM Studio: compatibility issues with tool calling
|
||||
- **Workaround:** Use non-streaming mode for tool calls (proven in Simon)
|
||||
|
||||
## HF `transformers` Alternative (not needed if using Ollama)
|
||||
|
||||
If you ever route through HF `transformers` (v5.5.4+) instead of Ollama, there's a
|
||||
cleaner parser than hand-rolled regex:
|
||||
|
||||
```python
|
||||
inputs = processor.apply_chat_template(
|
||||
messages, tools=TOOLS, enable_thinking=True,
|
||||
add_generation_prompt=True, tokenize=True,
|
||||
return_dict=True, return_tensors="pt"
|
||||
)
|
||||
out = model.generate(**inputs)
|
||||
parsed = processor.parse_response(processor.decode(out[0]))
|
||||
# -> {"thinking": "...", "content": "...", "tool_calls": [...]}
|
||||
```
|
||||
|
||||
`parse_response` uses `response_schema` + `x-regex` fields baked into
|
||||
`tokenizer_config.json` (downloaded at `tooling/huggingface/model-cards/`). For
|
||||
Ollama users this is informational — Ollama's server-side tool parser already does
|
||||
the equivalent and returns structured `tool_calls` in the chat response.
|
||||
|
||||
+29
@@ -168,6 +168,21 @@ Gemma 4 can generate `<unused>` or `<unused24>` tokens in an infinite loop on Vu
|
||||
|
||||
**Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516)
|
||||
|
||||
## MEDIUM: `google/gemma_pytorch` Abandoned for Gemma 4
|
||||
|
||||
**Severity: MEDIUM — wastes time on a dead-end path**
|
||||
|
||||
The `google/gemma_pytorch` repo (last push 2025-05-30) has zero Gemma 4 support —
|
||||
its variants validator only accepts Gemma 1/2/3 IDs. Anyone pointing at it as "the
|
||||
official PyTorch reference" for Gemma 4 is wrong.
|
||||
|
||||
**Use instead:**
|
||||
- **Inference:** `huggingface/transformers` (`AutoModelForMultimodalLM`, v5.5.4+)
|
||||
- **Reference impl:** `google-deepmind/gemma` (JAX/Flax)
|
||||
- **Serving:** Ollama / vLLM / llama.cpp
|
||||
|
||||
See `tooling/google-official/gemma-pytorch/README.md` for the original repo state.
|
||||
|
||||
## LOW: Fine-Tuning Ecosystem Issues
|
||||
|
||||
**Severity: LOW — only relevant if fine-tuning**
|
||||
@@ -177,6 +192,20 @@ Day-one issues for fine-tuners:
|
||||
- PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
|
||||
- New `mm_token_type_ids` field required during training even for text-only data
|
||||
- E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
|
||||
- **Flash Attention 2/4 incompatible:** Gemma 4's global-attention head_dim is 512;
|
||||
FA2 max is 256, FA4 max is 128. Training backends fall back to SDP or Flex Attention
|
||||
(Axolotl hard-codes `sdp_attention: true` for Gemma 4). Does not affect inference
|
||||
runtimes that already use SDP (Ollama, vLLM).
|
||||
- **Fused LoRA kernels broken** (shared-KV layers). Axolotl disables
|
||||
`lora_mlp_kernel` / `qkv_kernel` / `o_kernel` for Gemma 4; Unsloth routes around it.
|
||||
- **26B A4B MoE wants ≥8-bit LoRA**, not 4-bit QLoRA — MoE expert quality degrades
|
||||
at 4-bit during training. Axolotl's ScatterMoE + expert-LoRA config is the only
|
||||
validated 4-bit MoE path. (This caveat is **training-only**; Q4_K_M inference is fine.)
|
||||
- **New tool-call / channel tokens are learned embeddings** — if fine-tuning, set
|
||||
`modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True` in
|
||||
`LoraConfig`, or the adapter trains against frozen random vectors for them.
|
||||
|
||||
See `tooling/fine-tuning/recipe-recommendation.md` for the full training path.
|
||||
|
||||
## LOW: Vision Validator Overrejects
|
||||
|
||||
|
||||
+10
-3
@@ -7,6 +7,12 @@
|
||||
|
||||
Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs `num_predict` increased (Ollama defaults are absurdly low), `think` set to false (thinking eats the context budget), and `format: json` avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output.
|
||||
|
||||
> **For canonical upstream source (model cards, chat templates, serving commands,
|
||||
> fine-tuning recipes, specialized siblings like EmbeddingGemma/ShieldGemma): see
|
||||
> `tooling/README.md`.** That directory is 147 files / 14 MB of first-party material
|
||||
> pulled from Google / Hugging Face / framework maintainers. This SYNTHESIS is the
|
||||
> opinionated digest; `tooling/` is the receipts.
|
||||
|
||||
## Mental Model
|
||||
|
||||
Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain:
|
||||
@@ -165,10 +171,11 @@ Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only.
|
||||
|
||||
| Use Case | Recommended | Why |
|
||||
|----------|------------|-----|
|
||||
| Production pipeline (needs GPU coexistence) | `gemma4:26b` | Best quality/speed/VRAM balance |
|
||||
| On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio |
|
||||
| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Sharpest but slow under memory pressure |
|
||||
| Production pipeline (needs GPU coexistence) | `gemma4:26b` | MoE (3.8B active), fast, good quality/VRAM balance |
|
||||
| On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio (audio via llama.cpp only) |
|
||||
| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Dense 31B, sharpest but 5x slower, more VRAM pressure |
|
||||
| Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev |
|
||||
| Retrieval / embeddings | `embeddinggemma` (308M, separate model) | Gemma 4 has no embedding mode; use the sibling |
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
|
||||
+20
-11
@@ -16,27 +16,36 @@ Actual scripts, notebooks, model cards, and configs downloaded from Google, Hugg
|
||||
|
||||
## Findings that update / contradict the existing corpus
|
||||
|
||||
These are real gaps worth patching into `SYNTHESIS.md`, `GOTCHAS.md`, or `CORPUS_tool_calling_format.md`. Flagged here, not applied — the user asked for research, not a rewrite.
|
||||
These were merged into the top-level corpus docs on 2026-04-18 — each finding below
|
||||
is marked **[merged: file]** where it landed, or **[flagged]** if it's informational
|
||||
only. Scan here for provenance; read the CORPUS / SYNTHESIS / GOTCHAS files for the
|
||||
authoritative working text.
|
||||
|
||||
1. **Prompt-token format changed in Gemma 4.** Gemma 1/2/3 used `<start_of_turn>user ... <end_of_turn>`. Gemma 4 uses asymmetric pipe-brackets: `<|turn>user\n ... <turn|>`. Also new: `<|think|>`, `<|channel>thought...<channel|>`, `<|tool>`, `<|tool_call>`, `<|tool_response>` (+ inverses), `<|image>`, `<|audio>`, and string delimiter `<|"|>`. The existing `CORPUS_tool_calling_format.md` documents the tool tokens but doesn't reflect the turn-token change or the thinking/channel tokens. Canonical source: `huggingface/model-cards/gemma-4-31B-it-chat_template.jinja` and `google-official/docs/ai-google-dev_prompt_formatting_gemma4.html`.
|
||||
1. **Prompt-token format changed in Gemma 4.** Gemma 1/2/3 used `<start_of_turn>user ... <end_of_turn>`. Gemma 4 uses asymmetric pipe-brackets: `<|turn>user\n ... <turn|>`. Also new: `<|think|>`, `<|channel>thought...<channel|>`, `<|tool>`, `<|tool_call>`, `<|tool_response>` (+ inverses), `<|image>`, `<|audio>`, and string delimiter `<|"|>`. Canonical source: `huggingface/model-cards/gemma-4-31B-it-chat_template.jinja` and `google-official/docs/ai-google-dev_prompt_formatting_gemma4.html`. **[merged: CORPUS_tool_calling_format.md — added Chat Template Context section]**
|
||||
|
||||
2. **`google/gemma_pytorch` is abandoned for Gemma 4.** Last push 2025-05-30; the variants validator rejects Gemma 4 IDs. Anyone pointing at it as the PyTorch reference is wrong — use HF `transformers` or `google-deepmind/gemma` (JAX/Flax) instead.
|
||||
2. **`google/gemma_pytorch` is abandoned for Gemma 4.** Last push 2025-05-30; the variants validator rejects Gemma 4 IDs. Use HF `transformers` or `google-deepmind/gemma` (JAX/Flax) instead. **[merged: GOTCHAS.md — MEDIUM severity section]**
|
||||
|
||||
3. **`gemma.cpp` ships a Gemini-API-compatible local HTTP server** (`gemma_api_server`, endpoint `POST /v1beta/models/<model>:generateContent`, SSE streaming). This is a Google-authored alternative to Ollama that speaks the real Gemini REST API — possibly the single most interesting discovery in this research pass. See `google-official/gemma-cpp/API_SERVER_README.md`.
|
||||
3. **`gemma.cpp` ships a Gemini-API-compatible local HTTP server** (`gemma_api_server`, endpoint `POST /v1beta/models/<model>:generateContent`, SSE streaming). Google-authored alternative to Ollama that speaks the real Gemini REST API. See `google-official/gemma-cpp/API_SERVER_README.md`. **[flagged — not merged; no current homelab use case, but worth knowing it exists]**
|
||||
|
||||
4. **Transformers exposes `AutoModelForMultimodalLM` (new AutoClass)** — not `AutoModelForCausalLM`. It also exposes `processor.parse_response(..., response_schema=...)` driven from `tokenizer_config.json`, which replaces the hand-rolled regex in the current `CORPUS_tool_calling_format.md`. Pin: `transformers>=5.5.4`.
|
||||
4. **Transformers exposes `AutoModelForMultimodalLM` (new AutoClass)** — not `AutoModelForCausalLM`. It also exposes `processor.parse_response(..., response_schema=...)` driven from `tokenizer_config.json`. Pin: `transformers>=5.5.4`. **[merged: CORPUS_tool_calling_format.md — HF transformers Alternative section]**
|
||||
|
||||
5. **Gemma 4 breaks Flash Attention.** FA2's max head_dim is 256, FA4's is 128, and Gemma 4's global head_dim is 512. Use SDP or Flex Attention. Axolotl hard-codes `sdp_attention: true` for Gemma 4. This belongs in `GOTCHAS.md`.
|
||||
5. **Gemma 4 breaks Flash Attention** (training only). FA2's max head_dim is 256, FA4's is 128, and Gemma 4's global head_dim is 512. Use SDP or Flex Attention. Does not affect Ollama / vLLM inference which already use SDP. **[merged: GOTCHAS.md — under LOW: Fine-Tuning Ecosystem Issues]**
|
||||
|
||||
6. **The 26B variant is a MoE** — `gemma-4-26B-A4B` (A4B = 4B active per token). Quantization rules differ: Unsloth says use 16-bit LoRA, not 4-bit QLoRA, for acceptable quality. Axolotl's ScatterMoE + expert-LoRA config is the only tool validated for 4-bit MoE training. Worth a line in `CORPUS_ollama_variants.md`.
|
||||
6. **The 26B variant is a MoE** — `gemma-4-26B-A4B`, 25.2B total / 3.8B active, 8 experts of 128 + 1 shared. Q4_K_M inference is fine (standard for MoE — Mixtral/DeepSeek ship same). The "MoE quality degrades at 4-bit" concern is training-time only. **[merged: CORPUS_ollama_variants.md — annotated 26b row; GOTCHAS.md — training caveat in fine-tuning section]**
|
||||
|
||||
7. **No Gemma 4 technical report PDF exists yet** as of 2026-04-18. DeepMind repo says "Gemma 4 (Coming soon)". Gemma 3 report (downloaded at `google-official/tech-report/Gemma3Report.pdf`) remains the closest authoritative family citation.
|
||||
7. **No Gemma 4 technical report PDF exists yet** as of 2026-04-18. DeepMind repo says "Gemma 4 (Coming soon)". Gemma 3 report is at `google-official/tech-report/Gemma3Report.pdf`. **[flagged — nothing to merge; check back mid-2026]**
|
||||
|
||||
8. **No `google/gemma-4-*` specialized siblings yet** — ShieldGemma, CodeGemma, PaliGemma, MedGemma, DataGemma are all still on Gemma 2 or 3 base. Historical lag is 3–6 months; expect siblings-on-4 mid-to-late 2026.
|
||||
8. **No Gemma-4-generation specialized siblings yet.** ShieldGemma 2 is Gemma 3-based, CodeGemma on Gemma 2, PaliGemma 2 on Gemma 2, EmbeddingGemma on Gemma 3, etc. All still usable — just don't confuse the sibling generation with the base-model generation. Historical lag is 3–6 months; expect siblings-on-4 mid-to-late 2026. **[merged: CORPUS_capabilities.md — "What Gemma 4 Does NOT Do" now points at EmbeddingGemma for retrieval; full catalog in `gemma-family/index.md`]**
|
||||
|
||||
9. **No Gemma-4-specific TRL script in `huggingface/trl` yet.** HF blog says "fully supported," but the SFT/DPO/GRPO examples are still on Gemma 3 model IDs. Drop-in with `model_id` swap works. Only Gemma-4-dedicated TRL example today is `huggingface-gemma-recipes/carla_vlm_gemma.py` (VLM GRPO).
|
||||
9. **No Gemma-4-specific TRL script in `huggingface/trl` yet.** HF blog says "fully supported," but the SFT/DPO/GRPO examples are still on Gemma 3 model IDs. Drop-in with `model_id` swap works. Only Gemma-4-dedicated TRL example today is `huggingface-gemma-recipes/carla_vlm_gemma.py` (VLM GRPO). **[flagged — only relevant if fine-tuning]**
|
||||
|
||||
10. **HF Spaces `app.py` files are the shortest Gemma 4 inference examples** — Google and HF both use them as ref. See `huggingface/spaces/huggingface-projects_gemma-4-{31b,e4b}-it-app.py`.
|
||||
10. **HF Spaces `app.py` files are the shortest Gemma 4 inference examples** — Google and HF both use them as ref. See `huggingface/spaces/huggingface-projects_gemma-4-{31b,e4b}-it-app.py`. **[flagged — reference material]**
|
||||
|
||||
11. **Native object detection with bbox output.** Prompt `"Detect the X in this image"` → structured `{box_2d: [ymin, xmin, ymax, xmax]}` in 1000×1000-normalized coords. First-class Gemma 4 capability, no separate detection model needed. **[merged: CORPUS_capabilities.md — Native Object Detection section]**
|
||||
|
||||
12. **Native `system` role support.** New in Gemma 4 — Gemma 3 prepended system as a user turn. Matters if you were hand-building the prompt string; invisible if you use Ollama `system` or HF `apply_chat_template`. **[merged: CORPUS_capabilities.md — Text section]**
|
||||
|
||||
13. **Audio input is E-series only AND not via Ollama.** Requires llama.cpp's `mmproj-*-E*B-it-*.gguf` projector or vLLM's `input_features_padded`. **[merged: CORPUS_ollama_variants.md and CORPUS_capabilities.md]**
|
||||
|
||||
## Immediate homelab plug-ins (from the gemma-family research)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user