Files
Mortdecai eecebe7ef5 docs: add canonical tooling corpus (147 files) from Google/HF/frameworks
Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00

282 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gemma 4 Fine-Tuning Tooling — Index
Research captured 2026-04-18. All downloads verified against upstream repos.
## TL;DR
| Tool | Gemma 4 coverage | GPU floor (LoRA) | GPU floor (full FT) | Best at |
|------|------------------|------------------|---------------------|---------|
| **Unsloth** | Full parity — all 4 sizes, text/vision/audio/GRPO/RL | E2B: 8 GB, E4B: 17 GB, 26B A4B: ~40 GB, 31B QLoRA: 22 GB | Not recommended locally | **Fastest path**, Google-blessed, free Colab |
| **TRL** | Partial — no `sft_gemma4.py` yet; `sft_gemma3.py` + `AutoModelForImageTextToText` works | Same as Unsloth w/ `load_in_4bit` | 2x H100 min for 31B | Research-grade control, DPO/GRPO/online RL, VLM GRPO on Gemma 4 (CARLA) |
| **Axolotl** | **Native Gemma 4 configs shipped** (`examples/gemma4/`) | Single 5090 (32 GB) for 26B A4B QLoRA validated | >80 GB, "not tested" per README | Declarative YAML, multi-GPU FSDP, MoE expert LoRA |
| **Google cookbook** | `docs/core/*` notebooks default to `google/gemma-4-E2B` | Depends on Colab tier | L4 (22 GB) for E4B QLoRA | Canonical baseline, paired with ai.google.dev docs |
| **HF gemma-recipes** | Inference + one GRPO VLM script (CARLA) | E2B on T4 | — | VLM GRPO with tool-calling environment |
| **Ollama** | Serves fine-tuned Gemma 4 via Modelfile `ADAPTER` | — | — | Final serving step |
**Recommendation for Seth: Unsloth.** See `recipe-recommendation.md`.
---
## 1. Unsloth (`unsloth/`)
**Upstream:** `unslothai/notebooks`, `unslothai/unsloth`
**License:** LGPL-3.0 (notebooks), Apache-2.0 (library)
**Published Gemma 4 Dynamic quants:**
- `unsloth/gemma-4-{E2B,E4B,31B,26B-A4B}-{,it}-unsloth-bnb-4bit` (dynamic 4-bit)
- `unsloth/gemma-4-{E2B,E4B,31B,26B-A4B}-it-GGUF` (GGUF for inference)
- Collection: https://huggingface.co/collections/unsloth/gemma-4
**Downloaded files (local paths under this directory):**
- `unsloth/notebooks/Gemma4_(E2B)-Text.ipynb`**canonical SFT notebook, T4-compatible**
- `unsloth/notebooks/Gemma4_(E4B)-Text.ipynb` — 10 GB VRAM, higher accuracy
- `unsloth/notebooks/Gemma4_(26B_A4B)-Text.ipynb` — MoE SFT (needs A100+)
- `unsloth/notebooks/Gemma4_(31B)-Text.ipynb` — dense 31B SFT
- `unsloth/notebooks/Gemma4_(E2B|E4B|26B_A4B|31B)-Vision.ipynb` — vision SFT w/ `UnslothVisionDataCollator`
- `unsloth/notebooks/Gemma4_(E2B|E4B)-Audio.ipynb` — audio SFT (E2B/E4B only — 31B/26B have no audio encoder)
- `unsloth/notebooks/Gemma4_(E2B)_GRPO.ipynb` — GRPO RL w/ Python reward funcs
- `unsloth/notebooks/Gemma4_(E2B)_Reinforcement_Learning_{2048,Sudoku}_Game.ipynb` — game-playing RL
- `unsloth/python_scripts/*.py` — same content as `.py` scripts (easier to grep/modify)
- `unsloth/kaggle/Gemma4_(31B)-Text.ipynb`, `unsloth/kaggle/Gemma4_(E4B)-Text.ipynb` — Kaggle-flavored variants
- `unsloth/docs/unsloth-README.md` — top-level Unsloth README
**Upstream URLs (useful to share):**
- SFT E4B Colab: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E4B)-Text.ipynb
- GRPO Colab: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)_GRPO.ipynb
- Unsloth Gemma 4 docs: https://unsloth.ai/docs/models/gemma-4/train
### Unsloth chat-template & masking detail (CRITICAL for Gemma 4)
Gemma 4 does **not** use Gemma 3's `<start_of_turn>` / `<end_of_turn>`. The new format is:
```
<bos><|turn>user
Hello<turn|>
<|turn>model
Hey there!<turn|>
```
Unsloth's helper:
```python
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(tokenizer, chat_template = "gemma-4") # literal "gemma-4", not "gemma4"
```
Response-only masking (matches Unsloth's convention; everything *before* `response_part` is loss-masked):
```python
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<|turn>user\n",
response_part = "<|turn>model\n",
)
```
`<bos>` gotcha: `apply_chat_template` prepends `<bos>`; Unsloth's `formatting_prompts_func` strips it with `.removeprefix('<bos>')` because the SFTTrainer's data collator adds its own — double `<bos>` silently degrades training.
**Tool tokens (`<|tool>`, `<|tool_call>`, `<|tool_response>`, `<|"|>`) are *not* masked** in Unsloth's default setup — they flow through as plain text inside user/assistant turns. If you're fine-tuning on tool-call data, include full `<|tool_call>...<tool_call|>` markup in the assistant `content` field; the template doesn't need a special `role=tool` branch.
### Unsloth MoE note
For 26B A4B (128 experts): Unsloth explicitly recommends **bf16/16-bit LoRA, NOT 4-bit QLoRA** ("MoE QLoRA not recommended, dense 31B is fine"). Their notebook uses `load_in_4bit = True` at >40 GB but the docs flag this as suboptimal.
---
## 2. TRL (`trl/`)
**Upstream:** `huggingface/trl`
**License:** Apache-2.0
**Gemma 4-specific scripts:** NONE in `examples/scripts/` as of 2026-04-18. The canonical Gemma 4 TRL example lives in `huggingface-gemma-recipes/scripts/carla_vlm_gemma.py` (see next section).
**Closest-match Gemma 3 scripts downloaded (drop-in for Gemma 4 — change `model_id` to `google/gemma-4-*-it`, keep `AutoModelForImageTextToText`):**
- `trl/sft_gemma3.py`**use this as the Gemma 4 SFT template**. Pure text SFT (Codeforces-COTS).
- `trl/sft_vlm_gemma3.py` — vision SFT template (uses `AutoModelForImageTextToText`, `all-linear` LoRA).
- `trl/sft.py`, `trl/trl_scripts_sft.py` — the generic SFTTrainer wrappers.
- `trl/sft_vlm.py` — model-agnostic VLM SFT.
- `trl/dpo.py` — DPO (1-liner using TrlParser).
- `trl/grpo_agent.py`, `trl/grpo_vlm.py` — GRPO with tool-calling environments.
- `trl/sft_tiny_aya_tool_calling.py` — tool-calling SFT pattern.
**Chat template / masking detail:** TRL's `SFTTrainer` uses `tokenizer.apply_chat_template` end-to-end and delegates to the tokenizer's built-in Jinja template. For `google/gemma-4-*-it`, that template already produces `<|turn>user…<turn|>`. TRL supports `completion_only_loss` via the `SFTConfig(assistant_only_loss=True)` flag (TRL ≥ 0.22), which masks anything before the assistant turn — no manual `instruction_part` plumbing needed.
### Official HF blog says (verbatim):
> "Gemma 4 is fully supported for fine-tuning with TRL. … we have prepared an example on how to fine-tune Gemma 4 with TRL on Vertex AI using SFT, to showcase how to extend the function calling capabilities, **whilst freezing both the vision and audio towers**."
(see `huggingface-recipes/hf-blog-gemma4.md` §634-687)
---
## 3. Axolotl (`axolotl/`)
**Upstream:** `axolotl-ai-cloud/axolotl`, `examples/gemma4/`
**License:** Apache-2.0
**Gemma 4 status:** **Native support shipped**, day-one-class parity.
**Downloaded files:**
- `axolotl/README.md` — official Axolotl Gemma 4 guide
- `axolotl/31b-qlora.yaml` — 31B dense QLoRA, 1x80GB @ ~44 GB VRAM
- `axolotl/31b-qlora-flex.yaml` — 31B dense QLoRA + Flex Attention, 1x80GB @ ~26 GB (40% less VRAM, 50% throughput cost)
- `axolotl/26b-a4b-moe-qlora.yaml` — 26B MoE QLoRA + ScatterMoE expert-quantized + Expert-LoRA. Validated: 50 steps FineTome, loss 8.8→1.8, single RTX 5090 (32 GB), 21 GiB peak
- `axolotl/e2b-vision-lora.yaml` — E2B vision LoRA with `freeze_mm_modules: true`
**Run command (from Axolotl README):**
```bash
axolotl train examples/gemma4/26b-a4b-moe-qlora.yaml
axolotl train examples/gemma4/31b-qlora.yaml
axolotl train examples/gemma4/31b-qlora-flex.yaml
```
### Axolotl chat template & masking detail
```yaml
chat_template: gemma4
datasets:
- path: mlabonne/FineTome-100k
type: chat_template
field_messages: conversations
message_property_mappings:
role: from
content: value
```
`chat_template: gemma4` (no dash — Axolotl's key is different from Unsloth's `"gemma-4"`). The template applies Gemma 4 turn tokens (`<|turn>user … <turn|>`). Masking is handled automatically by `type: chat_template` — only the assistant turn counts toward loss.
### Axolotl hard limitations for Gemma 4 (from their README)
- **Flash Attention OFF.** FA2 caps head_dim at 256; FA4 at 128; Gemma 4's `global_head_dim=512` exceeds both. **Use SDP or Flex Attention.** (`sdp_attention: true` in every yaml.)
- **LoRA kernels OFF.** Due to Gemma 4's shared-KV layers (last N layers reuse K/V tensors): `lora_mlp_kernel: false`, `lora_qkv_kernel: false`, `lora_o_kernel: false`.
- **`lora_target_linear` is incompatible** for multimodal. You MUST use `lora_target_modules` with the regex (see below) to restrict LoRA to the text decoder and NOT the vision/audio encoders.
Axolotl's canonical regex restricts LoRA to text layers only:
```regex
model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj
```
For 26B A4B MoE, additionally target expert 3D tensors:
```yaml
lora_target_parameters:
- experts.gate_up_proj
- experts.down_proj
```
---
## 4. Google Cookbook (`google-cookbook/`)
**Upstream:** `google-gemma/cookbook`, `docs/core/`
**License:** Apache-2.0
**Gemma 4 status:** The `docs/core/*.ipynb` fine-tuning notebooks default to `google/gemma-4-E2B` as `model_id` — they ARE the Gemma 4 path, despite generic filenames.
**Downloaded files:**
- `google-cookbook/huggingface_text_finetune_qlora.ipynb`**text-to-SQL QLoRA tutorial** (gretel-synthetic-text-to-sql dataset, `philschmid/gretel-synthetic-text-to-sql`). This is the one ai.google.dev links to as the "official" fine-tune path.
- `google-cookbook/huggingface_text_full_finetune.ipynb` — full-weights fine-tune variant
- `google-cookbook/huggingface_vision_finetune_qlora.ipynb` — vision QLoRA on product descriptions
- `google-cookbook/lora_tuning.ipynb` — LoRA concepts tutorial
- `google-cookbook/function-calling-gemma4.ipynb` — official Google function-calling notebook (not a fine-tune, but the authoritative reference for tool-call tokens)
- `google-cookbook/Gemma_4_HDP_Agentic_Security.ipynb` + `Gemma_4_HDP_README.md` — full-app fine-tune example (agentic security)
**Upstream URLs:**
- https://ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora
- https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora
- https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
### Google cookbook chat template & masking detail (VERY IMPORTANT)
The cookbook notebooks use TRL's `SFTTrainer` with standard `messages` list (`role`/`content`) — chat-template is applied automatically by the tokenizer's built-in Jinja. No manual `instruction_part`/`response_part`.
**The non-obvious detail** is the `LoraConfig`:
```python
peft_config = LoraConfig(
lora_alpha=16, lora_dropout=0.05, r=16, bias="none",
target_modules="all-linear",
task_type="CAUSAL_LM",
modules_to_save=["lm_head", "embed_tokens"], # NOTE
ensure_weight_tying=True, # NOTE
)
```
`modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True` is required because **Gemma 4 introduced new special tokens (`<|turn>`, `<|tool>`, `<|tool_call>`, `<|tool_response>`, `<|"|>`) that need their embeddings to be trainable in a fine-tune.** PEFT 0.15+ added `ensure_weight_tying` specifically for this case. Skipping it causes the adapter to see frozen random embeddings for the new tokens and training silently underperforms.
For vision, Google's cookbook uses plain `target_modules="all-linear"` (NO `exclude_modules`) — meaning it *does* train LoRA adapters on the vision tower. This is a different tradeoff from Axolotl (`freeze_mm_modules: true`) and from TRL's CARLA recipe (`exclude_modules=["vision_tower", "multi_modal_projector"]`). Pick based on whether your task needs the vision encoder to adapt (e.g., new image domain) or just the text decoder (most cases).
---
## 5. HuggingFace gemma-recipes (`huggingface-recipes/`)
**Upstream:** `huggingface/huggingface-gemma-recipes`
**License:** Apache-2.0
**Downloaded files:**
- `huggingface-recipes/carla_vlm_gemma.py`**The canonical TRL + Gemma 4 example.** GRPO VLM training in a CARLA driving environment with tool calls. Shows `exclude_modules=["vision_tower", "multi_modal_projector"]`, `chat_template_kwargs={"enable_thinking": False}`, `max_tool_calling_iterations=10`.
- `huggingface-recipes/Gemma4_(E2B)-Multimodal.ipynb`**inference-only** multimodal demo (vision, video, audio, function calling, object detection). Not a fine-tune but necessary reference for the input format the training data must match.
- `huggingface-recipes/README.md` — HF's top-level recipes index
- `huggingface-recipes/hf-blog-gemma4.md` — the HF blog post's raw markdown (§630-707 is the fine-tuning section)
**Run command for the CARLA VLM RL example:**
```bash
pip install git+https://github.com/huggingface/trl.git
python examples/scripts/openenv/carla_vlm_gemma.py \
--env-urls https://sergiopaniego-carla-env.hf.space https://sergiopaniego-carla-env-2.hf.space \
--model google/gemma-4-E2B-it
```
**Known gap:** HF's gemma-recipes repo has *fine-tuning* notebooks for Gemma 3 and Gemma 3n (free T4 Colab) but **no pure-SFT Gemma 4 fine-tuning notebook yet** — the Gemma 4 Colab is inference only. Their blog points users to Unsloth Studio for the easy path.
---
## 6. Ollama / llama.cpp LoRA serving (`ollama-llamacpp/`)
**Downloaded:** `ollama-llamacpp/ollama-import-lora.md` — distilled from https://docs.ollama.com/import (2026-04-18 fetch).
**Short answer:** Yes, you can serve a Gemma 4 LoRA via Ollama. Two paths:
1. **Merge then serve (simpler, recommended):** `model.save_pretrained_merged("out", tokenizer, save_method="merged_16bit")``llama.cpp/convert_hf_to_gguf.py``llama.cpp/quantize` to Q4_K_M → `ollama create mymodel -f Modelfile` with `FROM ./gemma4-mortdecai.gguf`.
2. **Adapter-only serve:** `llama.cpp/convert_lora_to_gguf.py` on the PEFT directory → Modelfile with `FROM gemma4:e4b-it-q8_0` + `ADAPTER ./adapter.gguf`.
Ollama's docs list supported architectures as Llama/Mistral/Gemma 1-2 — Gemma 4 isn't *explicitly* listed, but llama.cpp has day-one Gemma 4 support and in practice the path works. (Vision-adapter serving via Ollama is still a grey area.)
---
## 7. Datasets the canonical tutorials pair with Gemma 4
| Tutorial | Dataset | Format | Notes |
|----------|---------|--------|-------|
| Unsloth Gemma4 E4B Text | `mlabonne/FineTome-100k` | ShareGPT-style `conversations` field | Also the Axolotl default |
| Unsloth Gemma4 GRPO | Synthetic kernel-optimization prompts in-notebook | Python reward funcs | RL w/ `function_works` / `check_only_stdlib_imports` |
| Unsloth Gemma4 Vision | `unsloth/LaTeX_OCR` | HF image-text pairs | Demonstrates `UnslothVisionDataCollator` |
| Google cookbook text QLoRA | `philschmid/gretel-synthetic-text-to-sql` | chat `messages` list | Google's "official" demo dataset for Gemma 4 |
| Google cookbook vision QLoRA | `philschmid/amazon-product-descriptions-vlm` | image + text pairs | Product-description generation |
| Axolotl Gemma 4 (all sizes) | `mlabonne/FineTome-100k` | `type: chat_template` | Validated in axolotl README |
| Axolotl E2B vision LoRA | `HuggingFaceH4/llava-instruct-mix-vsft` | vision-language SFT | Same as HF's VLM template |
| TRL sft_gemma3 (transfers) | `open-r1/codeforces-cots` | `messages` list | Chain-of-thought coding |
| TRL carla_vlm_gemma (Gemma 4 VLM GRPO) | CARLA simulator (live) | environment rollouts | Multimodal tool responses |
No one uses Alpaca or UltraChat as the canonical Gemma 4 pair. **FineTome-100k is the unofficial standard** — both Unsloth and Axolotl default to it.
---
## 8. Chat-template-and-masking matrix (the debugging cheat sheet)
| Framework | chat_template key | Turn tokens | Response masking API | BOS handling |
|-----------|-------------------|-------------|----------------------|--------------|
| Unsloth | `"gemma-4"` | `<|turn>role\n...<turn|>` | `train_on_responses_only(instruction_part="<|turn>user\n", response_part="<|turn>model\n")` | Strip `<bos>` manually with `.removeprefix('<bos>')` before passing to trainer |
| TRL | tokenizer's built-in Jinja (no key needed) | same | `SFTConfig(assistant_only_loss=True)` | Tokenizer handles automatically |
| Axolotl | `chat_template: gemma4` (no dash) | same | automatic via `type: chat_template` | Automatic |
| Google cookbook | tokenizer built-in Jinja | same | automatic via `SFTTrainer` + `messages` | Automatic |
Tool tokens (`<|tool>`, `<|tool_call>`, `<|tool_response>`, `<|"|>`) ride inside message content — none of the frameworks mask them specially, and none provide a `role="tool"` branch in the default template. If you're training tool-call data, put the complete `<|tool_call>call:{...}<tool_call|>` block in the assistant message `content`.
Also: **all Gemma 4 fine-tunes should `modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True`** in LoraConfig if you're using PEFT directly, because the new special-token embeddings need to be trainable. Unsloth and Axolotl handle this for you; naïve TRL + PEFT scripts do NOT by default.
---
## What's NOT here (and why)
- **Kaggle/Colab free-tier notebooks as a separate category** — the Unsloth notebooks *are* the free-tier notebooks. E2B Text runs on a free T4; 31B/26B-A4B need A100 Colab Pro. I pulled 2 Kaggle-flavored variants to `unsloth/kaggle/` for completeness.
- **Google's DeepMind JAX/Flax Gemma 4 fine-tune script** — Google's DeepMind-gemma repo ships inference/reference code, not a SFT script. Google's *canonical* fine-tune path is the HF+TRL notebook in `google-gemma/cookbook` (above), NOT JAX. If you want JAX, see the archived `.archive/Gemma/[Gemma_1]Finetune_distributed.ipynb` pattern — not ported to Gemma 4.
- **Full-weights 31B fine-tuning commands** — Axolotl's README says "heavy and has not been tested." Skip unless Seth rents an 8×H100 pod.
- **Prompt engineering / inference-only notebooks** — per scope.
## See also
- `recipe-recommendation.md` — which tool Seth should actually use for his homelab, with the exact command.
- `../../GOTCHAS.md` §"Fine-Tuning Ecosystem Issues" — day-one issues (required `mm_token_type_ids` field, Gemma4ClippableLinear PEFT issue, E2B/E4B training loss 13-15 being normal).
- `../../CORPUS_tool_calling_format.md` — the 6 tool-calling special tokens.