eecebe7ef5
Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
191 lines
9.6 KiB
Markdown
191 lines
9.6 KiB
Markdown
# Recommended Gemma 4 Fine-Tuning Recipe (Seth's Homelab)
|
||
|
||
## TL;DR
|
||
|
||
**Use Unsloth. Rent a single H100 on Vast.ai. Fine-tune Gemma 4 E4B (or 31B QLoRA). Save GGUF. `ollama create` back to CT 105.**
|
||
|
||
Why not the alternatives:
|
||
- **Your 3090 Ti(s):** can handle E2B/E4B LoRA comfortably, but 26B A4B LoRA wants ~40 GB and 31B QLoRA wants 22 GB (fits, tightly). Axolotl's 5090-validated configs need Flex Attention to fit, and you lose half the throughput. An H100 at $2-3/hr for 3-4 hours is cheaper than the time you'll spend tuning memory.
|
||
- **Axolotl** is great — in particular the 26B MoE ScatterMoE+expert-LoRA config is genuinely novel and Unsloth doesn't match it. But Axolotl has more moving parts (FSDP, kernels, flex attention), breaks more subtly on config errors, and the docs are less Gemma-4-specific than Unsloth's.
|
||
- **TRL** has no Gemma-4-specific SFT script yet — you'd be porting `sft_gemma3.py`. Useful if you need DPO/GRPO or multimodal tool-call GRPO (the CARLA recipe), but heavier lift than Unsloth for plain SFT.
|
||
- **Google cookbook** works and is authoritative but is slower than Unsloth (no fused kernels) and the notebook format is noisier to modify.
|
||
|
||
## Exact command
|
||
|
||
### On a rented H100 (Vast.ai `vast-h100` alias, already configured)
|
||
|
||
```bash
|
||
ssh vast-h100
|
||
# one-time setup
|
||
pip install unsloth "trl==0.22.2" "transformers>=5.5.0" timm torchcodec
|
||
```
|
||
|
||
Training script (save as `finetune_gemma4.py` on the H100):
|
||
|
||
```python
|
||
from unsloth import FastModel
|
||
from unsloth.chat_templates import get_chat_template, standardize_data_formats, train_on_responses_only
|
||
from datasets import load_dataset
|
||
from trl import SFTTrainer, SFTConfig
|
||
|
||
MODEL = "unsloth/gemma-4-E4B-it" # swap to "unsloth/gemma-4-31B-it" if you want more headroom
|
||
DATASET = "YOUR_DATASET_HERE" # e.g. a mortdecai-style chat JSONL on HF Hub
|
||
|
||
# 1. Load model + tokenizer in 4-bit
|
||
model, tokenizer = FastModel.from_pretrained(
|
||
model_name = MODEL,
|
||
max_seq_length = 4096,
|
||
load_in_4bit = True,
|
||
full_finetuning = False,
|
||
)
|
||
|
||
# 2. Attach LoRA
|
||
model = FastModel.get_peft_model(
|
||
model,
|
||
finetune_vision_layers = False, # text-only FT
|
||
finetune_language_layers = True,
|
||
finetune_attention_modules = True,
|
||
finetune_mlp_modules = True,
|
||
r = 16,
|
||
lora_alpha = 16,
|
||
lora_dropout = 0,
|
||
bias = "none",
|
||
random_state = 3407,
|
||
)
|
||
|
||
# 3. Chat template — "gemma-4" (literal, with dash)
|
||
tokenizer = get_chat_template(tokenizer, chat_template = "gemma-4")
|
||
|
||
# 4. Dataset: expects ShareGPT-style `conversations` field with {from, value}
|
||
# OR OpenAI-style `messages` with {role, content} — standardize_data_formats handles both.
|
||
dataset = load_dataset(DATASET, split = "train")
|
||
dataset = standardize_data_formats(dataset)
|
||
|
||
def fmt(examples):
|
||
convos = examples["conversations"]
|
||
texts = [
|
||
tokenizer.apply_chat_template(c, tokenize=False, add_generation_prompt=False)
|
||
.removeprefix('<bos>') # critical: avoid double <bos>
|
||
for c in convos
|
||
]
|
||
return {"text": texts}
|
||
dataset = dataset.map(fmt, batched=True)
|
||
|
||
# 5. Train
|
||
trainer = SFTTrainer(
|
||
model = model,
|
||
tokenizer = tokenizer,
|
||
train_dataset = dataset,
|
||
args = SFTConfig(
|
||
dataset_text_field = "text",
|
||
per_device_train_batch_size = 2,
|
||
gradient_accumulation_steps = 4,
|
||
warmup_steps = 10,
|
||
num_train_epochs = 1,
|
||
learning_rate = 2e-4,
|
||
logging_steps = 1,
|
||
optim = "adamw_8bit",
|
||
weight_decay = 0.001,
|
||
lr_scheduler_type = "linear",
|
||
seed = 3407,
|
||
report_to = "none",
|
||
output_dir = "outputs",
|
||
),
|
||
)
|
||
|
||
# 6. Mask everything except assistant turns
|
||
trainer = train_on_responses_only(
|
||
trainer,
|
||
instruction_part = "<|turn>user\n",
|
||
response_part = "<|turn>model\n",
|
||
)
|
||
|
||
trainer.train()
|
||
|
||
# 7. Save merged 16-bit for GGUF conversion
|
||
model.save_pretrained_merged("merged_out", tokenizer, save_method = "merged_16bit")
|
||
|
||
# 8. OR save directly to GGUF (Q4_K_M) — Ollama-ready
|
||
model.save_pretrained_gguf("gemma4-mortdecai-v1", tokenizer, quantization_method = "q4_k_m")
|
||
```
|
||
|
||
Run:
|
||
```bash
|
||
python finetune_gemma4.py
|
||
```
|
||
|
||
### Pulling the result back and serving on CT 105
|
||
|
||
```bash
|
||
# On the Vast box, upload to HF Hub or scp back:
|
||
scp -r vast-h100:~/gemma4-mortdecai-v1*.gguf steel141:/tmp/
|
||
|
||
# On CT 105 (pve197 Ollama):
|
||
cat > Modelfile <<'EOF'
|
||
FROM /path/to/gemma4-mortdecai-v1.Q4_K_M.gguf
|
||
PARAMETER num_ctx 8192
|
||
PARAMETER temperature 1.0
|
||
PARAMETER top_p 0.95
|
||
PARAMETER top_k 64
|
||
SYSTEM "You are Mortdecai, a Minecraft ops AI. You are powered by Gemma 4."
|
||
EOF
|
||
ollama create mortdecai-gemma4:v1 -f Modelfile
|
||
ollama run mortdecai-gemma4:v1
|
||
```
|
||
|
||
## Hardware sizing guide (from Unsloth's verified numbers)
|
||
|
||
| Variant | LoRA | QLoRA | Full FT | My recommendation |
|
||
|---------|------|-------|---------|-------------------|
|
||
| E2B | 8-10 GB | 8 GB | ~20 GB | Free Colab T4; local 3090 Ti fine |
|
||
| E4B | 17 GB | 10 GB | ~32 GB | Local 3090 Ti (24 GB) tight but fine; H100 faster |
|
||
| 26B A4B | >40 GB (16-bit recommended, NOT 4-bit) | not recommended | — | H100 80 GB |
|
||
| 31B dense | >48 GB | 22 GB | 2×H100 | H100 80 GB or 2×3090 Ti FSDP |
|
||
|
||
For **Mortdecai-style behavior tuning** (matches your existing qwen-based setup), start with **E4B**. It's the sweet spot: larger than qwen3 8B in the things that matter (Gemma 4 E4B beats Gemma 3 27B on most benchmarks), vision-capable if you want it, and fits on a single 3090 Ti locally.
|
||
|
||
For a **real coding/reasoning upgrade**, use **31B QLoRA on H100**. Unsloth's 31B QLoRA notebook is the canonical recipe there.
|
||
|
||
## Gemma-4-specific pitfalls to NOT miss
|
||
|
||
1. **New chat template.** Gemma 4 uses `<|turn>user\n … <turn|>` — NOT Gemma 3's `<start_of_turn>user\n … <end_of_turn>`. Unsloth's `get_chat_template(tokenizer, chat_template="gemma-4")` handles this; the HF tokenizer's built-in Jinja also handles it if you rely on `apply_chat_template`. Axolotl uses `chat_template: gemma4` (no dash — different key).
|
||
|
||
2. **6 new tool-calling tokens.** `<|tool>`, `<tool|>`, `<|tool_call>`, `<tool_call|>`, `<|tool_response>`, `<tool_response|>`, plus the string-delimiter `<|"|>`. If fine-tuning on tool-call data, include full `<|tool_call>call:fn_name{args}<tool_call|>` in the assistant turn — no `role="tool"` branch exists.
|
||
|
||
3. **`modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True`** in LoraConfig if going vanilla PEFT (Google's cookbook does this explicitly). The new special tokens are *learned embeddings* — if the embed table is frozen, the adapter sees random vectors for them and training silently underperforms. Unsloth and Axolotl bake this in.
|
||
|
||
4. **Freeze the vision/audio tower by default.** Two idioms in the wild:
|
||
- Axolotl: `freeze_mm_modules: true` + text-only LoRA regex.
|
||
- HF's CARLA example: `target_modules="all-linear"` + `exclude_modules=["vision_tower", "multi_modal_projector"]`.
|
||
Only train the vision tower if your task specifically needs the encoder to adapt (new image domain). For text-mode fine-tunes like Mortdecai, always freeze.
|
||
|
||
5. **Flash Attention DOES NOT WORK on Gemma 4.** FA2's max `head_dim=256`, FA4's is 128; Gemma 4's `global_head_dim=512` exceeds both. **Use SDP or Flex Attention.** Axolotl's configs set `sdp_attention: true`. TRL's `sft_gemma3.py` uses `attn_implementation="eager"` — this works but is slow; prefer `"sdpa"`. (Unsloth's FastModel handles this automatically.)
|
||
|
||
6. **LoRA kernels OFF.** Gemma 4's shared-KV-cache layers break the fused LoRA kernels. Axolotl sets `lora_mlp_kernel/qkv_kernel/o_kernel: false` explicitly. Unsloth's `FastModel` is fine because it uses its own kernel path that knows about shared-KV.
|
||
|
||
7. **Don't prepend a second `<bos>`.** `apply_chat_template` adds one; SFTTrainer's collator adds one; if you don't `.removeprefix('<bos>')` before passing text to the trainer, you train the model to expect `<bos><bos>`. Unsloth's example notebooks do this strip — copy their pattern.
|
||
|
||
8. **26B A4B: use 16-bit LoRA, not QLoRA.** Unsloth's docs explicitly say "MoE QLoRA not recommended, dense 31B is fine." Axolotl has a ScatterMoE+expert-quantized+expert-LoRA config that does make 4-bit work for the MoE (validated on a 5090), but it's the only tool that does — Unsloth's 26B A4B notebook goes 16-bit for quality.
|
||
|
||
9. **Initial training loss of 13-15 on E2B/E4B is normal, not a bug.** Multimodal models start much higher than 5-8. If you see 13-15 don't panic — GOTCHAS.md §"Fine-Tuning Ecosystem Issues" has this.
|
||
|
||
10. **`mm_token_type_ids` required during training even for text-only data.** Day-one PEFT/Transformers bug: the multimodal collator requires this field. Pin `transformers>=5.5.0` and `peft>=0.15` to ensure the fix is present.
|
||
|
||
## Feature parity snapshot (2026-04-18)
|
||
|
||
| Feature | Unsloth | TRL | Axolotl | Google cookbook |
|
||
|---------|:-:|:-:|:-:|:-:|
|
||
| Text SFT | ✓ | ~ (via gemma3 script, change model_id) | ✓ | ✓ |
|
||
| Vision SFT | ✓ | ~ (via sft_vlm_gemma3) | ✓ (E2B) | ✓ |
|
||
| Audio SFT | ✓ (E2B/E4B) | ✗ | ✗ | ✗ |
|
||
| GRPO | ✓ (E2B + RL game notebooks) | ✓ (CARLA VLM-GRPO, official) | ✗ | ✗ |
|
||
| DPO | via TRL | ✓ | ✓ | ✗ |
|
||
| 26B MoE native | ✓ (16-bit LoRA) | ~ | ✓ (ScatterMoE + expert-LoRA, validated on 5090) | ✗ |
|
||
| 31B dense QLoRA | ✓ | ~ | ✓ (with Flex Attn) | ~ |
|
||
| Free Colab T4 path | ✓ (E2B) | ✗ | ✗ | ~ (via Colab Pro) |
|
||
| Multi-GPU FSDP | ~ | ✓ | ✓ (first-class) | ~ |
|
||
|
||
**Bottom line:** Unsloth has the broadest Gemma-4-native coverage (including audio and RL games, which no one else has). Axolotl has the best 26B MoE story. TRL has the best multimodal-RL story (CARLA). Google cookbook is the reference, not the fast path.
|
||
|
||
For Seth's stated use case (fine-tune like mortdecai), Unsloth wins on ergonomics + speed + T4 free-tier fallback.
|