gemma4-research/tooling/fine-tuning/recipe-recommendation.md

# Recommended Gemma 4 Fine-Tuning Recipe (Seth's Homelab)

## TL;DR

**Use Unsloth. Rent a single H100 on Vast.ai. Fine-tune Gemma 4 E4B (or 31B QLoRA). Save GGUF. `ollama create` back to CT 105.**

Why not the alternatives:
- **Your 3090 Ti(s):** can handle E2B/E4B LoRA comfortably, but 26B A4B LoRA wants ~40 GB and 31B QLoRA wants 22 GB (fits, tightly). Axolotl's 5090-validated configs need Flex Attention to fit, and you lose half the throughput. An H100 at $2-3/hr for 3-4 hours is cheaper than the time you'll spend tuning memory.
- **Axolotl** is great — in particular the 26B MoE ScatterMoE+expert-LoRA config is genuinely novel and Unsloth doesn't match it. But Axolotl has more moving parts (FSDP, kernels, flex attention), breaks more subtly on config errors, and the docs are less Gemma-4-specific than Unsloth's.
- **TRL** has no Gemma-4-specific SFT script yet — you'd be porting `sft_gemma3.py`. Useful if you need DPO/GRPO or multimodal tool-call GRPO (the CARLA recipe), but heavier lift than Unsloth for plain SFT.
- **Google cookbook** works and is authoritative but is slower than Unsloth (no fused kernels) and the notebook format is noisier to modify.

## Exact command

### On a rented H100 (Vast.ai `vast-h100` alias, already configured)

```bash
ssh vast-h100
# one-time setup
pip install unsloth "trl==0.22.2" "transformers>=5.5.0" timm torchcodec
```

Training script (save as `finetune_gemma4.py` on the H100):

```python
from unsloth import FastModel
from unsloth.chat_templates import get_chat_template, standardize_data_formats, train_on_responses_only
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

MODEL = "unsloth/gemma-4-E4B-it"     # swap to "unsloth/gemma-4-31B-it" if you want more headroom
DATASET = "YOUR_DATASET_HERE"         # e.g. a mortdecai-style chat JSONL on HF Hub

# 1. Load model + tokenizer in 4-bit
model, tokenizer = FastModel.from_pretrained(
    model_name = MODEL,
    max_seq_length = 4096,
    load_in_4bit = True,
    full_finetuning = False,
)

# 2. Attach LoRA
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers = False,   # text-only FT
    finetune_language_layers = True,
    finetune_attention_modules = True,
    finetune_mlp_modules = True,
    r = 16,
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

# 3. Chat template — "gemma-4" (literal, with dash)
tokenizer = get_chat_template(tokenizer, chat_template = "gemma-4")

# 4. Dataset: expects ShareGPT-style `conversations` field with {from, value}
#    OR OpenAI-style `messages` with {role, content} — standardize_data_formats handles both.
dataset = load_dataset(DATASET, split = "train")
dataset = standardize_data_formats(dataset)

def fmt(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(c, tokenize=False, add_generation_prompt=False)
            .removeprefix('<bos>')     # critical: avoid double <bos>
        for c in convos
    ]
    return {"text": texts}
dataset = dataset.map(fmt, batched=True)

# 5. Train
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none",
        output_dir = "outputs",
    ),
)

# 6. Mask everything except assistant turns
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|turn>user\n",
    response_part    = "<|turn>model\n",
)

trainer.train()

# 7. Save merged 16-bit for GGUF conversion
model.save_pretrained_merged("merged_out", tokenizer, save_method = "merged_16bit")

# 8. OR save directly to GGUF (Q4_K_M) — Ollama-ready
model.save_pretrained_gguf("gemma4-mortdecai-v1", tokenizer, quantization_method = "q4_k_m")
```

Run:
```bash
python finetune_gemma4.py
```

### Pulling the result back and serving on CT 105

```bash
# On the Vast box, upload to HF Hub or scp back:
scp -r vast-h100:~/gemma4-mortdecai-v1*.gguf steel141:/tmp/

# On CT 105 (pve197 Ollama):
cat > Modelfile <<'EOF'
FROM /path/to/gemma4-mortdecai-v1.Q4_K_M.gguf
PARAMETER num_ctx 8192
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
SYSTEM "You are Mortdecai, a Minecraft ops AI. You are powered by Gemma 4."
EOF
ollama create mortdecai-gemma4:v1 -f Modelfile
ollama run mortdecai-gemma4:v1
```

## Hardware sizing guide (from Unsloth's verified numbers)

| Variant | LoRA | QLoRA | Full FT | My recommendation |
|---------|------|-------|---------|-------------------|
| E2B | 8-10 GB | 8 GB | ~20 GB | Free Colab T4; local 3090 Ti fine |
| E4B | 17 GB | 10 GB | ~32 GB | Local 3090 Ti (24 GB) tight but fine; H100 faster |
| 26B A4B | >40 GB (16-bit recommended, NOT 4-bit) | not recommended | — | H100 80 GB |
| 31B dense | >48 GB | 22 GB | 2×H100 | H100 80 GB or 2×3090 Ti FSDP |

For **Mortdecai-style behavior tuning** (matches your existing qwen-based setup), start with **E4B**. It's the sweet spot: larger than qwen3 8B in the things that matter (Gemma 4 E4B beats Gemma 3 27B on most benchmarks), vision-capable if you want it, and fits on a single 3090 Ti locally.

For a **real coding/reasoning upgrade**, use **31B QLoRA on H100**. Unsloth's 31B QLoRA notebook is the canonical recipe there.

## Gemma-4-specific pitfalls to NOT miss

1. **New chat template.** Gemma 4 uses `<|turn>user\n … <turn|>` — NOT Gemma 3's `<start_of_turn>user\n … <end_of_turn>`. Unsloth's `get_chat_template(tokenizer, chat_template="gemma-4")` handles this; the HF tokenizer's built-in Jinja also handles it if you rely on `apply_chat_template`. Axolotl uses `chat_template: gemma4` (no dash — different key).

2. **6 new tool-calling tokens.** `<|tool>`, `<tool|>`, `<|tool_call>`, `<tool_call|>`, `<|tool_response>`, `<tool_response|>`, plus the string-delimiter `<|"|>`. If fine-tuning on tool-call data, include full `<|tool_call>call:fn_name{args}<tool_call|>` in the assistant turn — no `role="tool"` branch exists.

3. **`modules_to_save=["lm_head","embed_tokens"]` + `ensure_weight_tying=True`** in LoraConfig if going vanilla PEFT (Google's cookbook does this explicitly). The new special tokens are *learned embeddings* — if the embed table is frozen, the adapter sees random vectors for them and training silently underperforms. Unsloth and Axolotl bake this in.

4. **Freeze the vision/audio tower by default.** Two idioms in the wild:
   - Axolotl: `freeze_mm_modules: true` + text-only LoRA regex.
   - HF's CARLA example: `target_modules="all-linear"` + `exclude_modules=["vision_tower", "multi_modal_projector"]`.
   Only train the vision tower if your task specifically needs the encoder to adapt (new image domain). For text-mode fine-tunes like Mortdecai, always freeze.

5. **Flash Attention DOES NOT WORK on Gemma 4.** FA2's max `head_dim=256`, FA4's is 128; Gemma 4's `global_head_dim=512` exceeds both. **Use SDP or Flex Attention.** Axolotl's configs set `sdp_attention: true`. TRL's `sft_gemma3.py` uses `attn_implementation="eager"` — this works but is slow; prefer `"sdpa"`. (Unsloth's FastModel handles this automatically.)

6. **LoRA kernels OFF.** Gemma 4's shared-KV-cache layers break the fused LoRA kernels. Axolotl sets `lora_mlp_kernel/qkv_kernel/o_kernel: false` explicitly. Unsloth's `FastModel` is fine because it uses its own kernel path that knows about shared-KV.

7. **Don't prepend a second `<bos>`.** `apply_chat_template` adds one; SFTTrainer's collator adds one; if you don't `.removeprefix('<bos>')` before passing text to the trainer, you train the model to expect `<bos><bos>`. Unsloth's example notebooks do this strip — copy their pattern.

8. **26B A4B: use 16-bit LoRA, not QLoRA.** Unsloth's docs explicitly say "MoE QLoRA not recommended, dense 31B is fine." Axolotl has a ScatterMoE+expert-quantized+expert-LoRA config that does make 4-bit work for the MoE (validated on a 5090), but it's the only tool that does — Unsloth's 26B A4B notebook goes 16-bit for quality.

9. **Initial training loss of 13-15 on E2B/E4B is normal, not a bug.** Multimodal models start much higher than 5-8. If you see 13-15 don't panic — GOTCHAS.md §"Fine-Tuning Ecosystem Issues" has this.

10. **`mm_token_type_ids` required during training even for text-only data.** Day-one PEFT/Transformers bug: the multimodal collator requires this field. Pin `transformers>=5.5.0` and `peft>=0.15` to ensure the fix is present.

## Feature parity snapshot (2026-04-18)

| Feature | Unsloth | TRL | Axolotl | Google cookbook |
|---------|:-:|:-:|:-:|:-:|
| Text SFT | ✓ | ~ (via gemma3 script, change model_id) | ✓ | ✓ |
| Vision SFT | ✓ | ~ (via sft_vlm_gemma3) | ✓ (E2B) | ✓ |
| Audio SFT | ✓ (E2B/E4B) | ✗ | ✗ | ✗ |
| GRPO | ✓ (E2B + RL game notebooks) | ✓ (CARLA VLM-GRPO, official) | ✗ | ✗ |
| DPO | via TRL | ✓ | ✓ | ✗ |
| 26B MoE native | ✓ (16-bit LoRA) | ~ | ✓ (ScatterMoE + expert-LoRA, validated on 5090) | ✗ |
| 31B dense QLoRA | ✓ | ~ | ✓ (with Flex Attn) | ~ |
| Free Colab T4 path | ✓ (E2B) | ✗ | ✗ | ~ (via Colab Pro) |
| Multi-GPU FSDP | ~ | ✓ | ✓ (first-class) | ~ |

**Bottom line:** Unsloth has the broadest Gemma-4-native coverage (including audio and RL games, which no one else has). Axolotl has the best 26B MoE story. TRL has the best multimodal-RL story (CARLA). Google cookbook is the reference, not the fast path.

For Seth's stated use case (fine-tune like mortdecai), Unsloth wins on ergonomics + speed + T4 free-tier fallback.