Files
gemma4-research/tooling/fine-tuning/axolotl/README.md
T
Mortdecai eecebe7ef5 docs: add canonical tooling corpus (147 files) from Google/HF/frameworks
Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00

2.4 KiB

Finetune Google's Gemma 4 with Axolotl

Gemma 4 is a family of multimodal models from Google. This guide covers how to train them with Axolotl.

Getting started

  1. Install Axolotl following the installation guide.

  2. Install Cut Cross Entropy to reduce training VRAM usage.

  3. Run the finetuning example:

# 26B MoE QLoRA (1x80GB @ ~50 GiB)
axolotl train examples/gemma4/26b-a4b-moe-qlora.yaml

# 31B Dense QLoRA (1x80GB @ ~44 GiB)
axolotl train examples/gemma4/31b-qlora.yaml

# 31B Dense QLoRA Flex Attn (1x80GB @ ~26 GiB)
axolotl train examples/gemma4/31b-qlora-flex.yaml

MoE Expert Quantization & Expert LoRA (26B-A4B only)

The 26B-A4B config uses ScatterMoE kernels via the transformers ExpertsInterface and quantizes expert weights on load. To learn about expert quantization, expert LoRA targeting, and related limitations, see the MoE Expert Quantization docs.

Flex Attention

Reduce ~40% VRAM (at the cost of up to half throughput) by setting the below (shown in examples/gemma4/31b-qlora-flex.yaml):

torch_compile: true
flex_attention: true

This works for both the MoE and Dense model.

Limitations

  • Flash Attention: FA2 (max head_dim=256) and FA4 (max head_dim=128) cannot support Gemma 4's global_head_dim=512. Use SDP or flex attention instead.
  • LoRA kernels: Not supported due to KV-sharing layers.
  • lora_target_linear: Incompatible for multimodal models — use lora_target_modules with a regex to restrict LoRA to the text backbone.

TIPS

  • Read more on how to load your own dataset at docs.
  • You can run full finetuning by removing adapter: qlora, load_in_4bit: true, and quantize_moe_experts: true from the config. This is heavy and has not been tested.

Optimization Guides

Please check the Optimizations doc.