Files

T

Mortdecai eecebe7ef5 docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 12:24:48 -04:00

10 KiB

Raw Blame History

Gemma 4 — Hugging Face Canonical Tooling

Downloaded April 2026. First-party Google/HF content only. No weights, no third-party fine-tunes.

What's here

`model-cards/`

Verbatim README.md from every google/gemma-4-* repo (raw endpoint, ungated). Plus the chat template and tokenizer config for two representative variants (31B-it and E4B-it). All eight model cards have identical body text; they differ only in the pipeline_tag: YAML frontmatter and size-specific tables.

File	What it demonstrates
`gemma-4-31B-it-README.md`	Flagship dense (33B) instruction-tuned. Full "how to use" from Google+HF.
`gemma-4-31B-README.md`	Base (pretrained) variant of the above.
`gemma-4-26B-A4B-it-README.md`	MoE (26B params, 4B active) instruction-tuned. The "A4B" = 4B active.
`gemma-4-26B-A4B-README.md`	Base MoE.
`gemma-4-E4B-it-README.md`	Edge-sized 8B instruction-tuned. Multimodal including audio.
`gemma-4-E4B-README.md`	Base E4B.
`gemma-4-E2B-it-README.md`	Smallest (5B) instruction-tuned, mobile-targeted.
`gemma-4-E2B-README.md`	Base E2B.
`gemma-4-31B-it-chat_template.jinja`	Canonical chat template. 16KB Jinja — handles system/user/model/tool roles, thinking channel, tool calls, image/audio/video tokens.
`gemma-4-E4B-it-chat_template.jinja`	Near-identical to 31B's (131-byte difference — likely one whitespace-sensitive thing around audio handling).
`gemma-4-31B-it-tokenizer_config.json`	Special-token inventory + `response_schema` regex machinery. See "New capabilities" below.
`gemma-4-E4B-it-tokenizer_config.json`	Same shape.

`transformers/`

Files under src/transformers/models/gemma4/ on huggingface/transformers@main. Full files for small ones; outlines (signatures + first 12 lines per class/def) for the two large ones.

File	Lines	What
`__init__.py`	33	Module exports
`configuration_gemma4.py`	352	`Gemma4Config`, `Gemma4TextConfig`, `Gemma4AudioConfig`, `Gemma4VisionConfig` — all hyperparams
`processing_gemma4.py`	366	`Gemma4Processor` — the thing `AutoProcessor.from_pretrained` returns. Includes `parse_response()`
`feature_extraction_gemma4.py`	298	Audio feature extraction (mel spec, padding)
`image_processing_gemma4.py`	220	Tensor-backed image preprocessing
`image_processing_pil_gemma4.py`	278	PIL-backed variant (slower fallback)
`video_processing_gemma4.py`	237	Frame sampling + stitching to image tokens
`modeling_gemma4-OUTLINE.py`	723	Outline of the 2657-line modeling file (43 classes: attention, MoE, audio encoder, vision tower, all LM heads)
`modular_gemma4-OUTLINE.py`	563	Outline of the modular source file — shows Gemma4 inherits from Gemma3n classes (RMSNorm, attention blocks etc.) confirming the 3n→4 lineage

Full files: https://github.com/huggingface/transformers/tree/main/src/transformers/models/gemma4

`recipes/`

From huggingface/huggingface-gemma-recipes — the canonical HF recipe repo. The only Gemma 4-specific recipe as of April 2026 is one notebook; the rest is Gemma 3n which is architecturally the parent of Gemma 4.

File	What
`notebooks/Gemma4_E2B-Multimodal.ipynb`	The one first-party Gemma-4 recipe. Original ipynb. 36 cells: image, video, audio, function calling, object detection with `box_2d`, any-to-any pipeline, captioning.
`notebooks/Gemma4_E2B-Multimodal-extracted.py`	Same notebook flattened to readable .py for grep/diff.
`scripts/ft_gemma3n_image_trl.py`	TRL SFT fine-tune of Gemma 3n on images. Direct precursor to Gemma 4 SFT.
`scripts/ft_gemma3n_image_vt.py`	Vision+text fine-tune without TRL (pure Transformers Trainer).
`scripts/ft_gemma3n_audio_vt.py`	Audio+text fine-tune.
`scripts/gemma3n_fine_tuning_on_all_modalities.py`	All-modalities SFT script — template for full Gemma-4 all-modal SFT.
`scripts/carla_vlm_gemma.py`	CARLA driving sim VLM example using Gemma.

`trl/`

Empty as of April 2026. Searched huggingface/trl/examples/scripts — only sft_gemma3.py and sft_vlm_gemma3.py exist, no gemma4 yet. The gemma-recipes repo's ft_gemma3n_image_trl.py is the closest first-party TRL pattern; it is saved under recipes/scripts/ above.

`peft/`

Empty as of April 2026. huggingface/peft/examples has no gemma-specific directory. The canonical HF PEFT guide for Gemma is the blog post gemma-peft.md, saved under blog/ below. It covers Gemma 1 but the LoRA target-module patterns apply unchanged to Gemma 4 (same q_proj/k_proj/v_proj/o_proj naming).

`blog/`

File	What
`gemma4-blog.md`	"Welcome Gemma 4: Frontier multimodal intelligence on device" — the HF launch blog. 764 lines. Authored by merve. Covers architecture, capabilities, transformers usage, HF Inference API, llama.cpp/MLX quantization, thinking mode examples.
`gemma-peft-blog.md`	"Fine-Tuning Gemma Models in Hugging Face" — the PEFT/LoRA recipe blog (gemma-agnostic, target modules unchanged for Gemma 4).

`spaces/`

The official HF-run interactive demo Spaces.

File	What
`huggingface-projects_gemma-4-31b-it-app.py`	Official 31B demo (Gradio 6 chat + multimodal).
`huggingface-projects_gemma-4-e4b-it-app.py`	Official E4B demo. More illustrative — shows the full multimodal+thinking pattern in ~320 lines.
`*-requirements.txt`	Pinned deps. `transformers==5.5.4` (as of 2026-04-18) — that's the minimum version for Gemma 4 in transformers main line.

New capabilities the HF integration exposes that weren't in the existing corpus

AutoModelForMultimodalLM — new transformers AutoClass, not AutoModelForCausalLM. Required to get any-to-any routing (text+image+audio+video in, text out). The corpus's CORPUS_capabilities.md should note this.
processor.parse_response(text) -> dict — built into Gemma4Processor. Returns {thinking, content, tool_calls} parsed from raw decoded output. Driven by regexes declared in tokenizer_config.json under response_schema (new HF feature using x-regex, x-regex-iterator, and a custom x-parser: gemma4-tool-call). You no longer need to hand-roll tool-call regex parsing if you use the HF processor — this is the HF-canonical replacement for the manual parsing done in CORPUS_tool_calling_format.md.
enable_thinking=True — a kwarg to processor.apply_chat_template(). When set, injects <|think|> at the top of the system turn. This is how you turn reasoning mode on through the HF API. Not documented in the existing corpus.
load_audio_from_video=True — another apply_chat_template kwarg. Pulls the audio track out of a video URL and feeds it as audio tokens alongside sampled frames. Only relevant for E2B/E4B which have audio; the notebook comment explicitly calls this out.
pipeline("any-to-any", model=...) — a new HF pipeline task registered for Gemma 4. Accepts the chat-style messages list directly. Easiest one-liner for multimodal inference.
Object detection via box_2d JSON — prompting with "What's the bounding box for the X?" returns [{"box_2d": [ymin, xmin, ymax, xmax], "label": "..."}] in a 1000x1000 normalized coordinate frame, with images resized to multiples of 48 pixels. This is a Gemma-4-specific convention the notebook demonstrates. Corpus doesn't cover this.
Thinking delimiters are <|channel>thought...<channel|> — not <thinking>...</thinking> like some other open-weights models. The Space app explicitly strips these to pass to Gradio 6's reasoning_tags for collapsible thinking UI.
Breaking change in role/turn markers vs Gemma 3 — Gemma 3 used <start_of_turn>user ... <end_of_turn>. Gemma 4 uses <|turn>user\n ... <turn|>. Tokenizer config:
- sot_token: <|turn> (start of turn)
- eot_token: <turn|> (end of turn)
- Role after <|turn> can be system, user, model, or tool.
- enable_thinking injects a <|think|> marker into the first system turn. Anything in the homelab that hard-codes <start_of_turn> for Gemma needs to branch on family version. Worth adding to GOTCHAS.md.

Canonical chat template format

Source of truth: the two .jinja files in model-cards/. Use them directly — do not reimplement. The tokenizer loads them automatically:

from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("google/gemma-4-E4B-it")
inputs = processor.apply_chat_template(
    messages,
    tools=[WEATHER_TOOL],           # optional; OpenAI-style tool schema
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=True,           # turns on reasoning, injects <|think|>
    load_audio_from_video=False,    # only for video inputs
)
output = model.generate(**inputs, max_new_tokens=1000)
generated = processor.decode(output[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
result = processor.parse_response(generated)
# → {"thinking": "...", "content": "...", "tool_calls": [...]}

Wire format that the template produces

<bos><|turn>system
<|think|>
{system prompt here if any}
<|tool>declaration:get_weather{city:{type:<|"|>STRING<|"|>,description:<|"|>...<|"|>}}<tool|>
<turn|>
<|turn>user
{user text}
<|image|>                       ← placeholder for each image
<|audio|>                       ← placeholder for each audio
<|video|>                       ← placeholder for each video
<turn|>
<|turn>model
<|channel>thought
{reasoning text}
<channel|>
<|tool_call>call:get_weather{city:<|"|>London<|"|>}<tool_call|>
<|tool_response>response:get_weather{temperature:15}<tool_response|>
{final content}
<turn|>

Every Gemma-4-specific token appears in tokenizer_config.json. The apply_chat_template call + the response_schema + parse_response() round-trip means homelab code should never hand-emit these tokens — always go through the processor.

Source URLs (first-party only)

Model collection: https://huggingface.co/collections/google/gemma-4
transformers gemma4 dir: https://github.com/huggingface/transformers/tree/main/src/transformers/models/gemma4
Recipes repo: https://github.com/huggingface/huggingface-gemma-recipes
Launch blog: https://huggingface.co/blog/gemma4
Official 31B Space: https://huggingface.co/spaces/huggingface-projects/gemma-4-31b-it
Official E4B Space: https://huggingface.co/spaces/huggingface-projects/gemma-4-e4b-it

10 KiB Raw Blame History

Gemma 4 — Hugging Face Canonical Tooling

What's here

model-cards/

transformers/

recipes/

trl/

peft/

blog/

spaces/