Files

T

Mortdecai eecebe7ef5 docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 12:24:48 -04:00

snippets

docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

2026-04-18 12:24:48 -04:00

README.md

docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

2026-04-18 12:24:48 -04:00

run_commands.sh

docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

2026-04-18 12:24:48 -04:00

README.md

Gemma 4 — Inference Framework Support Matrix

Non-Ollama frameworks. Ollama is covered separately in the parent research corpus. Verified against upstream repos, model cards, and docs on 2026-04-18.

Summary table

#	Framework	Gemma 4 support	Vision	Audio	Tool calling	Quantization options	Canonical run command
1	vLLM	Native, upstream merged — `gemma4.py` (text) + `gemma4_mm.py` (multimodal). Registered in `registry.py` as `Gemma4ForCausalLM` and `Gemma4ForConditionalGeneration`.	Yes (all sizes)	Yes (E2B/E4B)	Yes — OpenAI-compatible `/v1/chat/completions` with `tools=[...]`	AWQ, GPTQ, FP8, NVFP4 (via `--quantization modelopt`), BF16	`vllm serve google/gemma-4-31b-it --tensor-parallel-size 2`
2	llama.cpp / GGUF	Native — `Gemma4Model` + `Gemma4VisionAudioModel` registered in `convert_hf_to_gguf.py` (lines 7666 & 7791). Distinct `GEMMA4V` + `GEMMA4A` projector types. Official GGUFs published at `ggml-org/gemma-4-*-GGUF`.	Yes (all, via mmproj)	Yes (E-series, via mmproj)	Yes — `llama-server` exposes OpenAI-compatible tools API	Q4_K_M, Q8_0, BF16 published officially; full quant menu via self-convert	`llama-server -hf ggml-org/gemma-4-E4B-it-GGUF`
3	Apple MLX	Native in `mlx-lm` (text, `gemma4.py` + `gemma4_text.py`) and `mlx-vlm` (multimodal, `mlx_vlm/models/gemma4/` with `audio.py`, `vision.py`, `language.py`, `processing_gemma4.py`)	Yes (mlx-vlm)	Yes (mlx-vlm)	Community; no first-party tools wrapper	4bit, 8bit, bf16 via MLX quantize	`mlx_vlm.generate --model mlx-community/gemma-4-E4B-it-8bit --image URL --prompt "..."`
4	Keras / keras-hub	Native, full modular impl: `keras_hub/src/models/gemma4/` with `attention`, `audio_encoder`, `vision_encoder`, `decoder_block`, `moe`, `causal_lm`, etc. 8 presets (base + instruct × 2B/4B/26B_a4b/31B).	Yes	Yes	No (it's a training library, not an inference server)	Via Keras mixed-precision; no canonical GGUF/AWQ path	`keras_hub.models.Gemma4CausalLM.from_preset("gemma4_instruct_4b")`
5	HF Text Generation Inference (TGI)	No native support. Supported-models page stops at Gemma 3 / Gemma 3 Text. No open or merged PRs for "gemma4" (verified). Will fall back to unoptimized `AutoModelForCausalLM` path.	Fallback only, no vision kernels	No	Fallback only	Whatever HF transformers exposes on the fallback path	`text-generation-launcher --model-id google/gemma-4-31b-it` (degraded)
6	TensorRT-LLM / NVIDIA NIM	Not in the 2026-04 support matrix. Matrix lists `Gemma3ForCausalLM`/`Gemma3ForConditionalGeneration` but no Gemma 4 entry. GitHub issue #12764 tracks broken runtime on DGX Spark/GB10. NVIDIA's own `nvidia/Gemma-4-31B-IT-NVFP4` card tells users to run it on vLLM, not TRT-LLM.	N/A	N/A	N/A	NVFP4 export exists but runtime is broken; use the NVFP4 weights in vLLM instead	Avoid — use `vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt`
7	Gemini API (AI Studio)	Hosted. Model IDs: `gemma-4-31b-it`, `gemma-4-26b-a4b-it`. E-series NOT exposed (on-device only).	Yes (via `inlineData` parts)	No (Gemini API strips the audio path)	Yes — same `tools=[...]` schema as Gemini models	N/A (Google-managed)	`curl .../v1beta/models/gemma-4-26b-a4b-it:generateContent -d @payload.json`
8	Vertex AI Model Garden	One-click deploy. Model card: `console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4`. Publisher ID format `google/gemma4@gemma-4-31b-it`. 26B-A4B is offered fully managed & serverless; 31B requires self-provisioned GPU endpoint.	Yes (via endpoint backend — vLLM under the hood)	Yes for E-series variants deployed that way	Yes (endpoint inherits from backing runtime)	Depends on backing image (vLLM/SAX) — BF16, FP8, AWQ selectable at deploy time	`model_garden.OpenModel("google/gemma4@gemma-4-31b-it").deploy()`

Production-readiness ranking

vLLM — most complete, most optimized, only runtime with first-party NVFP4 support and tested multimodal (image+audio+video).
llama.cpp / GGUF — best for local CPU + small GPU, only framework with audio mmproj shipping as a downloadable file for E-series, official Google-published quants via ggml-org/*.
Gemini API / Vertex AI — if you don't want to self-host; Vertex gives you the managed-endpoint exit path with vLLM under the hood.
Apple MLX — production-ready on Apple Silicon only; mlx-vlm is community-maintained but actively updated.
Keras-hub — reference/training, not inference-server.
TGI — usable as a fallback only; no optimized path yet.
TensorRT-LLM — avoid for Gemma 4. NVIDIA themselves point at vLLM.

Capabilities beyond Ollama

Native audio input — Ollama does not currently expose the E2B/E4B audio tower. Three frameworks do:
- llama.cpp with the mmproj-...-E4B-it-*.gguf projector (VisionProjectorType.GEMMA4A),
- vLLM via gemma4_mm.py (input_features_padded, input_features_mask),
- MLX via mlx-vlm/models/gemma4/audio.py. If Seth ever wants the speech-transcription path, llama.cpp with the E4B mmproj is the shortest route from where he already is.
Video with interleaved audio — vLLM's gemma4_mm.py decomposes videos into up to 32 timestamped frames; with E-series models it also loads the audio track (load_audio_from_video=True). Ollama has no video path at all.
NVFP4 on Blackwell — vLLM only. nvidia/Gemma-4-31B-IT-NVFP4 reports ~0.3 pp accuracy loss vs BF16 on GPQA Diamond / MMLU Pro.

Framework to avoid

TensorRT-LLM. Not in the upstream support matrix as of 2026-04, known runtime bug on DGX Spark/GB10 (issue #12764), and NVIDIA's own NVFP4 checkpoint directs users to vLLM. Revisit only after a future TRT-LLM release lists Gemma4ForCausalLM in the support matrix.

Files in this directory

inference-frameworks/
├── README.md                              — this file
├── run_commands.sh                        — canonical one-liners per framework
└── snippets/
    ├── llamacpp_convert_gemma4_excerpt.py — Gemma4Model + Gemma4VisionAudioModel from convert_hf_to_gguf.py (lines 7666-7840)
    ├── vllm_gemma4_head_80.py             — gemma4.py header (imports, config deref)
    ├── vllm_gemma4_mm_head_80.py          — gemma4_mm.py header (multimodal docstring lists image/audio/video)
    ├── vllm_registry_excerpt.txt          — registry.py Gemma4 registrations
    ├── mlx_gemma4_head_100.py             — mlx-lm gemma4.py (text) first 100 lines
    ├── mlx_vlm_gemma4_head_60.py          — mlx-vlm gemma4/gemma4.py (multimodal) first 60 lines
    ├── keras_hub_gemma4.py                — canonical keras-hub example + preset list
    ├── gemini_api_gemma4.sh               — canonical curl example
    └── gemini_api_gemma4.py               — canonical google-genai Python SDK example

Notable upstream references

vLLM Gemma 4 model class: vllm-project/vllm:vllm/model_executor/models/gemma4.py and gemma4_mm.py
llama.cpp HF → GGUF converter: ggml-org/llama.cpp:convert_hf_to_gguf.py lines 7666-7840
Official Google GGUF repos (verified live): ggml-org/gemma-4-{E2B,E4B,31B,26b-a4b}-it-GGUF — all ship mmproj projector files
HF blog: huggingface.co/blog/gemma4 — shows AutoModelForMultimodalLM is the canonical transformers entry point
NVIDIA NVFP4 checkpoint: nvidia/Gemma-4-31B-IT-NVFP4 — runtime=vLLM, not TRT-LLM
Gemini API doc: ai.google.dev/gemma/docs/core/gemma_on_gemini_api
Vertex AI Model Garden: console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4
TGI supported-models list (confirming absence of Gemma 4): huggingface.co/docs/text-generation-inference/supported_models
TRT-LLM support matrix (confirming absence): nvidia.github.io/TensorRT-LLM/reference/support-matrix.html

README.md Unescape Escape

Gemma 4 — Inference Framework Support Matrix

Summary table

Production-readiness ranking

Capabilities beyond Ollama

Framework to avoid

Files in this directory

Notable upstream references

README.md