docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00
parent 5011059f5d
commit eecebe7ef5
149 changed files with 181297 additions and 0 deletions
@@ -0,0 +1,71 @@
+# Gemma 4 — Inference Framework Support Matrix
+
+> Non-Ollama frameworks. Ollama is covered separately in the parent research corpus.
+> Verified against upstream repos, model cards, and docs on **2026-04-18**.
+
+## Summary table
+
+| # | Framework | Gemma 4 support | Vision | Audio | Tool calling | Quantization options | Canonical run command |
+|---|---|---|---|---|---|---|---|
+| 1 | **vLLM** | Native, upstream merged — `gemma4.py` (text) + `gemma4_mm.py` (multimodal). Registered in `registry.py` as `Gemma4ForCausalLM` and `Gemma4ForConditionalGeneration`. | Yes (all sizes) | Yes (E2B/E4B) | Yes — OpenAI-compatible `/v1/chat/completions` with `tools=[...]` | AWQ, GPTQ, FP8, NVFP4 (via `--quantization modelopt`), BF16 | `vllm serve google/gemma-4-31b-it --tensor-parallel-size 2` |
+| 2 | **llama.cpp / GGUF** | Native — `Gemma4Model` + `Gemma4VisionAudioModel` registered in `convert_hf_to_gguf.py` (lines 7666 & 7791). Distinct `GEMMA4V` + `GEMMA4A` projector types. Official GGUFs published at `ggml-org/gemma-4-*-GGUF`. | Yes (all, via mmproj) | Yes (E-series, via mmproj) | Yes — `llama-server` exposes OpenAI-compatible tools API | Q4_K_M, Q8_0, BF16 published officially; full quant menu via self-convert | `llama-server -hf ggml-org/gemma-4-E4B-it-GGUF` |
+| 3 | **Apple MLX** | Native in `mlx-lm` (text, `gemma4.py` + `gemma4_text.py`) and `mlx-vlm` (multimodal, `mlx_vlm/models/gemma4/` with `audio.py`, `vision.py`, `language.py`, `processing_gemma4.py`) | Yes (mlx-vlm) | Yes (mlx-vlm) | Community; no first-party tools wrapper | 4bit, 8bit, bf16 via MLX quantize | `mlx_vlm.generate --model mlx-community/gemma-4-E4B-it-8bit --image URL --prompt "..."` |
+| 4 | **Keras / keras-hub** | Native, full modular impl: `keras_hub/src/models/gemma4/` with `attention`, `audio_encoder`, `vision_encoder`, `decoder_block`, `moe`, `causal_lm`, etc. 8 presets (base + instruct × 2B/4B/26B_a4b/31B). | Yes | Yes | No (it's a training library, not an inference server) | Via Keras mixed-precision; no canonical GGUF/AWQ path | `keras_hub.models.Gemma4CausalLM.from_preset("gemma4_instruct_4b")` |
+| 5 | **HF Text Generation Inference (TGI)** | **No native support.** Supported-models page stops at Gemma 3 / Gemma 3 Text. No open or merged PRs for "gemma4" (verified). Will fall back to unoptimized `AutoModelForCausalLM` path. | Fallback only, no vision kernels | No | Fallback only | Whatever HF transformers exposes on the fallback path | `text-generation-launcher --model-id google/gemma-4-31b-it` (degraded) |
+| 6 | **TensorRT-LLM / NVIDIA NIM** | **Not in the 2026-04 support matrix.** Matrix lists `Gemma3ForCausalLM`/`Gemma3ForConditionalGeneration` but no Gemma 4 entry. GitHub issue #12764 tracks broken runtime on DGX Spark/GB10. NVIDIA's own `nvidia/Gemma-4-31B-IT-NVFP4` card tells users to run it on **vLLM**, not TRT-LLM. | N/A | N/A | N/A | NVFP4 export exists but runtime is broken; use the NVFP4 weights in vLLM instead | Avoid — use `vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt` |
+| 7 | **Gemini API (AI Studio)** | Hosted. Model IDs: `gemma-4-31b-it`, `gemma-4-26b-a4b-it`. E-series NOT exposed (on-device only). | Yes (via `inlineData` parts) | No (Gemini API strips the audio path) | Yes — same `tools=[...]` schema as Gemini models | N/A (Google-managed) | `curl .../v1beta/models/gemma-4-26b-a4b-it:generateContent -d @payload.json` |
+| 8 | **Vertex AI Model Garden** | One-click deploy. Model card: `console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4`. Publisher ID format `google/gemma4@gemma-4-31b-it`. 26B-A4B is offered fully managed & serverless; 31B requires self-provisioned GPU endpoint. | Yes (via endpoint backend — vLLM under the hood) | Yes for E-series variants deployed that way | Yes (endpoint inherits from backing runtime) | Depends on backing image (vLLM/SAX) — BF16, FP8, AWQ selectable at deploy time | `model_garden.OpenModel("google/gemma4@gemma-4-31b-it").deploy()` |
+
+## Production-readiness ranking
+
+1. **vLLM** — most complete, most optimized, only runtime with first-party NVFP4 support and tested multimodal (image+audio+video).
+2. **llama.cpp / GGUF** — best for local CPU + small GPU, only framework with audio mmproj shipping as a downloadable file for E-series, official Google-published quants via `ggml-org/*`.
+3. **Gemini API / Vertex AI** — if you don't want to self-host; Vertex gives you the managed-endpoint exit path with vLLM under the hood.
+4. **Apple MLX** — production-ready on Apple Silicon only; `mlx-vlm` is community-maintained but actively updated.
+5. **Keras-hub** — reference/training, not inference-server.
+6. **TGI** — usable as a *fallback* only; no optimized path yet.
+7. **TensorRT-LLM** — **avoid for Gemma 4.** NVIDIA themselves point at vLLM.
+
+## Capabilities beyond Ollama
+
+- **Native audio input** — Ollama does **not** currently expose the E2B/E4B audio tower. Three frameworks do:
+  - **llama.cpp** with the `mmproj-...-E4B-it-*.gguf` projector (`VisionProjectorType.GEMMA4A`),
+  - **vLLM** via `gemma4_mm.py` (`input_features_padded`, `input_features_mask`),
+  - **MLX** via `mlx-vlm/models/gemma4/audio.py`.
+  If Seth ever wants the speech-transcription path, llama.cpp with the E4B mmproj is the shortest route from where he already is.
+- **Video with interleaved audio** — vLLM's `gemma4_mm.py` decomposes videos into up to 32 timestamped frames; with E-series models it also loads the audio track (`load_audio_from_video=True`). Ollama has no video path at all.
+- **NVFP4 on Blackwell** — vLLM only. `nvidia/Gemma-4-31B-IT-NVFP4` reports ~0.3 pp accuracy loss vs BF16 on GPQA Diamond / MMLU Pro.
+
+## Framework to avoid
+
+**TensorRT-LLM.** Not in the upstream support matrix as of 2026-04, known runtime bug on DGX Spark/GB10 (issue #12764), and NVIDIA's own NVFP4 checkpoint directs users to vLLM. Revisit only after a future TRT-LLM release lists `Gemma4ForCausalLM` in the support matrix.
+
+## Files in this directory
+
+```
+inference-frameworks/
+├── README.md                              — this file
+├── run_commands.sh                        — canonical one-liners per framework
+└── snippets/
+    ├── llamacpp_convert_gemma4_excerpt.py — Gemma4Model + Gemma4VisionAudioModel from convert_hf_to_gguf.py (lines 7666-7840)
+    ├── vllm_gemma4_head_80.py             — gemma4.py header (imports, config deref)
+    ├── vllm_gemma4_mm_head_80.py          — gemma4_mm.py header (multimodal docstring lists image/audio/video)
+    ├── vllm_registry_excerpt.txt          — registry.py Gemma4 registrations
+    ├── mlx_gemma4_head_100.py             — mlx-lm gemma4.py (text) first 100 lines
+    ├── mlx_vlm_gemma4_head_60.py          — mlx-vlm gemma4/gemma4.py (multimodal) first 60 lines
+    ├── keras_hub_gemma4.py                — canonical keras-hub example + preset list
+    ├── gemini_api_gemma4.sh               — canonical curl example
+    └── gemini_api_gemma4.py               — canonical google-genai Python SDK example
+```
+
+## Notable upstream references
+
+- vLLM Gemma 4 model class: `vllm-project/vllm:vllm/model_executor/models/gemma4.py` and `gemma4_mm.py`
+- llama.cpp HF → GGUF converter: `ggml-org/llama.cpp:convert_hf_to_gguf.py` lines 7666-7840
+- Official Google GGUF repos (verified live): `ggml-org/gemma-4-{E2B,E4B,31B,26b-a4b}-it-GGUF` — all ship mmproj projector files
+- HF blog: huggingface.co/blog/gemma4 — shows `AutoModelForMultimodalLM` is the canonical transformers entry point
+- NVIDIA NVFP4 checkpoint: `nvidia/Gemma-4-31B-IT-NVFP4` — runtime=vLLM, not TRT-LLM
+- Gemini API doc: ai.google.dev/gemma/docs/core/gemma_on_gemini_api
+- Vertex AI Model Garden: console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4
+- TGI supported-models list (confirming *absence* of Gemma 4): huggingface.co/docs/text-generation-inference/supported_models
+- TRT-LLM support matrix (confirming *absence*): nvidia.github.io/TensorRT-LLM/reference/support-matrix.html
@@ -0,0 +1,70 @@
+#!/usr/bin/env bash
+# Canonical one-liners to serve Gemma 4 across inference frameworks.
+# Verified against upstream repos / model cards on 2026-04-18.
+# Not meant to be executed as a script — each block is a standalone example.
+
+### 1. vLLM — full multimodal (text + vision + audio + video) ###
+# Text-only 31B dense:
+vllm serve google/gemma-4-31b-it --tensor-parallel-size 2
+# Multimodal E4B (vision + audio):
+vllm serve google/gemma-4-E4B-it --limit-mm-per-prompt image=4,audio=1
+# NVFP4-quantized 31B on Blackwell/H100 (NVIDIA's official quant):
+vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --tensor-parallel-size 8
+
+### 2. llama.cpp — official ggml-org GGUFs ###
+# Text-only via -hf shortcut (auto-download, default = Q4_K_M if multiple present):
+llama-server -hf ggml-org/gemma-4-E4B-it-GGUF
+# Choose a specific quant:
+llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M
+# Vision (+ audio for E-series) — add --mmproj pointing to the projector:
+llama-server -hf ggml-org/gemma-4-E4B-it-GGUF \
+  --mmproj ggml-org/gemma-4-E4B-it-GGUF/mmproj-gemma-4-E4B-it-Q8_0.gguf
+# Convert a new HF checkpoint to GGUF yourself:
+python convert_hf_to_gguf.py /path/to/google/gemma-4-31b-it --outfile gemma-4-31b.gguf
+
+### 3. Apple MLX — text via mlx-lm, multimodal via mlx-vlm (community) ###
+# Text generation (mlx-lm, first-party Apple):
+mlx_lm.generate --model mlx-community/gemma-4-E4B-it-4bit --prompt "Hello"
+# Vision/audio (mlx-vlm, Prince Canuma / community):
+mlx_vlm.generate --model mlx-community/gemma-4-E4B-it-8bit \
+  --image https://example.com/cat.jpg --prompt "Describe this image."
+
+### 4. Keras / keras-hub — reference implementation, training-focused ###
+# python:
+# import keras_hub
+# model = keras_hub.models.Gemma4CausalLM.from_preset("gemma4_instruct_4b")
+# model.generate("Hello", max_length=128)
+# Presets: gemma4_{2b,4b,26b_a4b,31b} and gemma4_instruct_{...}
+
+### 5. Text Generation Inference (TGI) — NO native Gemma 4 support as of 2026-04-18 ###
+# Upstream supported_models list stops at Gemma 3 / Gemma 3 Text.
+# Fallback: TGI will try AutoModelForCausalLM without optimized kernels —
+# expect degraded throughput and no guarantee of vision/audio paths.
+text-generation-launcher --model-id google/gemma-4-31b-it   # unoptimized fallback
+
+### 6. TensorRT-LLM — NOT supported ###
+# Support matrix (2026-04) lists Gemma2 and Gemma3{ForCausalLM,ForConditionalGeneration}
+# but NOT Gemma4. NVIDIA's own nvidia/Gemma-4-31B-IT-NVFP4 card points users to vLLM.
+# Issue #12764 tracks DGX Spark runtime skew. Avoid for production Gemma 4.
+
+### 7. Gemini API (Google AI Studio) — hosted Gemma 4 ###
+curl "https://generativelanguage.googleapis.com/v1beta/models/gemma-4-26b-a4b-it:generateContent" \
+  -H 'Content-Type: application/json' \
+  -H "x-goog-api-key: $GEMINI_API_KEY" \
+  -X POST \
+  -d '{"contents":[{"parts":[{"text":"Your prompt here"}]}]}'
+# Python SDK (google-genai):
+# from google import genai
+# client = genai.Client()
+# resp = client.models.generate_content(model="gemma-4-26b-a4b-it", contents="Hi")
+# print(resp.text)
+# Hosted model IDs: gemma-4-31b-it, gemma-4-26b-a4b-it
+
+### 8. Vertex AI Model Garden — one-click deploy ###
+# Console: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4
+# CLI (new model-garden command):
+gcloud ai model-garden models list | grep gemma-4
+# Python SDK (vertex-ai-model-garden):
+# from google.cloud.aiplatform import model_garden
+# model = model_garden.OpenModel("google/gemma4@gemma-4-31b-it")
+# endpoint = model.deploy()   # spins up Vertex endpoint with backing GPUs
@@ -0,0 +1,26 @@
+"""Canonical Gemma 4 call via the google-genai Python SDK (Gemini API).
+
+Source: https://ai.google.dev/gemma/docs/core/gemma_on_gemini_api
+
+Install:  pip install google-genai
+Env:      GEMINI_API_KEY=...  (from https://aistudio.google.com/apikey)
+
+Hosted model IDs (2026-04):
+  - gemma-4-31b-it
+  - gemma-4-26b-a4b-it
+
+The E-series (E2B, E4B) is NOT exposed via the Gemini API — those are
+on-device-only checkpoints. For them you must self-host (Ollama,
+llama.cpp, vLLM, MLX).
+"""
+
+from google import genai
+
+client = genai.Client()  # picks up GEMINI_API_KEY from env
+
+response = client.models.generate_content(
+    model="gemma-4-26b-a4b-it",
+    contents="Write a haiku about inference framework fragmentation.",
+)
+
+print(response.text)
@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+# Canonical Gemma 4 call via the Gemini API (Google AI Studio).
+# Source: https://ai.google.dev/gemma/docs/core/gemma_on_gemini_api
+# Hosted model IDs (2026-04): gemma-4-31b-it, gemma-4-26b-a4b-it
+# Note: hosted variants are the big ones only; on-device E2B/E4B are NOT served on the Gemini API.
+
+export GEMINI_API_KEY="..."  # from https://aistudio.google.com/apikey
+
+curl "https://generativelanguage.googleapis.com/v1beta/models/gemma-4-26b-a4b-it:generateContent" \
+  -H 'Content-Type: application/json' \
+  -H "x-goog-api-key: ${GEMINI_API_KEY}" \
+  -X POST \
+  -d '{
+    "contents": [{
+      "parts": [{"text": "Write a haiku about inference framework fragmentation."}]
+    }]
+  }'
@@ -0,0 +1,30 @@
+"""Canonical Keras / keras-hub example for Gemma 4.
+
+Source: keras-team/keras-hub — keras_hub/src/models/gemma4/
+Requires: pip install keras-hub keras[jax]  (or keras[torch] / keras[tensorflow])
+
+Presets (verified 2026-04-18 from gemma4_presets.py):
+  gemma4_2b              gemma4_instruct_2b
+  gemma4_4b              gemma4_instruct_4b
+  gemma4_26b_a4b         gemma4_instruct_26b_a4b
+  gemma4_31b             gemma4_instruct_31b
+
+Keras-hub is the reference implementation maintained by the Keras team
+(Google). It ships all components modularly — see the directory listing:
+gemma4_attention, gemma4_audio_encoder, gemma4_vision_encoder,
+gemma4_moe, gemma4_decoder_block, gemma4_causal_lm, etc.  This makes it
+the most legible path to *read* the architecture, but it is a
+training/fine-tuning tool — not a production inference server.
+"""
+
+import keras_hub
+
+# Text causal LM
+model = keras_hub.models.Gemma4CausalLM.from_preset("gemma4_instruct_4b")
+print(model.generate("Write a haiku about JAX.", max_length=128))
+
+# For multimodal (vision/audio) use the backbone + preprocessors directly:
+# backbone = keras_hub.models.Gemma4Backbone.from_preset("gemma4_instruct_4b")
+# preproc  = keras_hub.models.Gemma4CausalLMPreprocessor.from_preset("gemma4_instruct_4b")
+# Vision and audio encoders are in separate modules (gemma4_vision_encoder,
+# gemma4_audio_encoder) and are wired by the backbone when preset includes them.
@@ -0,0 +1,175 @@
+@ModelBase.register("Gemma4ForConditionalGeneration")
+class Gemma4Model(Gemma3Model):
+    model_arch = gguf.MODEL_ARCH.GEMMA4
+
+    def norm_shift(self, name: str) -> float:
+        del name # unused
+        return 0.0
+
+    def set_vocab(self):
+        vocab = gguf.LlamaHfVocab(self.dir_model)
+        tokens = []
+        scores = []
+        toktypes = []
+        visible_tokens = {"<|channel>", "<channel|>", "<|tool_call>", "<tool_call|>", "<|tool_response>", "<tool_response|>", "<|\"|>"}
+
+        for text, score, toktype in vocab.all_tokens():
+            tokens.append(text)
+            scores.append(score)
+            text_str = text.decode()
+            if text_str in visible_tokens:
+                # always render these tokens, so that the chat parser can read them
+                toktypes.append(gguf.TokenType.USER_DEFINED)
+                logger.info(f"Token '{text_str}' is set to USER_DEFINED")
+            else:
+                toktypes.append(toktype)
+
+        assert len(tokens) == vocab.vocab_size
+
+        self.gguf_writer.add_tokenizer_model("gemma4")
+        self.gguf_writer.add_token_list(tokens)
+        self.gguf_writer.add_token_scores(scores)
+        self.gguf_writer.add_token_types(toktypes)
+
+        special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
+        special_vocab.add_to_gguf(self.gguf_writer)
+        self.gguf_writer.add_add_space_prefix(False)
+        self.gguf_writer.add_add_bos_token(True)
+
+    def set_gguf_parameters(self):
+        super().set_gguf_parameters()
+
+        num_kv_shared_layers = self.hparams["num_kv_shared_layers"]
+        self.gguf_writer.add_shared_kv_layers(num_kv_shared_layers)
+
+        # per-layer embedding is optional
+        n_pl_embd = self.hparams.get("hidden_size_per_layer_input") or 0
+        self.gguf_writer.add_embedding_length_per_layer_input(n_pl_embd)
+
+        swa_layers = [t == "sliding_attention" for t in self.hparams["layer_types"]]
+        self.gguf_writer.add_sliding_window_pattern(swa_layers)
+
+        head_dim_full = self.hparams["global_head_dim"]
+        head_dim_swa = self.hparams["head_dim"]
+        # correct the head dim for global/swa layers
+        self.gguf_writer.add_key_length(head_dim_full)
+        self.gguf_writer.add_value_length(head_dim_full)
+        self.gguf_writer.add_key_length_swa(head_dim_swa)
+        self.gguf_writer.add_value_length_swa(head_dim_swa)
+
+        expert_intermediate_size = self.find_hparam(["expert_intermediate_size", "moe_intermediate_size"])
+        if expert_intermediate_size is not None:
+            self.gguf_writer.add_expert_feed_forward_length(expert_intermediate_size)
+
+        # if use_double_wide_mlp is set, we need to adjust the value for kv shared layers
+        use_double_wide_mlp = self.hparams.get("use_double_wide_mlp", False)
+        first_kv_shared_layer_idx = self.block_count - num_kv_shared_layers
+        if use_double_wide_mlp:
+            n_ff = self.hparams["intermediate_size"]
+            n_ff_arr = [n_ff if il < first_kv_shared_layer_idx else n_ff * 2 for il in range(self.block_count)]
+            self.gguf_writer.add_feed_forward_length(n_ff_arr)
+
+        # handle num_global_key_value_heads
+        num_key_value_heads_full = self.hparams.get("num_global_key_value_heads")
+        num_key_value_heads_swa = self.hparams.get("num_key_value_heads")
+        if num_key_value_heads_full is not None and num_key_value_heads_swa is not None:
+            value_arr = [num_key_value_heads_swa if is_swa else num_key_value_heads_full for is_swa in swa_layers]
+            self.gguf_writer.add_head_count_kv(value_arr)
+
+        # handle n_rot differently for global vs swa layers
+        partial_rotary_factor_swa = self.hparams.get("partial_rotary_factor", 1.0)
+        n_rot_full = int(head_dim_full) # "proportional" is used, see generate_extra_tensors
+        n_rot_swa = int(head_dim_swa * partial_rotary_factor_swa)
+        self.gguf_writer.add_rope_dimension_count(n_rot_full)
+        self.gguf_writer.add_rope_dimension_count_swa(n_rot_swa)
+
+    def generate_extra_tensors(self) -> Iterable[tuple[str, Tensor]]:
+        # full layer uses "proportional" rope with partial_rotary_factor=0.25
+        # the expected ordering is cc000000ss000000 (c = cos, s = sin, 0 = unrotated),
+        # but ggml neox only supports ccss000000000000, and we cannot rearrange the head because that will break use_alternative_attention
+        # solution is to set specific freq_factors for the unrotated dims
+
+        # IMPORTANT: this ROPE_FREQS tensor is ONLY used by the full_attention layers
+        rope_params_full = self.hparams["rope_parameters"]["full_attention"]
+        assert rope_params_full["rope_type"] == "proportional"
+        head_dim_full = (self.hparams["global_head_dim"])
+        partial_rotary_factor_full = rope_params_full["partial_rotary_factor"]
+        n_rot_full = int(head_dim_full * partial_rotary_factor_full / 2)
+        n_unrot_full = int(head_dim_full / 2) - n_rot_full
+        values = [1.0] * n_rot_full + [1e30] * n_unrot_full
+        rope_freqs_full = torch.tensor(values, dtype=torch.float32)
+        yield (self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FREQS), rope_freqs_full)
+
+    def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
+        if name.endswith("per_dim_scale") or name.endswith("layer_scalar"):
+            name = name + ".weight"
+
+        if "language_model." not in name and "rope_freqs" not in name:
+            return # skip non-language model tensors
+
+        name = name.replace("language_model.", "")
+        if name.endswith("router.scale"):
+            name = self.format_tensor_name(gguf.MODEL_TENSOR.FFN_GATE_INP, bid, ".scale")
+            yield (name, data_torch)
+            return
+        if ".per_expert_scale" in name:
+            # convert per-expert scale to FFN down scale
+            name = self.format_tensor_name(gguf.MODEL_TENSOR.FFN_DOWN_EXP, bid, ".scale")
+            yield (name, data_torch)
+            return
+        if ".experts." in name and not name.endswith(".weight"):
+            name += ".weight"
+
+        yield from super().modify_tensors(data_torch, name, bid)
+
+
+@ModelBase.register("Gemma4ForConditionalGeneration")
+class Gemma4VisionAudioModel(MmprojModel):
+    has_audio_encoder = True
+    has_vision_encoder = True
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        assert self.hparams_vision is not None
+        self.hparams_vision["image_size"] = 224 # unused, but set to avoid error
+
+        # remap audio hparams
+        if self.hparams_audio:
+            self.hparams_audio["feat_in"] = self.hparams_audio.get("input_feat_size", 128)
+            self.hparams_audio["intermediate_size"] = self.hparams_audio["hidden_size"] * 4
+        else:
+            self.has_audio_encoder = False
+
+    def set_gguf_parameters(self):
+        super().set_gguf_parameters()
+
+        # vision params
+        self.gguf_writer.add_clip_vision_projector_type(gguf.VisionProjectorType.GEMMA4V)
+        self.gguf_writer.add_vision_attention_layernorm_eps(self.hparams.get("layer_norm_eps", 1e-6))
+
+        # audio params
+        if self.hparams_audio:
+            self.gguf_writer.add_clip_audio_projector_type(gguf.VisionProjectorType.GEMMA4A)
+            self.gguf_writer.add_audio_num_mel_bins(self.hparams_audio["feat_in"])
+            self.gguf_writer.add_audio_attention_layernorm_eps(1e-5)
+
+    def is_audio_tensor(self, name: str) -> bool:
+        return "audio_tower" in name or "embed_audio" in name
+
+    def tensor_force_quant(self, name, new_name, bid, n_dims):
+        if self.is_audio_tensor(name):
+            if ".conv" in name or "_conv" in name and ".weight" in name:
+                return gguf.GGMLQuantizationType.F32
+        if "position_embedding_table" in name:
+            return gguf.GGMLQuantizationType.F32
+        return super().tensor_force_quant(name, new_name, bid, n_dims)
+
+    def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
+        del bid # unused
+
+        if name.startswith("model.language_model."):
+            return # skip
+
+        if len(data_torch.shape) == 0:
+            # convert scalar tensors (input/output_mix/max) to 1D tensors
+            data_torch = data_torch.unsqueeze(0)
@@ -0,0 +1,92 @@
+# Copyright © 2025 Apple Inc.
+
+from dataclasses import dataclass
+from typing import Optional
+
+import mlx.core as mx
+import mlx.nn as nn
+from mlx.utils import tree_flatten, tree_unflatten
+
+from . import gemma4_text
+from .base import BaseModelArgs
+
+
+@dataclass
+class ModelArgs(BaseModelArgs):
+    model_type: str = "gemma4"
+    text_config: dict = None
+    vocab_size: int = 262144
+
+    def __post_init__(self):
+        if self.text_config is None:
+            self.text_config = {}
+        self.text_config["vocab_size"] = self.vocab_size
+        self.text_config["num_attention_heads"] = self.text_config.get(
+            "num_attention_heads", 8
+        )
+        self.text_config["num_key_value_heads"] = self.text_config.get(
+            "num_key_value_heads", 1
+        )
+
+
+class Model(nn.Module):
+    def __init__(self, args: ModelArgs):
+        super().__init__()
+        self.args = args
+        self.model_type = args.model_type
+        self.language_model = gemma4_text.Model(
+            gemma4_text.ModelArgs.from_dict(args.text_config)
+        )
+
+    def __call__(
+        self,
+        inputs: mx.array,
+        cache=None,
+        input_embeddings: Optional[mx.array] = None,
+        per_layer_inputs: Optional[mx.array] = None,
+    ):
+        return self.language_model(
+            inputs,
+            cache=cache,
+            input_embeddings=input_embeddings,
+            per_layer_inputs=per_layer_inputs,
+        )
+
+    def sanitize(self, weights):
+        new_weights = {}
+        for k, v in weights.items():
+            starts_w_model = k.startswith("model.")
+
+            k = k.removeprefix("model.")
+            if k.startswith(
+                (
+                    "vision_tower",
+                    "multi_modal_projector",
+                    "audio_tower",
+                    "embed_audio",
+                    "embed_vision",
+                )
+            ):
+                continue
+
+            if not starts_w_model:
+                new_weights[k] = v
+                continue
+
+            if k.startswith("language_model"):
+                k = k.replace("language_model.", "language_model.model.")
+
+            new_weights[k] = v
+
+        return self.language_model.sanitize(new_weights)
+
+    @property
+    def layers(self):
+        return self.language_model.layers
+
+    @property
+    def quant_predicate(self):
+        return self.language_model.quant_predicate
+
+    def make_cache(self):
+        return self.language_model.make_cache()
@@ -0,0 +1,60 @@
+from typing import Optional
+
+import mlx.core as mx
+import mlx.nn as nn
+
+from ..base import InputEmbeddingsFeatures
+from .audio import AudioEncoder
+from .config import ModelConfig
+from .language import LanguageModel, RMSNormNoScale
+from .vision import VisionModel
+
+
+def masked_scatter(input_tensor, mask, source):
+    mask_flat = mask.flatten().astype(mx.int32)
+    indices = mx.cumsum(mask_flat) - 1
+    aligned = source.flatten()[indices % source.size]
+    return mx.where(mask_flat, aligned, input_tensor.flatten()).reshape(
+        input_tensor.shape
+    )
+
+
+class MultimodalEmbedder(nn.Module):
+    """Projects soft tokens from vision/audio into language model space."""
+
+    def __init__(self, embedding_dim: int, text_hidden_size: int, eps: float = 1e-6):
+        super().__init__()
+        self.embedding_projection = nn.Linear(
+            embedding_dim, text_hidden_size, bias=False
+        )
+        self.embedding_pre_projection_norm = RMSNormNoScale(embedding_dim, eps=eps)
+
+    def __call__(self, inputs_embeds: mx.array) -> mx.array:
+        normed = self.embedding_pre_projection_norm(inputs_embeds)
+        return self.embedding_projection(normed)
+
+
+class Model(nn.Module):
+    def __init__(self, config: ModelConfig):
+        super().__init__()
+        self.model_type = config.model_type
+        self.config = config
+
+        # Text
+        self.language_model = LanguageModel(config.text_config)
+        self.vocab_size = config.text_config.vocab_size
+
+        # Vision
+        self.vision_tower = VisionModel(config.vision_config)
+        self.embed_vision = MultimodalEmbedder(
+            embedding_dim=config.vision_config.hidden_size,
+            text_hidden_size=config.text_config.hidden_size,
+            eps=config.vision_config.rms_norm_eps,
+        )
+
+        # Audio
+        if config.audio_config is not None:
+            self.audio_tower = AudioEncoder(config.audio_config)
+            audio_output_dim = (
+                config.audio_config.output_proj_dims or config.audio_config.hidden_size
+            )
@@ -0,0 +1,90 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+# Copyright 2025 The vLLM team.
+# Copyright 2025 Google Inc. HuggingFace Inc. team. All rights reserved.
+#
+#
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Gemma 4 model implementation for vLLM."""
+
+from collections.abc import Iterable
+from dataclasses import replace
+from itertools import islice
+
+import regex as re
+import torch
+from torch import nn
+
+from vllm.compilation.decorators import support_torch_compile
+from vllm.config import CacheConfig, VllmConfig
+from vllm.distributed import (
+    get_pp_group,
+    get_tensor_model_parallel_rank,
+    get_tensor_model_parallel_world_size,
+)
+from vllm.forward_context import get_forward_context
+from vllm.logger import init_logger
+from vllm.model_executor.layers.activation import GeluAndMul
+from vllm.model_executor.layers.attention import Attention
+from vllm.model_executor.layers.fused_moe import FusedMoE, GateLinear
+from vllm.model_executor.layers.layernorm import RMSNorm
+from vllm.model_executor.layers.linear import (
+    ColumnParallelLinear,
+    MergedColumnParallelLinear,
+    QKVParallelLinear,
+    ReplicatedLinear,
+    RowParallelLinear,
+)
+from vllm.model_executor.layers.logits_processor import LogitsProcessor
+from vllm.model_executor.layers.quantization import QuantizationConfig
+from vllm.model_executor.layers.rotary_embedding import get_rope
+from vllm.model_executor.layers.vocab_parallel_embedding import (
+    ParallelLMHead,
+    VocabParallelEmbedding,
+)
+from vllm.model_executor.model_loader.weight_utils import (
+    default_weight_loader,
+    maybe_remap_kv_scale_name,
+)
+from vllm.sequence import IntermediateTensors
+from vllm.v1.attention.backends.utils import KVSharingFastPrefillMetadata
+
+from .interfaces import (
+    EagleModelMixin,
+    MixtureOfExperts,
+    SupportsEagle3,
+    SupportsLoRA,
+    SupportsPP,
+)
+from .utils import (
+    AutoWeightsLoader,
+    WeightsMapper,
+    extract_layer_index,
+    is_pp_missing_parameter,
+    make_layers,
+    maybe_prefix,
+)
+
+logger = init_logger(__name__)
+
+
+def _get_text_config(config):
+    """Dereference text_config if config is a nested Gemma4Config.
+
+    Gemma4 checkpoints use architectures=["Gemma4ForConditionalGeneration"]
+    which yields a Gemma4Config with nested text_config. This function
+    transparently returns the text config regardless of nesting.
+    """
+    if hasattr(config, "text_config"):
+        return config.text_config
@@ -0,0 +1,80 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""Gemma 4 multimodal model (image + audio + video support).
+
+Adds vision tower, audio tower, and multimodal embedders on top of the
+text-only Gemma4ForCausalLM.  The vision/audio encoders are loaded via
+AutoModel.from_config and run in eager mode while the language model uses
+the vLLM-optimized path.
+
+Video support:  Gemma4 does **not** have a native video tower.  Videos are
+decomposed into timestamped image frames (up to 32 frames at 70 soft tokens
+each) and fed through the same vision tower as regular images.  The
+processor inserts ``mm:ss`` timestamps between frames so the model can
+reason about temporal order.
+"""
+
+import math
+from collections.abc import Iterable, Mapping, Sequence
+from typing import Annotated, Any, Literal
+
+import numpy as np
+import torch
+from PIL import Image as PILImage
+from torch import nn
+from transformers import AutoModel, BatchFeature
+from transformers.models.gemma4 import (
+    Gemma4Config,
+    Gemma4Processor,
+    Gemma4VisionConfig,
+)
+from transformers.models.gemma4.configuration_gemma4 import (
+    Gemma4AudioConfig,
+    Gemma4TextConfig,
+)
+
+from vllm.config import VllmConfig
+from vllm.config.multimodal import BaseDummyOptions, VideoDummyOptions
+from vllm.inputs import MultiModalDataDict
+from vllm.logger import init_logger
+from vllm.model_executor.layers.layernorm import RMSNorm
+from vllm.model_executor.layers.linear import ReplicatedLinear
+from vllm.model_executor.models.gemma4 import Gemma4ForCausalLM
+from vllm.model_executor.models.module_mapping import MultiModelKeys
+from vllm.multimodal import MULTIMODAL_REGISTRY
+from vllm.multimodal.inputs import (
+    MultiModalFieldConfig,
+    MultiModalKwargsItems,
+    VideoItem,
+)
+from vllm.multimodal.parse import (
+    AudioProcessorItems,
+    ImageProcessorItems,
+    MultiModalDataItems,
+    MultiModalDataParser,
+)
+from vllm.multimodal.processing import BaseDummyInputsBuilder
+from vllm.multimodal.processing.processor import (
+    BaseMultiModalProcessor,
+    BaseProcessingInfo,
+    PromptReplacement,
+    PromptUpdate,
+    PromptUpdateDetails,
+)
+from vllm.sequence import IntermediateTensors
+from vllm.utils.tensor_schema import TensorSchema, TensorShape
+
+from .interfaces import (
+    MultiModalEmbeddings,
+    SupportsEagle3,
+    SupportsLoRA,
+    SupportsMultiModal,
+    SupportsPP,
+)
+from .utils import (
+    AutoWeightsLoader,
+    WeightsMapper,
+    init_vllm_registered_model,
+    maybe_prefix,
+)
+
@@ -0,0 +1,16 @@
+# Source: vllm-project/vllm main branch — vllm/model_executor/models/registry.py
+# Verified 2026-04-18 via GitHub API.
+
+# Line 99 (text-only Gemma 4 CausalLM):
+"Gemma4ForCausalLM": ("gemma4", "Gemma4ForCausalLM"),
+
+# Line 230 (multimodal Gemma 4: vision + audio + video):
+"Gemma4ForCausalLM": ("gemma4_mm", "Gemma4ForConditionalGeneration"),
+
+# The second (_mm) registration maps Gemma4ForCausalLM -> gemma4_mm.Gemma4ForConditionalGeneration,
+# which wires in:
+#   - vision_tower (pixel_values, pixel_position_ids)
+#   - audio_tower  (input_features_padded, input_features_mask)  [E2B/E4B only]
+#   - video path   (pixel_values_videos — decomposed to frames, up to 32 frames @ 70 soft tokens)
+#
+# vLLM dispatches based on whether the HF config has audio_config populated.