# Gemma 4 — Inference Framework Support Matrix > Non-Ollama frameworks. Ollama is covered separately in the parent research corpus. > Verified against upstream repos, model cards, and docs on **2026-04-18**. ## Summary table | # | Framework | Gemma 4 support | Vision | Audio | Tool calling | Quantization options | Canonical run command | |---|---|---|---|---|---|---|---| | 1 | **vLLM** | Native, upstream merged — `gemma4.py` (text) + `gemma4_mm.py` (multimodal). Registered in `registry.py` as `Gemma4ForCausalLM` and `Gemma4ForConditionalGeneration`. | Yes (all sizes) | Yes (E2B/E4B) | Yes — OpenAI-compatible `/v1/chat/completions` with `tools=[...]` | AWQ, GPTQ, FP8, NVFP4 (via `--quantization modelopt`), BF16 | `vllm serve google/gemma-4-31b-it --tensor-parallel-size 2` | | 2 | **llama.cpp / GGUF** | Native — `Gemma4Model` + `Gemma4VisionAudioModel` registered in `convert_hf_to_gguf.py` (lines 7666 & 7791). Distinct `GEMMA4V` + `GEMMA4A` projector types. Official GGUFs published at `ggml-org/gemma-4-*-GGUF`. | Yes (all, via mmproj) | Yes (E-series, via mmproj) | Yes — `llama-server` exposes OpenAI-compatible tools API | Q4_K_M, Q8_0, BF16 published officially; full quant menu via self-convert | `llama-server -hf ggml-org/gemma-4-E4B-it-GGUF` | | 3 | **Apple MLX** | Native in `mlx-lm` (text, `gemma4.py` + `gemma4_text.py`) and `mlx-vlm` (multimodal, `mlx_vlm/models/gemma4/` with `audio.py`, `vision.py`, `language.py`, `processing_gemma4.py`) | Yes (mlx-vlm) | Yes (mlx-vlm) | Community; no first-party tools wrapper | 4bit, 8bit, bf16 via MLX quantize | `mlx_vlm.generate --model mlx-community/gemma-4-E4B-it-8bit --image URL --prompt "..."` | | 4 | **Keras / keras-hub** | Native, full modular impl: `keras_hub/src/models/gemma4/` with `attention`, `audio_encoder`, `vision_encoder`, `decoder_block`, `moe`, `causal_lm`, etc. 8 presets (base + instruct × 2B/4B/26B_a4b/31B). | Yes | Yes | No (it's a training library, not an inference server) | Via Keras mixed-precision; no canonical GGUF/AWQ path | `keras_hub.models.Gemma4CausalLM.from_preset("gemma4_instruct_4b")` | | 5 | **HF Text Generation Inference (TGI)** | **No native support.** Supported-models page stops at Gemma 3 / Gemma 3 Text. No open or merged PRs for "gemma4" (verified). Will fall back to unoptimized `AutoModelForCausalLM` path. | Fallback only, no vision kernels | No | Fallback only | Whatever HF transformers exposes on the fallback path | `text-generation-launcher --model-id google/gemma-4-31b-it` (degraded) | | 6 | **TensorRT-LLM / NVIDIA NIM** | **Not in the 2026-04 support matrix.** Matrix lists `Gemma3ForCausalLM`/`Gemma3ForConditionalGeneration` but no Gemma 4 entry. GitHub issue #12764 tracks broken runtime on DGX Spark/GB10. NVIDIA's own `nvidia/Gemma-4-31B-IT-NVFP4` card tells users to run it on **vLLM**, not TRT-LLM. | N/A | N/A | N/A | NVFP4 export exists but runtime is broken; use the NVFP4 weights in vLLM instead | Avoid — use `vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt` | | 7 | **Gemini API (AI Studio)** | Hosted. Model IDs: `gemma-4-31b-it`, `gemma-4-26b-a4b-it`. E-series NOT exposed (on-device only). | Yes (via `inlineData` parts) | No (Gemini API strips the audio path) | Yes — same `tools=[...]` schema as Gemini models | N/A (Google-managed) | `curl .../v1beta/models/gemma-4-26b-a4b-it:generateContent -d @payload.json` | | 8 | **Vertex AI Model Garden** | One-click deploy. Model card: `console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4`. Publisher ID format `google/gemma4@gemma-4-31b-it`. 26B-A4B is offered fully managed & serverless; 31B requires self-provisioned GPU endpoint. | Yes (via endpoint backend — vLLM under the hood) | Yes for E-series variants deployed that way | Yes (endpoint inherits from backing runtime) | Depends on backing image (vLLM/SAX) — BF16, FP8, AWQ selectable at deploy time | `model_garden.OpenModel("google/gemma4@gemma-4-31b-it").deploy()` | ## Production-readiness ranking 1. **vLLM** — most complete, most optimized, only runtime with first-party NVFP4 support and tested multimodal (image+audio+video). 2. **llama.cpp / GGUF** — best for local CPU + small GPU, only framework with audio mmproj shipping as a downloadable file for E-series, official Google-published quants via `ggml-org/*`. 3. **Gemini API / Vertex AI** — if you don't want to self-host; Vertex gives you the managed-endpoint exit path with vLLM under the hood. 4. **Apple MLX** — production-ready on Apple Silicon only; `mlx-vlm` is community-maintained but actively updated. 5. **Keras-hub** — reference/training, not inference-server. 6. **TGI** — usable as a *fallback* only; no optimized path yet. 7. **TensorRT-LLM** — **avoid for Gemma 4.** NVIDIA themselves point at vLLM. ## Capabilities beyond Ollama - **Native audio input** — Ollama does **not** currently expose the E2B/E4B audio tower. Three frameworks do: - **llama.cpp** with the `mmproj-...-E4B-it-*.gguf` projector (`VisionProjectorType.GEMMA4A`), - **vLLM** via `gemma4_mm.py` (`input_features_padded`, `input_features_mask`), - **MLX** via `mlx-vlm/models/gemma4/audio.py`. If Seth ever wants the speech-transcription path, llama.cpp with the E4B mmproj is the shortest route from where he already is. - **Video with interleaved audio** — vLLM's `gemma4_mm.py` decomposes videos into up to 32 timestamped frames; with E-series models it also loads the audio track (`load_audio_from_video=True`). Ollama has no video path at all. - **NVFP4 on Blackwell** — vLLM only. `nvidia/Gemma-4-31B-IT-NVFP4` reports ~0.3 pp accuracy loss vs BF16 on GPQA Diamond / MMLU Pro. ## Framework to avoid **TensorRT-LLM.** Not in the upstream support matrix as of 2026-04, known runtime bug on DGX Spark/GB10 (issue #12764), and NVIDIA's own NVFP4 checkpoint directs users to vLLM. Revisit only after a future TRT-LLM release lists `Gemma4ForCausalLM` in the support matrix. ## Files in this directory ``` inference-frameworks/ ├── README.md — this file ├── run_commands.sh — canonical one-liners per framework └── snippets/ ├── llamacpp_convert_gemma4_excerpt.py — Gemma4Model + Gemma4VisionAudioModel from convert_hf_to_gguf.py (lines 7666-7840) ├── vllm_gemma4_head_80.py — gemma4.py header (imports, config deref) ├── vllm_gemma4_mm_head_80.py — gemma4_mm.py header (multimodal docstring lists image/audio/video) ├── vllm_registry_excerpt.txt — registry.py Gemma4 registrations ├── mlx_gemma4_head_100.py — mlx-lm gemma4.py (text) first 100 lines ├── mlx_vlm_gemma4_head_60.py — mlx-vlm gemma4/gemma4.py (multimodal) first 60 lines ├── keras_hub_gemma4.py — canonical keras-hub example + preset list ├── gemini_api_gemma4.sh — canonical curl example └── gemini_api_gemma4.py — canonical google-genai Python SDK example ``` ## Notable upstream references - vLLM Gemma 4 model class: `vllm-project/vllm:vllm/model_executor/models/gemma4.py` and `gemma4_mm.py` - llama.cpp HF → GGUF converter: `ggml-org/llama.cpp:convert_hf_to_gguf.py` lines 7666-7840 - Official Google GGUF repos (verified live): `ggml-org/gemma-4-{E2B,E4B,31B,26b-a4b}-it-GGUF` — all ship mmproj projector files - HF blog: huggingface.co/blog/gemma4 — shows `AutoModelForMultimodalLM` is the canonical transformers entry point - NVIDIA NVFP4 checkpoint: `nvidia/Gemma-4-31B-IT-NVFP4` — runtime=vLLM, not TRT-LLM - Gemini API doc: ai.google.dev/gemma/docs/core/gemma_on_gemini_api - Vertex AI Model Garden: console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4 - TGI supported-models list (confirming *absence* of Gemma 4): huggingface.co/docs/text-generation-inference/supported_models - TRT-LLM support matrix (confirming *absence*): nvidia.github.io/TensorRT-LLM/reference/support-matrix.html