eecebe7ef5
Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Gemma 4 — Inference Framework Support Matrix
Non-Ollama frameworks. Ollama is covered separately in the parent research corpus. Verified against upstream repos, model cards, and docs on 2026-04-18.
Summary table
| # | Framework | Gemma 4 support | Vision | Audio | Tool calling | Quantization options | Canonical run command |
|---|---|---|---|---|---|---|---|
| 1 | vLLM | Native, upstream merged — gemma4.py (text) + gemma4_mm.py (multimodal). Registered in registry.py as Gemma4ForCausalLM and Gemma4ForConditionalGeneration. |
Yes (all sizes) | Yes (E2B/E4B) | Yes — OpenAI-compatible /v1/chat/completions with tools=[...] |
AWQ, GPTQ, FP8, NVFP4 (via --quantization modelopt), BF16 |
vllm serve google/gemma-4-31b-it --tensor-parallel-size 2 |
| 2 | llama.cpp / GGUF | Native — Gemma4Model + Gemma4VisionAudioModel registered in convert_hf_to_gguf.py (lines 7666 & 7791). Distinct GEMMA4V + GEMMA4A projector types. Official GGUFs published at ggml-org/gemma-4-*-GGUF. |
Yes (all, via mmproj) | Yes (E-series, via mmproj) | Yes — llama-server exposes OpenAI-compatible tools API |
Q4_K_M, Q8_0, BF16 published officially; full quant menu via self-convert | llama-server -hf ggml-org/gemma-4-E4B-it-GGUF |
| 3 | Apple MLX | Native in mlx-lm (text, gemma4.py + gemma4_text.py) and mlx-vlm (multimodal, mlx_vlm/models/gemma4/ with audio.py, vision.py, language.py, processing_gemma4.py) |
Yes (mlx-vlm) | Yes (mlx-vlm) | Community; no first-party tools wrapper | 4bit, 8bit, bf16 via MLX quantize | mlx_vlm.generate --model mlx-community/gemma-4-E4B-it-8bit --image URL --prompt "..." |
| 4 | Keras / keras-hub | Native, full modular impl: keras_hub/src/models/gemma4/ with attention, audio_encoder, vision_encoder, decoder_block, moe, causal_lm, etc. 8 presets (base + instruct × 2B/4B/26B_a4b/31B). |
Yes | Yes | No (it's a training library, not an inference server) | Via Keras mixed-precision; no canonical GGUF/AWQ path | keras_hub.models.Gemma4CausalLM.from_preset("gemma4_instruct_4b") |
| 5 | HF Text Generation Inference (TGI) | No native support. Supported-models page stops at Gemma 3 / Gemma 3 Text. No open or merged PRs for "gemma4" (verified). Will fall back to unoptimized AutoModelForCausalLM path. |
Fallback only, no vision kernels | No | Fallback only | Whatever HF transformers exposes on the fallback path | text-generation-launcher --model-id google/gemma-4-31b-it (degraded) |
| 6 | TensorRT-LLM / NVIDIA NIM | Not in the 2026-04 support matrix. Matrix lists Gemma3ForCausalLM/Gemma3ForConditionalGeneration but no Gemma 4 entry. GitHub issue #12764 tracks broken runtime on DGX Spark/GB10. NVIDIA's own nvidia/Gemma-4-31B-IT-NVFP4 card tells users to run it on vLLM, not TRT-LLM. |
N/A | N/A | N/A | NVFP4 export exists but runtime is broken; use the NVFP4 weights in vLLM instead | Avoid — use vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt |
| 7 | Gemini API (AI Studio) | Hosted. Model IDs: gemma-4-31b-it, gemma-4-26b-a4b-it. E-series NOT exposed (on-device only). |
Yes (via inlineData parts) |
No (Gemini API strips the audio path) | Yes — same tools=[...] schema as Gemini models |
N/A (Google-managed) | curl .../v1beta/models/gemma-4-26b-a4b-it:generateContent -d @payload.json |
| 8 | Vertex AI Model Garden | One-click deploy. Model card: console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4. Publisher ID format google/gemma4@gemma-4-31b-it. 26B-A4B is offered fully managed & serverless; 31B requires self-provisioned GPU endpoint. |
Yes (via endpoint backend — vLLM under the hood) | Yes for E-series variants deployed that way | Yes (endpoint inherits from backing runtime) | Depends on backing image (vLLM/SAX) — BF16, FP8, AWQ selectable at deploy time | model_garden.OpenModel("google/gemma4@gemma-4-31b-it").deploy() |
Production-readiness ranking
- vLLM — most complete, most optimized, only runtime with first-party NVFP4 support and tested multimodal (image+audio+video).
- llama.cpp / GGUF — best for local CPU + small GPU, only framework with audio mmproj shipping as a downloadable file for E-series, official Google-published quants via
ggml-org/*. - Gemini API / Vertex AI — if you don't want to self-host; Vertex gives you the managed-endpoint exit path with vLLM under the hood.
- Apple MLX — production-ready on Apple Silicon only;
mlx-vlmis community-maintained but actively updated. - Keras-hub — reference/training, not inference-server.
- TGI — usable as a fallback only; no optimized path yet.
- TensorRT-LLM — avoid for Gemma 4. NVIDIA themselves point at vLLM.
Capabilities beyond Ollama
- Native audio input — Ollama does not currently expose the E2B/E4B audio tower. Three frameworks do:
- llama.cpp with the
mmproj-...-E4B-it-*.ggufprojector (VisionProjectorType.GEMMA4A), - vLLM via
gemma4_mm.py(input_features_padded,input_features_mask), - MLX via
mlx-vlm/models/gemma4/audio.py. If Seth ever wants the speech-transcription path, llama.cpp with the E4B mmproj is the shortest route from where he already is.
- llama.cpp with the
- Video with interleaved audio — vLLM's
gemma4_mm.pydecomposes videos into up to 32 timestamped frames; with E-series models it also loads the audio track (load_audio_from_video=True). Ollama has no video path at all. - NVFP4 on Blackwell — vLLM only.
nvidia/Gemma-4-31B-IT-NVFP4reports ~0.3 pp accuracy loss vs BF16 on GPQA Diamond / MMLU Pro.
Framework to avoid
TensorRT-LLM. Not in the upstream support matrix as of 2026-04, known runtime bug on DGX Spark/GB10 (issue #12764), and NVIDIA's own NVFP4 checkpoint directs users to vLLM. Revisit only after a future TRT-LLM release lists Gemma4ForCausalLM in the support matrix.
Files in this directory
inference-frameworks/
├── README.md — this file
├── run_commands.sh — canonical one-liners per framework
└── snippets/
├── llamacpp_convert_gemma4_excerpt.py — Gemma4Model + Gemma4VisionAudioModel from convert_hf_to_gguf.py (lines 7666-7840)
├── vllm_gemma4_head_80.py — gemma4.py header (imports, config deref)
├── vllm_gemma4_mm_head_80.py — gemma4_mm.py header (multimodal docstring lists image/audio/video)
├── vllm_registry_excerpt.txt — registry.py Gemma4 registrations
├── mlx_gemma4_head_100.py — mlx-lm gemma4.py (text) first 100 lines
├── mlx_vlm_gemma4_head_60.py — mlx-vlm gemma4/gemma4.py (multimodal) first 60 lines
├── keras_hub_gemma4.py — canonical keras-hub example + preset list
├── gemini_api_gemma4.sh — canonical curl example
└── gemini_api_gemma4.py — canonical google-genai Python SDK example
Notable upstream references
- vLLM Gemma 4 model class:
vllm-project/vllm:vllm/model_executor/models/gemma4.pyandgemma4_mm.py - llama.cpp HF → GGUF converter:
ggml-org/llama.cpp:convert_hf_to_gguf.pylines 7666-7840 - Official Google GGUF repos (verified live):
ggml-org/gemma-4-{E2B,E4B,31B,26b-a4b}-it-GGUF— all ship mmproj projector files - HF blog: huggingface.co/blog/gemma4 — shows
AutoModelForMultimodalLMis the canonical transformers entry point - NVIDIA NVFP4 checkpoint:
nvidia/Gemma-4-31B-IT-NVFP4— runtime=vLLM, not TRT-LLM - Gemini API doc: ai.google.dev/gemma/docs/core/gemma_on_gemini_api
- Vertex AI Model Garden: console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4
- TGI supported-models list (confirming absence of Gemma 4): huggingface.co/docs/text-generation-inference/supported_models
- TRT-LLM support matrix (confirming absence): nvidia.github.io/TensorRT-LLM/reference/support-matrix.html