Files
gemma4-research/tooling/inference-frameworks
Mortdecai eecebe7ef5 docs: add canonical tooling corpus (147 files) from Google/HF/frameworks
Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00
..

Gemma 4 — Inference Framework Support Matrix

Non-Ollama frameworks. Ollama is covered separately in the parent research corpus. Verified against upstream repos, model cards, and docs on 2026-04-18.

Summary table

# Framework Gemma 4 support Vision Audio Tool calling Quantization options Canonical run command
1 vLLM Native, upstream merged — gemma4.py (text) + gemma4_mm.py (multimodal). Registered in registry.py as Gemma4ForCausalLM and Gemma4ForConditionalGeneration. Yes (all sizes) Yes (E2B/E4B) Yes — OpenAI-compatible /v1/chat/completions with tools=[...] AWQ, GPTQ, FP8, NVFP4 (via --quantization modelopt), BF16 vllm serve google/gemma-4-31b-it --tensor-parallel-size 2
2 llama.cpp / GGUF Native — Gemma4Model + Gemma4VisionAudioModel registered in convert_hf_to_gguf.py (lines 7666 & 7791). Distinct GEMMA4V + GEMMA4A projector types. Official GGUFs published at ggml-org/gemma-4-*-GGUF. Yes (all, via mmproj) Yes (E-series, via mmproj) Yes — llama-server exposes OpenAI-compatible tools API Q4_K_M, Q8_0, BF16 published officially; full quant menu via self-convert llama-server -hf ggml-org/gemma-4-E4B-it-GGUF
3 Apple MLX Native in mlx-lm (text, gemma4.py + gemma4_text.py) and mlx-vlm (multimodal, mlx_vlm/models/gemma4/ with audio.py, vision.py, language.py, processing_gemma4.py) Yes (mlx-vlm) Yes (mlx-vlm) Community; no first-party tools wrapper 4bit, 8bit, bf16 via MLX quantize mlx_vlm.generate --model mlx-community/gemma-4-E4B-it-8bit --image URL --prompt "..."
4 Keras / keras-hub Native, full modular impl: keras_hub/src/models/gemma4/ with attention, audio_encoder, vision_encoder, decoder_block, moe, causal_lm, etc. 8 presets (base + instruct × 2B/4B/26B_a4b/31B). Yes Yes No (it's a training library, not an inference server) Via Keras mixed-precision; no canonical GGUF/AWQ path keras_hub.models.Gemma4CausalLM.from_preset("gemma4_instruct_4b")
5 HF Text Generation Inference (TGI) No native support. Supported-models page stops at Gemma 3 / Gemma 3 Text. No open or merged PRs for "gemma4" (verified). Will fall back to unoptimized AutoModelForCausalLM path. Fallback only, no vision kernels No Fallback only Whatever HF transformers exposes on the fallback path text-generation-launcher --model-id google/gemma-4-31b-it (degraded)
6 TensorRT-LLM / NVIDIA NIM Not in the 2026-04 support matrix. Matrix lists Gemma3ForCausalLM/Gemma3ForConditionalGeneration but no Gemma 4 entry. GitHub issue #12764 tracks broken runtime on DGX Spark/GB10. NVIDIA's own nvidia/Gemma-4-31B-IT-NVFP4 card tells users to run it on vLLM, not TRT-LLM. N/A N/A N/A NVFP4 export exists but runtime is broken; use the NVFP4 weights in vLLM instead Avoid — use vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt
7 Gemini API (AI Studio) Hosted. Model IDs: gemma-4-31b-it, gemma-4-26b-a4b-it. E-series NOT exposed (on-device only). Yes (via inlineData parts) No (Gemini API strips the audio path) Yes — same tools=[...] schema as Gemini models N/A (Google-managed) curl .../v1beta/models/gemma-4-26b-a4b-it:generateContent -d @payload.json
8 Vertex AI Model Garden One-click deploy. Model card: console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4. Publisher ID format google/gemma4@gemma-4-31b-it. 26B-A4B is offered fully managed & serverless; 31B requires self-provisioned GPU endpoint. Yes (via endpoint backend — vLLM under the hood) Yes for E-series variants deployed that way Yes (endpoint inherits from backing runtime) Depends on backing image (vLLM/SAX) — BF16, FP8, AWQ selectable at deploy time model_garden.OpenModel("google/gemma4@gemma-4-31b-it").deploy()

Production-readiness ranking

  1. vLLM — most complete, most optimized, only runtime with first-party NVFP4 support and tested multimodal (image+audio+video).
  2. llama.cpp / GGUF — best for local CPU + small GPU, only framework with audio mmproj shipping as a downloadable file for E-series, official Google-published quants via ggml-org/*.
  3. Gemini API / Vertex AI — if you don't want to self-host; Vertex gives you the managed-endpoint exit path with vLLM under the hood.
  4. Apple MLX — production-ready on Apple Silicon only; mlx-vlm is community-maintained but actively updated.
  5. Keras-hub — reference/training, not inference-server.
  6. TGI — usable as a fallback only; no optimized path yet.
  7. TensorRT-LLMavoid for Gemma 4. NVIDIA themselves point at vLLM.

Capabilities beyond Ollama

  • Native audio input — Ollama does not currently expose the E2B/E4B audio tower. Three frameworks do:
    • llama.cpp with the mmproj-...-E4B-it-*.gguf projector (VisionProjectorType.GEMMA4A),
    • vLLM via gemma4_mm.py (input_features_padded, input_features_mask),
    • MLX via mlx-vlm/models/gemma4/audio.py. If Seth ever wants the speech-transcription path, llama.cpp with the E4B mmproj is the shortest route from where he already is.
  • Video with interleaved audio — vLLM's gemma4_mm.py decomposes videos into up to 32 timestamped frames; with E-series models it also loads the audio track (load_audio_from_video=True). Ollama has no video path at all.
  • NVFP4 on Blackwell — vLLM only. nvidia/Gemma-4-31B-IT-NVFP4 reports ~0.3 pp accuracy loss vs BF16 on GPQA Diamond / MMLU Pro.

Framework to avoid

TensorRT-LLM. Not in the upstream support matrix as of 2026-04, known runtime bug on DGX Spark/GB10 (issue #12764), and NVIDIA's own NVFP4 checkpoint directs users to vLLM. Revisit only after a future TRT-LLM release lists Gemma4ForCausalLM in the support matrix.

Files in this directory

inference-frameworks/
├── README.md                              — this file
├── run_commands.sh                        — canonical one-liners per framework
└── snippets/
    ├── llamacpp_convert_gemma4_excerpt.py — Gemma4Model + Gemma4VisionAudioModel from convert_hf_to_gguf.py (lines 7666-7840)
    ├── vllm_gemma4_head_80.py             — gemma4.py header (imports, config deref)
    ├── vllm_gemma4_mm_head_80.py          — gemma4_mm.py header (multimodal docstring lists image/audio/video)
    ├── vllm_registry_excerpt.txt          — registry.py Gemma4 registrations
    ├── mlx_gemma4_head_100.py             — mlx-lm gemma4.py (text) first 100 lines
    ├── mlx_vlm_gemma4_head_60.py          — mlx-vlm gemma4/gemma4.py (multimodal) first 60 lines
    ├── keras_hub_gemma4.py                — canonical keras-hub example + preset list
    ├── gemini_api_gemma4.sh               — canonical curl example
    └── gemini_api_gemma4.py               — canonical google-genai Python SDK example

Notable upstream references

  • vLLM Gemma 4 model class: vllm-project/vllm:vllm/model_executor/models/gemma4.py and gemma4_mm.py
  • llama.cpp HF → GGUF converter: ggml-org/llama.cpp:convert_hf_to_gguf.py lines 7666-7840
  • Official Google GGUF repos (verified live): ggml-org/gemma-4-{E2B,E4B,31B,26b-a4b}-it-GGUF — all ship mmproj projector files
  • HF blog: huggingface.co/blog/gemma4 — shows AutoModelForMultimodalLM is the canonical transformers entry point
  • NVIDIA NVFP4 checkpoint: nvidia/Gemma-4-31B-IT-NVFP4 — runtime=vLLM, not TRT-LLM
  • Gemini API doc: ai.google.dev/gemma/docs/core/gemma_on_gemini_api
  • Vertex AI Model Garden: console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4
  • TGI supported-models list (confirming absence of Gemma 4): huggingface.co/docs/text-generation-inference/supported_models
  • TRT-LLM support matrix (confirming absence): nvidia.github.io/TensorRT-LLM/reference/support-matrix.html