gemma4-research/tooling/inference-frameworks/README.md

# Gemma 4 — Inference Framework Support Matrix

> Non-Ollama frameworks. Ollama is covered separately in the parent research corpus.
> Verified against upstream repos, model cards, and docs on **2026-04-18**.

## Summary table

| # | Framework | Gemma 4 support | Vision | Audio | Tool calling | Quantization options | Canonical run command |
|---|---|---|---|---|---|---|---|
| 1 | **vLLM** | Native, upstream merged — `gemma4.py` (text) + `gemma4_mm.py` (multimodal). Registered in `registry.py` as `Gemma4ForCausalLM` and `Gemma4ForConditionalGeneration`. | Yes (all sizes) | Yes (E2B/E4B) | Yes — OpenAI-compatible `/v1/chat/completions` with `tools=[...]` | AWQ, GPTQ, FP8, NVFP4 (via `--quantization modelopt`), BF16 | `vllm serve google/gemma-4-31b-it --tensor-parallel-size 2` |
| 2 | **llama.cpp / GGUF** | Native — `Gemma4Model` + `Gemma4VisionAudioModel` registered in `convert_hf_to_gguf.py` (lines 7666 & 7791). Distinct `GEMMA4V` + `GEMMA4A` projector types. Official GGUFs published at `ggml-org/gemma-4-*-GGUF`. | Yes (all, via mmproj) | Yes (E-series, via mmproj) | Yes — `llama-server` exposes OpenAI-compatible tools API | Q4_K_M, Q8_0, BF16 published officially; full quant menu via self-convert | `llama-server -hf ggml-org/gemma-4-E4B-it-GGUF` |
| 3 | **Apple MLX** | Native in `mlx-lm` (text, `gemma4.py` + `gemma4_text.py`) and `mlx-vlm` (multimodal, `mlx_vlm/models/gemma4/` with `audio.py`, `vision.py`, `language.py`, `processing_gemma4.py`) | Yes (mlx-vlm) | Yes (mlx-vlm) | Community; no first-party tools wrapper | 4bit, 8bit, bf16 via MLX quantize | `mlx_vlm.generate --model mlx-community/gemma-4-E4B-it-8bit --image URL --prompt "..."` |
| 4 | **Keras / keras-hub** | Native, full modular impl: `keras_hub/src/models/gemma4/` with `attention`, `audio_encoder`, `vision_encoder`, `decoder_block`, `moe`, `causal_lm`, etc. 8 presets (base + instruct × 2B/4B/26B_a4b/31B). | Yes | Yes | No (it's a training library, not an inference server) | Via Keras mixed-precision; no canonical GGUF/AWQ path | `keras_hub.models.Gemma4CausalLM.from_preset("gemma4_instruct_4b")` |
| 5 | **HF Text Generation Inference (TGI)** | **No native support.** Supported-models page stops at Gemma 3 / Gemma 3 Text. No open or merged PRs for "gemma4" (verified). Will fall back to unoptimized `AutoModelForCausalLM` path. | Fallback only, no vision kernels | No | Fallback only | Whatever HF transformers exposes on the fallback path | `text-generation-launcher --model-id google/gemma-4-31b-it` (degraded) |
| 6 | **TensorRT-LLM / NVIDIA NIM** | **Not in the 2026-04 support matrix.** Matrix lists `Gemma3ForCausalLM`/`Gemma3ForConditionalGeneration` but no Gemma 4 entry. GitHub issue #12764 tracks broken runtime on DGX Spark/GB10. NVIDIA's own `nvidia/Gemma-4-31B-IT-NVFP4` card tells users to run it on **vLLM**, not TRT-LLM. | N/A | N/A | N/A | NVFP4 export exists but runtime is broken; use the NVFP4 weights in vLLM instead | Avoid — use `vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt` |
| 7 | **Gemini API (AI Studio)** | Hosted. Model IDs: `gemma-4-31b-it`, `gemma-4-26b-a4b-it`. E-series NOT exposed (on-device only). | Yes (via `inlineData` parts) | No (Gemini API strips the audio path) | Yes — same `tools=[...]` schema as Gemini models | N/A (Google-managed) | `curl .../v1beta/models/gemma-4-26b-a4b-it:generateContent -d @payload.json` |
| 8 | **Vertex AI Model Garden** | One-click deploy. Model card: `console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4`. Publisher ID format `google/gemma4@gemma-4-31b-it`. 26B-A4B is offered fully managed & serverless; 31B requires self-provisioned GPU endpoint. | Yes (via endpoint backend — vLLM under the hood) | Yes for E-series variants deployed that way | Yes (endpoint inherits from backing runtime) | Depends on backing image (vLLM/SAX) — BF16, FP8, AWQ selectable at deploy time | `model_garden.OpenModel("google/gemma4@gemma-4-31b-it").deploy()` |

## Production-readiness ranking

1. **vLLM** — most complete, most optimized, only runtime with first-party NVFP4 support and tested multimodal (image+audio+video).
2. **llama.cpp / GGUF** — best for local CPU + small GPU, only framework with audio mmproj shipping as a downloadable file for E-series, official Google-published quants via `ggml-org/*`.
3. **Gemini API / Vertex AI** — if you don't want to self-host; Vertex gives you the managed-endpoint exit path with vLLM under the hood.
4. **Apple MLX** — production-ready on Apple Silicon only; `mlx-vlm` is community-maintained but actively updated.
5. **Keras-hub** — reference/training, not inference-server.
6. **TGI** — usable as a *fallback* only; no optimized path yet.
7. **TensorRT-LLM** — **avoid for Gemma 4.** NVIDIA themselves point at vLLM.

## Capabilities beyond Ollama

- **Native audio input** — Ollama does **not** currently expose the E2B/E4B audio tower. Three frameworks do:
  - **llama.cpp** with the `mmproj-...-E4B-it-*.gguf` projector (`VisionProjectorType.GEMMA4A`),
  - **vLLM** via `gemma4_mm.py` (`input_features_padded`, `input_features_mask`),
  - **MLX** via `mlx-vlm/models/gemma4/audio.py`.
  If Seth ever wants the speech-transcription path, llama.cpp with the E4B mmproj is the shortest route from where he already is.
- **Video with interleaved audio** — vLLM's `gemma4_mm.py` decomposes videos into up to 32 timestamped frames; with E-series models it also loads the audio track (`load_audio_from_video=True`). Ollama has no video path at all.
- **NVFP4 on Blackwell** — vLLM only. `nvidia/Gemma-4-31B-IT-NVFP4` reports ~0.3 pp accuracy loss vs BF16 on GPQA Diamond / MMLU Pro.

## Framework to avoid

**TensorRT-LLM.** Not in the upstream support matrix as of 2026-04, known runtime bug on DGX Spark/GB10 (issue #12764), and NVIDIA's own NVFP4 checkpoint directs users to vLLM. Revisit only after a future TRT-LLM release lists `Gemma4ForCausalLM` in the support matrix.

## Files in this directory

```
inference-frameworks/
├── README.md                              — this file
├── run_commands.sh                        — canonical one-liners per framework
└── snippets/
    ├── llamacpp_convert_gemma4_excerpt.py — Gemma4Model + Gemma4VisionAudioModel from convert_hf_to_gguf.py (lines 7666-7840)
    ├── vllm_gemma4_head_80.py             — gemma4.py header (imports, config deref)
    ├── vllm_gemma4_mm_head_80.py          — gemma4_mm.py header (multimodal docstring lists image/audio/video)
    ├── vllm_registry_excerpt.txt          — registry.py Gemma4 registrations
    ├── mlx_gemma4_head_100.py             — mlx-lm gemma4.py (text) first 100 lines
    ├── mlx_vlm_gemma4_head_60.py          — mlx-vlm gemma4/gemma4.py (multimodal) first 60 lines
    ├── keras_hub_gemma4.py                — canonical keras-hub example + preset list
    ├── gemini_api_gemma4.sh               — canonical curl example
    └── gemini_api_gemma4.py               — canonical google-genai Python SDK example
```

## Notable upstream references

- vLLM Gemma 4 model class: `vllm-project/vllm:vllm/model_executor/models/gemma4.py` and `gemma4_mm.py`
- llama.cpp HF → GGUF converter: `ggml-org/llama.cpp:convert_hf_to_gguf.py` lines 7666-7840
- Official Google GGUF repos (verified live): `ggml-org/gemma-4-{E2B,E4B,31B,26b-a4b}-it-GGUF` — all ship mmproj projector files
- HF blog: huggingface.co/blog/gemma4 — shows `AutoModelForMultimodalLM` is the canonical transformers entry point
- NVIDIA NVFP4 checkpoint: `nvidia/Gemma-4-31B-IT-NVFP4` — runtime=vLLM, not TRT-LLM
- Gemini API doc: ai.google.dev/gemma/docs/core/gemma_on_gemini_api
- Vertex AI Model Garden: console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4
- TGI supported-models list (confirming *absence* of Gemma 4): huggingface.co/docs/text-generation-inference/supported_models
- TRT-LLM support matrix (confirming *absence*): nvidia.github.io/TensorRT-LLM/reference/support-matrix.html