docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00
parent 5011059f5d
commit eecebe7ef5
149 changed files with 181297 additions and 0 deletions
@@ -0,0 +1,70 @@
+#!/usr/bin/env bash
+# Canonical one-liners to serve Gemma 4 across inference frameworks.
+# Verified against upstream repos / model cards on 2026-04-18.
+# Not meant to be executed as a script — each block is a standalone example.
+
+### 1. vLLM — full multimodal (text + vision + audio + video) ###
+# Text-only 31B dense:
+vllm serve google/gemma-4-31b-it --tensor-parallel-size 2
+# Multimodal E4B (vision + audio):
+vllm serve google/gemma-4-E4B-it --limit-mm-per-prompt image=4,audio=1
+# NVFP4-quantized 31B on Blackwell/H100 (NVIDIA's official quant):
+vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --tensor-parallel-size 8
+
+### 2. llama.cpp — official ggml-org GGUFs ###
+# Text-only via -hf shortcut (auto-download, default = Q4_K_M if multiple present):
+llama-server -hf ggml-org/gemma-4-E4B-it-GGUF
+# Choose a specific quant:
+llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M
+# Vision (+ audio for E-series) — add --mmproj pointing to the projector:
+llama-server -hf ggml-org/gemma-4-E4B-it-GGUF \
+  --mmproj ggml-org/gemma-4-E4B-it-GGUF/mmproj-gemma-4-E4B-it-Q8_0.gguf
+# Convert a new HF checkpoint to GGUF yourself:
+python convert_hf_to_gguf.py /path/to/google/gemma-4-31b-it --outfile gemma-4-31b.gguf
+
+### 3. Apple MLX — text via mlx-lm, multimodal via mlx-vlm (community) ###
+# Text generation (mlx-lm, first-party Apple):
+mlx_lm.generate --model mlx-community/gemma-4-E4B-it-4bit --prompt "Hello"
+# Vision/audio (mlx-vlm, Prince Canuma / community):
+mlx_vlm.generate --model mlx-community/gemma-4-E4B-it-8bit \
+  --image https://example.com/cat.jpg --prompt "Describe this image."
+
+### 4. Keras / keras-hub — reference implementation, training-focused ###
+# python:
+# import keras_hub
+# model = keras_hub.models.Gemma4CausalLM.from_preset("gemma4_instruct_4b")
+# model.generate("Hello", max_length=128)
+# Presets: gemma4_{2b,4b,26b_a4b,31b} and gemma4_instruct_{...}
+
+### 5. Text Generation Inference (TGI) — NO native Gemma 4 support as of 2026-04-18 ###
+# Upstream supported_models list stops at Gemma 3 / Gemma 3 Text.
+# Fallback: TGI will try AutoModelForCausalLM without optimized kernels —
+# expect degraded throughput and no guarantee of vision/audio paths.
+text-generation-launcher --model-id google/gemma-4-31b-it   # unoptimized fallback
+
+### 6. TensorRT-LLM — NOT supported ###
+# Support matrix (2026-04) lists Gemma2 and Gemma3{ForCausalLM,ForConditionalGeneration}
+# but NOT Gemma4. NVIDIA's own nvidia/Gemma-4-31B-IT-NVFP4 card points users to vLLM.
+# Issue #12764 tracks DGX Spark runtime skew. Avoid for production Gemma 4.
+
+### 7. Gemini API (Google AI Studio) — hosted Gemma 4 ###
+curl "https://generativelanguage.googleapis.com/v1beta/models/gemma-4-26b-a4b-it:generateContent" \
+  -H 'Content-Type: application/json' \
+  -H "x-goog-api-key: $GEMINI_API_KEY" \
+  -X POST \
+  -d '{"contents":[{"parts":[{"text":"Your prompt here"}]}]}'
+# Python SDK (google-genai):
+# from google import genai
+# client = genai.Client()
+# resp = client.models.generate_content(model="gemma-4-26b-a4b-it", contents="Hi")
+# print(resp.text)
+# Hosted model IDs: gemma-4-31b-it, gemma-4-26b-a4b-it
+
+### 8. Vertex AI Model Garden — one-click deploy ###
+# Console: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma4
+# CLI (new model-garden command):
+gcloud ai model-garden models list | grep gemma-4
+# Python SDK (vertex-ai-model-garden):
+# from google.cloud.aiplatform import model_garden
+# model = model_garden.OpenModel("google/gemma4@gemma-4-31b-it")
+# endpoint = model.deploy()   # spins up Vertex endpoint with backing GPUs