docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00
parent 5011059f5d
commit eecebe7ef5
149 changed files with 181297 additions and 0 deletions
@@ -0,0 +1,105 @@
+# TranslateGemma
+
+Multilingual text + image translation. Released **January 15, 2026**. Built on **Gemma 3** (not Gemma 4, despite being the newest variant at time of writing).
+
+## What it is
+
+Gemma 3 fine-tuned for translation across **55 languages**, using a two-stage distillation from Gemini. Retains Gemma 3's multimodal capability — can translate text embedded in images.
+
+## Sizes
+
+- **4B IT**
+- **12B IT**
+- **27B IT**
+
+Google's headline claim: the 12B beats Gemma 3 27B baseline translation quality with less than half the parameters.
+
+## Model card
+
+- HF: https://huggingface.co/google/translategemma-4b-it
+- Blog: https://blog.google/innovation-and-ai/technology/developers-tools/translategemma/
+- InfoQ: https://www.infoq.com/news/2026/01/google-translategemma-models/
+
+## Supported languages
+
+55 languages via ISO 639-1 codes (`en`, `de`, `es`, `fr`, `pl`, `ja`, `zh`, `ar`, `hi`, etc.) plus regional variants (`en-US`, `en-GB`, `pt-BR`, `pt-PT`, `de-DE`, `de-AT`, `de-CH`, `zh-CN`, `zh-TW`, etc.).
+
+## Prompt format
+
+**Strict chat-template format.** Content list must contain exactly **one entry**, with mandatory `source_lang_code` and `target_lang_code`.
+
+### Text translation
+
+```python
+messages = [{
+    "role": "user",
+    "content": [{
+        "type": "text",
+        "source_lang_code": "cs",
+        "target_lang_code": "de-DE",
+        "text": "V nejhorším případě i k prasknutí čočky.",
+    }],
+}]
+```
+
+### Image translation (translates text inside the image)
+
+```python
+messages = [{
+    "role": "user",
+    "content": [{
+        "type": "image",
+        "source_lang_code": "ja",
+        "target_lang_code": "en",
+        "url": "https://example.com/japanese-sign.jpg",
+    }],
+}]
+```
+
+Only `"text"` and `"image"` types are supported. Only `user` and `assistant` roles. Image input is normalized to 896×896 (256 vision tokens).
+
+## Minimum invocation
+
+```python
+from transformers import pipeline
+import torch
+
+pipe = pipeline(
+    "image-text-to-text",
+    model="google/translategemma-4b-it",
+    device="cuda",
+    dtype=torch.bfloat16,
+)
+
+messages = [{
+    "role": "user",
+    "content": [{
+        "type": "text",
+        "source_lang_code": "pl",
+        "target_lang_code": "en",
+        "text": "Dziadek mieszkał w Warszawie przed wojną.",
+    }],
+}]
+
+out = pipe(text=messages, max_new_tokens=200)
+print(out[0]["generated_text"][-1]["content"])
+```
+
+## Performance
+
+- **WMT24++ across 55 languages:** MetricX 5.32, COMET 81.6.
+- Context window: 2K tokens (short — this is a translation model, not a long-doc summarizer).
+
+## When to choose it over base Gemma 4
+
+- You want **translation quality > general Gemma 4** at equivalent size, with the strict prompt contract making it easy to drop into a pipeline.
+- You need **image-text translation** (street signs, menus, old documents) as a first-class task.
+- You care about the 55-language coverage and regionalized variants.
+
+Base Gemma 4 31B *can* translate — fine for casual use. TranslateGemma wins for production pipelines and when you care about metric-validated quality.
+
+## Homelab fit
+
+**Strong fit for family history agent.** If source documents are in German, Polish, Hungarian, Yiddish, or any of the 55 supported languages, TranslateGemma 4B on pve197 (GPU-backed) becomes the translation leg of an ingest pipeline: OCR → TranslateGemma → Gemma 4 for reasoning. The 4B size fits alongside the other models on the V100.
+
+Also useful for SearchXNG (if Seth ever wants to auto-translate non-English search results) and the news-summary print system (translate foreign-language feeds before summarization).