docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mortdecai
2026-04-18 12:24:48 -04:00
parent 5011059f5d
commit eecebe7ef5
149 changed files with 181297 additions and 0 deletions
+105
View File
@@ -0,0 +1,105 @@
# TranslateGemma
Multilingual text + image translation. Released **January 15, 2026**. Built on **Gemma 3** (not Gemma 4, despite being the newest variant at time of writing).
## What it is
Gemma 3 fine-tuned for translation across **55 languages**, using a two-stage distillation from Gemini. Retains Gemma 3's multimodal capability — can translate text embedded in images.
## Sizes
- **4B IT**
- **12B IT**
- **27B IT**
Google's headline claim: the 12B beats Gemma 3 27B baseline translation quality with less than half the parameters.
## Model card
- HF: https://huggingface.co/google/translategemma-4b-it
- Blog: https://blog.google/innovation-and-ai/technology/developers-tools/translategemma/
- InfoQ: https://www.infoq.com/news/2026/01/google-translategemma-models/
## Supported languages
55 languages via ISO 639-1 codes (`en`, `de`, `es`, `fr`, `pl`, `ja`, `zh`, `ar`, `hi`, etc.) plus regional variants (`en-US`, `en-GB`, `pt-BR`, `pt-PT`, `de-DE`, `de-AT`, `de-CH`, `zh-CN`, `zh-TW`, etc.).
## Prompt format
**Strict chat-template format.** Content list must contain exactly **one entry**, with mandatory `source_lang_code` and `target_lang_code`.
### Text translation
```python
messages = [{
"role": "user",
"content": [{
"type": "text",
"source_lang_code": "cs",
"target_lang_code": "de-DE",
"text": "V nejhorším případě i k prasknutí čočky.",
}],
}]
```
### Image translation (translates text inside the image)
```python
messages = [{
"role": "user",
"content": [{
"type": "image",
"source_lang_code": "ja",
"target_lang_code": "en",
"url": "https://example.com/japanese-sign.jpg",
}],
}]
```
Only `"text"` and `"image"` types are supported. Only `user` and `assistant` roles. Image input is normalized to 896×896 (256 vision tokens).
## Minimum invocation
```python
from transformers import pipeline
import torch
pipe = pipeline(
"image-text-to-text",
model="google/translategemma-4b-it",
device="cuda",
dtype=torch.bfloat16,
)
messages = [{
"role": "user",
"content": [{
"type": "text",
"source_lang_code": "pl",
"target_lang_code": "en",
"text": "Dziadek mieszkał w Warszawie przed wojną.",
}],
}]
out = pipe(text=messages, max_new_tokens=200)
print(out[0]["generated_text"][-1]["content"])
```
## Performance
- **WMT24++ across 55 languages:** MetricX 5.32, COMET 81.6.
- Context window: 2K tokens (short — this is a translation model, not a long-doc summarizer).
## When to choose it over base Gemma 4
- You want **translation quality > general Gemma 4** at equivalent size, with the strict prompt contract making it easy to drop into a pipeline.
- You need **image-text translation** (street signs, menus, old documents) as a first-class task.
- You care about the 55-language coverage and regionalized variants.
Base Gemma 4 31B *can* translate — fine for casual use. TranslateGemma wins for production pipelines and when you care about metric-validated quality.
## Homelab fit
**Strong fit for family history agent.** If source documents are in German, Polish, Hungarian, Yiddish, or any of the 55 supported languages, TranslateGemma 4B on pve197 (GPU-backed) becomes the translation leg of an ingest pipeline: OCR → TranslateGemma → Gemma 4 for reasoning. The 4B size fits alongside the other models on the V100.
Also useful for SearchXNG (if Seth ever wants to auto-translate non-English search results) and the news-summary print system (translate foreign-language feeds before summarization).