Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.7 KiB
PaliGemma / PaliGemma 2
Vision-language model combining a SigLIP image encoder with a Gemma text decoder. Separate product line from base Gemma 4's built-in vision. Still on Gemma 2 as of April 2026 — no PaliGemma 3 or PaliGemma-on-Gemma-4 yet.
What it is
- PaliGemma (May 2024): Gemma 1 + SigLIP-So400m/14. Sizes: 3B only. Built for task-prefix prompting (
caption,detect,segment,ocr). - PaliGemma 2 (Dec 2024): Gemma 2 + SigLIP-So400m/14. Sizes: 3B, 10B, 28B. Each available at three resolutions: 224x224, 448x448, 896x896.
- PaliGemma 2 mix (Feb 2025): task-mixed instruction-tuned variant — works better out-of-the-box on ad-hoc VQA without per-task fine-tuning.
Sizes (PaliGemma 2)
| Text decoder | Image encoder | Total | Resolutions |
|---|---|---|---|
| Gemma 2 2B | SigLIP-So400m | ~3B | 224 / 448 / 896 |
| Gemma 2 9B | SigLIP-So400m | ~10B | 224 / 448 / 896 |
| Gemma 2 27B | SigLIP-So400m | ~28B | 224 / 448 / 896 |
Model cards
- PaliGemma 2: https://ai.google.dev/gemma/docs/paligemma/model-card-2
- DeepMind: https://deepmind.google/models/gemma/paligemma-2/
- HF blog: https://huggingface.co/blog/paligemma2
Prompt format
PaliGemma uses task-prefix prompting, not chat turns. Format:
<image>{task} {args}
Known task prefixes (not exhaustive; Google under-documents the full list):
| Prefix | Purpose | Example |
|---|---|---|
caption {lang} |
Image captioning | <image>caption en |
ocr |
Read all text in image | <image>ocr |
answer en {q} |
VQA | <image>answer en what color is the car? |
detect {obj} |
Object detection (bounding boxes) | <image>detect cat ; dog |
segment {obj} |
Segmentation masks | <image>segment person |
For detect and segment, output uses custom location (<loc0123>) and segmentation (<seg000>) tokens. You need the PaliGemma postprocessing routines to convert them to pixel coords.
Minimum invocation — PaliGemma 2
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests, torch
model_id = "google/paligemma2-3b-mix-448"
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16
).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)
image = Image.open(requests.get(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
stream=True
).raw).convert("RGB")
prompt = "<image>caption en"
inputs = processor(prompt, image, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=200)
gen = processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(gen)
When to choose it over base Gemma 4 vision
- You need structured spatial output — bounding boxes, segmentation masks. Base Gemma 4 vision returns freeform text; PaliGemma 2 returns grid-aligned location tokens.
- You're doing pure VQA or captioning at scale and want a smaller, faster, task-specialized 3B model (vs. Gemma 4 E4B at 4B-effective).
- You're fine-tuning for a narrow vision task — PaliGemma 2 is explicitly designed to be easy to fine-tune; Google ships LoRA recipes.
Use base Gemma 4 for conversational multimodal (back-and-forth with images + text reasoning). PaliGemma is the "turn image into structured text" workhorse.
Homelab fit
For ai-visualizer (CT 167, pve197 with V100): PaliGemma 2 3B-448 is a great caption-and-ground step when producing SDXL prompts from reference images. Already tested: base Gemma 4 E4B handles "describe this image" at ~25 tok/s on pve197. PaliGemma 2 would add detect/segment for spatial control (e.g., "put the character in the upper-left quadrant of the generated scene").