Files
Mortdecai eecebe7ef5 docs: add canonical tooling corpus (147 files) from Google/HF/frameworks
Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00

3.7 KiB

PaliGemma / PaliGemma 2

Vision-language model combining a SigLIP image encoder with a Gemma text decoder. Separate product line from base Gemma 4's built-in vision. Still on Gemma 2 as of April 2026 — no PaliGemma 3 or PaliGemma-on-Gemma-4 yet.

What it is

  • PaliGemma (May 2024): Gemma 1 + SigLIP-So400m/14. Sizes: 3B only. Built for task-prefix prompting (caption, detect, segment, ocr).
  • PaliGemma 2 (Dec 2024): Gemma 2 + SigLIP-So400m/14. Sizes: 3B, 10B, 28B. Each available at three resolutions: 224x224, 448x448, 896x896.
  • PaliGemma 2 mix (Feb 2025): task-mixed instruction-tuned variant — works better out-of-the-box on ad-hoc VQA without per-task fine-tuning.

Sizes (PaliGemma 2)

Text decoder Image encoder Total Resolutions
Gemma 2 2B SigLIP-So400m ~3B 224 / 448 / 896
Gemma 2 9B SigLIP-So400m ~10B 224 / 448 / 896
Gemma 2 27B SigLIP-So400m ~28B 224 / 448 / 896

Model cards

Prompt format

PaliGemma uses task-prefix prompting, not chat turns. Format:

<image>{task} {args}

Known task prefixes (not exhaustive; Google under-documents the full list):

Prefix Purpose Example
caption {lang} Image captioning <image>caption en
ocr Read all text in image <image>ocr
answer en {q} VQA <image>answer en what color is the car?
detect {obj} Object detection (bounding boxes) <image>detect cat ; dog
segment {obj} Segmentation masks <image>segment person

For detect and segment, output uses custom location (<loc0123>) and segmentation (<seg000>) tokens. You need the PaliGemma postprocessing routines to convert them to pixel coords.

Minimum invocation — PaliGemma 2

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests, torch

model_id = "google/paligemma2-3b-mix-448"
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16
).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open(requests.get(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
    stream=True
).raw).convert("RGB")

prompt = "<image>caption en"
inputs = processor(prompt, image, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=200)
gen = processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(gen)

When to choose it over base Gemma 4 vision

  • You need structured spatial output — bounding boxes, segmentation masks. Base Gemma 4 vision returns freeform text; PaliGemma 2 returns grid-aligned location tokens.
  • You're doing pure VQA or captioning at scale and want a smaller, faster, task-specialized 3B model (vs. Gemma 4 E4B at 4B-effective).
  • You're fine-tuning for a narrow vision task — PaliGemma 2 is explicitly designed to be easy to fine-tune; Google ships LoRA recipes.

Use base Gemma 4 for conversational multimodal (back-and-forth with images + text reasoning). PaliGemma is the "turn image into structured text" workhorse.

Homelab fit

For ai-visualizer (CT 167, pve197 with V100): PaliGemma 2 3B-448 is a great caption-and-ground step when producing SDXL prompts from reference images. Already tested: base Gemma 4 E4B handles "describe this image" at ~25 tok/s on pve197. PaliGemma 2 would add detect/segment for spatial control (e.g., "put the character in the upper-left quadrant of the generated scene").