# PaliGemma / PaliGemma 2

Vision-language model combining a **SigLIP** image encoder with a Gemma text decoder. Separate product line from base Gemma 4's built-in vision. Still on Gemma 2 as of April 2026 — **no PaliGemma 3 or PaliGemma-on-Gemma-4 yet.**

## What it is

- **PaliGemma** (May 2024): Gemma 1 + SigLIP-So400m/14. Sizes: 3B only. Built for task-prefix prompting (`caption`, `detect`, `segment`, `ocr`).
- **PaliGemma 2** (Dec 2024): Gemma 2 + SigLIP-So400m/14. Sizes: 3B, 10B, 28B. Each available at three resolutions: 224x224, 448x448, 896x896.
- **PaliGemma 2 mix** (Feb 2025): task-mixed instruction-tuned variant — works better out-of-the-box on ad-hoc VQA without per-task fine-tuning.

## Sizes (PaliGemma 2)

| Text decoder | Image encoder | Total | Resolutions |
|---|---|---|---|
| Gemma 2 2B | SigLIP-So400m | ~3B | 224 / 448 / 896 |
| Gemma 2 9B | SigLIP-So400m | ~10B | 224 / 448 / 896 |
| Gemma 2 27B | SigLIP-So400m | ~28B | 224 / 448 / 896 |

## Model cards

- PaliGemma 2: https://ai.google.dev/gemma/docs/paligemma/model-card-2
- DeepMind: https://deepmind.google/models/gemma/paligemma-2/
- HF blog: https://huggingface.co/blog/paligemma2

## Prompt format

PaliGemma uses **task-prefix** prompting, not chat turns. Format:

```
<image>{task} {args}
```

Known task prefixes (not exhaustive; Google under-documents the full list):

| Prefix | Purpose | Example |
|---|---|---|
| `caption {lang}` | Image captioning | `<image>caption en` |
| `ocr` | Read all text in image | `<image>ocr` |
| `answer en {q}` | VQA | `<image>answer en what color is the car?` |
| `detect {obj}` | Object detection (bounding boxes) | `<image>detect cat ; dog` |
| `segment {obj}` | Segmentation masks | `<image>segment person` |

For `detect` and `segment`, output uses custom location (`<loc0123>`) and segmentation (`<seg000>`) tokens. You need the PaliGemma postprocessing routines to convert them to pixel coords.

## Minimum invocation — PaliGemma 2

```python
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests, torch

model_id = "google/paligemma2-3b-mix-448"
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16
).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open(requests.get(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
    stream=True
).raw).convert("RGB")

prompt = "<image>caption en"
inputs = processor(prompt, image, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=200)
gen = processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(gen)
```

## When to choose it over base Gemma 4 vision

- You need **structured spatial output** — bounding boxes, segmentation masks. Base Gemma 4 vision returns freeform text; PaliGemma 2 returns grid-aligned location tokens.
- You're doing **pure VQA or captioning at scale** and want a smaller, faster, task-specialized 3B model (vs. Gemma 4 E4B at 4B-effective).
- You're **fine-tuning** for a narrow vision task — PaliGemma 2 is explicitly designed to be easy to fine-tune; Google ships LoRA recipes.

Use base Gemma 4 for **conversational multimodal** (back-and-forth with images + text reasoning). PaliGemma is the "turn image into structured text" workhorse.

## Homelab fit

For `ai-visualizer` (CT 167, pve197 with V100): PaliGemma 2 3B-448 is a great caption-and-ground step when producing SDXL prompts from reference images. Already tested: base Gemma 4 E4B handles "describe this image" at ~25 tok/s on pve197. PaliGemma 2 would add `detect`/`segment` for spatial control (e.g., "put the character in the upper-left quadrant of the generated scene").