# PaliGemma / PaliGemma 2 Vision-language model combining a **SigLIP** image encoder with a Gemma text decoder. Separate product line from base Gemma 4's built-in vision. Still on Gemma 2 as of April 2026 — **no PaliGemma 3 or PaliGemma-on-Gemma-4 yet.** ## What it is - **PaliGemma** (May 2024): Gemma 1 + SigLIP-So400m/14. Sizes: 3B only. Built for task-prefix prompting (`caption`, `detect`, `segment`, `ocr`). - **PaliGemma 2** (Dec 2024): Gemma 2 + SigLIP-So400m/14. Sizes: 3B, 10B, 28B. Each available at three resolutions: 224x224, 448x448, 896x896. - **PaliGemma 2 mix** (Feb 2025): task-mixed instruction-tuned variant — works better out-of-the-box on ad-hoc VQA without per-task fine-tuning. ## Sizes (PaliGemma 2) | Text decoder | Image encoder | Total | Resolutions | |---|---|---|---| | Gemma 2 2B | SigLIP-So400m | ~3B | 224 / 448 / 896 | | Gemma 2 9B | SigLIP-So400m | ~10B | 224 / 448 / 896 | | Gemma 2 27B | SigLIP-So400m | ~28B | 224 / 448 / 896 | ## Model cards - PaliGemma 2: https://ai.google.dev/gemma/docs/paligemma/model-card-2 - DeepMind: https://deepmind.google/models/gemma/paligemma-2/ - HF blog: https://huggingface.co/blog/paligemma2 ## Prompt format PaliGemma uses **task-prefix** prompting, not chat turns. Format: ``` {task} {args} ``` Known task prefixes (not exhaustive; Google under-documents the full list): | Prefix | Purpose | Example | |---|---|---| | `caption {lang}` | Image captioning | `caption en` | | `ocr` | Read all text in image | `ocr` | | `answer en {q}` | VQA | `answer en what color is the car?` | | `detect {obj}` | Object detection (bounding boxes) | `detect cat ; dog` | | `segment {obj}` | Segmentation masks | `segment person` | For `detect` and `segment`, output uses custom location (``) and segmentation (``) tokens. You need the PaliGemma postprocessing routines to convert them to pixel coords. ## Minimum invocation — PaliGemma 2 ```python from transformers import AutoProcessor, PaliGemmaForConditionalGeneration from PIL import Image import requests, torch model_id = "google/paligemma2-3b-mix-448" model = PaliGemmaForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16 ).to("cuda") processor = AutoProcessor.from_pretrained(model_id) image = Image.open(requests.get( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png", stream=True ).raw).convert("RGB") prompt = "caption en" inputs = processor(prompt, image, return_tensors="pt").to("cuda") out = model.generate(**inputs, max_new_tokens=200) gen = processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True) print(gen) ``` ## When to choose it over base Gemma 4 vision - You need **structured spatial output** — bounding boxes, segmentation masks. Base Gemma 4 vision returns freeform text; PaliGemma 2 returns grid-aligned location tokens. - You're doing **pure VQA or captioning at scale** and want a smaller, faster, task-specialized 3B model (vs. Gemma 4 E4B at 4B-effective). - You're **fine-tuning** for a narrow vision task — PaliGemma 2 is explicitly designed to be easy to fine-tune; Google ships LoRA recipes. Use base Gemma 4 for **conversational multimodal** (back-and-forth with images + text reasoning). PaliGemma is the "turn image into structured text" workhorse. ## Homelab fit For `ai-visualizer` (CT 167, pve197 with V100): PaliGemma 2 3B-448 is a great caption-and-ground step when producing SDXL prompts from reference images. Already tested: base Gemma 4 E4B handles "describe this image" at ~25 tok/s on pve197. PaliGemma 2 would add `detect`/`segment` for spatial control (e.g., "put the character in the upper-left quadrant of the generated scene").