gemma4-research/tooling/gemma-family/shieldgemma.md

# ShieldGemma / ShieldGemma 2

Safety classifiers. Two separate product lines now: **ShieldGemma** (text, built on Gemma 2) and **ShieldGemma 2** (images, built on Gemma 3). There is no Gemma 4 generation yet.

## What it is

- **ShieldGemma (text):** LLM-as-a-judge safety classifier. Takes a prompt + optional model response + a policy, emits `Yes`/`No` (yes = violates policy). Four harm types.
- **ShieldGemma 2 (image):** Image classifier. Takes a PIL image, emits probabilities across three image-safety categories. Image-only — does not accept text.

## Sizes

- ShieldGemma: **2B, 9B, 27B** — all instruction-tuned.
- ShieldGemma 2: **4B** — only size available.

## Model cards

- Text: https://ai.google.dev/gemma/docs/shieldgemma/model_card
- Image: https://huggingface.co/google/shieldgemma-2-4b-it
- DeepMind: https://deepmind.google/models/gemma/shieldgemma-2/

## Safety categories

**ShieldGemma (text):**
1. Sexually explicit content
2. Dangerous content
3. Hate speech
4. Harassment

**ShieldGemma 2 (image):**
1. Sexually explicit content
2. Dangerous content
3. Violence / gore

Note the image model dropped "hate" and "harassment" (hard to define visually) and added "violence/gore" (a visual primitive).

## Text prompt format (ShieldGemma)

Five-component structure:

```
<preamble establishing "you are a policy expert">

<start_of_turn>user
<user prompt here>
<end_of_turn>

<start_of_turn>model
<optional model response here>
<end_of_turn>

Our safety principle is defined in the below:
* <policy description, e.g. "No Hate Speech": ...>

Does the human prompt/response violate the above principle? Your answer must start with 'Yes' or 'No'.
```

The model outputs one token: `Yes` (violates) or `No` (safe). Softmax the logits on those two tokens for a calibrated score.

## Minimum invocation — ShieldGemma 2 (image)

```python
from transformers import AutoProcessor, ShieldGemma2ForImageClassification
from PIL import Image
import torch

model_id = "google/shieldgemma-2-4b-it"
model = ShieldGemma2ForImageClassification.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open("input.jpg")
inputs = processor(images=[image], return_tensors="pt")

with torch.inference_mode():
    out = model(**inputs)

print(out.probabilities)  # tensor of per-category "Yes" probabilities
```

## When to choose it over base Gemma 4

- You need a **calibrated safety score**, not a free-form "is this safe?" answer from the chat model. ShieldGemma emits Yes/No token logits — easy to threshold.
- You want **policy-by-policy classification** (e.g., run each category separately with different thresholds).
- You're running a moderation pipeline and need **a small, fast, purpose-trained classifier** rather than a general chat model reasoning about safety.

Use base Gemma 4 for "explain *why* this is unsafe" narrative output. ShieldGemma is the yes/no stamp.

## Homelab fit

Pre-filter for `ai-visualizer` (CT 167, pve197) before publishing generated images. ShieldGemma 2 4B at Q4 fits comfortably on the Tesla V100-PCIE-32GB alongside SDXL.