gemma4-research/tooling/gemma-family/txgemma.md

# TxGemma

Therapeutic-development / drug-discovery variant. Built on **Gemma 2**. No Gemma 3 or 4 generation yet.

## What it is

Gemma 2 fine-tuned on 7M examples curated from the **Therapeutics Data Commons (TDC)** — predictive tasks across small molecules, proteins, nucleic acids, diseases, and cell lines. Beats or matches state-of-the-art on 50 of 66 TDC tasks; beats specialist models on 26 of them.

## Sizes

- **2B predict** — prediction-only, narrow prompt format.
- **9B predict** + **9B chat** — prediction plus conversational reasoning.
- **27B predict** + **27B chat** — same, larger.

## Model card

- https://developers.google.com/health-ai-developer-foundations/txgemma/model-card
- DeepMind: https://deepmind.google/models/gemma/txgemma/
- Paper: https://deepmind.google/research/publications/153799/

## Prompting modes

**Prediction mode** (all sizes): structured TDC-format prompt with instruction + context + question + optional few-shot. Output is a short prediction (sometimes a single token or a float).

**Conversational mode** (9B, 27B): chat-template interactions, can explain reasoning behind predictions.

## Minimum invocation — prediction

```python
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="google/txgemma-27b-predict",
    device="cuda",
)

prompt = (
    "Instructions: Predict whether the molecule can penetrate the blood-brain barrier.\n"
    "Context: Blood-brain barrier penetration is an important property for CNS drugs.\n"
    "Question: Given the SMILES string CN1C=NC2=C1C(=O)N(C(=O)N2C)C, "
    "predict BBB penetration. Answer with 'Yes' or 'No'.\n"
    "Answer:"
)

out = pipe(prompt, max_new_tokens=8)
print(out[0]["generated_text"])
```

## License

Health AI Developer Foundations — same terms as MedGemma. Non-clinical, research-use.

## When to choose it over base Gemma 4

- You're doing **drug-discovery research** and need TDC-format predictions out of the box.
- You want **SMILES-aware reasoning** without a custom cheminformatics stack.

Almost never chosen for general-purpose work. TxGemma's value is the training data, not the base model.

## Homelab fit

Zero. Noted for completeness.