gemma4-research/tooling/gemma-family/recurrentgemma.md

# RecurrentGemma

Griffin-architecture sibling. Built on **Gemma 1**. No Gemma 2/3/4 generation — the line has effectively stalled, with long-context Transformer variants (Gemma 4 with 256K context) overtaking the memory-efficiency argument.

## What it is

Gated linear recurrences + local sliding-window attention, replacing full self-attention. Fixed-size hidden state → **O(1) memory per token generated**, no KV cache growth. Inference stays fast and cheap as context lengthens.

## Sizes

- **2B** pretrained + instruct
- **9B** pretrained + instruct

Only two sizes. No 27B. Griffin scaling beyond 9B is an open research question and Google didn't ship it.

## Model card

- https://ai.google.dev/gemma/docs/recurrentgemma/model_card
- DeepMind: https://deepmind.google/models/gemma/recurrentgemma/
- Paper: https://arxiv.org/abs/2404.07839
- Repo: https://github.com/google-deepmind/recurrentgemma

## Architecture highlights

- **Griffin block:** alternates two residual recurrent blocks with a local MQA attention block.
- **State size:** fixed — independent of sequence length.
- **Sliding window:** local attention only, not global.
- **Trade-off:** loses some needle-in-haystack precision vs. a full-attention Transformer, gains memory flatness.

## Prompt format

Standard Gemma turn format — same `<start_of_turn>user … <end_of_turn>` as Gemma 1 IT. No RecurrentGemma-specific tokens.

## Minimum invocation

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/recurrentgemma-9b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

prompt = "<start_of_turn>user\nWrite a haiku about memory.<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

## When to choose it over base Gemma 4

Honestly: **rarely, in April 2026.**

The original pitch was "long-context generation without KV blowup." Gemma 4 now ships with 256K context on the 26B/31B and 128K on the edge models, with efficient attention implementations. The gap RecurrentGemma was filling has narrowed.

Reasonable residual cases:
- **Extremely memory-constrained hardware** (Jetson Nano tier) where even quantized Gemma 4 E2B KV cache is the limiting factor on sequence length.
- **Streaming-generation workloads** where latency-per-token must stay constant as output length grows into the tens of thousands of tokens.
- **Research interest** in recurrent LLMs.

For typical homelab use, skip. The V100 on pve197 has 32GB VRAM; Gemma 4 31B at Q4 fits with room for generous context.

## Homelab fit

Not a strong candidate for any current Seth project. Note for file: if a CPU-only streaming-transcript use case ever comes up (e.g., running on seth-pi for always-on audio processing), RecurrentGemma 2B could reappear in scope.