Files
gemma4-research/tooling/gemma-family/recurrentgemma.md
T
Mortdecai eecebe7ef5 docs: add canonical tooling corpus (147 files) from Google/HF/frameworks
Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00

3.0 KiB

RecurrentGemma

Griffin-architecture sibling. Built on Gemma 1. No Gemma 2/3/4 generation — the line has effectively stalled, with long-context Transformer variants (Gemma 4 with 256K context) overtaking the memory-efficiency argument.

What it is

Gated linear recurrences + local sliding-window attention, replacing full self-attention. Fixed-size hidden state → O(1) memory per token generated, no KV cache growth. Inference stays fast and cheap as context lengthens.

Sizes

  • 2B pretrained + instruct
  • 9B pretrained + instruct

Only two sizes. No 27B. Griffin scaling beyond 9B is an open research question and Google didn't ship it.

Model card

Architecture highlights

  • Griffin block: alternates two residual recurrent blocks with a local MQA attention block.
  • State size: fixed — independent of sequence length.
  • Sliding window: local attention only, not global.
  • Trade-off: loses some needle-in-haystack precision vs. a full-attention Transformer, gains memory flatness.

Prompt format

Standard Gemma turn format — same <start_of_turn>user … <end_of_turn> as Gemma 1 IT. No RecurrentGemma-specific tokens.

Minimum invocation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/recurrentgemma-9b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

prompt = "<start_of_turn>user\nWrite a haiku about memory.<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

When to choose it over base Gemma 4

Honestly: rarely, in April 2026.

The original pitch was "long-context generation without KV blowup." Gemma 4 now ships with 256K context on the 26B/31B and 128K on the edge models, with efficient attention implementations. The gap RecurrentGemma was filling has narrowed.

Reasonable residual cases:

  • Extremely memory-constrained hardware (Jetson Nano tier) where even quantized Gemma 4 E2B KV cache is the limiting factor on sequence length.
  • Streaming-generation workloads where latency-per-token must stay constant as output length grows into the tens of thousands of tokens.
  • Research interest in recurrent LLMs.

For typical homelab use, skip. The V100 on pve197 has 32GB VRAM; Gemma 4 31B at Q4 fits with room for generous context.

Homelab fit

Not a strong candidate for any current Seth project. Note for file: if a CPU-only streaming-transcript use case ever comes up (e.g., running on seth-pi for always-on audio processing), RecurrentGemma 2B could reappear in scope.