# RecurrentGemma Griffin-architecture sibling. Built on **Gemma 1**. No Gemma 2/3/4 generation — the line has effectively stalled, with long-context Transformer variants (Gemma 4 with 256K context) overtaking the memory-efficiency argument. ## What it is Gated linear recurrences + local sliding-window attention, replacing full self-attention. Fixed-size hidden state → **O(1) memory per token generated**, no KV cache growth. Inference stays fast and cheap as context lengthens. ## Sizes - **2B** pretrained + instruct - **9B** pretrained + instruct Only two sizes. No 27B. Griffin scaling beyond 9B is an open research question and Google didn't ship it. ## Model card - https://ai.google.dev/gemma/docs/recurrentgemma/model_card - DeepMind: https://deepmind.google/models/gemma/recurrentgemma/ - Paper: https://arxiv.org/abs/2404.07839 - Repo: https://github.com/google-deepmind/recurrentgemma ## Architecture highlights - **Griffin block:** alternates two residual recurrent blocks with a local MQA attention block. - **State size:** fixed — independent of sequence length. - **Sliding window:** local attention only, not global. - **Trade-off:** loses some needle-in-haystack precision vs. a full-attention Transformer, gains memory flatness. ## Prompt format Standard Gemma turn format — same `user … ` as Gemma 1 IT. No RecurrentGemma-specific tokens. ## Minimum invocation ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "google/recurrentgemma-9b-it" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) prompt = "user\nWrite a haiku about memory.\nmodel\n" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") out = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` ## When to choose it over base Gemma 4 Honestly: **rarely, in April 2026.** The original pitch was "long-context generation without KV blowup." Gemma 4 now ships with 256K context on the 26B/31B and 128K on the edge models, with efficient attention implementations. The gap RecurrentGemma was filling has narrowed. Reasonable residual cases: - **Extremely memory-constrained hardware** (Jetson Nano tier) where even quantized Gemma 4 E2B KV cache is the limiting factor on sequence length. - **Streaming-generation workloads** where latency-per-token must stay constant as output length grows into the tens of thousands of tokens. - **Research interest** in recurrent LLMs. For typical homelab use, skip. The V100 on pve197 has 32GB VRAM; Gemma 4 31B at Q4 fits with room for generous context. ## Homelab fit Not a strong candidate for any current Seth project. Note for file: if a CPU-only streaming-transcript use case ever comes up (e.g., running on seth-pi for always-on audio processing), RecurrentGemma 2B could reappear in scope.