eecebe7ef5
Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
68 lines
3.0 KiB
Markdown
68 lines
3.0 KiB
Markdown
# RecurrentGemma
|
|
|
|
Griffin-architecture sibling. Built on **Gemma 1**. No Gemma 2/3/4 generation — the line has effectively stalled, with long-context Transformer variants (Gemma 4 with 256K context) overtaking the memory-efficiency argument.
|
|
|
|
## What it is
|
|
|
|
Gated linear recurrences + local sliding-window attention, replacing full self-attention. Fixed-size hidden state → **O(1) memory per token generated**, no KV cache growth. Inference stays fast and cheap as context lengthens.
|
|
|
|
## Sizes
|
|
|
|
- **2B** pretrained + instruct
|
|
- **9B** pretrained + instruct
|
|
|
|
Only two sizes. No 27B. Griffin scaling beyond 9B is an open research question and Google didn't ship it.
|
|
|
|
## Model card
|
|
|
|
- https://ai.google.dev/gemma/docs/recurrentgemma/model_card
|
|
- DeepMind: https://deepmind.google/models/gemma/recurrentgemma/
|
|
- Paper: https://arxiv.org/abs/2404.07839
|
|
- Repo: https://github.com/google-deepmind/recurrentgemma
|
|
|
|
## Architecture highlights
|
|
|
|
- **Griffin block:** alternates two residual recurrent blocks with a local MQA attention block.
|
|
- **State size:** fixed — independent of sequence length.
|
|
- **Sliding window:** local attention only, not global.
|
|
- **Trade-off:** loses some needle-in-haystack precision vs. a full-attention Transformer, gains memory flatness.
|
|
|
|
## Prompt format
|
|
|
|
Standard Gemma turn format — same `<start_of_turn>user … <end_of_turn>` as Gemma 1 IT. No RecurrentGemma-specific tokens.
|
|
|
|
## Minimum invocation
|
|
|
|
```python
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
import torch
|
|
|
|
model_id = "google/recurrentgemma-9b-it"
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
model_id, torch_dtype=torch.bfloat16, device_map="auto"
|
|
)
|
|
|
|
prompt = "<start_of_turn>user\nWrite a haiku about memory.<end_of_turn>\n<start_of_turn>model\n"
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
|
|
out = model.generate(**inputs, max_new_tokens=100)
|
|
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
|
```
|
|
|
|
## When to choose it over base Gemma 4
|
|
|
|
Honestly: **rarely, in April 2026.**
|
|
|
|
The original pitch was "long-context generation without KV blowup." Gemma 4 now ships with 256K context on the 26B/31B and 128K on the edge models, with efficient attention implementations. The gap RecurrentGemma was filling has narrowed.
|
|
|
|
Reasonable residual cases:
|
|
- **Extremely memory-constrained hardware** (Jetson Nano tier) where even quantized Gemma 4 E2B KV cache is the limiting factor on sequence length.
|
|
- **Streaming-generation workloads** where latency-per-token must stay constant as output length grows into the tens of thousands of tokens.
|
|
- **Research interest** in recurrent LLMs.
|
|
|
|
For typical homelab use, skip. The V100 on pve197 has 32GB VRAM; Gemma 4 31B at Q4 fits with room for generous context.
|
|
|
|
## Homelab fit
|
|
|
|
Not a strong candidate for any current Seth project. Note for file: if a CPU-only streaming-transcript use case ever comes up (e.g., running on seth-pi for always-on audio processing), RecurrentGemma 2B could reappear in scope.
|