Files

T

Mortdecai eecebe7ef5 docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 12:24:48 -04:00

3.0 KiB

Raw Blame History

RecurrentGemma

Griffin-architecture sibling. Built on Gemma 1. No Gemma 2/3/4 generation — the line has effectively stalled, with long-context Transformer variants (Gemma 4 with 256K context) overtaking the memory-efficiency argument.

What it is

Gated linear recurrences + local sliding-window attention, replacing full self-attention. Fixed-size hidden state → O(1) memory per token generated, no KV cache growth. Inference stays fast and cheap as context lengthens.

Sizes

2B pretrained + instruct
9B pretrained + instruct

Only two sizes. No 27B. Griffin scaling beyond 9B is an open research question and Google didn't ship it.

Model card

Architecture highlights

Griffin block: alternates two residual recurrent blocks with a local MQA attention block.
State size: fixed — independent of sequence length.
Sliding window: local attention only, not global.
Trade-off: loses some needle-in-haystack precision vs. a full-attention Transformer, gains memory flatness.

Prompt format

Standard Gemma turn format — same <start_of_turn>user … <end_of_turn> as Gemma 1 IT. No RecurrentGemma-specific tokens.

Minimum invocation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/recurrentgemma-9b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

prompt = "<start_of_turn>user\nWrite a haiku about memory.<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

When to choose it over base Gemma 4

Honestly: rarely, in April 2026.

The original pitch was "long-context generation without KV blowup." Gemma 4 now ships with 256K context on the 26B/31B and 128K on the edge models, with efficient attention implementations. The gap RecurrentGemma was filling has narrowed.

Reasonable residual cases:

Extremely memory-constrained hardware (Jetson Nano tier) where even quantized Gemma 4 E2B KV cache is the limiting factor on sequence length.
Streaming-generation workloads where latency-per-token must stay constant as output length grows into the tens of thousands of tokens.
Research interest in recurrent LLMs.

For typical homelab use, skip. The V100 on pve197 has 32GB VRAM; Gemma 4 31B at Q4 fits with room for generous context.

Homelab fit

Not a strong candidate for any current Seth project. Note for file: if a CPU-only streaming-transcript use case ever comes up (e.g., running on seth-pi for always-on audio processing), RecurrentGemma 2B could reappear in scope.

3.0 KiB Raw Blame History