# DataGemma LLM grounding with Google **Data Commons** — a public knowledge graph of 240B+ statistical data points (economics, health, demographics, science). Built on **Gemma 2 27B**. No Gemma 3 or 4 generation yet. ## What it is Two flavors: - **DataGemma RIG** (Retrieval-Interleaved Generation): Model is fine-tuned to emit inline Data Commons queries wrapped around its own claims. Outputs look like `The population of Sunnyvale is [__DC__("population of Sunnyvale") --> "152,200"]`. An external resolver substitutes the real stat. - **DataGemma RAG** (Retrieval-Augmented Generation): Standard RAG pipeline — query Data Commons, inject results into context, generate. ## Sizes - **27B instruct** only (`datagemma-rig-27b-it`, `datagemma-rag-27b-it`). ## Model cards - https://ai.google.dev/gemma/docs/datagemma - DeepMind: https://deepmind.google/models/gemma/datagemma/ - HF RIG: https://huggingface.co/google/datagemma-rig-27b-it - HF RAG: https://huggingface.co/google/datagemma-rag-27b-it - Paper: https://docs.datacommons.org/papers/DataGemma-FullPaper.pdf ## Performance claim Baseline Gemma 2 factuality on the 101-query statistical eval: **5–17%**. DataGemma RIG: **~58%**. The improvement is narrow (statistical claims only) but real. ## Prompt format No special template. Plain natural-language input. The difference is in the **training** and the **output format**. **RIG output example:** ``` Sunnyvale has [__DC__("total population of Sunnyvale CA") --> "152,200"] residents as of 2020, with a median age of [__DC__("median age of Sunnyvale CA") --> "34.8"]. ``` Post-processing: regex out the `[__DC__("...") --> "..."]` blocks and either (a) replace with resolved Data Commons values, or (b) render as inline citations. **RAG flow:** query Data Commons first, inject tabular context, then prompt normally. ## Minimum invocation — RIG ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "google/datagemma-rig-27b-it" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.bfloat16 ) prompt = "What are the demographic trends in Sunnyvale, California?" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") out = model.generate(**inputs, max_new_tokens=1024) print(tokenizer.batch_decode( out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True )[0]) ``` Then run a resolver that extracts each `[__DC__(q) --> ""]` and hits the Data Commons API. ## When to choose it over base Gemma 4 - You're building a **statistics-grounded assistant** (government data, public health, economic indicators) and need low hallucination on numbers. - You're okay with a **27B model** — DataGemma only ships at this size. - Your domain overlaps Data Commons coverage (US-heavy, but growing internationally). Base Gemma 4 + a conventional RAG pipeline can do the same thing if you bring your own retriever. DataGemma's value is the **trained inline-citation behavior** (RIG) — Gemma 4 won't emit that format without prompting gymnastics. ## Homelab fit Low. No current Seth project leans on statistical grounding. Niche for a news-summary use case (POS-Automation daily print) if Seth ever wants "US inflation was X% as of Y" kind of interjections — but then a simple Data Commons API call from the script is cheaper than running a 27B model.