# Native Bakeoff — Gemma 4 Inference Path Comparison Three-arm bakeoff comparing how different inference paths handle the same Gemma 4 tool-calling workload. Isolates Ollama's JSON↔native translator and the runtime itself as variables. ## The three arms | Arm | Path | What varies | |-----|------|-------------| | A. `ollama-json` | `/api/chat` with OpenAI-style `tools:[...]` | Ollama translates JSON → native tokens on input, native tool-call tokens → structured JSON on output. | | B. `ollama-native` | `/api/generate` with `raw:true` + canonical HF jinja template | No JSON translation. Rendered tokens go straight to the model; the harness parses `<\|tool_call>` spans out of the completion. | | C. `jax-native` | `google-deepmind/gemma` reference `ToolSampler` | No Ollama. No llama.cpp. No GGUF quant. Reference Python + JAX + bf16. | ## Research question > Does Ollama's JSON tools path materially diverge from the native/reference path? - A vs B divergence ⇒ the Ollama server-side parser is the variable. - B vs C divergence ⇒ llama.cpp runtime / GGUF quantization / Ollama scheduler is the variable. - A ≡ B ≡ C ⇒ Ollama's path is faithful to the reference, current production usage is fine. ## Prerequisites **Arms A and B:** local Ollama with `gemma4:latest` (E4B 8B) or `gemma4:e4b-it-q8_0` pulled. Python 3.10+, `aiohttp`, `jinja2`. **Arm C:** separate env with `jax` and `gemma` installed; HF credentials for checkpoint download (~8GB for E4B-it). See `arms/jax_native.py` module docstring. ## Running ```bash cd scripts/native-bakeoff # One arm, one task: python3 harness.py --arm ollama-json --task memory --out runs/A/memory.json python3 harness.py --arm ollama-native --task memory --out runs/B/memory.json python3 harness.py --arm jax-native --task memory --out runs/C/memory.json # Full sweep (A + B, 4 tasks each): for arm in ollama-json ollama-native; do for task in movies research memory long; do python3 harness.py --arm "$arm" --task "$task" \ --out "runs/${arm}/${task}.json" done done ``` Default model is `gemma4:latest` for Ollama arms (the E4B-it variant). Override with `--model gemma4:26b` if you want the MoE bakeoff (expect slower; 26B is 18GB GGUF). ## Trace schema Each run writes a JSON with: - `arm`, `model`, `task`, `task_prompt` - `turns[]` — per-step metrics: `elapsed_s`, `prompt_eval_count`, `eval_count`, `tool_call_count`, `content_len`, etc. - `final` — `halt_reason`, `steps_used`, `tool_calls_total`, `wall_clock_s`, `final_history_chars` Halt reasons: `no_tool_calls` (model produced final answer), `step_budget` (hit 20-step limit), `error:*`, `env_missing` (arm C only), `sampler_error:*` (arm C only). ## Smoke test evidence First wiring run on 2026-04-19 against `gemma4:latest` on steel141 (local Ollama, CPU): | Arm | Task | Steps | Tools | Halt | Wall | |-----|------|-------|-------|------|------| | A (ollama-json) | memory | 2 | 1 | `no_tool_calls` | 10.16s | | B (ollama-native) | memory | 2 | 1 | `no_tool_calls` | 2.39s | Identical *behavioral* shape (one tool call, clean final answer) on this simple task. The wall-clock delta is interesting but not conclusive on a single run — could be cache warmth or could be Ollama's parser overhead. A full sweep will separate signal from noise. ## Known limitations - **Arm C system prompt handling.** `gm.text.ToolSampler` doesn't take a pre-populated message history cleanly, so arm C folds a compact version of `FAKE_HISTORY` into the user message. Arms A and B feed history through proper role-tagged turns. Fidelity compromise — if a C vs A/B delta traces here, rebuild `sampler.turns` directly before calling `.chat()`. - **Arm C sampler caveat.** The deepmind-gemma `ToolSampler` docstring notes "Gemma 1, 2 and 3 models were not specifically trained for tool use" and flags the sampler as a proof-of-concept. Gemma 4 *is* tool-trained, so it should do better, but if arm C underperforms A/B the sampler implementation may be the variable, not the model. - **Quantization confounder.** Ollama arms run Q8 (E4B) or Q4 (26B); arm C runs bf16. A non-trivial A vs C delta could be the quantization. Only A ≡ B ≢ C cleanly implicates the inference engine rather than the bits. ## Related artifacts - `scripts/mort-bakeoff/harness.py` — the round-3 bakeoff that established `think:false` kills 26B in multi-turn tool loops. Task definitions are lifted from there. - `docs/reference/bakeoff-2026-04-18.md` — round-3 writeup. - `CORPUS_tool_calling_format.md` — the native Gemma 4 tool-call token syntax this harness implements. - `tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja` — the canonical template arm B renders.