Three-arm harness under scripts/native-bakeoff/: - arm A: /api/chat with JSON tools (current default) - arm B: /api/generate raw:true with canonical HF jinja template rendered directly - arm C: google-deepmind/gemma JAX ToolSampler (env-gated, JAX required) Interim finding from A+B sweep on matt-strix gemma4:26b Q4: Ollama's bidirectional JSON↔native tool-call translator is faithful. The "long" multi-tool task produces identical behavior (7 steps / 6 tools) on both arms. Earlier arm-B parser bug that looked like a divergence was a harness issue: preserving the model's <|channel>thought\n<channel|> prefix as assistant content tripped the jinja template's tool_response-following conditional, appending a spurious <turn|>\n that corrupted the next step's prompt. Fixed by dropping the channel prefix on the assistant message. Arm C left as scaffolded-but-not-run — the JAX/bf16 reference path would answer "does the GGUF runtime diverge from DeepMind's implementation" but requires a separate env with the `gemma` PyPI package. Parked pending SDXL eviction or vast-h100 session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.7 KiB
Native Bakeoff — Gemma 4 Inference Path Comparison
Three-arm bakeoff comparing how different inference paths handle the same Gemma 4 tool-calling workload. Isolates Ollama's JSON↔native translator and the runtime itself as variables.
The three arms
| Arm | Path | What varies |
|---|---|---|
A. ollama-json |
/api/chat with OpenAI-style tools:[...] |
Ollama translates JSON → native tokens on input, native tool-call tokens → structured JSON on output. |
B. ollama-native |
/api/generate with raw:true + canonical HF jinja template |
No JSON translation. Rendered tokens go straight to the model; the harness parses <|tool_call> spans out of the completion. |
C. jax-native |
google-deepmind/gemma reference ToolSampler |
No Ollama. No llama.cpp. No GGUF quant. Reference Python + JAX + bf16. |
Research question
Does Ollama's JSON tools path materially diverge from the native/reference path?
- A vs B divergence ⇒ the Ollama server-side parser is the variable.
- B vs C divergence ⇒ llama.cpp runtime / GGUF quantization / Ollama scheduler is the variable.
- A ≡ B ≡ C ⇒ Ollama's path is faithful to the reference, current production usage is fine.
Prerequisites
Arms A and B: local Ollama with gemma4:latest (E4B 8B) or
gemma4:e4b-it-q8_0 pulled. Python 3.10+, aiohttp, jinja2.
Arm C: separate env with jax and gemma installed; HF
credentials for checkpoint download (~8GB for E4B-it). See
arms/jax_native.py module docstring.
Running
cd scripts/native-bakeoff
# One arm, one task:
python3 harness.py --arm ollama-json --task memory --out runs/A/memory.json
python3 harness.py --arm ollama-native --task memory --out runs/B/memory.json
python3 harness.py --arm jax-native --task memory --out runs/C/memory.json
# Full sweep (A + B, 4 tasks each):
for arm in ollama-json ollama-native; do
for task in movies research memory long; do
python3 harness.py --arm "$arm" --task "$task" \
--out "runs/${arm}/${task}.json"
done
done
Default model is gemma4:latest for Ollama arms (the E4B-it variant).
Override with --model gemma4:26b if you want the MoE bakeoff
(expect slower; 26B is 18GB GGUF).
Trace schema
Each run writes a JSON with:
arm,model,task,task_promptturns[]— per-step metrics:elapsed_s,prompt_eval_count,eval_count,tool_call_count,content_len, etc.final—halt_reason,steps_used,tool_calls_total,wall_clock_s,final_history_chars
Halt reasons: no_tool_calls (model produced final answer),
step_budget (hit 20-step limit), error:*, env_missing (arm C
only), sampler_error:* (arm C only).
Smoke test evidence
First wiring run on 2026-04-19 against gemma4:latest on steel141
(local Ollama, CPU):
| Arm | Task | Steps | Tools | Halt | Wall |
|---|---|---|---|---|---|
| A (ollama-json) | memory | 2 | 1 | no_tool_calls |
10.16s |
| B (ollama-native) | memory | 2 | 1 | no_tool_calls |
2.39s |
Identical behavioral shape (one tool call, clean final answer) on this simple task. The wall-clock delta is interesting but not conclusive on a single run — could be cache warmth or could be Ollama's parser overhead. A full sweep will separate signal from noise.
Known limitations
- Arm C system prompt handling.
gm.text.ToolSamplerdoesn't take a pre-populated message history cleanly, so arm C folds a compact version ofFAKE_HISTORYinto the user message. Arms A and B feed history through proper role-tagged turns. Fidelity compromise — if a C vs A/B delta traces here, rebuildsampler.turnsdirectly before calling.chat(). - Arm C sampler caveat. The deepmind-gemma
ToolSamplerdocstring notes "Gemma 1, 2 and 3 models were not specifically trained for tool use" and flags the sampler as a proof-of-concept. Gemma 4 is tool-trained, so it should do better, but if arm C underperforms A/B the sampler implementation may be the variable, not the model. - Quantization confounder. Ollama arms run Q8 (E4B) or Q4 (26B); arm C runs bf16. A non-trivial A vs C delta could be the quantization. Only A ≡ B ≢ C cleanly implicates the inference engine rather than the bits.
Related artifacts
scripts/mort-bakeoff/harness.py— the round-3 bakeoff that establishedthink:falsekills 26B in multi-turn tool loops. Task definitions are lifted from there.docs/reference/bakeoff-2026-04-18.md— round-3 writeup.CORPUS_tool_calling_format.md— the native Gemma 4 tool-call token syntax this harness implements.tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja— the canonical template arm B renders.