df5542f7d6
Three-arm harness under scripts/native-bakeoff/: - arm A: /api/chat with JSON tools (current default) - arm B: /api/generate raw:true with canonical HF jinja template rendered directly - arm C: google-deepmind/gemma JAX ToolSampler (env-gated, JAX required) Interim finding from A+B sweep on matt-strix gemma4:26b Q4: Ollama's bidirectional JSON↔native tool-call translator is faithful. The "long" multi-tool task produces identical behavior (7 steps / 6 tools) on both arms. Earlier arm-B parser bug that looked like a divergence was a harness issue: preserving the model's <|channel>thought\n<channel|> prefix as assistant content tripped the jinja template's tool_response-following conditional, appending a spurious <turn|>\n that corrupted the next step's prompt. Fixed by dropping the channel prefix on the assistant message. Arm C left as scaffolded-but-not-run — the JAX/bf16 reference path would answer "does the GGUF runtime diverge from DeepMind's implementation" but requires a separate env with the `gemma` PyPI package. Parked pending SDXL eviction or vast-h100 session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
117 lines
8.9 KiB
JSON
117 lines
8.9 KiB
JSON
{
|
|
"arm": "ollama-native",
|
|
"model": "gemma4:26b",
|
|
"num_ctx": 8192,
|
|
"num_predict": 2048,
|
|
"started_at": 1776600755.785702,
|
|
"turns": [
|
|
{
|
|
"step": 1,
|
|
"elapsed_s": 1.05,
|
|
"prompt_eval_count": 1389,
|
|
"eval_count": 27,
|
|
"content_len": 109,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 3009,
|
|
"raw_completion_head": "<|channel>thought\n<channel|><|tool_call>call:memory_read{query:<|\"|>home_automation<|\"|>,user:<|\"|>seth<|\"|>}",
|
|
"raw_completion_tail": "",
|
|
"prompt_tail": "ry for what I have on home_automation, (2) search sethflix for any home-automation documentaries, (3) web_search for current news about Home Assistant version releases, (4) fetch the top search result for details, (5) check chat_search for prior mentions, (6) summarize all findings and write a new memory entry with the summary. Do each step in order and report back at the end.<turn|>\n<|turn>model\n",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 2,
|
|
"elapsed_s": 1.07,
|
|
"prompt_eval_count": 1509,
|
|
"eval_count": 28,
|
|
"content_len": 126,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 3281,
|
|
"raw_completion_head": "<|channel>thought\n<channel|><|tool_call>call:sethsearch{query:<|\"|>home automation documentary<|\"|>,source:<|\"|>sethflix<|\"|>}",
|
|
"raw_completion_tail": "",
|
|
"prompt_tail": "<|\"|>home_automation<|\"|>,user:<|\"|>seth<|\"|>}<tool_call|><|tool_response>response:memory_read{value:<|\"|>memories matching 'home_automation':\n- home_automation: Seth uses Home Assistant on VM 706 (pve173) with Zigbee2MQTT and MQTT broker on CT 149. Integrates with LG TV, lights, and Frigate NVR.\n- preferences: dark theme with orange accents (#D35400), Sethflix/Sethian brand.<|\"|>}<tool_response|>",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 3,
|
|
"elapsed_s": 0.92,
|
|
"prompt_eval_count": 1649,
|
|
"eval_count": 23,
|
|
"content_len": 116,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 3572,
|
|
"raw_completion_head": "<|channel>thought\n<channel|><|tool_call>call:web_search{query:<|\"|>Home Assistant current version release news<|\"|>}",
|
|
"raw_completion_tail": "",
|
|
"prompt_tail": ">,source:<|\"|>sethflix<|\"|>}<tool_call|><|tool_response>response:sethsearch{value:<|\"|>sethflix search 'home automation documentary': The Matrix (1999), The Matrix Reloaded (2003), The Matrix Revolutions (2003), The Matrix Resurrections (2021), Equilibrium (2002), Dark City (1998), Minority Report (2002), Ex Machina (2014), Blade Runner 2049 (2017), Ghost in the Shell (1995).<|\"|>}<tool_response|>",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 4,
|
|
"elapsed_s": 0.97,
|
|
"prompt_eval_count": 1814,
|
|
"eval_count": 24,
|
|
"content_len": 93,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 4188,
|
|
"raw_completion_head": "<|channel>thought\n<channel|><|tool_call>call:web_fetch{url:<|\"|>https://example.com/one<|\"|>}",
|
|
"raw_completion_tail": "",
|
|
"prompt_tail": "a community discussion with multiple perspectives and useful links to follow up on. https://example.com/two\n3. Example result three \u2014 official documentation or reference material. https://example.com/three\n4. Example result four \u2014 a recent news article with relevant context. https://example.com/four\n5. Example result five \u2014 a tutorial or how-to guide. https://example.com/five<|\"|>}<tool_response|>",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 5,
|
|
"elapsed_s": 0.88,
|
|
"prompt_eval_count": 1931,
|
|
"eval_count": 19,
|
|
"content_len": 89,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 4616,
|
|
"raw_completion_head": "<|channel>thought\n<channel|><|tool_call>call:chat_search{query:<|\"|>home automation<|\"|>}",
|
|
"raw_completion_tail": "",
|
|
"prompt_tail": "le body with several paragraphs of extracted text. It covers the topic requested with examples and context. The full text runs to about 2000 characters of real prose in production; here's a reasonable approximation for the bakeoff harness. Key details are preserved \u2014 author, date, main argument \u2014 followed by supporting evidence and a conclusion that ties back to the headline.<|\"|>}<tool_response|>",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 6,
|
|
"elapsed_s": 4.38,
|
|
"prompt_eval_count": 2063,
|
|
"eval_count": 188,
|
|
"content_len": 692,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 4843,
|
|
"raw_completion_head": "<|channel>thought\n<channel|><|tool_call>call:memory_write{content:<|\"|>Summary of research on home_automation: \n1. Memory: Seth uses Home Assistant on VM 706 (pve173) with Zigbee2MQTT and MQTT broker on CT 149, integrating LG TV, lights, an",
|
|
"raw_completion_tail": "sults (details from example source).\n4. Chat History: Previous discussions focused on setting up a shared Grafana dashboard for the Proxmox cluster on CT 300.<|\"|>,key:<|\"|>home_automation_research_summary_20260418<|\"|>,user:<|\"|>seth<|\"|>}",
|
|
"prompt_tail": "eadline.<|\"|>}<tool_response|><|tool_call>call:chat_search{query:<|\"|>home automation<|\"|>}<tool_call|><|tool_response>response:chat_search{value:<|\"|>chat_search results:\n[2026-03-14 22:00] @seth:sethpc.xyz in #general: we should set up a shared grafana dashboard for the proxmox cluster\n[2026-03-20 18:30] @seth:sethpc.xyz in #infra: done, it's on CT 300 at grafana.sethpc.xyz<|\"|>}<tool_response|>",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 7,
|
|
"elapsed_s": 5.08,
|
|
"prompt_eval_count": 2296,
|
|
"eval_count": 217,
|
|
"content_len": 873,
|
|
"tool_call_count": 0,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 4958,
|
|
"raw_completion_head": "<|channel>thought\n<channel|>Here is the summary of my research:\n\n1. **Existing Memory**: You are running Home Assistant on VM 706 (pve173), utilizing Zigbee2MQTT and an MQTT broker on CT 149, with integrations for your LG TV, lights, and F",
|
|
"raw_completion_tail": "tory**: Previous mentions of automation/monitoring were related to setting up a shared Grafana dashboard for the Proxmox cluster on CT 300.\n\nI have saved this summary to your memory under the key `home_automation_research_summary_20260418`.",
|
|
"prompt_tail": ". Chat History: Previous discussions focused on setting up a shared Grafana dashboard for the Proxmox cluster on CT 300.<|\"|>,key:<|\"|>home_automation_research_summary_20260418<|\"|>,user:<|\"|>seth<|\"|>}<tool_call|><|tool_response>response:memory_write{value:<|\"|>stored: home_automation_research_summary_20260418 = Summary of research on home_automation: \n1. Memory: Seth use...<|\"|>}<tool_response|>",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
}
|
|
],
|
|
"final": {
|
|
"halt_reason": "no_tool_calls",
|
|
"steps_used": 7,
|
|
"tool_calls_total": 6,
|
|
"wall_clock_s": 14.34,
|
|
"final_message_count": 26,
|
|
"final_history_chars": 5831
|
|
},
|
|
"task": "long",
|
|
"task_prompt": "Research question with multiple steps: (1) check memory for what I have on home_automation, (2) search sethflix for any home-automation documentaries, (3) web_search for current news about Home Assistant version releases, (4) fetch the top search result for details, (5) check chat_search for prior mentions, (6) summarize all findings and write a new memory entry with the summary. Do each step in order and report back at the end."
|
|
} |