df5542f7d6
Three-arm harness under scripts/native-bakeoff/: - arm A: /api/chat with JSON tools (current default) - arm B: /api/generate raw:true with canonical HF jinja template rendered directly - arm C: google-deepmind/gemma JAX ToolSampler (env-gated, JAX required) Interim finding from A+B sweep on matt-strix gemma4:26b Q4: Ollama's bidirectional JSON↔native tool-call translator is faithful. The "long" multi-tool task produces identical behavior (7 steps / 6 tools) on both arms. Earlier arm-B parser bug that looked like a divergence was a harness issue: preserving the model's <|channel>thought\n<channel|> prefix as assistant content tripped the jinja template's tool_response-following conditional, appending a spurious <turn|>\n that corrupted the next step's prompt. Fixed by dropping the channel prefix on the assistant message. Arm C left as scaffolded-but-not-run — the JAX/bf16 reference path would answer "does the GGUF runtime diverge from DeepMind's implementation" but requires a separate env with the `gemma` PyPI package. Parked pending SDXL eviction or vast-h100 session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
117 lines
8.8 KiB
JSON
117 lines
8.8 KiB
JSON
{
|
|
"arm": "ollama-native",
|
|
"model": "gemma4:26b",
|
|
"num_ctx": 8192,
|
|
"num_predict": 2048,
|
|
"started_at": 1776600600.7731934,
|
|
"turns": [
|
|
{
|
|
"step": 1,
|
|
"elapsed_s": 4.12,
|
|
"prompt_eval_count": 1389,
|
|
"eval_count": 27,
|
|
"content_len": 109,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 3009,
|
|
"raw_completion_head": "<|channel>thought\n<channel|><|tool_call>call:memory_read{query:<|\"|>home_automation<|\"|>,user:<|\"|>seth<|\"|>}",
|
|
"raw_completion_tail": "",
|
|
"prompt_tail": "ry for what I have on home_automation, (2) search sethflix for any home-automation documentaries, (3) web_search for current news about Home Assistant version releases, (4) fetch the top search result for details, (5) check chat_search for prior mentions, (6) summarize all findings and write a new memory entry with the summary. Do each step in order and report back at the end.<turn|>\n<|turn>model\n",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 2,
|
|
"elapsed_s": 1.0,
|
|
"prompt_eval_count": 1511,
|
|
"eval_count": 27,
|
|
"content_len": 116,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 3309,
|
|
"raw_completion_head": "thought\n<channel|><|tool_call>call:sethsearch{query:<|\"|>home automation documentary<|\"|>,source:<|\"|>sethflix<|\"|>}",
|
|
"raw_completion_tail": "",
|
|
"prompt_tail": "e_automation<|\"|>,user:<|\"|>seth<|\"|>}<tool_call|><|tool_response>response:memory_read{value:<|\"|>memories matching 'home_automation':\n- home_automation: Seth uses Home Assistant on VM 706 (pve173) with Zigbee2MQTT and MQTT broker on CT 149. Integrates with LG TV, lights, and Frigate NVR.\n- preferences: dark theme with orange accents (#D35400), Sethflix/Sethian brand.<|\"|>}<tool_response|><turn|>\n",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 3,
|
|
"elapsed_s": 0.82,
|
|
"prompt_eval_count": 1654,
|
|
"eval_count": 20,
|
|
"content_len": 98,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 3618,
|
|
"raw_completion_head": "<channel|><|tool_call>call:web_search{query:<|\"|>current Home Assistant version release news<|\"|>}",
|
|
"raw_completion_tail": "",
|
|
"prompt_tail": "ethflix<|\"|>}<tool_call|><|tool_response>response:sethsearch{value:<|\"|>sethflix search 'home automation documentary': The Matrix (1999), The Matrix Reloaded (2003), The Matrix Revolutions (2003), The Matrix Resurrections (2021), Equilibrium (2002), Dark City (1998), Minority Report (2002), Ex Machina (2014), Blade Runner 2049 (2017), Ghost in the Shell (1995).<|\"|>}<tool_response|>thought<turn|>\n",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 4,
|
|
"elapsed_s": 0.93,
|
|
"prompt_eval_count": 1821,
|
|
"eval_count": 23,
|
|
"content_len": 83,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 4244,
|
|
"raw_completion_head": "thought\n<channel|><|tool_call>call:web_fetch{url:<|\"|>https://example.com/one<|\"|>}",
|
|
"raw_completion_tail": "",
|
|
"prompt_tail": "ity discussion with multiple perspectives and useful links to follow up on. https://example.com/two\n3. Example result three \u2014 official documentation or reference material. https://example.com/three\n4. Example result four \u2014 a recent news article with relevant context. https://example.com/four\n5. Example result five \u2014 a tutorial or how-to guide. https://example.com/five<|\"|>}<tool_response|><turn|>\n",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 5,
|
|
"elapsed_s": 0.81,
|
|
"prompt_eval_count": 1941,
|
|
"eval_count": 18,
|
|
"content_len": 79,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 4690,
|
|
"raw_completion_head": "thought\n<channel|><|tool_call>call:chat_search{query:<|\"|>home automation<|\"|>}",
|
|
"raw_completion_tail": "",
|
|
"prompt_tail": "veral paragraphs of extracted text. It covers the topic requested with examples and context. The full text runs to about 2000 characters of real prose in production; here's a reasonable approximation for the bakeoff harness. Key details are preserved \u2014 author, date, main argument \u2014 followed by supporting evidence and a conclusion that ties back to the headline.<|\"|>}<tool_response|>thought<turn|>\n",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 6,
|
|
"elapsed_s": 3.97,
|
|
"prompt_eval_count": 2076,
|
|
"eval_count": 171,
|
|
"content_len": 652,
|
|
"tool_call_count": 1,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 4935,
|
|
"raw_completion_head": "\nthought\n<channel|><|tool_call>call:memory_write{content:<|\"|>Summary of Home Automation research: \n1. Memory: Seth uses Home Assistant on VM 706 (pve173) with Zigbee2MQTT/MQTT (CT 149), integrating LG TV, lights, and Frigate.\n2. Sethflix: ",
|
|
"raw_completion_tail": " Assistant updates and technical implementations.\n4. Chat History: Previous discussions involved setting up Grafana dashboards for infrastructure monitoring (CT 300).<|\"|>,key:<|\"|>home_automation_research_20260418<|\"|>,user:<|\"|>seth<|\"|>}",
|
|
"prompt_tail": "thought<turn|>\n<|tool_call>call:chat_search{query:<|\"|>home automation<|\"|>}<tool_call|><|tool_response>response:chat_search{value:<|\"|>chat_search results:\n[2026-03-14 22:00] @seth:sethpc.xyz in #general: we should set up a shared grafana dashboard for the proxmox cluster\n[2026-03-20 18:30] @seth:sethpc.xyz in #infra: done, it's on CT 300 at grafana.sethpc.xyz<|\"|>}<tool_response|>thought<turn|>\n",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
},
|
|
{
|
|
"step": 7,
|
|
"elapsed_s": 5.66,
|
|
"prompt_eval_count": 2292,
|
|
"eval_count": 248,
|
|
"content_len": 975,
|
|
"tool_call_count": 0,
|
|
"stop_reason": "stop",
|
|
"history_chars_before_append": 5060,
|
|
"raw_completion_head": "\t<channel|>Here's the breakdown of my research:\n\n1. **Memory Check**: You're running Home Assistant on **VM 706 (pve173)**, utilizing Zigbee2MQTT and an MQTT broker on **CT 149**, with integrations for your LG TV, lights, and Frigate NVR.\n",
|
|
"raw_completion_tail": "me automation\" in chat, though you previously discussed setting up the **Grafana dashboard** (CT 300) for cluster monitoring.\n6. **Memory Updated**: I've stored a summary of these findings under the key `home_automation_research_20260418`.",
|
|
"prompt_tail": "ons.\n4. Chat History: Previous discussions involved setting up Grafana dashboards for infrastructure monitoring (CT 300).<|\"|>,key:<|\"|>home_automation_research_20260418<|\"|>,user:<|\"|>seth<|\"|>}<tool_call|><|tool_response>response:memory_write{value:<|\"|>stored: home_automation_research_20260418 = Summary of Home Automation research: \n1. Memory: Seth uses H...<|\"|>}<tool_response|>thought<turn|>\n",
|
|
"prompt_head": "<bos><|turn>system\nYou are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.\n\nWhen a tool can answer the question, "
|
|
}
|
|
],
|
|
"final": {
|
|
"halt_reason": "no_tool_calls",
|
|
"steps_used": 7,
|
|
"tool_calls_total": 6,
|
|
"wall_clock_s": 17.31,
|
|
"final_message_count": 26,
|
|
"final_history_chars": 6034
|
|
},
|
|
"task": "long",
|
|
"task_prompt": "Research question with multiple steps: (1) check memory for what I have on home_automation, (2) search sethflix for any home-automation documentaries, (3) web_search for current news about Home Assistant version releases, (4) fetch the top search result for details, (5) check chat_search for prior mentions, (6) summarize all findings and write a new memory entry with the summary. Do each step in order and report back at the end."
|
|
} |