Files
gemma4-research/scripts/native-bakeoff/tasks.py
T
Mortdecai df5542f7d6 feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling
Three-arm harness under scripts/native-bakeoff/:
- arm A: /api/chat with JSON tools (current default)
- arm B: /api/generate raw:true with canonical HF jinja template rendered directly
- arm C: google-deepmind/gemma JAX ToolSampler (env-gated, JAX required)

Interim finding from A+B sweep on matt-strix gemma4:26b Q4: Ollama's
bidirectional JSON↔native tool-call translator is faithful. The "long"
multi-tool task produces identical behavior (7 steps / 6 tools) on both
arms. Earlier arm-B parser bug that looked like a divergence was a
harness issue: preserving the model's <|channel>thought\n<channel|>
prefix as assistant content tripped the jinja template's
tool_response-following conditional, appending a spurious <turn|>\n
that corrupted the next step's prompt. Fixed by dropping the channel
prefix on the assistant message.

Arm C left as scaffolded-but-not-run — the JAX/bf16 reference path
would answer "does the GGUF runtime diverge from DeepMind's
implementation" but requires a separate env with the `gemma` PyPI
package. Parked pending SDXL eviction or vast-h100 session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 05:45:12 -04:00

140 lines
10 KiB
Python

"""Shared task definitions for the native-bakeoff.
Lifted verbatim from scripts/mort-bakeoff/harness.py so all three arms
(ollama-json, ollama-native, jax-native) see identical tasks, stubs,
system prompt, and fake history. If these ever drift, the comparison
becomes meaningless.
The goal of this harness is to isolate the *inference path* as the
only variable:
- Arm A: Ollama /api/chat with JSON `tools:[...]` (current default)
- Arm B: Ollama /api/generate with raw:true + native Gemma tokens
- Arm C: google-deepmind/gemma JAX `gm.text.ToolSampler`
"""
from __future__ import annotations
SYSTEM_PROMPT = """You are Mort, a direct and witty AI assistant on Seth's Matrix server. Powered by Gemma 4. Current time: Saturday, April 18 2026 02:30 PM EDT.
When a tool can answer the question, invoke it immediately — do not narrate intent or describe what you would do. Chain tools when a single call isn't sufficient: search → fetch → synthesize. If a tool returns an error or empty results, try an alternative tool or query before answering from memory. Base your response on tool results, not your training data — cite what you found.
## Tools
- **sethsearch** — search Seth's homelab (repos, wiki, media, feeds). Use `source: "sethflix"` for movies/TV/music.
- **check_sethflix** — verify which titles are in sethflix. Pass a comma-separated list.
- **web_search** — search the internet for current information
- **chat_search** — search message history across all rooms
- **memory_read / memory_write** — recall and store durable facts about users and topics
- **web_fetch** — fetch and extract text from a URL
- **generate_image** — generate an image via SDXL.
## Boundaries
- Only persist durable facts to memory, not ephemeral chat
- You have no memory between sessions. Your context is a sliding window — older messages fall off silently. Do not claim to "remember," promise to "do better," or describe your own architecture.
"""
# OpenAI-style tool schema. Arm A consumes this as-is. Arm B serializes
# it into Gemma's native <|tool>declaration:...<tool|> syntax. Arm C
# wraps each tool into a gm.tools.Tool subclass.
TOOLS = [
{"type": "function", "function": {"name": "web_search", "description": "Search the web.", "parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}}},
{"type": "function", "function": {"name": "sethsearch", "description": "Search Seth's homelab or sethflix (use source='sethflix' for movies/TV).", "parameters": {"type": "object", "properties": {"query": {"type": "string"}, "source": {"type": "string"}, "limit": {"type": "integer"}}, "required": ["query"]}}},
{"type": "function", "function": {"name": "check_sethflix", "description": "Verify which titles are in sethflix.", "parameters": {"type": "object", "properties": {"titles": {"type": "string", "description": "comma-separated"}}, "required": ["titles"]}}},
{"type": "function", "function": {"name": "memory_read", "description": "Look up stored facts.", "parameters": {"type": "object", "properties": {"query": {"type": "string"}, "user": {"type": "string"}}, "required": ["query"]}}},
{"type": "function", "function": {"name": "memory_write", "description": "Store a fact.", "parameters": {"type": "object", "properties": {"key": {"type": "string"}, "content": {"type": "string"}, "user": {"type": "string"}}, "required": ["key", "content"]}}},
{"type": "function", "function": {"name": "web_fetch", "description": "Fetch URL text.", "parameters": {"type": "object", "properties": {"url": {"type": "string"}}, "required": ["url"]}}},
{"type": "function", "function": {"name": "chat_search", "description": "Search message history.", "parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}}},
{"type": "function", "function": {"name": "generate_image", "description": "Generate an image.", "parameters": {"type": "object", "properties": {"prompt": {"type": "string"}}, "required": ["prompt"]}}},
]
FAKE_HISTORY = [
{"role": "user", "content": "[2026-04-18 13:45] @alice:sethpc.xyz: what's the weather looking like for Monday"},
{"role": "assistant", "content": "Checking forecasts now — looks like Monday in Seth's area will be mostly cloudy, high of 67°F, 30% chance of rain in the afternoon."},
{"role": "user", "content": "[2026-04-18 13:47] @bob:sethpc.xyz: any good movies added to sethflix lately"},
{"role": "assistant", "content": "Fresh adds this week include *Dune: Part Two*, *The Zone of Interest*, *Poor Things*, and a 4K remaster of *The Thing* (1982). Anything catch your eye?"},
{"role": "user", "content": "[2026-04-18 13:50] @alice:sethpc.xyz: I liked the thing, any horror recs along that line"},
{"role": "assistant", "content": "If you liked *The Thing*, you'll probably want *Annihilation*, *The Mist*, *Bone Tomahawk*, or *Event Horizon* — all in the library. Body-horror/dread tone, same energy."},
{"role": "user", "content": "[2026-04-18 13:55] @bob:sethpc.xyz: can you remind me what the sethpc monitoring dashboard is called"},
{"role": "assistant", "content": "That's Uptime Kuma — runs on CT 147 at `https://uptime.sethpc.xyz`. Covers all the containers and external services."},
{"role": "user", "content": "[2026-04-18 14:10] @alice:sethpc.xyz: noted thanks"},
{"role": "user", "content": "[2026-04-18 14:15] @bob:sethpc.xyz: hey mort what's the deal with the proxmox cluster nodes again"},
{"role": "assistant", "content": "Four nodes: pve173 (tank ZFS host, PowerEdge R820), pve112 (workhorse), pve241 (caddy + game servers), and pve197 (GPU inference). Corosync on 10.10.10.0/24."},
]
TASKS = {
"movies": "Recommend 3 sci-fi movies NOT already in my sethflix library. Check your picks against check_sethflix before finalizing.",
"research": "Look up what Home Assistant is, then check chat history for any prior mentions of it in this server.",
"memory": "What do I have stored about home automation? If anything, summarize it briefly.",
"long": ("Research question with multiple steps: (1) check memory for what I have on home_automation, "
"(2) search sethflix for any home-automation documentaries, (3) web_search for current news about "
"Home Assistant version releases, (4) fetch the top search result for details, (5) check chat_search "
"for prior mentions, (6) summarize all findings and write a new memory entry with the summary. "
"Do each step in order and report back at the end."),
}
def execute_tool_stub(name: str, args: dict) -> str:
"""Deterministic tool stubs — same as mort-bakeoff/harness.py."""
if name == "web_search":
q = args.get("query", "")
return (f"Search results for '{q}':\n"
"1. Example result one — a detailed article that covers the topic at length "
"with concrete examples and technical background. https://example.com/one\n"
"2. Example result two — a community discussion with multiple perspectives "
"and useful links to follow up on. https://example.com/two\n"
"3. Example result three — official documentation or reference material. "
"https://example.com/three\n"
"4. Example result four — a recent news article with relevant context. "
"https://example.com/four\n"
"5. Example result five — a tutorial or how-to guide. https://example.com/five")
if name == "sethsearch":
src = args.get("source", "general")
q = args.get("query", "")
if src == "sethflix":
return (f"sethflix search '{q}': The Matrix (1999), The Matrix Reloaded (2003), "
"The Matrix Revolutions (2003), The Matrix Resurrections (2021), "
"Equilibrium (2002), Dark City (1998), Minority Report (2002), "
"Ex Machina (2014), Blade Runner 2049 (2017), Ghost in the Shell (1995).")
return (f"homelab search '{q}': 3 repos, 5 wiki pages, 2 service docs matched. "
"Top hits: services_directory.md, DECISIONS.md, CORPUS_architecture.md.")
if name == "check_sethflix":
titles = args.get("titles", "")
items = [t.strip() for t in titles.split(",") if t.strip()]
in_lib = {"The Matrix", "Blade Runner 2049", "Ex Machina", "The Thing"}
return "\n".join(
f"- {t}: IN LIBRARY" if t in in_lib else f"- {t}: NOT IN LIBRARY"
for t in items
)
if name == "memory_read":
q = args.get("query", "")
return (f"memories matching '{q}':\n"
"- home_automation: Seth uses Home Assistant on VM 706 (pve173) with "
"Zigbee2MQTT and MQTT broker on CT 149. Integrates with LG TV, lights, "
"and Frigate NVR.\n"
"- preferences: dark theme with orange accents (#D35400), Sethflix/Sethian brand.")
if name == "memory_write":
return f"stored: {args.get('key', '?')} = {args.get('content', '?')[:60]}..."
if name == "web_fetch":
return ("fetched content (trimmed): This is a typical article body with several "
"paragraphs of extracted text. It covers the topic requested with examples "
"and context. The full text runs to about 2000 characters of real prose in "
"production; here's a reasonable approximation for the bakeoff harness. "
"Key details are preserved — author, date, main argument — followed by "
"supporting evidence and a conclusion that ties back to the headline.")
if name == "chat_search":
return ("chat_search results:\n"
"[2026-03-14 22:00] @seth:sethpc.xyz in #general: we should set up a shared "
"grafana dashboard for the proxmox cluster\n"
"[2026-03-20 18:30] @seth:sethpc.xyz in #infra: done, it's on CT 300 at "
"grafana.sethpc.xyz")
if name == "generate_image":
return f"image generated: /mxc/abc123/sunset.png (SDXL, 1024x1024, prompt={args.get('prompt','')[:40]}...)"
return f"ERROR: unknown tool {name}"