Files

T

Mortdecai 5775978899 docs: merge tooling findings into SYNTHESIS/GOTCHAS/CORPUS_* and add handoff

Patches the top-level corpus docs with the 13 findings flagged during the
2026-04-18 canonical tooling research pass. tooling/README.md now marks each
finding [merged: <file>] or [flagged] for provenance.

- CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B
  active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard
  (the "MoE quality degrades at 4-bit" caveat is training-only). Add note
  that audio on E-series is NOT available via Ollama — llama.cpp mmproj
  or vLLM only.
- CORPUS_capabilities.md: native system role, configurable thinking mode,
  first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object
  detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma
  for retrieval (Gemma 4 has no embedding mode).
- CORPUS_tool_calling_format.md: add Chat Template Context section
  documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4,
  replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>,
  <|image>, <|audio> tokens. Add HF transformers Alternative section
  showing processor.parse_response with response_schema.
- GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no
  Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4
  head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant
  guidance, new tool-call tokens as learned embeddings.
- SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream
  material. Add embeddinggemma row to Model Selection table.

Also:
- Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md
  convention, not needed in tracked history) and __pycache__/.
- Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future
  sessions can pick up cold — facts verified, open threads, what changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 12:48:26 -04:00

4.4 KiB

Raw Blame History

Gemma 4 Native Tool Calling Format

Source: Google AI for Developers - Function Calling docs https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4 Canonical source in corpus: tooling/google-official/docs/ai-google-dev_function_calling_gemma4.html Authoritative chat template: tooling/huggingface/model-cards/gemma-4-{31B,E4B}-it-chat_template.jinja

Chat Template Context (what surrounds the tool tokens)

Gemma 4 changed the turn-token syntax from Gemma 3. You won't usually write these by hand — Ollama, llama.cpp --jinja, and HF apply_chat_template all handle it — but know what's on the wire when debugging:

Purpose	Gemma 3	Gemma 4
Turn start	`<start_of_turn>role\n`	`<\|turn>role\n`
Turn end	`<end_of_turn>\n`	`<turn\|>\n`
Thinking	(not standardized)	`<\|think>...<think\|>`
Thought channel	(n/a)	`<\|channel>thought...<channel\|>`
Image inline	`<start_of_image>`	`<\|image>...<image\|>`
Audio inline	(n/a)	`<\|audio>...<audio\|>`
String delimiter in native format	(n/a)	`<\|"\|>`

Asymmetric brackets are intentional. Opening is <|token>, closing is <token|>. If you see <|turn>...</turn|> in a code sample, that's wrong.

Tool Special Tokens (6 total)

Token	Purpose
`<\|tool>` / `<tool\|>`	Tool definition block
`<\|tool_call>` / `<tool_call\|>`	Model's tool request
`<\|tool_response>` / `<tool_response\|>`	Tool execution result

String delimiter: <\|"\|> (encloses all string values in native format)

Native Format (raw model tokens)

Tool definition in system prompt:

<|tool>declaration:
get_current_temperature{
  location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
  unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
}<tool|>

Tool call from model:

<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>

Tool response:

<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>

JSON Chat Format (for Ollama / OpenAI-compatible APIs)

This is what you actually use in practice. Ollama translates to/from native tokens.

Tool definition:

{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {"type": "string", "description": "The city name"}
      },
      "required": ["city"]
    }
  }
}

Model returns:

{
  "role": "assistant",
  "tool_calls": [{
    "function": {
      "name": "get_weather",
      "arguments": {"city": "London"}
    }
  }]
}

Tool result message:

{
  "role": "tool",
  "content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
}

Thinking Mode + Tool Calls

When thinking is enabled, preserve thoughts between tool calls
For long agent chains, summarize thoughts as plain text to save context
Recommended: disable thinking for tool-heavy workflows (Seth's finding)

Framework Flags

Framework	Required Flag
llama.cpp	`--jinja`
vLLM	`--enable-auto-tool-choice`
Ollama	Works via `/api/chat` endpoint with `tools` field
transformers	`apply_chat_template(tools=[...])`

Known Issues

Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
llama.cpp: format mismatches and continuous loops reported
LM Studio: compatibility issues with tool calling
Workaround: Use non-streaming mode for tool calls (proven in Simon)

HF `transformers` Alternative (not needed if using Ollama)

If you ever route through HF transformers (v5.5.4+) instead of Ollama, there's a cleaner parser than hand-rolled regex:

inputs = processor.apply_chat_template(
    messages, tools=TOOLS, enable_thinking=True,
    add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
)
out = model.generate(**inputs)
parsed = processor.parse_response(processor.decode(out[0]))
# -> {"thinking": "...", "content": "...", "tool_calls": [...]}

parse_response uses response_schema + x-regex fields baked into tokenizer_config.json (downloaded at tooling/huggingface/model-cards/). For Ollama users this is informational — Ollama's server-side tool parser already does the equivalent and returns structured tool_calls in the chat response.

4.4 KiB Raw Blame History