Files
gemma4-research/CORPUS_tool_calling_format.md
Mortdecai 5775978899 docs: merge tooling findings into SYNTHESIS/GOTCHAS/CORPUS_* and add handoff
Patches the top-level corpus docs with the 13 findings flagged during the
2026-04-18 canonical tooling research pass. tooling/README.md now marks each
finding [merged: <file>] or [flagged] for provenance.

- CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B
  active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard
  (the "MoE quality degrades at 4-bit" caveat is training-only). Add note
  that audio on E-series is NOT available via Ollama — llama.cpp mmproj
  or vLLM only.
- CORPUS_capabilities.md: native system role, configurable thinking mode,
  first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object
  detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma
  for retrieval (Gemma 4 has no embedding mode).
- CORPUS_tool_calling_format.md: add Chat Template Context section
  documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4,
  replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>,
  <|image>, <|audio> tokens. Add HF transformers Alternative section
  showing processor.parse_response with response_schema.
- GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no
  Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4
  head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant
  guidance, new tool-call tokens as learned embeddings.
- SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream
  material. Add embeddinggemma row to Model Selection table.

Also:
- Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md
  convention, not needed in tracked history) and __pycache__/.
- Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future
  sessions can pick up cold — facts verified, open threads, what changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:48:26 -04:00

4.4 KiB

Gemma 4 Native Tool Calling Format

Source: Google AI for Developers - Function Calling docs https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4 Canonical source in corpus: tooling/google-official/docs/ai-google-dev_function_calling_gemma4.html Authoritative chat template: tooling/huggingface/model-cards/gemma-4-{31B,E4B}-it-chat_template.jinja

Chat Template Context (what surrounds the tool tokens)

Gemma 4 changed the turn-token syntax from Gemma 3. You won't usually write these by hand — Ollama, llama.cpp --jinja, and HF apply_chat_template all handle it — but know what's on the wire when debugging:

Purpose Gemma 3 Gemma 4
Turn start <start_of_turn>role\n <|turn>role\n
Turn end <end_of_turn>\n <turn|>\n
Thinking (not standardized) <|think>...<think|>
Thought channel (n/a) <|channel>thought...<channel|>
Image inline <start_of_image> <|image>...<image|>
Audio inline (n/a) <|audio>...<audio|>
String delimiter in native format (n/a) <|"|>

Asymmetric brackets are intentional. Opening is <|token>, closing is <token|>. If you see <|turn>...</turn|> in a code sample, that's wrong.

Tool Special Tokens (6 total)

Token Purpose
<|tool> / <tool|> Tool definition block
<|tool_call> / <tool_call|> Model's tool request
<|tool_response> / <tool_response|> Tool execution result

String delimiter: <\|"\|> (encloses all string values in native format)

Native Format (raw model tokens)

Tool definition in system prompt:

<|tool>declaration:
get_current_temperature{
  location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
  unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
}<tool|>

Tool call from model:

<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>

Tool response:

<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>

JSON Chat Format (for Ollama / OpenAI-compatible APIs)

This is what you actually use in practice. Ollama translates to/from native tokens.

Tool definition:

{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {"type": "string", "description": "The city name"}
      },
      "required": ["city"]
    }
  }
}

Model returns:

{
  "role": "assistant",
  "tool_calls": [{
    "function": {
      "name": "get_weather",
      "arguments": {"city": "London"}
    }
  }]
}

Tool result message:

{
  "role": "tool",
  "content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
}

Thinking Mode + Tool Calls

  • When thinking is enabled, preserve thoughts between tool calls
  • For long agent chains, summarize thoughts as plain text to save context
  • Recommended: disable thinking for tool-heavy workflows (Seth's finding)

Framework Flags

Framework Required Flag
llama.cpp --jinja
vLLM --enable-auto-tool-choice
Ollama Works via /api/chat endpoint with tools field
transformers apply_chat_template(tools=[...])

Known Issues

  • Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
  • llama.cpp: format mismatches and continuous loops reported
  • LM Studio: compatibility issues with tool calling
  • Workaround: Use non-streaming mode for tool calls (proven in Simon)

HF transformers Alternative (not needed if using Ollama)

If you ever route through HF transformers (v5.5.4+) instead of Ollama, there's a cleaner parser than hand-rolled regex:

inputs = processor.apply_chat_template(
    messages, tools=TOOLS, enable_thinking=True,
    add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
)
out = model.generate(**inputs)
parsed = processor.parse_response(processor.decode(out[0]))
# -> {"thinking": "...", "content": "...", "tool_calls": [...]}

parse_response uses response_schema + x-regex fields baked into tokenizer_config.json (downloaded at tooling/huggingface/model-cards/). For Ollama users this is informational — Ollama's server-side tool parser already does the equivalent and returns structured tool_calls in the chat response.