Patches the top-level corpus docs with the 13 findings flagged during the 2026-04-18 canonical tooling research pass. tooling/README.md now marks each finding [merged: <file>] or [flagged] for provenance. - CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard (the "MoE quality degrades at 4-bit" caveat is training-only). Add note that audio on E-series is NOT available via Ollama — llama.cpp mmproj or vLLM only. - CORPUS_capabilities.md: native system role, configurable thinking mode, first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma for retrieval (Gemma 4 has no embedding mode). - CORPUS_tool_calling_format.md: add Chat Template Context section documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4, replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>, <|image>, <|audio> tokens. Add HF transformers Alternative section showing processor.parse_response with response_schema. - GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4 head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant guidance, new tool-call tokens as learned embeddings. - SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream material. Add embeddinggemma row to Model Selection table. Also: - Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md convention, not needed in tracked history) and __pycache__/. - Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future sessions can pick up cold — facts verified, open threads, what changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.4 KiB
Gemma 4 Native Tool Calling Format
Source: Google AI for Developers - Function Calling docs https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4 Canonical source in corpus:
tooling/google-official/docs/ai-google-dev_function_calling_gemma4.htmlAuthoritative chat template:tooling/huggingface/model-cards/gemma-4-{31B,E4B}-it-chat_template.jinja
Chat Template Context (what surrounds the tool tokens)
Gemma 4 changed the turn-token syntax from Gemma 3. You won't usually write these by
hand — Ollama, llama.cpp --jinja, and HF apply_chat_template all handle it — but
know what's on the wire when debugging:
| Purpose | Gemma 3 | Gemma 4 |
|---|---|---|
| Turn start | <start_of_turn>role\n |
<|turn>role\n |
| Turn end | <end_of_turn>\n |
<turn|>\n |
| Thinking | (not standardized) | <|think>...<think|> |
| Thought channel | (n/a) | <|channel>thought...<channel|> |
| Image inline | <start_of_image> |
<|image>...<image|> |
| Audio inline | (n/a) | <|audio>...<audio|> |
| String delimiter in native format | (n/a) | <|"|> |
Asymmetric brackets are intentional. Opening is <|token>, closing is <token|>.
If you see <|turn>...</turn|> in a code sample, that's wrong.
Tool Special Tokens (6 total)
| Token | Purpose |
|---|---|
<|tool> / <tool|> |
Tool definition block |
<|tool_call> / <tool_call|> |
Model's tool request |
<|tool_response> / <tool_response|> |
Tool execution result |
String delimiter: <\|"\|> (encloses all string values in native format)
Native Format (raw model tokens)
Tool definition in system prompt:
<|tool>declaration:
get_current_temperature{
location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
}<tool|>
Tool call from model:
<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>
Tool response:
<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>
JSON Chat Format (for Ollama / OpenAI-compatible APIs)
This is what you actually use in practice. Ollama translates to/from native tokens.
Tool definition:
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "The city name"}
},
"required": ["city"]
}
}
}
Model returns:
{
"role": "assistant",
"tool_calls": [{
"function": {
"name": "get_weather",
"arguments": {"city": "London"}
}
}]
}
Tool result message:
{
"role": "tool",
"content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
}
Thinking Mode + Tool Calls
- When thinking is enabled, preserve thoughts between tool calls
- For long agent chains, summarize thoughts as plain text to save context
- Recommended: disable thinking for tool-heavy workflows (Seth's finding)
Framework Flags
| Framework | Required Flag |
|---|---|
| llama.cpp | --jinja |
| vLLM | --enable-auto-tool-choice |
| Ollama | Works via /api/chat endpoint with tools field |
| transformers | apply_chat_template(tools=[...]) |
Known Issues
- Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
- llama.cpp: format mismatches and continuous loops reported
- LM Studio: compatibility issues with tool calling
- Workaround: Use non-streaming mode for tool calls (proven in Simon)
HF transformers Alternative (not needed if using Ollama)
If you ever route through HF transformers (v5.5.4+) instead of Ollama, there's a
cleaner parser than hand-rolled regex:
inputs = processor.apply_chat_template(
messages, tools=TOOLS, enable_thinking=True,
add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
)
out = model.generate(**inputs)
parsed = processor.parse_response(processor.decode(out[0]))
# -> {"thinking": "...", "content": "...", "tool_calls": [...]}
parse_response uses response_schema + x-regex fields baked into
tokenizer_config.json (downloaded at tooling/huggingface/model-cards/). For
Ollama users this is informational — Ollama's server-side tool parser already does
the equivalent and returns structured tool_calls in the chat response.