gemma4-research/CORPUS_tool_calling_format.md

# Gemma 4 Native Tool Calling Format

> Source: Google AI for Developers - Function Calling docs
> https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
> Canonical source in corpus: `tooling/google-official/docs/ai-google-dev_function_calling_gemma4.html`
> Authoritative chat template: `tooling/huggingface/model-cards/gemma-4-{31B,E4B}-it-chat_template.jinja`

## Chat Template Context (what surrounds the tool tokens)

Gemma 4 changed the turn-token syntax from Gemma 3. You won't usually write these by
hand — Ollama, llama.cpp `--jinja`, and HF `apply_chat_template` all handle it — but
know what's on the wire when debugging:

| Purpose | Gemma 3 | Gemma 4 |
|---------|---------|---------|
| Turn start | `<start_of_turn>role\n` | `<\|turn>role\n` |
| Turn end | `<end_of_turn>\n` | `<turn\|>\n` |
| Thinking | (not standardized) | `<\|think>...<think\|>` |
| Thought channel | (n/a) | `<\|channel>thought...<channel\|>` |
| Image inline | `<start_of_image>` | `<\|image>...<image\|>` |
| Audio inline | (n/a) | `<\|audio>...<audio\|>` |
| String delimiter in native format | (n/a) | `<\|"\|>` |

**Asymmetric brackets are intentional.** Opening is `<|token>`, closing is `<token|>`.
If you see `<|turn>...</turn|>` in a code sample, that's wrong.

## Tool Special Tokens (6 total)

| Token | Purpose |
|-------|---------|
| `<\|tool>` / `<tool\|>` | Tool definition block |
| `<\|tool_call>` / `<tool_call\|>` | Model's tool request |
| `<\|tool_response>` / `<tool_response\|>` | Tool execution result |

String delimiter: `<\|"\|>` (encloses all string values in native format)

## Native Format (raw model tokens)

### Tool definition in system prompt:
```
<|tool>declaration:
get_current_temperature{
  location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
  unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
}<tool|>
```

### Tool call from model:
```
<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>
```

### Tool response:
```
<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>
```

## JSON Chat Format (for Ollama / OpenAI-compatible APIs)

This is what you actually use in practice. Ollama translates to/from native tokens.

### Tool definition:
```json
{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {"type": "string", "description": "The city name"}
      },
      "required": ["city"]
    }
  }
}
```

### Model returns:
```json
{
  "role": "assistant",
  "tool_calls": [{
    "function": {
      "name": "get_weather",
      "arguments": {"city": "London"}
    }
  }]
}
```

### Tool result message:
```json
{
  "role": "tool",
  "content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
}
```

## Thinking Mode + Tool Calls

- When thinking is enabled, preserve thoughts between tool calls
- For long agent chains, summarize thoughts as plain text to save context
- Recommended: **disable thinking for tool-heavy workflows** (Seth's finding)

## Framework Flags

| Framework | Required Flag |
|-----------|--------------|
| llama.cpp | `--jinja` |
| vLLM | `--enable-auto-tool-choice` |
| Ollama | Works via `/api/chat` endpoint with `tools` field |
| transformers | `apply_chat_template(tools=[...])` |

## Known Issues

- Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
- llama.cpp: format mismatches and continuous loops reported
- LM Studio: compatibility issues with tool calling
- **Workaround:** Use non-streaming mode for tool calls (proven in Simon)

## HF `transformers` Alternative (not needed if using Ollama)

If you ever route through HF `transformers` (v5.5.4+) instead of Ollama, there's a
cleaner parser than hand-rolled regex:

```python
inputs = processor.apply_chat_template(
    messages, tools=TOOLS, enable_thinking=True,
    add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
)
out = model.generate(**inputs)
parsed = processor.parse_response(processor.decode(out[0]))
# -> {"thinking": "...", "content": "...", "tool_calls": [...]}
```

`parse_response` uses `response_schema` + `x-regex` fields baked into
`tokenizer_config.json` (downloaded at `tooling/huggingface/model-cards/`). For
Ollama users this is informational — Ollama's server-side tool parser already does
the equivalent and returns structured `tool_calls` in the chat response.