5011059f5d
Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
101 lines
2.5 KiB
Markdown
101 lines
2.5 KiB
Markdown
# Gemma 4 Native Tool Calling Format
|
|
|
|
> Source: Google AI for Developers - Function Calling docs
|
|
> https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
|
|
|
|
## Special Tokens (6 total)
|
|
|
|
| Token | Purpose |
|
|
|-------|---------|
|
|
| `<\|tool>` / `<tool\|>` | Tool definition block |
|
|
| `<\|tool_call>` / `<tool_call\|>` | Model's tool request |
|
|
| `<\|tool_response>` / `<tool_response\|>` | Tool execution result |
|
|
|
|
String delimiter: `<\|"\|>` (encloses all string values in native format)
|
|
|
|
## Native Format (raw model tokens)
|
|
|
|
### Tool definition in system prompt:
|
|
```
|
|
<|tool>declaration:
|
|
get_current_temperature{
|
|
location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
|
|
unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
|
|
}<tool|>
|
|
```
|
|
|
|
### Tool call from model:
|
|
```
|
|
<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>
|
|
```
|
|
|
|
### Tool response:
|
|
```
|
|
<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>
|
|
```
|
|
|
|
## JSON Chat Format (for Ollama / OpenAI-compatible APIs)
|
|
|
|
This is what you actually use in practice. Ollama translates to/from native tokens.
|
|
|
|
### Tool definition:
|
|
```json
|
|
{
|
|
"type": "function",
|
|
"function": {
|
|
"name": "get_weather",
|
|
"description": "Get current weather for a location",
|
|
"parameters": {
|
|
"type": "object",
|
|
"properties": {
|
|
"city": {"type": "string", "description": "The city name"}
|
|
},
|
|
"required": ["city"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Model returns:
|
|
```json
|
|
{
|
|
"role": "assistant",
|
|
"tool_calls": [{
|
|
"function": {
|
|
"name": "get_weather",
|
|
"arguments": {"city": "London"}
|
|
}
|
|
}]
|
|
}
|
|
```
|
|
|
|
### Tool result message:
|
|
```json
|
|
{
|
|
"role": "tool",
|
|
"content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
|
|
}
|
|
```
|
|
|
|
## Thinking Mode + Tool Calls
|
|
|
|
- When thinking is enabled, preserve thoughts between tool calls
|
|
- For long agent chains, summarize thoughts as plain text to save context
|
|
- Recommended: **disable thinking for tool-heavy workflows** (Seth's finding)
|
|
|
|
## Framework Flags
|
|
|
|
| Framework | Required Flag |
|
|
|-----------|--------------|
|
|
| llama.cpp | `--jinja` |
|
|
| vLLM | `--enable-auto-tool-choice` |
|
|
| Ollama | Works via `/api/chat` endpoint with `tools` field |
|
|
| transformers | `apply_chat_template(tools=[...])` |
|
|
|
|
## Known Issues
|
|
|
|
- Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
|
|
- llama.cpp: format mismatches and continuous loops reported
|
|
- LM Studio: compatibility issues with tool calling
|
|
- **Workaround:** Use non-streaming mode for tool calls (proven in Simon)
|