docs: initial Gemma 4 research corpus and synthesis

Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 18:14:19 -04:00
commit 5011059f5d
9 changed files with 861 additions and 0 deletions
@@ -0,0 +1,100 @@
+# Gemma 4 Native Tool Calling Format
+
+> Source: Google AI for Developers - Function Calling docs
+> https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
+
+## Special Tokens (6 total)
+
+| Token | Purpose |
+|-------|---------|
+| `<\|tool>` / `<tool\|>` | Tool definition block |
+| `<\|tool_call>` / `<tool_call\|>` | Model's tool request |
+| `<\|tool_response>` / `<tool_response\|>` | Tool execution result |
+
+String delimiter: `<\|"\|>` (encloses all string values in native format)
+
+## Native Format (raw model tokens)
+
+### Tool definition in system prompt:
+```
+<|tool>declaration:
+get_current_temperature{
+  location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
+  unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
+}<tool|>
+```
+
+### Tool call from model:
+```
+<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>
+```
+
+### Tool response:
+```
+<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>
+```
+
+## JSON Chat Format (for Ollama / OpenAI-compatible APIs)
+
+This is what you actually use in practice. Ollama translates to/from native tokens.
+
+### Tool definition:
+```json
+{
+  "type": "function",
+  "function": {
+    "name": "get_weather",
+    "description": "Get current weather for a location",
+    "parameters": {
+      "type": "object",
+      "properties": {
+        "city": {"type": "string", "description": "The city name"}
+      },
+      "required": ["city"]
+    }
+  }
+}
+```
+
+### Model returns:
+```json
+{
+  "role": "assistant",
+  "tool_calls": [{
+    "function": {
+      "name": "get_weather",
+      "arguments": {"city": "London"}
+    }
+  }]
+}
+```
+
+### Tool result message:
+```json
+{
+  "role": "tool",
+  "content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
+}
+```
+
+## Thinking Mode + Tool Calls
+
+- When thinking is enabled, preserve thoughts between tool calls
+- For long agent chains, summarize thoughts as plain text to save context
+- Recommended: **disable thinking for tool-heavy workflows** (Seth's finding)
+
+## Framework Flags
+
+| Framework | Required Flag |
+|-----------|--------------|
+| llama.cpp | `--jinja` |
+| vLLM | `--enable-auto-tool-choice` |
+| Ollama | Works via `/api/chat` endpoint with `tools` field |
+| transformers | `apply_chat_template(tools=[...])` |
+
+## Known Issues
+
+- Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
+- llama.cpp: format mismatches and continuous loops reported
+- LM Studio: compatibility issues with tool calling
+- **Workaround:** Use non-streaming mode for tool calls (proven in Simon)