docs: merge tooling findings into SYNTHESIS/GOTCHAS/CORPUS_* and add handoff

Patches the top-level corpus docs with the 13 findings flagged during the 2026-04-18 canonical tooling research pass. tooling/README.md now marks each finding [merged: <file>] or [flagged] for provenance. - CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard (the "MoE quality degrades at 4-bit" caveat is training-only). Add note that audio on E-series is NOT available via Ollama — llama.cpp mmproj or vLLM only. - CORPUS_capabilities.md: native system role, configurable thinking mode, first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma for retrieval (Gemma 4 has no embedding mode). - CORPUS_tool_calling_format.md: add Chat Template Context section documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4, replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>, <|image>, <|audio> tokens. Add HF transformers Alternative section showing processor.parse_response with response_schema. - GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4 head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant guidance, new tool-call tokens as learned embeddings. - SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream material. Add embeddinggemma row to Model Selection table. Also: - Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md convention, not needed in tracked history) and __pycache__/. - Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future sessions can pick up cold — facts verified, open threads, what changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:48:26 -04:00
parent eecebe7ef5
commit 5775978899
8 changed files with 197 additions and 20 deletions
@@ -2,8 +2,29 @@

 > Source: Google AI for Developers - Function Calling docs
 > https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
+> Canonical source in corpus: `tooling/google-official/docs/ai-google-dev_function_calling_gemma4.html`
+> Authoritative chat template: `tooling/huggingface/model-cards/gemma-4-{31B,E4B}-it-chat_template.jinja`

-## Special Tokens (6 total)
+## Chat Template Context (what surrounds the tool tokens)
+
+Gemma 4 changed the turn-token syntax from Gemma 3. You won't usually write these by
+hand — Ollama, llama.cpp `--jinja`, and HF `apply_chat_template` all handle it — but
+know what's on the wire when debugging:
+
+| Purpose | Gemma 3 | Gemma 4 |
+|---------|---------|---------|
+| Turn start | `<start_of_turn>role\n` | `<\|turn>role\n` |
+| Turn end | `<end_of_turn>\n` | `<turn\|>\n` |
+| Thinking | (not standardized) | `<\|think>...<think\|>` |
+| Thought channel | (n/a) | `<\|channel>thought...<channel\|>` |
+| Image inline | `<start_of_image>` | `<\|image>...<image\|>` |
+| Audio inline | (n/a) | `<\|audio>...<audio\|>` |
+| String delimiter in native format | (n/a) | `<\|"\|>` |
+
+**Asymmetric brackets are intentional.** Opening is `<|token>`, closing is `<token|>`.
+If you see `<|turn>...</turn|>` in a code sample, that's wrong.
+
+## Tool Special Tokens (6 total)

 | Token | Purpose |
 |-------|---------|
@@ -98,3 +119,24 @@ This is what you actually use in practice. Ollama translates to/from native toke
 - llama.cpp: format mismatches and continuous loops reported
 - LM Studio: compatibility issues with tool calling
 - **Workaround:** Use non-streaming mode for tool calls (proven in Simon)
+
+## HF `transformers` Alternative (not needed if using Ollama)
+
+If you ever route through HF `transformers` (v5.5.4+) instead of Ollama, there's a
+cleaner parser than hand-rolled regex:
+
+```python
+inputs = processor.apply_chat_template(
+    messages, tools=TOOLS, enable_thinking=True,
+    add_generation_prompt=True, tokenize=True,
+    return_dict=True, return_tensors="pt"
+)
+out = model.generate(**inputs)
+parsed = processor.parse_response(processor.decode(out[0]))
+# -> {"thinking": "...", "content": "...", "tool_calls": [...]}
+```
+
+`parse_response` uses `response_schema` + `x-regex` fields baked into
+`tokenizer_config.json` (downloaded at `tooling/huggingface/model-cards/`). For
+Ollama users this is informational — Ollama's server-side tool parser already does
+the equivalent and returns structured `tool_calls` in the chat response.