Files

T

Mortdecai 5775978899 docs: merge tooling findings into SYNTHESIS/GOTCHAS/CORPUS_* and add handoff

Patches the top-level corpus docs with the 13 findings flagged during the
2026-04-18 canonical tooling research pass. tooling/README.md now marks each
finding [merged: <file>] or [flagged] for provenance.

- CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B
  active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard
  (the "MoE quality degrades at 4-bit" caveat is training-only). Add note
  that audio on E-series is NOT available via Ollama — llama.cpp mmproj
  or vLLM only.
- CORPUS_capabilities.md: native system role, configurable thinking mode,
  first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object
  detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma
  for retrieval (Gemma 4 has no embedding mode).
- CORPUS_tool_calling_format.md: add Chat Template Context section
  documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4,
  replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>,
  <|image>, <|audio> tokens. Add HF transformers Alternative section
  showing processor.parse_response with response_schema.
- GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no
  Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4
  head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant
  guidance, new tool-call tokens as learned embeddings.
- SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream
  material. Add embeddinggemma row to Model Selection table.

Also:
- Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md
  convention, not needed in tracked history) and __pycache__/.
- Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future
  sessions can pick up cold — facts verified, open threads, what changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 12:48:26 -04:00

7.5 KiB

Raw Blame History

Gemma 4 Synthesis — How to Build With It

Opinionated guide based on two production implementations and ongoing use. Seth Freiberg, 2026-04-12

The One-Paragraph Summary

Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs num_predict increased (Ollama defaults are absurdly low), think set to false (thinking eats the context budget), and format: json avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output.

For canonical upstream source (model cards, chat templates, serving commands, fine-tuning recipes, specialized siblings like EmbeddingGemma/ShieldGemma): see tooling/README.md. That directory is 147 files / 14 MB of first-party material pulled from Google / Hugging Face / framework maintainers. This SYNTHESIS is the opinionated digest; tooling/ is the receipts.

Mental Model

Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain:

Who they are and what their job is
What they should and should NOT do
Exactly what format you want the deliverable in
The boundaries of their role

Get those right and Gemma 4 just works. Get them wrong and you get a generic chatbot.

Mandatory Ollama Settings

Every Gemma 4 call MUST include:

{
  "think": false,
  "options": {
    "num_ctx": 4096,
    "num_predict": 2048
  }
}

Why each one:

think: false — Ollama 0.20+ defaults to think:true. Thinking tokens consume num_predict budget invisibly, returning empty responses. Seth has ONLY had success with thinking off.
num_ctx: 4096+ — Ollama defaults to 2048. Your system prompt alone might exceed that.
num_predict: 2048+ — Ollama defaults to 128. Any structured output gets truncated.

Scale these to your task. The values above are safe minimums, not recommendations.

System Prompt Template

You are [NAME], a [ROLE DESCRIPTION].

## What You Do
- [Explicit list of responsibilities]
- [Tools you have access to and when to use each one]

## What You Do NOT Do
- [Explicit list of things to refuse or avoid]
- [Common mistakes to prevent]

## Output Format
[Exact schema, field names, example if complex]
Respond with ONLY [format]. No prose outside the [format].

## Rules
- [Behavioral constraints]
- [Multi-step chaining instructions if using tools]

Today's date: [DATE]

Key principles:

Identity first — who is this agent?
Positive instructions before negative (what TO do before what NOT to do)
Output format is explicit and complete — Gemma 4 follows schemas faithfully
"No prose outside the JSON" prevents wrapper text that breaks parsing
Date injection helps with temporal reasoning

Tool Calling Strategy

Gemma 4 is reliable for tool calling but weak at structuring long JSONs.

When to use tool calling (Ollama native)

Multi-turn agents with 2-10 tools
Sequential reasoning chains (lookup A -> use A to decide B -> lookup B)
Any task where the model needs to gather information before responding

When to use prompt-based JSON instead

Single-turn generation with known output structure
When you need specific JSON schema control
When the output is a payload (prompts, configs) not a conversation

The Sequential Pattern

Instead of asking Gemma 4 to produce one massive JSON:

BAD:  "Generate a 50-scene storyboard as JSON"  -> truncated/malformed
GOOD: "Generate scenes 1-5 as JSON" x10         -> reliable every time

Gemma 4's inference speed makes sequential calls cheap. A 10-call chain at ~134 tok/s on a 3090 Ti costs seconds, not minutes. This is the fundamental advantage of local models — latency is predictable and network-free.

JSON Extraction Pattern

Since format: "json" is broken, always extract client-side:

# Python
import json
raw = response["response"]
start = raw.find("{")
end = raw.rfind("}")
if start >= 0 and end > start:
    obj = json.loads(raw[start:end + 1])

// JavaScript
const raw = response.message.content;
const match = raw.match(/\{[\s\S]*\}/);
if (match) obj = JSON.parse(match[0]);

For arrays, find [ and ] instead. Add json5 fallback for trailing commas.

Temperature Guidelines

Task Type	Temperature	Why
Evaluation / scoring	0.2	Consistent, reproducible judgments
Structured extraction	0.3-0.4	Faithful to schema
Creative generation	0.6-0.8	Variety without chaos
Conversation / chat	0.7-1.0	Natural feel

Retry strategy: bump temp +0.1 per retry to escape format failures.

Vision Usage

Works for: Describing image contents (objects, colors, composition, text) Unreliable for: Subjective quality scoring, aesthetic judgment

import base64
with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode("ascii")

response = client.generate(
    model="gemma4:26b",
    prompt="Describe this image in detail.",
    images=[b64],
    think=False,
    options={"temperature": 0.2, "num_predict": 512}
)

Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only.

Context Management

Multi-turn (chat agents)

Prune old tool results and tool-call messages
Keep assistant's natural-language summaries
Set num_ctx to 32768 for rich conversations
Set a tool iteration limit (12 is proven) with streaming fallback

Single-turn (pipeline stages)

Calculate your prompt size and set num_ctx accordingly
For long inputs (full track analysis), use recursive splitting at natural boundaries
Pin model with keep_alive=-1 if pipeline has idle gaps

Model Selection

Use Case	Recommended	Why
Production pipeline (needs GPU coexistence)	`gemma4:26b`	MoE (3.8B active), fast, good quality/VRAM balance
On-device / edge	`gemma4:e4b-it-q8_0`	12GB VRAM, vision+audio (audio via llama.cpp only)
Maximum quality (single-model GPU)	`gemma4:31b-it-q4_K_M`	Dense 31B, sharpest but 5x slower, more VRAM pressure
Rapid prototyping / testing	`gemma4:26b`	Fast enough for interactive dev
Retrieval / embeddings	`embeddinggemma` (308M, separate model)	Gemma 4 has no embedding mode; use the sibling

Anti-Patterns

Don't use format: "json" — infinite loops on nested schemas
Don't leave think at default — eats your output budget silently
Don't leave num_predict at default — 128 tokens is nothing
Don't leave num_ctx at default — 2048 truncates most prompts
Don't ask for huge JSON in one call — break into sequential calls
Don't use thinking mode for evaluation — inflates scores, wastes context
Don't skip system prompt identity — Gemma 4 becomes a generic chatbot
Don't use audio on 26B/31B — only E-series has audio encoder

Quick-Start Checklist

Set think: false
Set num_predict >= 512 (2048+ for JSON output)
Set num_ctx >= 4096 (scale to your prompt size)
Write explicit system prompt with identity + boundaries + output format
Extract JSON client-side (no format: "json")
Set keep_alive >= 30m (or pin with -1)
For long structured output, use sequential calls
For vision, pass base64 in images array
Test with your actual prompt length — Ollama won't warn about truncation

7.5 KiB Raw Blame History