Files
gemma4-research/CORPUS_capabilities.md
Mortdecai 5775978899 docs: merge tooling findings into SYNTHESIS/GOTCHAS/CORPUS_* and add handoff
Patches the top-level corpus docs with the 13 findings flagged during the
2026-04-18 canonical tooling research pass. tooling/README.md now marks each
finding [merged: <file>] or [flagged] for provenance.

- CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B
  active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard
  (the "MoE quality degrades at 4-bit" caveat is training-only). Add note
  that audio on E-series is NOT available via Ollama — llama.cpp mmproj
  or vLLM only.
- CORPUS_capabilities.md: native system role, configurable thinking mode,
  first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object
  detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma
  for retrieval (Gemma 4 has no embedding mode).
- CORPUS_tool_calling_format.md: add Chat Template Context section
  documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4,
  replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>,
  <|image>, <|audio> tokens. Add HF transformers Alternative section
  showing processor.parse_response with response_schema.
- GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no
  Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4
  head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant
  guidance, new tool-call tokens as learned embeddings.
- SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream
  material. Add embeddinggemma row to Model Selection table.

Also:
- Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md
  convention, not needed in tracked history) and __pycache__/.
- Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future
  sessions can pick up cold — facts verified, open threads, what changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:48:26 -04:00

65 lines
3.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gemma 4 Capabilities Reference
## Modalities
### Text (all variants)
- Standard instruction-following, chat, completion
- **Native `system` role support** (new in Gemma 4; Gemma 3 prepended system as user turn)
- **Configurable thinking mode** — `<|think|>` / `<|channel>` tokens in the chat template; Ollama `think: true/false` flag. Seth's finding (see GOTCHAS): keep `false` for tool-use workloads.
- 128K context window (E2B/E4B) / 256K (26B/31B) — training length
- 262K vocabulary
### Vision (all variants)
- **Tested and verified working** (Seth, 2026-04-10)
- Accurately described colors, shapes, composition in 256x256 test image
- ~25 tok/s, ~24s end-to-end on pve197 V100
- Input: base64-encoded image in `images` field of Ollama API
- Vision encoder: 16x16 patches, 2D RoPE, variable aspect ratio
- Token budgets scale with resolution (70-1120 soft tokens)
- Used in AI_Visualizer for SDXL frame quality criticism
### Audio (E2B/E4B only)
- **Not tested by Seth** — status unknown in practice
- Conformer architecture (~300M params), mel-spectrogram input
- **Trained on SPEECH ONLY — not music or environmental sounds**
- Maximum 30 seconds per clip
- NOT available on 26B or 31B variants
- AI_Visualizer explicitly rejected audio for music analysis (DECISIONS S2) — correct call, model wasn't trained for it
### Video (all variants)
- E2B/E4B: video WITH audio (`load_audio_from_video=True`)
- 31B/26B: video WITHOUT audio
- Not explicitly post-trained on video but works
- Maximum 60 seconds at 1 frame/second
- Not tested by Seth
### Tool Calling / Function Calling
- **Verified reliable** in both Simon and AI_Visualizer
- Ollama native tool format (OpenAI-compatible function calling)
- Simon: 6 genealogy tools, up to 12 sequential iterations
- Supports parallel tool calls in single response
- Weak at deeply nested JSON schemas -> prefer sequential calls
- **First Gemma generation with tool use as a trained capability.** Gemma 1/2/3 tool use was "proof-of-concept" (per the DeepMind tool_use colab). Gemma 4 has dedicated tool-call tokens and is trained on the pattern.
### Native Object Detection (all variants)
- **Prompt format:** "Detect the {object} in this image" → structured output `{box_2d: [ymin, xmin, ymax, xmax]}` in **1000×1000-normalized coordinates** (rescale to your actual image dims).
- Images auto-resized to multiples of 48 pixels by the processor.
- Useful for grounding, cropping, counting, or passing bboxes to downstream tools — no separate detection model required.
- Documented in the HF model card (`tooling/huggingface/model-cards/gemma-4-*.md`). Not tested by Seth yet.
## Benchmark Context (vs Gemma 3)
- 31B replaces Gemma 3 27B (60 layers vs 62, but wider)
- MoE variant (26B) is new — no Gemma 3 equivalent
- E-series with PLE is new — on-device focus
- Proportional RoPE replaces linear frequency scaling -> better long-context
- Shared KV cache is new -> more efficient inference
## What Gemma 4 Does NOT Do
- No native code execution / sandboxing
- No web browsing or retrieval
- Audio only on E-series (not the models most people run) — and **not on Ollama**, requires llama.cpp mmproj or vLLM
- No built-in RAG — tool calling can implement it
- No embeddings — use `EmbeddingGemma` (308M, separate model) for retrieval/semantic search