Files

T

Mortdecai 5775978899 docs: merge tooling findings into SYNTHESIS/GOTCHAS/CORPUS_* and add handoff

Patches the top-level corpus docs with the 13 findings flagged during the
2026-04-18 canonical tooling research pass. tooling/README.md now marks each
finding [merged: <file>] or [flagged] for provenance.

- CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B
  active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard
  (the "MoE quality degrades at 4-bit" caveat is training-only). Add note
  that audio on E-series is NOT available via Ollama — llama.cpp mmproj
  or vLLM only.
- CORPUS_capabilities.md: native system role, configurable thinking mode,
  first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object
  detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma
  for retrieval (Gemma 4 has no embedding mode).
- CORPUS_tool_calling_format.md: add Chat Template Context section
  documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4,
  replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>,
  <|image>, <|audio> tokens. Add HF transformers Alternative section
  showing processor.parse_response with response_schema.
- GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no
  Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4
  head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant
  guidance, new tool-call tokens as learned embeddings.
- SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream
  material. Add embeddinggemma row to Model Selection table.

Also:
- Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md
  convention, not needed in tracked history) and __pycache__/.
- Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future
  sessions can pick up cold — facts verified, open threads, what changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 12:48:26 -04:00

2.3 KiB

Raw Blame History

Gemma 4 on Ollama — Available Variants

Last verified against Seth's homelab: 2026-04-12

Ollama Model Tags

Tag	Params	Quant	Size on Disk	VRAM	Notes
`gemma4:e4b-it-q8_0`	~8B total / 4B effective	Q8_0	11.6GB	~12GB	Vision + audio capable. ~25 tok/s on V100
`gemma4:26b`	25.2B total / 3.8B active (MoE)	Q4_K_M (default)	18.0GB	~18GB	Sweet spot for quality/speed. ~134 tok/s on 3090 Ti. 8 experts active of 128 + 1 shared — runs at ~4B-speed, hence throughput. Q4_K_M inference is standard (Mixtral/DeepSeek ship same); the "MoE quality degrades at 4-bit" caveat is a training-time concern, not inference. See `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` for the full card.
`gemma4:31b-it-q4_K_M`	31.3B	Q4_K_M	19.9GB	~24.5GB	Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure)

Capabilities by Variant (from `ollama show`)

All variants support:

Text generation (completion, chat)
Vision (image input via base64 in images field)
Tool/function calling (native Ollama tool format)
Thinking (configurable — ollama show lists it; Seth's finding is to leave it false for tool-use workloads)

E-series (E2B, E4B) additionally support:

Audio input (conformer encoder) — but not via Ollama; requires llama.cpp with the mmproj-*-E*B-it-*.gguf projector, or vLLM's input_features_padded. See tooling/inference-frameworks/README.md.

GPU Coexistence (pve197 V100 32GB)

gemma4:26b + SDXL Turbo: ~28.5GB peak VRAM — fits on V100-32GB
gemma4:31b: 24.5GB alone — memory pressure with any coexisting model
gemma4:e4b-it-q8_0: ~12GB — comfortable headroom

Ollama API Endpoint

/api/generate (single-turn, used by AI_Visualizer)
/api/chat (multi-turn with message history, used by Simon)
Both accept tools, images, stream, options, keep_alive

Important Ollama Defaults to Override

Parameter	Ollama Default	Recommended	Why
`num_ctx`	2048	4096-32768	Default is absurdly small, causes truncation
`num_predict`	128	512-4096+	Default truncates almost all useful output
`think`	true (Ollama 0.20+)	false	See GOTCHAS doc
`keep_alive`	5m	30m-4h	Prevents expensive model reload between calls

2.3 KiB Raw Blame History

Gemma 4 on Ollama — Available Variants

Ollama Model Tags

Capabilities by Variant (from ollama show)

GPU Coexistence (pve197 V100 32GB)

Ollama API Endpoint

Important Ollama Defaults to Override

2.3 KiB

Raw Blame History

Capabilities by Variant (from `ollama show`)