5775978899
Patches the top-level corpus docs with the 13 findings flagged during the 2026-04-18 canonical tooling research pass. tooling/README.md now marks each finding [merged: <file>] or [flagged] for provenance. - CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard (the "MoE quality degrades at 4-bit" caveat is training-only). Add note that audio on E-series is NOT available via Ollama — llama.cpp mmproj or vLLM only. - CORPUS_capabilities.md: native system role, configurable thinking mode, first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma for retrieval (Gemma 4 has no embedding mode). - CORPUS_tool_calling_format.md: add Chat Template Context section documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4, replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>, <|image>, <|audio> tokens. Add HF transformers Alternative section showing processor.parse_response with response_schema. - GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4 head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant guidance, new tool-call tokens as learned embeddings. - SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream material. Add embeddinggemma row to Model Selection table. Also: - Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md convention, not needed in tracked history) and __pycache__/. - Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future sessions can pick up cold — facts verified, open threads, what changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
44 lines
2.3 KiB
Markdown
44 lines
2.3 KiB
Markdown
# Gemma 4 on Ollama — Available Variants
|
|
|
|
> Last verified against Seth's homelab: 2026-04-12
|
|
|
|
## Ollama Model Tags
|
|
|
|
| Tag | Params | Quant | Size on Disk | VRAM | Notes |
|
|
|-----|--------|-------|-------------|------|-------|
|
|
| `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 |
|
|
| `gemma4:26b` | 25.2B total / **3.8B active (MoE)** | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti. **8 experts active of 128 + 1 shared** — runs at ~4B-speed, hence throughput. Q4_K_M inference is standard (Mixtral/DeepSeek ship same); the "MoE quality degrades at 4-bit" caveat is a **training-time** concern, not inference. See `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` for the full card. |
|
|
| `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) |
|
|
|
|
## Capabilities by Variant (from `ollama show`)
|
|
|
|
All variants support:
|
|
- Text generation (completion, chat)
|
|
- Vision (image input via base64 in `images` field)
|
|
- Tool/function calling (native Ollama tool format)
|
|
- Thinking (configurable — `ollama show` lists it; Seth's finding is to leave it `false` for tool-use workloads)
|
|
|
|
E-series (E2B, E4B) additionally support:
|
|
- Audio input (conformer encoder) — **but not via Ollama**; requires llama.cpp with the `mmproj-*-E*B-it-*.gguf` projector, or vLLM's `input_features_padded`. See `tooling/inference-frameworks/README.md`.
|
|
|
|
## GPU Coexistence (pve197 V100 32GB)
|
|
|
|
- gemma4:26b + SDXL Turbo: ~28.5GB peak VRAM — fits on V100-32GB
|
|
- gemma4:31b: 24.5GB alone — memory pressure with any coexisting model
|
|
- gemma4:e4b-it-q8_0: ~12GB — comfortable headroom
|
|
|
|
## Ollama API Endpoint
|
|
|
|
- `/api/generate` (single-turn, used by AI_Visualizer)
|
|
- `/api/chat` (multi-turn with message history, used by Simon)
|
|
- Both accept `tools`, `images`, `stream`, `options`, `keep_alive`
|
|
|
|
## Important Ollama Defaults to Override
|
|
|
|
| Parameter | Ollama Default | Recommended | Why |
|
|
|-----------|---------------|-------------|-----|
|
|
| `num_ctx` | 2048 | 4096-32768 | Default is absurdly small, causes truncation |
|
|
| `num_predict` | 128 | 512-4096+ | Default truncates almost all useful output |
|
|
| `think` | true (Ollama 0.20+) | false | See GOTCHAS doc |
|
|
| `keep_alive` | 5m | 30m-4h | Prevents expensive model reload between calls |
|