V100 data was degraded by SDXL co-residence on CT 167 (31/32 GB VRAM
occupied, Gemma 4 models forced 95% onto CPU). Rather than ship a
prominent caveat, drop the V100 column entirely so the doc reports
only apples-to-apples measurements. V100 can be added back once an
isolated run is possible.
Removed: V100 column from TL;DR and per-model tables, hardware row,
caveat section, and associated raw JSONs under runs/pve197/. Harness
config keeps pve197 in HOSTS for future re-runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-host Gemma 4 throughput comparison across three architectures.
Harness at scripts/gpu-bakeoff/; writeup at
docs/reference/gpu-bakeoff-2026-04-20.md.
Key findings:
- RTX 3090 Ti wins decode decisively (128 tok/s on gemma4:26b MoE Q4,
~4.7× faster than gemma4:31b dense on the same card).
- AMD Strix Halo iGPU lands at ~42% of 3090 Ti decode on ~25% of the
memory bandwidth — good SIMD utilization, especially for MoE.
- V100 numbers are DEGRADED: CT 167 ai-visualizer SDXL consumes 31/32
GB of its VRAM, forcing Gemma 4 models 95% onto CPU. Isolated V100
run requires SDXL eviction — left as follow-up.
- MoE vs dense is the dominant latency factor across all GPUs: ~4 B
active params of gemma4:26b beats 31.3 B active of gemma4:31b by
the same ratio (~4.7×) on every card tested.
Methodology: 1 warmup + 3 measurement runs per (host × model ×
prompt-length), Ollama's canonical timing fields, temp=0 greedy,
num_predict=256. All three Ollama servers accessed via HTTP (Strix
via Tailscale).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Applies SYNTHESIS.md + GOTCHAS.md findings to the OpenWebUI front-end:
per-setting reference, two baked-in Workspace Model profiles (chat +
extract), and a symptom→cause troubleshooting table. Front-loads the
`think: false` / gemma4:26b multi-turn footgun from Round 3 of the
2026-04-18 bakeoff since that is the shape OpenWebUI users will hit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seth's challenge: "we experienced this context eating with every
implementation that had think=true. mort-bot runs a loop. Can you do
a bake-off?"
Built a harness that replicates mort-bot's /api/chat loop verbatim
(num_ctx=8192, num_predict=2048, temperature=0.7, gemma4:26b,
STEP_BUDGET=20, exact payload shape) but with stubbed tools and a
prebuilt 15-turn fake chat history. Ran 4 tasks × 2 think settings.
Finding: on Ollama 0.20.4 the "thinking eats context" concern does
NOT reproduce. Direct evidence:
- Movies task step 2 (think=true) returned 905 chars of thinking.
- Step 3 prompt_eval_count delta: +76 tokens (think=true) vs +135
tokens (think=false). If thinking had accumulated in the prompt,
think=true would have grown by +360 tokens, not shrunk.
- Ollama's chat template strips the `thinking` field when serializing
assistant turns for subsequent prompts.
All 4 tasks × 2 settings produced identical step counts and tool
counts. Wall clocks comparable. Gemma only actually generated
thinking on 1 of 4 tasks (the one with check_sethflix verify-loop);
on the others with think=true it emitted 0 thinking tokens.
Reconciled with the earlier coding-agent bakeoff: the two findings
are orthogonal. Coding bakeoff was at num_ctx=32K with a different
harness; mort at 8K doesn't touch the silent-stop regime either way.
Seth's prior may have been correct on an older Ollama or in a
different API shape (/api/generate has its own issues) but does not
reproduce here.
Concrete recommendation: mort-bot THINK=False is defensible but not
load-bearing; THINK=True or unset-default would also work. Keep as-is
unless a different need arises.
New: docs/reference/mort-bakeoff-2026-04-18.md, scripts/mort-bakeoff/
(harness + 8 run logs). README updated with pointer.
Ran minimal agent loop (Ollama /api/chat + read_file/write_file/run_bash) on
steel141 3090 Ti against 3 models on a broken-median-function task:
- gemma4:31b-it-q4_K_M: PASS (8 iters, 1 write, 44s) — textbook trace
- qwen3-coder:30b: PASS (15 iters, 1 write, 22s) — correct but chatty
- gemma4:26b: FAIL (6 iters, 0 writes) — silently stops with eval=4
after reading source. Reproduced on second run. One-shot probe
confirms 26b CAN produce the correct fix — failure is specifically
at the write_file tool-call argument boundary.
Updates GOTCHAS with a new HIGH-severity entry, SYNTHESIS model-selection
table, CORPUS_cli_coding_agent.md empirical-follow-up pointer, and adds
docs/reference/bakeoff-2026-04-18.md with the full writeup.
Fills the gap between existing chat-agent (Simon) and pipeline
(AI_Visualizer) patterns. Covers openclaw/open code/pi/hermes first-party
agents from the HF launch blog, honest positioning vs qwen3-coder:30b,
CLI-agent-specific gotchas (safety filter on security code, long-JSON
weakness, no code exec), and a concrete homelab bakeoff plan pointed at
CT 166 openclaw2 → CT 105 Ollama on pve197.
Key research finding: Google published LiveCodeBench + Codeforces but
NOT SWE-bench or Aider polyglot. The "autonomous agents" claim is
plausible but unproven for multi-file repo-scale coding specifically.
Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.
- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
tokenizer_config.json, transformers gemma4/ source, launch blog posts,
official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md
Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.
Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Architecture specs, benchmarks, gotchas, Ollama settings, tool calling
format, and implementation patterns from Simon and AI_Visualizer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>