Applies SYNTHESIS.md + GOTCHAS.md findings to the OpenWebUI front-end: per-setting reference, two baked-in Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table. Front-loads the `think: false` / gemma4:26b multi-turn footgun from Round 3 of the 2026-04-18 bakeoff since that is the shape OpenWebUI users will hit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
Gemma 4 in OpenWebUI — Setup Guide
Derived from
SYNTHESIS.md,GOTCHAS.md, and the 2026-04-18 bakeoff. Assumes: OpenWebUI is already running andgemma4:*is pulled in the backing Ollama. Covers every setting that matters and what to set it to.
TL;DR — Just Tell Me What To Toggle
Create a Workspace Model (don't edit per-chat), pick a base variant, and set:
| Setting | Multi-turn chat (the default OpenWebUI shape) | Single-turn JSON pipeline |
|---|---|---|
| Base model | gemma4:26b (fast) or gemma4:31b-it-q4_K_M (sharper) |
same |
| System Prompt | Required — identity + boundaries + format (template below) | Required |
Context Length (num_ctx) |
32768 |
4096–16384 (scale to prompt) |
Max Tokens (num_predict) |
4096 |
2048+ |
| Temperature | 0.7 |
0.2–0.4 |
| Stream | On for text-only chat; Off if you attach a Tool | Off |
| Function Calling | Native | N/A |
Reasoning / think |
LEAVE UNSET on 26B (do NOT force off). Unset or On on 31B. | Force Off |
| Response Format / JSON mode | Off (always) | Off (always) |
| Keep Alive | 30m or -1 |
match pipeline duration (4h / -1) |
The single biggest failure mode in OpenWebUI: setting think: false on
gemma4:26b in a chat. The model silent-stops at tool-decision turns with
eval_count=4. If you see "model just stops answering after a few messages"
on 26B, check this first. See GOTCHAS.md § "think: false Kills Gemma 4 26B
in Multi-Turn Tool-Calling Loops".
Where Settings Live in OpenWebUI
OpenWebUI has four layers. Later layers override earlier ones:
- Ollama defaults — baked in, almost all wrong for Gemma 4
(
num_ctx=2048,num_predict=128,keep_alive=5m). - Admin Panel → Settings → Models — global defaults for all models. Touch this only to set sane fleet-wide floors.
- Workspace → Models → [Create/Edit] — named presets. This is where you bake Gemma 4 settings. A Workspace model = base model + system prompt + advanced params + tags + optional tool server bindings.
- Per-chat controls (right-hand panel / top of chat) — overrides for a single conversation. Useful for experimentation, bad for persistence.
Rule: every knob below goes in layer 3 (Workspace Model) unless noted. Per-chat overrides are for debugging only.
Step 1 — Create the Workspace Model
Workspace → Models → + Add Model (or Create a Model). Fill in:
- Name:
gemma4-26b-chat(or whatever matches your use case) - Base Model: pick from Ollama list. Recommended:
gemma4:26b— fastest, great defaultgemma4:31b-it-q4_K_M— sharper, 5x slower, more VRAMgemma4:e4b-it-q8_0— 12GB VRAM, vision + audio (audio via llama.cpp only)
- Description: what this preset is for. Future-you will thank you.
- Profile Image / Tags: optional.
- System Prompt: required (see Step 2).
- Advanced Params: expand and configure (see Step 3).
- Tools / Knowledge / Filters: optional — attach any tool servers here.
- Capabilities (at bottom): toggle Vision if you want image input. Gemma 4 supports vision on all variants.
Save. The model now appears in the main chat dropdown.
Step 2 — System Prompt (Required)
Gemma 4 is ultra-compliant but doesn't know who it is. A blank or generic system prompt gets you a generic chatbot — and sporadic overfiltering.
Use the template from SYNTHESIS.md:
You are [NAME], a [ROLE DESCRIPTION]. You are powered by Gemma 4.
## What You Do
- [Explicit list of responsibilities]
- [Tools you have access to and when to use each one, if any]
## What You Do NOT Do
- [Explicit list of things to refuse or avoid]
- [Common mistakes to prevent]
## Output Format
[For free-form chat: "Respond in clear Markdown with code in fenced blocks."]
[For structured output: exact schema, field names, example if complex.]
## Rules
- [Behavioral constraints]
- [Multi-step chaining instructions if using tools]
Today's date: 2026-04-18
Principles:
- Identity first.
- Positive instructions (what to do) before negative (what not to do).
- Output format is explicit.
- Don't use language that sounds like you're asking the model to bypass restrictions — just state the task directly (safety overfilter trigger).
Step 3 — Advanced Params Reference
Expand Advanced Params in the Workspace Model editor. Every field, what to set, and why.
Sampling / Output Shape
| Field | Ollama default | Set to | Why |
|---|---|---|---|
| Temperature | 0.8 | 0.7 (chat) / 0.3 (extraction) / 0.2 (scoring) |
Per SYNTHESIS.md temperature table. |
| Top K | 40 | leave default | Gemma 4 is well-behaved at default. |
| Top P | 0.9 | leave default | Same. |
| Min P | 0.0 | leave default | Same. |
| Seed | random | leave blank | Set only for A/B reproduction. |
| Stop Sequences | none | leave blank | Gemma 4 emits proper EOS. |
| Mirostat / Eta / Tau | off | leave off | Not needed; Min P / Top P work fine. |
| Frequency Penalty | 0 | leave 0 | Any value biases style for little gain. |
| Repeat Penalty | 1.1 | leave default | Fine at default. |
| Repeat Last N | 64 | leave default | Fine at default. |
| Presence Penalty | 0 | leave 0 | Same as frequency. |
Context / Memory — these are the ones that bite
| Field | Ollama default | Set to | Why |
|---|---|---|---|
Context Length (num_ctx) |
2048 | 32768 chat / 4096–16384 pipeline |
Default truncates mid-system-prompt. GOTCHAS.md § "Ollama Default Context is 2048". |
Max Tokens (num_predict) |
128 | 4096 chat / 2048+ JSON |
Default truncates any useful reply. GOTCHAS.md § "num_predict Default is 128". |
Batch Size (num_batch) |
512 | leave default | Prompt-eval throughput; no Gemma 4 issue. |
Tokens to Keep (num_keep) |
4 | leave default | System-prompt header anchor. |
| Use Mmap | on | leave on | Standard. |
| Use Mlock | off | leave off | Standard. |
Threads (num_thread) |
auto | leave default | Ollama picks. |
| Keep Alive | 5m | 30m for chat, 4h or -1 for pipelines |
Default unloads the model between messages — ~10–30s reload penalty. GOTCHAS.md § "Keep-Alive Too Short". |
Reasoning / Thinking — the OpenWebUI 26B killer
| Field | Set to | Why |
|---|---|---|
| Reasoning / Thinking / Think toggle | Leave UNSET on 26B in chat. Optional on 31B. Force Off only in single-turn JSON pipelines. | Ollama 0.20+ defaults think: true. On gemma4:26b in multi-turn chat, forcing think: false causes silent stops (eval_count=4, empty content, no tool call) — the model just… stops. Verified 2026-04-18. 31B and Qwen3-Coder tolerate the flag. In single-turn JSON pipelines (AI_Visualizer shape) the old advice still applies: force off so thinking tokens don't eat num_predict. See GOTCHAS.md § "think: false Kills Gemma 4 26B" and § "Thinking Mode Eats Context". |
OpenWebUI exposes this as a "Reasoning" toggle or a raw
thinkfield depending on version. If your version exposes it as tri-state (On / Off / Default), pick Default on 26B chat. If it's binary (On / Off), leave it On on 26B chat. Never Off on 26B chat.
Response Format — never use JSON mode
| Field | Set to | Why |
|---|---|---|
| Response Format / Format = JSON | Off / None / Text | Ollama's server-side format: "json" enforcer causes Gemma 4 26B to infinite-loop on nested schemas. Ask for JSON in the prompt and parse client-side. GOTCHAS.md § "format=json Causes Infinite Loops". |
Streaming & Function Calling
| Field | Set to | Why |
|---|---|---|
| Stream Chat Response | On for text-only chat. Off if you've attached Tools. | Ollama v0.20.0 drops tool calls on streaming responses (community-reported, and matches Simon's non-streaming choice). GOTCHAS.md § "Tool Calling Broken in Ollama v0.20.0 Streaming". |
| Function Calling | Native if you're attaching tools; otherwise Default / off. |
Native uses Ollama's /api/chat tool_calls field. Gemma 4 has a native tool-calling token format. |
Vision
Enable the Vision capability (bottom of model editor). All Gemma 4
variants support vision. Paste or upload images in chat. Works great for
description; unreliable for subjective quality scoring (see
GOTCHAS.md § "Vision Validator Overrejects").
Audio (E-series only)
26B / 31B have no audio encoder. Only gemma4:e4b-it-* variants support
audio, and currently only via llama.cpp — Ollama doesn't pipe audio through
OpenWebUI today. Skip this in OpenWebUI for now.
Step 4 — Global Admin Defaults (Optional Floor)
Admin Panel → Settings → Models sets defaults that apply when a Workspace
Model doesn't override. Set these as a safety net for ad-hoc chats against
gemma4:* base models directly:
- Default Context Length: 8192
- Default Max Tokens: 2048
- Default Keep Alive: 30m
These are only floors. The Workspace Model's explicit settings still take over for named presets.
Two Profiles Worth Baking
Profile A: gemma4-26b-chat (default daily driver)
- Base:
gemma4:26b - System Prompt: "You are a helpful assistant powered by Gemma 4. Respond in clear Markdown. Use fenced code blocks for code. Today's date: …"
- Temp
0.7,num_ctx 32768,num_predict 4096 - Reasoning: Default (unset) or On — never Off
- Stream On, Format Off, Keep Alive
30m
Profile B: gemma4-26b-extract (structured output)
- Base:
gemma4:26b - System Prompt: explicit schema with "Respond with ONLY JSON. No prose."
- Temp
0.3,num_ctx 8192,num_predict 2048 - Reasoning: Off (single-turn — thinking would eat
num_predict) - Stream Off, Format Off (still!), Keep Alive
1h - Parse client-side with the regex pattern in
SYNTHESIS.md.
For tool-using agent chats, Profile A is correct — don't flip Reasoning off.
Troubleshooting Map
| Symptom | Most likely cause | Fix |
|---|---|---|
| 26B "stops answering" mid-conversation, blank reply | think: false in payload |
Set Reasoning to Default/On in Workspace Model |
| Reply truncates mid-sentence | num_predict too low |
Bump to 4096 |
| Long prompt ignored / forgets system prompt | num_ctx too low |
Set 32768 |
| JSON request hangs forever | Response Format = JSON | Turn it off; parse client-side |
| Tool call not fired despite model "deciding" to call it | Streaming + tool call, Ollama v0.20.0 | Disable Stream when tools attached |
| 10–30s latency on first message after idle | keep_alive default 5m |
Set 30m or -1 |
| Model generic / no personality / confuses identity | Empty or weak system prompt | Use the template in Step 2 |
| 31B hangs at long prompts | Flash Attention + 31B Dense + >3–4K tokens on 3090 class | Use 26B for long prompts, or disable FA in Ollama |
| Chat refuses a benign technical prompt | Safety overfilter | Rephrase; state task directly without "bypass"/"ignore" language |
What This Doc Does Not Cover
- Installing Ollama or OpenWebUI — assumed done.
- Pulling Gemma 4 models —
ollama pull gemma4:26boutside scope. - Tool server development — see
CORPUS_tool_calling_format.mdand Simon (~/bin/FreibergFamily/simon/). - Embeddings / retrieval — Gemma 4 has no embedding mode; use
embeddinggemma(308M) as a sibling model. - Fine-tuning — see
GOTCHAS.md§ "Fine-Tuning Ecosystem Issues" andtooling/fine-tuning/recipe-recommendation.md.
Related Docs in This Repo
SYNTHESIS.md— opinionated guide this doc is derived from.GOTCHAS.md— every known issue, severity-ranked.CORPUS_ollama_variants.md— model inventory, VRAM, Ollama settings.docs/reference/bakeoff-2026-04-18.md— thethink: false/ 26B evidence trail.CORPUS_cli_coding_agent.md— if the OpenWebUI chat is really an agent front-end, read this for model-choice nuance.