d9477da52a
Applies SYNTHESIS.md + GOTCHAS.md findings to the OpenWebUI front-end: per-setting reference, two baked-in Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table. Front-loads the `think: false` / gemma4:26b multi-turn footgun from Round 3 of the 2026-04-18 bakeoff since that is the shape OpenWebUI users will hit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
258 lines
12 KiB
Markdown
258 lines
12 KiB
Markdown
# Gemma 4 in OpenWebUI — Setup Guide
|
||
|
||
> Derived from `SYNTHESIS.md`, `GOTCHAS.md`, and the 2026-04-18 bakeoff.
|
||
> Assumes: OpenWebUI is already running and `gemma4:*` is pulled in the
|
||
> backing Ollama. Covers every setting that matters and what to set it to.
|
||
|
||
## TL;DR — Just Tell Me What To Toggle
|
||
|
||
Create a **Workspace Model** (don't edit per-chat), pick a base variant, and set:
|
||
|
||
| Setting | Multi-turn chat (the default OpenWebUI shape) | Single-turn JSON pipeline |
|
||
|---|---|---|
|
||
| Base model | `gemma4:26b` (fast) or `gemma4:31b-it-q4_K_M` (sharper) | same |
|
||
| System Prompt | **Required** — identity + boundaries + format (template below) | **Required** |
|
||
| Context Length (`num_ctx`) | `32768` | `4096`–`16384` (scale to prompt) |
|
||
| Max Tokens (`num_predict`) | `4096` | `2048`+ |
|
||
| Temperature | `0.7` | `0.2`–`0.4` |
|
||
| Stream | **On** for text-only chat; **Off** if you attach a Tool | Off |
|
||
| Function Calling | Native | N/A |
|
||
| Reasoning / `think` | **LEAVE UNSET** on 26B (do NOT force off). Unset or On on 31B. | **Force Off** |
|
||
| Response Format / JSON mode | **Off** (always) | **Off** (always) |
|
||
| Keep Alive | `30m` or `-1` | match pipeline duration (`4h` / `-1`) |
|
||
|
||
**The single biggest failure mode in OpenWebUI:** setting `think: false` on
|
||
`gemma4:26b` in a chat. The model silent-stops at tool-decision turns with
|
||
`eval_count=4`. If you see "model just stops answering after a few messages"
|
||
on 26B, check this first. See `GOTCHAS.md` § "think: false Kills Gemma 4 26B
|
||
in Multi-Turn Tool-Calling Loops".
|
||
|
||
---
|
||
|
||
## Where Settings Live in OpenWebUI
|
||
|
||
OpenWebUI has four layers. Later layers override earlier ones:
|
||
|
||
1. **Ollama defaults** — baked in, almost all wrong for Gemma 4
|
||
(`num_ctx=2048`, `num_predict=128`, `keep_alive=5m`).
|
||
2. **Admin Panel → Settings → Models** — global defaults for all models.
|
||
Touch this only to set sane fleet-wide floors.
|
||
3. **Workspace → Models → [Create/Edit]** — named presets. **This is where
|
||
you bake Gemma 4 settings.** A Workspace model = base model + system
|
||
prompt + advanced params + tags + optional tool server bindings.
|
||
4. **Per-chat controls** (right-hand panel / top of chat) — overrides for
|
||
a single conversation. Useful for experimentation, bad for persistence.
|
||
|
||
**Rule:** every knob below goes in layer 3 (Workspace Model) unless noted.
|
||
Per-chat overrides are for debugging only.
|
||
|
||
---
|
||
|
||
## Step 1 — Create the Workspace Model
|
||
|
||
Workspace → Models → **+ Add Model** (or **Create a Model**). Fill in:
|
||
|
||
- **Name**: `gemma4-26b-chat` (or whatever matches your use case)
|
||
- **Base Model**: pick from Ollama list. Recommended:
|
||
- `gemma4:26b` — fastest, great default
|
||
- `gemma4:31b-it-q4_K_M` — sharper, 5x slower, more VRAM
|
||
- `gemma4:e4b-it-q8_0` — 12GB VRAM, vision + audio (audio via llama.cpp only)
|
||
- **Description**: what this preset is for. Future-you will thank you.
|
||
- **Profile Image / Tags**: optional.
|
||
- **System Prompt**: **required** (see Step 2).
|
||
- **Advanced Params**: expand and configure (see Step 3).
|
||
- **Tools / Knowledge / Filters**: optional — attach any tool servers here.
|
||
- **Capabilities** (at bottom): toggle Vision if you want image input. Gemma
|
||
4 supports vision on all variants.
|
||
|
||
Save. The model now appears in the main chat dropdown.
|
||
|
||
---
|
||
|
||
## Step 2 — System Prompt (Required)
|
||
|
||
Gemma 4 is ultra-compliant but doesn't know who it is. A blank or generic
|
||
system prompt gets you a generic chatbot — and sporadic overfiltering.
|
||
|
||
Use the template from `SYNTHESIS.md`:
|
||
|
||
```
|
||
You are [NAME], a [ROLE DESCRIPTION]. You are powered by Gemma 4.
|
||
|
||
## What You Do
|
||
- [Explicit list of responsibilities]
|
||
- [Tools you have access to and when to use each one, if any]
|
||
|
||
## What You Do NOT Do
|
||
- [Explicit list of things to refuse or avoid]
|
||
- [Common mistakes to prevent]
|
||
|
||
## Output Format
|
||
[For free-form chat: "Respond in clear Markdown with code in fenced blocks."]
|
||
[For structured output: exact schema, field names, example if complex.]
|
||
|
||
## Rules
|
||
- [Behavioral constraints]
|
||
- [Multi-step chaining instructions if using tools]
|
||
|
||
Today's date: 2026-04-18
|
||
```
|
||
|
||
Principles:
|
||
1. Identity first.
|
||
2. Positive instructions (what to do) before negative (what not to do).
|
||
3. Output format is explicit.
|
||
4. Don't use language that sounds like you're asking the model to bypass
|
||
restrictions — just state the task directly (safety overfilter trigger).
|
||
|
||
---
|
||
|
||
## Step 3 — Advanced Params Reference
|
||
|
||
Expand **Advanced Params** in the Workspace Model editor. Every field, what
|
||
to set, and why.
|
||
|
||
### Sampling / Output Shape
|
||
|
||
| Field | Ollama default | Set to | Why |
|
||
|---|---|---|---|
|
||
| **Temperature** | 0.8 | `0.7` (chat) / `0.3` (extraction) / `0.2` (scoring) | Per `SYNTHESIS.md` temperature table. |
|
||
| **Top K** | 40 | leave default | Gemma 4 is well-behaved at default. |
|
||
| **Top P** | 0.9 | leave default | Same. |
|
||
| **Min P** | 0.0 | leave default | Same. |
|
||
| **Seed** | random | leave blank | Set only for A/B reproduction. |
|
||
| **Stop Sequences** | none | leave blank | Gemma 4 emits proper EOS. |
|
||
| **Mirostat / Eta / Tau** | off | leave off | Not needed; Min P / Top P work fine. |
|
||
| **Frequency Penalty** | 0 | leave 0 | Any value biases style for little gain. |
|
||
| **Repeat Penalty** | 1.1 | leave default | Fine at default. |
|
||
| **Repeat Last N** | 64 | leave default | Fine at default. |
|
||
| **Presence Penalty** | 0 | leave 0 | Same as frequency. |
|
||
|
||
### Context / Memory — **these are the ones that bite**
|
||
|
||
| Field | Ollama default | Set to | Why |
|
||
|---|---|---|---|
|
||
| **Context Length** (`num_ctx`) | **2048** | `32768` chat / `4096`–`16384` pipeline | Default truncates mid-system-prompt. `GOTCHAS.md` § "Ollama Default Context is 2048". |
|
||
| **Max Tokens** (`num_predict`) | **128** | `4096` chat / `2048`+ JSON | Default truncates any useful reply. `GOTCHAS.md` § "num_predict Default is 128". |
|
||
| **Batch Size** (`num_batch`) | 512 | leave default | Prompt-eval throughput; no Gemma 4 issue. |
|
||
| **Tokens to Keep** (`num_keep`) | 4 | leave default | System-prompt header anchor. |
|
||
| **Use Mmap** | on | leave on | Standard. |
|
||
| **Use Mlock** | off | leave off | Standard. |
|
||
| **Threads** (`num_thread`) | auto | leave default | Ollama picks. |
|
||
| **Keep Alive** | 5m | `30m` for chat, `4h` or `-1` for pipelines | Default unloads the model between messages — `~10–30s` reload penalty. `GOTCHAS.md` § "Keep-Alive Too Short". |
|
||
|
||
### Reasoning / Thinking — **the OpenWebUI 26B killer**
|
||
|
||
| Field | Set to | Why |
|
||
|---|---|---|
|
||
| **Reasoning / Thinking** / **Think** toggle | **Leave UNSET** on 26B in chat. Optional on 31B. **Force Off** only in single-turn JSON pipelines. | Ollama 0.20+ defaults `think: true`. On `gemma4:26b` in multi-turn chat, forcing `think: false` causes silent stops (`eval_count=4`, empty content, no tool call) — the model just… stops. Verified 2026-04-18. 31B and Qwen3-Coder tolerate the flag. In single-turn JSON pipelines (AI_Visualizer shape) the old advice still applies: force off so thinking tokens don't eat `num_predict`. See `GOTCHAS.md` § "think: false Kills Gemma 4 26B" and § "Thinking Mode Eats Context". |
|
||
|
||
> OpenWebUI exposes this as a **"Reasoning"** toggle or a raw **`think`**
|
||
> field depending on version. If your version exposes it as tri-state
|
||
> (On / Off / Default), pick **Default** on 26B chat. If it's binary
|
||
> (On / Off), leave it **On** on 26B chat. **Never Off on 26B chat.**
|
||
|
||
### Response Format — **never use JSON mode**
|
||
|
||
| Field | Set to | Why |
|
||
|---|---|---|
|
||
| **Response Format** / **Format = JSON** | **Off / None / Text** | Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B to infinite-loop on nested schemas. Ask for JSON in the prompt and parse client-side. `GOTCHAS.md` § "format=json Causes Infinite Loops". |
|
||
|
||
### Streaming & Function Calling
|
||
|
||
| Field | Set to | Why |
|
||
|---|---|---|
|
||
| **Stream Chat Response** | **On** for text-only chat. **Off** if you've attached Tools. | Ollama v0.20.0 drops tool calls on streaming responses (community-reported, and matches Simon's non-streaming choice). `GOTCHAS.md` § "Tool Calling Broken in Ollama v0.20.0 Streaming". |
|
||
| **Function Calling** | `Native` if you're attaching tools; otherwise `Default` / off. | Native uses Ollama's `/api/chat` tool_calls field. Gemma 4 has a native tool-calling token format. |
|
||
|
||
### Vision
|
||
|
||
Enable the **Vision** capability (bottom of model editor). All Gemma 4
|
||
variants support vision. Paste or upload images in chat. Works great for
|
||
description; **unreliable for subjective quality scoring** (see
|
||
`GOTCHAS.md` § "Vision Validator Overrejects").
|
||
|
||
### Audio (E-series only)
|
||
|
||
26B / 31B have no audio encoder. Only `gemma4:e4b-it-*` variants support
|
||
audio, and currently only via llama.cpp — Ollama doesn't pipe audio through
|
||
OpenWebUI today. Skip this in OpenWebUI for now.
|
||
|
||
---
|
||
|
||
## Step 4 — Global Admin Defaults (Optional Floor)
|
||
|
||
Admin Panel → Settings → Models sets defaults that apply when a Workspace
|
||
Model doesn't override. Set these as a safety net for ad-hoc chats against
|
||
`gemma4:*` base models directly:
|
||
|
||
- Default Context Length: **8192**
|
||
- Default Max Tokens: **2048**
|
||
- Default Keep Alive: **30m**
|
||
|
||
These are only floors. The Workspace Model's explicit settings still take
|
||
over for named presets.
|
||
|
||
---
|
||
|
||
## Two Profiles Worth Baking
|
||
|
||
### Profile A: `gemma4-26b-chat` (default daily driver)
|
||
|
||
- Base: `gemma4:26b`
|
||
- System Prompt: "You are a helpful assistant powered by Gemma 4. Respond
|
||
in clear Markdown. Use fenced code blocks for code. Today's date: …"
|
||
- Temp `0.7`, `num_ctx 32768`, `num_predict 4096`
|
||
- Reasoning: **Default** (unset) or On — **never Off**
|
||
- Stream On, Format Off, Keep Alive `30m`
|
||
|
||
### Profile B: `gemma4-26b-extract` (structured output)
|
||
|
||
- Base: `gemma4:26b`
|
||
- System Prompt: explicit schema with "Respond with ONLY JSON. No prose."
|
||
- Temp `0.3`, `num_ctx 8192`, `num_predict 2048`
|
||
- Reasoning: **Off** (single-turn — thinking would eat `num_predict`)
|
||
- Stream Off, Format **Off** (still!), Keep Alive `1h`
|
||
- Parse client-side with the regex pattern in `SYNTHESIS.md`.
|
||
|
||
For tool-using agent chats, Profile A is correct — don't flip Reasoning off.
|
||
|
||
---
|
||
|
||
## Troubleshooting Map
|
||
|
||
| Symptom | Most likely cause | Fix |
|
||
|---|---|---|
|
||
| 26B "stops answering" mid-conversation, blank reply | `think: false` in payload | Set Reasoning to Default/On in Workspace Model |
|
||
| Reply truncates mid-sentence | `num_predict` too low | Bump to 4096 |
|
||
| Long prompt ignored / forgets system prompt | `num_ctx` too low | Set 32768 |
|
||
| JSON request hangs forever | Response Format = JSON | Turn it off; parse client-side |
|
||
| Tool call not fired despite model "deciding" to call it | Streaming + tool call, Ollama v0.20.0 | Disable Stream when tools attached |
|
||
| 10–30s latency on first message after idle | `keep_alive` default 5m | Set `30m` or `-1` |
|
||
| Model generic / no personality / confuses identity | Empty or weak system prompt | Use the template in Step 2 |
|
||
| 31B hangs at long prompts | Flash Attention + 31B Dense + >3–4K tokens on 3090 class | Use 26B for long prompts, or disable FA in Ollama |
|
||
| Chat refuses a benign technical prompt | Safety overfilter | Rephrase; state task directly without "bypass"/"ignore" language |
|
||
|
||
---
|
||
|
||
## What This Doc Does Not Cover
|
||
|
||
- **Installing Ollama or OpenWebUI** — assumed done.
|
||
- **Pulling Gemma 4 models** — `ollama pull gemma4:26b` outside scope.
|
||
- **Tool server development** — see `CORPUS_tool_calling_format.md` and
|
||
Simon (`~/bin/FreibergFamily/simon/`).
|
||
- **Embeddings / retrieval** — Gemma 4 has no embedding mode; use
|
||
`embeddinggemma` (308M) as a sibling model.
|
||
- **Fine-tuning** — see `GOTCHAS.md` § "Fine-Tuning Ecosystem Issues" and
|
||
`tooling/fine-tuning/recipe-recommendation.md`.
|
||
|
||
## Related Docs in This Repo
|
||
|
||
- `SYNTHESIS.md` — opinionated guide this doc is derived from.
|
||
- `GOTCHAS.md` — every known issue, severity-ranked.
|
||
- `CORPUS_ollama_variants.md` — model inventory, VRAM, Ollama settings.
|
||
- `docs/reference/bakeoff-2026-04-18.md` — the `think: false` / 26B
|
||
evidence trail.
|
||
- `CORPUS_cli_coding_agent.md` — if the OpenWebUI chat is really an
|
||
agent front-end, read this for model-choice nuance.
|