gemma4-research/docs/openwebui-setup.md

# Gemma 4 in OpenWebUI — Setup Guide

> Derived from `SYNTHESIS.md`, `GOTCHAS.md`, and the 2026-04-18 bakeoff.
> Assumes: OpenWebUI is already running and `gemma4:*` is pulled in the
> backing Ollama. Covers every setting that matters and what to set it to.

## TL;DR — Just Tell Me What To Toggle

Create a **Workspace Model** (don't edit per-chat), pick a base variant, and set:

| Setting | Multi-turn chat (the default OpenWebUI shape) | Single-turn JSON pipeline |
|---|---|---|
| Base model | `gemma4:26b` (fast) or `gemma4:31b-it-q4_K_M` (sharper) | same |
| System Prompt | **Required** — identity + boundaries + format (template below) | **Required** |
| Context Length (`num_ctx`) | `32768` | `4096`–`16384` (scale to prompt) |
| Max Tokens (`num_predict`) | `4096` | `2048`+ |
| Temperature | `0.7` | `0.2`–`0.4` |
| Stream | **On** for text-only chat; **Off** if you attach a Tool | Off |
| Function Calling | Native | N/A |
| Reasoning / `think` | **LEAVE UNSET** on 26B (do NOT force off). Unset or On on 31B. | **Force Off** |
| Response Format / JSON mode | **Off** (always) | **Off** (always) |
| Keep Alive | `30m` or `-1` | match pipeline duration (`4h` / `-1`) |

**The single biggest failure mode in OpenWebUI:** setting `think: false` on
`gemma4:26b` in a chat. The model silent-stops at tool-decision turns with
`eval_count=4`. If you see "model just stops answering after a few messages"
on 26B, check this first. See `GOTCHAS.md` § "think: false Kills Gemma 4 26B
in Multi-Turn Tool-Calling Loops".

---

## Where Settings Live in OpenWebUI

OpenWebUI has four layers. Later layers override earlier ones:

1. **Ollama defaults** — baked in, almost all wrong for Gemma 4
   (`num_ctx=2048`, `num_predict=128`, `keep_alive=5m`).
2. **Admin Panel → Settings → Models** — global defaults for all models.
   Touch this only to set sane fleet-wide floors.
3. **Workspace → Models → [Create/Edit]** — named presets. **This is where
   you bake Gemma 4 settings.** A Workspace model = base model + system
   prompt + advanced params + tags + optional tool server bindings.
4. **Per-chat controls** (right-hand panel / top of chat) — overrides for
   a single conversation. Useful for experimentation, bad for persistence.

**Rule:** every knob below goes in layer 3 (Workspace Model) unless noted.
Per-chat overrides are for debugging only.

---

## Step 1 — Create the Workspace Model

Workspace → Models → **+ Add Model** (or **Create a Model**). Fill in:

- **Name**: `gemma4-26b-chat` (or whatever matches your use case)
- **Base Model**: pick from Ollama list. Recommended:
  - `gemma4:26b` — fastest, great default
  - `gemma4:31b-it-q4_K_M` — sharper, 5x slower, more VRAM
  - `gemma4:e4b-it-q8_0` — 12GB VRAM, vision + audio (audio via llama.cpp only)
- **Description**: what this preset is for. Future-you will thank you.
- **Profile Image / Tags**: optional.
- **System Prompt**: **required** (see Step 2).
- **Advanced Params**: expand and configure (see Step 3).
- **Tools / Knowledge / Filters**: optional — attach any tool servers here.
- **Capabilities** (at bottom): toggle Vision if you want image input. Gemma
  4 supports vision on all variants.

Save. The model now appears in the main chat dropdown.

---

## Step 2 — System Prompt (Required)

Gemma 4 is ultra-compliant but doesn't know who it is. A blank or generic
system prompt gets you a generic chatbot — and sporadic overfiltering.

Use the template from `SYNTHESIS.md`:

```
You are [NAME], a [ROLE DESCRIPTION]. You are powered by Gemma 4.

## What You Do
- [Explicit list of responsibilities]
- [Tools you have access to and when to use each one, if any]

## What You Do NOT Do
- [Explicit list of things to refuse or avoid]
- [Common mistakes to prevent]

## Output Format
[For free-form chat: "Respond in clear Markdown with code in fenced blocks."]
[For structured output: exact schema, field names, example if complex.]

## Rules
- [Behavioral constraints]
- [Multi-step chaining instructions if using tools]

Today's date: 2026-04-18
```

Principles:
1. Identity first.
2. Positive instructions (what to do) before negative (what not to do).
3. Output format is explicit.
4. Don't use language that sounds like you're asking the model to bypass
   restrictions — just state the task directly (safety overfilter trigger).

---

## Step 3 — Advanced Params Reference

Expand **Advanced Params** in the Workspace Model editor. Every field, what
to set, and why.

### Sampling / Output Shape

| Field | Ollama default | Set to | Why |
|---|---|---|---|
| **Temperature** | 0.8 | `0.7` (chat) / `0.3` (extraction) / `0.2` (scoring) | Per `SYNTHESIS.md` temperature table. |
| **Top K** | 40 | leave default | Gemma 4 is well-behaved at default. |
| **Top P** | 0.9 | leave default | Same. |
| **Min P** | 0.0 | leave default | Same. |
| **Seed** | random | leave blank | Set only for A/B reproduction. |
| **Stop Sequences** | none | leave blank | Gemma 4 emits proper EOS. |
| **Mirostat / Eta / Tau** | off | leave off | Not needed; Min P / Top P work fine. |
| **Frequency Penalty** | 0 | leave 0 | Any value biases style for little gain. |
| **Repeat Penalty** | 1.1 | leave default | Fine at default. |
| **Repeat Last N** | 64 | leave default | Fine at default. |
| **Presence Penalty** | 0 | leave 0 | Same as frequency. |

### Context / Memory — **these are the ones that bite**

| Field | Ollama default | Set to | Why |
|---|---|---|---|
| **Context Length** (`num_ctx`) | **2048** | `32768` chat / `4096`–`16384` pipeline | Default truncates mid-system-prompt. `GOTCHAS.md` § "Ollama Default Context is 2048". |
| **Max Tokens** (`num_predict`) | **128** | `4096` chat / `2048`+ JSON | Default truncates any useful reply. `GOTCHAS.md` § "num_predict Default is 128". |
| **Batch Size** (`num_batch`) | 512 | leave default | Prompt-eval throughput; no Gemma 4 issue. |
| **Tokens to Keep** (`num_keep`) | 4 | leave default | System-prompt header anchor. |
| **Use Mmap** | on | leave on | Standard. |
| **Use Mlock** | off | leave off | Standard. |
| **Threads** (`num_thread`) | auto | leave default | Ollama picks. |
| **Keep Alive** | 5m | `30m` for chat, `4h` or `-1` for pipelines | Default unloads the model between messages — `~10–30s` reload penalty. `GOTCHAS.md` § "Keep-Alive Too Short". |

### Reasoning / Thinking — **the OpenWebUI 26B killer**

| Field | Set to | Why |
|---|---|---|
| **Reasoning / Thinking** / **Think** toggle | **Leave UNSET** on 26B in chat. Optional on 31B. **Force Off** only in single-turn JSON pipelines. | Ollama 0.20+ defaults `think: true`. On `gemma4:26b` in multi-turn chat, forcing `think: false` causes silent stops (`eval_count=4`, empty content, no tool call) — the model just… stops. Verified 2026-04-18. 31B and Qwen3-Coder tolerate the flag. In single-turn JSON pipelines (AI_Visualizer shape) the old advice still applies: force off so thinking tokens don't eat `num_predict`. See `GOTCHAS.md` § "think: false Kills Gemma 4 26B" and § "Thinking Mode Eats Context". |

> OpenWebUI exposes this as a **"Reasoning"** toggle or a raw **`think`**
> field depending on version. If your version exposes it as tri-state
> (On / Off / Default), pick **Default** on 26B chat. If it's binary
> (On / Off), leave it **On** on 26B chat. **Never Off on 26B chat.**

### Response Format — **never use JSON mode**

| Field | Set to | Why |
|---|---|---|
| **Response Format** / **Format = JSON** | **Off / None / Text** | Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B to infinite-loop on nested schemas. Ask for JSON in the prompt and parse client-side. `GOTCHAS.md` § "format=json Causes Infinite Loops". |

### Streaming & Function Calling

| Field | Set to | Why |
|---|---|---|
| **Stream Chat Response** | **On** for text-only chat. **Off** if you've attached Tools. | Ollama v0.20.0 drops tool calls on streaming responses (community-reported, and matches Simon's non-streaming choice). `GOTCHAS.md` § "Tool Calling Broken in Ollama v0.20.0 Streaming". |
| **Function Calling** | `Native` if you're attaching tools; otherwise `Default` / off. | Native uses Ollama's `/api/chat` tool_calls field. Gemma 4 has a native tool-calling token format. |

### Vision

Enable the **Vision** capability (bottom of model editor). All Gemma 4
variants support vision. Paste or upload images in chat. Works great for
description; **unreliable for subjective quality scoring** (see
`GOTCHAS.md` § "Vision Validator Overrejects").

### Audio (E-series only)

26B / 31B have no audio encoder. Only `gemma4:e4b-it-*` variants support
audio, and currently only via llama.cpp — Ollama doesn't pipe audio through
OpenWebUI today. Skip this in OpenWebUI for now.

---

## Step 4 — Global Admin Defaults (Optional Floor)

Admin Panel → Settings → Models sets defaults that apply when a Workspace
Model doesn't override. Set these as a safety net for ad-hoc chats against
`gemma4:*` base models directly:

- Default Context Length: **8192**
- Default Max Tokens: **2048**
- Default Keep Alive: **30m**

These are only floors. The Workspace Model's explicit settings still take
over for named presets.

---

## Two Profiles Worth Baking

### Profile A: `gemma4-26b-chat` (default daily driver)

- Base: `gemma4:26b`
- System Prompt: "You are a helpful assistant powered by Gemma 4. Respond
  in clear Markdown. Use fenced code blocks for code. Today's date: …"
- Temp `0.7`, `num_ctx 32768`, `num_predict 4096`
- Reasoning: **Default** (unset) or On — **never Off**
- Stream On, Format Off, Keep Alive `30m`

### Profile B: `gemma4-26b-extract` (structured output)

- Base: `gemma4:26b`
- System Prompt: explicit schema with "Respond with ONLY JSON. No prose."
- Temp `0.3`, `num_ctx 8192`, `num_predict 2048`
- Reasoning: **Off** (single-turn — thinking would eat `num_predict`)
- Stream Off, Format **Off** (still!), Keep Alive `1h`
- Parse client-side with the regex pattern in `SYNTHESIS.md`.

For tool-using agent chats, Profile A is correct — don't flip Reasoning off.

---

## Troubleshooting Map

| Symptom | Most likely cause | Fix |
|---|---|---|
| 26B "stops answering" mid-conversation, blank reply | `think: false` in payload | Set Reasoning to Default/On in Workspace Model |
| Reply truncates mid-sentence | `num_predict` too low | Bump to 4096 |
| Long prompt ignored / forgets system prompt | `num_ctx` too low | Set 32768 |
| JSON request hangs forever | Response Format = JSON | Turn it off; parse client-side |
| Tool call not fired despite model "deciding" to call it | Streaming + tool call, Ollama v0.20.0 | Disable Stream when tools attached |
| 10–30s latency on first message after idle | `keep_alive` default 5m | Set `30m` or `-1` |
| Model generic / no personality / confuses identity | Empty or weak system prompt | Use the template in Step 2 |
| 31B hangs at long prompts | Flash Attention + 31B Dense + >3–4K tokens on 3090 class | Use 26B for long prompts, or disable FA in Ollama |
| Chat refuses a benign technical prompt | Safety overfilter | Rephrase; state task directly without "bypass"/"ignore" language |

---

## What This Doc Does Not Cover

- **Installing Ollama or OpenWebUI** — assumed done.
- **Pulling Gemma 4 models** — `ollama pull gemma4:26b` outside scope.
- **Tool server development** — see `CORPUS_tool_calling_format.md` and
  Simon (`~/bin/FreibergFamily/simon/`).
- **Embeddings / retrieval** — Gemma 4 has no embedding mode; use
  `embeddinggemma` (308M) as a sibling model.
- **Fine-tuning** — see `GOTCHAS.md` § "Fine-Tuning Ecosystem Issues" and
  `tooling/fine-tuning/recipe-recommendation.md`.

## Related Docs in This Repo

- `SYNTHESIS.md` — opinionated guide this doc is derived from.
- `GOTCHAS.md` — every known issue, severity-ranked.
- `CORPUS_ollama_variants.md` — model inventory, VRAM, Ollama settings.
- `docs/reference/bakeoff-2026-04-18.md` — the `think: false` / 26B
  evidence trail.
- `CORPUS_cli_coding_agent.md` — if the OpenWebUI chat is really an
  agent front-end, read this for model-choice nuance.