Files

T

Mortdecai d9477da52a docs: OpenWebUI setup guide for Gemma 4

Applies SYNTHESIS.md + GOTCHAS.md findings to the OpenWebUI front-end:
per-setting reference, two baked-in Workspace Model profiles (chat +
extract), and a symptom→cause troubleshooting table. Front-loads the
`think: false` / gemma4:26b multi-turn footgun from Round 3 of the
2026-04-18 bakeoff since that is the shape OpenWebUI users will hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 20:47:17 -04:00

12 KiB

Raw Blame History

Gemma 4 in OpenWebUI — Setup Guide

Derived from SYNTHESIS.md, GOTCHAS.md, and the 2026-04-18 bakeoff. Assumes: OpenWebUI is already running and gemma4:* is pulled in the backing Ollama. Covers every setting that matters and what to set it to.

TL;DR — Just Tell Me What To Toggle

Create a Workspace Model (don't edit per-chat), pick a base variant, and set:

Setting	Multi-turn chat (the default OpenWebUI shape)	Single-turn JSON pipeline
Base model	`gemma4:26b` (fast) or `gemma4:31b-it-q4_K_M` (sharper)	same
System Prompt	Required — identity + boundaries + format (template below)	Required
Context Length (`num_ctx`)	`32768`	`4096`–`16384` (scale to prompt)
Max Tokens (`num_predict`)	`4096`	`2048`+
Temperature	`0.7`	`0.2`–`0.4`
Stream	On for text-only chat; Off if you attach a Tool	Off
Function Calling	Native	N/A
Reasoning / `think`	LEAVE UNSET on 26B (do NOT force off). Unset or On on 31B.	Force Off
Response Format / JSON mode	Off (always)	Off (always)
Keep Alive	`30m` or `-1`	match pipeline duration (`4h` / `-1`)

The single biggest failure mode in OpenWebUI: setting think: false on gemma4:26b in a chat. The model silent-stops at tool-decision turns with eval_count=4. If you see "model just stops answering after a few messages" on 26B, check this first. See GOTCHAS.md § "think: false Kills Gemma 4 26B in Multi-Turn Tool-Calling Loops".

Where Settings Live in OpenWebUI

OpenWebUI has four layers. Later layers override earlier ones:

Ollama defaults — baked in, almost all wrong for Gemma 4 (num_ctx=2048, num_predict=128, keep_alive=5m).
Admin Panel → Settings → Models — global defaults for all models. Touch this only to set sane fleet-wide floors.
Workspace → Models → [Create/Edit] — named presets. This is where you bake Gemma 4 settings. A Workspace model = base model + system prompt + advanced params + tags + optional tool server bindings.
Per-chat controls (right-hand panel / top of chat) — overrides for a single conversation. Useful for experimentation, bad for persistence.

Rule: every knob below goes in layer 3 (Workspace Model) unless noted. Per-chat overrides are for debugging only.

Step 1 — Create the Workspace Model

Workspace → Models → + Add Model (or Create a Model). Fill in:

Name: gemma4-26b-chat (or whatever matches your use case)
Base Model: pick from Ollama list. Recommended:
- gemma4:26b — fastest, great default
- gemma4:31b-it-q4_K_M — sharper, 5x slower, more VRAM
- gemma4:e4b-it-q8_0 — 12GB VRAM, vision + audio (audio via llama.cpp only)
Description: what this preset is for. Future-you will thank you.
Profile Image / Tags: optional.
System Prompt: required (see Step 2).
Advanced Params: expand and configure (see Step 3).
Tools / Knowledge / Filters: optional — attach any tool servers here.
Capabilities (at bottom): toggle Vision if you want image input. Gemma 4 supports vision on all variants.

Save. The model now appears in the main chat dropdown.

Step 2 — System Prompt (Required)

Gemma 4 is ultra-compliant but doesn't know who it is. A blank or generic system prompt gets you a generic chatbot — and sporadic overfiltering.

Use the template from SYNTHESIS.md:

You are [NAME], a [ROLE DESCRIPTION]. You are powered by Gemma 4.

## What You Do
- [Explicit list of responsibilities]
- [Tools you have access to and when to use each one, if any]

## What You Do NOT Do
- [Explicit list of things to refuse or avoid]
- [Common mistakes to prevent]

## Output Format
[For free-form chat: "Respond in clear Markdown with code in fenced blocks."]
[For structured output: exact schema, field names, example if complex.]

## Rules
- [Behavioral constraints]
- [Multi-step chaining instructions if using tools]

Today's date: 2026-04-18

Principles:

Identity first.
Positive instructions (what to do) before negative (what not to do).
Output format is explicit.
Don't use language that sounds like you're asking the model to bypass restrictions — just state the task directly (safety overfilter trigger).

Step 3 — Advanced Params Reference

Expand Advanced Params in the Workspace Model editor. Every field, what to set, and why.

Sampling / Output Shape

Field	Ollama default	Set to	Why
Temperature	0.8	`0.7` (chat) / `0.3` (extraction) / `0.2` (scoring)	Per `SYNTHESIS.md` temperature table.
Top K	40	leave default	Gemma 4 is well-behaved at default.
Top P	0.9	leave default	Same.
Min P	0.0	leave default	Same.
Seed	random	leave blank	Set only for A/B reproduction.
Stop Sequences	none	leave blank	Gemma 4 emits proper EOS.
Mirostat / Eta / Tau	off	leave off	Not needed; Min P / Top P work fine.
Frequency Penalty	0	leave 0	Any value biases style for little gain.
Repeat Penalty	1.1	leave default	Fine at default.
Repeat Last N	64	leave default	Fine at default.
Presence Penalty	0	leave 0	Same as frequency.

Context / Memory — these are the ones that bite

Field	Ollama default	Set to	Why
Context Length (`num_ctx`)	2048	`32768` chat / `4096`–`16384` pipeline	Default truncates mid-system-prompt. `GOTCHAS.md` § "Ollama Default Context is 2048".
Max Tokens (`num_predict`)	128	`4096` chat / `2048`+ JSON	Default truncates any useful reply. `GOTCHAS.md` § "num_predict Default is 128".
Batch Size (`num_batch`)	512	leave default	Prompt-eval throughput; no Gemma 4 issue.
Tokens to Keep (`num_keep`)	4	leave default	System-prompt header anchor.
Use Mmap	on	leave on	Standard.
Use Mlock	off	leave off	Standard.
Threads (`num_thread`)	auto	leave default	Ollama picks.
Keep Alive	5m	`30m` for chat, `4h` or `-1` for pipelines	Default unloads the model between messages — `~10–30s` reload penalty. `GOTCHAS.md` § "Keep-Alive Too Short".

Reasoning / Thinking — the OpenWebUI 26B killer

Field	Set to	Why
Reasoning / Thinking / Think toggle	Leave UNSET on 26B in chat. Optional on 31B. Force Off only in single-turn JSON pipelines.	Ollama 0.20+ defaults `think: true`. On `gemma4:26b` in multi-turn chat, forcing `think: false` causes silent stops (`eval_count=4`, empty content, no tool call) — the model just… stops. Verified 2026-04-18. 31B and Qwen3-Coder tolerate the flag. In single-turn JSON pipelines (AI_Visualizer shape) the old advice still applies: force off so thinking tokens don't eat `num_predict`. See `GOTCHAS.md` § "think: false Kills Gemma 4 26B" and § "Thinking Mode Eats Context".

OpenWebUI exposes this as a "Reasoning" toggle or a raw think field depending on version. If your version exposes it as tri-state (On / Off / Default), pick Default on 26B chat. If it's binary (On / Off), leave it On on 26B chat. Never Off on 26B chat.

Response Format — never use JSON mode

Field	Set to	Why
Response Format / Format = JSON	Off / None / Text	Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B to infinite-loop on nested schemas. Ask for JSON in the prompt and parse client-side. `GOTCHAS.md` § "format=json Causes Infinite Loops".

Streaming & Function Calling

Field	Set to	Why
Stream Chat Response	On for text-only chat. Off if you've attached Tools.	Ollama v0.20.0 drops tool calls on streaming responses (community-reported, and matches Simon's non-streaming choice). `GOTCHAS.md` § "Tool Calling Broken in Ollama v0.20.0 Streaming".
Function Calling	`Native` if you're attaching tools; otherwise `Default` / off.	Native uses Ollama's `/api/chat` tool_calls field. Gemma 4 has a native tool-calling token format.

Vision

Enable the Vision capability (bottom of model editor). All Gemma 4 variants support vision. Paste or upload images in chat. Works great for description; unreliable for subjective quality scoring (see GOTCHAS.md § "Vision Validator Overrejects").

Audio (E-series only)

26B / 31B have no audio encoder. Only gemma4:e4b-it-* variants support audio, and currently only via llama.cpp — Ollama doesn't pipe audio through OpenWebUI today. Skip this in OpenWebUI for now.

Step 4 — Global Admin Defaults (Optional Floor)

Admin Panel → Settings → Models sets defaults that apply when a Workspace Model doesn't override. Set these as a safety net for ad-hoc chats against gemma4:* base models directly:

Default Context Length: 8192
Default Max Tokens: 2048
Default Keep Alive: 30m

These are only floors. The Workspace Model's explicit settings still take over for named presets.

Two Profiles Worth Baking

Profile A: `gemma4-26b-chat` (default daily driver)

Base: gemma4:26b
System Prompt: "You are a helpful assistant powered by Gemma 4. Respond in clear Markdown. Use fenced code blocks for code. Today's date: …"
Temp 0.7, num_ctx 32768, num_predict 4096
Reasoning: Default (unset) or On — never Off
Stream On, Format Off, Keep Alive 30m

Profile B: `gemma4-26b-extract` (structured output)

Base: gemma4:26b
System Prompt: explicit schema with "Respond with ONLY JSON. No prose."
Temp 0.3, num_ctx 8192, num_predict 2048
Reasoning: Off (single-turn — thinking would eat num_predict)
Stream Off, Format Off (still!), Keep Alive 1h
Parse client-side with the regex pattern in SYNTHESIS.md.

For tool-using agent chats, Profile A is correct — don't flip Reasoning off.

Troubleshooting Map

Symptom	Most likely cause	Fix
26B "stops answering" mid-conversation, blank reply	`think: false` in payload	Set Reasoning to Default/On in Workspace Model
Reply truncates mid-sentence	`num_predict` too low	Bump to 4096
Long prompt ignored / forgets system prompt	`num_ctx` too low	Set 32768
JSON request hangs forever	Response Format = JSON	Turn it off; parse client-side
Tool call not fired despite model "deciding" to call it	Streaming + tool call, Ollama v0.20.0	Disable Stream when tools attached
10–30s latency on first message after idle	`keep_alive` default 5m	Set `30m` or `-1`
Model generic / no personality / confuses identity	Empty or weak system prompt	Use the template in Step 2
31B hangs at long prompts	Flash Attention + 31B Dense + >3–4K tokens on 3090 class	Use 26B for long prompts, or disable FA in Ollama
Chat refuses a benign technical prompt	Safety overfilter	Rephrase; state task directly without "bypass"/"ignore" language

What This Doc Does Not Cover

Installing Ollama or OpenWebUI — assumed done.
Pulling Gemma 4 models — ollama pull gemma4:26b outside scope.
Tool server development — see CORPUS_tool_calling_format.md and Simon (~/bin/FreibergFamily/simon/).
Embeddings / retrieval — Gemma 4 has no embedding mode; use embeddinggemma (308M) as a sibling model.
Fine-tuning — see GOTCHAS.md § "Fine-Tuning Ecosystem Issues" and tooling/fine-tuning/recipe-recommendation.md.

SYNTHESIS.md — opinionated guide this doc is derived from.
GOTCHAS.md — every known issue, severity-ranked.
CORPUS_ollama_variants.md — model inventory, VRAM, Ollama settings.
docs/reference/bakeoff-2026-04-18.md — the think: false / 26B evidence trail.
CORPUS_cli_coding_agent.md — if the OpenWebUI chat is really an agent front-end, read this for model-choice nuance.

12 KiB Raw Blame History Unescape Escape