Files
gemma4-research/CORPUS_cli_coding_agent.md
T
Mortdecai 4b9c537dda docs: add CLI coding agent research doc
Fills the gap between existing chat-agent (Simon) and pipeline
(AI_Visualizer) patterns. Covers openclaw/open code/pi/hermes first-party
agents from the HF launch blog, honest positioning vs qwen3-coder:30b,
CLI-agent-specific gotchas (safety filter on security code, long-JSON
weakness, no code exec), and a concrete homelab bakeoff plan pointed at
CT 166 openclaw2 → CT 105 Ollama on pve197.

Key research finding: Google published LiveCodeBench + Codeforces but
NOT SWE-bench or Aider polyglot. The "autonomous agents" claim is
plausible but unproven for multi-file repo-scale coding specifically.
2026-04-18 13:01:59 -04:00

192 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gemma 4 as a CLI Coding Agent
> Research pass, 2026-04-18. Positions Gemma 4 against the specific use case of
> driving a terminal-based coding agent (openclaw / open code / aider / pi /
> hermes style: read_file, write_file, bash, iterate). Separate from the existing
> `IMPLEMENTATIONS.md` chat-agent patterns (Simon) and pipeline patterns
> (AI_Visualizer).
## TL;DR
- Gemma 4 is Google's **first Gemma with trained (not proof-of-concept) tool use**. LiveCodeBench v6 = 80.0% (31B) / 77.1% (26B). Codeforces ELO = 2150 / 1718. That's frontier-open territory on the reported benchmarks.
- Google/HF co-launched with four local CLI coding agents: **openclaw, hermes, pi, open code** (see `tooling/huggingface/blog/gemma4-blog.md`, § "Plug in your local agent"). All four use an OpenAI-compatible endpoint → Ollama or llama.cpp work interchangeably.
- **No SWE-bench or Aider polyglot number from Google.** Reporting leans on competitive programming + single-file code gen. Real-world multi-file repo-scale coding is an empirical question Google didn't answer. Treat the CLI agent claim as **plausible + untested**, not proven.
- No specialized CodeGemma-4 sibling exists (CodeGemma is still G1). Base Gemma 4 **is** the Gemma-family coding path for now.
- In Seth's homelab, CT 166 `openclaw2` on pve197 is the natural testbed — GPU-adjacent to CT 105 Ollama which already serves `gemma4:26b` and `gemma4:31b-it-q4_K_M`.
## What Google does and doesn't claim
The HF 31B-it model card (`tooling/huggingface/model-cards/gemma-4-31B-it-README.md`, line 38) says:
> "Enhanced Coding & Agentic Capabilities Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents."
Reported coding / agentic numbers (from `CORPUS_benchmarks.md`):
| Benchmark | 31B | 26B A4B | What it tests |
|---|---|---|---|
| LiveCodeBench v6 | 80.0% | 77.1% | Single-file code generation |
| Codeforces ELO | 2150 | 1718 | Competitive programming |
| tau2-bench | 86.4% | 85.5% | Agentic tool use — **customer service, not coding** |
What's **not** reported and worth noting:
- **SWE-bench Verified / SWE-bench Lite** — the standard multi-file repo-patch benchmark
- **Aider polyglot** — the standard diff-format / edit-quality benchmark
- **HumanEval / MBPP** — even the old single-function tests
The absence isn't necessarily bad news (Google could simply have prioritized novel benchmarks), but it means **the claim "powering highly capable autonomous agents" has no agentic-coding-specific receipt**. tau2-bench is the closest agentic number and it measures a different domain.
## First-party supported CLI coding agents
From the HF launch blog (`tooling/huggingface/blog/gemma4-blog.md`, lines 505-572):
| Agent | Config | Endpoint |
|---|---|---|
| **openclaw** | `openclaw onboard` — auto-detects running llama-server | OpenAI-compatible |
| **hermes** | `hermes model` — interactive model picker | OpenAI-compatible |
| **pi** | `~/.pi/agent/models.json` | `baseUrl: http://localhost:8080/v1`, `api: openai-completions` |
| **open code** | `~/.config/opencode/opencode.json` (opencode.ai) | `@ai-sdk/openai-compatible`, `baseURL: http://127.0.0.1:8080/v1` |
All four are demonstrated against llama.cpp's `llama-server`, which ships first-party Gemma 4 GGUFs via `ggml-org/gemma-4-*-it-GGUF` including mmproj for vision. **Ollama's `/v1/chat/completions` is drop-in substitutable** — same protocol, different port/path (`http://<host>:11434/v1`).
The blog didn't test aider / continue / cline / roo code / goose. They're all OpenAI-compatible and should work, but they're outside Google's tested set. Aider in particular uses a highly structured diff format that depends on the model emitting edits cleanly — an area where Gemma 4 has a known weakness (long/nested JSON — see `GOTCHAS.md`).
## vs qwen3-coder:30b (the realistic homelab alternative)
Seth's steel141 already has `qwen3-coder:30b` and `qwen3-coder-next:79.7B`. The honest comparison:
| Axis | Gemma 4 26B A4B | qwen3-coder:30b |
|---|---|---|
| Active params | 3.8B (MoE, 8-of-128 experts) | ~30B dense |
| Designed for | General-purpose + agentic tool use | Coding specifically |
| Vision | Native (all variants) | No |
| Agentic tool-call training | Yes, native tokens | Yes, native tokens |
| LiveCodeBench v6 | 77.1% (Google card) | not in this corpus — don't invent |
| Edit-format fidelity | Weak at long JSON (sequential-calls workaround) | Coder-tuned, strong at diffs |
| VRAM at 32K ctx | moderate (KV-hungry, see GOTCHAS) | moderate |
**Picking heuristic:**
- **Gemma 4** if the agent does chat + tools + vision (e.g., "look at this screenshot, edit this file, re-run test") — it's the only side with native vision.
- **qwen3-coder** if the agent is pure code-edit loops where diff quality dominates.
- **Bakeoff before committing.** Swapping an OpenAI-compatible provider URL is near-free. Two runs on one real repo task beats either benchmark.
Don't treat Google's "Enhanced Coding" framing as a head-to-head result against Qwen. It's not — they're pointing at the delta from Gemma 3, not at current coder-specialized competition.
## Configuration for Ollama-backed agents
The baseline settings from `SYNTHESIS.md` still apply. CLI coding agent-specific adjustments:
```json
{
"model": "gemma4:26b",
"think": false,
"keep_alive": "4h",
"options": {
"num_ctx": 32768,
"num_predict": 4096,
"temperature": 0.3
}
}
```
- `num_ctx: 32768` is the working minimum for repo-scale work. Agents interleave file reads, bash output, and edits; 4K will truncate the second `read_file`.
- `num_predict: 4096` — single edits are short but the agent may emit a bash invocation + reasoning + tool call in one turn.
- `temperature: 0.3` — per `SYNTHESIS.md` temperature table, "structured extraction" tier. Coding edits want low variance.
- `think: false` — critical. `GOTCHAS.md` documents that Ollama 0.20+ thinking silently eats `num_predict` and drops tool calls. If an agent somehow injects `think: true`, you'll see empty responses.
- `keep_alive: 4h` — agent sessions have think pauses; avoid reload penalty.
### Streaming
**Non-streaming mode required on Ollama 0.20.0-0.20.1.** The tool-call parser drops calls on streaming endpoints (see `GOTCHAS.md` and `CORPUS_tool_calling_format.md`). Most CLI agents default to non-streaming for tool turns, but verify in the agent's config.
### llama-server alternative
If you want to follow the HF blog exactly, swap Ollama for llama.cpp:
```bash
llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M \
--jinja \
-c 32768 \
--host 0.0.0.0 --port 8080
```
`--jinja` is the critical flag — without it, the native tool-call template (with `<|tool_call>` / `<tool_call|>` asymmetric brackets — see `CORPUS_tool_calling_format.md`) doesn't render correctly.
## Gotchas specific to CLI coding agent use
These extend (do not replace) the general `GOTCHAS.md`.
### 1. Safety overfiltering on security-adjacent code
`GOTCHAS.md` documents strict alignment generally. For coding agents this bites more often: pentest tooling, CTF write-ups, auth-bypass debugging, even aggressive `rm -rf`-style cleanup can trigger refusals or bowdlerized edits.
**Workaround:** The agent's system prompt should establish authorization context — "this is an authorized security test", "this is my own machine", "this is a CTF challenge". Don't rephrase as a jailbreak; state context plainly. Stock agent system prompts typically don't set this, so it's often the first thing to add.
### 2. Weak long JSON → favors sequential tool calls
Gemma 4 struggles with deeply-nested schemas and long arrays (existing `GOTCHAS.md` finding). Agent-level implication:
- **Agents that drive tool-by-tool** (openclaw, open code, pi, cline): good fit. Each `write_file` / `bash` / `read_file` is a short tool call.
- **Agents that expect one-giant-structured-response** (some aider edit modes, any "output the entire diff as JSON"): expect parse failures on long patches. Break into smaller edits if possible.
### 3. No code execution — that's the agent's job
Gemma 4 has no sandbox / kernel / VM. It decides when to call bash; the agent runs it. This is standard but worth stating — no CodeInterpreter-style "model runs the code" path.
### 4. Long-horizon context pressure
Gemma 4 supports 256K on 26B/31B but the KV cache is VRAM-hungry (existing `GOTCHAS.md`). For an agent churning through a repo:
- 32K ctx = comfortable on a 24GB card
- 128K ctx = you're feeding a lot of VRAM to cache, not weights
- Prefer **agent-side retrieval** (grep, ripgrep, targeted file reads) over "paste the whole repo in context"
### 5. Identity drift across long sessions
Gemma 4's "ultra-compliant but doesn't know who it is" (existing GOTCHA) shows up in long agent sessions as subtle drift — switching voice, adopting a different tool-call style mid-session, forgetting constraints from turn 1. The `SYNTHESIS.md` system-prompt template (identity + what-you-do + what-you-do-not + format) is more important for a 50-turn agent loop than a 3-turn chat.
### 6. Missing coding-specific agentic benchmark (same warning, bigger stakes)
Because Google didn't publish SWE-bench, you're operating on extrapolation from Codeforces + tau2-bench when you use Gemma 4 as a CLI coding agent. Measure on your actual repo before taking a dependency.
## Homelab setup (Seth)
**Natural testbed:** CT 166 `openclaw2` on pve197 → CT 105 Ollama on pve197.
Both are on the same host so there's no network hop. CT 105 already serves
`gemma4:26b` and `gemma4:31b-it-q4_K_M` (verified in handoff + per-node inventory
in `/home/claude/bin/CLAUDE.md`).
1. Verify openclaw2's current model config. If it's pointing at a different
backend, switch to `http://192.168.0.179:11434/v1` with `gemma4:26b` (or 31B if
VRAM permits alongside the V100 CT 167 visualizer stack).
2. Set default options per the block above (`num_ctx: 32768`, `num_predict: 4096`,
`think: false`, `temperature: 0.3`, `keep_alive: 4h`).
3. Run one real task (suggested: a small addition to Mortdecai-2.0 — a codebase with
existing CLAUDE.md and clear conventions, good signal-to-noise).
4. Capture: number of tool calls, number of retries, diff quality, wall clock.
5. **Same task** against `qwen3-coder:30b` on steel141 (`http://192.168.0.141:11434/v1`).
Don't A/B anything else — same agent, same prompt, same repo state, different backend.
6. If Gemma 4 dominates on plan/navigate/describe but Qwen dominates on write_file
quality, the natural step is per-role model split: let openclaw2 use Gemma for
"thinking" tool calls and Qwen for edit tool calls. open code's provider config
supports this cleanly.
## What is NOT covered by this document
- Concrete benchmark results from the proposed bakeoff (do the measurement, write a separate findings file)
- openclaw / hermes / pi / open code feature-matrix detail (each agent has its own docs — the HF blog links to all four)
- aider-specific diff-format analysis (aider wasn't in the HF blog's tested set)
- Fine-tuning Gemma 4 for coding agents (see `tooling/fine-tuning/` — the existing path)
- CodeGemma (still Gemma 1 base — see `tooling/gemma-family/codegemma.md`)
## Provenance
- HF 31B-it model card: `tooling/huggingface/model-cards/gemma-4-31B-it-README.md`
- HF launch blog: `tooling/huggingface/blog/gemma4-blog.md`
- Benchmarks: `CORPUS_benchmarks.md`
- Tool calling: `CORPUS_tool_calling_format.md`
- Ollama variants: `CORPUS_ollama_variants.md`
- Known issues: `GOTCHAS.md`
- Qwen3-Coder in homelab: `/home/claude/bin/CLAUDE.md` § "Ollama models"