docs: add CLI coding agent research doc
Fills the gap between existing chat-agent (Simon) and pipeline (AI_Visualizer) patterns. Covers openclaw/open code/pi/hermes first-party agents from the HF launch blog, honest positioning vs qwen3-coder:30b, CLI-agent-specific gotchas (safety filter on security code, long-JSON weakness, no code exec), and a concrete homelab bakeoff plan pointed at CT 166 openclaw2 → CT 105 Ollama on pve197. Key research finding: Google published LiveCodeBench + Codeforces but NOT SWE-bench or Aider polyglot. The "autonomous agents" claim is plausible but unproven for multi-file repo-scale coding specifically.
This commit is contained in:
@@ -0,0 +1,191 @@
|
|||||||
|
# Gemma 4 as a CLI Coding Agent
|
||||||
|
|
||||||
|
> Research pass, 2026-04-18. Positions Gemma 4 against the specific use case of
|
||||||
|
> driving a terminal-based coding agent (openclaw / open code / aider / pi /
|
||||||
|
> hermes style: read_file, write_file, bash, iterate). Separate from the existing
|
||||||
|
> `IMPLEMENTATIONS.md` chat-agent patterns (Simon) and pipeline patterns
|
||||||
|
> (AI_Visualizer).
|
||||||
|
|
||||||
|
## TL;DR
|
||||||
|
|
||||||
|
- Gemma 4 is Google's **first Gemma with trained (not proof-of-concept) tool use**. LiveCodeBench v6 = 80.0% (31B) / 77.1% (26B). Codeforces ELO = 2150 / 1718. That's frontier-open territory on the reported benchmarks.
|
||||||
|
- Google/HF co-launched with four local CLI coding agents: **openclaw, hermes, pi, open code** (see `tooling/huggingface/blog/gemma4-blog.md`, § "Plug in your local agent"). All four use an OpenAI-compatible endpoint → Ollama or llama.cpp work interchangeably.
|
||||||
|
- **No SWE-bench or Aider polyglot number from Google.** Reporting leans on competitive programming + single-file code gen. Real-world multi-file repo-scale coding is an empirical question Google didn't answer. Treat the CLI agent claim as **plausible + untested**, not proven.
|
||||||
|
- No specialized CodeGemma-4 sibling exists (CodeGemma is still G1). Base Gemma 4 **is** the Gemma-family coding path for now.
|
||||||
|
- In Seth's homelab, CT 166 `openclaw2` on pve197 is the natural testbed — GPU-adjacent to CT 105 Ollama which already serves `gemma4:26b` and `gemma4:31b-it-q4_K_M`.
|
||||||
|
|
||||||
|
## What Google does and doesn't claim
|
||||||
|
|
||||||
|
The HF 31B-it model card (`tooling/huggingface/model-cards/gemma-4-31B-it-README.md`, line 38) says:
|
||||||
|
|
||||||
|
> "Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents."
|
||||||
|
|
||||||
|
Reported coding / agentic numbers (from `CORPUS_benchmarks.md`):
|
||||||
|
|
||||||
|
| Benchmark | 31B | 26B A4B | What it tests |
|
||||||
|
|---|---|---|---|
|
||||||
|
| LiveCodeBench v6 | 80.0% | 77.1% | Single-file code generation |
|
||||||
|
| Codeforces ELO | 2150 | 1718 | Competitive programming |
|
||||||
|
| tau2-bench | 86.4% | 85.5% | Agentic tool use — **customer service, not coding** |
|
||||||
|
|
||||||
|
What's **not** reported and worth noting:
|
||||||
|
|
||||||
|
- **SWE-bench Verified / SWE-bench Lite** — the standard multi-file repo-patch benchmark
|
||||||
|
- **Aider polyglot** — the standard diff-format / edit-quality benchmark
|
||||||
|
- **HumanEval / MBPP** — even the old single-function tests
|
||||||
|
|
||||||
|
The absence isn't necessarily bad news (Google could simply have prioritized novel benchmarks), but it means **the claim "powering highly capable autonomous agents" has no agentic-coding-specific receipt**. tau2-bench is the closest agentic number and it measures a different domain.
|
||||||
|
|
||||||
|
## First-party supported CLI coding agents
|
||||||
|
|
||||||
|
From the HF launch blog (`tooling/huggingface/blog/gemma4-blog.md`, lines 505-572):
|
||||||
|
|
||||||
|
| Agent | Config | Endpoint |
|
||||||
|
|---|---|---|
|
||||||
|
| **openclaw** | `openclaw onboard` — auto-detects running llama-server | OpenAI-compatible |
|
||||||
|
| **hermes** | `hermes model` — interactive model picker | OpenAI-compatible |
|
||||||
|
| **pi** | `~/.pi/agent/models.json` | `baseUrl: http://localhost:8080/v1`, `api: openai-completions` |
|
||||||
|
| **open code** | `~/.config/opencode/opencode.json` (opencode.ai) | `@ai-sdk/openai-compatible`, `baseURL: http://127.0.0.1:8080/v1` |
|
||||||
|
|
||||||
|
All four are demonstrated against llama.cpp's `llama-server`, which ships first-party Gemma 4 GGUFs via `ggml-org/gemma-4-*-it-GGUF` including mmproj for vision. **Ollama's `/v1/chat/completions` is drop-in substitutable** — same protocol, different port/path (`http://<host>:11434/v1`).
|
||||||
|
|
||||||
|
The blog didn't test aider / continue / cline / roo code / goose. They're all OpenAI-compatible and should work, but they're outside Google's tested set. Aider in particular uses a highly structured diff format that depends on the model emitting edits cleanly — an area where Gemma 4 has a known weakness (long/nested JSON — see `GOTCHAS.md`).
|
||||||
|
|
||||||
|
## vs qwen3-coder:30b (the realistic homelab alternative)
|
||||||
|
|
||||||
|
Seth's steel141 already has `qwen3-coder:30b` and `qwen3-coder-next:79.7B`. The honest comparison:
|
||||||
|
|
||||||
|
| Axis | Gemma 4 26B A4B | qwen3-coder:30b |
|
||||||
|
|---|---|---|
|
||||||
|
| Active params | 3.8B (MoE, 8-of-128 experts) | ~30B dense |
|
||||||
|
| Designed for | General-purpose + agentic tool use | Coding specifically |
|
||||||
|
| Vision | Native (all variants) | No |
|
||||||
|
| Agentic tool-call training | Yes, native tokens | Yes, native tokens |
|
||||||
|
| LiveCodeBench v6 | 77.1% (Google card) | not in this corpus — don't invent |
|
||||||
|
| Edit-format fidelity | Weak at long JSON (sequential-calls workaround) | Coder-tuned, strong at diffs |
|
||||||
|
| VRAM at 32K ctx | moderate (KV-hungry, see GOTCHAS) | moderate |
|
||||||
|
|
||||||
|
**Picking heuristic:**
|
||||||
|
- **Gemma 4** if the agent does chat + tools + vision (e.g., "look at this screenshot, edit this file, re-run test") — it's the only side with native vision.
|
||||||
|
- **qwen3-coder** if the agent is pure code-edit loops where diff quality dominates.
|
||||||
|
- **Bakeoff before committing.** Swapping an OpenAI-compatible provider URL is near-free. Two runs on one real repo task beats either benchmark.
|
||||||
|
|
||||||
|
Don't treat Google's "Enhanced Coding" framing as a head-to-head result against Qwen. It's not — they're pointing at the delta from Gemma 3, not at current coder-specialized competition.
|
||||||
|
|
||||||
|
## Configuration for Ollama-backed agents
|
||||||
|
|
||||||
|
The baseline settings from `SYNTHESIS.md` still apply. CLI coding agent-specific adjustments:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "gemma4:26b",
|
||||||
|
"think": false,
|
||||||
|
"keep_alive": "4h",
|
||||||
|
"options": {
|
||||||
|
"num_ctx": 32768,
|
||||||
|
"num_predict": 4096,
|
||||||
|
"temperature": 0.3
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- `num_ctx: 32768` is the working minimum for repo-scale work. Agents interleave file reads, bash output, and edits; 4K will truncate the second `read_file`.
|
||||||
|
- `num_predict: 4096` — single edits are short but the agent may emit a bash invocation + reasoning + tool call in one turn.
|
||||||
|
- `temperature: 0.3` — per `SYNTHESIS.md` temperature table, "structured extraction" tier. Coding edits want low variance.
|
||||||
|
- `think: false` — critical. `GOTCHAS.md` documents that Ollama 0.20+ thinking silently eats `num_predict` and drops tool calls. If an agent somehow injects `think: true`, you'll see empty responses.
|
||||||
|
- `keep_alive: 4h` — agent sessions have think pauses; avoid reload penalty.
|
||||||
|
|
||||||
|
### Streaming
|
||||||
|
|
||||||
|
**Non-streaming mode required on Ollama 0.20.0-0.20.1.** The tool-call parser drops calls on streaming endpoints (see `GOTCHAS.md` and `CORPUS_tool_calling_format.md`). Most CLI agents default to non-streaming for tool turns, but verify in the agent's config.
|
||||||
|
|
||||||
|
### llama-server alternative
|
||||||
|
|
||||||
|
If you want to follow the HF blog exactly, swap Ollama for llama.cpp:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M \
|
||||||
|
--jinja \
|
||||||
|
-c 32768 \
|
||||||
|
--host 0.0.0.0 --port 8080
|
||||||
|
```
|
||||||
|
|
||||||
|
`--jinja` is the critical flag — without it, the native tool-call template (with `<|tool_call>` / `<tool_call|>` asymmetric brackets — see `CORPUS_tool_calling_format.md`) doesn't render correctly.
|
||||||
|
|
||||||
|
## Gotchas specific to CLI coding agent use
|
||||||
|
|
||||||
|
These extend (do not replace) the general `GOTCHAS.md`.
|
||||||
|
|
||||||
|
### 1. Safety overfiltering on security-adjacent code
|
||||||
|
|
||||||
|
`GOTCHAS.md` documents strict alignment generally. For coding agents this bites more often: pentest tooling, CTF write-ups, auth-bypass debugging, even aggressive `rm -rf`-style cleanup can trigger refusals or bowdlerized edits.
|
||||||
|
|
||||||
|
**Workaround:** The agent's system prompt should establish authorization context — "this is an authorized security test", "this is my own machine", "this is a CTF challenge". Don't rephrase as a jailbreak; state context plainly. Stock agent system prompts typically don't set this, so it's often the first thing to add.
|
||||||
|
|
||||||
|
### 2. Weak long JSON → favors sequential tool calls
|
||||||
|
|
||||||
|
Gemma 4 struggles with deeply-nested schemas and long arrays (existing `GOTCHAS.md` finding). Agent-level implication:
|
||||||
|
|
||||||
|
- **Agents that drive tool-by-tool** (openclaw, open code, pi, cline): good fit. Each `write_file` / `bash` / `read_file` is a short tool call.
|
||||||
|
- **Agents that expect one-giant-structured-response** (some aider edit modes, any "output the entire diff as JSON"): expect parse failures on long patches. Break into smaller edits if possible.
|
||||||
|
|
||||||
|
### 3. No code execution — that's the agent's job
|
||||||
|
|
||||||
|
Gemma 4 has no sandbox / kernel / VM. It decides when to call bash; the agent runs it. This is standard but worth stating — no CodeInterpreter-style "model runs the code" path.
|
||||||
|
|
||||||
|
### 4. Long-horizon context pressure
|
||||||
|
|
||||||
|
Gemma 4 supports 256K on 26B/31B but the KV cache is VRAM-hungry (existing `GOTCHAS.md`). For an agent churning through a repo:
|
||||||
|
|
||||||
|
- 32K ctx = comfortable on a 24GB card
|
||||||
|
- 128K ctx = you're feeding a lot of VRAM to cache, not weights
|
||||||
|
- Prefer **agent-side retrieval** (grep, ripgrep, targeted file reads) over "paste the whole repo in context"
|
||||||
|
|
||||||
|
### 5. Identity drift across long sessions
|
||||||
|
|
||||||
|
Gemma 4's "ultra-compliant but doesn't know who it is" (existing GOTCHA) shows up in long agent sessions as subtle drift — switching voice, adopting a different tool-call style mid-session, forgetting constraints from turn 1. The `SYNTHESIS.md` system-prompt template (identity + what-you-do + what-you-do-not + format) is more important for a 50-turn agent loop than a 3-turn chat.
|
||||||
|
|
||||||
|
### 6. Missing coding-specific agentic benchmark (same warning, bigger stakes)
|
||||||
|
|
||||||
|
Because Google didn't publish SWE-bench, you're operating on extrapolation from Codeforces + tau2-bench when you use Gemma 4 as a CLI coding agent. Measure on your actual repo before taking a dependency.
|
||||||
|
|
||||||
|
## Homelab setup (Seth)
|
||||||
|
|
||||||
|
**Natural testbed:** CT 166 `openclaw2` on pve197 → CT 105 Ollama on pve197.
|
||||||
|
|
||||||
|
Both are on the same host so there's no network hop. CT 105 already serves
|
||||||
|
`gemma4:26b` and `gemma4:31b-it-q4_K_M` (verified in handoff + per-node inventory
|
||||||
|
in `/home/claude/bin/CLAUDE.md`).
|
||||||
|
|
||||||
|
1. Verify openclaw2's current model config. If it's pointing at a different
|
||||||
|
backend, switch to `http://192.168.0.179:11434/v1` with `gemma4:26b` (or 31B if
|
||||||
|
VRAM permits alongside the V100 CT 167 visualizer stack).
|
||||||
|
2. Set default options per the block above (`num_ctx: 32768`, `num_predict: 4096`,
|
||||||
|
`think: false`, `temperature: 0.3`, `keep_alive: 4h`).
|
||||||
|
3. Run one real task (suggested: a small addition to Mortdecai-2.0 — a codebase with
|
||||||
|
existing CLAUDE.md and clear conventions, good signal-to-noise).
|
||||||
|
4. Capture: number of tool calls, number of retries, diff quality, wall clock.
|
||||||
|
5. **Same task** against `qwen3-coder:30b` on steel141 (`http://192.168.0.141:11434/v1`).
|
||||||
|
Don't A/B anything else — same agent, same prompt, same repo state, different backend.
|
||||||
|
6. If Gemma 4 dominates on plan/navigate/describe but Qwen dominates on write_file
|
||||||
|
quality, the natural step is per-role model split: let openclaw2 use Gemma for
|
||||||
|
"thinking" tool calls and Qwen for edit tool calls. open code's provider config
|
||||||
|
supports this cleanly.
|
||||||
|
|
||||||
|
## What is NOT covered by this document
|
||||||
|
|
||||||
|
- Concrete benchmark results from the proposed bakeoff (do the measurement, write a separate findings file)
|
||||||
|
- openclaw / hermes / pi / open code feature-matrix detail (each agent has its own docs — the HF blog links to all four)
|
||||||
|
- aider-specific diff-format analysis (aider wasn't in the HF blog's tested set)
|
||||||
|
- Fine-tuning Gemma 4 for coding agents (see `tooling/fine-tuning/` — the existing path)
|
||||||
|
- CodeGemma (still Gemma 1 base — see `tooling/gemma-family/codegemma.md`)
|
||||||
|
|
||||||
|
## Provenance
|
||||||
|
|
||||||
|
- HF 31B-it model card: `tooling/huggingface/model-cards/gemma-4-31B-it-README.md`
|
||||||
|
- HF launch blog: `tooling/huggingface/blog/gemma4-blog.md`
|
||||||
|
- Benchmarks: `CORPUS_benchmarks.md`
|
||||||
|
- Tool calling: `CORPUS_tool_calling_format.md`
|
||||||
|
- Ollama variants: `CORPUS_ollama_variants.md`
|
||||||
|
- Known issues: `GOTCHAS.md`
|
||||||
|
- Qwen3-Coder in homelab: `/home/claude/bin/CLAUDE.md` § "Ollama models"
|
||||||
@@ -14,6 +14,7 @@ Research corpus and implementation guidance for Google Gemma 4, based on product
|
|||||||
| `CORPUS_capabilities.md` | Modalities (vision, audio, video, tools), what it can/can't do | When scoping what Gemma 4 can handle |
|
| `CORPUS_capabilities.md` | Modalities (vision, audio, video, tools), what it can/can't do | When scoping what Gemma 4 can handle |
|
||||||
| `CORPUS_benchmarks.md` | Full benchmark table vs Gemma 3, arena scores, agentic scores | When comparing Gemma 4 to alternatives |
|
| `CORPUS_benchmarks.md` | Full benchmark table vs Gemma 3, arena scores, agentic scores | When comparing Gemma 4 to alternatives |
|
||||||
| `CORPUS_tool_calling_format.md` | Native token format + JSON API format for function calling | When implementing tool calling |
|
| `CORPUS_tool_calling_format.md` | Native token format + JSON API format for function calling | When implementing tool calling |
|
||||||
|
| `CORPUS_cli_coding_agent.md` | Positioning Gemma 4 for CLI coding agent use (openclaw / open code / pi / hermes / aider style). Honest take on what Google did and didn't measure, head-to-head with `qwen3-coder:30b`, homelab setup pointer | When scoping a CLI coding agent or deciding Gemma 4 vs Qwen3-Coder |
|
||||||
| `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) |
|
| `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) |
|
||||||
|
|
||||||
## Source Projects
|
## Source Projects
|
||||||
|
|||||||
@@ -176,6 +176,7 @@ Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only.
|
|||||||
| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Dense 31B, sharpest but 5x slower, more VRAM pressure |
|
| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Dense 31B, sharpest but 5x slower, more VRAM pressure |
|
||||||
| Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev |
|
| Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev |
|
||||||
| Retrieval / embeddings | `embeddinggemma` (308M, separate model) | Gemma 4 has no embedding mode; use the sibling |
|
| Retrieval / embeddings | `embeddinggemma` (308M, separate model) | Gemma 4 has no embedding mode; use the sibling |
|
||||||
|
| CLI coding agent (openclaw / open code / pi / hermes / aider) | `gemma4:26b` (or compare to `qwen3-coder:30b`) | Trained tool use + strong LiveCodeBench, but Google didn't publish SWE-bench — see `CORPUS_cli_coding_agent.md` for the honest positioning and the homelab bakeoff plan |
|
||||||
|
|
||||||
## Anti-Patterns
|
## Anti-Patterns
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user