Files
gemma4-research/docs/reference/bakeoff-2026-04-18.md
T
Mortdecai 7f806e0b92 feat: round-2 bakeoff — 26b silent-stop is tool-response context size
Round 2 tested the hypothesis that 26B's silent-stop was about
write_file argument size. Result: refuted.

- Patch-mode (apply_patch instead of write_file): 26B fails identically
  at iter 6. Tool-arg size is not the cause.
- Truncation sweep on tool responses reveals the real trigger: cap at
  800 or 1200 chars → 26B PASSES (1200-cap is 8.4s, fastest of any run).
  Cap at 1600, 2000, or unlimited → 26B silent-stops with eval=4.

Revised understanding: 26B silent-stops when cumulative tool-response
context crosses a shape threshold around 1200-1600 chars per response.
Not a tool-arg bug, not a raw code-gen bug — 26B emits correct code
fine in both one-shot and short-context settings.

Production CLI agents (openclaw, open code, aider) typically truncate
tool responses by default, so this failure may not surface in them.
Custom harnesses should cap ≤1200 chars per tool response when
targeting the 26B MoE.

Updates GOTCHAS (rewritten entry with the truncation sweep table),
SYNTHESIS model-selection row, CORPUS_cli_coding_agent.md pointer,
docs/reference/bakeoff-2026-04-18.md with full Round 2 methodology
and data.

Adds harness_patch.py (apply_patch edit tool), harness_patch_truncated.py
(env-configurable TOOL_RESULT_CAP), all 7 run logs, and a
.secrets.baseline for detect-secrets false positives on JSON timestamps.
2026-04-18 13:40:18 -04:00

242 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLI Coding Agent Bakeoff — 2026-04-18
> Empirical follow-up to `CORPUS_cli_coding_agent.md`. Runs a minimal CLI coding
> agent loop against three candidate models on identical hardware and an
> identical broken-code task. **n=1 per model** (plus one re-run to check
> reproducibility of a failure). Treat as a smoke test, not a benchmark.
## Setup
- **Host:** steel141 (Seth's local box)
- **GPU:** NVIDIA RTX 3090 Ti, 24 GiB, ~22.7 GiB free
- **Ollama:** 0.20.4
- **Harness:** `scripts/bakeoff/harness.py` — custom minimal agent loop, **not** openclaw / open code / aider / pi / hermes. Protocol: Ollama `/api/chat` with `tools=[read_file, write_file, run_bash]`, non-streaming, `think: false`, `num_ctx: 32768`, `num_predict: 4096`, `temperature: 0.3`. Iteration cap = 15.
- **Task:** `scripts/bakeoff/task_seed/` — Python package with buggy `median()` function. 3 of 7 pytest tests fail on even-length inputs. Fix is ~5 lines.
- **System prompt:** generic CLI-agent template (identity + allowed tools + rules: "never modify tests", "prefer minimal edits"). Not tuned per model.
All three models pulled from steel141's local Ollama, swapped in/out of GPU as each run started. First iteration per run pays the load cost; later iterations are hot.
## Results
| Model | Pass | Iterations | write_file | read_file | run_bash | Wall clock | Halt reason |
|---|---|---|---|---|---|---|---|
| `gemma4:26b` | **Fail** | 6 | **0** | 2 | 3 | 10.9s | `no_tool_calls` (silent empty response) |
| `gemma4:26b` (retry) | **Fail** | 6 | **0** | 2 | 3 | 11.4s | `no_tool_calls` (reproduces exactly) |
| `gemma4:31b-it-q4_K_M` | **Pass** | 8 | 1 | 2 | 4 | 44.1s | `no_tool_calls` (clean summary turn) |
| `qwen3-coder:30b` | **Pass** | 15 (cap) | 1 | 4 | 8 | 22.6s | `no_tool_calls` (at iteration cap) |
### Gemma 4 31B — clean run
Textbook agent trace:
1. `read_file README.md`
2. `pytest` (exit=2, module not found — pytest needs PYTHONPATH)
3. `ls -R`
4. `PYTHONPATH=. pytest` → sees 3 failures
5. `read_file calc/stats.py`
6. `write_file calc/stats.py` (eval_count=330, 13.4s) — correct fix
7. `PYTHONPATH=. pytest` → all green
8. summary: *"I updated the `median` function in `calc/stats.py` to correctly calculate the average of the two middle elements..."*
Zero wasted turns. One write. Minimal edit.
### Qwen3-Coder 30B — correct but chatty
Passed, but used all 15 iterations:
- Narrated every step ("I'll help you...", "Now let's look at...")
- Tried to read a non-existent file (`test_calc.py`) — wasted iter 2
- Tried to `read_file` on a directory (`calc`) — wasted iter 6
- Ran several redundant bash calls (`pwd && pytest`, etc.)
- Emitted a ceremonial `echo "All tests pass..."` bash call at iter 14
- Final turn was a polite summary
The fix itself (iter 12) was correct on the first write. Quality is fine; efficiency isn't. Per-iteration it was fast (many 20-40 token turns) — total wall clock 22.6s beat Gemma 31B despite using nearly 2× the iterations.
### Gemma 4 26B — reproducible silent stop
Both runs followed an identical trajectory:
1. `ls -R`
2. `read_file README.md`
3. `pytest` (exit=2)
4. `PYTHONPATH=. pytest` → sees 3 failures
5. `read_file calc/stats.py`
6. **Empty response. `eval_count=4`. No tool calls. Loop terminates.**
Zero writes. The model saw all the context it needed (failing tests + buggy source) and then silently declined to act.
### Isolating the failure — one-shot probe
To check whether 26B can produce the fix at all, I ran a single-turn call with no tool loop:
```
prompt: "The following function is buggy — median([1,2,3,4]) returns 3
but should return 2.5. Rewrite it correctly. [buggy code]"
```
Response (eval_count=81):
```python
def median(numbers):
s = sorted(numbers)
n = len(s)
if n % 2 == 1:
return s[n // 2]
else:
return (s[n // 2 - 1] + s[n // 2]) / 2
```
**Correct.** So 26B's diagnosis and code generation are intact. The failure is specifically at the **tool-call-boundary** — when the model needs to emit a `write_file(path, content)` call where the `content` argument is a several-hundred-character string, it aborts with eval=4 instead.
This aligns with `GOTCHAS.md` § "Weak at Long/Nested JSON". A `write_file` tool call argument with a ~500-char string is structurally similar to a long JSON value. Gemma 4 31B handles the same surface reliably (eval=330 on that turn); the 26B MoE does not.
## Interpretation
### What this is evidence for
- **Gemma 4 31B is a viable CLI-coding-agent backing model on this class of task.** Clean trace, minimal wasted turns, correct fix on first write.
- **Qwen3-Coder 30B also works**, at the cost of more iterations and looser discipline. Diff quality was fine; agentic efficiency wasn't.
- **Gemma 4 26B has a reproducible failure mode** at tool-call-argument emission. It can reason. It can code. It struggles to deliver code through a `write_file` tool call when the content is non-trivial.
### What this is NOT evidence for
- **This is not a representative benchmark.** n=1 per model. One task. One fix. One harness. Do not conclude "Gemma 4 26B is broken for coding agents" — conclude "Gemma 4 26B failed this specific setup reproducibly; investigate further before relying on it."
- **This harness is not openclaw / open code / aider / pi / hermes.** Production agents wrap prompts, retries, and tool surfaces differently. The 26B failure may be avoided in a harness that:
- Uses a **patch/diff tool** (`apply_patch(old, new)`) instead of `write_file(full_content)` — smaller argument surface, matches the "sequential tool calls" pattern from `SYNTHESIS.md`
- Adds a **retry on empty response** (same as Simon's streaming-fallback pattern in `IMPLEMENTATIONS.md`)
- Provides fewer but richer tools (a dedicated `fix_file` that re-prompts internally)
- **This compares agent behavior, not raw performance.** Wall clock is noisy (model load, context size, token rate all differ). Per-iteration latency is more meaningful — but that only matters for throughput, not correctness.
### Recommendations
1. **For a CLI coding agent on Seth's hardware:** start with `gemma4:31b-it-q4_K_M`. Clean behavior, modest wall clock (44s for a simple fix), no retry needed.
2. **For comparison or backup:** `qwen3-coder:30b` is equally correct, roughly half the per-iteration cost, ~2× the iteration count. In a longer session those extra turns add up.
3. **Do not default to `gemma4:26b` for this pattern.** Two tests in a row silent-stopped at the write boundary. If you want to use the 26B MoE (it's strong on `LiveCodeBench v6` at 77.1%), validate it against your specific agent framework first — especially whether the framework uses `write_file` (full content) or `apply_patch` (delta) as its edit primitive.
4. **Test with the real harness you plan to use in production** (openclaw2, open code, etc.) before committing. A handful of this style of run takes minutes on the 3090 Ti and will tell you more than any benchmark card.
## Honest caveats
- **Stochasticity.** Only 26B was re-run. 31B and Qwen3-Coder might hit failure modes on a different seed or a different task. Temperature 0.3 is low but not zero.
- **System prompt bias.** "Start by reading README.md" steered all three models similarly; a different prompt skeleton would produce different traces. I did not tune per model — deliberately — because a production agent won't either.
- **The 26B silent-stop hypothesis (tool-arg emission failure) is inferred, not proven.** A clean confirmation would require running the same task with a smaller-surface edit tool (`apply_patch(path, old, new)` instead of `write_file(path, full_content)`) and showing 26B succeeds. That's the obvious follow-up.
- **Ollama 0.20.4** is between the 0.20.0/0.20.1 known-broken-streaming range and whatever is current. Non-streaming tool calls worked cleanly for 31B and Qwen; 26B's failure looks model-specific, not Ollama-specific, but I didn't test on a different Ollama version.
- **No openclaw / open code / aider runs.** Those are the frameworks named in the HF launch blog. This was a synthetic harness; transfer is plausible but unverified.
## Artifacts
- `scripts/bakeoff/harness.py` — the agent loop
- `scripts/bakeoff/task_seed/` — the broken-code seed (reset between runs)
- `scripts/bakeoff/runs/gemma4-26b/log.json` — full turn-by-turn trace
- `scripts/bakeoff/runs/gemma4-26b-retry/log.json`
- `scripts/bakeoff/runs/gemma4-31b/log.json`
- `scripts/bakeoff/runs/qwen3-coder-30b/log.json`
Each log records per-turn: content, tool calls, results (truncated to 800 chars), prompt/eval token counts, wall time. Final block records halt reason, pass/fail, iteration count, tool-call totals, total wall clock.
## Reproducing
```bash
cd scripts/bakeoff
python3 harness.py gemma4:31b-it-q4_K_M runs/gemma4-31b/work runs/gemma4-31b/log.json
python3 harness.py qwen3-coder:30b runs/qwen3-coder-30b/work runs/qwen3-coder-30b/log.json
python3 harness.py gemma4:26b runs/gemma4-26b/work runs/gemma4-26b/log.json
```
Each invocation resets the work directory from `task_seed/`, runs the loop, writes the log, and prints a one-line summary.
---
# Round 2 — isolating the 26B silent-stop
After Round 1 I hypothesized the 26B failure was about long `write_file(path, full_content)` tool arguments. Round 2 tests that.
## What was tested
1. **Patch-mode harness** (`harness_patch.py`) — identical to the original but swaps `write_file(path, content)` for `apply_patch(path, old_text, new_text)`. Arguments are a small delta (~100-200 chars), not the full file.
2. **Truncation-mode harness** (`harness_patch_truncated.py`) — same as patch-mode, but caps every tool response to `TOOL_RESULT_CAP` chars (env-configurable) before returning it to the model.
All else identical: same task, same system prompt, same Ollama settings, same 3090 Ti on steel141.
## Results
### Round 2a — patch-mode (small edit tool arguments)
| Model | Pass | Iters | patches | reads | bashes | Wall |
|---|---|---|---|---|---|---|
| `gemma4:31b-it-q4_K_M` | ✓ | 8 | 1 | 2 | 4 | 37s |
| `qwen3-coder:30b` | ✓ | 14 | 1 | 3 | 9 | 22s |
| `gemma4:26b` | **✗** | 6 | **0** | 2 | 3 | 8s |
**Hypothesis refuted.** 26B fails identically on patch-mode: 6 iters, silent stop at iter 6 with eval=4, zero edits. The tool-call **argument size is not the trigger.**
### Round 2b — tool-result truncation cap
Ran 26B through patch-mode with progressively smaller caps on each tool response:
| TOOL_RESULT_CAP | 26B Pass | Halt turn | prompt_eval at halt | eval_count at halt |
|---|---|---|---|---|
| **800** | ✓ | iter 15 (cap) | 3741 | 24 |
| **1200** | ✓ | iter 8 | 2294 | 27 |
| **1600** | ✗ | iter 6 | 2070 | **4** |
| **2000** | ✗ | iter 6 | 2157 | **4** |
| **unlimited** | ✗ | iter 6 | 2139 | **4** |
Sharp transition between 1200 and 1600. Below the line, 26B generates code (`eval_count=165` on the patch turn). Above the line, `eval_count=4` — effectively an EOS.
**The trigger is cumulative tool-response context shape, not total tokens.** The 800-cap run continued reasoning past 3741 prompt tokens without issue. The failing runs all halt at ~2070-2150 tokens — but the 1200-cap run crossed that same range (2076 at iter 7) and kept going. So "N tokens" isn't the cause — the recent-context pattern (large tool responses accumulated over 5 iterations) is.
### Bonus observation: 26B at 1200-cap is the fastest passing configuration
| Run | Iters | Wall clock |
|---|---|---|
| 26B @ 1200-cap | 8 | **8.4s** |
| 31B @ patch | 8 | 37s |
| Qwen3-Coder @ patch | 14 | 22s |
Same task, same correct fix. 26B's MoE (3.8B active params) is ~5× faster than 31B dense when it cooperates.
## Revised interpretation
- **Not "26B is broken for CLI coding agents."**
- **Not "long tool-call arguments break 26B."**
- **Yes: "26B silent-stops when the cumulative tool-response context crosses a certain shape/size threshold, at the decision-to-edit boundary."** Observed threshold here: per-tool-response cap somewhere between 1200 and 1600 chars, on this task / this Ollama version / this model variant.
- **The mitigation is standard.** Every production CLI agent (openclaw, open code, aider, cline, continue) truncates tool responses — this is table stakes, not exotic. 26B's "failure mode" is likely *already mitigated* in those frameworks. What my default harness did (pass full 4-6KB pytest outputs verbatim) is probably not what those frameworks do.
- **Exact mechanism is unproven.** I'm observing behavior, not internals. Could be MoE expert routing, could be chat-template edge case, could be some interaction with the tool-call channel tokens. Finding the root cause would require model instrumentation beyond this scope.
## Revised recommendation
1. **Default to `gemma4:31b-it-q4_K_M`** for general CLI coding agent use. Robust to long tool responses, no mitigation needed.
2. **Use `gemma4:26b`** if you care about latency AND your agent framework truncates tool responses (most do). 5× faster than 31B when it works.
3. **Verify by re-running against your actual agent framework.** Don't trust this harness as a proxy — it's a diagnostic, not a production test.
4. **If you're writing a custom agent and targeting 26B**, cap tool responses aggressively (≤1200 chars per response worked here; ≤800 is safer). pytest output in particular benefits from `--tb=line` or `-x` to shrink it.
## Artifacts (Round 2)
- `scripts/bakeoff/harness_patch.py` — patch-mode harness
- `scripts/bakeoff/harness_patch_truncated.py` — truncation-mode harness (env var `TOOL_RESULT_CAP`)
- `scripts/bakeoff/runs_patch/gemma4-26b/log.json` — patch mode, unlimited (fails)
- `scripts/bakeoff/runs_patch/gemma4-26b-truncated/log.json` — cap=800 (passes)
- `scripts/bakeoff/runs_patch/gemma4-26b-cap1200/log.json` — cap=1200 (passes)
- `scripts/bakeoff/runs_patch/gemma4-26b-cap1600/log.json` — cap=1600 (fails)
- `scripts/bakeoff/runs_patch/gemma4-26b-cap2000/log.json` — cap=2000 (fails)
- `scripts/bakeoff/runs_patch/gemma4-31b/log.json` — patch mode, passes (control)
- `scripts/bakeoff/runs_patch/qwen3-coder-30b/log.json` — patch mode, passes (control)
## Reproducing Round 2
```bash
cd scripts/bakeoff
# Patch-mode baseline (3 models)
python3 harness_patch.py gemma4:31b-it-q4_K_M runs_patch/gemma4-31b/work runs_patch/gemma4-31b/log.json
python3 harness_patch.py qwen3-coder:30b runs_patch/qwen3-coder-30b/work runs_patch/qwen3-coder-30b/log.json
python3 harness_patch.py gemma4:26b runs_patch/gemma4-26b/work runs_patch/gemma4-26b/log.json
# Truncation sweep on 26B
TOOL_RESULT_CAP=800 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-truncated/work runs_patch/gemma4-26b-truncated/log.json
TOOL_RESULT_CAP=1200 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap1200/work runs_patch/gemma4-26b-cap1200/log.json
TOOL_RESULT_CAP=1600 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap1600/work runs_patch/gemma4-26b-cap1600/log.json
TOOL_RESULT_CAP=2000 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap2000/work runs_patch/gemma4-26b-cap2000/log.json
```