feat: CLI coding agent bakeoff — 26b reproducibly silent-stops at write_file

Ran minimal agent loop (Ollama /api/chat + read_file/write_file/run_bash) on steel141 3090 Ti against 3 models on a broken-median-function task: - gemma4:31b-it-q4_K_M: PASS (8 iters, 1 write, 44s) — textbook trace - qwen3-coder:30b: PASS (15 iters, 1 write, 22s) — correct but chatty - gemma4:26b: FAIL (6 iters, 0 writes) — silently stops with eval=4 after reading source. Reproduced on second run. One-shot probe confirms 26b CAN produce the correct fix — failure is specifically at the write_file tool-call argument boundary. Updates GOTCHAS with a new HIGH-severity entry, SYNTHESIS model-selection table, CORPUS_cli_coding_agent.md empirical-follow-up pointer, and adds docs/reference/bakeoff-2026-04-18.md with the full writeup.
2026-04-18 13:27:50 -04:00
parent 4b9c537dda
commit a945207aab
15 changed files with 1172 additions and 1 deletions
@@ -0,0 +1,14 @@
+# Bakeoff Task
+
+A tiny Python package (`calc/`) with a statistics module. Run `pytest` from this
+directory — two tests currently fail because `median` returns the upper-middle
+element instead of averaging the two middle elements on even-length inputs.
+
+Your job: make all tests pass. Do not disable or modify the tests.
+
+Allowed tools:
+- `read_file(path)` — read a file (relative to this directory)
+- `write_file(path, content)` — overwrite a file (relative to this directory)
+- `run_bash(command)` — run a shell command (cwd is this directory)
+
+When all tests pass, reply with a short summary of the fix and stop calling tools.
@@ -0,0 +1,35 @@
+"""Basic statistics helpers."""
+
+
+def mean(numbers):
+    """Arithmetic mean of a non-empty list."""
+    return sum(numbers) / len(numbers)
+
+
+def median(numbers):
+    """Return the median of a list of numbers."""
+    s = sorted(numbers)
+    n = len(s)
+    return s[n // 2]
+
+
+def mode(numbers):
+    """Return the most common value. Ties broken by first occurrence."""
+    counts = {}
+    for x in numbers:
+        counts[x] = counts.get(x, 0) + 1
+    best = None
+    best_count = -1
+    for x in numbers:
+        if counts[x] > best_count:
+            best = x
+            best_count = counts[x]
+    return best
+
+
+def variance(numbers):
+    """Sample variance (divides by n-1)."""
+    if len(numbers) < 2:
+        raise ValueError("variance requires at least 2 values")
+    m = mean(numbers)
+    return sum((x - m) ** 2 for x in numbers) / (len(numbers) - 1)
@@ -0,0 +1,30 @@
+from calc.stats import mean, median, mode, variance
+
+
+def test_mean_basic():
+    assert mean([1, 2, 3, 4, 5]) == 3.0
+
+
+def test_median_odd():
+    assert median([1, 2, 3]) == 2
+
+
+def test_median_even():
+    assert median([1, 2, 3, 4]) == 2.5
+
+
+def test_median_unsorted():
+    assert median([3, 1, 4, 1, 5, 9, 2, 6]) == 3.5
+
+
+def test_median_floats():
+    assert median([1.0, 2.0, 3.0, 4.0]) == 2.5
+
+
+def test_mode_basic():
+    assert mode([1, 2, 2, 3]) == 2
+
+
+def test_variance_basic():
+    # sample variance (n-1) of [1, 2, 3, 4, 5] is 10/4 = 2.5
+    assert variance([1, 2, 3, 4, 5]) == 2.5