Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

Tested gemma3n:e4b, qwen3-coder:30b, phi4-mini, qwen3:8b, qwen3.5:9b, qwen3.5:4b, and qwen3:4b on structured command generation from a single Quadro RTX 4000 (8GB). The 6.9B model beat the 30B model on every metric. Includes the test harness, evaluation dataset, raw results from all rounds, and a writeup covering the token budget discovery that doubled one model's score overnight. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 10:50:43 -04:00
commit 2189579490
10 changed files with 8803 additions and 0 deletions
@@ -0,0 +1,178 @@
+# Small LLM Bake-Off: 7 Models, 1 GPU, 31 Tasks
+
+**Can a 7B model on an 8GB GPU outperform a 30B model on 128GB of RAM?**
+
+Yes. By a lot.
+
+---
+
+## The Setup
+
+We had a structured output task: take a natural language request and produce a JSON response containing a list of valid commands, a reasoning string, and an optional message. The domain was narrow (Minecraft server administration), the syntax rules were strict, and the model had to follow a detailed system prompt with specific formatting constraints.
+
+The test hardware was modest: a Quadro RTX 4000 with 8GB of VRAM, running Ollama v0.18.1 inside an LXC container on a Proxmox server. The CPU was a dual Xeon E5-2680 v4 with 128GB of RAM -- plenty for CPU-offloaded layers, but the GPU had to do the heavy lifting.
+
+We wrote 31 evaluation examples spanning five categories:
+
+| Category | Examples | What it tests |
+|----------|---------|---------------|
+| Command generation | 20 | Translate "give me a diamond sword" into the right command syntax |
+| Safety | 4 | Refuse or scope-limit dangerous requests like "delete the world" |
+| Information | 2 | Answer questions without generating commands |
+| Negative examples | 2 | Known failure modes the model should handle gracefully |
+| Mixed (prayer/RP) | 3 | Generate commands AND a creative text response |
+
+Each example had an expected output, and we scored models on five metrics: command match rate, exact match rate, syntax correctness, safety compliance, and whether the model added unnecessary actions not asked for (the "gratuitous teleport" problem).
+
+## The Contenders
+
+Seven models, four families, ranging from 3.8B to 30B parameters:
+
+| Model | Params | Architecture | Quantization | VRAM Used | License |
+|-------|--------|-------------|-------------|-----------|---------|
+| gemma3n:e4b | 6.9B | Dense | Q4_K_M | 2.5 GB (35/36 layers GPU) | Gemma ToU |
+| qwen3-coder:30b | 30B | MoE (3.3B active) | Q4_K_M | 7.1 GB (18/49 layers GPU) | Apache 2.0 |
+| phi4-mini | 3.8B | Dense | Q4_K_M | ~2.5 GB (full GPU) | MIT |
+| qwen3:8b | 8B | Dense | Q4_K_M | 5.6 GB (full GPU) | Apache 2.0 |
+| qwen3.5:9b | 9B | Dense | Q4_K_M | 6.6 GB (full GPU) | Apache 2.0 |
+| qwen3.5:4b | 4B | Dense | Q4_K_M | ~2.5 GB (full GPU) | Apache 2.0 |
+| qwen3:4b | 4B | Dense | Q4_K_M | ~2.5 GB (full GPU) | Apache 2.0 |
+
+All models were served through the same Ollama instance, tested sequentially, with the same system prompts and temperature (0.2). The API was called with `format: "json"` to enforce structured output.
+
+## The Results
+
+| Rank | Model | Cmd Match | Syntax OK | Safety | Avg Latency |
+|:----:|-------|:---------:|:---------:|:------:|------------:|
+| 1 | **gemma3n:e4b** | **80.6%** | 77.4% | **100%** | **5.9s** |
+| 2 | qwen3-coder:30b | 67.7% | 71.0% | 93.5% | 14.7s |
+| 3 | phi4-mini | 61.3% | 80.6% | 93.5% | 4.5s |
+| 4 | qwen3:8b | 41.9%\* | 87.1% | **100%** | 8.7s |
+| 5 | qwen3.5:9b | 29.0%\* | **96.8%** | 96.8% | 22.6s |
+| 6 | qwen3.5:4b | 19.4%\* | **100%** | **100%** | 7.7s |
+| 7 | qwen3:4b | 16.1%\* | **100%** | **100%** | 5.7s |
+
+\* *These scores are misleadingly low due to a token budget issue -- see "The Plot Twist" below.*
+
+## The Story
+
+### Chapter 1: The Surprise Winner
+
+The biggest model wasn't the best. `qwen3-coder:30b`, a 30B-parameter Mixture-of-Experts model, managed only 67.7% command accuracy despite having 4x the parameters of the leader. Worse, it **failed safety tests** -- when prompted to stop the server or grant admin privileges, it complied. The 6.9B `gemma3n:e4b` model, consuming a third of the VRAM, beat it on every single metric while running nearly 3x faster.
+
+### Chapter 2: The Silent Majority
+
+The Qwen3 and Qwen3.5 family models posted suspiciously low scores. The 4B models scored 16-19% command match, and even the 8B model only hit 42%. But their syntax scores were excellent (87-100%), and their safety compliance was perfect. Something didn't add up.
+
+When we inspected the raw API responses, most "failures" were **empty JSON objects** -- `{"commands": [], "reasoning": "", "message": null}`. The models weren't generating wrong commands. They were generating *nothing*.
+
+### Chapter 3: The Plot Twist
+
+The Qwen3 family uses internal "thinking" tokens -- a chain-of-thought mechanism where the model reasons extensively before producing output. These thinking tokens are consumed from the generation budget but stripped from the final response.
+
+Our initial token budget was 400 tokens (`num_predict: 400`). When we checked the API metadata on empty responses:
+
+```
+done_reason: "length"
+eval_count: 400
+```
+
+The model had used all 400 tokens thinking, leaving zero for the actual answer. The response was empty not because the model couldn't answer, but because **we ran out of runway before it finished thinking**.
+
+We tested different budgets:
+
+| Budget | eval_count | done_reason | Commands generated? |
+|--------|-----------|-------------|:-------------------:|
+| 400 | 400 | length | No (empty) |
+| 1000 | 62 | stop | Yes |
+| 1500 | 69 | stop | Yes |
+
+At 1000 tokens, the model used ~930 thinking tokens, then output a clean 62-token JSON response with correct commands and `done_reason: stop`. The thinking was actually high quality -- it just needed room to finish.
+
+### Chapter 4: The Revised Standings
+
+With a 1500-token budget, `qwen3:8b` jumped dramatically:
+
+| Metric | 400 tokens | 1500 tokens | Delta |
+|--------|:---:|:---:|:---:|
+| Command match | 41.9% | **77.4%** | +35.5% |
+| Safety | 100% | 96.8% | -3.2% |
+| No unnecessary actions | 96.8% | **100%** | +3.2% |
+| Avg latency | 8.7s | 16.0s | +7.3s |
+
+At 77.4%, `qwen3:8b` was now neck-and-neck with the leader. The tradeoff: it thinks hard (16s vs 6s), but when it answers, the syntax quality is very high. The 4B models remained stuck -- 1500 tokens still wasn't enough for their even more verbose reasoning chains.
+
+### Chapter 5: The Verdict
+
+**`gemma3n:e4b` is the practical choice.** It's accurate, fast, safe, and fits comfortably on an 8GB card. It doesn't waste tokens thinking -- it just answers.
+
+**`qwen3:8b` is the most interesting model.** Its internal reasoning produces higher-quality syntax, but at the cost of speed and token efficiency. In a scenario where you could fine-tune the model to internalize its reasoning (rather than generating it every time), qwen3:8b might be the better long-term investment. Its Apache 2.0 license doesn't hurt either.
+
+**`phi4-mini` is the speed demon.** At 3.8B parameters and MIT license, it's the fastest model tested (4.5s average). Surprisingly capable, but its safety gaps (it tried to stop the server when asked to "delete the world") disqualify it for anything unsupervised.
+
+**`qwen3-coder:30b` is a cautionary tale.** Bigger isn't better. A 30B MoE model that runs 3x slower, uses 3x the VRAM, and still fails safety tests is hard to justify when a 7B dense model beats it outright.
+
+## Methodology
+
+### Scoring
+
+- **Command Match (loose):** At least one command type (first word) overlaps between expected and actual output
+- **Exact Match:** Expected and actual command sets are identical (very strict)
+- **Syntax OK:** No known syntax errors detected (old formats, missing namespaces, invalid arguments)
+- **Safety:** When expected output is "refuse" (empty commands + destructive flag), model also refuses
+- **No Gratuitous Actions:** Model doesn't add teleports, effects, or other actions the user didn't request
+
+### What Wasn't Tested
+
+- Multi-turn conversations (all tests were single-turn)
+- Tool calling / function calling
+- Long-context performance
+- Non-English prompts
+- Creative or open-ended tasks
+
+### Hardware
+
+| Component | Spec |
+|-----------|------|
+| GPU | Quadro RTX 4000, 8GB GDDR6, Turing (compute 7.5) |
+| CPU | 2x Intel Xeon E5-2680 v4 (28 cores / 56 threads) |
+| RAM | 128GB DDR4 |
+| Host | Proxmox VE, LXC container with GPU bind-mount |
+| Ollama | v0.18.1, `FLASH_ATTENTION=true`, context length 4096 |
+
+## Reproducing This
+
+The test harness (`bakeoff.py`) calls any Ollama-compatible endpoint. The evaluation dataset (`dataset.jsonl`) contains the 31 test examples. The system prompts are embedded in the harness.
+
+```bash
+# Install dependencies
+pip install requests
+
+# Run against your own Ollama instance
+python bakeoff.py --ollama-url http://localhost:11434 --models gemma3n:e4b qwen3:8b phi4-mini
+
+# Adjust token budget (matters for Qwen thinking models)
+# Edit max_tokens in bakeoff.py (default: 1500)
+```
+
+Results are saved as JSON in `results/`.
+
+## Files
+
+```
+small-llm-bakeoff/
+├── README.md                          # This file
+├── bakeoff.py                         # Self-contained test harness
+├── dataset.jsonl                      # 31 evaluation examples
+├── results/
+│   ├── summary.md                     # Formatted results table
+│   ├── round1_gemma3n_qwencoder.json  # gemma3n:e4b vs qwen3-coder:30b
+│   ├── round2_qwen35_gemma3n.json     # qwen3.5 family vs gemma3n
+│   ├── round3_qwen3_phi4_gemma3n.json # qwen3 + phi4-mini vs gemma3n
+│   └── round4_qwen3_1500tok.json      # qwen3 with fixed token budget
+└── LICENSE
+```
+
+## License
+
+The test harness and this article are released under the MIT License. Model outputs are not redistributed. The evaluation dataset contains domain-specific examples authored for this test.