small-llm-bakeoff

Seth/small-llm-bakeoff

Fork 0

Commit Graph

Author	SHA1	Message	Date
Seth	33e3e55770	Round 5: Live RCON bake-off results (preliminary) - Expanded dataset from 31 to 182 examples (edge cases, log extraction, bug reports) - Tested gemma3n:e4b vs qwen3:8b on live Paper 1.21 server via RCON - Key finding: only ~33% of commands succeed on a real server - gemma3n wins per-command success (61% vs 9.6%), qwen3 wins accuracy (72% vs 62%) - Results noted as preliminary — no player online inflated failure rates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 15:09:10 -04:00
Seth	2189579490	Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks Tested gemma3n:e4b, qwen3-coder:30b, phi4-mini, qwen3:8b, qwen3.5:9b, qwen3.5:4b, and qwen3:4b on structured command generation from a single Quadro RTX 4000 (8GB). The 6.9B model beat the 30B model on every metric. Includes the test harness, evaluation dataset, raw results from all rounds, and a writeup covering the token budget discovery that doubled one model's score overnight. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 10:50:43 -04:00

Author

SHA1

Message

Date

Seth

33e3e55770

Round 5: Live RCON bake-off results (preliminary)

- Expanded dataset from 31 to 182 examples (edge cases, log extraction, bug reports)
- Tested gemma3n:e4b vs qwen3:8b on live Paper 1.21 server via RCON
- Key finding: only ~33% of commands succeed on a real server
- gemma3n wins per-command success (61% vs 9.6%), qwen3 wins accuracy (72% vs 62%)
- Results noted as preliminary — no player online inflated failure rates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-18 15:09:10 -04:00

Seth

2189579490

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

Tested gemma3n:e4b, qwen3-coder:30b, phi4-mini, qwen3:8b, qwen3.5:9b,
qwen3.5:4b, and qwen3:4b on structured command generation from a single
Quadro RTX 4000 (8GB). The 6.9B model beat the 30B model on every metric.

Includes the test harness, evaluation dataset, raw results from all rounds,
and a writeup covering the token budget discovery that doubled one model's
score overnight.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-18 10:50:43 -04:00

2 Commits