Files

T

Seth 7da28c8800 Add model bake-off harness and base model research

Bake-off tested 7 models on 31 seed examples via GPU-accelerated Ollama
on node-197 RTX 4000. gemma3n:e4b leads for serving (80.6% cmd match,
100% safety, 5.9s). qwen3:8b recommended as fine-tuning base (Apache 2.0,
best syntax quality, strong ecosystem). Full research in MODEL_RESEARCH.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-18 08:54:11 -04:00

12 KiB

Raw Blame History

Model Research: Small LMs for LoRA/QLoRA Fine-Tuning

Date: 2026-03-18 Purpose: Evaluate small language models (4-14B) as base models for the Minecraft server ops assistant. Constraints:

8GB VRAM for inference (Q4 quantized via Ollama)

24GB VRAM for training (QLoRA)

Permissive license (Apache 2.0, MIT -- NOT community/restricted licenses)

Available on both Ollama (serving) and HuggingFace in safetensors/PyTorch (training)

Good instruction following and structured JSON output

Active fine-tuning ecosystem (Unsloth, Axolotl, PEFT, LlamaFactory)

Ranked Recommendations

1. Qwen3-8B (RECOMMENDED)

Attribute	Detail
Parameters	8B dense
Release	April 2025
License	Apache 2.0
HuggingFace	`Qwen/Qwen3-8B` -- safetensors, BF16
Ollama	`ollama pull qwen3:8b`
Q4 VRAM	~5.5 GB (fits 8GB comfortably)
QLoRA VRAM	~14-16 GB (fits 24GB easily)
Context	128K native

Why #1:

Outperforms Qwen2.5-14B on benchmarks despite being smaller. MMLU-Redux ~87, MATH-500 ~98.
Apache 2.0 with no usage restrictions -- the cleanest license in this list.
First-class Unsloth support with dedicated notebooks and 2x training speedup.
Supported by Axolotl, LlamaFactory, PEFT, and TRL out of the box.
Native thinking/non-thinking mode toggle -- useful for complex command generation vs. quick lookups.
Strong structured output support; JSON format instructions work reliably.
Massive community: most fine-tuned derivatives on HuggingFace of any model this size.

Caveats:

Newer than some alternatives, so fewer battle-tested fine-tunes in production.

2. Qwen3.5-4B

Attribute	Detail
Parameters	4B dense
Release	February 2026
License	Apache 2.0
HuggingFace	`Qwen/Qwen3.5-4B` -- safetensors, BF16/F32
Ollama	`ollama pull qwen3.5:4b` (~3.4 GB)
Q4 VRAM	~2.5-3 GB
QLoRA VRAM	~8-10 GB
Context	256K native

Why #2:

The newest model on this list (Feb 2026) with latest training techniques.
Extremely lightweight -- leaves massive headroom for context on 8GB cards.
256K context window is best-in-class for this parameter range.
Full Unsloth + LlamaFactory support confirmed.
Apache 2.0 license, no restrictions.
Ideal if your training data is small (<1000 examples) -- smaller models fine-tune faster and can still match larger models on narrow domains.

Caveats:

4B may struggle with complex multi-step reasoning compared to 8B.
Fewer community fine-tunes available yet (very new release).

3. Qwen3-4B

Attribute	Detail
Parameters	4B dense (36-layer transformer)
Release	April 2025
License	Apache 2.0
HuggingFace	`Qwen/Qwen3-4B` -- safetensors
Ollama	`ollama pull qwen3:4b`
Q4 VRAM	~2.5 GB
QLoRA VRAM	~8-10 GB
Context	128K native

Why #3:

Benchmarks rival Qwen2.5-72B-Instruct (!!) according to Qwen team claims.
MMLU-Redux 83.7, MATH-500 97.0 -- exceptional for 4B.
Well-established Unsloth support with notebooks and GGUF export pipeline.
Best fine-tuning benchmark results per distillabs.ai evaluation: "Qwen3-4B-Instruct-2507 delivers the best overall fine-tuned performance, matching a 120B+ teacher."
Apache 2.0.

Caveats:

Slightly older than Qwen3.5-4B; same parameter count but older architecture.

4. Phi-4-mini-instruct (3.8B)

Attribute	Detail
Parameters	3.8B
Release	February 2025
License	MIT
HuggingFace	`microsoft/Phi-4-mini-instruct` -- safetensors
Ollama	`ollama pull phi4-mini:3.8b`
Q4 VRAM	~2.5 GB
QLoRA VRAM	~8-10 GB
Context	128K

Why #4:

MIT license -- the most permissive option available.
Microsoft provides an official LoRA fine-tuning script in the HuggingFace repo.
Performance comparable to 7-9B models (Llama-3.1-8B level) despite being 3.8B.
200K vocabulary, grouped-query attention -- modern architecture.
JSON tool-calling format built into the chat template.
Unsloth support confirmed with dedicated notebooks.

Caveats:

Smaller community of fine-tuners compared to Qwen.
3.8B is the smallest viable option; may need more training data to match larger models on nuanced tasks.
Microsoft's Phi models have historically had some quirks with non-English content and repetition.

5. Gemma 3 4B-IT

Attribute	Detail
Parameters	4B (multimodal -- text + image)
Release	March 2025
License	Gemma Terms of Use (NOT Apache 2.0 -- see caveats)
HuggingFace	`google/gemma-3-4b-it` -- safetensors
Ollama	`ollama pull gemma3:4b` (~3.3 GB)
Q4 VRAM	~2.5 GB
QLoRA VRAM	~8-10 GB
Context	128K

Why #5:

Outperforms Gemma 2 27B on benchmarks -- a 7x smaller model beating its predecessor's flagship.
Google provides official LoRA fine-tuning docs with Keras and HuggingFace PEFT.
QAT (Quantization-Aware Training) variants available for better quantized performance.
Native function calling and structured output support.
Multimodal capability (text + images) could be useful for screenshot-based troubleshooting.
Unsloth, Axolotl, and LlamaFactory all support Gemma 3.

Caveats:

License is NOT Apache 2.0. Gemma Terms of Use allow commercial use but include a Prohibited Use Policy covering sensitive domains. Google retains the right to "restrict (remotely or otherwise) usage." This is more restrictive than Apache 2.0/MIT.
For a personal Minecraft server project this is likely fine, but it fails the strict "permissive license" requirement.

6. Gemma 3 12B-IT

Attribute	Detail
Parameters	12B (multimodal)
Release	March 2025
License	Gemma Terms of Use (same caveats as 4B)
HuggingFace	`google/gemma-3-12b-it` -- safetensors
Ollama	`ollama pull gemma3:12b`
Q4 VRAM	~6.6 GB (Google claims RTX 4060 8GB works)
QLoRA VRAM	~18-20 GB (fits 24GB)
Context	128K

Why #6:

The largest model that can fit in 8GB VRAM at Q4.
Best raw capability of any model on this list.
QAT Q4 variants from Google specifically optimized for consumer GPUs.
Full Unsloth support.

Caveats:

Tight fit on 8GB -- leaves little headroom for KV cache with long prompts.
Same license concerns as Gemma 3 4B.
QLoRA training at 12B needs more VRAM; will use ~18-20 GB of your 24GB budget.

7. Mistral NeMo 12B

Attribute	Detail
Parameters	12B
Release	July 2024
License	Apache 2.0
HuggingFace	`mistralai/Mistral-Nemo-Instruct-2407` -- safetensors
Ollama	`ollama pull mistral-nemo:12b`
Q4 VRAM	~7 GB
QLoRA VRAM	~18-22 GB (higher due to large vocabulary)
Context	128K

Why #7:

Apache 2.0 license, built with NVIDIA collaboration.
128K context, strong multilingual support.
Established fine-tuning ecosystem with mistral-finetune tool.

Caveats:

Oldest model on this list (July 2024) -- outperformed by newer 4-8B models on many benchmarks.
Large vocabulary (32K+ tokens) increases memory requirements for fine-tuning beyond what the parameter count suggests.
Tight fit on 8GB VRAM at Q4 with limited context headroom.
Not recommended over Qwen3-8B which is newer, smaller, and benchmarks better.

Models Considered and Rejected

Model	Reason for Rejection
Llama 3.2 (1B/3B)	Llama Community License prohibits using outputs to train non-Llama models. Distillation restrictions. Not truly permissive.
Llama 3.1-8B / 3.3-70B	Same license restrictions as above. The 700M MAU clause and output training restrictions disqualify it.
Qwen3-Coder (30B-A3B, 480B)	All variants are massive MoE models. Even the smallest (30B-A3B with 3B active) has 30B total parameters -- too large for 8GB inference and questionable for 24GB QLoRA.
Mistral Small 3 (24B)	24B parameters -- requires ~14 GB VRAM at Q4. Does not fit 8GB.
Phi-4 (14B)	Fits 8GB at Q4 (~8-9 GB) only marginally. QLoRA at 14B needs ~22-24 GB, cutting it very close. The 3.8B Phi-4-mini is a better fit for this project.
Gemma 2 (9B/27B)	Superseded by Gemma 3. No reason to use older generation.
Qwen2.5 (7B/14B)	Superseded by Qwen3 and Qwen3.5 with significantly better benchmarks.

Fine-Tuning Ecosystem Comparison (as of March 2026)

Framework	Qwen3/3.5	Phi-4-mini	Gemma 3	Mistral NeMo
Unsloth	Full support, dedicated notebooks, 2x speedup	Supported, notebooks available	Supported, Gemma 3n confirmed	Supported
Axolotl	Supported	Supported	Supported	Supported
LlamaFactory	Supported, Ollama export	Supported	Supported	Supported
HF PEFT/TRL	Supported	Supported, official script	Supported, Google official docs	Supported
Community notebooks	Abundant	Moderate	Abundant	Moderate

Recommendation for This Project

Primary: Qwen3-8B -- Best balance of capability, VRAM fit, license cleanliness, and fine-tuning ecosystem. It significantly outperforms older 14B models while fitting comfortably in 8GB at Q4. Apache 2.0 means zero legal concerns.

Secondary: Qwen3-4B or Qwen3.5-4B -- If training data is limited (<500 examples) or you want faster iteration cycles, a 4B model will fine-tune faster and still perform well on the narrow domain of Minecraft server operations. Qwen3.5-4B is newer with a 256K context window; Qwen3-4B has more proven fine-tuning results.

Note on qwen3-coder: The current PLAN.md references qwen3-coder as the base model. All Qwen3-Coder variants are large MoE models (30B+ total parameters) that do not fit the 8GB inference constraint. The recommendation is to use Qwen3-8B (or Qwen3-4B) as the base model instead. The coding/command-generation capability can be developed through fine-tuning on domain-specific data rather than requiring a code-specialized base model.

12 KiB Raw Blame History

Model Research: Small LMs for LoRA/QLoRA Fine-Tuning

Ranked Recommendations

1. Qwen3-8B (RECOMMENDED)

2. Qwen3.5-4B

3. Qwen3-4B

4. Phi-4-mini-instruct (3.8B)

5. Gemma 3 4B-IT

6. Gemma 3 12B-IT

7. Mistral NeMo 12B

Models Considered and Rejected

Fine-Tuning Ecosystem Comparison (as of March 2026)

Recommendation for This Project

Sources

12 KiB

Raw Blame History