Files
Mortdecai/training/MODEL_RESEARCH.md
T
Seth 7da28c8800 Add model bake-off harness and base model research
Bake-off tested 7 models on 31 seed examples via GPU-accelerated Ollama
on node-197 RTX 4000. gemma3n:e4b leads for serving (80.6% cmd match,
100% safety, 5.9s). qwen3:8b recommended as fine-tuning base (Apache 2.0,
best syntax quality, strong ecosystem). Full research in MODEL_RESEARCH.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 08:54:11 -04:00

12 KiB

Model Research: Small LMs for LoRA/QLoRA Fine-Tuning

Date: 2026-03-18 Purpose: Evaluate small language models (4-14B) as base models for the Minecraft server ops assistant. Constraints:

  • 8GB VRAM for inference (Q4 quantized via Ollama)
  • 24GB VRAM for training (QLoRA)
  • Permissive license (Apache 2.0, MIT -- NOT community/restricted licenses)
  • Available on both Ollama (serving) and HuggingFace in safetensors/PyTorch (training)
  • Good instruction following and structured JSON output
  • Active fine-tuning ecosystem (Unsloth, Axolotl, PEFT, LlamaFactory)

Ranked Recommendations

Attribute Detail
Parameters 8B dense
Release April 2025
License Apache 2.0
HuggingFace Qwen/Qwen3-8B -- safetensors, BF16
Ollama ollama pull qwen3:8b
Q4 VRAM ~5.5 GB (fits 8GB comfortably)
QLoRA VRAM ~14-16 GB (fits 24GB easily)
Context 128K native

Why #1:

  • Outperforms Qwen2.5-14B on benchmarks despite being smaller. MMLU-Redux ~87, MATH-500 ~98.
  • Apache 2.0 with no usage restrictions -- the cleanest license in this list.
  • First-class Unsloth support with dedicated notebooks and 2x training speedup.
  • Supported by Axolotl, LlamaFactory, PEFT, and TRL out of the box.
  • Native thinking/non-thinking mode toggle -- useful for complex command generation vs. quick lookups.
  • Strong structured output support; JSON format instructions work reliably.
  • Massive community: most fine-tuned derivatives on HuggingFace of any model this size.

Caveats:

  • Newer than some alternatives, so fewer battle-tested fine-tunes in production.

2. Qwen3.5-4B

Attribute Detail
Parameters 4B dense
Release February 2026
License Apache 2.0
HuggingFace Qwen/Qwen3.5-4B -- safetensors, BF16/F32
Ollama ollama pull qwen3.5:4b (~3.4 GB)
Q4 VRAM ~2.5-3 GB
QLoRA VRAM ~8-10 GB
Context 256K native

Why #2:

  • The newest model on this list (Feb 2026) with latest training techniques.
  • Extremely lightweight -- leaves massive headroom for context on 8GB cards.
  • 256K context window is best-in-class for this parameter range.
  • Full Unsloth + LlamaFactory support confirmed.
  • Apache 2.0 license, no restrictions.
  • Ideal if your training data is small (<1000 examples) -- smaller models fine-tune faster and can still match larger models on narrow domains.

Caveats:

  • 4B may struggle with complex multi-step reasoning compared to 8B.
  • Fewer community fine-tunes available yet (very new release).

3. Qwen3-4B

Attribute Detail
Parameters 4B dense (36-layer transformer)
Release April 2025
License Apache 2.0
HuggingFace Qwen/Qwen3-4B -- safetensors
Ollama ollama pull qwen3:4b
Q4 VRAM ~2.5 GB
QLoRA VRAM ~8-10 GB
Context 128K native

Why #3:

  • Benchmarks rival Qwen2.5-72B-Instruct (!!) according to Qwen team claims.
  • MMLU-Redux 83.7, MATH-500 97.0 -- exceptional for 4B.
  • Well-established Unsloth support with notebooks and GGUF export pipeline.
  • Best fine-tuning benchmark results per distillabs.ai evaluation: "Qwen3-4B-Instruct-2507 delivers the best overall fine-tuned performance, matching a 120B+ teacher."
  • Apache 2.0.

Caveats:

  • Slightly older than Qwen3.5-4B; same parameter count but older architecture.

4. Phi-4-mini-instruct (3.8B)

Attribute Detail
Parameters 3.8B
Release February 2025
License MIT
HuggingFace microsoft/Phi-4-mini-instruct -- safetensors
Ollama ollama pull phi4-mini:3.8b
Q4 VRAM ~2.5 GB
QLoRA VRAM ~8-10 GB
Context 128K

Why #4:

  • MIT license -- the most permissive option available.
  • Microsoft provides an official LoRA fine-tuning script in the HuggingFace repo.
  • Performance comparable to 7-9B models (Llama-3.1-8B level) despite being 3.8B.
  • 200K vocabulary, grouped-query attention -- modern architecture.
  • JSON tool-calling format built into the chat template.
  • Unsloth support confirmed with dedicated notebooks.

Caveats:

  • Smaller community of fine-tuners compared to Qwen.
  • 3.8B is the smallest viable option; may need more training data to match larger models on nuanced tasks.
  • Microsoft's Phi models have historically had some quirks with non-English content and repetition.

5. Gemma 3 4B-IT

Attribute Detail
Parameters 4B (multimodal -- text + image)
Release March 2025
License Gemma Terms of Use (NOT Apache 2.0 -- see caveats)
HuggingFace google/gemma-3-4b-it -- safetensors
Ollama ollama pull gemma3:4b (~3.3 GB)
Q4 VRAM ~2.5 GB
QLoRA VRAM ~8-10 GB
Context 128K

Why #5:

  • Outperforms Gemma 2 27B on benchmarks -- a 7x smaller model beating its predecessor's flagship.
  • Google provides official LoRA fine-tuning docs with Keras and HuggingFace PEFT.
  • QAT (Quantization-Aware Training) variants available for better quantized performance.
  • Native function calling and structured output support.
  • Multimodal capability (text + images) could be useful for screenshot-based troubleshooting.
  • Unsloth, Axolotl, and LlamaFactory all support Gemma 3.

Caveats:

  • License is NOT Apache 2.0. Gemma Terms of Use allow commercial use but include a Prohibited Use Policy covering sensitive domains. Google retains the right to "restrict (remotely or otherwise) usage." This is more restrictive than Apache 2.0/MIT.
  • For a personal Minecraft server project this is likely fine, but it fails the strict "permissive license" requirement.

6. Gemma 3 12B-IT

Attribute Detail
Parameters 12B (multimodal)
Release March 2025
License Gemma Terms of Use (same caveats as 4B)
HuggingFace google/gemma-3-12b-it -- safetensors
Ollama ollama pull gemma3:12b
Q4 VRAM ~6.6 GB (Google claims RTX 4060 8GB works)
QLoRA VRAM ~18-20 GB (fits 24GB)
Context 128K

Why #6:

  • The largest model that can fit in 8GB VRAM at Q4.
  • Best raw capability of any model on this list.
  • QAT Q4 variants from Google specifically optimized for consumer GPUs.
  • Full Unsloth support.

Caveats:

  • Tight fit on 8GB -- leaves little headroom for KV cache with long prompts.
  • Same license concerns as Gemma 3 4B.
  • QLoRA training at 12B needs more VRAM; will use ~18-20 GB of your 24GB budget.

7. Mistral NeMo 12B

Attribute Detail
Parameters 12B
Release July 2024
License Apache 2.0
HuggingFace mistralai/Mistral-Nemo-Instruct-2407 -- safetensors
Ollama ollama pull mistral-nemo:12b
Q4 VRAM ~7 GB
QLoRA VRAM ~18-22 GB (higher due to large vocabulary)
Context 128K

Why #7:

  • Apache 2.0 license, built with NVIDIA collaboration.
  • 128K context, strong multilingual support.
  • Established fine-tuning ecosystem with mistral-finetune tool.

Caveats:

  • Oldest model on this list (July 2024) -- outperformed by newer 4-8B models on many benchmarks.
  • Large vocabulary (32K+ tokens) increases memory requirements for fine-tuning beyond what the parameter count suggests.
  • Tight fit on 8GB VRAM at Q4 with limited context headroom.
  • Not recommended over Qwen3-8B which is newer, smaller, and benchmarks better.

Models Considered and Rejected

Model Reason for Rejection
Llama 3.2 (1B/3B) Llama Community License prohibits using outputs to train non-Llama models. Distillation restrictions. Not truly permissive.
Llama 3.1-8B / 3.3-70B Same license restrictions as above. The 700M MAU clause and output training restrictions disqualify it.
Qwen3-Coder (30B-A3B, 480B) All variants are massive MoE models. Even the smallest (30B-A3B with 3B active) has 30B total parameters -- too large for 8GB inference and questionable for 24GB QLoRA.
Mistral Small 3 (24B) 24B parameters -- requires ~14 GB VRAM at Q4. Does not fit 8GB.
Phi-4 (14B) Fits 8GB at Q4 (~8-9 GB) only marginally. QLoRA at 14B needs ~22-24 GB, cutting it very close. The 3.8B Phi-4-mini is a better fit for this project.
Gemma 2 (9B/27B) Superseded by Gemma 3. No reason to use older generation.
Qwen2.5 (7B/14B) Superseded by Qwen3 and Qwen3.5 with significantly better benchmarks.

Fine-Tuning Ecosystem Comparison (as of March 2026)

Framework Qwen3/3.5 Phi-4-mini Gemma 3 Mistral NeMo
Unsloth Full support, dedicated notebooks, 2x speedup Supported, notebooks available Supported, Gemma 3n confirmed Supported
Axolotl Supported Supported Supported Supported
LlamaFactory Supported, Ollama export Supported Supported Supported
HF PEFT/TRL Supported Supported, official script Supported, Google official docs Supported
Community notebooks Abundant Moderate Abundant Moderate

Recommendation for This Project

Primary: Qwen3-8B -- Best balance of capability, VRAM fit, license cleanliness, and fine-tuning ecosystem. It significantly outperforms older 14B models while fitting comfortably in 8GB at Q4. Apache 2.0 means zero legal concerns.

Secondary: Qwen3-4B or Qwen3.5-4B -- If training data is limited (<500 examples) or you want faster iteration cycles, a 4B model will fine-tune faster and still perform well on the narrow domain of Minecraft server operations. Qwen3.5-4B is newer with a 256K context window; Qwen3-4B has more proven fine-tuning results.

Note on qwen3-coder: The current PLAN.md references qwen3-coder as the base model. All Qwen3-Coder variants are large MoE models (30B+ total parameters) that do not fit the 8GB inference constraint. The recommendation is to use Qwen3-8B (or Qwen3-4B) as the base model instead. The coding/command-generation capability can be developed through fine-tuning on domain-specific data rather than requiring a code-specialized base model.


Sources