Files

T

Mortdecai 48df42b042 docs: Mortdecai 0.6.0 model analysis — fine-tunes broken, base model rankings

Full analysis of mortdecai:0.6.0-9b and mortdecai:latest (27B) fine-tunes
vs 6 base model candidates. Both fine-tunes score 0% JSON compliance
(catastrophic forgetting from chat template mismatch). Training signal
exists in weights but is inaccessible through chat API.

Base model rankings: phi4:14b (100%, 7.4s) > gemma3:12b (100%, 12.9s) >
gemma3:27b (100%, 25.3s). Qwen3.5 not recommended for conductor role.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-26 02:39:52 -04:00

10 KiB

Raw Blame History

Mortdecai 0.6.0 Model Analysis Report

Date: 2026-03-26 Analyst: Claude Opus 4.6 (non-developer, pure analysis role) Target models: mortdecai:0.6.0-9b (Qwen3.5:9B LoRA), mortdecai:latest (Qwen3.5:27B LoRA) Comparison models: qwen3.5:latest, qwen3.5:27b, gemma3:12b, phi4:14b, gemma3:27b, qwen3:14b Inference hardware: Matt's Strix Halo (64GB unified memory, Ollama) Expected output format: {"commands": [...], "reasoning": "..."}

1. Executive Summary

Both fine-tuned models are completely broken. Training didn't partially stick — it actively destroyed the models' ability to follow instructions. The fine-tunes are worse than useless; the base models they were derived from dramatically outperform them.

2. Methodology

Test Battery (Fine-tuned models — 8 tests each)

Test	System Prompt	User Prompt	Purpose
STANDARD	Full training system prompt	"give me a diamond sword"	Baseline compliance
MINIMAL	JSON format instruction only	"give me a diamond sword"	Minimal instruction following
NO SYSTEM	Empty	"give me a diamond sword"	Default behavior
/no_think	Training prompt + /no_think prefix	"give me a diamond sword"	Think token suppression
COMPLEX	Full training prompt	"build me a 5x5 house"	Multi-step command
IDENTITY	Empty	"What are you?"	Training awareness
FORMAT STRESS	Full training prompt	Time + weather + armor	Multi-command JSON
RISK	Full training prompt	"give me op"	Risk assessment

Test Battery (Base models — 5 tests each)

Same system prompt across all models. Prompts: diamond sword, multi-command, house build, op request, teleport.

Diagnostic Probes

Training signal detection — exact training data format
/no_think effect — across fine-tuned and base models
Raw completion — bypassing chat template via /api/generate
Correction coercion — multi-turn with explicit correction
Mortdecai awareness — identity and training memory

3. Fine-Tuned Model Results

mortdecai:0.6.0-9b (Qwen3.5:9B LoRA)

Test	JSON Valid	Response Type	Latency
STANDARD	NO	Generic Minecraft tutorial	29.9s
MINIMAL	NO	Crafting recipe + game tips	35.9s
NO SYSTEM	NO	Crafting recipe + tips	42.6s
/no_think	NO	Tutorial with version advice	22.6s
COMPLEX	NO	Real-world construction advice (permits, carpenters)	46.0s
IDENTITY	NO	"I am Qwen3.5 by Tongyi Lab"	45.8s
FORMAT STRESS	NO	Think block, incomplete	46.0s
RISK	NO	Investment advice ($1M portfolio)	45.7s

Score: 0/8 JSON compliance (0%) Comparison: Base Qwen3.5:9B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points

Key observations:

Completely ignores system prompts
Leaks raw special tokens (<|endoftext|><|im_start|>) into output
Interprets Minecraft prompts as real-world requests (house = construction, op = operator/investment)
/no_think suppresses <think> tags but doesn't restore instruction following
Average latency: 36.0s

mortdecai:latest (Qwen3.5:27B LoRA)

Test	JSON Valid	Response Type	Latency
STANDARD	NO	Think block + crafting tutorial	54.2s
MINIMAL	NO	Think block + crafting recipe	28.2s
NO SYSTEM	NO	Crafting recipe + emoji tips	30.7s
/no_think	NO	Think block (still!) + tutorial	39.0s
COMPLEX	NO	Think block about real-world building	49.2s
IDENTITY	NO	"I am Qwen3.5 by Tongyi Lab"	21.9s
FORMAT STRESS	NO	Commands listed as markdown, not JSON	23.8s
RISK	NO	Research study methodology (!)	49.1s

Score: 0/8 JSON compliance (0%) Comparison: Base Qwen3.5:27B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points

Key observations:

Wraps everything in <think> blocks even with /no_think prefix
Think tokens consume most context budget before any useful output
Also leaks special tokens
"give me op" → completely derails into academic research methodology
Average latency: 37.0s

4. Root Cause Analysis

4.1 Chat Template Mismatch (Primary cause)

Evidence: Probe 3 (raw completion mode) proved the training signal IS in the weights.

When bypassing the chat template entirely:

Prompt: 'Assistant: {"commands": ["'
mortdecai:0.6.0-9b completion: 'give @p diamond_sword"]}'
mortdecai:latest completion: 'give @p diamond_sword"]}'

Both models produce valid, correct Minecraft commands in raw mode. The knowledge is there — it's just inaccessible through the chat API.

Diagnosis: The training data used a different message format than Qwen3.5's native chat template (<|im_start|>system\n...\n<|im_end|>). The LoRA learned to associate the JSON output format with the raw training format, not with the chat template wrapping that Ollama applies.

4.2 Catastrophic Forgetting

The LoRA didn't just add Minecraft knowledge — it overwrote the base model's instruction-following capability:

Base Qwen3.5:9B: 70% command accuracy (bakeoff), 40% JSON compliance (this test)
Fine-tuned 9B: 10% command accuracy (bakeoff), 0% JSON compliance (this test)

This is classic catastrophic forgetting from LoRA rank being too high, learning rate too aggressive, or insufficient regularization.

4.3 Think Token Contamination

Qwen3.5's thinking mode (<think>...</think>) was not accounted for during training:

27B: Always generates think blocks, even with /no_think
9B: Sometimes generates think blocks
Base models: /no_think works correctly on both sizes

The fine-tuning broke the /no_think mechanism on the 27B model, making think token suppression impossible.

4.4 Special Token Leakage

Both fine-tuned models leak <|endoftext|><|im_start|>user into their output, which means:

The model learned to predict special tokens as regular text
The tokenizer/chat template boundary was corrupted during training
This causes the model to "hallucinate" new conversation turns within a single response

5. Base Model Comparison

Quantitative Results

Model	JSON Valid	Has Commands	Avg Latency	Tokens/Response
phi4:14b	5/5 (100%)	5/5	7.4s	~88
gemma3:12b	5/5 (100%)	5/5	12.9s	~117
gemma3:27b	5/5 (100%)	5/5	25.3s	~166
qwen3:14b	3/5 (60%)	3/5	23.8s	~330
qwen3.5:latest (9B)	2/5 (40%)	2/5	13.9s	~370
qwen3.5:27b	2/5 (40%)	2/5	65.4s	~437

Qualitative Assessment

phi4:14b — Fastest response times. Always wraps JSON in markdown fences (minor issue, easily stripped). Clean reasoning. Uses @p consistently. Good domain knowledge. House build attempt is structured but coordinates are imprecise.

gemma3:12b — Slightly slower but equally reliable. Sometimes returns raw JSON, sometimes wraps in fences. Uses @s (self) which is more correct for "give me" commands. Best Minecraft domain knowledge of all candidates. Very concise responses.

gemma3:27b — Same quality as 12b, 2x slower. Over-engineers some responses (unnecessary NBT attributes on armor). The tp command uses a redundant two-command approach. Not worth the latency penalty for most use cases.

qwen3:14b — Think tokens cause it to exceed token limits on complex prompts. When it does produce JSON, quality is decent but includes leading slashes on commands (against instructions).

qwen3.5 (both sizes) — Think tokens are the fundamental problem. Burns 300-400 tokens on reasoning before producing output, frequently hits token limits before completing JSON. The /no_think flag works on base models but is unreliable.

6. Conductor Candidacy Assessment

Question: Is Qwen3.5 (27B or 9B) a good candidate for the Conductor/Orchestrator role?

Answer: No. Four reasons:

Uncontrollable think token overhead. The conductor needs fast, reliable responses. Qwen3.5's thinking mode adds 5-30s latency and burns context on reasoning that should happen in orchestrator code, not inside the model.
Unreliable JSON compliance. The conductor must produce structured output (routing decisions, tool calls, dispatch instructions) 100% of the time. Qwen3.5 manages 40% vs gemma3's 100%.
Fragile under fine-tuning. LoRA on Qwen3.5 caused catastrophic forgetting. If the conductor needs fine-tuning later, Qwen3.5 is a risky base.
27B is too slow. 65s average is unacceptable for a routing layer in the critical path of every player request.

Recommended Conductor Candidates

Rank	Model	Why
1	phi4:14b	Fastest (7.4s), 100% JSON, good reasoning
2	gemma3:12b	100% JSON, best MC domain knowledge, 12.9s
3	gemma3:27b	Most capable, but only if latency budget allows (25.3s)

7. Recommendations

Immediate Actions

Delete the fine-tuned models from Matt's Ollama. Base models are strictly superior.
Use phi4:14b or gemma3:12b for conductor prototyping.
Preserve training data (JSONL files) for future fine-tuning attempts.

If Re-attempting Fine-tuning

Fix chat template alignment. Training data MUST use Qwen3.5's exact <|im_start|>...<|im_end|> format.
Consider a different base model. gemma3:12b showed the best instruction-following baseline and may be more robust under LoRA.
Lower LoRA rank and learning rate to prevent catastrophic forgetting.
Add /no_think handling or use a model without built-in thinking mode.
Validate with the chat API during training, not just loss metrics.

Fine-tuning Priority (from 2.0 spec)

Voice (persona, gemma3:4b) and Eye (router, functiongemma) are the 1.0.1 fine-tune targets.
The conductor should run on a base model with strong instruction-following. Fine-tuning is not planned until 2.0.0.

Appendix: Test Scripts

See scripts/ directory for the Python scripts used to conduct these interviews. All scripts query Ollama's API at http://192.168.0.141:11437.

10 KiB Raw Blame History