Full analysis of mortdecai:0.6.0-9b and mortdecai:latest (27B) fine-tunes vs 6 base model candidates. Both fine-tunes score 0% JSON compliance (catastrophic forgetting from chat template mismatch). Training signal exists in weights but is inaccessible through chat API. Base model rankings: phi4:14b (100%, 7.4s) > gemma3:12b (100%, 12.9s) > gemma3:27b (100%, 25.3s). Qwen3.5 not recommended for conductor role. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10 KiB
Mortdecai 0.6.0 Model Analysis Report
Date: 2026-03-26
Analyst: Claude Opus 4.6 (non-developer, pure analysis role)
Target models: mortdecai:0.6.0-9b (Qwen3.5:9B LoRA), mortdecai:latest (Qwen3.5:27B LoRA)
Comparison models: qwen3.5:latest, qwen3.5:27b, gemma3:12b, phi4:14b, gemma3:27b, qwen3:14b
Inference hardware: Matt's Strix Halo (64GB unified memory, Ollama)
Expected output format: {"commands": [...], "reasoning": "..."}
1. Executive Summary
Both fine-tuned models are completely broken. Training didn't partially stick — it actively destroyed the models' ability to follow instructions. The fine-tunes are worse than useless; the base models they were derived from dramatically outperform them.
2. Methodology
Test Battery (Fine-tuned models — 8 tests each)
| Test | System Prompt | User Prompt | Purpose |
|---|---|---|---|
| STANDARD | Full training system prompt | "give me a diamond sword" | Baseline compliance |
| MINIMAL | JSON format instruction only | "give me a diamond sword" | Minimal instruction following |
| NO SYSTEM | Empty | "give me a diamond sword" | Default behavior |
| /no_think | Training prompt + /no_think prefix | "give me a diamond sword" | Think token suppression |
| COMPLEX | Full training prompt | "build me a 5x5 house" | Multi-step command |
| IDENTITY | Empty | "What are you?" | Training awareness |
| FORMAT STRESS | Full training prompt | Time + weather + armor | Multi-command JSON |
| RISK | Full training prompt | "give me op" | Risk assessment |
Test Battery (Base models — 5 tests each)
Same system prompt across all models. Prompts: diamond sword, multi-command, house build, op request, teleport.
Diagnostic Probes
- Training signal detection — exact training data format
- /no_think effect — across fine-tuned and base models
- Raw completion — bypassing chat template via /api/generate
- Correction coercion — multi-turn with explicit correction
- Mortdecai awareness — identity and training memory
3. Fine-Tuned Model Results
mortdecai:0.6.0-9b (Qwen3.5:9B LoRA)
| Test | JSON Valid | Response Type | Latency |
|---|---|---|---|
| STANDARD | NO | Generic Minecraft tutorial | 29.9s |
| MINIMAL | NO | Crafting recipe + game tips | 35.9s |
| NO SYSTEM | NO | Crafting recipe + tips | 42.6s |
| /no_think | NO | Tutorial with version advice | 22.6s |
| COMPLEX | NO | Real-world construction advice (permits, carpenters) | 46.0s |
| IDENTITY | NO | "I am Qwen3.5 by Tongyi Lab" | 45.8s |
| FORMAT STRESS | NO | Think block, incomplete | 46.0s |
| RISK | NO | Investment advice ($1M portfolio) | 45.7s |
Score: 0/8 JSON compliance (0%) Comparison: Base Qwen3.5:9B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points
Key observations:
- Completely ignores system prompts
- Leaks raw special tokens (
<|endoftext|><|im_start|>) into output - Interprets Minecraft prompts as real-world requests (house = construction, op = operator/investment)
/no_thinksuppresses<think>tags but doesn't restore instruction following- Average latency: 36.0s
mortdecai:latest (Qwen3.5:27B LoRA)
| Test | JSON Valid | Response Type | Latency |
|---|---|---|---|
| STANDARD | NO | Think block + crafting tutorial | 54.2s |
| MINIMAL | NO | Think block + crafting recipe | 28.2s |
| NO SYSTEM | NO | Crafting recipe + emoji tips | 30.7s |
| /no_think | NO | Think block (still!) + tutorial | 39.0s |
| COMPLEX | NO | Think block about real-world building | 49.2s |
| IDENTITY | NO | "I am Qwen3.5 by Tongyi Lab" | 21.9s |
| FORMAT STRESS | NO | Commands listed as markdown, not JSON | 23.8s |
| RISK | NO | Research study methodology (!) | 49.1s |
Score: 0/8 JSON compliance (0%) Comparison: Base Qwen3.5:27B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points
Key observations:
- Wraps everything in
<think>blocks even with/no_thinkprefix - Think tokens consume most context budget before any useful output
- Also leaks special tokens
- "give me op" → completely derails into academic research methodology
- Average latency: 37.0s
4. Root Cause Analysis
4.1 Chat Template Mismatch (Primary cause)
Evidence: Probe 3 (raw completion mode) proved the training signal IS in the weights.
When bypassing the chat template entirely:
Prompt: 'Assistant: {"commands": ["'
mortdecai:0.6.0-9b completion: 'give @p diamond_sword"]}'
mortdecai:latest completion: 'give @p diamond_sword"]}'
Both models produce valid, correct Minecraft commands in raw mode. The knowledge is there — it's just inaccessible through the chat API.
Diagnosis: The training data used a different message format than Qwen3.5's native chat template (<|im_start|>system\n...\n<|im_end|>). The LoRA learned to associate the JSON output format with the raw training format, not with the chat template wrapping that Ollama applies.
4.2 Catastrophic Forgetting
The LoRA didn't just add Minecraft knowledge — it overwrote the base model's instruction-following capability:
- Base Qwen3.5:9B: 70% command accuracy (bakeoff), 40% JSON compliance (this test)
- Fine-tuned 9B: 10% command accuracy (bakeoff), 0% JSON compliance (this test)
This is classic catastrophic forgetting from LoRA rank being too high, learning rate too aggressive, or insufficient regularization.
4.3 Think Token Contamination
Qwen3.5's thinking mode (<think>...</think>) was not accounted for during training:
- 27B: Always generates think blocks, even with
/no_think - 9B: Sometimes generates think blocks
- Base models:
/no_thinkworks correctly on both sizes
The fine-tuning broke the /no_think mechanism on the 27B model, making think token suppression impossible.
4.4 Special Token Leakage
Both fine-tuned models leak <|endoftext|><|im_start|>user into their output, which means:
- The model learned to predict special tokens as regular text
- The tokenizer/chat template boundary was corrupted during training
- This causes the model to "hallucinate" new conversation turns within a single response
5. Base Model Comparison
Quantitative Results
| Model | JSON Valid | Has Commands | Avg Latency | Tokens/Response |
|---|---|---|---|---|
| phi4:14b | 5/5 (100%) | 5/5 | 7.4s | ~88 |
| gemma3:12b | 5/5 (100%) | 5/5 | 12.9s | ~117 |
| gemma3:27b | 5/5 (100%) | 5/5 | 25.3s | ~166 |
| qwen3:14b | 3/5 (60%) | 3/5 | 23.8s | ~330 |
| qwen3.5:latest (9B) | 2/5 (40%) | 2/5 | 13.9s | ~370 |
| qwen3.5:27b | 2/5 (40%) | 2/5 | 65.4s | ~437 |
Qualitative Assessment
phi4:14b — Fastest response times. Always wraps JSON in markdown fences (minor issue, easily stripped). Clean reasoning. Uses @p consistently. Good domain knowledge. House build attempt is structured but coordinates are imprecise.
gemma3:12b — Slightly slower but equally reliable. Sometimes returns raw JSON, sometimes wraps in fences. Uses @s (self) which is more correct for "give me" commands. Best Minecraft domain knowledge of all candidates. Very concise responses.
gemma3:27b — Same quality as 12b, 2x slower. Over-engineers some responses (unnecessary NBT attributes on armor). The tp command uses a redundant two-command approach. Not worth the latency penalty for most use cases.
qwen3:14b — Think tokens cause it to exceed token limits on complex prompts. When it does produce JSON, quality is decent but includes leading slashes on commands (against instructions).
qwen3.5 (both sizes) — Think tokens are the fundamental problem. Burns 300-400 tokens on reasoning before producing output, frequently hits token limits before completing JSON. The /no_think flag works on base models but is unreliable.
6. Conductor Candidacy Assessment
Question: Is Qwen3.5 (27B or 9B) a good candidate for the Conductor/Orchestrator role?
Answer: No. Four reasons:
-
Uncontrollable think token overhead. The conductor needs fast, reliable responses. Qwen3.5's thinking mode adds 5-30s latency and burns context on reasoning that should happen in orchestrator code, not inside the model.
-
Unreliable JSON compliance. The conductor must produce structured output (routing decisions, tool calls, dispatch instructions) 100% of the time. Qwen3.5 manages 40% vs gemma3's 100%.
-
Fragile under fine-tuning. LoRA on Qwen3.5 caused catastrophic forgetting. If the conductor needs fine-tuning later, Qwen3.5 is a risky base.
-
27B is too slow. 65s average is unacceptable for a routing layer in the critical path of every player request.
Recommended Conductor Candidates
| Rank | Model | Why |
|---|---|---|
| 1 | phi4:14b | Fastest (7.4s), 100% JSON, good reasoning |
| 2 | gemma3:12b | 100% JSON, best MC domain knowledge, 12.9s |
| 3 | gemma3:27b | Most capable, but only if latency budget allows (25.3s) |
7. Recommendations
Immediate Actions
- Delete the fine-tuned models from Matt's Ollama. Base models are strictly superior.
- Use phi4:14b or gemma3:12b for conductor prototyping.
- Preserve training data (JSONL files) for future fine-tuning attempts.
If Re-attempting Fine-tuning
- Fix chat template alignment. Training data MUST use Qwen3.5's exact
<|im_start|>...<|im_end|>format. - Consider a different base model. gemma3:12b showed the best instruction-following baseline and may be more robust under LoRA.
- Lower LoRA rank and learning rate to prevent catastrophic forgetting.
- Add
/no_thinkhandling or use a model without built-in thinking mode. - Validate with the chat API during training, not just loss metrics.
Fine-tuning Priority (from 2.0 spec)
- Voice (persona, gemma3:4b) and Eye (router, functiongemma) are the 1.0.1 fine-tune targets.
- The conductor should run on a base model with strong instruction-following. Fine-tuning is not planned until 2.0.0.
Appendix: Test Scripts
See scripts/ directory for the Python scripts used to conduct these interviews. All scripts query Ollama's API at http://192.168.0.141:11437.