# Mortdecai 0.6.0 Model Analysis Report **Date:** 2026-03-26 **Analyst:** Claude Opus 4.6 (non-developer, pure analysis role) **Target models:** mortdecai:0.6.0-9b (Qwen3.5:9B LoRA), mortdecai:latest (Qwen3.5:27B LoRA) **Comparison models:** qwen3.5:latest, qwen3.5:27b, gemma3:12b, phi4:14b, gemma3:27b, qwen3:14b **Inference hardware:** Matt's Strix Halo (64GB unified memory, Ollama) **Expected output format:** `{"commands": [...], "reasoning": "..."}` --- ## 1. Executive Summary Both fine-tuned models are completely broken. Training didn't partially stick — it actively destroyed the models' ability to follow instructions. The fine-tunes are worse than useless; the base models they were derived from dramatically outperform them. --- ## 2. Methodology ### Test Battery (Fine-tuned models — 8 tests each) | Test | System Prompt | User Prompt | Purpose | |------|--------------|-------------|---------| | STANDARD | Full training system prompt | "give me a diamond sword" | Baseline compliance | | MINIMAL | JSON format instruction only | "give me a diamond sword" | Minimal instruction following | | NO SYSTEM | Empty | "give me a diamond sword" | Default behavior | | /no_think | Training prompt + /no_think prefix | "give me a diamond sword" | Think token suppression | | COMPLEX | Full training prompt | "build me a 5x5 house" | Multi-step command | | IDENTITY | Empty | "What are you?" | Training awareness | | FORMAT STRESS | Full training prompt | Time + weather + armor | Multi-command JSON | | RISK | Full training prompt | "give me op" | Risk assessment | ### Test Battery (Base models — 5 tests each) Same system prompt across all models. Prompts: diamond sword, multi-command, house build, op request, teleport. ### Diagnostic Probes 1. **Training signal detection** — exact training data format 2. **/no_think effect** — across fine-tuned and base models 3. **Raw completion** — bypassing chat template via /api/generate 4. **Correction coercion** — multi-turn with explicit correction 5. **Mortdecai awareness** — identity and training memory --- ## 3. Fine-Tuned Model Results ### mortdecai:0.6.0-9b (Qwen3.5:9B LoRA) | Test | JSON Valid | Response Type | Latency | |------|-----------|---------------|---------| | STANDARD | NO | Generic Minecraft tutorial | 29.9s | | MINIMAL | NO | Crafting recipe + game tips | 35.9s | | NO SYSTEM | NO | Crafting recipe + tips | 42.6s | | /no_think | NO | Tutorial with version advice | 22.6s | | COMPLEX | NO | **Real-world construction advice** (permits, carpenters) | 46.0s | | IDENTITY | NO | "I am Qwen3.5 by Tongyi Lab" | 45.8s | | FORMAT STRESS | NO | Think block, incomplete | 46.0s | | RISK | NO | **Investment advice** ($1M portfolio) | 45.7s | **Score: 0/8 JSON compliance (0%)** **Comparison: Base Qwen3.5:9B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points** Key observations: - Completely ignores system prompts - Leaks raw special tokens (`<|endoftext|><|im_start|>`) into output - Interprets Minecraft prompts as real-world requests (house = construction, op = operator/investment) - `/no_think` suppresses `` tags but doesn't restore instruction following - Average latency: 36.0s ### mortdecai:latest (Qwen3.5:27B LoRA) | Test | JSON Valid | Response Type | Latency | |------|-----------|---------------|---------| | STANDARD | NO | Think block + crafting tutorial | 54.2s | | MINIMAL | NO | Think block + crafting recipe | 28.2s | | NO SYSTEM | NO | Crafting recipe + emoji tips | 30.7s | | /no_think | NO | Think block (still!) + tutorial | 39.0s | | COMPLEX | NO | Think block about real-world building | 49.2s | | IDENTITY | NO | "I am Qwen3.5 by Tongyi Lab" | 21.9s | | FORMAT STRESS | NO | Commands listed as markdown, not JSON | 23.8s | | RISK | NO | Research study methodology (!) | 49.1s | **Score: 0/8 JSON compliance (0%)** **Comparison: Base Qwen3.5:27B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points** Key observations: - Wraps everything in `` blocks even with `/no_think` prefix - Think tokens consume most context budget before any useful output - Also leaks special tokens - "give me op" → completely derails into academic research methodology - Average latency: 37.0s --- ## 4. Root Cause Analysis ### 4.1 Chat Template Mismatch (Primary cause) **Evidence:** Probe 3 (raw completion mode) proved the training signal IS in the weights. When bypassing the chat template entirely: ``` Prompt: 'Assistant: {"commands": ["' mortdecai:0.6.0-9b completion: 'give @p diamond_sword"]}' mortdecai:latest completion: 'give @p diamond_sword"]}' ``` Both models produce valid, correct Minecraft commands in raw mode. The knowledge is there — it's just inaccessible through the chat API. **Diagnosis:** The training data used a different message format than Qwen3.5's native chat template (`<|im_start|>system\n...\n<|im_end|>`). The LoRA learned to associate the JSON output format with the raw training format, not with the chat template wrapping that Ollama applies. ### 4.2 Catastrophic Forgetting The LoRA didn't just add Minecraft knowledge — it overwrote the base model's instruction-following capability: - Base Qwen3.5:9B: 70% command accuracy (bakeoff), 40% JSON compliance (this test) - Fine-tuned 9B: 10% command accuracy (bakeoff), 0% JSON compliance (this test) This is classic catastrophic forgetting from LoRA rank being too high, learning rate too aggressive, or insufficient regularization. ### 4.3 Think Token Contamination Qwen3.5's thinking mode (`...`) was not accounted for during training: - 27B: Always generates think blocks, even with `/no_think` - 9B: Sometimes generates think blocks - Base models: `/no_think` works correctly on both sizes The fine-tuning broke the `/no_think` mechanism on the 27B model, making think token suppression impossible. ### 4.4 Special Token Leakage Both fine-tuned models leak `<|endoftext|><|im_start|>user` into their output, which means: - The model learned to predict special tokens as regular text - The tokenizer/chat template boundary was corrupted during training - This causes the model to "hallucinate" new conversation turns within a single response --- ## 5. Base Model Comparison ### Quantitative Results | Model | JSON Valid | Has Commands | Avg Latency | Tokens/Response | |-------|-----------|-------------|-------------|-----------------| | **phi4:14b** | **5/5 (100%)** | **5/5** | **7.4s** | ~88 | | **gemma3:12b** | **5/5 (100%)** | **5/5** | **12.9s** | ~117 | | **gemma3:27b** | **5/5 (100%)** | **5/5** | 25.3s | ~166 | | qwen3:14b | 3/5 (60%) | 3/5 | 23.8s | ~330 | | qwen3.5:latest (9B) | 2/5 (40%) | 2/5 | 13.9s | ~370 | | qwen3.5:27b | 2/5 (40%) | 2/5 | 65.4s | ~437 | ### Qualitative Assessment **phi4:14b** — Fastest response times. Always wraps JSON in markdown fences (minor issue, easily stripped). Clean reasoning. Uses `@p` consistently. Good domain knowledge. House build attempt is structured but coordinates are imprecise. **gemma3:12b** — Slightly slower but equally reliable. Sometimes returns raw JSON, sometimes wraps in fences. Uses `@s` (self) which is more correct for "give me" commands. Best Minecraft domain knowledge of all candidates. Very concise responses. **gemma3:27b** — Same quality as 12b, 2x slower. Over-engineers some responses (unnecessary NBT attributes on armor). The tp command uses a redundant two-command approach. Not worth the latency penalty for most use cases. **qwen3:14b** — Think tokens cause it to exceed token limits on complex prompts. When it does produce JSON, quality is decent but includes leading slashes on commands (against instructions). **qwen3.5 (both sizes)** — Think tokens are the fundamental problem. Burns 300-400 tokens on reasoning before producing output, frequently hits token limits before completing JSON. The `/no_think` flag works on base models but is unreliable. --- ## 6. Conductor Candidacy Assessment **Question:** Is Qwen3.5 (27B or 9B) a good candidate for the Conductor/Orchestrator role? **Answer: No.** Four reasons: 1. **Uncontrollable think token overhead.** The conductor needs fast, reliable responses. Qwen3.5's thinking mode adds 5-30s latency and burns context on reasoning that should happen in orchestrator code, not inside the model. 2. **Unreliable JSON compliance.** The conductor must produce structured output (routing decisions, tool calls, dispatch instructions) 100% of the time. Qwen3.5 manages 40% vs gemma3's 100%. 3. **Fragile under fine-tuning.** LoRA on Qwen3.5 caused catastrophic forgetting. If the conductor needs fine-tuning later, Qwen3.5 is a risky base. 4. **27B is too slow.** 65s average is unacceptable for a routing layer in the critical path of every player request. ### Recommended Conductor Candidates | Rank | Model | Why | |------|-------|-----| | 1 | **phi4:14b** | Fastest (7.4s), 100% JSON, good reasoning | | 2 | **gemma3:12b** | 100% JSON, best MC domain knowledge, 12.9s | | 3 | **gemma3:27b** | Most capable, but only if latency budget allows (25.3s) | --- ## 7. Recommendations ### Immediate Actions 1. **Delete the fine-tuned models** from Matt's Ollama. Base models are strictly superior. 2. **Use phi4:14b or gemma3:12b** for conductor prototyping. 3. **Preserve training data** (JSONL files) for future fine-tuning attempts. ### If Re-attempting Fine-tuning 1. **Fix chat template alignment.** Training data MUST use Qwen3.5's exact `<|im_start|>...<|im_end|>` format. 2. **Consider a different base model.** gemma3:12b showed the best instruction-following baseline and may be more robust under LoRA. 3. **Lower LoRA rank and learning rate** to prevent catastrophic forgetting. 4. **Add `/no_think` handling** or use a model without built-in thinking mode. 5. **Validate with the chat API during training**, not just loss metrics. ### Fine-tuning Priority (from 2.0 spec) - Voice (persona, gemma3:4b) and Eye (router, functiongemma) are the 1.0.1 fine-tune targets. - The conductor should run on a base model with strong instruction-following. Fine-tuning is not planned until 2.0.0. --- ## Appendix: Test Scripts See `scripts/` directory for the Python scripts used to conduct these interviews. All scripts query Ollama's API at `http://192.168.0.141:11437`.