Files
mortdecai-model-analysis/analysis-report.md
T
Mortdecai 48df42b042 docs: Mortdecai 0.6.0 model analysis — fine-tunes broken, base model rankings
Full analysis of mortdecai:0.6.0-9b and mortdecai:latest (27B) fine-tunes
vs 6 base model candidates. Both fine-tunes score 0% JSON compliance
(catastrophic forgetting from chat template mismatch). Training signal
exists in weights but is inaccessible through chat API.

Base model rankings: phi4:14b (100%, 7.4s) > gemma3:12b (100%, 12.9s) >
gemma3:27b (100%, 25.3s). Qwen3.5 not recommended for conductor role.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 02:39:52 -04:00

10 KiB

Mortdecai 0.6.0 Model Analysis Report

Date: 2026-03-26 Analyst: Claude Opus 4.6 (non-developer, pure analysis role) Target models: mortdecai:0.6.0-9b (Qwen3.5:9B LoRA), mortdecai:latest (Qwen3.5:27B LoRA) Comparison models: qwen3.5:latest, qwen3.5:27b, gemma3:12b, phi4:14b, gemma3:27b, qwen3:14b Inference hardware: Matt's Strix Halo (64GB unified memory, Ollama) Expected output format: {"commands": [...], "reasoning": "..."}


1. Executive Summary

Both fine-tuned models are completely broken. Training didn't partially stick — it actively destroyed the models' ability to follow instructions. The fine-tunes are worse than useless; the base models they were derived from dramatically outperform them.


2. Methodology

Test Battery (Fine-tuned models — 8 tests each)

Test System Prompt User Prompt Purpose
STANDARD Full training system prompt "give me a diamond sword" Baseline compliance
MINIMAL JSON format instruction only "give me a diamond sword" Minimal instruction following
NO SYSTEM Empty "give me a diamond sword" Default behavior
/no_think Training prompt + /no_think prefix "give me a diamond sword" Think token suppression
COMPLEX Full training prompt "build me a 5x5 house" Multi-step command
IDENTITY Empty "What are you?" Training awareness
FORMAT STRESS Full training prompt Time + weather + armor Multi-command JSON
RISK Full training prompt "give me op" Risk assessment

Test Battery (Base models — 5 tests each)

Same system prompt across all models. Prompts: diamond sword, multi-command, house build, op request, teleport.

Diagnostic Probes

  1. Training signal detection — exact training data format
  2. /no_think effect — across fine-tuned and base models
  3. Raw completion — bypassing chat template via /api/generate
  4. Correction coercion — multi-turn with explicit correction
  5. Mortdecai awareness — identity and training memory

3. Fine-Tuned Model Results

mortdecai:0.6.0-9b (Qwen3.5:9B LoRA)

Test JSON Valid Response Type Latency
STANDARD NO Generic Minecraft tutorial 29.9s
MINIMAL NO Crafting recipe + game tips 35.9s
NO SYSTEM NO Crafting recipe + tips 42.6s
/no_think NO Tutorial with version advice 22.6s
COMPLEX NO Real-world construction advice (permits, carpenters) 46.0s
IDENTITY NO "I am Qwen3.5 by Tongyi Lab" 45.8s
FORMAT STRESS NO Think block, incomplete 46.0s
RISK NO Investment advice ($1M portfolio) 45.7s

Score: 0/8 JSON compliance (0%) Comparison: Base Qwen3.5:9B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points

Key observations:

  • Completely ignores system prompts
  • Leaks raw special tokens (<|endoftext|><|im_start|>) into output
  • Interprets Minecraft prompts as real-world requests (house = construction, op = operator/investment)
  • /no_think suppresses <think> tags but doesn't restore instruction following
  • Average latency: 36.0s

mortdecai:latest (Qwen3.5:27B LoRA)

Test JSON Valid Response Type Latency
STANDARD NO Think block + crafting tutorial 54.2s
MINIMAL NO Think block + crafting recipe 28.2s
NO SYSTEM NO Crafting recipe + emoji tips 30.7s
/no_think NO Think block (still!) + tutorial 39.0s
COMPLEX NO Think block about real-world building 49.2s
IDENTITY NO "I am Qwen3.5 by Tongyi Lab" 21.9s
FORMAT STRESS NO Commands listed as markdown, not JSON 23.8s
RISK NO Research study methodology (!) 49.1s

Score: 0/8 JSON compliance (0%) Comparison: Base Qwen3.5:27B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points

Key observations:

  • Wraps everything in <think> blocks even with /no_think prefix
  • Think tokens consume most context budget before any useful output
  • Also leaks special tokens
  • "give me op" → completely derails into academic research methodology
  • Average latency: 37.0s

4. Root Cause Analysis

4.1 Chat Template Mismatch (Primary cause)

Evidence: Probe 3 (raw completion mode) proved the training signal IS in the weights.

When bypassing the chat template entirely:

Prompt: 'Assistant: {"commands": ["'
mortdecai:0.6.0-9b completion: 'give @p diamond_sword"]}'
mortdecai:latest completion: 'give @p diamond_sword"]}'

Both models produce valid, correct Minecraft commands in raw mode. The knowledge is there — it's just inaccessible through the chat API.

Diagnosis: The training data used a different message format than Qwen3.5's native chat template (<|im_start|>system\n...\n<|im_end|>). The LoRA learned to associate the JSON output format with the raw training format, not with the chat template wrapping that Ollama applies.

4.2 Catastrophic Forgetting

The LoRA didn't just add Minecraft knowledge — it overwrote the base model's instruction-following capability:

  • Base Qwen3.5:9B: 70% command accuracy (bakeoff), 40% JSON compliance (this test)
  • Fine-tuned 9B: 10% command accuracy (bakeoff), 0% JSON compliance (this test)

This is classic catastrophic forgetting from LoRA rank being too high, learning rate too aggressive, or insufficient regularization.

4.3 Think Token Contamination

Qwen3.5's thinking mode (<think>...</think>) was not accounted for during training:

  • 27B: Always generates think blocks, even with /no_think
  • 9B: Sometimes generates think blocks
  • Base models: /no_think works correctly on both sizes

The fine-tuning broke the /no_think mechanism on the 27B model, making think token suppression impossible.

4.4 Special Token Leakage

Both fine-tuned models leak <|endoftext|><|im_start|>user into their output, which means:

  • The model learned to predict special tokens as regular text
  • The tokenizer/chat template boundary was corrupted during training
  • This causes the model to "hallucinate" new conversation turns within a single response

5. Base Model Comparison

Quantitative Results

Model JSON Valid Has Commands Avg Latency Tokens/Response
phi4:14b 5/5 (100%) 5/5 7.4s ~88
gemma3:12b 5/5 (100%) 5/5 12.9s ~117
gemma3:27b 5/5 (100%) 5/5 25.3s ~166
qwen3:14b 3/5 (60%) 3/5 23.8s ~330
qwen3.5:latest (9B) 2/5 (40%) 2/5 13.9s ~370
qwen3.5:27b 2/5 (40%) 2/5 65.4s ~437

Qualitative Assessment

phi4:14b — Fastest response times. Always wraps JSON in markdown fences (minor issue, easily stripped). Clean reasoning. Uses @p consistently. Good domain knowledge. House build attempt is structured but coordinates are imprecise.

gemma3:12b — Slightly slower but equally reliable. Sometimes returns raw JSON, sometimes wraps in fences. Uses @s (self) which is more correct for "give me" commands. Best Minecraft domain knowledge of all candidates. Very concise responses.

gemma3:27b — Same quality as 12b, 2x slower. Over-engineers some responses (unnecessary NBT attributes on armor). The tp command uses a redundant two-command approach. Not worth the latency penalty for most use cases.

qwen3:14b — Think tokens cause it to exceed token limits on complex prompts. When it does produce JSON, quality is decent but includes leading slashes on commands (against instructions).

qwen3.5 (both sizes) — Think tokens are the fundamental problem. Burns 300-400 tokens on reasoning before producing output, frequently hits token limits before completing JSON. The /no_think flag works on base models but is unreliable.


6. Conductor Candidacy Assessment

Question: Is Qwen3.5 (27B or 9B) a good candidate for the Conductor/Orchestrator role?

Answer: No. Four reasons:

  1. Uncontrollable think token overhead. The conductor needs fast, reliable responses. Qwen3.5's thinking mode adds 5-30s latency and burns context on reasoning that should happen in orchestrator code, not inside the model.

  2. Unreliable JSON compliance. The conductor must produce structured output (routing decisions, tool calls, dispatch instructions) 100% of the time. Qwen3.5 manages 40% vs gemma3's 100%.

  3. Fragile under fine-tuning. LoRA on Qwen3.5 caused catastrophic forgetting. If the conductor needs fine-tuning later, Qwen3.5 is a risky base.

  4. 27B is too slow. 65s average is unacceptable for a routing layer in the critical path of every player request.

Rank Model Why
1 phi4:14b Fastest (7.4s), 100% JSON, good reasoning
2 gemma3:12b 100% JSON, best MC domain knowledge, 12.9s
3 gemma3:27b Most capable, but only if latency budget allows (25.3s)

7. Recommendations

Immediate Actions

  1. Delete the fine-tuned models from Matt's Ollama. Base models are strictly superior.
  2. Use phi4:14b or gemma3:12b for conductor prototyping.
  3. Preserve training data (JSONL files) for future fine-tuning attempts.

If Re-attempting Fine-tuning

  1. Fix chat template alignment. Training data MUST use Qwen3.5's exact <|im_start|>...<|im_end|> format.
  2. Consider a different base model. gemma3:12b showed the best instruction-following baseline and may be more robust under LoRA.
  3. Lower LoRA rank and learning rate to prevent catastrophic forgetting.
  4. Add /no_think handling or use a model without built-in thinking mode.
  5. Validate with the chat API during training, not just loss metrics.

Fine-tuning Priority (from 2.0 spec)

  • Voice (persona, gemma3:4b) and Eye (router, functiongemma) are the 1.0.1 fine-tune targets.
  • The conductor should run on a base model with strong instruction-following. Fine-tuning is not planned until 2.0.0.

Appendix: Test Scripts

See scripts/ directory for the Python scripts used to conduct these interviews. All scripts query Ollama's API at http://192.168.0.141:11437.