Mortdecai 48df42b042 docs: Mortdecai 0.6.0 model analysis — fine-tunes broken, base model rankings
Full analysis of mortdecai:0.6.0-9b and mortdecai:latest (27B) fine-tunes
vs 6 base model candidates. Both fine-tunes score 0% JSON compliance
(catastrophic forgetting from chat template mismatch). Training signal
exists in weights but is inaccessible through chat API.

Base model rankings: phi4:14b (100%, 7.4s) > gemma3:12b (100%, 12.9s) >
gemma3:27b (100%, 25.3s). Qwen3.5 not recommended for conductor role.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 02:39:52 -04:00

Mortdecai Model Analysis

Analysis of Mortdecai 0.6.0 fine-tuned models vs base model candidates for the Conductor/Hand roles in Mortdecai 2.0.

Date: 2026-03-26 Conducted by: Claude Opus 4.6 (analyst role) Hardware: Matt's Strix Halo (64GB unified memory) running Ollama

Summary

Both Mortdecai 0.6.0 fine-tunes (Qwen3.5 9B and 27B) are completely broken — 0% JSON compliance across all tests. The training signal exists in the weights (proven via raw completion mode) but is inaccessible through the chat API due to chat template misalignment during training.

Base models dramatically outperform the fine-tunes. gemma3:12b and phi4:14b both achieve 100% JSON compliance with zero fine-tuning.

Files

File Description
analysis-report.md Full analysis with methodology, findings, and recommendations
data/mortdecai-interview.txt Raw output from fine-tuned model interviews (8 tests each)
data/base-model-interview.txt Raw output from base model comparison (6 models, 5 tests each)
data/deep-probes.txt Diagnostic probes: training signal detection, chat template, identity
scripts/model_interview.py Interview script for fine-tuned models
scripts/base_model_interview.py Comparison script for base models
scripts/deep_probe.py Deep diagnostic probe script
S
Description
Analysis of Mortdecai 0.6.0 fine-tuned models vs base model candidates for conductor role
Readme 52 KiB
Languages
Python 100%