task-queue-proof/README.md

# Unweighted Average Completion Time Is Not a Fair Metric for Task Scheduling

A mathematical proof that unweighted average task completion time is a biased
statistic that incentivizes cherry-picking easy work, and that any scheduling
advantage it appears to reveal is an artifact of the metric — not a reflection
of genuine throughput or service quality.

---

## 1. Definitions

Let there be **n** tasks with processing times $p_1, p_2, \ldots, p_n$.

A **schedule** $\sigma$ is a permutation of $\{1, 2, \ldots, n\}$ assigning
tasks to execution order on a single executor.

The **completion time** of task $\sigma(k)$ under schedule $\sigma$ is:

$$C_{\sigma(k)} = \sum_{j=1}^{k} p_{\sigma(j)}$$

The **unweighted mean completion time** is:

$$\bar{C}(\sigma) = \frac{1}{n} \sum_{k=1}^{n} C_{\sigma(k)}$$

The **work-weighted mean completion time** is:

$$\bar{C}_w(\sigma) = \frac{\sum_{k=1}^{n} p_{\sigma(k)} \cdot C_{\sigma(k)}}{\sum_{k=1}^{n} p_{\sigma(k)}}$$

---

## 2. SPT Is Optimal for the Unweighted Statistic

**Theorem 1.** The schedule that minimizes $\bar{C}(\sigma)$ is Shortest
Processing Time first (SPT): sort tasks so that $p_{\sigma(1)} \le p_{\sigma(2)} \le \cdots \le p_{\sigma(n)}$.

**Proof (exchange argument).**

Consider any schedule $\sigma$ in which two adjacent tasks $i, j$ satisfy
$p_i > p_j$ with task $i$ scheduled immediately before task $j$. Let $t$ be the
start time of task $i$.

| | Task $i$ finishes | Task $j$ finishes | Sum |
|---|---|---|---|
| **Before swap** ($i$ then $j$) | $t + p_i$ | $t + p_i + p_j$ | $2t + 2p_i + p_j$ |
| **After swap** ($j$ then $i$) | $t + p_j$ | $t + p_j + p_i$ | $2t + p_i + 2p_j$ |

The change in the sum of completion times is:

$$(2p_i + p_j) - (p_i + 2p_j) = p_i - p_j > 0$$

Every swap of a longer-before-shorter adjacent pair strictly reduces the total.
Any non-SPT schedule contains such a pair. Repeated swaps converge to SPT.
Therefore SPT uniquely minimizes $\bar{C}(\sigma)$. $\blacksquare$

---

## 3. The Work-Weighted Statistic Is Schedule-Invariant

**Theorem 2.** The work-weighted mean completion time $\bar{C}_w(\sigma)$ is
the same for every schedule $\sigma$.

**Proof.**

Expand the numerator:

$$\sum_{k=1}^{n} p_{\sigma(k)} \cdot C_{\sigma(k)} = \sum_{k=1}^{n} p_{\sigma(k)} \sum_{j=1}^{k} p_{\sigma(j)}$$

Reindex by letting $a = \sigma(k)$ and $b = \sigma(j)$. The double sum counts
every ordered pair $(a, b)$ where $b$ is scheduled no later than $a$:

$$= \sum_{\substack{a, b \\ b \preceq_\sigma a}} p_a \, p_b$$

For any pair $(a, b)$ with $a \ne b$, exactly one of $\{b \preceq_\sigma a\}$
or $\{a \prec_\sigma b\}$ holds. The diagonal terms ($a = b$) contribute $p_a^2$
regardless of order. Therefore:

$$\sum_{\substack{a, b \\ b \preceq_\sigma a}} p_a \, p_b = \sum_{a} p_a^2 + \sum_{\substack{a \ne b \\ b \prec_\sigma a}} p_a \, p_b$$

Now consider the complementary sum:

$$\sum_{\substack{a \ne b \\ a \prec_\sigma b}} p_a \, p_b$$

Together the two off-diagonal sums cover all unordered pairs $\{a, b\}$:

$$\sum_{\substack{a \ne b \\ b \prec_\sigma a}} p_a \, p_b + \sum_{\substack{a \ne b \\ a \prec_\sigma b}} p_a \, p_b = \sum_{a \ne b} p_a \, p_b$$

The right-hand side is schedule-independent. By symmetry of $p_a p_b$, both
off-diagonal sums are equal:

$$\sum_{\substack{a \ne b \\ b \prec_\sigma a}} p_a \, p_b = \frac{1}{2} \sum_{a \ne b} p_a \, p_b$$

Therefore:

$$\sum_{k=1}^{n} p_{\sigma(k)} \cdot C_{\sigma(k)} = \sum_a p_a^2 + \frac{1}{2} \sum_{a \ne b} p_a \, p_b = \frac{1}{2}\left(\sum_a p_a\right)^2 + \frac{1}{2}\sum_a p_a^2$$

This expression contains no reference to $\sigma$. Since the denominator
$\sum p_a$ is also schedule-independent:

$$\bar{C}_w(\sigma) = \frac{\frac{1}{2}\left(\sum p_a\right)^2 + \frac{1}{2}\sum p_a^2}{\sum p_a}$$

is **constant across all schedules**. $\blacksquare$

---

## 4. Concrete Example

Two tasks: $A$ with $p_A = 1$ hour, $B$ with $p_B = 10$ hours.

### SPT order (A first)

| Task | Completion time |
|------|----------------|
| A | 1 |
| B | 11 |

- Unweighted mean: $(1 + 11) / 2 = 6.0$
- Work-weighted mean: $(1 \times 1 + 10 \times 11) / 11 = 111/11 \approx 10.09$

### Reverse order (B first)

| Task | Completion time |
|------|----------------|
| B | 10 |
| A | 11 |

- Unweighted mean: $(10 + 11) / 2 = 10.5$
- Work-weighted mean: $(10 \times 10 + 1 \times 11) / 11 = 111/11 \approx 10.09$

SPT appears **4.5 hours better** on the unweighted metric but provides
**zero improvement** on the work-weighted metric. The apparent advantage exists
only because the unweighted statistic lets a 1-hour task "vote" equally with
a 10-hour task.

---

## 5. Connection to Little's Law

Little's Law states $L = \lambda W$, where $L$ is the time-averaged number
of tasks in the system, $\lambda$ is the arrival rate, and $W$ is the
average time a task spends in the system.

In a *steady-state* queueing system with fixed arrival and service rates,
$\lambda$ and the long-run service rate are determined by the workload, not
by scheduling policy. Little's Law then tells us that $L$ and $W$ are
linked, but in the batch case (all $n$ tasks present at time 0), $L$ and
$W$ are both schedule-dependent: $\bar{C} = W$, and
$L = \sum C_i / \sum p_i$, both of which SPT minimizes.

The invariance we proved in Theorem 2 is more specific: *work-weighted*
mean completion time $\bar{C}_w$ is constant across schedules. This
corresponds to measuring the system from the perspective of "how long does
a unit of *work* wait" rather than "how long does a *task* wait." The
unweighted statistic measures the latter and is gameable precisely because
it counts completions rather than work.

---

## 6. Consequences

**Theorem 3 (Metric Bias).** Any scheduling policy that minimizes unweighted
mean completion time necessarily maximizes the completion time of the largest
task relative to other schedules.

**Proof.** SPT places the largest task last. Its completion time equals the
total processing time $\sum p_i$, which is the maximum possible completion
time for any individual task. Meanwhile, FIFO or any non-SPT order would
allow the large task to finish earlier. $\blacksquare$

This creates a **starvation incentive**: rational agents optimizing the
unweighted statistic will indefinitely defer large tasks in favor of
small ones.

### Real-world manifestations

| Domain | Gameable metric | Perverse outcome |
|--------|----------------|------------------|
| Support desks | Tickets closed / day | Complex issues ignored |
| Sprint planning | Story count velocity | Work split into trivial pieces |
| Emergency rooms | Average wait time | Critical patients deprioritized |
| Academic publishing | Papers per year | Incremental work favored over deep research |

---

## 7. Impact on Client Satisfaction and Team Productivity

The preceding theorems are not merely abstract. They have direct, provable
consequences for client satisfaction and team productivity when a team adopts
unweighted mean completion time as its performance metric.

### 7.1 Defining Client Satisfaction: The Slowdown Ratio

A client submitting a task of size $p_i$ has an expectation anchored to that
size. The natural measure of their experience is the **slowdown ratio**:

$$S_i = \frac{C_i}{p_i}$$

This is the factor by which the client's wait exceeds the task's inherent
processing time. A slowdown of 1 means no queuing delay at all. A slowdown
of 10 means the client waited 10x longer than the work itself required.

Client satisfaction is inversely related to slowdown: a client who waits
2x their task size is more satisfied than one who waits 20x, regardless of
the absolute times involved.

**Theorem 4 (SPT Uniquely Maximizes Completion Time of the Largest Task).**
Among all schedules, SPT is the unique policy that assigns the maximum
possible completion time ($\sum p_i$) to the largest task.

**Proof.**

SPT sorts tasks in ascending order of $p_i$, placing the largest task
$p_{\max}$ in the last position. The last task in any schedule has
completion time $\sum_{i=1}^{n} p_i$, which is the maximum completion time
any individual task can receive. Therefore, under SPT:

$$C_{\max\text{-task}}^{\text{SPT}} = \sum_{i=1}^{n} p_i$$

Under any schedule that does not place $p_{\max}$ last, the largest task
completes strictly before $\sum p_i$. SPT is the unique schedule (among
those ordered by processing time) that assigns this worst-case completion
time to the largest task.

Note on slowdown: SPT actually *compresses* slowdown ratios ($S_i = C_i / p_i$)
because larger tasks in later positions have large denominators that absorb
the accumulated sum. For example, with tasks $[1, 5, 10]$:

- SPT: slowdowns $[1, 1.2, 1.6]$ — low variance
- LPT: slowdowns $[1, 3, 16]$ — high variance

SPT's harm to large-task clients is not visible in the slowdown ratio. It is
visible in **absolute completion time**: the largest task finishes last, at
$\sum p_i$, while under any other ordering it finishes earlier. $\blacksquare$

**Corollary 4.1.** A team optimizing unweighted mean completion time will
systematically deliver the worst experience to clients with the most
complex needs.

This is not a side effect — it is the *mechanism* by which the metric improves.
The only way to lower the unweighted average is to complete more small tasks
early, which necessarily means completing large tasks later. The metric
improves *because* high-effort clients are deprioritized.

### 7.2 The Absolute Delay Burden

The slowdown ratio $S_i = C_i / p_i$ might suggest SPT is *fair* — it
compresses slowdown variance by giving everyone a ratio close to 1. But
this obscures the real cost. The correct measure of burden is the
**absolute delay** experienced by each task:

$$\Delta_i = C_i - p_i$$

This is the time a task spends waiting for other tasks, independent of its
own size. Under any sequential schedule, the total delay across all tasks
is schedule-dependent (it equals $\sum C_i - \sum p_i$), and SPT minimizes
this total. But the *distribution* of delay matters.

**Theorem 5 (SPT Concentrates Delay on the Largest Task).** Under SPT, the
largest task bears more absolute delay than under any other schedule.

**Proof.** Under SPT, the largest task is in position $n$ with:

$$\Delta_{\max\text{-task}}^{\text{SPT}} = C_n - p_n = \sum_{i=1}^{n-1} p_i$$

This is the sum of all other tasks' processing times — the maximum possible
delay for any single task. Under any schedule where the largest task is not
last, its delay is strictly less than $\sum_{i \ne \max} p_i$.

Meanwhile, SPT gives the smallest task zero delay ($\Delta_1^{\text{SPT}} = 0$).
The entire queuing burden is shifted from small tasks to large tasks.
$\blacksquare$

The tension is this: SPT minimizes total delay (good for aggregate
efficiency) by concentrating delay onto the tasks best able to "absorb" it
in slowdown-ratio terms. But in absolute terms — hours spent waiting — the
largest task bears the full weight. If that task represents a critical
business need, the absolute delay, not the ratio, determines the damage.

### 7.3 Productivity Is Not Improved

**Theorem 6 (Throughput Invariance).** Total work completed over any time
horizon $T$ is identical under all scheduling policies.

**Proof.** The executor processes work at a fixed rate. Over time $T$, the
total work completed is:

$$W(T) = \sum_{\{i : C_i \le T\}} p_i + \text{(partial progress on current task)}$$

In the non-preemptive case (tasks run to completion once started), $W(T)$ may
vary slightly at the boundary depending on which task is in progress at time
$T$. However, over any horizon $T \ge \sum p_i$ (i.e., long enough to
complete all tasks), the total work done is exactly $\sum p_i$ regardless
of order.

For the steady-state case with ongoing arrivals, the long-run throughput is
determined by the service rate $\mu$ and is completely independent of
scheduling:

$$\lim_{T \to \infty} \frac{W(T)}{T} = \mu \quad \text{for all schedules } \sigma$$

$\blacksquare$

**Corollary 6.1.** A team that switches from any scheduling policy to SPT
will observe an improvement in unweighted mean completion time with
**zero change in actual throughput**.

The metric improves. The output does not.

### 7.4 The Compound Effect: Satisfaction Down, Productivity Flat

Combining Theorems 4, 5, and 6:

| Measure | Effect of optimizing unweighted mean |
|---------|--------------------------------------|
| Throughput (work/time) | No change (Theorem 6) |
| Delay for small tasks | Minimized — approaches zero (SPT) |
| Delay for large tasks | **Maximized** — bears all queuing burden (Theorem 5) |
| Completion time of largest task | **Maximum possible**: $\sum p_i$ (Theorem 4) |
| Overall perceived quality of service | **Net negative** (see below) |

The net effect on perceived quality is negative because:

1. **Loss aversion is asymmetric.** A client whose 100-hour task is
   deprioritized to last experiences a large, salient negative. A client
   whose 1-hour task moves from position 5 to position 1 experiences a
   small, often unnoticed positive. The absolute dissatisfaction created
   exceeds the absolute satisfaction gained.

2. **High-effort tasks correlate with high-value clients.** Large tasks
   are disproportionately likely to come from major clients, complex
   contracts, or critical business needs. Systematically giving these
   clients the worst experience is anti-correlated with revenue and
   retention.

3. **Starvation compounds.** In a continuous system (Theorem 3), large
   tasks are not merely delayed — they may be **indefinitely deferred**
   as new small tasks keep arriving. The affected client's satisfaction
   does not merely decrease; it collapses entirely.

**Theorem 7 (The Core Result).** For a team processing tasks of non-uniform
size, adopting unweighted mean completion time as a performance metric:

(a) Provides **zero productivity gain** (Theorem 6), while
(b) **Assigning the maximum possible completion time** to the largest task
    (Theorem 4), and
(c) **Concentrating all queuing delay** onto the largest tasks while
    eliminating delay for the smallest (Theorem 5).

This is not a tradeoff — there is no compensating benefit on the productivity
side. The metric creates a pure transfer of service quality from high-effort
clients to low-effort clients, with no net work gained.

**A team using unweighted mean completion time as its performance metric
will, under rational optimization, simultaneously fail to improve
productivity and systematically degrade the experience of its most
demanding clients.** $\blacksquare$

---

## 8. When Unweighted Mean Completion Time Is Valid

For completeness: the unweighted metric is appropriate **if and only if**
all tasks are approximately equal in size ($p_i \approx p_j$ for all $i, j$).
In this case, the work-weighted and unweighted statistics converge, SPT and
FIFO produce similar schedules, and slowdown ratios are naturally equal.

The pathology arises specifically from **variance in task size**. The greater
the variance, the greater the distortion, and the more damage the metric
causes when optimized.

---

## 9. Complete Breakdown Under Priority Classification

The preceding sections proved that unweighted mean completion time is biased
when tasks vary in size. We now show that introducing a **priority system** —
as virtually all real teams use — causes the metric to become not merely
biased but **actively adversarial** to the organization's stated goals.

### 9.1 Extended Model: Tasks With Priority

Let each task $i$ have processing time $p_i$ and a priority class
$q_i \in \{1, 2, 3, 4\}$ where 1 is the highest priority (critical) and
4 is the lowest (cosmetic/enhancement). Assign priority weights:

$$w(q) = \begin{cases} 8 & q = 1 \text{ (Critical)} \\ 4 & q = 2 \text{ (High)} \\ 2 & q = 3 \text{ (Medium)} \\ 1 & q = 4 \text{ (Low)} \end{cases}$$

The specific weights are illustrative; the results hold for any strictly
decreasing weight function. The key property is that priority is assigned
by **business impact**, not by task size.

### 9.2 The Metric Contradicts the Priority System

**Theorem 8 (Priority-Size Inversion).** When priority is independent of
task size, the schedule that minimizes unweighted mean completion time (SPT)
will, in expectation, complete low-priority tasks before high-priority tasks
of greater size.

**Proof.**

SPT orders tasks by $p_i$ ascending, regardless of $q_i$. Consider two tasks:

- Task A: $p_A = 40$ hours, $q_A = 1$ (Critical — e.g., server outage)
- Task B: $p_B = 0.5$ hours, $q_B = 4$ (Low — e.g., cosmetic UI fix)

SPT schedules B before A. The unweighted mean completion time for this pair:

$$\bar{C}^{\text{SPT}} = \frac{0.5 + 40.5}{2} = 20.5$$

The priority-respecting order (A before B):

$$\bar{C}^{\text{priority}} = \frac{40 + 40.5}{2} = 40.25$$

The metric declares SPT nearly **twice as good** — despite completing a
cosmetic fix while a server outage burns for an additional 0.5 hours.

In general, for $n$ tasks where priority $q_i$ is statistically independent
of processing time $p_i$ (a reasonable assumption, since priority reflects
business impact while processing time reflects technical complexity):

$$\text{Corr}(p_i, q_i) \approx 0$$

SPT's ordering is determined entirely by $p_i$. The expected position of a
task in the SPT schedule has **zero correlation** with its priority. A
Critical task is equally likely to be scheduled first or last.

More precisely: the expected fraction of Critical tasks in the bottom half
of the SPT schedule equals the fraction of Critical tasks whose processing
time exceeds the median. In practice, Critical tasks (outages, security
incidents, data loss) often require more work, so this fraction exceeds 50%.
The metric is not merely uncorrelated with priority — it is plausibly
**anti-correlated**. $\blacksquare$

### 9.3 Dimensionality Collapse

The unweighted mean completion time reduces a three-dimensional task
$(p_i, q_i, C_i)$ to a one-dimensional signal ($C_i$), then averages
that signal uniformly. This discards two of the three dimensions:

1. **Priority ($q_i$) is completely ignored.** A critical task and a
   cosmetic task contribute identically to the mean.
2. **Size ($p_i$) is implicitly inverted.** Small tasks are rewarded with
   early completion, large tasks are punished — regardless of their
   importance.

**Theorem 9 (Information Destruction).** Let $I(\sigma)$ be the mutual
information between the schedule's implicit priority ranking (position in
schedule) and the actual priority assignment $q_i$. For SPT:

$$I(\sigma_{\text{SPT}}) = 0 \quad \text{when } p_i \perp q_i$$

**Proof.** SPT assigns positions based solely on $p_i$. When $p_i$ and $q_i$
are independent, knowing a task's position in the SPT schedule provides
zero information about its priority. The schedule is statistically
independent of the priority system.

Contrast this with a priority-first schedule, where $I > 0$ by construction.
$\blacksquare$

**Corollary 9.1.** A team that optimizes unweighted mean completion time
is operating a scheduling system that carries zero information about its
own priority classification. The priority field in their ticketing system
is, with respect to execution order, decorative.

### 9.4 Quantifying the Damage: Priority-Weighted Delay Cost

Define the **priority-weighted delay cost** of a schedule:

$$D(\sigma) = \sum_{i=1}^{n} w(q_i) \cdot C_i$$

This measures the total business-impact-weighted time spent waiting.

**Theorem 10 (SPT and Priority-Weighted Delay Cost).**
The optimal schedule for minimizing priority-weighted delay cost $D(\sigma)$
is WSJF: order by $w(q_i)/p_i$ descending. SPT's ordering — by $1/p_i$
descending — ignores priority entirely and produces higher $D$ than
priority-respecting alternatives when priority is correlated with task size.

**Proof.** By the standard exchange argument (as in Theorem 1), swapping
adjacent tasks $i, j$ in a schedule changes $D$ by:

$$\Delta D = w(q_j) \cdot p_i - w(q_i) \cdot p_j$$

The swap improves $D$ when $\Delta D > 0$, i.e., when $w(q_j)/p_j > w(q_i)/p_i$
but $j$ is scheduled after $i$. Therefore the optimal order is decreasing
$w(q_i)/p_i$ — this is the WSJF rule.

SPT orders by $p_i$ ascending (equivalently, $1/p_i$ descending), which
corresponds to WSJF only when $w(q_i) = \text{const}$ — i.e., when all
tasks have equal priority.

**Example.** Two tasks: Critical ($w = 8$, $p_H = 10$) and Low ($w = 1$, $p_L = 1$).

WSJF scores: Critical = $8/10 = 0.8$, Low = $1/1 = 1.0$.

WSJF places the Low task first (higher $w/p$), same as SPT. Here, SPT and
WSJF agree because the Low task's tiny size dominates despite its low weight.

Now consider: Critical ($w = 8$, $p_H = 3$) and Low ($w = 1$, $p_L = 2$).

WSJF scores: Critical = $8/3 = 2.67$, Low = $1/2 = 0.5$.

WSJF places Critical first. SPT places Low first (smaller $p$). The costs:

- SPT (Low first): $D = 1 \cdot 2 + 8 \cdot 5 = 42$
- WSJF (Critical first): $D = 8 \cdot 3 + 1 \cdot 5 = 29$

SPT incurs 45% more priority-weighted delay because it ignores the 8x
priority weight of the Critical task.

In general, SPT diverges from WSJF — and produces suboptimal $D$ — whenever
priority and task size are not perfectly inversely correlated. In practice,
Critical tasks tend to be larger (outages, security incidents), making the
divergence systematic rather than occasional. $\blacksquare$

---

## 10. A Proposed Solution: Priority-Weighted Completion Score

### 10.1 The Metric

Replace unweighted mean completion time with the **Priority-Weighted
Completion Score (PWCS)**:

$$\text{PWCS}(\sigma) = \frac{\sum_{i=1}^{n} w(q_i) \cdot \frac{C_i}{p_i}}{\sum_{i=1}^{n} w(q_i)}$$

This is the priority-weighted mean slowdown ratio. It measures:

- **How long each task waited relative to its size** (the slowdown $C_i / p_i$),
  weighted by
- **How much that task mattered** (the priority weight $w(q_i)$).

Lower is better. A PWCS of 1.0 means every task was completed instantly
with zero queuing delay. A PWCS of 3.0 means the average task waited 3x
its processing time, weighted by importance.

### 10.2 Properties of PWCS

**Property 1: Priority-respecting.** PWCS penalizes delays to high-priority
tasks more heavily than low-priority tasks. A 2-hour delay to a Critical
task costs 8x more than the same delay to a Low task.

**Property 2: Size-fair.** By using the slowdown ratio $C_i / p_i$ rather
than raw completion time $C_i$, the metric does not inherently penalize
large tasks for being large. A 40-hour task that waits 80 hours contributes
the same slowdown (2.0) as a 1-hour task that waits 2 hours.

**Property 3: Not gameable by SPT.** Because the metric weights by priority
and normalizes by task size, reordering tasks by processing time does not
systematically improve the score. The optimal strategy is to minimize
slowdown for high-priority tasks — i.e., to **actually respect the priority
system**.

**Property 4: Reduces to unweighted mean when tasks are uniform.** If all
tasks have equal priority and equal size, PWCS equals the unweighted mean
completion time divided by the common task size. It is a strict
generalization.

### 10.3 Optimal Policy for PWCS

**Theorem 11.** The schedule minimizing PWCS processes tasks in order of
decreasing $w(q_i) / p_i$ — highest priority first, breaking ties by
shortest processing time within the same priority class.

**Proof (exchange argument, as in Theorem 1).**

Consider adjacent tasks $i, j$ with $i$ before $j$. Each task's contribution
to the PWCS numerator depends on the completion times of both. Swapping $i$
and $j$:

The change in the weighted slowdown sum is proportional to:

$$w(q_i) \cdot \frac{p_j}{p_i} - w(q_j) \cdot \frac{p_i}{p_j}$$

The swap improves PWCS when this quantity is positive, i.e., when:

$$\frac{w(q_i)}{p_i^2} > \frac{w(q_j)}{p_j^2}$$

Hmm — this doesn't simplify as cleanly due to the ratio structure. Let
us instead consider the more practical **priority-weighted completion time**:

$$\text{PWCT}(\sigma) = \frac{\sum_{i=1}^{n} w(q_i) \cdot C_i}{\sum_{i=1}^{n} w(q_i)}$$

For PWCT, the exchange argument gives: swap improves the score when
$w(q_j) \cdot p_i > w(q_i) \cdot p_j$, i.e., when $w(q_j)/p_j > w(q_i)/p_i$
but $j$ is scheduled after $i$. The optimal order is therefore decreasing
$w(q_i)/p_i$, which is the **Weighted Shortest Job First (WSJF)** rule:

$$\text{Schedule by: } \frac{w(q_i)}{p_i} \text{ descending}$$

This means: within a priority class, do short tasks first; across priority
classes, a Critical 8-hour task ($w/p = 8/8 = 1.0$) ties with a Low 1-hour
task ($w/p = 1/1 = 1.0$) — but a Critical 4-hour task ($w/p = 8/4 = 2.0$)
beats both. $\blacksquare$

### 10.4 Applied Example: IT Service Desk

Consider an IT team with the following ticket queue on a Monday morning:

| Ticket | Priority | Type | Est. Hours |
|--------|----------|------|-----------|
| T1 | P1 (Critical) | Email server down | 6 |
| T2 | P2 (High) | VPN failing for remote team | 4 |
| T3 | P3 (Medium) | New employee laptop setup | 2 |
| T4 | P4 (Low) | Update desktop wallpaper policy | 0.5 |
| T5 | P3 (Medium) | Install software license | 1 |
| T6 | P1 (Critical) | Database backup failing | 3 |
| T7 | P2 (High) | Printer fleet offline | 2 |
| T8 | P4 (Low) | Archive old shared drive folder | 0.25 |

**SPT order (optimizing unweighted mean):** T8, T4, T5, T3, T7, T6, T2, T1

| Position | Ticket | Priority | Hours | Completion | Slowdown |
|----------|--------|----------|-------|------------|----------|
| 1 | T8 (archive folder) | P4 Low | 0.25 | 0.25 | 1.0 |
| 2 | T4 (wallpaper) | P4 Low | 0.5 | 0.75 | 1.5 |
| 3 | T5 (software) | P3 Med | 1 | 1.75 | 1.75 |
| 4 | T3 (laptop) | P3 Med | 2 | 3.75 | 1.875 |
| 5 | T7 (printers) | P2 High | 2 | 5.75 | 2.875 |
| 6 | T6 (backups) | P1 Crit | 3 | 8.75 | 2.917 |
| 7 | T2 (VPN) | P2 High | 4 | 12.75 | 3.1875 |
| 8 | T1 (email) | P1 Crit | 6 | 18.75 | 3.125 |

- **Unweighted mean completion:** $(0.25 + 0.75 + 1.75 + 3.75 + 5.75 + 8.75 + 12.75 + 18.75) / 8 = 6.5625$ hours
- **PWCT:** $(1 \cdot 0.25 + 1 \cdot 0.75 + 2 \cdot 1.75 + 2 \cdot 3.75 + 4 \cdot 5.75 + 8 \cdot 8.75 + 4 \cdot 12.75 + 8 \cdot 18.75) / 30 = 306/30 = 10.2$ hours
- Email server is down for **18.75 hours**. Database backups fail for **8.75 hours**.

**WSJF order (optimizing PWCT by $w(q)/p$ descending):**

| Ticket | Priority | Hours | $w/p$ |
|--------|----------|-------|-------|
| T6 | P1 Crit | 3 | 8/3 = 2.667 |
| T8 | P4 Low | 0.25 | 1/0.25 = 4.0 |
| T5 | P3 Med | 1 | 2/1 = 2.0 |
| T4 | P4 Low | 0.5 | 1/0.5 = 2.0 |
| T1 | P1 Crit | 6 | 8/6 = 1.333 |
| T7 | P2 High | 2 | 4/2 = 2.0 |
| T2 | P2 High | 4 | 4/4 = 1.0 |
| T3 | P3 Med | 2 | 2/2 = 1.0 |

Wait — T8 has $w/p = 4.0$, the highest. That places a Low-priority task
first, which feels wrong. This reveals an important practical point:
**pure WSJF can still be gamed by tiny tasks** because their small $p$
inflates the ratio. In practice, this is mitigated by enforcing strict
priority class ordering and only applying WSJF *within* priority classes.

**Practical WSJF (priority-class-first, then $w/p$ within class):**

| Position | Ticket | Priority | Hours | Completion |
|----------|--------|----------|-------|------------|
| 1 | T6 (backups) | P1 Crit | 3 | 3 |
| 2 | T1 (email) | P1 Crit | 6 | 9 |
| 3 | T7 (printers) | P2 High | 2 | 11 |
| 4 | T2 (VPN) | P2 High | 4 | 15 |
| 5 | T5 (software) | P3 Med | 1 | 16 |
| 6 | T3 (laptop) | P3 Med | 2 | 18 |
| 7 | T8 (archive) | P4 Low | 0.25 | 18.25 |
| 8 | T4 (wallpaper) | P4 Low | 0.5 | 18.75 |

- **Unweighted mean completion:** $(3 + 9 + 11 + 15 + 16 + 18 + 18.25 + 18.75) / 8 = 13.625$ hours
- **PWCT:** $(8 \cdot 3 + 8 \cdot 9 + 4 \cdot 11 + 4 \cdot 15 + 2 \cdot 16 + 2 \cdot 18 + 1 \cdot 18.25 + 1 \cdot 18.75) / 30 = 305/30 = 10.167$ hours
- Email server restored in **9 hours**. Backups fixed in **3 hours**.

### Comparison

| Metric | SPT | Practical WSJF | Winner |
|--------|-----|----------------|--------|
| Unweighted mean completion | **6.5625 hrs** | 13.625 hrs | SPT |
| Priority-weighted completion (PWCT) | 10.2 hrs | **10.167 hrs** | WSJF |
| Time to fix email server | 18.75 hrs | **9 hrs** | WSJF |
| Time to fix database backups | 8.75 hrs | **3 hrs** | WSJF |
| Time to fix printers | 5.75 hrs | **11 hrs** | SPT |
| Time to update wallpaper | **0.75 hrs** | 18.75 hrs | SPT |

The PWCT values are nearly identical (10.2 vs 10.167) because PWCT — as a
*weighted average of completion times* — is dampened by the fact that total
work is constant. **PWCT is not the right metric for this comparison.** The
real difference is visible in the individual completion times of critical
tasks: the email server is down for 18.75 hours under SPT versus 9 hours
under WSJF. The database backups fail for 8.75 hours versus 3 hours.

The better comparison metric is the **priority-weighted delay cost**
$D = \sum w(q_i) \cdot C_i$ (not normalized):

- SPT: $D = 306$ priority-weighted hours
- Practical WSJF: $D = 305$ priority-weighted hours

Again, the aggregate is similar. The damage from SPT is not in the
aggregate — it is in the *distribution*: critical systems burn while
cosmetic tasks are polished. A metric that cannot distinguish between these
two schedules — despite one leaving the email server down for twice as long
— is not measuring what matters.

The unweighted metric, however, confidently reports SPT as **more than twice
as efficient** (6.56 vs 13.63), rewarding the team that updated desktop
wallpaper while the email server was on fire.

### 10.5 Recommended Metric Suite

The IT example reveals that even priority-weighted aggregate metrics (PWCT)
can fail to distinguish good from bad schedules, because aggregation hides
distributional damage. No single metric suffices. A complete measurement
system for a priority-based team should track:

| Metric | What it measures | Formula |
|--------|-----------------|---------|
| **Mean completion by priority class** | Per-class responsiveness | $\bar{C}$ filtered by $q$ |
| **P1 mean time to resolution** | Critical incident response | $\bar{C}$ filtered to $q = 1$ |
| **Throughput** | Raw work capacity | Work-hours completed / calendar time |
| **Aging violations** | Starvation prevention | Count of tasks exceeding SLA by priority |
| **Max completion time (P1/P2)** | Worst-case critical response | $\max(C_i)$ filtered to $q \le 2$ |

The key insight from our analysis: **per-priority-class metrics** (rows 1-2,
5) expose scheduling failures that aggregate metrics hide. If P1 mean time
to resolution is 14 hours while P4 mean is 0.5 hours, the team is
optimizing the wrong metric — regardless of what the aggregate says.

---

## 11. Devil's Advocate: The Case for Unweighted Mean Completion Time

Intellectual honesty requires acknowledging where the preceding argument
has limits. The following are genuine counterarguments — not strawmen.

### 11.1 Simplicity Has Real Value

**Argument.** The unweighted mean is trivially computable: sum the completion
times, divide by the count. It requires no priority weights, no task-size
estimates, no calibration. Every alternative proposed in Section 10 requires
estimating $p_i$ (task size) before the task is complete — and these
estimates are notoriously unreliable.

**Assessment: This is true.** PWCS and PWCT require inputs (priority
weights, size estimates) that introduce their own sources of error. If size
estimates are systematically wrong — and in software engineering they often
are, with large tasks underestimated and small tasks overestimated — then
the weighted metric inherits that noise.

However, the unweighted metric does not avoid this problem — it *hides* it
by implicitly setting all weights to 1 and all sizes to 1. That is not
"making no assumptions"; it is making the specific assumption that all tasks
are equally important and equally sized, which is demonstrably false in any
real system. **A known-imprecise estimate of task size is still more
informative than the implicit assumption that all sizes are equal.**

### 11.2 Minimizing the Number of People Waiting

**Argument.** If each task represents one client, then unweighted mean
completion time minimizes the total person-hours spent waiting. SPT is
optimal for this because completing short tasks first "frees" the most
people from the queue earliest.

**Assessment: This is mathematically correct.** The sum $\sum C_i$ counts
total person-time in the system. SPT genuinely minimizes this quantity.
If you run a DMV and every person's time is equally valuable regardless of
why they're there, SPT is the right policy.

The argument breaks down when:

1. **Tasks are not 1:1 with clients.** In IT, one client may submit tasks
   of varying size. Across a relationship, SPT systematically fast-tracks
   their easy requests and starves their hard ones — which is not perceived
   as good service.

2. **Waiting cost is not uniform.** A person waiting for a server outage
   to be fixed is not equivalent to a person waiting for a wallpaper change.
   The cost of waiting is proportional to the *impact* of the unresolved
   task, which is what priority encodes.

3. **The metric is applied to teams, not DMVs.** When a team's performance
   is measured by unweighted mean, the rational response is to cherry-pick
   — which is individually rational but collectively destructive.

### 11.3 SPT as a Triage Heuristic

**Argument.** In high-volume systems where task sizes cluster tightly
(e.g., a call center where most calls are 3-7 minutes), SPT approximates
FIFO and the unweighted mean approximates the weighted mean. The pathologies
described in this paper only manifest when task sizes span orders of
magnitude.

**Assessment: This is correct.** As shown in Section 8, when task sizes are
approximately uniform, all scheduling policies converge and all metrics
agree. The coefficient of variation of task size, $CV = \sigma_p / \bar{p}$,
determines the severity of the distortion:

| $CV$ | Task size distribution | Metric distortion |
|------|----------------------|-------------------|
| < 0.3 | Tight (call center) | Negligible |
| 0.3 - 1.0 | Moderate (mixed IT) | Moderate |
| > 1.0 | Wide (typical IT queue) | Severe |

For a typical IT service desk, task sizes range from 15 minutes (password
reset) to 40+ hours (infrastructure migration), giving $CV > 2$. The
distortion is not a theoretical edge case — it is the default condition.

### 11.4 Gaming Requires Malice

**Argument.** The theorems show that the metric *can* be gamed, not that it
*will* be gamed. A well-intentioned team might use the unweighted mean as
a rough health indicator without actively optimizing for it, avoiding the
pathologies described.

**Assessment: This is the strongest counterargument.** If the metric is
used purely for monitoring — "are we completing things at a reasonable
pace?" — and not for performance evaluation, rewards, or scheduling
decisions, then the gaming incentive is absent and the metric is relatively
harmless.

However, this argument requires the metric to remain purely informational
and never influence behavior. In practice, any metric that is reported to
management, tied to OKRs, or used in sprint retrospectives will influence
behavior — this is Goodhart's Law, and it applies to well-intentioned teams
as reliably as to cynical ones. The team need not be gaming the metric
consciously; it is sufficient that completing three easy tickets "feels
productive" while staring at one hard ticket does not. The metric validates
the feeling, and the drift happens organically.

### 11.5 Summary: When the Unweighted Mean Is Defensible

The unweighted mean completion time is a defensible metric **only when all
four conditions hold simultaneously**:

1. Task sizes are approximately uniform ($CV < 0.3$)
2. There is no priority differentiation (all tasks are equally important)
3. Each task represents exactly one client
4. The metric is not used to evaluate, reward, or direct team behavior

In a system satisfying all four conditions — such as a simple FIFO queue
with uniform jobs and no priority system — the unweighted mean is adequate,
and its simplicity is a genuine advantage.

In any system that violates even one of these conditions — which includes
virtually every IT service desk, development team, and support organization
— the metric produces the distortions proven in Sections 2-9.

The honest conclusion is not that the unweighted mean is always wrong. It is
that the conditions under which it is right are narrow, easily identified,
and rarely met in the systems where it is most commonly used.

---

## 12. Manager Internalization: The Actionable Solution

The preceding sections present two extremes: reject the metric entirely
(Sections 1-10) or surrender to it (Appendix A). In practice, most
managers cannot unilaterally change the metric — it is set at the
organizational level, reported across teams, and embedded in dashboards
that other stakeholders consume. The best solution is company-wide metric
reform. The *actionable* solution is what a single informed manager can
do right now.

### 12.1 The Strategy

A manager who understands the proof can **internalize the metric's
limitations without propagating them to the team**. The approach:

1. **Schedule primarily by priority.** The team works critical tasks
   first, exactly as professional judgment and the priority system
   dictate. This is the default — the team need not know why.

2. **Tactically interleave small tasks to maintain metric parity.** When
   the queue contains a small, low-priority task that can be completed
   quickly without materially delaying any high-priority work, do it.
   Not because the metric demands it, but because the small task *also
   needs to get done*, and doing it now costs almost nothing.

3. **Never reveal the metric as the motivation.** The team is told "knock
   out this quick one while we're waiting on the vendor callback for the
   P1" — not "we need to bring our average down." The team's
   professional judgment and intrinsic motivation (Appendix B) remain
   intact. The manager absorbs the metric-management burden.

This is a **constrained optimization**: minimize priority-weighted delay
(do the right work in the right order) subject to the constraint that
the reported unweighted mean stays within an acceptable band.

### 12.2 Formalization

Let $\bar{C}_{\text{target}}$ be the unweighted mean completion time that
other teams report — the parity threshold. The manager's problem is:

$$\min_{\sigma} \sum_{i=1}^{n} w(q_i) \cdot C_i \quad \text{subject to} \quad \bar{C}(\sigma) \le \bar{C}_{\text{target}}$$

This is a single-machine scheduling problem with a budget constraint on
the unweighted mean. The solution is a modified priority schedule:

- Start from the priority-first ordering (all P1 first, then P2, etc.).
- Identify small low-priority tasks whose insertion ahead of lower-ranked
  same-priority tasks reduces $\bar{C}$ without displacing any
  higher-priority task.
- Insert them only when the marginal improvement to $\bar{C}$ exceeds
  the marginal cost to priority-weighted delay.

**Theorem 12 (Bounded Metric Cost of Priority Scheduling).** For a
priority-first schedule with $n$ tasks, the gap between its unweighted
mean $\bar{C}_{\text{priority}}$ and the SPT-optimal unweighted mean
$\bar{C}_{\text{SPT}}$ is bounded by:

$$\bar{C}_{\text{priority}} - \bar{C}_{\text{SPT}} \le \frac{n-1}{2n}(\bar{p}_{\max\text{-class}} - \bar{p}_{\min\text{-class}}) \cdot n_{\text{classes}}$$

where $\bar{p}_{\max\text{-class}}$ and $\bar{p}_{\min\text{-class}}$ are
the mean processing times of the largest and smallest priority classes.

**Proof sketch.** The gap arises entirely from the cross-class ordering:
within each priority class, the manager can use SPT (shortest first) at
no priority cost, since all tasks in the class have equal priority. The
only deviation from global SPT is the *between-class* ordering, where
large high-priority tasks are placed before small low-priority tasks.
Each such inversion costs at most $p_{\text{large}} - p_{\text{small}}$
in the unweighted sum, and there are at most
$n_{\text{classes}} \cdot (n / n_{\text{classes}})$ such inversions.
$\blacksquare$

In practice, this means: **a manager who uses SPT within each priority
class and priority ordering between classes will produce a metric that
is close to the SPT-optimal value** — often within 10-20% — while
respecting the priority system entirely.

### 12.3 Why This Works: The Manager as Information Barrier

The strategy works because the manager serves as an **information
barrier** between the metric and the team:

| Layer | Sees the metric | Sees the priorities | Sees the proof |
|-------|----------------|--------------------|-----------------|
| Organization | Yes | Nominally | No |
| Manager | Yes | Yes | **Yes** |
| Team | No (shielded) | Yes | Irrelevant |
| Client | Yes (dashboard) | Via SLA | No |

The manager is the only actor who holds all three pieces of information.
By internalizing the proof, the manager can:

- Present a metric that satisfies organizational reporting (the number
  is reasonable)
- Direct the team by priority (professional judgment preserved)
- Shield the team from the metric's perverse incentives (Appendix B
  costs avoided)

This is *not* manipulation. The manager is not fabricating numbers or
misreporting. They are doing the right work in the right order, and
the metric happens to be acceptable because within-class SPT is free
and between-class inversions are bounded (Theorem 12).

### 12.4 The Competitive Breakdown

This strategy fails when the metric becomes **competitive between teams**.

Model $m$ teams, each managed independently. Team $j$ reports
$\bar{C}_j(\sigma_j)$. If teams are ranked, rewarded, or compared on
$\bar{C}$:

**Case 1: Cooperative** — Teams are measured for parity, not ranking.
The threshold is "stay within a reasonable band." Each manager
independently uses the internalization strategy. All teams do
approximately the right work. The metric is decorative but harmless.
This is a **coordination game** with a stable cooperative equilibrium.

**Case 2: Competitive** — Teams are ranked by $\bar{C}$. Promotions,
resources, or recognition go to the lowest average. This is a
**prisoner's dilemma**:

| | Team B: Priority-first | Team B: SPT |
|---|---|---|
| **Team A: Priority-first** | (Good work, Good work) | (A looks bad, B looks good) |
| **Team A: SPT** | (A looks good, B looks bad) | (Both look good, both do wrong work) |

The dominant strategy for each team is SPT. The Nash equilibrium is
(SPT, SPT) — all teams optimize the metric, all teams do the wrong
work, and the organization reports excellent numbers while critical
tasks rot across every queue.

The internalization strategy is a **cooperative equilibrium that is not
stable under competition**. A single team that defects to pure SPT will
outperform all others on the metric, forcing other managers to choose
between doing the right work (and looking bad) or following suit (and
abandoning their professional judgment).

### 12.5 The Scope of the Solution

| Condition | Strategy viability |
|-----------|-------------------|
| Metric used for health-check / parity | **Viable** — cooperative equilibrium holds |
| Metric visible but not ranked | **Viable** — no competitive pressure to defect |
| Metric ranked across teams | **Fragile** — viable only if all managers cooperate |
| Metric tied to compensation / resources | **Not viable** — prisoner's dilemma dominates |
| Metric reform possible at org level | **Unnecessary** — fix the metric instead |

The internalization strategy is actionable *right now*, by a single
manager, without organizational permission or metric reform. It
preserves team psychology (Appendix B), respects priorities (Sections
9-10), and produces an acceptable reported metric (Theorem 12).

Its limitation is structural: it requires the metric to be a
**reporting formality**, not a **competitive instrument**. The moment
the metric drives resource allocation or team ranking, the cooperative
equilibrium collapses and only organizational reform — replacing the
metric with a priority-weighted alternative (Section 10) — can prevent
the race to the bottom.

**The best solution is company-wide. The actionable solution is a
manager who understands this proof, shields their team from the metric,
schedules by priority, and uses SPT only within priority classes to
keep the number reasonable.**

---

## 13. Conclusion

The unweighted average completion time is a **biased statistic** that:

1. **Can be gamed** by scheduling policy (Theorem 1), unlike work-weighted
   completion time which is schedule-invariant (Theorem 2).
2. **Incentivizes starvation** of large tasks (Theorem 3).
3. **Contradicts Little's Law** unless tasks are uniformly sized.
4. **Degrades client satisfaction** with zero compensating productivity
   gain (Theorem 7).
5. **Actively contradicts priority systems** by carrying zero information
   about business-impact classification (Theorem 9).
6. **Ignores priority entirely** in its scheduling recommendation,
   producing suboptimal priority-weighted delay whenever priority and
   size are not perfectly inversely correlated (Theorem 10).

A metric that can be improved by reordering work — without doing any
additional work — is measuring the scheduling policy, not the system's
capacity or effectiveness. When combined with a priority system, the metric
does not merely fail to reflect priorities — it recommends the schedule
that inflicts the most damage on the highest-priority work.

The unweighted mean is defensible only under narrow, identifiable conditions
(Section 11.5): uniform task sizes, no priority system, one-to-one
client-task mapping, and no behavioral influence from the metric. These
conditions are rarely met in practice.

**Unweighted average completion time is not a fair or accurate measurement
of task execution performance. Its adoption as a team metric will
rationally produce starvation of complex work, violation of stated
priorities, inequitable client outcomes, and the illusion of productivity
where none exists.**

---

## Appendix A. When the Metric Is the Product

The preceding twelve sections rest on an implicit assumption: that client
satisfaction is a function of *experienced service quality* — how long
*their* task took, relative to its size and urgency. If this assumption
holds, the proof is valid and the unweighted mean is a destructive metric.

But there exists a scenario in which the assumption fails and the entire
argument collapses.

### A.1 The Self-Referential Metric

Suppose the service provider reports the unweighted mean completion time
directly to the client — on a dashboard, in an SLA report, on a marketing
page — and the client's satisfaction is derived primarily from *that number*
rather than from their individual experience.

Define client satisfaction as:

$$U_{\text{client}} = f\!\left(\bar{C}(\sigma)\right), \quad f' < 0$$

That is: the client sees "Average resolution time: 6.56 hours" and is
satisfied, without checking whether *their* ticket — the critical email
outage — took 6.56 hours or 18.75 hours.

Under this model, SPT genuinely maximizes client satisfaction (Theorem 1).
The service provider's throughput is unchanged (Theorem 6). The business
outcome improves: same work done, happier client.

**Every theorem in this paper remains mathematically correct. But the
conclusion inverts.** The metric is no longer a proxy for service quality
that can be gamed — it *is* the service quality, because the client has
agreed to evaluate quality by the aggregate number rather than by their
individual experience.

### A.2 The Economics

This creates a coherent, stable business equilibrium:

| Actor | Behavior | Outcome |
|-------|----------|---------|
| Provider | Optimizes unweighted mean (SPT) | Metric improves, no extra work |
| Client | Reads dashboard, sees low average | Reports satisfaction |
| Management | Sees satisfied client + good metric | Rewards team |

Throughput is unchanged (Theorem 6), so the same revenue-generating work
is completed. The only thing that changed is the *order* — and therefore
the reported number. Real resources were rearranged, no additional value
was created, but the business metrics all moved in the right direction.

This is *profitable*. The provider extracts satisfaction from the client
at zero marginal cost, by optimizing a number that the client has accepted
as a proxy for quality. The client is no worse off *in their own estimation*,
because they evaluate the aggregate, not their individual experience.

### A.3 The Fragility

This equilibrium is stable only as long as the client never inspects
their own experience. It breaks the moment any of the following occur:

**1. The client checks their own ticket.**

A CTO whose email server was down for 18.75 hours will not be reassured
by a dashboard reading "Average resolution: 6.56 hours." The aggregate
metric and the individual experience diverge maximally for high-priority
tasks (Theorem 4). The clients most likely to inspect their own experience
are exactly the ones receiving the worst service.

**2. A competitor offers per-ticket SLAs.**

If an alternative provider guarantees "P1 incidents resolved within 4 hours"
instead of "average resolution under 7 hours," the aggregate-metric provider
cannot compete for clients with critical needs — which are typically the
highest-value clients.

**3. The provider's team internalizes the metric.**

If the team believes the metric reflects real performance (rather than
consciously gaming it), they lose the ability to recognize when critical
work is being neglected. The metric becomes an epistemic hazard: it
tells the team they are performing well, preventing them from seeing that
they are not.

### A.4 The General Pattern

This is not unique to task scheduling. The structure is:

1. A measurable proxy is established for an unmeasured quality.
2. The proxy is reported as if it were the quality itself.
3. The proxy is optimized, improving the reported number.
4. The underlying quality diverges from the proxy, but no one measures
   the underlying quality because the proxy exists.
5. The system is stable until an exogenous shock forces inspection of
   the underlying quality.

This pattern appears across domains:

| Domain | Proxy metric | Underlying quality | Divergence |
|--------|-------------|-------------------|------------|
| IT support | Avg. resolution time | Critical system uptime | Server down for 19 hrs, avg says 6.5 |
| Education | Standardized test scores | Actual learning | Teaching to the test, understanding declines |
| Healthcare | Patient throughput | Patient outcomes | Faster discharges, higher readmission rates |
| Finance | Quarterly earnings | Long-term value creation | Cost-cutting inflates EPS, erodes capability |
| Software | Velocity (story points) | Deliverable product quality | Point inflation, features half-finished |

In each case, the proxy is optimized, the number improves, and the system
*functions* — profitably, even — until the moment the underlying quality
is tested by reality.

### A.5 A Mathematical Note on Equilibrium Stability

Model the system as a game between provider (P) and client (C).

**Information structure:**
- P observes individual completion times $\{C_i\}$ and chooses schedule $\sigma$
- C observes only the reported aggregate $\bar{C}(\sigma)$

**Payoffs:**
- P's payoff increases with C's satisfaction and is independent of schedule
  (throughput is invariant)
- C's *reported* satisfaction $U_C = f(\bar{C})$ is maximized by SPT
- C's *actual* welfare (if they could observe it) depends on individual
  $C_i$ values, especially for high-priority tasks

This is a **moral hazard** problem. P has private information (the
distribution of $C_i$) that C cannot observe. P's optimal strategy is to
minimize the observable signal ($\bar{C}$) regardless of the unobservable
distribution — which is exactly SPT.

The equilibrium is a **pooling equilibrium**: P's schedule looks identical
to the client regardless of the underlying priority-weighted performance.
A provider with PWCT = 10.2 and a provider with PWCT = 10.167 both report
$\bar{C} = 6.56$ under SPT. The client cannot distinguish between them.

This equilibrium is stable under the standard game-theoretic condition:
**C has no incentive to deviate** (they have no better information source)
and **P has no incentive to deviate** (any other schedule worsens $\bar{C}$
with zero throughput benefit).

It is *unstable* under **information revelation**: if C obtains access to
individual $C_i$ values (via a customer portal, a competing vendor's
transparency, or a sufficiently painful incident), the pooling equilibrium
collapses and C's evaluation shifts to the underlying quality.

### A.6 The Uncomfortable Conclusion

The honest answer to "does optimizing the unweighted mean hurt the
business?" is: **not necessarily, as long as the client never looks
behind the number**.

The honest answer to "does it hurt the client?" is: **only when they
have a problem large enough to notice** — which is precisely when the
metric's distortion is largest (Theorem 4).

The honest answer to "is this sustainable?" is: it is exactly as
sustainable as any system in which the seller knows more than the buyer.
Such systems are historically stable for extended periods and then
collapse rapidly when the information asymmetry is punctured — by a
crisis, a competitor, or a regulator.

The mathematical structure is clear: the unweighted mean creates an
information asymmetry between the metric and the reality. Optimizing
the metric under this asymmetry is *locally rational* for the provider,
*locally satisfying* for the uninspecting client, and *globally fragile*
for the relationship.

Whether one calls this "efficient market behavior" or "a dystopian
consequence of optimizing legible numbers over illegible reality" is not
a mathematical question. The math says only this: **the incentive exists,
the equilibrium is real, and it holds until it doesn't.**

---

## Appendix B. The Psychological Cost of Knowing

Appendix A modeled the provider as a unitary rational actor — "the team"
optimizes the metric. But teams are composed of individuals, and those
individuals have their own utility functions. When a team member
understands the proof — when they *know* the metric is synthetic, that
the dashboard is theater, that the email server is still down while they
close wallpaper tickets — a new cost appears that the equilibrium model
did not account for.

### B.1 The Hidden Variable: Team Awareness

Appendix A's game has three actors: provider, client, management. But the
provider is not monolithic. Decompose it:

- **Management (M):** sets the metric, evaluates the team, reports to client
- **Team member (T):** executes the work, observes individual task states
- **Client (C):** observes only the reported aggregate

The information structure changes:

| Actor | Observes individual $C_i$ | Observes aggregate $\bar{C}$ | Understands the proof |
|-------|--------------------------|-----------------------------|-----------------------|
| M | Possibly | Yes | Varies |
| T | **Yes** | Yes | **Yes** (in this scenario) |
| C | No | Yes | No |

The team member has *full information*. They see the ticket queue. They
know the email server has been down since 7 AM. They know they are closing
a wallpaper ticket because it will improve the number. And they know *why*
this is happening — not from vague discomfort, but from a precise
mathematical understanding that the metric rewards this behavior.

### B.2 Cognitive Dissonance Under Full Information

Cognitive dissonance (Festinger, 1957) arises when an individual holds
two contradictory cognitions simultaneously. The standard resolution is
to modify one cognition to reduce the conflict.

A team member operating under the synthetic metric holds:

- **Cognition A:** "I am a competent professional. My job is to solve
  important problems for clients."
- **Cognition B:** "I am closing a wallpaper ticket while the email
  server is down, because it makes the number look better."

In the absence of understanding *why*, Cognition B can be rationalized:
"management knows best," "maybe there's a reason," "the system works
overall." This is uncomfortable but tolerable — the ambiguity provides
cognitive cover.

**Understanding the proof removes the ambiguity entirely.** The team
member now holds:

- **Cognition A:** Same as above.
- **Cognition B':** "I am closing a wallpaper ticket while the email
  server is down, because the metric is mathematically biased toward
  small tasks (Theorem 1), the reordering produces zero additional
  throughput (Theorem 6), and the only beneficiary is the dashboard
  (Appendix A). I can prove this."

B' is strictly harder to rationalize than B. The team member cannot
retreat into uncertainty because they possess the proof. The dissonance
is now *load-bearing*: it must be resolved, and the available resolutions
are:

1. **Reject Cognition A** — "I am not here to solve important problems;
   I am here to move numbers." This is psychologically costly. It
   requires abandoning professional identity.

2. **Reject Cognition B'** — "The proof must be wrong, or doesn't apply
   here." This is intellectually costly. The proof is simple enough to
   verify, and the IT example maps directly to their daily experience.

3. **Change the situation** — advocate for better metrics, refuse to
   cherry-pick, escalate. This is *professionally* costly in an
   environment that rewards the metric.

4. **Leave** — resolve the dissonance by exiting the system entirely.

None of these resolutions are free. Each one imposes a cost on the team
member that did not exist before they understood the proof — and *none of
them appear in the business equilibrium model of Appendix A*.

### B.3 Self-Determination Theory: Three Needs Violated

Deci and Ryan's Self-Determination Theory (1985, 2000) identifies three
innate psychological needs whose satisfaction predicts intrinsic motivation,
job satisfaction, and well-being:

**1. Autonomy** — the need to feel volitional control over one's actions.

A team member who understands the proof knows that the metric constrains
their choices in a way that is mathematically suboptimal for the client.
Their scheduling decisions are not autonomous expressions of professional
judgment; they are coerced responses to a flawed incentive. The *knowledge*
of the coercion — not just the coercion itself — is what damages autonomy.
A worker who doesn't understand why they're doing something can still feel
autonomous ("I'm choosing to follow the process"). A worker who understands
that the process is provably counterproductive cannot.

**2. Competence** — the need to feel effective at meaningful tasks.

The proof demonstrates that the metric rewards *apparent* effectiveness
(low $\bar{C}$) while being invariant to *actual* effectiveness (throughput,
Theorem 6). A team member who understands this knows that the metric
cannot distinguish between a competent team and an incompetent one that
happens to cherry-pick small tasks. Their competence is invisible to the
measurement system. Worse: genuine competence — choosing to fix the email
server first — is *punished* by the metric ($\bar{C}$ increases from 6.56
to 13.63 in the IT example).

When a measurement system punishes competent decisions and rewards
incompetent ones, and the team member *knows this*, the need for
competence is not merely unsatisfied — it is actively contradicted.

**3. Relatedness** — the need to feel connected to others and to
contribute to something meaningful.

The team member knows the client's email server is down. They know the
client is suffering. They know they could help. They are instead updating
a wallpaper policy — not because it helps anyone, but because it helps
a number. The connection between the team member's work and the client's
well-being has been severed by the metric, and the team member *can see
the severed ends*.

### B.4 Moral Injury

The concept of moral injury (Shay, 1994; Litz et al., 2009) was developed
in military psychology to describe the lasting harm caused by
"perpetrating, failing to prevent, bearing witness to, or learning about
acts that transgress deeply held moral beliefs." It has since been applied
to healthcare workers, first responders, and — increasingly — to
knowledge workers in bureaucratic systems.

The key distinction from burnout: **burnout is exhaustion from doing too
much. Moral injury is damage from doing the wrong thing, or being
prevented from doing the right thing.**

A team member who:
- Knows the email server is down (witnessing the harm)
- Knows they should fix it (moral belief about professional duty)
- Closes a wallpaper ticket instead (transgressing that belief)
- Does so because the metric requires it (institutional causation)

...is experiencing the structural conditions for moral injury. The
proof doesn't cause the injury — the metric does. But the proof
eliminates the psychological buffer of ignorance that would otherwise
mitigate it.

### B.5 Learned Helplessness and Metric Fatalism

Seligman's learned helplessness framework (1967, 1975) describes the
phenomenon where exposure to uncontrollable negative outcomes leads to
passivity even when control becomes available.

The sequence for an aware team member:

1. **Observation:** The metric is flawed (proof understood).
2. **Action:** Advocate for change ("we should use priority-weighted
   metrics").
3. **Outcome:** Rejected ("the client is happy with the current
   dashboard," "this is how we've always measured," "the numbers are
   good, don't rock the boat").
4. **Repetition:** Steps 2-3 repeat, with decreasing conviction.
5. **Helplessness:** "The metric is what it is. I'll just close tickets."

The terminal state — metric fatalism — is characterized by:
- Disengagement from professional judgment ("I just do what the queue
  says")
- Reduced initiative ("why bother triaging if the metric doesn't care?")
- Cynicism toward measurement generally ("all metrics are fake")
- Withdrawal of discretionary effort on complex tasks

This is not laziness. It is the rational psychological response to a
system that punishes correct behavior and rewards incorrect behavior,
when the individual lacks the power to change the system.

### B.6 The Turnover Equation

The costs described in B.2-B.5 are borne by the team member, not the
organization — initially. They become organizational costs through
**turnover**.

Model the team member's stay/leave decision:

$$\text{Stay if: } \quad V_{\text{compensation}} + V_{\text{intrinsic}} > V_{\text{outside option}}$$

The synthetic metric degrades $V_{\text{intrinsic}}$ through each of the
mechanisms described above:

| Mechanism | Component degraded | Effect on $V_{\text{intrinsic}}$ |
|-----------|-------------------|----------------------------------|
| Cognitive dissonance (B.2) | Psychological comfort | Decreased |
| Autonomy violation (B.3.1) | Sense of agency | Decreased |
| Competence contradiction (B.3.2) | Professional identity | Decreased |
| Relatedness severance (B.3.3) | Sense of purpose | Decreased |
| Moral injury (B.4) | Ethical well-being | Decreased |
| Learned helplessness (B.5) | Belief in efficacy | Decreased |

As $V_{\text{intrinsic}}$ decreases, the organization must increase
$V_{\text{compensation}}$ to retain the team member, or accept their
departure.

Crucially: **the team members most affected are those with the strongest
professional identity and the deepest understanding of the work.** These
are the most competent members — the ones most capable of recognizing the
metric's absurdity, most troubled by it, and most able to find employment
elsewhere. The metric selects for the departure of the team's best people.

### B.7 The Adversarial Selection Spiral

Combining Appendix A's equilibrium with the turnover dynamic:

1. Organization adopts unweighted mean completion time.
2. Metric looks good (SPT). Client is satisfied (Appendix A). Management
   is satisfied.
3. Aware, competent team members experience psychological costs (B.2-B.5).
4. Those members leave. They are replaced by members who either:
   (a) do not understand the metric's flaws (less competent), or
   (b) do not care (less engaged).
5. The metric continues to look good — it always does under SPT,
   regardless of team competence (Theorem 6, Corollary 6.1).
6. Actual service quality degrades (less competent team), but the metric
   cannot detect this (Theorem 9, Corollary 9.1).
7. Return to step 2.

This is an **adversarial selection spiral**: the metric selects *against*
the people who would improve the system and *for* the people who will not
challenge it. The system stabilizes at a lower level of actual competence,
invisible to its own measurement apparatus, staffed by people who have
made peace with — or are unaware of — the gap between the number and the
reality.

The dashboard still looks good.

### B.8 The Complete Cost Model

Appendix A concluded that the synthetic-metric equilibrium is stable and
profitable. Appendix B reveals the hidden costs that model omitted:

| Appendix A (visible) | Appendix B (hidden) |
|---------------------|---------------------|
| Client satisfied (sees good number) | Team dissatisfied (sees bad reality) |
| Throughput unchanged | Discretionary effort withdrawn |
| Metric improves | Competent members leave |
| Business economy stable | Institutional competence degrades |
| Zero marginal cost | Replacement/training costs accumulate |

The business equilibrium of Appendix A is real. The psychological costs
of Appendix B are also real. They operate on different timescales:
the equilibrium is visible quarterly; the competence degradation is
visible over years.

The complete model is not "the metric works" (Appendix A) or "the metric
is destructive" (Sections 1-12). It is: **the metric works, and it
is destructive, and the destruction is invisible to the metric.**

An organization can run profitably for an extended period on synthetic
metrics and hollowed-out competence, just as a building can stand for
years with corroded rebar. The metric is the fresh paint. Appendix A
proved the paint is convincing. This appendix merely notes that it is
still paint.

---

## References

### Scheduling Theory

[1] Smith, W. E. (1956). Various optimizers for single-stage production.
*Naval Research Logistics Quarterly*, 3(1–2), 59–66.
doi:[10.1002/nav.3800030106](https://doi.org/10.1002/nav.3800030106)

> Origin of the SPT optimality result (Theorem 1), the weighted completion
> time rule $w_i/p_i$ descending (WSJF, Theorem 11), and the adjacent-job
> pairwise interchange (exchange argument) proof technique used throughout
> this paper.

[2] Conway, R. W., Maxwell, W. L., & Miller, L. W. (1967). *Theory of
Scheduling*. Addison-Wesley.

> Comprehensive treatment of single-machine and multi-machine scheduling
> theory, extending Smith's results. Standard textbook reference for the
> exchange argument and its generalizations.

[3] Little, J. D. C. (1961). A proof for the queuing formula: L = λW.
*Operations Research*, 9(3), 383–387.
doi:[10.1287/opre.9.3.383](https://doi.org/10.1287/opre.9.3.383)

> First rigorous proof of Little's Law, referenced in Section 5. The
> result was known informally before 1961; this paper provided the
> general proof requiring only stationarity and finite expectations.

[4] Little, J. D. C. (2011). Little's Law as viewed on its 50th
anniversary. *Operations Research*, 59(3), 536–549.
doi:[10.1287/opre.1110.0941](https://doi.org/10.1287/opre.1110.0941)

> Retrospective discussing the law's scope, limitations, and
> common misapplications — including the batch-case subtleties
> noted in Section 5 of this paper.

[5] Reinertsen, D. G. (2009). *The Principles of Product Development
Flow: Second Generation Lean Product Development*. Celeritas Publishing.
ISBN: 978-0-9844512-0-8.

> Popularized the term "Weighted Shortest Job First" (WSJF) and the
> "Cost of Delay divided by Duration" formulation in agile/lean product
> development contexts. The underlying mathematical result is Smith
> (1956) [1].

### Measurement and Incentives

[6] Goodhart, C. A. E. (1984). Problems of monetary management: The
U.K. experience. In C. A. E. Goodhart, *Monetary Theory and Practice:
The UK Experience* (pp. 91–121). Macmillan.

> Source of Goodhart's Law. Original wording: "Any observed statistical
> regularity will tend to collapse once pressure is placed upon it for
> control purposes." First presented as a working paper for the Reserve
> Bank of Australia in 1975.

[7] Strathern, M. (1997). 'Improving ratings': Audit in the British
university system. *European Review*, 5(3), 305–321.
doi:[10.1002/(SICI)1234-981X(199707)5:3<305::AID-EURO184>3.0.CO;2-4](https://doi.org/10.1002/(SICI)1234-981X(199707)5:3%3C305::AID-EURO184%3E3.0.CO;2-4)

> Generalized Goodhart's observation into the form commonly cited today:
> "When a measure becomes a target, it ceases to be a good measure."
> Referenced implicitly in Sections 6, 11.4, and Appendix A.4.

### Behavioral Economics

[8] Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of
decision under risk. *Econometrica*, 47(2), 263–292.
doi:[10.2307/1914185](https://doi.org/10.2307/1914185)

> Established loss aversion — the finding that losses are weighted
> approximately twice as heavily as equivalent gains in subjective
> evaluation. Referenced in Section 7.4 to argue that the dissatisfaction
> of deprioritized large-task clients outweighs the satisfaction gained
> by small-task clients under SPT.

### Game Theory and Contract Theory

[9] Akerlof, G. A. (1970). The market for "lemons": Quality uncertainty
and the market mechanism. *The Quarterly Journal of Economics*, 84(3),
488–500. doi:[10.2307/1879431](https://doi.org/10.2307/1879431)

> Foundational model of information asymmetry and adverse selection.
> The pooling equilibrium described in Appendix A.5 — where the client
> cannot distinguish high-quality from low-quality service because both
> produce the same aggregate metric — is structurally analogous to
> Akerlof's lemons problem.

[10] Hölmstrom, B. (1979). Moral hazard and observability. *The Bell
Journal of Economics*, 10(1), 74–91.
doi:[10.2307/3003320](https://doi.org/10.2307/3003320)

> Formal treatment of moral hazard — the problem arising when an agent's
> actions are not fully observable by the principal. The metric-reporting
> scenario in Appendix A.5 is a moral hazard problem: the provider
> (agent) chooses the schedule, but the client (principal) observes only
> the aggregate outcome.

### Psychology

[11] Festinger, L. (1957). *A Theory of Cognitive Dissonance*. Stanford
University Press. ISBN: 978-0-8047-0131-0.

> Foundational theory of cognitive dissonance. Referenced in Appendix
> B.2: an individual holding contradictory cognitions experiences
> psychological discomfort and is motivated to reduce the contradiction.
> The proof eliminates the ambiguity that would normally allow
> rationalization, making the dissonance load-bearing.

[12] Deci, E. L., & Ryan, R. M. (1985). *Intrinsic Motivation and
Self-Determination in Human Behavior*. Plenum Press.
ISBN: 978-0-306-42022-1.

> Original book-length treatment of Self-Determination Theory,
> identifying autonomy, competence, and relatedness as innate
> psychological needs. Referenced in Appendix B.3.

[13] Ryan, R. M., & Deci, E. L. (2000). Self-determination theory and
the facilitation of intrinsic motivation, social development, and
well-being. *American Psychologist*, 55(1), 68–78.
doi:[10.1037/0003-066X.55.1.68](https://doi.org/10.1037/0003-066X.55.1.68)

> Overview and update of Self-Determination Theory, linking need
> satisfaction to intrinsic motivation, job satisfaction, and
> psychological well-being. The three-need framework (autonomy,
> competence, relatedness) applied in Appendix B.3.

[14] Seligman, M. E. P., & Maier, S. F. (1967). Failure to escape
traumatic shock. *Journal of Experimental Psychology*, 74(1), 1–9.
doi:[10.1037/h0024514](https://doi.org/10.1037/h0024514)

> Original experimental demonstration of learned helplessness.
> Co-authored with Steven F. Maier. Referenced in Appendix B.5:
> repeated exposure to uncontrollable outcomes (failed advocacy for
> better metrics) produces passivity and disengagement.

[15] Seligman, M. E. P. (1975). *Helplessness: On Depression,
Development, and Death*. W. H. Freeman.
ISBN: 978-0-7167-0752-3.

> Extended treatment connecting learned helplessness to human depression
> and institutional behavior. The concept of "metric fatalism" described
> in Appendix B.5 is a domain-specific instance of learned helplessness
> in organizational settings.

[16] Shay, J. (1994). *Achilles in Vietnam: Combat Trauma and the
Undoing of Character*. Atheneum / Simon & Schuster.
ISBN: 978-0-689-12182-3.

> Introduced the concept of moral injury through analysis of Vietnam
> combat veterans' experiences, drawing parallels to Homer's *Iliad*.
> Defined moral injury as arising from a betrayal of "what's right" by
> someone in legitimate authority in a high-stakes situation. Referenced
> in Appendix B.4.

[17] Litz, B. T., Stein, N., Delaney, E., Lebowitz, L., Nash, W. P.,
Silva, C., & Maguen, S. (2009). Moral injury and moral repair in war
veterans: A preliminary model and intervention strategy. *Clinical
Psychology Review*, 29(8), 695–706.
doi:[10.1016/j.cpr.2009.07.003](https://doi.org/10.1016/j.cpr.2009.07.003)

> Formalized moral injury as a clinical construct and proposed a
> treatment model. Defined moral injury as resulting from "perpetrating,
> failing to prevent, bearing witness to, or learning about acts that
> transgress deeply held moral beliefs and expectations." This definition
> is quoted in Appendix B.4 and applied to knowledge workers operating
> under synthetic metrics.

---

*This proof was developed conversationally and formalized on 2026-03-28.*