Cost model:
- Marginal billing: only charge for watts above idle
- Dedicated billing: charge for all uptime (optional)
- Labor rate: $/hr for operator time, manually logged
- Profit margin: percentage markup on electricity cost
- All parameters adjustable live via POST /config
Dashboard shows:
- Cost breakdown with progress bar
- Power model (idle→load for GPU and system)
- Marginal watts per inference call
- Labor hours + labor cost
- Total owed (electricity + labor + margin)
- GPU utilization, temperature, power draw
- Avg cost per request, estimated remaining requests
Endpoints:
- GET /config — view current cost config
- POST /config — update any parameter live
- GET /stats — full usage stats + cost config (auth required)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
./update-model.sh [url] [name]
Downloads GGUF and loads into Ollama. No remote access needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gateway: POST /admin/update-model downloads new GGUF and reloads.
Disabled by default — requires ALLOW_MODEL_UPDATES=true in .env.
Matt controls whether remote model updates are allowed.
Self-play: --api-key flag for authenticated gateway connections.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Setup script now:
1. Generates API key
2. Starts Docker containers
3. Downloads GGUF from mortdec.ai automatically (~5.3GB)
4. Creates Ollama model with correct chat template
5. Runs test inference
6. Prints connection details for Seth
Matt just runs ./setup.sh — no manual file copying.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- API key auth on all inference endpoints
- Power/cost tracking: GPU TDP × inference time × electricity rate
- Spending cap enforcement
- Web dashboard with live stats
- Docker compose for AMD ROCm (Strix Halo) or NVIDIA
- Auto-setup script with GGUF loading
- Tested against local Ollama
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>