81cedfc789
Passive horror webapp — AI-generated hellscape with infinite escalation. SDXL Turbo + XTTS v2 on V100, WebGL shader compositor frontend. Based on claude-avatar infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
266 lines
10 KiB
Markdown
266 lines
10 KiB
Markdown
# AI Hell — Design Spec
|
|
|
|
## Overview
|
|
|
|
Passive horror webapp. You open it, it slowly destroys your comfort. AI-generated imagery shifts between abstract hellscape and recognizable-but-wrong forms (faces, rooms, shapes). Audio: ambient dread soundscape with AI-voiced whispers cloned from random non-voice audio samples. Escalation is infinite — never peaks, always worse, with randomized timing to prevent habituation. No avatar, no interaction, no UI. You just watch.
|
|
|
|
Based on the claude-avatar project infrastructure (FastAPI, WebSocket, SDXL Turbo, XTTS v2) but replaces the 3D avatar with a fullscreen 2D shader compositor.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Viewer opens hell.sethpc.xyz
|
|
|
|
|
v
|
|
+-------------------+
|
|
| FastAPI Server |
|
|
| (escalation brain)|----> WebSocket: phase updates, audio clips, asset URLs
|
|
+---------+---------+
|
|
|
|
|
+-----+------+
|
|
| |
|
|
v v
|
|
+--------+ +-----------+
|
|
| SDXL | | XTTS v2 |
|
|
| Turbo | | (voices) |
|
|
| ~5GB | | ~1.5GB |
|
|
+--------+ +-----------+
|
|
| |
|
|
v v
|
|
Asset Pool Audio Pool
|
|
(on disk) (on disk)
|
|
| |
|
|
v v
|
|
+-----------------------------------+
|
|
| Browser (WebGL) |
|
|
| - Shader distortion layer |
|
|
| - Asset compositor (blend/morph) |
|
|
| - Web Audio (ambient + whispers) |
|
|
| - Escalation renderer |
|
|
+-----------------------------------+
|
|
```
|
|
|
|
Server streams assets and commands, not frames. Frontend composites at 60fps.
|
|
|
|
### Container
|
|
|
|
Same pattern as claude-avatar: LXC on pve197, V100 32GB GPU passthrough, Caddy reverse proxy at `hell.sethpc.xyz`. Can reuse CT 167 or create a new CT for isolation.
|
|
|
|
### VRAM Budget
|
|
|
|
~6.5GB total (SDXL 5GB + XTTS 1.5GB). No MuseTalk/LivePortrait needed.
|
|
|
|
## Escalation Engine
|
|
|
|
Server-side brain. Continuous intensity value on a logarithmic curve:
|
|
|
|
```
|
|
intensity = log(1 + elapsed_seconds * rate)
|
|
```
|
|
|
|
Default `rate = 0.05`. At this rate: intensity 1.0 at ~40s, 2.0 at ~130s, 3.0 at ~7min, 4.0 at ~18min. Tunable via config.
|
|
|
|
Fast early escalation (first minute is dramatic), then slower creep that never stops. 5 minutes and 30 minutes are very different experiences, but neither has peaked.
|
|
|
|
### What Intensity Controls
|
|
|
|
| Parameter | Low (0-1) | Medium (1-3) | High (3+) |
|
|
|-----------|-----------|--------------|-----------|
|
|
| Image content | Abstract textures, dark gradients | Faces emerge, distorted architecture | Body horror, impossible geometry, rapid cycling |
|
|
| Morph speed | Slow crossfades (5-10s) | Moderate blending (2-4s) | Fast cuts, stuttering, strobe |
|
|
| Shader severity | Subtle chromatic aberration, slight warping | Visible glitch, color bleeding, pulse | Screen tearing, melt, inversion, tremor |
|
|
| Audio | Low drone, silence gaps | Whispers fade in, dissonant tones | Layered voices, rising pitch, sudden stops |
|
|
| Voice frequency | Rare (every 60s+) | Occasional (every 20-30s) | Frequent, overlapping, direct address |
|
|
| Surprise events | None | Rare (face flash, audio spike) | Unpredictable timing, fake UI glitches |
|
|
|
|
### Randomized Timing
|
|
|
|
The escalation floor only goes up, but delivery is stochastic:
|
|
|
|
- **Asset swaps:** random intervals within a range that shrinks with intensity (early: 5-15s gaps, later: 0.5-4s)
|
|
- **Silence gaps:** random-length pauses where nothing happens — then something does
|
|
- **Cluster bursts:** occasionally stack multiple events close together, then go quiet
|
|
- **Voice timing:** Poisson-distributed, mean interval decreases with intensity
|
|
- **Fake calm:** occasionally intensity *presentation* drops for 10-30s before spiking — the "it stopped... wait" effect
|
|
|
|
Predictable timing kills horror. The randomness prevents habituation.
|
|
|
|
### Phase Updates
|
|
|
|
Server pushes over WebSocket:
|
|
|
|
```json
|
|
{"type": "phase", "intensity": 2.4, "params": {
|
|
"morph_speed": 0.35,
|
|
"shader_severity": 0.6,
|
|
"palette": "crimson_void"
|
|
}}
|
|
```
|
|
|
|
Frontend interpolates between current and target params smoothly, unless the server sends a deliberate hard scare.
|
|
|
|
## Asset Generation Pipeline
|
|
|
|
### Startup Batch
|
|
|
|
- Server generates 30-50 initial images across severity tiers (mild, medium, extreme)
|
|
- SDXL prompts curated per tier: abstract textures, distorted faces, body horror, impossible spaces
|
|
- 10-20 voice clips from random non-voice source samples via XTTS
|
|
- Assets tagged with `severity: float` for the escalation engine to select from
|
|
|
|
### Background Generation (Continuous)
|
|
|
|
- Worker thread generates new assets every 10-30s while the server runs
|
|
- Severity of new assets biased toward what connected viewers currently need
|
|
- Old assets rotated out to prevent disk bloat (cap: ~200 images, ~50 audio clips)
|
|
- Prompts randomized from a curated horror prompt pool + procedural combination
|
|
|
|
### Asset Format
|
|
|
|
- Images: 512x512 PNG (SDXL Turbo native res), served via static endpoint
|
|
- Audio: WAV clips, 2-10s each, served via static endpoint or pushed over WebSocket
|
|
- Ambient drones: pre-baked looping tracks bundled with frontend (not AI-generated)
|
|
|
|
## Frontend Rendering
|
|
|
|
Fullscreen WebGL shader canvas. No 3D scene — 2D compositor with shader effects.
|
|
|
|
### Layer Stack (bottom to top)
|
|
|
|
1. **Base layer** — current AI-generated image, fills viewport
|
|
2. **Blend layer** — next image, crossfading in (or hard-cutting at high intensity)
|
|
3. **Morphing layer** — WebGL shader that warps/distorts the composited image
|
|
4. **Overlay layer** — procedural effects (fog, particles, vignette, scanlines)
|
|
5. **Flash layer** — momentary full-screen events (face flash, white-out, inversion)
|
|
|
|
### Shader Effects
|
|
|
|
| Effect | Description |
|
|
|--------|-------------|
|
|
| Chromatic aberration | RGB channel split — subtle wrongness early, extreme separation later |
|
|
| Mesh warp | Sin-wave displacement — gentle breathing early, violent convulsion later |
|
|
| Melt | Downward pixel smear — faces drip |
|
|
| Glitch | Horizontal scanline offset blocks — digital corruption |
|
|
| Pulse | Brightness oscillation synced to ambient drone |
|
|
| Color shift | Palette rotation — desaturates then pushes into reds/greens |
|
|
| Noise | Film grain -> TV static progression |
|
|
| Inversion | Brief negative-image flashes |
|
|
|
|
### Transition Modes
|
|
|
|
- Slow crossfade (low intensity)
|
|
- Dissolve through black
|
|
- Glitch-cut (hard swap with artifacting)
|
|
- Melt-morph (one image melts into the next via displacement map)
|
|
|
|
### UI
|
|
|
|
None. Black background. Cursor hidden after 3s idle. No title bar hint. Fullscreen API requested on first click.
|
|
|
|
## Audio System
|
|
|
|
Three layers mixed in Web Audio API.
|
|
|
|
### Layer 1 — Ambient Drone (client-side)
|
|
|
|
- 2-3 pre-baked dark ambient loops bundled with the frontend
|
|
- Crossfade between them as intensity rises
|
|
- Web Audio gain/filter nodes shift tone: low-pass opens up, reverb increases, pitch drops
|
|
- Always playing — the foundation
|
|
|
|
### Layer 2 — Whisper Pool (server-pushed)
|
|
|
|
- XTTS clones random *non-voice* audio samples as voice sources: dog barks, machinery, wind, static, instruments, household objects, reversed audio
|
|
- Speaks horror phrases through these cloned "voices" — results range from "almost human but wrong" to "this should not be talking"
|
|
- Pre-generated at startup from a `samples/` directory (just drop WAVs in to customize)
|
|
- Server pushes clip references over WebSocket at random intervals
|
|
- Client plays with randomized: stereo pan, volume (some barely audible), reverb amount
|
|
- Content: fragments of sentences, numbers, names, nonsense syllables, distorted laughter
|
|
- At high intensity: clips overlap, stack, pan rapidly
|
|
|
|
### Layer 3 — Direct Address (server-triggered, real-time)
|
|
|
|
- Server generates XTTS audio on the fly for "it sees you" moments
|
|
- Phrases: "you're still here", "don't leave", "I can see you", "why"
|
|
- Sparse at low intensity, more frequent later
|
|
- Played dry (no reverb) — cuts through the ambient, feels close and intimate
|
|
- These are the moments that actually scare people
|
|
|
|
### Audio Escalation
|
|
|
|
- **Low:** quiet drone, occasional barely-audible whisper
|
|
- **Medium:** drone louder, whispers clearly audible, first direct address
|
|
- **High:** layered voices, drone distorted, direct address frequent, sudden silences followed by spikes
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
ai-hell/
|
|
server/
|
|
main.py # FastAPI app, WebSocket, static serving
|
|
config.py # Intensity params, timing ranges
|
|
escalation.py # Escalation engine, phase calculator, randomized timing
|
|
asset_generator.py # SDXL wrapper (repurposed from claude-avatar face_generator.py)
|
|
voice_generator.py # XTTS wrapper (repurposed from claude-avatar voice.py)
|
|
asset_pool.py # Pool management, severity tagging, rotation
|
|
prompts.py # Horror prompt library + procedural combiner
|
|
frontend/
|
|
index.html # Fullscreen WebGL compositor
|
|
shaders/ # GLSL fragment shaders for distortion effects
|
|
ambient/ # Pre-baked drone loops (2-3 tracks)
|
|
samples/ # Random audio files for XTTS voice cloning source
|
|
requirements.txt
|
|
IDEA.md
|
|
SESSION.md
|
|
DECISIONS.md
|
|
CLAUDE.md
|
|
```
|
|
|
|
### Reused from claude-avatar
|
|
|
|
FastAPI skeleton, WebSocket streaming, SDXL Turbo loading/inference, XTTS loading/inference, config pattern, systemd service pattern, Caddy reverse proxy pattern.
|
|
|
|
### Dropped
|
|
|
|
Three.js 3D scene, GLB model, morph targets, MuseTalk, LivePortrait, animation loop.
|
|
|
|
### New
|
|
|
|
Escalation engine, asset pool manager, shader compositor, horror prompt system, multi-layer Web Audio mixer.
|
|
|
|
## API
|
|
|
|
### WebSocket (`/stream`)
|
|
|
|
Server -> Client:
|
|
```json
|
|
{"type": "phase", "intensity": 2.4, "params": {"morph_speed": 0.35, "shader_severity": 0.6, "palette": "crimson_void"}}
|
|
{"type": "asset", "url": "/assets/img_0042.png", "severity": 1.8, "transition": "melt"}
|
|
{"type": "whisper", "url": "/assets/audio/whisper_017.wav", "pan": -0.3, "volume": 0.4, "reverb": 0.7}
|
|
{"type": "address", "audio": "<base64 wav>", "text": "you're still here"}
|
|
{"type": "scare", "effect": "face_flash", "duration_ms": 150}
|
|
```
|
|
|
|
Client -> Server:
|
|
```json
|
|
{"type": "ping"}
|
|
```
|
|
|
|
### REST
|
|
|
|
```
|
|
GET /status
|
|
-> {"intensity": 2.4, "connected_clients": 1, "asset_pool_size": 87, "audio_pool_size": 23}
|
|
|
|
POST /reset
|
|
-> {"status": "ok"} # restart escalation for this session
|
|
```
|
|
|
|
## Future Path (Out of scope for v1)
|
|
|
|
- Webcam integration — detect viewer's face, use it in the horror
|
|
- Multi-viewer awareness — "there are others here"
|
|
- Mobile haptics — vibration API timed to scares
|
|
- Viewer voice cloning — mic input -> XTTS clones them back at them
|
|
- Seasonal themes — different horror palettes/prompts
|