d9046322c66597d9d926480d3350c192c7392d1c
Least-Prefill-Work-Left: score = pending_prefill_tokens + max(0, input - cache_hit_here), pure argmin with (num_requests, round-robin) tie-break. Zero hyperparameters — derived from the agentic pattern: decode is cheap (I/O ~217x) so outstanding prefill-token-work is the only load worth modelling. Dropping LMetric's x num_requests factor (a) un-swallows the cache signal so affinity emerges with no gate, and (b) makes an idle-but- decoding host score `input` (its true marginal cost) instead of 0, removing the empty-batch degeneracy. Stick-vs-spill crossover is computed from real token-work, replacing overload_factor + cache_ratio gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%