95c8ef853cde525c0add7518e439a5172923ef53
The proxy maintains shadow counters (num_requests, ongoing_tokens,
pending_prefill_tokens, ongoing_decode_tokens) used by every routing
picker. They are incremented in _handle_local_request and decremented
in the generator's finally block. When the StreamingResponse generator
never enters (client disconnect between proxy returning the response
and Starlette starting iteration, or Starlette failing before
iteration), the decrement never fires and the counter stays elevated
forever. Over a multi-hour run the shadow accumulates "phantom" load
on the affected instances and biases the router away from them.
Concrete observation that prompted the fix: during the unified_kv_both
B3 run, engine_0 sat at proxy num_requests=1 / ongoing_decode_tokens=80406
while vLLM's own /metrics reported num_running=0 num_waiting=0 and the
GPU sat at 0% utilization. Every routing decision after that point
believed engine_0 was busy with an 80k-token decode that did not exist.
Fix: extend _reconcile_loop to actively poll each instance's
/metrics every 30 s. If the proxy's num_requests has been higher than
vLLM's (running + waiting) for two consecutive cycles (~60 s of stable
drift), reduce the shadow to vLLM's truth. When vLLM is fully idle
(running=0, waiting=0), zero ongoing_tokens, ongoing_decode_tokens,
and pending_prefill_tokens as well.
Two-cycle persistence avoids correcting transient mismatches where
the proxy has just incremented for a new request that vLLM has not
scheduled yet. A single ~30 s blip is not large enough to corrupt
routing decisions; only persistent drift gets corrected.
The previous _reconcile_loop only clamped negatives. Phantom positives
are now caught and logged ("[reconcile] {url}: phantom drift ...").
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%