Commit Graph

2 Commits

Author SHA1 Message Date
5b26c345f4 P2: all routing policies read real state via eff_ accessors + ablation harness
InstanceState.eff_{num_requests,pending_prefill,ongoing_decode,ongoing_tokens}
= max(shadow, real) when feed fresh (fixes 30s-stale under-count, keeps
in-flight RaceFix), plus real-only r_max_prefill_remaining / r_kv_used_frac.
Wired into load_only, lmetric, sticky, unified(_kv_both), unified_v3, and
snapshot logging. Feed off => identical to before. run_v3_trace.sh gains ES=1
toggle (always deploys enhanced proxy); run_ablation_es.sh runs each config
ES0-vs-ES1 to test whether real state changes policy performance/ranking.
All unit-tested without GPU.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 20:21:12 +08:00
be948d32b8 P2: real engine-state feed replaces stale shadow counters for migration targeting
vLLM scheduler publishes real state (running/waiting, KV free, and the
max-in-progress-prefill signal /metrics lacks) to a tmpfs/redis store ~20Hz;
router reads it and avoids GIL-stall (mid-large-prefill) + KV-capacity-wall
targets, using real load over 30s-stale shadow counters. Components:
engine_state.py (canonical+reader), instrument_engine_state.py (scheduler
patch, file/redis writer), migration_target.py (scorer), proxy wiring
(--engine-state-uri, off=unchanged). All unit-tested without GPU; not yet
run live. See P2_ENGINE_STATE.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 20:01:26 +08:00