InstanceState.eff_{num_requests,pending_prefill,ongoing_decode,ongoing_tokens}
= max(shadow, real) when feed fresh (fixes 30s-stale under-count, keeps
in-flight RaceFix), plus real-only r_max_prefill_remaining / r_kv_used_frac.
Wired into load_only, lmetric, sticky, unified(_kv_both), unified_v3, and
snapshot logging. Feed off => identical to before. run_v3_trace.sh gains ES=1
toggle (always deploys enhanced proxy); run_ablation_es.sh runs each config
ES0-vs-ES1 to test whether real state changes policy performance/ranking.
All unit-tested without GPU.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.7 KiB
P2: real engine-state feed for migration target selection
Problem: the router (cache_aware_proxy.py) decides migration targets from
shadow counters it maintains itself (incremented at dispatch, decremented
at completion) and reconciles to vLLM /metrics only every 30 s
(_reconcile_loop). So every routing/migration decision is on stale state.
Worse, the signal that predicts the ~45% control-plane stall — is the target
mid-large-prefill? (a big prefill holds the GIL and starves the mooncake
receiver_loop) — isn't visible at all, and /metrics doesn't expose it either.
Fix: vLLM publishes real per-engine state to a shared store ~20 Hz; the router reads ground truth and avoids GIL-stall / capacity-wall targets.
Components (all unit-tested without GPUs)
engine_state.py— canonicalcompute_snapshot(scheduler, id),StateWriter,StateReader. Schema per engine:ts, num_running, num_waiting, gpu_blocks_total/free, gpu_kv_used_frac, pending_prefill_tokens, ongoing_decode_tokens, num_prefilling, max_prefill_remaining.instrument_engine_state.py— vLLMSchedulerpatch (apply/revert markersES_INSTRUMENT_*): a daemon thread publishes the snapshot everyAGENTIC_ENGINE_STATE_PERIOD_MS(50 ms) off the forward hot path. Inlined writer (engine process needs no repo import). Coexists with MB5.migration_target.py— pure target scorer: avoidmax_prefill_remaining ≥ es_big_prefill_threshold(GIL stall) andgpu_kv_used_frac ≥ es_kv_wall_frac(capacity wall), then rank by cache-richness and real load.cache_aware_proxy.WRITEMODE.py— wired:InstanceState.real_state,_engine_state_poll_loop(instance i ←engine_{i}),_real_load/Gate-3 and Mechanism-B now real-state-aware.--engine-state-uriflag; off ⇒ identical to before (shadow only).
Transport (AGENTIC_ENGINE_STATE_URI / --engine-state-uri):
file:///dev/shm/agentic_engine_state (default, zero-dep, single-node) or
redis://host:port/0 (multi-node; needs redis-py + server — not installed on
dash0, so file backend is the working default).
Tests (no GPU)
compute_snapshotfield math (mock scheduler): running/waiting, max_prefill_remaining, pending, decode, kv_used_frac.- writer→reader round-trip + staleness drop (file backend).
- target scorer: 5 cases incl. avoid GIL-stall target even when its shadow load is lower, real load beats stale shadow, cache-rich wins, avoid KV wall, graceful fallback when feed missing.
- end-to-end: publish 8 engines (one mid-130k-prefill) → proxy inlined reader → target selection avoids it.
Enabling in a GPU run (when free)
instrument_engine_state.py --applyon the dash0 venv.export AGENTIC_ENGINE_STATE_URI=file:///dev/shm/agentic_engine_statebefore the launcher (vLLM instances inherit it;AGENTIC_WORKER_ID=engine_{i}already set byb3_isolated_policy.sh→ publishes asengine_{i}).- Proxy:
EXTRA_PROXY_ARGS="--engine-state-uri file:///dev/shm/agentic_engine_state ...". - Revert the patch +
rm -rf /dev/shm/agentic_engine_stateafter.
ALL policies now read the real state (update)
InstanceState exposes effective accessors used by every picker:
eff_num_requests / eff_pending_prefill / eff_ongoing_decode / eff_ongoing_tokens = max(shadow, real) when the feed is fresh (real fixes
the 30s-stale under-count; shadow's atomic pre-await reservation still covers
the in-flight window, preserving the RaceFix), plus real-only
r_max_prefill_remaining / r_kv_used_frac. Wired into: load_only, lmetric,
sticky, pick_instance (legacy), pick_instance_unified_hybrid
(unified / unified_kv_both), pick_instance_unified_v3 (gate + Mechanism B),
and snapshot_workers (logged scores now match the decision + real fields).
Feed off ⇒ real_state is None ⇒ accessors return shadow ⇒ byte-identical to
before. (legacy unified_v2 left on shadow — retired, not in the ablation.)
Ablation (when GPU free)
run_v3_trace.sh gains ES=1 (apply engine-state patch + feed + proxy flag)
and always deploys the enhanced proxy (dormant when feed/write-mode off).
run_ablation_es.sh runs each config twice (ES=0 vs ES=1) so the only
difference is the state source. Default decisive set (4 runs): champion
unified+A+B and unified_v3+A+B+layerwise, each ES0/ES1. Extend CONFIGS for
lmetric / unified_kv_both / load_only. Compares per-policy TTFT
(overall + migrated) and whether the ranking changes with ground-truth
state.
Status / scope
- Built + unit-tested (snapshot, round-trip, target scorer, eff_ accessors, end-to-end publish→read→select); NOT yet run against live engines (GPU busy).
- TP=1 only (one EngineCore/instance → one publisher/engine_id). TP>1 needs per-rank ids.