# P2: real engine-state feed for migration target selection Problem: the router (`cache_aware_proxy.py`) decides migration targets from **shadow counters** it maintains itself (incremented at dispatch, decremented at completion) and reconciles to vLLM `/metrics` only every **30 s** (`_reconcile_loop`). So every routing/migration decision is on stale state. Worse, the signal that predicts the ~45% control-plane stall — *is the target mid-large-prefill?* (a big prefill holds the GIL and starves the mooncake receiver_loop) — isn't visible at all, and `/metrics` doesn't expose it either. Fix: vLLM publishes **real** per-engine state to a shared store ~20 Hz; the router reads ground truth and avoids GIL-stall / capacity-wall targets. ## Components (all unit-tested without GPUs) - `engine_state.py` — canonical `compute_snapshot(scheduler, id)`, `StateWriter`, `StateReader`. Schema per engine: `ts, num_running, num_waiting, gpu_blocks_total/free, gpu_kv_used_frac, pending_prefill_tokens, ongoing_decode_tokens, num_prefilling, max_prefill_remaining`. - `instrument_engine_state.py` — vLLM `Scheduler` patch (apply/revert markers `ES_INSTRUMENT_*`): a daemon thread publishes the snapshot every `AGENTIC_ENGINE_STATE_PERIOD_MS` (50 ms) off the forward hot path. Inlined writer (engine process needs no repo import). Coexists with MB5. - `migration_target.py` — pure target scorer: avoid `max_prefill_remaining ≥ es_big_prefill_threshold` (GIL stall) and `gpu_kv_used_frac ≥ es_kv_wall_frac` (capacity wall), then rank by cache-richness and **real** load. - `cache_aware_proxy.WRITEMODE.py` — wired: `InstanceState.real_state`, `_engine_state_poll_loop` (instance i ← `engine_{i}`), `_real_load`/Gate-3 and Mechanism-B now real-state-aware. `--engine-state-uri` flag; off ⇒ identical to before (shadow only). Transport (`AGENTIC_ENGINE_STATE_URI` / `--engine-state-uri`): `file:///dev/shm/agentic_engine_state` (default, zero-dep, single-node) or `redis://host:port/0` (multi-node; needs redis-py + server — not installed on dash0, so file backend is the working default). ## Tests (no GPU) - `compute_snapshot` field math (mock scheduler): running/waiting, max_prefill_remaining, pending, decode, kv_used_frac. - writer→reader round-trip + staleness drop (file backend). - target scorer: 5 cases incl. *avoid GIL-stall target even when its shadow load is lower*, *real load beats stale shadow*, *cache-rich wins*, *avoid KV wall*, *graceful fallback when feed missing*. - end-to-end: publish 8 engines (one mid-130k-prefill) → proxy inlined reader → target selection avoids it. ## Enabling in a GPU run (when free) 1. `instrument_engine_state.py --apply` on the dash0 venv. 2. `export AGENTIC_ENGINE_STATE_URI=file:///dev/shm/agentic_engine_state` before the launcher (vLLM instances inherit it; `AGENTIC_WORKER_ID=engine_{i}` already set by `b3_isolated_policy.sh` → publishes as `engine_{i}`). 3. Proxy: `EXTRA_PROXY_ARGS="--engine-state-uri file:///dev/shm/agentic_engine_state ..."`. 4. Revert the patch + `rm -rf /dev/shm/agentic_engine_state` after. ## ALL policies now read the real state (update) `InstanceState` exposes effective accessors used by **every** picker: `eff_num_requests / eff_pending_prefill / eff_ongoing_decode / eff_ongoing_tokens` = `max(shadow, real)` when the feed is fresh (real fixes the 30s-stale under-count; shadow's atomic pre-await reservation still covers the in-flight window, preserving the RaceFix), plus real-only `r_max_prefill_remaining / r_kv_used_frac`. Wired into: `load_only`, `lmetric`, `sticky`, `pick_instance` (legacy), `pick_instance_unified_hybrid` (unified / unified_kv_both), `pick_instance_unified_v3` (gate + Mechanism B), and `snapshot_workers` (logged scores now match the decision + real fields). Feed off ⇒ `real_state is None` ⇒ accessors return shadow ⇒ byte-identical to before. (legacy `unified_v2` left on shadow — retired, not in the ablation.) ## Ablation (when GPU free) `run_v3_trace.sh` gains `ES=1` (apply engine-state patch + feed + proxy flag) and always deploys the enhanced proxy (dormant when feed/write-mode off). `run_ablation_es.sh` runs each config twice (ES=0 vs ES=1) so the only difference is the state source. Default decisive set (4 runs): champion `unified+A+B` and `unified_v3+A+B+layerwise`, each ES0/ES1. Extend CONFIGS for `lmetric` / `unified_kv_both` / `load_only`. Compares per-policy TTFT (overall + migrated) and whether the **ranking** changes with ground-truth state. ## Status / scope - Built + unit-tested (snapshot, round-trip, target scorer, eff_ accessors, end-to-end publish→read→select); NOT yet run against live engines (GPU busy). - TP=1 only (one EngineCore/instance → one publisher/engine_id). TP>1 needs per-rank ids.