# P2: real engine-state feed for migration target selection

Problem: the router (`cache_aware_proxy.py`) decides migration targets from
**shadow counters** it maintains itself (incremented at dispatch, decremented
at completion) and reconciles to vLLM `/metrics` only every **30 s**
(`_reconcile_loop`). So every routing/migration decision is on stale state.
Worse, the signal that predicts the ~45% control-plane stall — *is the target
mid-large-prefill?* (a big prefill holds the GIL and starves the mooncake
receiver_loop) — isn't visible at all, and `/metrics` doesn't expose it either.

Fix: vLLM publishes **real** per-engine state to a shared store ~20 Hz; the
router reads ground truth and avoids GIL-stall / capacity-wall targets.

## Components (all unit-tested without GPUs)

- `engine_state.py` — canonical `compute_snapshot(scheduler, id)`, `StateWriter`,
  `StateReader`. Schema per engine: `ts, num_running, num_waiting,
  gpu_blocks_total/free, gpu_kv_used_frac, pending_prefill_tokens,
  ongoing_decode_tokens, num_prefilling, max_prefill_remaining`.
- `instrument_engine_state.py` — vLLM `Scheduler` patch (apply/revert markers
  `ES_INSTRUMENT_*`): a daemon thread publishes the snapshot every
  `AGENTIC_ENGINE_STATE_PERIOD_MS` (50 ms) off the forward hot path. Inlined
  writer (engine process needs no repo import). Coexists with MB5.
- `migration_target.py` — pure target scorer: avoid `max_prefill_remaining ≥
  es_big_prefill_threshold` (GIL stall) and `gpu_kv_used_frac ≥ es_kv_wall_frac`
  (capacity wall), then rank by cache-richness and **real** load.
- `cache_aware_proxy.WRITEMODE.py` — wired: `InstanceState.real_state`,
  `_engine_state_poll_loop` (instance i ← `engine_{i}`), `_real_load`/Gate-3 and
  Mechanism-B now real-state-aware. `--engine-state-uri` flag; off ⇒ identical
  to before (shadow only).

Transport (`AGENTIC_ENGINE_STATE_URI` / `--engine-state-uri`):
`file:///dev/shm/agentic_engine_state` (default, zero-dep, single-node) or
`redis://host:port/0` (multi-node; needs redis-py + server — not installed on
dash0, so file backend is the working default).

## Tests (no GPU)
- `compute_snapshot` field math (mock scheduler): running/waiting,
  max_prefill_remaining, pending, decode, kv_used_frac.
- writer→reader round-trip + staleness drop (file backend).
- target scorer: 5 cases incl. *avoid GIL-stall target even when its shadow
  load is lower*, *real load beats stale shadow*, *cache-rich wins*,
  *avoid KV wall*, *graceful fallback when feed missing*.
- end-to-end: publish 8 engines (one mid-130k-prefill) → proxy inlined reader →
  target selection avoids it.

## Enabling in a GPU run (when free)
1. `instrument_engine_state.py --apply` on the dash0 venv.
2. `export AGENTIC_ENGINE_STATE_URI=file:///dev/shm/agentic_engine_state`
   before the launcher (vLLM instances inherit it; `AGENTIC_WORKER_ID=engine_{i}`
   already set by `b3_isolated_policy.sh` → publishes as `engine_{i}`).
3. Proxy: `EXTRA_PROXY_ARGS="--engine-state-uri file:///dev/shm/agentic_engine_state ..."`.
4. Revert the patch + `rm -rf /dev/shm/agentic_engine_state` after.

## ALL policies now read the real state (update)
`InstanceState` exposes effective accessors used by **every** picker:
`eff_num_requests / eff_pending_prefill / eff_ongoing_decode /
eff_ongoing_tokens` = `max(shadow, real)` when the feed is fresh (real fixes
the 30s-stale under-count; shadow's atomic pre-await reservation still covers
the in-flight window, preserving the RaceFix), plus real-only
`r_max_prefill_remaining / r_kv_used_frac`. Wired into: `load_only`, `lmetric`,
`sticky`, `pick_instance` (legacy), `pick_instance_unified_hybrid`
(unified / unified_kv_both), `pick_instance_unified_v3` (gate + Mechanism B),
and `snapshot_workers` (logged scores now match the decision + real fields).
Feed off ⇒ `real_state is None` ⇒ accessors return shadow ⇒ byte-identical to
before. (legacy `unified_v2` left on shadow — retired, not in the ablation.)

## Ablation (when GPU free)
`run_v3_trace.sh` gains `ES=1` (apply engine-state patch + feed + proxy flag)
and always deploys the enhanced proxy (dormant when feed/write-mode off).
`run_ablation_es.sh` runs each config twice (ES=0 vs ES=1) so the only
difference is the state source. Default decisive set (4 runs): champion
`unified+A+B` and `unified_v3+A+B+layerwise`, each ES0/ES1. Extend CONFIGS for
`lmetric` / `unified_kv_both` / `load_only`. Compares per-policy TTFT
(overall + migrated) and whether the **ranking** changes with ground-truth
state.

## Status / scope
- Built + unit-tested (snapshot, round-trip, target scorer, eff_ accessors,
  end-to-end publish→read→select); NOT yet run against live engines (GPU busy).
- TP=1 only (one EngineCore/instance → one publisher/engine_id). TP>1 needs
  per-rank ids.