Files
agentic-kvc/microbench/connector_tax/layerwise/P2_ENGINE_STATE.md
Gahow Wang 5b26c345f4 P2: all routing policies read real state via eff_ accessors + ablation harness
InstanceState.eff_{num_requests,pending_prefill,ongoing_decode,ongoing_tokens}
= max(shadow, real) when feed fresh (fixes 30s-stale under-count, keeps
in-flight RaceFix), plus real-only r_max_prefill_remaining / r_kv_used_frac.
Wired into load_only, lmetric, sticky, unified(_kv_both), unified_v3, and
snapshot logging. Feed off => identical to before. run_v3_trace.sh gains ES=1
toggle (always deploys enhanced proxy); run_ablation_es.sh runs each config
ES0-vs-ES1 to test whether real state changes policy performance/ranking.
All unit-tested without GPU.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 20:21:12 +08:00

4.7 KiB

P2: real engine-state feed for migration target selection

Problem: the router (cache_aware_proxy.py) decides migration targets from shadow counters it maintains itself (incremented at dispatch, decremented at completion) and reconciles to vLLM /metrics only every 30 s (_reconcile_loop). So every routing/migration decision is on stale state. Worse, the signal that predicts the ~45% control-plane stall — is the target mid-large-prefill? (a big prefill holds the GIL and starves the mooncake receiver_loop) — isn't visible at all, and /metrics doesn't expose it either.

Fix: vLLM publishes real per-engine state to a shared store ~20 Hz; the router reads ground truth and avoids GIL-stall / capacity-wall targets.

Components (all unit-tested without GPUs)

  • engine_state.py — canonical compute_snapshot(scheduler, id), StateWriter, StateReader. Schema per engine: ts, num_running, num_waiting, gpu_blocks_total/free, gpu_kv_used_frac, pending_prefill_tokens, ongoing_decode_tokens, num_prefilling, max_prefill_remaining.
  • instrument_engine_state.py — vLLM Scheduler patch (apply/revert markers ES_INSTRUMENT_*): a daemon thread publishes the snapshot every AGENTIC_ENGINE_STATE_PERIOD_MS (50 ms) off the forward hot path. Inlined writer (engine process needs no repo import). Coexists with MB5.
  • migration_target.py — pure target scorer: avoid max_prefill_remaining ≥ es_big_prefill_threshold (GIL stall) and gpu_kv_used_frac ≥ es_kv_wall_frac (capacity wall), then rank by cache-richness and real load.
  • cache_aware_proxy.WRITEMODE.py — wired: InstanceState.real_state, _engine_state_poll_loop (instance i ← engine_{i}), _real_load/Gate-3 and Mechanism-B now real-state-aware. --engine-state-uri flag; off ⇒ identical to before (shadow only).

Transport (AGENTIC_ENGINE_STATE_URI / --engine-state-uri): file:///dev/shm/agentic_engine_state (default, zero-dep, single-node) or redis://host:port/0 (multi-node; needs redis-py + server — not installed on dash0, so file backend is the working default).

Tests (no GPU)

  • compute_snapshot field math (mock scheduler): running/waiting, max_prefill_remaining, pending, decode, kv_used_frac.
  • writer→reader round-trip + staleness drop (file backend).
  • target scorer: 5 cases incl. avoid GIL-stall target even when its shadow load is lower, real load beats stale shadow, cache-rich wins, avoid KV wall, graceful fallback when feed missing.
  • end-to-end: publish 8 engines (one mid-130k-prefill) → proxy inlined reader → target selection avoids it.

Enabling in a GPU run (when free)

  1. instrument_engine_state.py --apply on the dash0 venv.
  2. export AGENTIC_ENGINE_STATE_URI=file:///dev/shm/agentic_engine_state before the launcher (vLLM instances inherit it; AGENTIC_WORKER_ID=engine_{i} already set by b3_isolated_policy.sh → publishes as engine_{i}).
  3. Proxy: EXTRA_PROXY_ARGS="--engine-state-uri file:///dev/shm/agentic_engine_state ...".
  4. Revert the patch + rm -rf /dev/shm/agentic_engine_state after.

ALL policies now read the real state (update)

InstanceState exposes effective accessors used by every picker: eff_num_requests / eff_pending_prefill / eff_ongoing_decode / eff_ongoing_tokens = max(shadow, real) when the feed is fresh (real fixes the 30s-stale under-count; shadow's atomic pre-await reservation still covers the in-flight window, preserving the RaceFix), plus real-only r_max_prefill_remaining / r_kv_used_frac. Wired into: load_only, lmetric, sticky, pick_instance (legacy), pick_instance_unified_hybrid (unified / unified_kv_both), pick_instance_unified_v3 (gate + Mechanism B), and snapshot_workers (logged scores now match the decision + real fields). Feed off ⇒ real_state is None ⇒ accessors return shadow ⇒ byte-identical to before. (legacy unified_v2 left on shadow — retired, not in the ablation.)

Ablation (when GPU free)

run_v3_trace.sh gains ES=1 (apply engine-state patch + feed + proxy flag) and always deploys the enhanced proxy (dormant when feed/write-mode off). run_ablation_es.sh runs each config twice (ES=0 vs ES=1) so the only difference is the state source. Default decisive set (4 runs): champion unified+A+B and unified_v3+A+B+layerwise, each ES0/ES1. Extend CONFIGS for lmetric / unified_kv_both / load_only. Compares per-policy TTFT (overall + migrated) and whether the ranking changes with ground-truth state.

Status / scope

  • Built + unit-tested (snapshot, round-trip, target scorer, eff_ accessors, end-to-end publish→read→select); NOT yet run against live engines (GPU busy).
  • TP=1 only (one EngineCore/instance → one publisher/engine_id). TP>1 needs per-rank ids.