Merge layerwise KV transfer + engine-state ablation onto main

Brings the worktree-mooncake-layerwise line (layerwise Mooncake connector, write-mode proxy, real engine-state feed + eff_ accessors, mb7 microbench, v3 trace re-profile, A/B x migration matrix runner) into main so the repo is self-contained for these experiments. Disjoint paths (microbench/connector_tax/layerwise/*) => clean merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:40 +08:00
parent ee5db0b321 7947831e0f
commit 8708b75520
22 changed files with 11105 additions and 0 deletions
--- a/microbench/connector_tax/layerwise/DESIGN.md
+++ b/microbench/connector_tax/layerwise/DESIGN.md
@@ -0,0 +1,188 @@
 # Layer-wise KV transfer on Mooncake — exploration
 Goal: make vLLM's `MooncakeConnector` push KV **per-layer during prefill**
 (write mode) instead of the current **post-hoc full-request transfer**, then
 microbench correctness + whether it hides the transfer behind prefill compute
 (the thing MoRIIO's write mode does on AMD; no NVIDIA connector ships it).
 Everything here is isolated in worktree `worktree-mooncake-layerwise`. The
 dash0 venv connector is backed up at `mooncake_connector.py.ORIG_BACKUP`;
 revert = copy the backup back. Opt-in via env `MOONCAKE_LAYERWISE=1`, so with
 the env unset the connector behaves exactly as upstream.
 ## Baseline flow (post-hoc, what we have)
 1. Proxy: prefill on src (`do_remote_decode`, max_tokens=1) → **await done** →
   decode on dst (`do_remote_prefill`) which pulls.
 2. dst `start_load_kv`→`receive_kv` sends ZMQ `MooncakeXferMetadata` (its block
   addrs) to src bootstrap.
 3. src `send_kv_to_decode`: waits `send_meta.ready` (set at `request_finished`,
   i.e. **after full prefill**) → `_build_transfer_params` (all layers) →
   `_send_blocks` (one big `batch_transfer_sync_write`) → FINISH response.
 Measured: this full transfer is on the critical path, runs at ~3 GB/s under
 load (vs ~10 GB/s idle), dominating migration TTFT.
 ## Layer-wise flow (write mode, this exploration)
 Key idea: keep all RDMA + completion on the `sender_loop` thread (clean), but
 issue **one `batch_transfer_sync_write` per layer**, each fired as soon as that
 layer's KV is computed — so writes overlap the remaining prefill compute.
 Signaling: `save_kv_layer(layer_name, ...)` (called by vLLM's attention hook
 after each layer's forward, on the main worker thread) records "layer L
 computed" and wakes the sender_loop. `send_kv_to_decode` loops L=0..N-1,
 waits until L is computed, writes layer L's blocks, then sends FINISH.
 ### Edits to `mooncake_connector.py` (all gated by `_lw_enabled`)
 1. **Worker `__init__`**: `_lw_enabled` (env), layer-name→position map,
   `_lw_computed: dict[transfer_id,int]`, `_lw_active: set[transfer_id]`,
   wake event, lock.
 2. **`register_kv_caches`**: build `_lw_layer_pos[layer_name]` (0..N-1) and
   `_lw_addr_idx[pos]` = indices into `kv_caches_base_addr` (×2 if
   `split_k_and_v`).
 3. **Scheduler `update_state_after_alloc`** (`do_remote_decode` branch): in
   layer-wise mode capture `blocks.get_block_ids()[0]` and store non-empty in
   `_reqs_need_send` so the worker learns local block_ids + sets `ready`
   **before** prefill finishes.
 4. **Worker `note_layer_computed(layer_name)`** (new) called from
   `MooncakeConnector.save_kv_layer`: bump `_lw_computed[tid]` for active
   producers, `call_soon_threadsafe(wake.set)`.
 5. **Worker `send_kv_to_decode`**: in layer-wise mode, mark transfer active,
   loop layers: await `_lw_computed[tid] >= L`, `_send_blocks` for layer L
   only (subset of `_build_transfer_params`), then send FINISH.
 6. **Worker `_build_layer_transfer_params`** (new): like
   `_build_transfer_params` but only the addr indices for one layer position.
 ### Microbench requirements
 - Disable chunked prefill (`--max-num-batched-tokens` ≥ prompt) so prefill is a
  single forward and `save_kv_layer` fires once per layer in order.
 - Dispatch the dst (`do_remote_prefill`) request **first/concurrently** so the
  ZMQ handshake reaches src during prefill.
 - Correctness: dst follow-up `cached_tokens == prompt_len` (KV landed),
  identical to baseline.
 - Perf: src prefill wall-clock (does layer-wise slow it?) and dst TTFT (does
  transfer leave the critical path?), swept over KV size, vs baseline.
 ## Status
 - [x] worktree + connector backup + design
 - [x] modified connector (LAYERWISE.py, +193/-4 lines, env-gated)
 - [x] correctness microbench (mb7_layerwise.py) + launcher (run_mb7.sh)
 - [x] correctness run on dash0 — PASS (KV lands; cached == prompt)
 - [x] perf run + verdict — POSITIVE (transfer hidden behind prefill)
 ## Results (2-instance, idle, chunked-prefill off, Qwen3-30B-A3B, 48 layers)
 Metric: `overhead = total − prefill_only` = the transfer cost left on the
 critical path (TTFT). Baseline = post-hoc full pull (sequential).
 | KV size | baseline overhead | **layerwise overhead** | reduction |
 |--------:|------------------:|-----------------------:|----------:|
 | 8192 (0.75 GiB)  | 123 ms | **58 ms** | 2.1× |
 | 16384 (1.5 GiB)  | 202 ms | **58 ms** | 3.5× |
 | 32768 (3.0 GiB)  | 529 ms | **57 ms** | 9.3× |
 Key signatures:
 - **Layerwise overhead is ~constant (~58 ms)** regardless of KV size, while
  baseline grows O(KV size). The 58 ms is handshake + last-layer tail + 1
  decode; the bulk transfer is hidden behind prefill compute.
 - **Prefill did NOT slow down**: layerwise `t_A` (575/1495/4440 ms) ==
  `prefill_only` (574/1492/4440 ms). The concurrent RDMA was "free" on idle
  GPUs — no measurable HBM contention with prefill compute here.
 - Producer logs confirm the transfer itself took 0.39/0.55/4.37 s (grows with
  size) yet ran *inside* the prefill window, so it left the critical path.
 - **Correctness PASS**: B's follow-up cached == prompt for all sizes; the
  48-layer / 96-base-addr (split K&V) per-layer addressing is correct.
 ## Caveats (why this is a proof-of-concept, not a verdict for production)
 1. **Idle instances only.** Real migration happens between *busy* instances.
   Under load both prefill and transfer slow; transfer (even at ~3 GB/s) is
   still < prefill for big contexts so it should still hide, but receive-side
   (B) and HBM contention during prefill are untested here. NEXT: rerun with
   background load on both A and B.
 2. **Chunked prefill disabled.** The monotonic layer counter assumes one
   forward, layers in order. Production uses chunked prefill (multi-step),
   which needs per-(chunk,layer) tracking — not implemented.
 3. **Single concurrent producer transfer.** Global counter; real migration is
   concurrent. Would need per-transfer state.
 4. **Microbench dispatch.** mb7 fires B then A with a 50 ms head start to get
   the handshake to A before its forward. The real proxy path
   (`_handle_combined_pd_sep_v2`) dispatches sequentially and would need the
   write-mode (concurrent) restructure.
 ## Results under LOAD (bg=16 background decode streams, 8 per instance)
 Critical-path transfer overhead (ms), `total − prefill_only`:
 | KV size | idle base | idle LW | **load base** | **load LW** |
 |--------:|----------:|--------:|--------------:|------------:|
 | 8k  | 123 | 58 | 158 | **94** |
 | 16k | 202 | 58 | 239 | **83** |
 | 32k | 529 | 57 | **749** | **95** |
 The overlap **survives load**: layerwise overhead stays ~constant (~90 ms)
 under load while baseline grows to 749 ms at 32k (7.9× reduction). Prefill did
 not slow (load LW `t_A` == load `prefill_only`); the transfer (0.56/1.46/4.37 s,
 producer logs) ran inside the prefill window even with 16 concurrent decodes.
 Correctness PASS under load.
 ## FULL 1200-req v3 TRACE re-profile (chunk-safe + concurrent + write-mode)
 Hardened connector (per-step incremental shipping, per-transfer state) +
 write-mode proxy (concurrent prefill/decode dispatch). Two passes of
 `w600_r0.0015_st30.jsonl` under `unified_v3`, differing only in transfer mode.
 Correctness: layer-wise **1213/1214 success** (1 connection-error on the 128k
 req, not KV corruption); byte-level KV correctness validated on mb7
 (chunked + 3-way concurrent, `cached==prompt`); producer logs confirm
 incremental shipping (e.g. `shipped 7872/7872 blocks`).
 Migration sets differ between runs (write-mode timing shifts which requests
 trigger migration; only 4 migrated in both), but are distributionally
 comparable (median new_local/input ≈ 0.42 vs 0.46). **Matched migrations
 all improved**, scaling with the transfer hidden behind prefill:
 | request | input | new_local | base TTFT | LW TTFT | Δ |
 |---|--:|--:|--:|--:|--:|
 | 1268630 | 102k | 97k | 41.20 | 33.96 | **−7.23s** |
 | 1334223 | 37k | 14k | 6.04 | 3.23 | −2.81s |
 | 1279412 | 40k | 8k | 5.50 | 2.92 | −2.58s |
 | 1271459 | 8.9k | 8.9k | 37.01 | 36.98 | −0.03s (queue-bound) |
 Trace-level TTFT (different sets, directional): overall p90 9.79→9.16 (−6%),
 p99 44.89→42.85 (−5%). **Modest** because (a) migrations are only 25/1214 ≈
 **2%** of requests, and (b) several migrations are queue/contention-bound, not
 transfer-bound — layer-wise removes the transfer component but not the
 control-plane/queue residual (the ~45% from the b3_v3_fullbreak profile).
 **Verdict on the trace re-profile:** layer-wise does exactly what the profile
 predicted — it removes the transfer half of migration overhead (matched
 migrations −2.6 to −7.2s, biggest where there's the most prefill to hide
 behind), but the trace-level gain is small because migrations are rare and
 partly queue-bound. It does NOT, on its own, flip migration to a clear win
 over unified for this agentic workload.
 ## Verdict (microbench)
 The mechanism **works and the benefit holds under load**: layer-wise push turns
 migration's KV-transfer cost from O(KV size) on the critical path into a
 near-constant ~90 ms tail, by overlapping it with prefill compute — what
 MoRIIO's write mode does on AMD, now demonstrated on NVIDIA/Mooncake.
 **BUT this is single-transfer, non-chunked.** Running the actual 1200-req trace
 correctly needs two more pieces this PoC does NOT have:
 1. **Chunk-safe tracking** — long agentic prompts force chunked prefill;
   `save_kv_layer` then fires per-chunk and the monotonic counter would ship
   uncomputed blocks. Needs slot-mapping-aware per-(request,chunk) tracking.
 2. **Concurrent-transfer safety** — the global counter assumes one producer at
   a time; the trace migrates from busy instances running other forwards.
 Also: even with those fixed, layer-wise only removes the **transfer half** of
 the measured migration overhead. The b3_v3_fullbreak profile showed dst-side
 `T_kv_pull` = ~55% RDMA + ~45% control-plane GIL-dispatch stalls; layer-wise
 hides the RDMA half but the control-plane half is orthogonal. So a trace
 re-profile would show roughly the transfer half collapse, not the whole thing.
--- a/microbench/connector_tax/layerwise/P2_ENGINE_STATE.md
+++ b/microbench/connector_tax/layerwise/P2_ENGINE_STATE.md
@@ -0,0 +1,82 @@
 # P2: real engine-state feed for migration target selection
 Problem: the router (`cache_aware_proxy.py`) decides migration targets from
 **shadow counters** it maintains itself (incremented at dispatch, decremented
 at completion) and reconciles to vLLM `/metrics` only every **30 s**
 (`_reconcile_loop`). So every routing/migration decision is on stale state.
 Worse, the signal that predicts the ~45% control-plane stall — *is the target
 mid-large-prefill?* (a big prefill holds the GIL and starves the mooncake
 receiver_loop) — isn't visible at all, and `/metrics` doesn't expose it either.
 Fix: vLLM publishes **real** per-engine state to a shared store ~20 Hz; the
 router reads ground truth and avoids GIL-stall / capacity-wall targets.
 ## Components (all unit-tested without GPUs)
 - `engine_state.py` — canonical `compute_snapshot(scheduler, id)`, `StateWriter`,
  `StateReader`. Schema per engine: `ts, num_running, num_waiting,
  gpu_blocks_total/free, gpu_kv_used_frac, pending_prefill_tokens,
  ongoing_decode_tokens, num_prefilling, max_prefill_remaining`.
 - `instrument_engine_state.py` — vLLM `Scheduler` patch (apply/revert markers
  `ES_INSTRUMENT_*`): a daemon thread publishes the snapshot every
  `AGENTIC_ENGINE_STATE_PERIOD_MS` (50 ms) off the forward hot path. Inlined
  writer (engine process needs no repo import). Coexists with MB5.
 - `migration_target.py` — pure target scorer: avoid `max_prefill_remaining ≥
  es_big_prefill_threshold` (GIL stall) and `gpu_kv_used_frac ≥ es_kv_wall_frac`
  (capacity wall), then rank by cache-richness and **real** load.
 - `cache_aware_proxy.WRITEMODE.py` — wired: `InstanceState.real_state`,
  `_engine_state_poll_loop` (instance i ← `engine_{i}`), `_real_load`/Gate-3 and
  Mechanism-B now real-state-aware. `--engine-state-uri` flag; off ⇒ identical
  to before (shadow only).
 Transport (`AGENTIC_ENGINE_STATE_URI` / `--engine-state-uri`):
 `file:///dev/shm/agentic_engine_state` (default, zero-dep, single-node) or
 `redis://host:port/0` (multi-node; needs redis-py + server — not installed on
 dash0, so file backend is the working default).
 ## Tests (no GPU)
 - `compute_snapshot` field math (mock scheduler): running/waiting,
  max_prefill_remaining, pending, decode, kv_used_frac.
 - writer→reader round-trip + staleness drop (file backend).
 - target scorer: 5 cases incl. *avoid GIL-stall target even when its shadow
  load is lower*, *real load beats stale shadow*, *cache-rich wins*,
  *avoid KV wall*, *graceful fallback when feed missing*.
 - end-to-end: publish 8 engines (one mid-130k-prefill) → proxy inlined reader →
  target selection avoids it.
 ## Enabling in a GPU run (when free)
 1. `instrument_engine_state.py --apply` on the dash0 venv.
 2. `export AGENTIC_ENGINE_STATE_URI=file:///dev/shm/agentic_engine_state`
   before the launcher (vLLM instances inherit it; `AGENTIC_WORKER_ID=engine_{i}`
   already set by `b3_isolated_policy.sh` → publishes as `engine_{i}`).
 3. Proxy: `EXTRA_PROXY_ARGS="--engine-state-uri file:///dev/shm/agentic_engine_state ..."`.
 4. Revert the patch + `rm -rf /dev/shm/agentic_engine_state` after.
 ## ALL policies now read the real state (update)
 `InstanceState` exposes effective accessors used by **every** picker:
 `eff_num_requests / eff_pending_prefill / eff_ongoing_decode /
 eff_ongoing_tokens` = `max(shadow, real)` when the feed is fresh (real fixes
 the 30s-stale under-count; shadow's atomic pre-await reservation still covers
 the in-flight window, preserving the RaceFix), plus real-only
 `r_max_prefill_remaining / r_kv_used_frac`. Wired into: `load_only`, `lmetric`,
 `sticky`, `pick_instance` (legacy), `pick_instance_unified_hybrid`
 (unified / unified_kv_both), `pick_instance_unified_v3` (gate + Mechanism B),
 and `snapshot_workers` (logged scores now match the decision + real fields).
 Feed off ⇒ `real_state is None` ⇒ accessors return shadow ⇒ byte-identical to
 before. (legacy `unified_v2` left on shadow — retired, not in the ablation.)
 ## Ablation (when GPU free)
 `run_v3_trace.sh` gains `ES=1` (apply engine-state patch + feed + proxy flag)
 and always deploys the enhanced proxy (dormant when feed/write-mode off).
 `run_ablation_es.sh` runs each config twice (ES=0 vs ES=1) so the only
 difference is the state source. Default decisive set (4 runs): champion
 `unified+A+B` and `unified_v3+A+B+layerwise`, each ES0/ES1. Extend CONFIGS for
 `lmetric` / `unified_kv_both` / `load_only`. Compares per-policy TTFT
 (overall + migrated) and whether the **ranking** changes with ground-truth
 state.
 ## Status / scope
 - Built + unit-tested (snapshot, round-trip, target scorer, eff_ accessors,
  end-to-end publish→read→select); NOT yet run against live engines (GPU busy).
 - TP=1 only (one EngineCore/instance → one publisher/engine_id). TP>1 needs
  per-rank ids.
--- a/microbench/connector_tax/layerwise/cache_aware_proxy.WRITEMODE.py
+++ b/microbench/connector_tax/layerwise/cache_aware_proxy.WRITEMODE.py
--- a/microbench/connector_tax/layerwise/engine_state.py
+++ b/microbench/connector_tax/layerwise/engine_state.py
@@ -0,0 +1,140 @@
 #!/usr/bin/env python3
 """Engine-state store: canonical snapshot + writer/reader, shared schema.
 The vLLM scheduler patch (instrument_engine_state.py) inlines a faithful copy
 of `compute_snapshot` + the file/redis writer (engine process needs no repo
 import). The router (cache_aware_proxy) imports `StateReader` here to read the
 real per-engine state instead of its stale shadow counters.
 Schema (one record per engine, key = engine_id):
  ts, engine_id, num_running, num_waiting, gpu_blocks_total, gpu_blocks_free,
  gpu_kv_used_frac, pending_prefill_tokens, ongoing_decode_tokens,
  num_prefilling, max_prefill_remaining
 Transport URIs:
  file:///dev/shm/agentic_engine_state   (default; atomic temp+rename)
  redis://host:port/0                      (optional; needs redis-py)
 """
 from __future__ import annotations
 import json
 import os
 import time
 def compute_snapshot(scheduler, engine_id: str) -> dict:
    """Cheap O(batch) read of routing-relevant real state from a live
    vLLM V1 Scheduler (duck-typed for testability)."""
    try:
        pool = scheduler.kv_cache_manager.block_pool
        total = int(pool.num_gpu_blocks)
        free = int(pool.get_num_free_blocks())
    except Exception:
        total = free = -1
    n_run = pend = dec = n_pref = max_pref = 0
    try:
        for r in scheduler.running:
            n_run += 1
            npr = int(getattr(r, "num_prompt_tokens", 0))
            nct = int(getattr(r, "num_computed_tokens", 0))
            if nct < npr:
                rem = npr - nct
                pend += rem
                n_pref += 1
                max_pref = max(max_pref, rem)
            else:
                dec += int(getattr(r, "num_tokens", 0))
    except Exception:
        pass
    n_wait = 0
    try:
        n_wait = len(scheduler.waiting) + len(getattr(scheduler, "skipped_waiting", []))
        for r in list(scheduler.waiting):
            pend += max(0, int(getattr(r, "num_prompt_tokens", 0))
                        - int(getattr(r, "num_computed_tokens", 0)))
    except Exception:
        pass
    used = ((total - free) / total) if (total and total > 0) else -1.0
    return {
        "ts": time.time(),
        "engine_id": engine_id,
        "num_running": n_run,
        "num_waiting": int(n_wait),
        "gpu_blocks_total": total,
        "gpu_blocks_free": free,
        "gpu_kv_used_frac": used,
        "pending_prefill_tokens": int(pend),
        "ongoing_decode_tokens": int(dec),
        "num_prefilling": n_pref,
        "max_prefill_remaining": int(max_pref),
    }
 class StateWriter:
    def __init__(self, uri: str, engine_id: str):
        self.engine_id = engine_id
        self.kind = None
        if uri.startswith("file://"):
            self.kind = "file"
            self.dir = uri[len("file://"):]
            os.makedirs(self.dir, exist_ok=True)
            self.path = os.path.join(self.dir, f"{engine_id}.json")
            self.tmp = self.path + f".tmp.{os.getpid()}"
        elif uri.startswith("redis://"):
            self.kind = "redis"
            import redis
            self.r = redis.Redis.from_url(uri)
            self.key = f"engine_state:{engine_id}"
        else:
            raise ValueError(f"unsupported engine-state URI: {uri}")
    def publish(self, state: dict):
        if self.kind == "file":
            with open(self.tmp, "w") as f:
                f.write(json.dumps(state))
            os.replace(self.tmp, self.path)
        elif self.kind == "redis":
            self.r.set(self.key, json.dumps(state), ex=5)
 class StateReader:
    """Router-side reader. read_all() returns {engine_id: state}, dropping
    records older than max_age_s (so a dead/hung engine is ignored)."""
    def __init__(self, uri: str, max_age_s: float = 2.0):
        self.uri = uri
        self.max_age_s = max_age_s
        self.kind = None
        if uri.startswith("file://"):
            self.kind = "file"
            self.dir = uri[len("file://"):]
        elif uri.startswith("redis://"):
            self.kind = "redis"
            import redis
            self.r = redis.Redis.from_url(uri)
        else:
            raise ValueError(f"unsupported engine-state URI: {uri}")
    def read_all(self) -> dict[str, dict]:
        now = time.time()
        out: dict[str, dict] = {}
        try:
            if self.kind == "file":
                import glob
                for p in glob.glob(os.path.join(self.dir, "*.json")):
                    try:
                        s = json.load(open(p))
                    except Exception:
                        continue
                    if now - s.get("ts", 0) <= self.max_age_s:
                        out[s.get("engine_id", os.path.basename(p)[:-5])] = s
            elif self.kind == "redis":
                for k in self.r.scan_iter("engine_state:*"):
                    v = self.r.get(k)
                    if not v:
                        continue
                    s = json.loads(v)
                    if now - s.get("ts", 0) <= self.max_age_s:
                        out[s.get("engine_id")] = s
        except Exception:
            pass
        return out
--- a/microbench/connector_tax/layerwise/instrument_engine_state.py
+++ b/microbench/connector_tax/layerwise/instrument_engine_state.py
@@ -0,0 +1,234 @@
 #!/usr/bin/env python3
 """Patch vLLM V1 scheduler to publish REAL engine state to a shared store,
 so the global router reads ground truth instead of its own stale shadow
 counters (reconciled only every 30s).
 Published per engine (key = AGENTIC_ENGINE_ID), throttled ~20 Hz from a
 daemon thread (off the forward hot path):
  {ts, num_running, num_waiting, gpu_blocks_total, gpu_blocks_free,
   gpu_kv_used_frac, pending_prefill_tokens, ongoing_decode_tokens,
   num_prefilling, max_prefill_remaining}
 `max_prefill_remaining` is the key signal /metrics does NOT expose: the
 largest in-progress prefill on the engine. A big in-progress prefill holds
 the GIL and stalls the mooncake receiver_loop — so the router should avoid
 migrating KV to such an instance (P2).
 Transport (env AGENTIC_ENGINE_STATE_URI):
  file:///dev/shm/agentic_engine_state   (default; atomic temp+rename)
  redis://host:port/0                      (optional; needs redis-py + server)
 Self-contained (inlined writer) so the engine process needs no repo import.
 Apply/revert markers: # ES_INSTRUMENT_START / # ES_INSTRUMENT_END.
 Usage:
  python instrument_engine_state.py --apply  [--venv PATH]
  python instrument_engine_state.py --revert [--venv PATH]
  python instrument_engine_state.py --check  [--venv PATH]
 """
 from __future__ import annotations
 import argparse
 import re
 from pathlib import Path
 DEFAULT_VENV = Path("/home/admin/cpfs/wjh/agentic-kv/.venv")
 TARGET_REL = "lib/python3.12/site-packages/vllm/v1/core/sched/scheduler.py"
 START = "# ES_INSTRUMENT_START"
 END = "# ES_INSTRUMENT_END"
 # ---- Patch 1: header (writer + publisher thread), before class Scheduler ----
 HEADER_ANCHOR = "class Scheduler(SchedulerInterface):"
 HEADER = f'''{START}
 import json as _es_json
 import os as _es_os
 import threading as _es_threading
 import time as _es_time
 _ES_URI = _es_os.environ.get("AGENTIC_ENGINE_STATE_URI", "")
 _ES_ID = _es_os.environ.get("AGENTIC_ENGINE_ID") or _es_os.environ.get(
    "AGENTIC_WORKER_ID", f"engine_{{_es_os.getpid()}}")
 _ES_PERIOD_S = float(_es_os.environ.get("AGENTIC_ENGINE_STATE_PERIOD_MS", "50")) / 1000.0
 class _ESWriter:
    """Pluggable state writer: file:// (atomic temp+rename) or redis://."""
    def __init__(self, uri: str, engine_id: str):
        self.engine_id = engine_id
        self.kind = None
        if uri.startswith("file://"):
            self.kind = "file"
            self.dir = uri[len("file://"):]
            _es_os.makedirs(self.dir, exist_ok=True)
            self.path = _es_os.path.join(self.dir, f"{{engine_id}}.json")
            self.tmp = self.path + f".tmp.{{_es_os.getpid()}}"
        elif uri.startswith("redis://"):
            self.kind = "redis"
            import redis  # lazy
            self.r = redis.Redis.from_url(uri)
            self.key = f"engine_state:{{engine_id}}"
    def publish(self, state: dict):
        try:
            if self.kind == "file":
                with open(self.tmp, "w") as f:
                    f.write(_es_json.dumps(state))
                _es_os.replace(self.tmp, self.path)  # atomic
            elif self.kind == "redis":
                self.r.set(self.key, _es_json.dumps(state), ex=5)
        except Exception:
            pass
 def _es_compute_snapshot(scheduler) -> dict:
    """Cheap O(batch) state read from the live scheduler."""
    try:
        kvm = scheduler.kv_cache_manager
        pool = kvm.block_pool
        total = int(pool.num_gpu_blocks)
        free = int(pool.get_num_free_blocks())
    except Exception:
        total = free = -1
    n_run = 0
    pend = 0
    dec = 0
    n_pref = 0
    max_pref = 0
    try:
        for r in scheduler.running:
            n_run += 1
            npr = int(getattr(r, "num_prompt_tokens", 0))
            nct = int(getattr(r, "num_computed_tokens", 0))
            if nct < npr:  # still prefilling
                rem = npr - nct
                pend += rem
                n_pref += 1
                if rem > max_pref:
                    max_pref = rem
            else:  # decoding
                dec += int(getattr(r, "num_tokens", 0))
    except Exception:
        pass
    n_wait = 0
    try:
        n_wait = len(scheduler.waiting) + len(getattr(scheduler, "skipped_waiting", []))
        for r in list(scheduler.waiting):
            pend += max(0, int(getattr(r, "num_prompt_tokens", 0))
                        - int(getattr(r, "num_computed_tokens", 0)))
    except Exception:
        pass
    used_frac = ((total - free) / total) if (total and total > 0) else -1.0
    return {{
        "ts": _es_time.time(),
        "engine_id": _ES_ID,
        "num_running": n_run,
        "num_waiting": int(n_wait),
        "gpu_blocks_total": total,
        "gpu_blocks_free": free,
        "gpu_kv_used_frac": used_frac,
        "pending_prefill_tokens": int(pend),
        "ongoing_decode_tokens": int(dec),
        "num_prefilling": n_pref,
        "max_prefill_remaining": int(max_pref),
    }}
 class _ESPublisher:
    def __init__(self, scheduler):
        self._sched = scheduler
        self._writer = _ESWriter(_ES_URI, _ES_ID)
        self._stop = _es_threading.Event()
        self._t = _es_threading.Thread(target=self._loop, daemon=True)
        self._t.start()
    def _loop(self):
        while not self._stop.is_set():
            try:
                self._writer.publish(_es_compute_snapshot(self._sched))
            except Exception:
                pass
            _es_time.sleep(_ES_PERIOD_S)
 {END}
 '''
 # ---- Patch 2: start the publisher at the end of Scheduler.__init__ ----------
 # Anchor on the existing agentic step-log block tail in __init__.
 INIT_ANCHOR = """        _step_path = _os.environ.get("AGENTIC_STEP_LOG_PATH")"""
 INIT_INSERT = f"""        {START}
        if _ES_URI:
            try:
                self._es_publisher = _ESPublisher(self)
                logger.info("agentic engine-state publisher: uri=%s id=%s",
                            _ES_URI, _ES_ID)
            except Exception as _e:
                logger.warning("engine-state publisher disabled (%r)", _e)
        {END}
        _step_path = _os.environ.get("AGENTIC_STEP_LOG_PATH")"""
 PATCHES = [
    ("header", HEADER_ANCHOR, HEADER + HEADER_ANCHOR),
    ("init", INIT_ANCHOR, INIT_INSERT),
 ]
 def find_target(venv: Path) -> Path:
    for c in (venv / TARGET_REL, DEFAULT_VENV / TARGET_REL):
        if c.is_file():
            return c
    raise FileNotFoundError(f"cannot find {TARGET_REL} under {venv}")
 def is_patched(t: str) -> bool:
    return START in t
 def apply(target: Path):
    text = target.read_text()
    if is_patched(text):
        print(f"[es-instr] already patched: {target}")
        return
    new = text
    for name, src, dst in PATCHES:
        if src not in new:
            raise RuntimeError(f"patch {name!r}: anchor not found in {target}")
        new = new.replace(src, dst, 1)
    target.write_text(new)
    print(f"[es-instr] applied {len(PATCHES)} patches -> {target}")
 def revert(target: Path):
    text = target.read_text()
    if not is_patched(text):
        print(f"[es-instr] not patched: {target}")
        return
    pat = re.compile(r"[ \t]*" + re.escape(START) + r".*?" + re.escape(END) + r"\n",
                     flags=re.DOTALL)
    new = pat.sub("", text)
    new = re.sub(r"\n{3,}class Scheduler\(", "\n\nclass Scheduler(", new)
    target.write_text(new)
    print(f"[es-instr] reverted: {target}")
 def main():
    p = argparse.ArgumentParser()
    p.add_argument("--apply", action="store_true")
    p.add_argument("--revert", action="store_true")
    p.add_argument("--check", action="store_true")
    p.add_argument("--venv", type=Path, default=DEFAULT_VENV)
    a = p.parse_args()
    t = find_target(a.venv)
    if a.apply:
        apply(t)
    elif a.revert:
        revert(t)
    elif a.check:
        print(f"[es-instr] {'PATCHED' if is_patched(t.read_text()) else 'CLEAN'}: {t}")
    else:
        p.error("specify --apply/--revert/--check")
 if __name__ == "__main__":
    main()
--- a/microbench/connector_tax/layerwise/mb7_layerwise.py
+++ b/microbench/connector_tax/layerwise/mb7_layerwise.py
@@ -0,0 +1,294 @@
 #!/usr/bin/env python3
 """MB7: correctness + perf of layer-wise KV push vs post-hoc transfer.
 Two 2-instance modes against A (src/producer) and B (dst/consumer):
  baseline  : prefill A (await) -> THEN B pulls (post-hoc full transfer).
              T_total = T_prefill + T_xfer  (sequential)
  layerwise : dispatch B's remote-prefill (handshake) and A's prefill
              CONCURRENTLY, so A pushes each layer as it computes it.
              If overlap works, T_total ~= max(T_prefill, T_xfer) ~= T_prefill.
 Reference: T_prefill_only = a plain prefill on A with no transfer.
 Correctness: after the transfer, a plain follow-up to B on the same prompt
 must report cached_tokens >= ~prompt_len (the KV actually landed on B).
 The connector mode is selected by the launcher (run_mb7.sh): baseline uses the
 stock connector; layerwise deploys mooncake_connector.LAYERWISE.py +
 MOONCAKE_LAYERWISE=1. This script just drives the requests and measures.
 Usage:
  python mb7_layerwise.py --mode layerwise --sizes 8192,32768,65536 --repeats 3 \
      --src-port 8000 --dst-port 8001 --src-bp 8998 --dst-bp 8999 --out mb7.json
 """
 from __future__ import annotations
 import argparse
 import asyncio
 import json
 import statistics
 import time
 import uuid
 from pathlib import Path
 import httpx
 MODEL = "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct"
 KV_PER_TOK = 98304
 def synth_prompt(seed: int, n: int) -> list[int]:
    import random
    rng = random.Random(seed)
    return [rng.randint(100, 150000) for _ in range(n)]
 async def get_engine_id(client, host, bp):
    r = await client.get(f"http://{host}:{bp}/query")
    r.raise_for_status()
    return r.json()["0"]["engine_id"]
 async def completion(client, host, port, prompt, max_tokens, ktp=None):
    payload = {
        "model": MODEL, "prompt": prompt, "max_tokens": max_tokens,
        "min_tokens": max_tokens if max_tokens == 1 else 1,
        "temperature": 0.0, "stream": False,
    }
    if ktp:
        payload["kv_transfer_params"] = ktp
    t0 = time.perf_counter()
    r = await client.post(f"http://{host}:{port}/v1/completions",
                          json=payload, timeout=600.0)
    dt = time.perf_counter() - t0
    r.raise_for_status()
    return dt, r.json()
 def cached_of(resp) -> int:
    usage = resp.get("usage") or {}
    det = usage.get("prompt_tokens_details") or {}
    return det.get("cached_tokens", 0) or usage.get("cached_tokens", 0) or 0
 async def _stream_completion(client, host, port, prompt, max_tokens):
    payload = {"model": MODEL, "prompt": prompt, "max_tokens": max_tokens,
               "min_tokens": 1, "temperature": 0.0, "stream": True}
    async with client.stream("POST", f"http://{host}:{port}/v1/completions",
                             json=payload, timeout=600.0) as r:
        r.raise_for_status()
        async for _ in r.aiter_bytes():
            pass
 class BackgroundLoad:
    """Hold N concurrent long-decode streams across endpoints to keep busy."""
    def __init__(self, client, endpoints, n, prompt_tokens=2000, out_tokens=6000):
        self.client, self.endpoints, self.n = client, endpoints, n
        self.pt, self.ot = prompt_tokens, out_tokens
        self._stop = asyncio.Event()
        self._tasks = []
    async def _w(self, idx):
        host, port = self.endpoints[idx % len(self.endpoints)]
        seed = 800000 + idx
        while not self._stop.is_set():
            try:
                await _stream_completion(self.client, host, port,
                                         synth_prompt(seed, self.pt), self.ot)
            except Exception:
                await asyncio.sleep(0.5)
            seed += 1
    def start(self):
        self._tasks = [asyncio.create_task(self._w(i)) for i in range(self.n)]
    async def stop(self):
        self._stop.set()
        for t in self._tasks:
            t.cancel()
        await asyncio.gather(*self._tasks, return_exceptions=True)
 async def num_running(client, host, port):
    try:
        r = await client.get(f"http://{host}:{port}/metrics", timeout=5.0)
        for line in r.text.splitlines():
            if line.startswith("vllm:num_requests_running"):
                return int(float(line.split()[-1]))
    except Exception:
        pass
    return -1
 async def prefill_only(client, host, port, prompt):
    """Reference: plain prefill cost on A, no transfer."""
    dt, _ = await completion(client, host, port, prompt, max_tokens=1)
    return dt
 async def measure_baseline(client, A, B, src_eid, src_bp_addr, prompt, seed):
    tid = uuid.uuid4().hex
    t0 = time.perf_counter()
    t_pf, _ = await completion(client, *A, prompt, 1,
                               ktp={"do_remote_decode": True, "transfer_id": tid})
    t_xfer, _ = await completion(client, *B, prompt, 1,
                                 ktp={"do_remote_prefill": True, "transfer_id": tid,
                                      "remote_engine_id": src_eid,
                                      "remote_bootstrap_addr": src_bp_addr})
    t_total = time.perf_counter() - t0
    # correctness: B follow-up should hit cache
    _, fr = await completion(client, *B, prompt, 1)
    return {"t_prefill_s": t_pf, "t_xfer_s": t_xfer, "t_total_s": t_total,
            "cached": cached_of(fr)}
 async def measure_layerwise(client, A, B, src_eid, src_bp_addr, prompt, seed):
    """Dispatch B handshake + A prefill concurrently => layer-wise overlap."""
    tid = uuid.uuid4().hex
    t0 = time.perf_counter()
    async def run_B():
        return await completion(client, *B, prompt, 1,
                                ktp={"do_remote_prefill": True, "transfer_id": tid,
                                     "remote_engine_id": src_eid,
                                     "remote_bootstrap_addr": src_bp_addr})
    async def run_A():
        # small head start for B's handshake to reach A before A's forward
        await asyncio.sleep(0.05)
        return await completion(client, *A, prompt, 1,
                                ktp={"do_remote_decode": True, "transfer_id": tid})
    b_task = asyncio.create_task(run_B())
    a_task = asyncio.create_task(run_A())
    (t_b, _), (t_a, _) = await asyncio.gather(b_task, a_task)
    t_total = time.perf_counter() - t0
    _, fr = await completion(client, *B, prompt, 1)
    return {"t_A_s": t_a, "t_B_s": t_b, "t_total_s": t_total,
            "cached": cached_of(fr)}
 async def main_async(a):
    sizes = [int(s) for s in a.sizes.split(",")]
    A = (a.src_host, a.src_port)
    B = (a.dst_host, a.dst_port)
    limits = httpx.Limits(max_connections=64, max_keepalive_connections=64)
    async with httpx.AsyncClient(limits=limits, trust_env=False) as client:
        src_eid = await get_engine_id(client, a.src_host, a.src_bp)
        src_bp_addr = f"http://{a.src_host}:{a.src_bp}"
        print(f"[mb7] mode={a.mode} bg_load={a.bg_load} src_eid={src_eid[:16]}...")
        loader = None
        if a.bg_load > 0:
            loader = BackgroundLoad(client, [A, B], a.bg_load)
            loader.start()
            print(f"[mb7] ramping background load ({a.bg_load}) ...")
            for _ in range(40):
                await asyncio.sleep(1.0)
                na = await num_running(client, *A)
                nb = await num_running(client, *B)
                if na >= 1 and nb >= 1:
                    print(f"[mb7] busy: A_run={na} B_run={nb}")
                    break
        # --- concurrent correctness mode: fire N transfers at once ----------
        if a.concurrent > 1 and a.mode == "layerwise":
            print(f"[mb7] CONCURRENT correctness: {a.concurrent} simultaneous "
                  f"transfers per size (src=A stresses concurrent producing)")
            all_ok = True
            for sz in sizes:
                tasks = [
                    asyncio.create_task(measure_layerwise(
                        client, A, B, src_eid, src_bp_addr,
                        synth_prompt(sz * 1000 + j, sz), sz * 1000 + j))
                    for j in range(a.concurrent)
                ]
                rows = await asyncio.gather(*tasks)
                oks = [r["cached"] >= int(sz * 0.9) for r in rows]
                all_ok = all_ok and all(oks)
                print(f"  sz={sz:>6} x{a.concurrent}: cached="
                      f"{[r['cached'] for r in rows]} correct={oks}")
            print(f"[mb7] CONCURRENT correctness: "
                  f"{'ALL PASS' if all_ok else 'FAILURE'}")
            if loader:
                await loader.stop()
            return
        results = []
        for sz in sizes:
            for rep in range(a.repeats):
                prompt = synth_prompt(sz * 100 + rep, sz)
                # reference prefill-only cost (fresh prompt, different seed so no cache)
                t_pf_only = await prefill_only(
                    client, *A, synth_prompt(sz * 100 + rep + 555, sz))
                if a.mode == "baseline":
                    row = await measure_baseline(client, A, B, src_eid, src_bp_addr,
                                                 prompt, sz * 100 + rep)
                else:
                    row = await measure_layerwise(client, A, B, src_eid, src_bp_addr,
                                                  prompt, sz * 100 + rep)
                row.update({"mode": a.mode, "size": sz, "rep": rep,
                            "t_prefill_only_s": t_pf_only,
                            "kv_gib": sz * KV_PER_TOK / 2**30,
                            "correct": row["cached"] >= int(sz * 0.9)})
                results.append(row)
                extra = (f"xfer={row.get('t_xfer_s', 0)*1000:.0f}ms"
                         if a.mode == "baseline"
                         else f"tA={row.get('t_A_s',0)*1000:.0f}ms tB={row.get('t_B_s',0)*1000:.0f}ms")
                print(f"  sz={sz:>6} rep={rep} pf_only={t_pf_only*1000:6.0f}ms "
                      f"total={row['t_total_s']*1000:7.0f}ms {extra} "
                      f"cached={row['cached']}/{sz} correct={row['correct']}")
        if loader:
            await loader.stop()
    # summary
    print(f"\n=== {a.mode} (bg={a.bg_load}) summary ===")
    print(f"{'size':>7} {'n':>2} {'pf_only_ms':>11} {'total_ms':>9} "
          f"{'overhead_ms':>12} {'correct':>8}")
    summary = []
    for sz in sizes:
        rs = [r for r in results if r["size"] == sz]
        if not rs:
            continue
        pf = statistics.median(r["t_prefill_only_s"] for r in rs) * 1000
        tot = statistics.median(r["t_total_s"] for r in rs) * 1000
        allok = all(r["correct"] for r in rs)
        # overhead = total - prefill_only = the part NOT hidden behind prefill
        overhead = tot - pf
        summary.append({"size": sz, "n": len(rs), "pf_only_ms": pf,
                        "total_ms": tot, "overhead_ms": overhead,
                        "all_correct": allok})
        print(f"{sz:>7} {len(rs):>2} {pf:>11.0f} {tot:>9.0f} {overhead:>12.0f} "
              f"{str(allok):>8}")
    Path(a.out).write_text(json.dumps(
        {"mode": a.mode, "model": MODEL, "raw": results, "summary": summary}, indent=2))
    print(f"\n[mb7] wrote {a.out}")
 def main():
    p = argparse.ArgumentParser()
    p.add_argument("--mode", choices=["baseline", "layerwise"], required=True)
    p.add_argument("--src-host", default="127.0.0.1")
    p.add_argument("--dst-host", default="127.0.0.1")
    p.add_argument("--src-port", type=int, default=8000)
    p.add_argument("--dst-port", type=int, default=8001)
    p.add_argument("--src-bp", type=int, default=8998)
    p.add_argument("--dst-bp", type=int, default=8999)
    p.add_argument("--sizes", default="8192,32768,65536")
    p.add_argument("--repeats", type=int, default=3)
    p.add_argument("--bg-load", type=int, default=0,
                   help="N concurrent background decode streams across A+B")
    p.add_argument("--concurrent", type=int, default=1,
                   help="layerwise: fire N simultaneous transfers to test "
                        "concurrent-producing correctness")
    p.add_argument("--out", default="mb7_result.json")
    args = p.parse_args()
    asyncio.run(main_async(args))
 if __name__ == "__main__":
    main()
--- a/microbench/connector_tax/layerwise/migration_target.py
+++ b/microbench/connector_tax/layerwise/migration_target.py
@@ -0,0 +1,79 @@
 #!/usr/bin/env python3
 """P2: real-state-aware migration target selection.
 Pure helpers (no proxy deps) so they're unit-testable. The router calls
 `rank_migration_targets` to pick the decode target, using REAL engine state
 (from the engine-state store) when available, falling back to shadow counters.
 Key fix over the shadow-only Mechanism B: deprioritise targets that are
 mid-large-prefill (`max_prefill_remaining` high) — those hold the GIL and
 stall the mooncake receiver_loop, which is the ~45% control-plane residual
 that layer-wise transfer does NOT fix. Also avoid targets near the KV
 capacity wall (`gpu_kv_used_frac` high).
 """
 from __future__ import annotations
 from dataclasses import dataclass
@dataclass
 class TargetCandidate:
    idx: int
    cache_hit: int                 # estimated transfer bytes saved (tokens)
    shadow_num_req: int            # proxy shadow counter (fallback)
    ongoing_tokens: int            # shadow tertiary
    real_state: dict | None = None # engine-state record, or None if stale/missing
 def real_load(c: TargetCandidate) -> float:
    """Effective load: prefer real (running + waiting); else shadow."""
    rs = c.real_state
    if rs is not None:
        return float(rs.get("num_running", 0) + rs.get("num_waiting", 0))
    return float(c.shadow_num_req)
 def big_prefill_remaining(c: TargetCandidate) -> int:
    """Largest in-progress prefill on the candidate (GIL-stall predictor).
    0 when unknown (no real state) so we don't over-penalise blind."""
    rs = c.real_state
    return int(rs.get("max_prefill_remaining", 0)) if rs is not None else 0
 def kv_used_frac(c: TargetCandidate) -> float:
    rs = c.real_state
    if rs is not None:
        f = rs.get("gpu_kv_used_frac", -1.0)
        return float(f) if f is not None and f >= 0 else 0.0
    return 0.0
 def target_sort_key(
    c: TargetCandidate,
    big_prefill_threshold: int = 16000,
    kv_wall_frac: float = 0.90,
 ):
    """Sort key (lower = better). Ordering of concerns:
      1. NOT mid-large-prefill (avoid the GIL-stall dst)         [bool]
      2. NOT near the KV capacity wall                            [bool]
      3. most cache-rich  (fewest transfer bytes)  -> -cache_hit
      4. lowest real load
      5. lowest ongoing_tokens (shadow tertiary tie-break)
    """
    stalls = 1 if big_prefill_remaining(c) >= big_prefill_threshold else 0
    near_wall = 1 if kv_used_frac(c) >= kv_wall_frac else 0
    return (stalls, near_wall, -c.cache_hit, real_load(c), c.ongoing_tokens)
 def rank_migration_targets(
    candidates: list[TargetCandidate],
    big_prefill_threshold: int = 16000,
    kv_wall_frac: float = 0.90,
 ) -> TargetCandidate | None:
    """Return the best candidate, or None if the list is empty."""
    if not candidates:
        return None
    return min(
        candidates,
        key=lambda c: target_sort_key(c, big_prefill_threshold, kv_wall_frac),
    )
--- a/microbench/connector_tax/layerwise/mooncake_connector.BASE.py
+++ b/microbench/connector_tax/layerwise/mooncake_connector.BASE.py
--- a/microbench/connector_tax/layerwise/mooncake_connector.LAYERWISE.py
+++ b/microbench/connector_tax/layerwise/mooncake_connector.LAYERWISE.py
--- a/microbench/connector_tax/layerwise/mooncake_connector.LAYERWISE.v1_singlexfer.py
+++ b/microbench/connector_tax/layerwise/mooncake_connector.LAYERWISE.v1_singlexfer.py
--- a/microbench/connector_tax/layerwise/results/mb7_baseline.json
+++ b/microbench/connector_tax/layerwise/results/mb7_baseline.json
@@ -0,0 +1,140 @@
 {
  "mode": "baseline",
  "model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
  "raw": [
    {
      "t_prefill_s": 0.5736213000018324,
      "t_xfer_s": 0.36388630099827424,
      "t_total_s": 0.9375749369974073,
      "cached": 8176,
      "mode": "baseline",
      "size": 8192,
      "rep": 0,
      "t_prefill_only_s": 1.0551288530004967,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_prefill_s": 0.5740011439993395,
      "t_xfer_s": 0.12374231500143651,
      "t_total_s": 0.6978207100000873,
      "cached": 8176,
      "mode": "baseline",
      "size": 8192,
      "rep": 1,
      "t_prefill_only_s": 0.5743715360003989,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_prefill_s": 0.5732713990000775,
      "t_xfer_s": 0.10885842400239198,
      "t_total_s": 0.6821924389987544,
      "cached": 8176,
      "mode": "baseline",
      "size": 8192,
      "rep": 2,
      "t_prefill_only_s": 0.5745713680007611,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_prefill_s": 1.4892208660021424,
      "t_xfer_s": 0.2091717740004242,
      "t_total_s": 1.6984740270017937,
      "cached": 16368,
      "mode": "baseline",
      "size": 16384,
      "rep": 0,
      "t_prefill_only_s": 1.4990949730017746,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_prefill_s": 1.4885207330007688,
      "t_xfer_s": 0.2010940889995254,
      "t_total_s": 1.6896768289989268,
      "cached": 16368,
      "mode": "baseline",
      "size": 16384,
      "rep": 1,
      "t_prefill_only_s": 1.4898170189990196,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_prefill_s": 1.4895933570005582,
      "t_xfer_s": 0.2026357979993918,
      "t_total_s": 1.6922962099997676,
      "cached": 16368,
      "mode": "baseline",
      "size": 16384,
      "rep": 2,
      "t_prefill_only_s": 1.4907751430000644,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_prefill_s": 4.438586502998078,
      "t_xfer_s": 0.37847799000155646,
      "t_total_s": 4.817142683001293,
      "cached": 32752,
      "mode": "baseline",
      "size": 32768,
      "rep": 0,
      "t_prefill_only_s": 4.437922253000579,
      "kv_gib": 3.0,
      "correct": true
    },
    {
      "t_prefill_s": 4.4350325649975275,
      "t_xfer_s": 0.5313337980005599,
      "t_total_s": 4.966431269000168,
      "cached": 32752,
      "mode": "baseline",
      "size": 32768,
      "rep": 1,
      "t_prefill_only_s": 4.437473922000208,
      "kv_gib": 3.0,
      "correct": true
    },
    {
      "t_prefill_s": 4.436279826000828,
      "t_xfer_s": 0.6335160570015432,
      "t_total_s": 5.069869226001174,
      "cached": 32752,
      "mode": "baseline",
      "size": 32768,
      "rep": 2,
      "t_prefill_only_s": 4.440119222999783,
      "kv_gib": 3.0,
      "correct": true
    }
  ],
  "summary": [
    {
      "size": 8192,
      "n": 3,
      "pf_only_ms": 574.5713680007611,
      "total_ms": 697.8207100000873,
      "overhead_ms": 123.24934199932613,
      "all_correct": true
    },
    {
      "size": 16384,
      "n": 3,
      "pf_only_ms": 1490.7751430000644,
      "total_ms": 1692.2962099997676,
      "overhead_ms": 201.52106699970318,
      "all_correct": true
    },
    {
      "size": 32768,
      "n": 3,
      "pf_only_ms": 4437.922253000579,
      "total_ms": 4966.431269000168,
      "overhead_ms": 528.5090159995889,
      "all_correct": true
    }
  ]
 }
--- a/microbench/connector_tax/layerwise/results/mb7_baseline_bg16.json
+++ b/microbench/connector_tax/layerwise/results/mb7_baseline_bg16.json
@@ -0,0 +1,140 @@
 {
  "mode": "baseline",
  "model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
  "raw": [
    {
      "t_prefill_s": 0.5868483350022871,
      "t_xfer_s": 0.19584889299949282,
      "t_total_s": 0.7827702419999696,
      "cached": 8176,
      "mode": "baseline",
      "size": 8192,
      "rep": 0,
      "t_prefill_only_s": 0.5920699099988269,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_prefill_s": 0.5875704979989678,
      "t_xfer_s": 0.1554814909977722,
      "t_total_s": 0.7431365060001554,
      "cached": 8176,
      "mode": "baseline",
      "size": 8192,
      "rep": 1,
      "t_prefill_only_s": 0.5814537600017502,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_prefill_s": 0.5852241569991747,
      "t_xfer_s": 0.15129724399957922,
      "t_total_s": 0.7365909610016388,
      "cached": 8176,
      "mode": "baseline",
      "size": 8192,
      "rep": 2,
      "t_prefill_only_s": 0.5846994370003813,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_prefill_s": 1.498547145001794,
      "t_xfer_s": 0.2475714690008317,
      "t_total_s": 1.7462187470009667,
      "cached": 16368,
      "mode": "baseline",
      "size": 16384,
      "rep": 0,
      "t_prefill_only_s": 1.5670790190015396,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_prefill_s": 1.5025789940009417,
      "t_xfer_s": 0.24532966799961287,
      "t_total_s": 1.7479741930001182,
      "cached": 16368,
      "mode": "baseline",
      "size": 16384,
      "rep": 1,
      "t_prefill_only_s": 1.5008903820016712,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_prefill_s": 1.5021674179988622,
      "t_xfer_s": 0.24640760400143336,
      "t_total_s": 1.7486415580024186,
      "cached": 16368,
      "mode": "baseline",
      "size": 16384,
      "rep": 2,
      "t_prefill_only_s": 1.509417139001016,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_prefill_s": 4.444555983998725,
      "t_xfer_s": 0.4227471090016479,
      "t_total_s": 4.86737214599998,
      "cached": 32752,
      "mode": "baseline",
      "size": 32768,
      "rep": 0,
      "t_prefill_only_s": 4.4467717689985875,
      "kv_gib": 3.0,
      "correct": true
    },
    {
      "t_prefill_s": 4.442135782999685,
      "t_xfer_s": 0.7519038230020669,
      "t_total_s": 5.194113359000767,
      "cached": 32752,
      "mode": "baseline",
      "size": 32768,
      "rep": 1,
      "t_prefill_only_s": 4.445541313998547,
      "kv_gib": 3.0,
      "correct": true
    },
    {
      "t_prefill_s": 4.439772993999213,
      "t_xfer_s": 0.7855456319994119,
      "t_total_s": 5.225392060998274,
      "cached": 32752,
      "mode": "baseline",
      "size": 32768,
      "rep": 2,
      "t_prefill_only_s": 4.442906365002273,
      "kv_gib": 3.0,
      "correct": true
    }
  ],
  "summary": [
    {
      "size": 8192,
      "n": 3,
      "pf_only_ms": 584.6994370003813,
      "total_ms": 743.1365060001554,
      "overhead_ms": 158.43706899977406,
      "all_correct": true
    },
    {
      "size": 16384,
      "n": 3,
      "pf_only_ms": 1509.417139001016,
      "total_ms": 1747.9741930001182,
      "overhead_ms": 238.5570539991022,
      "all_correct": true
    },
    {
      "size": 32768,
      "n": 3,
      "pf_only_ms": 4445.541313998547,
      "total_ms": 5194.113359000767,
      "overhead_ms": 748.57204500222,
      "all_correct": true
    }
  ]
 }
--- a/microbench/connector_tax/layerwise/results/mb7_layerwise.json
+++ b/microbench/connector_tax/layerwise/results/mb7_layerwise.json
@@ -0,0 +1,140 @@
 {
  "mode": "layerwise",
  "model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
  "raw": [
    {
      "t_A_s": 0.5749198459998297,
      "t_B_s": 0.6508419569981925,
      "t_total_s": 0.6509377910006151,
      "cached": 8176,
      "mode": "layerwise",
      "size": 8192,
      "rep": 0,
      "t_prefill_only_s": 1.0447357020020718,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_A_s": 0.574626908000937,
      "t_B_s": 0.6306310719992325,
      "t_total_s": 0.6307087300010608,
      "cached": 8176,
      "mode": "layerwise",
      "size": 8192,
      "rep": 1,
      "t_prefill_only_s": 0.5731983850018878,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_A_s": 0.5756587910000235,
      "t_B_s": 0.6316753270002664,
      "t_total_s": 0.6317471290021786,
      "cached": 8176,
      "mode": "layerwise",
      "size": 8192,
      "rep": 2,
      "t_prefill_only_s": 0.5737888650000968,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_A_s": 1.4953326409995498,
      "t_B_s": 1.5502465710014803,
      "t_total_s": 1.5503262860001996,
      "cached": 16368,
      "mode": "layerwise",
      "size": 16384,
      "rep": 0,
      "t_prefill_only_s": 1.5000705940001353,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_A_s": 1.493850356000621,
      "t_B_s": 1.5505031290012994,
      "t_total_s": 1.5505791659998067,
      "cached": 16368,
      "mode": "layerwise",
      "size": 16384,
      "rep": 1,
      "t_prefill_only_s": 1.4924546469992492,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_A_s": 1.4979969070009247,
      "t_B_s": 1.554968774002191,
      "t_total_s": 1.5551903560008213,
      "cached": 16368,
      "mode": "layerwise",
      "size": 16384,
      "rep": 2,
      "t_prefill_only_s": 1.4914496510027675,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_A_s": 4.4403588690001925,
      "t_B_s": 4.496483378999983,
      "t_total_s": 4.4965666819989565,
      "cached": 32752,
      "mode": "layerwise",
      "size": 32768,
      "rep": 0,
      "t_prefill_only_s": 4.440080869000667,
      "kv_gib": 3.0,
      "correct": true
    },
    {
      "t_A_s": 4.44209005599987,
      "t_B_s": 4.499940814999718,
      "t_total_s": 4.500021006002498,
      "cached": 32752,
      "mode": "layerwise",
      "size": 32768,
      "rep": 1,
      "t_prefill_only_s": 4.440225810998527,
      "kv_gib": 3.0,
      "correct": true
    },
    {
      "t_A_s": 4.437084657998639,
      "t_B_s": 4.496842522999941,
      "t_total_s": 4.496926485000586,
      "cached": 32752,
      "mode": "layerwise",
      "size": 32768,
      "rep": 2,
      "t_prefill_only_s": 4.439449855002749,
      "kv_gib": 3.0,
      "correct": true
    }
  ],
  "summary": [
    {
      "size": 8192,
      "n": 3,
      "pf_only_ms": 573.7888650000968,
      "total_ms": 631.7471290021786,
      "overhead_ms": 57.958264002081705,
      "all_correct": true
    },
    {
      "size": 16384,
      "n": 3,
      "pf_only_ms": 1492.4546469992492,
      "total_ms": 1550.5791659998067,
      "overhead_ms": 58.124519000557484,
      "all_correct": true
    },
    {
      "size": 32768,
      "n": 3,
      "pf_only_ms": 4440.080869000667,
      "total_ms": 4496.926485000586,
      "overhead_ms": 56.845615999918664,
      "all_correct": true
    }
  ]
 }
--- a/microbench/connector_tax/layerwise/results/mb7_layerwise_bg16.json
+++ b/microbench/connector_tax/layerwise/results/mb7_layerwise_bg16.json
@@ -0,0 +1,140 @@
 {
  "mode": "layerwise",
  "model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
  "raw": [
    {
      "t_A_s": 0.5905098549992545,
      "t_B_s": 0.6900827390018094,
      "t_total_s": 0.6904724189989793,
      "cached": 8176,
      "mode": "layerwise",
      "size": 8192,
      "rep": 0,
      "t_prefill_only_s": 0.5852864849985053,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_A_s": 0.5897548109969648,
      "t_B_s": 0.6827381169969158,
      "t_total_s": 0.6828304180016858,
      "cached": 8176,
      "mode": "layerwise",
      "size": 8192,
      "rep": 1,
      "t_prefill_only_s": 0.5890174580017629,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_A_s": 0.5850713190011447,
      "t_B_s": 0.6744917560026806,
      "t_total_s": 0.6745770380002796,
      "cached": 8176,
      "mode": "layerwise",
      "size": 8192,
      "rep": 2,
      "t_prefill_only_s": 0.5943713950000529,
      "kv_gib": 0.75,
      "correct": true
    },
    {
      "t_A_s": 1.5030149390004226,
      "t_B_s": 1.596173029000056,
      "t_total_s": 1.597060264000902,
      "cached": 16368,
      "mode": "layerwise",
      "size": 16384,
      "rep": 0,
      "t_prefill_only_s": 1.5130829510017065,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_A_s": 1.499876754998695,
      "t_B_s": 1.5940461120007967,
      "t_total_s": 1.5948001770011615,
      "cached": 16368,
      "mode": "layerwise",
      "size": 16384,
      "rep": 1,
      "t_prefill_only_s": 1.5024838620010996,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_A_s": 1.5068977490009274,
      "t_B_s": 1.5950395179970656,
      "t_total_s": 1.59571184500237,
      "cached": 16368,
      "mode": "layerwise",
      "size": 16384,
      "rep": 2,
      "t_prefill_only_s": 1.5303227439981129,
      "kv_gib": 1.5,
      "correct": true
    },
    {
      "t_A_s": 4.4503932609986805,
      "t_B_s": 4.538851200999488,
      "t_total_s": 4.539281312001549,
      "cached": 32752,
      "mode": "layerwise",
      "size": 32768,
      "rep": 0,
      "t_prefill_only_s": 4.446753306998289,
      "kv_gib": 3.0,
      "correct": true
    },
    {
      "t_A_s": 4.44226107799841,
      "t_B_s": 4.551636377997056,
      "t_total_s": 4.552389411001059,
      "cached": 32752,
      "mode": "layerwise",
      "size": 32768,
      "rep": 1,
      "t_prefill_only_s": 4.44538704000297,
      "kv_gib": 3.0,
      "correct": true
    },
    {
      "t_A_s": 4.440309538000292,
      "t_B_s": 4.539836316998844,
      "t_total_s": 4.540553365997766,
      "cached": 32752,
      "mode": "layerwise",
      "size": 32768,
      "rep": 2,
      "t_prefill_only_s": 4.443476915999781,
      "kv_gib": 3.0,
      "correct": true
    }
  ],
  "summary": [
    {
      "size": 8192,
      "n": 3,
      "pf_only_ms": 589.0174580017629,
      "total_ms": 682.8304180016858,
      "overhead_ms": 93.8129599999229,
      "all_correct": true
    },
    {
      "size": 16384,
      "n": 3,
      "pf_only_ms": 1513.0829510017065,
      "total_ms": 1595.71184500237,
      "overhead_ms": 82.62889400066342,
      "all_correct": true
    },
    {
      "size": 32768,
      "n": 3,
      "pf_only_ms": 4445.38704000297,
      "total_ms": 4540.553365997766,
      "overhead_ms": 95.16632599479635,
      "all_correct": true
    }
  ]
 }
--- a/microbench/connector_tax/layerwise/results/trace/baseline_breakdown.json
+++ b/microbench/connector_tax/layerwise/results/trace/baseline_breakdown.json
--- a/microbench/connector_tax/layerwise/results/trace/baseline_metrics.jsonl
+++ b/microbench/connector_tax/layerwise/results/trace/baseline_metrics.jsonl
--- a/microbench/connector_tax/layerwise/results/trace/layerwise_breakdown.json
+++ b/microbench/connector_tax/layerwise/results/trace/layerwise_breakdown.json
--- a/microbench/connector_tax/layerwise/results/trace/layerwise_metrics.jsonl
+++ b/microbench/connector_tax/layerwise/results/trace/layerwise_metrics.jsonl
--- a/microbench/connector_tax/layerwise/run_ab_matrix.sh
+++ b/microbench/connector_tax/layerwise/run_ab_matrix.sh
@@ -0,0 +1,33 @@
 #!/usr/bin/env bash
 # A/B x migration matrix on the 1200-req trace (sequential, ~47 min each).
 #   1. unified           (no A/B, no migration)            anchor
 #   2. unified + A+B      (documented champion, no mig)
 #   3. unified_v3 + A+B + layer-wise (champion + cheap mig)
 # We already have: unified_v3 + layer-wise (no A/B) from the prior run.
 #
 # Q1 (migration benefit w/ layer-wise): #1 vs prior v3+layerwise(noAB)
 # Q2 (does migration add to champion):  #2 vs #3
 set -uo pipefail
 PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
 R="$PROJ_DIR/microbench/connector_tax/layerwise/run_v3_trace.sh"
 AB="--overload-factor 1.3 --lmetric-decode-weight 0.01"
 LOGD=/tmp/dst_break_logs; mkdir -p "$LOGD"
 echo "########## 1/3 unified plain ##########"
 TAG=unified_plain POLICY=unified MODE=baseline AB_FLAGS="" \
    bash "$R" 2>&1 | tee "$LOGD/abmatrix_1_unified_plain.log" | tail -6
 echo "########## 2/3 unified + A+B ##########"
 TAG=unified_AB POLICY=unified MODE=baseline AB_FLAGS="$AB" \
    bash "$R" 2>&1 | tee "$LOGD/abmatrix_2_unified_AB.log" | tail -6
 echo "########## 3/3 unified_v3 + A+B + layer-wise ##########"
 TAG=v3_AB_lw POLICY=unified_v3 MODE=layerwise AB_FLAGS="$AB" \
    bash "$R" 2>&1 | tee "$LOGD/abmatrix_3_v3_AB_lw.log" | tail -6
 echo "########## MATRIX DONE ##########"
 for t in unified_plain unified_AB v3_AB_lw; do
    D=$(ls -dt "$PROJ_DIR"/outputs/v3trace_${t}_*/unified_v3 2>/dev/null | head -1)
    echo "=== $t ($D) ==="
    sed -n '/\[stats\]/,/\[done\]/p' "$LOGD"/abmatrix_*_${t}.log 2>/dev/null | grep -E "requests:|TTFT|migrations:" || true
 done
--- a/microbench/connector_tax/layerwise/run_ablation_es.sh
+++ b/microbench/connector_tax/layerwise/run_ablation_es.sh
@@ -0,0 +1,42 @@
 #!/usr/bin/env bash
 # Ablation: does the REAL engine-state feed (P2) change each policy's
 # performance and ranking vs the stale-shadow baseline?
 #
 # Each config is run twice (ES=0 shadow-only, ES=1 real-state feed) so the
 # ONLY difference is the state source. Sequential, ~47 min each.
 #
 # Default = the 4 decisive runs (champion + migration, with/without feed).
 # Extend CONFIGS for the full sweep (lmetric / unified_kv_both / load_only).
 set -uo pipefail
 PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
 R="$PROJ_DIR/microbench/connector_tax/layerwise/run_v3_trace.sh"
 AB="--overload-factor 1.3 --lmetric-decode-weight 0.01"
 LOGD=/tmp/dst_break_logs; mkdir -p "$LOGD"
 # CONFIG format: "TAG|POLICY|MODE|AB?|ES"
 CONFIGS=(
  "unified_AB_es0|unified|baseline|AB|0"
  "unified_AB_es1|unified|baseline|AB|1"
  "v3_AB_lw_es0|unified_v3|layerwise|AB|0"
  "v3_AB_lw_es1|unified_v3|layerwise|AB|1"
  # --- extend for the full sweep ---
  # "lmetric_es0|lmetric|baseline|noAB|0"
  # "lmetric_es1|lmetric|baseline|noAB|1"
  # "ukvboth_AB_es0|unified_kv_both|baseline|AB|0"
  # "ukvboth_AB_es1|unified_kv_both|baseline|AB|1"
 )
 for cfg in "${CONFIGS[@]}"; do
  IFS='|' read -r tag policy mode ab es <<< "$cfg"
  ab_flags=""; [ "$ab" = "AB" ] && ab_flags="$AB"
  echo "########## $tag (policy=$policy mode=$mode ab=$ab es=$es) ##########"
  TAG="$tag" POLICY="$policy" MODE="$mode" AB_FLAGS="$ab_flags" ES="$es" \
      bash "$R" 2>&1 | tee "$LOGD/abl_${tag}.log" | tail -6
 done
 echo "########## ABLATION DONE — summary ##########"
 for cfg in "${CONFIGS[@]}"; do
  IFS='|' read -r tag _ _ _ _ <<< "$cfg"
  echo "=== $tag ==="
  grep -E "requests:|TTFT|migrations:" "$LOGD/abl_${tag}.log" 2>/dev/null || true
 done
--- a/microbench/connector_tax/layerwise/run_mb7.sh
+++ b/microbench/connector_tax/layerwise/run_mb7.sh
@@ -0,0 +1,114 @@
 #!/usr/bin/env bash
 # MB7 launcher (runs on dash0). Two 2-instance modes selected by MODE env:
 #   MODE=baseline  : restore stock connector, no layerwise env
 #   MODE=layerwise : deploy mooncake_connector.LAYERWISE.py + MOONCAKE_LAYERWISE=1
 #
 # Chunked prefill is DISABLED (max-num-batched-tokens >= max prompt) so the
 # producer prefill is a single forward and save_kv_layer fires once per layer
 # in order — the layer-wise counter assumes this.
 #
 # The connector is always restored from .ORIG_BACKUP on exit.
 #
 # Usage (on dash0):
 #   MODE=baseline  bash run_mb7.sh
 #   MODE=layerwise bash run_mb7.sh
 set -uo pipefail
 MODE="${MODE:-baseline}"
 PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
 VENV="${VENV:-$PROJ_DIR/.venv}"
 MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
 GPUS=(${GPUS:-0 1})
 SIZES="${SIZES:-8192,16384,32768}"
 REPEATS="${REPEATS:-3}"
 BG_LOAD="${BG_LOAD:-0}"
 CONCURRENT="${CONCURRENT:-1}"
 MAX_BATCHED="${MAX_BATCHED:-40960}"   # >= max prompt => no chunked prefill
 DATE="$(date +%Y%m%d_%H%M)"
 OUTDIR="${OUTDIR:-$PROJ_DIR/outputs/mb7_${MODE}_${DATE}}"
 PYTHON="$VENV/bin/python"
 MC_FILE="$VENV/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
 LW_SRC="${LW_SRC:-/tmp/mooncake_connector.LAYERWISE.py}"
 DRIVER="$PROJ_DIR/microbench/connector_tax/layerwise/mb7_layerwise.py"
 mkdir -p "$OUTDIR/logs"
 PORTS=(8000 8001); BPS=(8998 8999)
 echo "=== MB7 ($MODE) ==="
 echo "Out: $OUTDIR ; connector: $MC_FILE"
 restore_connector() {
    if [ -f "$MC_FILE.ORIG_BACKUP" ]; then
        cp -f "$MC_FILE.ORIG_BACKUP" "$MC_FILE"
        echo "[restore] connector reset to ORIG"
    fi
 }
 cleanup() {
    pkill -9 -f "vllm serve" 2>/dev/null || true
    pkill -9 -f "EngineCore" 2>/dev/null || true
    sleep 4
    restore_connector
 }
 trap cleanup EXIT
 pkill -9 -f "vllm serve" 2>/dev/null || true; sleep 3
 # Deploy the connector for the chosen mode.
 if [ "$MODE" = "layerwise" ]; then
    if [ ! -f "$LW_SRC" ]; then echo "FATAL: $LW_SRC not found (scp it first)"; exit 1; fi
    cp -f "$LW_SRC" "$MC_FILE"
    "$PYTHON" -c "import ast; ast.parse(open('$MC_FILE').read()); print('[deploy] LAYERWISE connector AST OK')" || exit 1
    LW_ENV="MOONCAKE_LAYERWISE=1"
 else
    restore_connector
    LW_ENV=""
 fi
 echo "[launch] 2 instances (max-num-batched-tokens=$MAX_BATCHED, chunked-prefill off)"
 i=0
 for gpu in "${GPUS[@]:0:2}"; do
    port=${PORTS[$i]}; bp=${BPS[$i]}; master=$((29700 + i))
    env $LW_ENV \
        PYTHONHASHSEED=42 VLLM_MOONCAKE_BOOTSTRAP_PORT=$bp \
        CUDA_VISIBLE_DEVICES=$gpu MASTER_PORT=$master \
        nohup "$VENV/bin/vllm" serve "$MODEL" \
        --host 0.0.0.0 --port "$port" --tensor-parallel-size 1 \
        --trust-remote-code --enable-prefix-caching --dtype auto \
        --gpu-memory-utilization 0.9 --max-model-len 200000 \
        --max-num-batched-tokens "$MAX_BATCHED" \
        --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
        --enable-prompt-tokens-details \
        > "$OUTDIR/logs/vllm_${i}_gpu${gpu}.log" 2>&1 &
    disown; sleep 2; i=$((i + 1))
 done
 echo "[health] waiting ..."
 for i in 0 1; do
    port=${PORTS[$i]}; tries=0
    while ! curl -sf "http://127.0.0.1:$port/health" >/dev/null 2>&1; do
        tries=$((tries + 1)); [ $tries -gt 180 ] && { echo "FATAL inst_$i"; exit 1; }
        sleep 2
    done
    echo "  inst_$i ready"
 done
 for i in 0 1; do
    bp=${BPS[$i]}; tries=0
    while ! curl -sf "http://127.0.0.1:$bp/query" >/dev/null 2>&1; do
        tries=$((tries+1)); [ $tries -gt 60 ] && { echo "WARN bp $bp"; break; }; sleep 2
    done
 done
 echo "[run] mb7 --mode $MODE"
 "$PYTHON" "$DRIVER" --mode "$MODE" \
    --src-port "${PORTS[0]}" --dst-port "${PORTS[1]}" \
    --src-bp "${BPS[0]}" --dst-bp "${BPS[1]}" \
    --sizes "$SIZES" --repeats "$REPEATS" --bg-load "$BG_LOAD" \
    --concurrent "$CONCURRENT" --out "$OUTDIR/mb7_result.json" \
    2>&1 | tee "$OUTDIR/mb7_run.txt"
 echo "[done] $OUTDIR"
 # grep layerwise transfer logs from the producer (gpu0) for sanity
 if [ "$MODE" = "layerwise" ]; then
    echo "=== producer layerwise log lines ==="
    grep -i "layerwise" "$OUTDIR/logs/vllm_0_gpu${GPUS[0]}.log" | tail -10 || true
 fi
--- a/microbench/connector_tax/layerwise/run_v3_trace.sh
+++ b/microbench/connector_tax/layerwise/run_v3_trace.sh
@@ -0,0 +1,114 @@
 #!/usr/bin/env bash
 # Full 1200-req v3 trace, two modes (MODE env), for layer-wise re-profile.
 #   MODE=baseline  : stock connector + stock proxy (post-hoc transfer)
 #   MODE=layerwise : LAYERWISE connector + write-mode proxy (overlapped)
 # Both: unified_v3 routing + DR-fix. Connector & proxy restored from backup
 # on exit. Output-equivalence/correctness gate = success rate + migrated-req
 # TTFT distribution (byte-level KV correctness already validated on mb7).
 #
 # Usage (on dash0):  MODE=baseline  bash run_v3_trace.sh
 #                    MODE=layerwise bash run_v3_trace.sh
 set -uo pipefail
 MODE="${MODE:-baseline}"
 POLICY="${POLICY:-unified_v3}"
 AB_FLAGS="${AB_FLAGS:-}"        # e.g. "--overload-factor 1.3 --lmetric-decode-weight 0.01"
 TAG="${TAG:-$MODE}"
 PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
 VENV="$PROJ_DIR/.venv"
 VLLM_ROOT="$VENV/lib/python3.12/site-packages/vllm"
 TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
 DATE="$(date +%Y%m%d_%H%M)"
 OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/v3trace_${TAG}_${DATE}}"
 PYTHON="$VENV/bin/python"
 DR_FIX="$PROJ_DIR/microbench/connector_tax/cache_sweep/apply_direct_read_fix.py"
 MC_FILE="$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
 PROXY_FILE="$PROJ_DIR/scripts/cache_aware_proxy.py"
 # Staging on shared cpfs (visible on dash0/dash1), not node-local /tmp.
 _LWDIR="$PROJ_DIR/microbench/connector_tax/layerwise"
 LW_CONN="${LW_CONN:-$_LWDIR/mooncake_connector.LAYERWISE.py}"
 WM_PROXY="${WM_PROXY:-$_LWDIR/cache_aware_proxy.WRITEMODE.py}"
 ES_INSTR="$_LWDIR/instrument_engine_state.py"
 ES="${ES:-0}"                  # 1 = enable real engine-state feed (P2)
 ES_DIR="/dev/shm/agentic_engine_state_${TAG}"
 mkdir -p "$OUTROOT"
 cfg_dir="$OUTROOT/unified_v3"; mkdir -p "$cfg_dir"
 # Backups (connector backup already exists as .ORIG_BACKUP; make proxy one).
 [ -f "$MC_FILE.ORIG_BACKUP" ] || cp "$MC_FILE" "$MC_FILE.ORIG_BACKUP"
 [ -f "$PROXY_FILE.ORIG_BACKUP" ] || cp "$PROXY_FILE" "$PROXY_FILE.ORIG_BACKUP"
 restore() {
    cp -f "$MC_FILE.ORIG_BACKUP" "$MC_FILE"
    cp -f "$PROXY_FILE.ORIG_BACKUP" "$PROXY_FILE"
    "$PYTHON" "$DR_FIX" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
    "$PYTHON" "$ES_INSTR" --revert --venv "$VENV" 2>/dev/null || true
    rm -rf "$ES_DIR" 2>/dev/null || true
    echo "[restore] connector+proxy reset to ORIG, DR-fix + ES-patch reverted"
 }
 cleanup() {
    pkill -9 -f cache_aware_proxy 2>/dev/null || true
    pkill -9 -f "vllm serve" 2>/dev/null || true
    pkill -9 -f "EngineCore" 2>/dev/null || true
    sleep 5
    restore
 }
 trap cleanup EXIT
 pkill -9 -f "vllm serve" 2>/dev/null || true; sleep 3
 restore   # start from clean
 echo "=== v3 trace (mode=$MODE es=$ES tag=$TAG) -> $OUTROOT ==="
 # Always deploy the enhanced proxy (write-mode + engine-state, both env/flag
 # gated; with feed off + write-mode off it behaves identically to stock).
 cp -f "$WM_PROXY" "$PROXY_FILE"
 if [ "$MODE" = "layerwise" ]; then
    cp -f "$LW_CONN" "$MC_FILE"
    export MOONCAKE_LAYERWISE=1
    export EAR_WRITE_MODE=1
 fi
 "$PYTHON" -c "import ast; ast.parse(open('$MC_FILE').read()); ast.parse(open('$PROXY_FILE').read()); print('[deploy] proxy + connector AST OK')" || exit 1
 PROXY_ES_ARG=""
 if [ "$ES" = "1" ]; then
    echo "[ES] apply engine-state patch + enable feed at $ES_DIR"
    "$PYTHON" "$ES_INSTR" --apply --venv "$VENV"
    mkdir -p "$ES_DIR"
    export AGENTIC_ENGINE_STATE_URI="file://$ES_DIR"
    PROXY_ES_ARG="--engine-state-uri file://$ES_DIR"
 fi
 echo "[DR-fix] apply"
 "$PYTHON" "$DR_FIX" --apply --vllm-root "$VLLM_ROOT"
 export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
 echo "[run] $POLICY AB=[$AB_FLAGS] (MOONCAKE_LAYERWISE=${MOONCAKE_LAYERWISE:-0} EAR_WRITE_MODE=${EAR_WRITE_MODE:-0})"
 EXTRA_PROXY_ARGS="$AB_FLAGS $PROXY_ES_ARG" bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "$POLICY" "$TRACE" "$cfg_dir" \
    2>&1 | tee "$cfg_dir/orchestrator.log" | tail -20
 pkill -9 -f cache_aware_proxy 2>/dev/null || true
 pkill -9 -f "vllm serve" 2>/dev/null || true
 sleep 5
 echo "[stats] $MODE"
 "$PYTHON" - "$cfg_dir" << 'PYEOF'
 import json, sys, statistics
 d = sys.argv[1]
 ms = [json.loads(l) for l in open(f"{d}/metrics.jsonl")]
 ok = [m for m in ms if not m.get("error")]
 ttft = sorted(m["ttft_s"] for m in ok if m.get("ttft_s") is not None)
 def p(q): return ttft[min(len(ttft)-1, int(q*len(ttft)))] if ttft else 0
 print(f"  requests: {len(ms)}  success: {len(ok)} ({len(ok)/max(1,len(ms))*100:.1f}%)")
 print(f"  TTFT s  : p50={p(.5):.2f} p90={p(.9):.2f} p99={p(.99):.2f}")
 # migrated reqs from proxy breakdown
 try:
    bd = json.load(open(f"{d}/breakdown.json"))
    mig = [x for x in bd if x.get("route_class") == "PD_SEP_V2"]
    mids = {x["request_id"] for x in mig}
    mt = sorted(m["ttft_s"] for m in ok if m["request_id"] in mids and m.get("ttft_s"))
    print(f"  migrations: {len(mig)}  migrated-req TTFT: "
          f"p50={mt[len(mt)//2]:.2f} p90={mt[int(len(mt)*.9)]:.2f} max={mt[-1]:.2f}" if mt else f"  migrations: {len(mig)}")
 except Exception as e:
    print(f"  (breakdown parse: {e})")
 PYEOF
 echo "[done] $cfg_dir  (metrics.jsonl, breakdown.json)"