Migration transfer-cost study: KV transfer is slow on busy GPUs

MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at ~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into ~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s) + ~45% control-plane GIL starvation during long prefills. Reproduced on a fresh upstream venv (byte-identical transfer path) -> upstream/hardware inherent, not our patch. Layerwise is the wrong lever; the tax is structural on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6, instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:01 +08:00
parent 67fcec7933
commit 1262c9c22e
11 changed files with 2055 additions and 0 deletions
--- a/microbench/connector_tax/cache_sweep/MIGRATION_TRANSFER_COST.md
+++ b/microbench/connector_tax/cache_sweep/MIGRATION_TRANSFER_COST.md
@@ -0,0 +1,178 @@
+# Why KV-transfer is slow during migration under real load
+
+**Question.** EAR's unified+A+B routing beats migration (v3) on agentic
+workloads. We wanted to know whether *layerwise* KV transfer would shrink
+migration's overhead enough to make it viable. Investigating that led to a
+sharper question: **in a real (loaded) cluster, when we migrate, the KV
+transfer is already slow — the effective bandwidth is far below the
+~10 GB/s wire rate. Why?**
+
+This doc answers that with instrumented measurements.
+
+**TL;DR.** Migration fires precisely when instances are *busy* (that's the
+trigger). But on a busy instance, KV transfer runs at **~3 GB/s instead of
+~10 GB/s**, because:
+
+1. **The RDMA write itself slows ~2× under compute load** — GPU-direct RDMA
+   (`batch_transfer_sync_write`) contends with the running attention/MLP
+   kernels for **HBM and PCIe bandwidth**. (idle 7.6 GB/s → busy 4.0 GB/s)
+2. **The connector's Python control plane gets GIL-starved** — mooncake's
+   ZMQ handshake + transfer orchestration run on asyncio threads inside the
+   engine process; when the engine's main thread is doing a long forward
+   pass (e.g. a 100k-token prefill), those threads stall for *seconds*.
+
+Both are **inherent to upstream vLLM 0.18.1 + mooncake** (reproduced on a
+clean fresh venv; the transfer path is byte-identical to upstream — our
+patches did not cause this), and both get **worse**, not better, with
+layerwise transfer. So the bandwidth gap is not a layerwise problem; it is a
+*transfer-on-a-busy-GPU* problem.
+
+---
+
+## 1. Evidence chain
+
+Three independent measurements, all on dash0 (8×H100, Qwen3-Coder-30B-A3B,
+TP=1), Mooncake `kv_both`.
+
+### 1a. Instrumented v3 trace replay — where does migration time go?
+
+Run `outputs/b3_v3_fullbreak_20260528_0338/`. Instruments:
+`instrument_dst_migration.py` (dst scheduler lifecycle) +
+`instrument_mooncake.py` (connector internals: `send_blocks` RDMA,
+`receive_kv` window, `ready_wait`).
+
+25 migrations fired over the trace. Dst-side migration overhead
+(`T_kv_pull` = scheduler marks `WAITING_FOR_REMOTE_KVS` → `finished_recving`):
+
+| component | share | what it is |
+|---|---:|---|
+| RDMA-actual (`batch_transfer_sync_write`) | **55%** (55.2 s) | the real RDMA write |
+| dst control-plane gap | **45%** (45.4 s) | scheduler↔receiver_loop dispatch + completion propagation |
+| `ready_wait` (src KV not committed) | 0% | 25/25 already committed — **ruled out** |
+
+- Pure RDMA aggregate rate: **2.03 GB/s** (vs MB2 idle 9.7 GB/s).
+- RDMA rate **collapses with transfer size**: <3 GiB → 4–9.5 GB/s,
+  >5 GiB → 0.9–2.6 GB/s.
+- The control-plane gap is **bimodal**: median 0.04 s, but a handful of
+  requests stall ~10 s. Those are small-KV transfers (0.18 s of actual RDMA)
+  whose `T_kv_pull` is 8–11 s — i.e. the dst's `receiver_loop` thread was
+  GIL-starved for ~10 s while the engine did a big forward pass.
+
+> Earlier (pre-instrumentation) we wrongly attributed ~90% of migration
+> overhead to "dst scheduler queueing" by estimating transfer at clean wire
+> speed. With real instrumentation, dst *scheduler admission* is ~0
+> (`T_admission_post_kv` = 0.003 s); the time is the transfer phase (RDMA +
+> connector control plane), both degraded by instance busy-ness.
+
+### 1b. MB6 controlled microbench — does busy-ness cause it?
+
+`microbench/fresh_setup/mb6_transfer_under_load.py` + `run_mb6.sh`: 2
+instances, transfer a fixed-size KV (prefill on A → migrate to B) while
+holding *N* background decode streams on both. Sweep N.
+
+Effective transfer bandwidth (65k-token KV ≈ 6 GiB), main venv:
+
+| background load | 65k transfer | eff bandwidth |
+|---|---:|---:|
+| **0 (idle)** | 747 ms | **8.76 GB/s** |
+| 8 (4/instance) | 2423 ms | 4.53 GB/s |
+| **24 (12/instance)** | 2015 ms | **3.33 GB/s** |
+
+Monotonic degradation with load. **The busy level (3.3 GB/s) matches the
+v3 trace's 3.3 GB/s median exactly** — because agentic instances run
+~10+ concurrent requests, i.e. the bg=24 regime.
+
+Decomposing the 65k transfer into RDMA-actual vs control-plane:
+
+| bg | RDMA rate | control-plane share |
+|---|---:|---:|
+| 0 (idle) | 7.56 GB/s | 13% |
+| 8 | 4.07 GB/s | 11% |
+| 24 (busy) | 3.97 GB/s | 15% |
+
+In the clean microbench the **RDMA write itself is the dominant degrading
+term** (7.6 → 4.0 GB/s). The ~10 s control-plane stalls seen in the trace
+(1a) don't reproduce here because steady decode forward passes are short;
+they require the long (100k-token) prefills that the real trace has.
+
+### 1c. Fresh-venv comparison — is it our patch?
+
+Same MB6 sweep on `agentic-kv-fresh/.venv` (clean upstream-style 0.18.1):
+
+| bg | 65k eff (fresh) | 65k eff (main/patched) |
+|---|---:|---:|
+| 0 | 8.73 GB/s | 8.76 GB/s |
+| 8 | 4.52 GB/s | 4.53 GB/s |
+| 24 | 3.27 GB/s | 3.33 GB/s |
+
+**Identical within noise.** Plus a static check: the v3 transfer path
+(`send_kv_to_decode`, `_send_blocks`/`batch_transfer_sync_write`,
+`_build_transfer_params`) is **byte-identical** to pristine upstream 0.18.1
+(commit `445e491`); `receive_kv_from_single_worker` differs only by a 4-line
+error branch. Our mooncake commits (`a7df84b` direct-read,
+`ea51497` partial-prefill, `e3a1d70` read→push) only touch a *separate*
+`direct_read` path that v3 does **not** use (v3 requests carry no
+`direct_read` flag → normal push path).
+
+→ **The slowdown is upstream/hardware-inherent, not introduced by us.**
+
+---
+
+## 2. Root cause
+
+Migration in agentic serving transfers KV **between instances that are
+concurrently busy with compute** — by construction, since v3 migrates *away
+from* a busy host. On a busy instance:
+
+- **HBM/PCIe contention (the dominant, irreducible part).** Mooncake's
+  transfer is GPU-direct RDMA: the NIC DMAs KV straight out of / into GPU
+  HBM. While the GPU runs attention+MLP kernels, those kernels saturate HBM
+  bandwidth, so the NIC's RDMA gets a smaller slice. Effective transfer
+  bandwidth roughly halves (7.6 → 4.0 GB/s at our load), and degrades
+  further for large multi-segment transfers.
+- **Control-plane GIL starvation (secondary, bursty).** The connector runs
+  its ZMQ handshake + `send_kv_to_decode`/`receive_kv` orchestration on
+  asyncio threads (`sender_loop`/`receiver_loop`) *inside the engine
+  process*. A long forward pass (100k-token prefill) holds the GIL for
+  seconds, stalling those threads → multi-second dispatch gaps even when the
+  actual transfer is 0.2 s.
+
+MB2 measured 9.7 GB/s precisely because both endpoints were **idle**. The
+real-workload gap is the difference between "idle benchmark" and "transfer
+while the GPU is doing the day job."
+
+---
+
+## 3. Implication: layerwise is the wrong lever; migration's tax is largely irreducible
+
+| lever | effect on the gap |
+|---|---|
+| **Model-level layerwise transfer** (push each layer's KV during prefill) | **Worse.** Prefill is the most HBM-intensive phase, so per-layer transfers contend *harder* for HBM (Cause 1); and they multiply the control-plane round-trips (Cause 2). |
+| **Control-plane fix** (move mooncake orchestration off the GIL-contended threads / separate process) | Addresses only the bursty ~10 s stalls (~15% in the clean case, up to ~45% of the trace tail). Does **not** touch the HBM-contention half. |
+| **Reduce bytes** (cache-aware target so less KV moves) | Helps linearly; v3 Mechanism B already does some. Orthogonal. |
+| **Migrate to/from idle instances** | Would restore ~10 GB/s — but defeats the purpose (we migrate *because* the host is busy). |
+
+The dominant cost (RDMA contending with compute for HBM on busy instances)
+is a **hardware reality**, not a software bug we can patch away, and not
+something layerwise improves. This reinforces
+[UNIFIED_ABLATION.md](UNIFIED_ABLATION.md): the unified no-migration path
+(A+B'+RaceFix) remains the right default; migration's transfer tax is
+structural on a loaded agentic cluster.
+
+---
+
+## 4. Repro / artifacts
+
+- Instrumented v3 breakdown: `outputs/b3_v3_fullbreak_20260528_0338/unified_v3/`
+  (`transfer_decomp.txt`, `dst_migration_breakdown.{csv,png}`,
+  `transfer_rootcause.png`)
+- MB6 main: `outputs/mb6_agentic-kv_20260528_0552/mb6_result.json`
+- MB6 fresh: `outputs/mb6_fresh_20260528_0559/mb6_result.json`
+- Instruments: `microbench/fresh_setup/instrument_dst_migration.py`,
+  `microbench/fresh_setup/instrument_mooncake.py`
+- Microbench: `microbench/fresh_setup/mb6_transfer_under_load.py` +
+  `run_mb6.sh` (`VENV=… bash run_mb6.sh`)
+- Analyzers: `analyze_dst_migration.py`, `analyze_transfer_decomp.py`
+
+All instruments apply/revert cleanly via `--apply`/`--revert`; both venvs
+were restored after the runs.
--- a/microbench/connector_tax/cache_sweep/analyze_dst_migration.py
+++ b/microbench/connector_tax/cache_sweep/analyze_dst_migration.py
@@ -0,0 +1,333 @@
+#!/usr/bin/env python3
+"""Analyze dst-side migration breakdown for unified_v3 runs.
+
+Joins the proxy `breakdown.json` (per-request route + phase timestamps)
+with the dst engine per-PID logs written by
+`instrument_dst_migration.py` (`dm_mig_pid<pid>.jsonl`), to attribute
+each migration's dst-side wall-clock into:
+
+  T_relay              proxy decode-sent → dst arrival
+  T_admission_pre_kv   dst arrival → status=WAITING_FOR_REMOTE_KVS
+                       (waiting in dst's scheduler queue before KV pull
+                        is even initiated)
+  T_kv_pull            WAITING_FOR_REMOTE_KVS → finished_recving
+                       (the actual RDMA transfer + connector ack)
+  T_admission_post_kv  finished_recving → first time in self.running
+                       (KV ready, waiting for batch slot)
+  T_first_iter         first scheduled → first generated token
+                       (one decode-iter compute + sampler latency)
+
+Layerwise transfer can at best eliminate T_kv_pull. Everything else is
+queueing or compute that layerwise does not touch.
+
+Usage:
+    python analyze_dst_migration.py \
+        --proxy-breakdown <RUNDIR>/breakdown.json \
+        --dst-log-dir     <DST_LOG_DIR>
+        [--output         <RUNDIR>/dst_migration_breakdown.csv]
+        [--plot           <RUNDIR>/dst_migration_breakdown.png]
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import os
+import re
+import statistics
+import sys
+from pathlib import Path
+
+
+def _core_req_id(rid: str) -> str:
+    """Normalize a vLLM engine req_id back to the proxy's request_id.
+
+    vLLM wraps the proxy id `S:T:U:N` as `cmpl-S:T:U:N-<dp_rank>-<hex>`.
+    Strip the `cmpl-` prefix and the trailing `-<digits>-<hex>` suffix so
+    it joins against the proxy `breakdown.json` request_id.
+    """
+    if not rid:
+        return rid
+    s = rid
+    if s.startswith("cmpl-"):
+        s = s[len("cmpl-"):]
+    m = re.match(r"^(.*)-\d+-[0-9a-fA-F]+$", s)
+    if m:
+        s = m.group(1)
+    return s
+
+
+def _pct(vals: list[float], q: float) -> float:
+    if not vals:
+        return float("nan")
+    vs = sorted(vals)
+    i = max(0, min(len(vs) - 1, int(math.ceil(q * len(vs))) - 1))
+    return vs[i]
+
+
+def _summary(name: str, vals: list[float]) -> dict:
+    if not vals:
+        return {"name": name, "n": 0}
+    return {
+        "name": name,
+        "n": len(vals),
+        "mean_s": statistics.mean(vals),
+        "p50_s": _pct(vals, 0.5),
+        "p90_s": _pct(vals, 0.9),
+        "p99_s": _pct(vals, 0.99),
+        "max_s": max(vals),
+        "sum_s": sum(vals),
+    }
+
+
+def load_dst_log(dst_log_dir: Path) -> dict[str, dict]:
+    by_req: dict[str, dict] = {}
+    found_files = sorted(dst_log_dir.glob("dm_mig_pid*.jsonl"))
+    print(f"[analyze] dst log files: {len(found_files)} under {dst_log_dir}")
+    for f in found_files:
+        with f.open() as fh:
+            for line in fh:
+                try:
+                    rec = json.loads(line)
+                except Exception:
+                    continue
+                rid = rec.get("req_id")
+                if not rid:
+                    continue
+                key = _core_req_id(rid)
+                rec["_raw_req_id"] = rid
+                # If a req shows up twice (shouldn't, but be safe), prefer the
+                # one with t_first_token_unix populated.
+                prev = by_req.get(key)
+                if prev is None or (
+                    rec.get("t_first_token_unix") and
+                    not prev.get("t_first_token_unix")
+                ):
+                    by_req[key] = rec
+    print(f"[analyze] unique dst records: {len(by_req)}")
+    return by_req
+
+
+def load_proxy_breakdown(path: Path) -> list[dict]:
+    with path.open() as fh:
+        data = json.load(fh)
+    assert isinstance(data, list), f"unexpected breakdown.json shape: {type(data)}"
+    return data
+
+
+def decompose(proxy_recs: list[dict], dst_by_req: dict[str, dict]) -> list[dict]:
+    """Build per-migration breakdown rows by joining proxy + dst by req_id."""
+    rows: list[dict] = []
+    migrations = [x for x in proxy_recs if x.get("route_class") == "PD_SEP_V2"]
+    print(f"[analyze] proxy migrations: {len(migrations)} "
+          f"(of {len(proxy_recs)} total requests)")
+
+    miss_in_dst = 0
+    missing_phases = 0
+    for p in migrations:
+        rid = p.get("request_id")
+        dst = dst_by_req.get(rid)
+        if dst is None:
+            miss_in_dst += 1
+            continue
+        if dst.get("t_first_token_unix") is None:
+            missing_phases += 1
+            # still include the row but mark phases as NaN downstream
+        t_decode_sent = p.get("t_decode_sent_unix")
+        t_first_tok = p.get("t_first_token_unix")
+        t_arrival = dst.get("t_arrival_unix")
+        t_wait_kvs = dst.get("t_wait_for_kvs_unix")
+        t_kv_done = dst.get("t_kv_recv_done_unix")
+        t_first_sched = dst.get("t_first_scheduled_unix")
+        t_first_tok_dst = dst.get("t_first_token_unix")
+
+        def _diff(a, b):
+            if a is None or b is None:
+                return None
+            return float(a) - float(b)
+
+        rows.append({
+            "request_id": rid,
+            "session_id": p.get("session_id"),
+            "input_length": p.get("input_length"),
+            "v3_new_local": p.get("v3_new_local"),
+            "v3_target_idx": p.get("v3_target_idx") or p.get("v3_decode_target_idx"),
+            "arrival_n_running": (dst.get("arrival_state") or {}).get("n_running"),
+            "arrival_n_waiting": (dst.get("arrival_state") or {}).get("n_waiting"),
+            "arrival_pending_prefill_tok": (dst.get("arrival_state") or {}).get("pending_prefill_tok"),
+            "arrival_n_waiting_for_kvs": (dst.get("arrival_state") or {}).get("n_waiting_for_kvs"),
+            # Phase durations (seconds)
+            "T_proxy_total_dst_first_token_s": _diff(t_first_tok, t_decode_sent),
+            "T_relay_s": _diff(t_arrival, t_decode_sent),
+            "T_admission_pre_kv_s": _diff(t_wait_kvs, t_arrival),
+            "T_kv_pull_s": _diff(t_kv_done, t_wait_kvs),
+            "T_admission_post_kv_s": _diff(t_first_sched, t_kv_done),
+            "T_first_iter_s": _diff(t_first_tok_dst, t_first_sched),
+            # Raw timestamps for debugging
+            "t_decode_sent_unix": t_decode_sent,
+            "t_dst_arrival_unix": t_arrival,
+            "t_dst_wait_for_kvs_unix": t_wait_kvs,
+            "t_dst_kv_recv_done_unix": t_kv_done,
+            "t_dst_first_scheduled_unix": t_first_sched,
+            "t_dst_first_token_unix": t_first_tok_dst,
+            "t_proxy_first_token_unix": t_first_tok,
+        })
+
+    print(f"[analyze] missing in dst log: {miss_in_dst}")
+    print(f"[analyze] dst record incomplete (no t_first_token): {missing_phases}")
+    return rows
+
+
+def emit_summary(rows: list[dict]) -> None:
+    if not rows:
+        print("[analyze] no rows — nothing to summarize.")
+        return
+
+    phase_keys = [
+        "T_proxy_total_dst_first_token_s",
+        "T_relay_s",
+        "T_admission_pre_kv_s",
+        "T_kv_pull_s",
+        "T_admission_post_kv_s",
+        "T_first_iter_s",
+    ]
+
+    print()
+    print("=" * 88)
+    print(f"Migration dst-side phase breakdown  (n_migrations={len(rows)})")
+    print("=" * 88)
+    print(f"{'phase':<36} {'n':>4} {'mean(s)':>9} {'p50':>8} {'p90':>8} "
+          f"{'p99':>8} {'max':>8} {'sum(s)':>9}")
+    print("-" * 88)
+    for k in phase_keys:
+        vals = [r[k] for r in rows if r.get(k) is not None]
+        if not vals:
+            print(f"{k:<36} {'n/a':>4}")
+            continue
+        s = _summary(k, vals)
+        print(f"{k:<36} {s['n']:>4} {s['mean_s']:>9.3f} {s['p50_s']:>8.3f} "
+              f"{s['p90_s']:>8.3f} {s['p99_s']:>8.3f} {s['max_s']:>8.3f} "
+              f"{s['sum_s']:>9.2f}")
+
+    print()
+    print("Aggregate attribution (sum across all migrations):")
+    sums = {}
+    for k in ("T_relay_s", "T_admission_pre_kv_s", "T_kv_pull_s",
+              "T_admission_post_kv_s", "T_first_iter_s"):
+        sums[k] = sum(r[k] for r in rows if r.get(k) is not None)
+    total = sum(sums.values())
+    total_proxy = sum(r["T_proxy_total_dst_first_token_s"] for r in rows
+                      if r.get("T_proxy_total_dst_first_token_s") is not None)
+    print(f"  decomposed sum   : {total:>8.2f} s")
+    print(f"  proxy total sum  : {total_proxy:>8.2f} s   "
+          f"(should be ~equal; gap = uninstrumented)")
+    if total > 0:
+        for k, v in sums.items():
+            print(f"    {k:<28} {v:>8.2f} s  ({v/total*100:5.1f} %)")
+
+    # Headline: "How much could layerwise save?"
+    layerwise_addressable = sums.get("T_kv_pull_s", 0.0)
+    queue_residual = sum(v for k, v in sums.items() if k != "T_kv_pull_s")
+    print()
+    print("Layerwise-addressable vs queue-residual:")
+    print(f"  T_kv_pull_s (addressable by layerwise)   : {layerwise_addressable:>8.2f} s "
+          f"({layerwise_addressable / total * 100 if total else 0:5.1f} %)")
+    print(f"  everything else (queue/admission/iter)   : {queue_residual:>8.2f} s "
+          f"({queue_residual / total * 100 if total else 0:5.1f} %)")
+
+
+def write_csv(rows: list[dict], path: Path) -> None:
+    import csv
+    if not rows:
+        path.write_text("")
+        return
+    fields = list(rows[0].keys())
+    with path.open("w", newline="") as fh:
+        w = csv.DictWriter(fh, fieldnames=fields)
+        w.writeheader()
+        w.writerows(rows)
+    print(f"[analyze] wrote CSV: {path}  (n={len(rows)})")
+
+
+def maybe_plot(rows: list[dict], out_path: Path) -> None:
+    try:
+        import matplotlib
+        matplotlib.use("Agg")
+        import matplotlib.pyplot as plt
+    except Exception as e:
+        print(f"[analyze] matplotlib unavailable ({e}); skipping plot.")
+        return
+
+    if not rows:
+        return
+
+    rows_sorted = sorted(
+        rows,
+        key=lambda r: r.get("T_proxy_total_dst_first_token_s") or 0.0,
+    )
+    n = len(rows_sorted)
+    idx = list(range(n))
+
+    def col(k):
+        return [(r.get(k) or 0.0) for r in rows_sorted]
+
+    relay = col("T_relay_s")
+    pre = col("T_admission_pre_kv_s")
+    pull = col("T_kv_pull_s")
+    post = col("T_admission_post_kv_s")
+    first_iter = col("T_first_iter_s")
+
+    fig, ax = plt.subplots(figsize=(11, 5))
+    bot = [0.0] * n
+    for vals, label, color in [
+        (relay, "HTTP relay", "#cccccc"),
+        (pre, "admission pre-KV", "#f4a261"),
+        (pull, "KV pull (layerwise-addressable)", "#e76f51"),
+        (post, "admission post-KV", "#2a9d8f"),
+        (first_iter, "first decode iter", "#264653"),
+    ]:
+        ax.bar(idx, vals, bottom=bot, color=color, label=label, width=0.85)
+        bot = [b + v for b, v in zip(bot, vals)]
+
+    ax.set_xticks(idx)
+    ax.set_xticklabels([str(i + 1) for i in idx], rotation=0, fontsize=8)
+    ax.set_xlabel("Migrated request (sorted by total dst wait, ascending)")
+    ax.set_ylabel("Time (s)")
+    ax.set_title("Per-migration dst-side phase breakdown (v3 unified_v3 run)")
+    ax.legend(loc="upper left", fontsize=9)
+    ax.grid(axis="y", linestyle=":", alpha=0.5)
+    fig.tight_layout()
+    fig.savefig(out_path, dpi=120)
+    plt.close(fig)
+    print(f"[analyze] wrote plot: {out_path}")
+
+
+def main() -> None:
+    p = argparse.ArgumentParser()
+    p.add_argument("--proxy-breakdown", type=Path, required=True)
+    p.add_argument("--dst-log-dir", type=Path, required=True)
+    p.add_argument("--output", type=Path, default=None,
+                   help="CSV path (default: <run>/dst_migration_breakdown.csv)")
+    p.add_argument("--plot", type=Path, default=None,
+                   help="PNG path (default: <run>/dst_migration_breakdown.png)")
+    args = p.parse_args()
+
+    if not args.proxy_breakdown.is_file():
+        sys.exit(f"missing proxy breakdown: {args.proxy_breakdown}")
+    if not args.dst_log_dir.is_dir():
+        sys.exit(f"missing dst log dir: {args.dst_log_dir}")
+
+    run_dir = args.proxy_breakdown.parent
+    out_csv = args.output or (run_dir / "dst_migration_breakdown.csv")
+    out_png = args.plot or (run_dir / "dst_migration_breakdown.png")
+
+    proxy_recs = load_proxy_breakdown(args.proxy_breakdown)
+    dst_by_req = load_dst_log(args.dst_log_dir)
+    rows = decompose(proxy_recs, dst_by_req)
+    emit_summary(rows)
+    write_csv(rows, out_csv)
+    maybe_plot(rows, out_png)
+
+
+if __name__ == "__main__":
+    main()
--- a/microbench/connector_tax/cache_sweep/analyze_migration_log.py
+++ b/microbench/connector_tax/cache_sweep/analyze_migration_log.py
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+"""Per-migration log + per-instance summary for a v3 trace replay.
+
+Reads <run_dir>/breakdown.json and <run_dir>/metrics.jsonl and emits:
+  1. A row per migration showing src→dst, per-side state snapshots, and
+     the resulting TTFT.
+  2. Histograms: migrations received per inst, sent per inst, all
+     (src→dst) pairs.
+  3. Post-rotation tail: how many turns of migrated sessions ended up on
+     each inst (downstream impact of rotation).
+  4. Anti-hotspot signal: recent_mig_received_in_window at decision time.
+
+Run any v3 replay through this to spot pathological clustering of
+migrations on the same dst within a short window.
+
+Usage:
+    python analyze_migration_log.py <RUN_DIR>
+where <RUN_DIR> contains breakdown.json + metrics.jsonl (i.e. the proxy's
+per-policy output folder, e.g. .../b3_v3_20260527_1344/unified_v3).
+"""
+import json
+import sys
+from collections import Counter, defaultdict
+from pathlib import Path
+
+
+def main(run_dir: Path) -> None:
+    bd = json.load(open(run_dir / "breakdown.json"))
+    m = {json.loads(l)["request_id"]: json.loads(l)
+         for l in open(run_dir / "metrics.jsonl")}
+
+    mig = [e for e in bd if e.get("v3_migrate")]
+    mig.sort(key=lambda x: x.get("t_decision_unix", 0))
+
+    print(f"=== {len(mig)} migrations in {run_dir.name} ===\n")
+    cols = (
+        "#", "t_rel", "session", "turn",
+        "src", "dst", "src_nreq", "src_dec_tok",
+        "dst_nreq", "dst_cache", "dst_recent_recv",
+        "inlen", "self_ttft_ms",
+    )
+    print("  " + " ".join(f"{c:>13}" for c in cols))
+    print("-" * (15 * len(cols)))
+
+    t0 = mig[0]["t_decision_unix"] if mig else 0
+    for i, e in enumerate(mig):
+        rid = e["request_id"]
+        src_idx = e.get("v3_src_idx", e["chosen_idx"])
+        dst_idx = e.get("v3_target_idx", -1)
+        src_state = e.get("v3_src_state") or {}
+        dst_state = e.get("v3_target_state") or {}
+        cands = {c["idx"]: c for c in e.get("candidate_scores", [])}
+        # Fall back to candidate_scores if dedicated v3_*_state fields aren't present.
+        src_nreq = src_state.get("num_requests", cands.get(src_idx, {}).get("num_requests", "-"))
+        src_dec_tok = src_state.get("ongoing_decode_tokens",
+                                    cands.get(src_idx, {}).get("ongoing_decode_tokens", "-"))
+        dst_nreq = dst_state.get("num_requests", cands.get(dst_idx, {}).get("num_requests", "-"))
+        dst_cache = e.get("v3_target_cache_hit", dst_state.get("cache_hit_estimate", 0))
+        dst_recent = e.get("v3_target_recent_received",
+                            dst_state.get("recent_mig_received_in_window", "-"))
+        inlen = e.get("input_length") or m.get(rid, {}).get("input_length", 0)
+        ttft = m.get(rid, {}).get("ttft_s") or 0
+        t_rel = e["t_decision_unix"] - t0
+        turn = m.get(rid, {}).get("turn_id", "?")
+        print(
+            f"  {i+1:>13} {t_rel:>13.1f} {e['session_id']:>13} {turn:>13} "
+            f"{src_idx:>13} {dst_idx:>13} {src_nreq:>13} {src_dec_tok:>13} "
+            f"{dst_nreq:>13} {dst_cache:>13} {dst_recent:>13} "
+            f"{inlen:>13} {ttft*1000:>13.0f}"
+        )
+
+    # Aggregate counts
+    print("\n=== Migrations TO each instance ===")
+    to_count = Counter(e.get("v3_target_idx", -1) for e in mig)
+    for idx in range(8):
+        print(f"  inst_{idx}: {to_count.get(idx, 0)} migrations received")
+
+    print("\n=== Migrations FROM each instance ===")
+    from_count = Counter(e.get("v3_src_idx", e["chosen_idx"]) for e in mig)
+    for idx in range(8):
+        print(f"  inst_{idx}: {from_count.get(idx, 0)} migrations sent")
+
+    print("\n=== Migration pairs (src→dst, count) ===")
+    pair_count = Counter(
+        (e.get("v3_src_idx", e["chosen_idx"]), e.get("v3_target_idx", -1))
+        for e in mig
+    )
+    for (s, d), n in sorted(pair_count.items(), key=lambda x: -x[1]):
+        print(f"  {s} → {d}: {n}")
+
+    print("\n=== Sessions migrating multiple times ===")
+    sess_mig = defaultdict(list)
+    for e in mig:
+        sess_mig[e["session_id"]].append(
+            (e.get("t_decision_unix", 0),
+             e.get("v3_src_idx", e["chosen_idx"]),
+             e.get("v3_target_idx", -1))
+        )
+    multi = {s: ev for s, ev in sess_mig.items() if len(ev) > 1}
+    if not multi:
+        print("  (none)")
+    for sess, events in sorted(multi.items()):
+        chain = " → ".join(f"{s}->{d}" for _, s, d in sorted(events))
+        print(f"  session {sess}: {chain}")
+
+    # Recent-received hotspot signal — non-zero values mean the picker
+    # accepted a target that recently got another migration.
+    print("\n=== Anti-hotspot signal: dst.recent_mig_received_in_window ===")
+    rec = [e.get("v3_target_recent_received", 0) for e in mig]
+    if rec:
+        nonzero = [v for v in rec if v]
+        print(f"  total migrations: {len(rec)}, "
+              f"with recent_received > 0: {len(nonzero)}, "
+              f"max recent_received: {max(rec)}")
+
+    # Post-rotation tail: turns of migrated sessions after their LAST mig
+    print("\n=== Post-rotation tail per inst (turns of migrated sessions after last mig) ===")
+    tail = Counter()
+    for sess, events in sess_mig.items():
+        final_dst = sorted(events)[-1][2]
+        last_t = max(t for t, _, _ in events)
+        sess_turns = [mm for rid, mm in m.items() if mm["session_id"] == sess]
+        tail[final_dst] += sum(1 for mm in sess_turns
+                               if mm.get("t_dispatch_unix", 0) > last_t)
+    for idx in range(8):
+        print(f"  inst_{idx}: {tail.get(idx, 0)} tail turns")
+
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("usage: analyze_migration_log.py <run_dir>", file=sys.stderr)
+        sys.exit(1)
+    main(Path(sys.argv[1]))
--- a/microbench/connector_tax/cache_sweep/analyze_transfer_decomp.py
+++ b/microbench/connector_tax/cache_sweep/analyze_transfer_decomp.py
@@ -0,0 +1,237 @@
+#!/usr/bin/env python3
+"""Decompose migration KV-transfer time into RDMA-actual vs control-plane.
+
+Joins three logs from an instrumented unified_v3 run:
+
+  proxy breakdown.json        — per-request route + phase timestamps
+  dst_mig_log/dm_mig_pid*.jsonl  — dst lifecycle (instrument_dst_migration.py)
+                                   gives T_kv_pull = wait_for_kvs -> recv_done
+  mooncake xfer/mb2_transfer_pid*.jsonl — connector internals
+                                   (instrument_mooncake.py):
+      send_blocks            : pure RDMA (total_bytes, duration_s)  [producer]
+      receive_kv_enter/finish: consumer-observed transfer window    [consumer]
+      ready_wait             : producer wait for src KV commit       [producer]
+      send_kv_to_decode_enter: producer received the pull request    [producer]
+
+Decisive question: of the 87% dst-side overhead that is T_kv_pull, how
+much is the actual RDMA write (`send_blocks`) vs control-plane
+(handshake / ready-wait / GIL starvation on the busy src)?
+
+  - send_blocks bandwidth ~ wire (10 GB/s) AND << T_kv_pull
+        => loss is control-plane; layerwise (which only moves WHEN the
+           RDMA fires) will NOT fix it.
+  - send_blocks bandwidth << wire
+        => the RDMA write itself is slow (NIC / src-side servicing);
+           characterize with a load microbench next.
+
+Usage:
+  python analyze_transfer_decomp.py \
+      --proxy-breakdown <RUN>/unified_v3/breakdown.json \
+      --dst-log-dir     <RUN>/dst_mig_log \
+      --xfer-log-dir    <RUN>/xfer_log
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import re
+import statistics
+import sys
+from pathlib import Path
+
+
+def _core_req_id(rid: str) -> str:
+    if not rid:
+        return rid
+    s = rid
+    if s.startswith("cmpl-"):
+        s = s[len("cmpl-"):]
+    m = re.match(r"^(.*)-\d+-[0-9a-fA-F]+$", s)
+    if m:
+        s = m.group(1)
+    return s
+
+
+def _pct(vals, q):
+    if not vals:
+        return float("nan")
+    vs = sorted(vals)
+    i = max(0, min(len(vs) - 1, int(math.ceil(q * len(vs))) - 1))
+    return vs[i]
+
+
+def _stat_line(name, vals, unit="s"):
+    if not vals:
+        print(f"{name:<34} n=0")
+        return
+    print(f"{name:<34} n={len(vals):>3}  mean={statistics.mean(vals):>8.3f}  "
+          f"p50={_pct(vals,0.5):>8.3f}  p90={_pct(vals,0.9):>8.3f}  "
+          f"max={max(vals):>8.3f}  sum={sum(vals):>8.2f} {unit}")
+
+
+def load_events(xfer_dir: Path):
+    files = sorted(xfer_dir.glob("mb2_transfer_pid*.jsonl"))
+    print(f"[xfer] log files: {len(files)} under {xfer_dir}")
+    send_blocks, recv_enter, recv_finish, ready_wait, send_enter = [], [], [], [], []
+    for f in files:
+        pid = f.stem.replace("mb2_transfer_pid", "")
+        with f.open() as fh:
+            for line in fh:
+                try:
+                    e = json.loads(line)
+                except Exception:
+                    continue
+                e["_pid"] = pid
+                ev = e.get("event")
+                if ev == "send_blocks":
+                    send_blocks.append(e)
+                elif ev == "receive_kv_enter":
+                    recv_enter.append(e)
+                elif ev == "receive_kv_finish":
+                    recv_finish.append(e)
+                elif ev == "ready_wait":
+                    ready_wait.append(e)
+                elif ev == "send_kv_to_decode_enter":
+                    send_enter.append(e)
+    print(f"[xfer] events: send_blocks={len(send_blocks)} "
+          f"recv_enter={len(recv_enter)} recv_finish={len(recv_finish)} "
+          f"ready_wait={len(ready_wait)} send_enter={len(send_enter)}")
+    return send_blocks, recv_enter, recv_finish, ready_wait, send_enter
+
+
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("--proxy-breakdown", type=Path, required=True)
+    p.add_argument("--dst-log-dir", type=Path, required=True)
+    p.add_argument("--xfer-log-dir", type=Path, required=True)
+    args = p.parse_args()
+
+    for pth in (args.proxy_breakdown, args.dst_log_dir, args.xfer_log_dir):
+        if not pth.exists():
+            sys.exit(f"missing: {pth}")
+
+    proxy = json.load(args.proxy_breakdown.open())
+    migrations = [x for x in proxy if x.get("route_class") == "PD_SEP_V2"]
+    mig_ids = {x.get("request_id") for x in migrations}
+    print(f"[proxy] migrations: {len(migrations)} / {len(proxy)} total")
+
+    # dst lifecycle: T_kv_pull per migration (core req id)
+    dst_pull = {}
+    for f in sorted(args.dst_log_dir.glob("dm_mig_pid*.jsonl")):
+        for line in f.open():
+            try:
+                r = json.loads(line)
+            except Exception:
+                continue
+            tw = r.get("t_wait_for_kvs_unix")
+            td = r.get("t_kv_recv_done_unix")
+            if tw and td:
+                dst_pull[_core_req_id(r.get("req_id"))] = td - tw
+
+    sb, re_enter, re_finish, rw, se = load_events(args.xfer_log_dir)
+
+    # ---- 1. Pure RDMA bandwidth from send_blocks (the decisive number) ----
+    print("\n" + "=" * 90)
+    print("1. PURE RDMA WRITE rate (`send_blocks` = batch_transfer_sync_write)")
+    print("=" * 90)
+    bws, durs, bytes_l = [], [], []
+    for e in sb:
+        b = e.get("total_bytes", 0)
+        d = e.get("duration_s", 0)
+        if d and d > 0 and b > 0:
+            bws.append(b / 1e9 / d)
+            durs.append(d)
+            bytes_l.append(b)
+    if bws:
+        tot_b = sum(bytes_l)
+        tot_d = sum(durs)
+        print(f"  send_blocks calls: {len(bws)}")
+        print(f"  total bytes moved : {tot_b/2**30:.2f} GiB")
+        print(f"  total RDMA time   : {tot_d:.2f} s")
+        print(f"  AGGREGATE rate    : {tot_b/1e9/tot_d:.2f} GB/s  "
+              f"(MB2 idle-src steady-state = ~9.7-10 GB/s)")
+        _stat_line("  per-call rate (GB/s)", bws, unit="GB/s")
+        _stat_line("  per-call duration", durs)
+        # bandwidth vs size — small ops are latency-bound
+        print("\n  rate vs transfer size:")
+        pairs = sorted(zip(bytes_l, bws))
+        for b, w in pairs:
+            bar = "#" * int(min(40, w * 4))
+            print(f"    {b/2**20:>8.1f} MiB  {w:>6.2f} GB/s  {bar}")
+    else:
+        print("  no send_blocks events with positive duration")
+
+    # ---- 2. Producer ready-wait (src KV commit) ----
+    print("\n" + "=" * 90)
+    print("2. PRODUCER ready-wait (src KV not yet committed when pull arrived)")
+    print("=" * 90)
+    rw_vals = [e.get("ready_wait_s", 0) for e in rw if e.get("ready_wait_s") is not None]
+    already = sum(1 for e in rw if e.get("ready_already_set"))
+    _stat_line("  ready_wait", rw_vals)
+    print(f"  ready_already_set at entry: {already}/{len(rw)} "
+          f"(if most are True, src commit is not the bottleneck)")
+
+    # ---- 3. Consumer-observed receive_kv window ----
+    print("\n" + "=" * 90)
+    print("3. CONSUMER receive_kv window (enter->FINISH, ~most of T_kv_pull)")
+    print("=" * 90)
+    rf_vals = [e.get("duration_s", 0) for e in re_finish if e.get("duration_s")]
+    _stat_line("  receive_kv duration", rf_vals)
+
+    # ---- 4. Per-migration join: T_kv_pull vs receive_kv vs ready_wait ----
+    print("\n" + "=" * 90)
+    print("4. PER-MIGRATION join (T_kv_pull from dst vs connector internals)")
+    print("=" * 90)
+    # index connector events by core req id
+    rf_by_req = {}
+    for e in re_finish:
+        for rid in e.get("req_ids", []):
+            rf_by_req[_core_req_id(rid)] = e.get("duration_s")
+    rw_by_req = {}
+    for e in rw:
+        rw_by_req[_core_req_id(e.get("d_req_id", ""))] = e.get("ready_wait_s")
+
+    joined = 0
+    sum_pull = sum_recv = sum_rw = 0.0
+    rows = []
+    for m in migrations:
+        core = m.get("request_id")
+        pull = dst_pull.get(core)
+        recv = rf_by_req.get(core)
+        rwv = rw_by_req.get(core)
+        if pull is None and recv is None:
+            continue
+        joined += 1
+        if pull: sum_pull += pull
+        if recv: sum_recv += recv
+        if rwv: sum_rw += rwv
+        rows.append((core, m.get("input_length"), m.get("v3_target_cache_hit"),
+                     pull, recv, rwv))
+    print(f"  joined migrations: {joined}")
+    print(f"  Σ T_kv_pull (dst)        = {sum_pull:8.2f} s")
+    print(f"  Σ receive_kv (consumer)  = {sum_recv:8.2f} s")
+    print(f"  Σ ready_wait (producer)  = {sum_rw:8.2f} s")
+    # The RDMA share: best-effort total send_blocks time
+    sum_rdma = sum(durs) if durs else 0.0
+    print(f"  Σ send_blocks RDMA       = {sum_rdma:8.2f} s   (all transfers, "
+          f"not just migrations)")
+    if sum_pull > 0:
+        print(f"\n  RDMA-actual / T_kv_pull   ≈ {sum_rdma/sum_pull*100:5.1f} %")
+        print(f"  ready-wait / T_kv_pull    ≈ {sum_rw/sum_pull*100:5.1f} %")
+        resid = sum_pull - sum_rdma - sum_rw
+        print(f"  control-plane residual    ≈ {resid/sum_pull*100:5.1f} %  "
+              f"(handshake / ZMQ / GIL starvation)")
+
+    print("\n  per-migration detail:")
+    print(f"  {'req_id':<22} {'in_len':>7} {'dst_hit':>8} {'kv_pull':>8} "
+          f"{'recv_kv':>8} {'rdy_wait':>8}")
+    for core, il, hit, pull, recv, rwv in sorted(
+            rows, key=lambda r: -(r[3] or 0)):
+        def s(v): return f"{v:.2f}" if v is not None else "  --"
+        print(f"  {core:<22} {str(il):>7} {str(hit):>8} {s(pull):>8} "
+              f"{s(recv):>8} {s(rwv):>8}")
+
+
+if __name__ == "__main__":
+    main()
--- a/microbench/connector_tax/cache_sweep/run_v3_dst_breakdown.sh
+++ b/microbench/connector_tax/cache_sweep/run_v3_dst_breakdown.sh
@@ -0,0 +1,90 @@
+#!/usr/bin/env bash
+# v3 trace replay with dst-side migration breakdown instrumentation.
+#
+# Same trace + DR_FIX as `run_v3_replay.sh`, plus:
+#   - instrument_dst_migration.py applied to vLLM scheduler
+#   - DM_LOG_DIR exported to all 8 vLLM instances so per-PID
+#     dst-migration logs land in <RUNDIR>/dst_mig_log/
+#   - analyze_dst_migration.py runs on completion to print the
+#     T_kv_pull vs queue-residual decomposition
+#
+# Usage: bash run_v3_dst_breakdown.sh
+
+set -uo pipefail
+
+PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
+TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
+DATE="$(date +%Y%m%d_%H%M)"
+OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/b3_v3_dstbreak_${DATE}}"
+PYTHON="$PROJ_DIR/.venv/bin/python"
+VLLM_ROOT="${VLLM_ROOT:-$PROJ_DIR/.venv/lib/python3.12/site-packages/vllm}"
+DR_FIX_SCRIPT="$PROJ_DIR/microbench/connector_tax/cache_sweep/apply_direct_read_fix.py"
+DM_INSTRUMENT="$PROJ_DIR/microbench/fresh_setup/instrument_dst_migration.py"
+ANALYZE="$PROJ_DIR/microbench/connector_tax/cache_sweep/analyze_dst_migration.py"
+
+mkdir -p "$OUTROOT"
+DST_LOG_DIR="$OUTROOT/dst_mig_log"
+mkdir -p "$DST_LOG_DIR"
+
+echo "=== unified_v3 + dst-side migration breakdown ==="
+echo "Trace      : $TRACE"
+echo "Out        : $OUTROOT"
+echo "DST logs   : $DST_LOG_DIR"
+echo ""
+
+cleanup_all() {
+    pkill -9 -f cache_aware_proxy 2>/dev/null || true
+    pkill -9 -f "vllm serve" 2>/dev/null || true
+    pkill -9 -f "EngineCore" 2>/dev/null || true
+    sleep 5
+    "$PYTHON" "$DR_FIX_SCRIPT" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
+    "$PYTHON" "$DM_INSTRUMENT" --revert --venv "$PROJ_DIR/.venv" 2>/dev/null || true
+}
+trap cleanup_all EXIT
+cleanup_all
+
+echo "[stage 0a] applying CT_DR_FIX (env-gated)"
+"$PYTHON" "$DR_FIX_SCRIPT" --apply --vllm-root "$VLLM_ROOT"
+
+echo "[stage 0b] applying DST migration instrumentation"
+"$PYTHON" "$DM_INSTRUMENT" --apply --venv "$PROJ_DIR/.venv"
+"$PYTHON" "$DM_INSTRUMENT" --check --venv "$PROJ_DIR/.venv"
+
+cfg_dir="$OUTROOT/unified_v3"
+mkdir -p "$cfg_dir"
+
+# Activate DR-fix env gate (consistent with run_v3_replay.sh)
+export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
+# Export DM_LOG_DIR — every vLLM EngineCore inherits this env and writes
+# its own dm_mig_pid<pid>.jsonl into it.
+export DM_LOG_DIR="$DST_LOG_DIR"
+
+echo ""
+echo "====== unified_v3 ; DR_SYNC_DISABLED=1 ; DM_LOG_DIR=$DST_LOG_DIR ======"
+bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "unified_v3" "$TRACE" "$cfg_dir" \
+    2>&1 | tee "$cfg_dir/orchestrator.log" | tail -30
+
+pkill -9 -f cache_aware_proxy 2>/dev/null || true
+pkill -9 -f "vllm serve" 2>/dev/null || true
+pkill -9 -f "EngineCore" 2>/dev/null || true
+sleep 5
+
+echo ""
+echo "[stage Z] reverting DR_FIX + DM instrument"
+"$PYTHON" "$DR_FIX_SCRIPT" --revert --vllm-root "$VLLM_ROOT"
+"$PYTHON" "$DM_INSTRUMENT" --revert --venv "$PROJ_DIR/.venv"
+
+echo ""
+echo "[stage analyze] dst-side migration breakdown"
+"$PYTHON" "$ANALYZE" \
+    --proxy-breakdown "$cfg_dir/breakdown.json" \
+    --dst-log-dir "$DST_LOG_DIR" \
+    --output "$cfg_dir/dst_migration_breakdown.csv" \
+    --plot "$cfg_dir/dst_migration_breakdown.png" \
+    2>&1 | tee "$cfg_dir/dst_migration_breakdown.txt"
+
+echo ""
+echo "Done."
+echo "  proxy breakdown : $cfg_dir/breakdown.json"
+echo "  dst per-PID log : $DST_LOG_DIR/"
+echo "  decomposition   : $cfg_dir/dst_migration_breakdown.{csv,png,txt}"
--- a/microbench/connector_tax/cache_sweep/run_v3_full_breakdown.sh
+++ b/microbench/connector_tax/cache_sweep/run_v3_full_breakdown.sh
@@ -0,0 +1,102 @@
+#!/usr/bin/env bash
+# v3 trace replay with FULL migration instrumentation:
+#   - instrument_dst_migration.py  : dst lifecycle -> T_kv_pull
+#   - instrument_mooncake.py        : connector internals (send_blocks RDMA,
+#                                      receive_kv window, ready_wait)
+# Goal: decompose the 87% T_kv_pull into RDMA-actual vs control-plane to
+# explain why effective bandwidth is far below the ~10 GB/s wire rate.
+#
+# Usage: bash run_v3_full_breakdown.sh
+
+set -uo pipefail
+
+PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
+TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
+DATE="$(date +%Y%m%d_%H%M)"
+OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/b3_v3_fullbreak_${DATE}}"
+PYTHON="$PROJ_DIR/.venv/bin/python"
+VENV="$PROJ_DIR/.venv"
+VLLM_ROOT="${VLLM_ROOT:-$VENV/lib/python3.12/site-packages/vllm}"
+DR_FIX="$PROJ_DIR/microbench/connector_tax/cache_sweep/apply_direct_read_fix.py"
+DM_INSTR="$PROJ_DIR/microbench/fresh_setup/instrument_dst_migration.py"
+MC_INSTR="$PROJ_DIR/microbench/fresh_setup/instrument_mooncake.py"
+ANALYZE_DST="$PROJ_DIR/microbench/connector_tax/cache_sweep/analyze_dst_migration.py"
+ANALYZE_XFER="$PROJ_DIR/microbench/connector_tax/cache_sweep/analyze_transfer_decomp.py"
+
+mkdir -p "$OUTROOT"
+DST_LOG_DIR="$OUTROOT/dst_mig_log"
+XFER_LOG_DIR="$OUTROOT/xfer_log"
+mkdir -p "$DST_LOG_DIR" "$XFER_LOG_DIR"
+
+echo "=== unified_v3 + FULL migration breakdown ==="
+echo "Out      : $OUTROOT"
+echo "DST logs : $DST_LOG_DIR"
+echo "XFER logs: $XFER_LOG_DIR"
+echo ""
+
+cleanup_all() {
+    pkill -9 -f cache_aware_proxy 2>/dev/null || true
+    pkill -9 -f "vllm serve" 2>/dev/null || true
+    pkill -9 -f "EngineCore" 2>/dev/null || true
+    sleep 5
+    "$PYTHON" "$DR_FIX"   --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
+    "$PYTHON" "$DM_INSTR" --revert --venv "$VENV" 2>/dev/null || true
+    "$PYTHON" "$MC_INSTR" --revert --venv "$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py" 2>/dev/null || true
+}
+trap cleanup_all EXIT
+cleanup_all
+
+echo "[0a] DR_FIX"
+"$PYTHON" "$DR_FIX" --apply --vllm-root "$VLLM_ROOT"
+echo "[0b] DST migration instrument"
+"$PYTHON" "$DM_INSTR" --apply --venv "$VENV"
+echo "[0c] Mooncake transfer instrument"
+"$PYTHON" "$MC_INSTR" --apply --venv "$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
+"$PYTHON" "$DM_INSTR" --check --venv "$VENV"
+"$PYTHON" "$MC_INSTR" --check --venv "$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
+
+cfg_dir="$OUTROOT/unified_v3"
+mkdir -p "$cfg_dir"
+
+export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
+export DM_LOG_DIR="$DST_LOG_DIR"
+export MB2_LOG_DIR="$XFER_LOG_DIR"
+
+echo ""
+echo "====== unified_v3 ; DM_LOG_DIR + MB2_LOG_DIR set ======"
+bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "unified_v3" "$TRACE" "$cfg_dir" \
+    2>&1 | tee "$cfg_dir/orchestrator.log" | tail -25
+
+pkill -9 -f cache_aware_proxy 2>/dev/null || true
+pkill -9 -f "vllm serve" 2>/dev/null || true
+pkill -9 -f "EngineCore" 2>/dev/null || true
+sleep 5
+
+echo ""
+echo "[Z] revert all instruments"
+"$PYTHON" "$DR_FIX"   --revert --vllm-root "$VLLM_ROOT"
+"$PYTHON" "$DM_INSTR" --revert --venv "$VENV"
+"$PYTHON" "$MC_INSTR" --revert --venv "$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
+
+echo ""
+echo "[analyze 1] dst-side T_kv_pull breakdown"
+"$PYTHON" "$ANALYZE_DST" \
+    --proxy-breakdown "$cfg_dir/breakdown.json" \
+    --dst-log-dir "$DST_LOG_DIR" \
+    --output "$cfg_dir/dst_migration_breakdown.csv" \
+    --plot "$cfg_dir/dst_migration_breakdown.png" \
+    2>&1 | tee "$cfg_dir/dst_migration_breakdown.txt" || echo "(dst analyze failed)"
+
+echo ""
+echo "[analyze 2] transfer decomposition: RDMA-actual vs control-plane"
+"$PYTHON" "$ANALYZE_XFER" \
+    --proxy-breakdown "$cfg_dir/breakdown.json" \
+    --dst-log-dir "$DST_LOG_DIR" \
+    --xfer-log-dir "$XFER_LOG_DIR" \
+    2>&1 | tee "$cfg_dir/transfer_decomp.txt" || echo "(xfer analyze failed)"
+
+echo ""
+echo "Done. Artifacts in $cfg_dir/"
+echo "  dst_migration_breakdown.{csv,png,txt}"
+echo "  transfer_decomp.txt"
+echo "  raw: $DST_LOG_DIR/  $XFER_LOG_DIR/"
--- a/microbench/fresh_setup/analyze_goodput.py
+++ b/microbench/fresh_setup/analyze_goodput.py
@@ -0,0 +1,198 @@
+"""SLO-goodput analyzer + PD_advantage for the PD-disagg crossover study.
+
+Reads per-arm replayer output (replay_metrics.jsonl) and computes, per arm:
+  - completion rate (error-free fraction)
+  - raw TTFT / TPOT / E2E percentiles (over successes — reported for context
+    only; NEVER the verdict metric, since failing arms have a small success set)
+  - SLO-goodput: fraction of OFFERED requests that are error-free AND meet a
+    (TTFT, TPOT) SLO. This is the verdict metric.
+
+The two arms must replay the IDENTICAL trace (same seed), so they are paired
+request-for-request. PD_advantage = goodput(arm) / goodput(baseline); y=1 is
+the crossover line — PD_advantage >= 1 means PD-disagg wins.
+
+Goodput is computed over a grid of SLO thresholds so the conclusion does not
+hinge on one arbitrary cutoff.
+
+Usage:
+  python analyze_goodput.py \
+      --arm 8C-proxy  .../8C-proxy/replay_metrics.jsonl \
+      --arm 4P+4D     .../4P+4D/replay_metrics.jsonl \
+      --baseline 8C-proxy \
+      --ttft-slo 0.5 1 2 5 --tpot-slo 0.05 0.1 0.2
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import statistics
+from pathlib import Path
+
+
+def load_metrics(path: Path) -> list[dict]:
+    rows = []
+    with path.open("r", encoding="utf-8") as fh:
+        for line in fh:
+            line = line.strip()
+            if line:
+                rows.append(json.loads(line))
+    return rows
+
+
+def percentile(sorted_vals: list[float], pct: float) -> float:
+    n = len(sorted_vals)
+    if n == 0:
+        return float("nan")
+    if n == 1:
+        return sorted_vals[0]
+    rank = pct * (n - 1)
+    lo = int(rank)
+    hi = min(lo + 1, n - 1)
+    frac = rank - lo
+    return sorted_vals[lo] * (1 - frac) + sorted_vals[hi] * frac
+
+
+def pstats(vals: list[float]) -> dict:
+    clean = sorted(v for v in vals if v is not None)
+    if not clean:
+        return {"count": 0}
+    return {
+        "count": len(clean),
+        "mean": statistics.fmean(clean),
+        "p50": percentile(clean, 0.50),
+        "p90": percentile(clean, 0.90),
+        "p99": percentile(clean, 0.99),
+    }
+
+
+def offered_window_s(rows: list[dict]) -> float:
+    ts = [r.get("trace_timestamp_s") for r in rows if r.get("trace_timestamp_s") is not None]
+    if len(ts) < 2:
+        return 0.0
+    return max(ts) - min(ts)
+
+
+def meets_slo(r: dict, ttft_slo: float, tpot_slo: float) -> bool:
+    if r.get("error") is not None:
+        return False
+    ttft = r.get("ttft_s")
+    tpot = r.get("tpot_s")
+    if ttft is None:
+        return False
+    if ttft > ttft_slo:
+        return False
+    # tpot=0 happens only for single-token outputs; treat as meeting any SLO.
+    if tpot is not None and tpot > tpot_slo:
+        return False
+    return True
+
+
+def load_summary(jsonl_path: Path) -> dict:
+    """Read the sibling replay_metrics.summary.json (wall-clock, amplification)."""
+    sp = jsonl_path.with_suffix(".summary.json")
+    if sp.exists():
+        try:
+            return json.loads(sp.read_text())
+        except Exception:
+            return {}
+    return {}
+
+
+def summarize_arm(name: str, jsonl_path: Path, rows: list[dict]) -> dict:
+    n = len(rows)
+    ok = [r for r in rows if r.get("error") is None]
+    window = offered_window_s(rows)
+    summ = load_summary(jsonl_path)
+    return {
+        "name": name,
+        "n_offered": n,
+        "n_success": len(ok),
+        "completion_rate": len(ok) / n if n else 0.0,
+        "offered_window_s": window,
+        "offered_qps": n / window if window > 0 else 0.0,
+        # Throughput: how much longer than the offered window it took to drain.
+        # ~1.0 = keeps up; >1 = falling behind (the cleanest PD-collapse signal).
+        "wall_clock_s": summ.get("wall_clock_s"),
+        "amplification": summ.get("amplification"),
+        "ttft": pstats([r.get("ttft_s") for r in ok]),
+        "tpot": pstats([r.get("tpot_s") for r in ok]),
+        "e2e": pstats([r.get("latency_s") for r in ok]),
+        "_rows": rows,
+    }
+
+
+def main() -> None:
+    p = argparse.ArgumentParser(description=__doc__,
+                                formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--arm", nargs=2, action="append", metavar=("NAME", "PATH"),
+                   required=True, help="arm name + replay_metrics.jsonl path (repeatable)")
+    p.add_argument("--baseline", required=True, help="arm name to use as PD_advantage denominator")
+    p.add_argument("--ttft-slo", nargs="+", type=float, default=[0.5, 1.0, 2.0, 5.0])
+    p.add_argument("--tpot-slo", nargs="+", type=float, default=[0.05, 0.1, 0.2])
+    p.add_argument("--out-json", type=Path, default=None)
+    args = p.parse_args()
+
+    arms = {}
+    for name, path in args.arm:
+        arms[name] = summarize_arm(name, Path(path), load_metrics(Path(path)))
+
+    if args.baseline not in arms:
+        raise SystemExit(f"baseline {args.baseline!r} not among arms {list(arms)}")
+
+    # ---- per-arm overview ------------------------------------------------
+    print("=" * 78)
+    print("PER-ARM OVERVIEW (latency stats over successes only — context, not verdict)")
+    print("=" * 78)
+    hdr = f"{'arm':<12}{'offered':>8}{'compl%':>8}{'ampl':>6}{'oQPS':>7}" \
+          f"{'TTFTp50':>9}{'TTFTp90':>9}{'TPOTp50':>9}{'TPOTp99':>9}{'E2Ep90':>9}"
+    print(hdr)
+    for name, a in arms.items():
+        t, tp, e = a["ttft"], a["tpot"], a["e2e"]
+        ampl = a.get("amplification")
+        ampl_s = f"{ampl:>6.2f}" if isinstance(ampl, (int, float)) else f"{'--':>6}"
+        print(f"{name:<12}{a['n_offered']:>8}{100*a['completion_rate']:>7.1f}%"
+              f"{ampl_s}{a['offered_qps']:>7.2f}"
+              f"{t.get('p50', float('nan')):>9.2f}{t.get('p90', float('nan')):>9.2f}"
+              f"{1000*tp.get('p50', float('nan')):>8.0f}m{1000*tp.get('p99', float('nan')):>8.0f}m"
+              f"{e.get('p90', float('nan')):>9.2f}")
+
+    # ---- SLO-goodput grid + PD_advantage --------------------------------
+    base = arms[args.baseline]
+    grid = []
+    print()
+    print("=" * 78)
+    print(f"SLO-GOODPUT (attainment = error-free AND TTFT<=slo AND TPOT<=slo)")
+    print(f"PD_advantage = attainment(arm) / attainment(baseline={args.baseline}); "
+          f">=1 means arm wins")
+    print("=" * 78)
+    for ttft_slo in args.ttft_slo:
+        for tpot_slo in args.tpot_slo:
+            row = {"ttft_slo_s": ttft_slo, "tpot_slo_s": tpot_slo, "arms": {}}
+            base_n = sum(1 for r in base["_rows"] if meets_slo(r, ttft_slo, tpot_slo))
+            base_att = base_n / base["n_offered"] if base["n_offered"] else 0.0
+            line = f"TTFT<={ttft_slo:>4}s TPOT<={int(1000*tpot_slo):>4}ms | "
+            cells = []
+            for name, a in arms.items():
+                n_slo = sum(1 for r in a["_rows"] if meets_slo(r, ttft_slo, tpot_slo))
+                att = n_slo / a["n_offered"] if a["n_offered"] else 0.0
+                adv = (att / base_att) if base_att > 0 else float("nan")
+                row["arms"][name] = {"attainment": att, "pd_advantage": adv, "n_slo": n_slo}
+                tag = "" if name == args.baseline else f" adv={adv:.2f}"
+                cells.append(f"{name}={100*att:>5.1f}%{tag}")
+            print(line + "  ".join(cells))
+            grid.append(row)
+
+    if args.out_json:
+        out = {
+            "baseline": args.baseline,
+            "arms": {n: {k: v for k, v in a.items() if k != "_rows"}
+                     for n, a in arms.items()},
+            "slo_grid": grid,
+        }
+        args.out_json.write_text(json.dumps(out, indent=2))
+        print(f"\nwrote {args.out_json}")
+
+
+if __name__ == "__main__":
+    main()
--- a/microbench/fresh_setup/instrument_dst_migration.py
+++ b/microbench/fresh_setup/instrument_dst_migration.py
@@ -0,0 +1,360 @@
+#!/usr/bin/env python3
+"""Instrument vLLM V1 scheduler to dump per-request DST-side migration timeline.
+
+For each request that arrives at the engine with `kv_transfer_params`
+containing `do_remote_prefill=True` (i.e., the decode-target of an
+EAR v3 migration), record:
+
+  t_arrival_unix         — Scheduler.add_request() entry
+  t_wait_for_kvs_unix    — status set to WAITING_FOR_REMOTE_KVS (KV pull start)
+  t_kv_recv_done_unix    — req_id added to finished_recving_kv_req_ids
+  t_first_scheduled_unix — first time req appears in self.running after KV done
+  t_first_token_unix     — first new_token_ids appended in update_from_output
+  arrival_state          — {n_running, n_waiting, pending_prefill_tok,
+                            n_waiting_for_kvs}
+
+We complement the proxy `breakdown.json` (t_decode_sent_unix /
+t_first_token_unix) to attribute the migration's dst-side wait into:
+  HTTP relay + admission_pre_kv + KV pull + admission_post_kv + first_iter
+
+One JSONL per EngineCore PID at $DM_LOG_DIR/dm_mig_pid<pid>.jsonl
+(default DM_LOG_DIR=/tmp). Records are flushed when t_first_token is
+reached or when the request is aborted/finished.
+
+Co-exists with MB5 KV snapshot patches (different START/END markers).
+
+Usage:
+    python instrument_dst_migration.py --apply  [--venv PATH]
+    python instrument_dst_migration.py --revert [--venv PATH]
+    python instrument_dst_migration.py --check  [--venv PATH]
+"""
+from __future__ import annotations
+
+import argparse
+import re
+from pathlib import Path
+
+DEFAULT_VENV = Path("/home/admin/cpfs/wjh/agentic-kv/.venv")
+TARGET_REL = "lib/python3.12/site-packages/vllm/v1/core/sched/scheduler.py"
+
+START_MARK = "# DM_INSTRUMENT_START"
+END_MARK = "# DM_INSTRUMENT_END"
+
+# ---------- Patch 1: module-level header (helpers + globals) -----------------
+# Anchor: the very first `class Scheduler(SchedulerInterface):` line. We insert
+# the entire helper block immediately before that, so MB5's prior block (if
+# present) is preserved and our block lives in module scope. The anchor must
+# stay outside our own START/END markers so revert() can re-find it.
+HEADER_ANCHOR = "class Scheduler(SchedulerInterface):"
+
+HEADER_INSERT = f"""{START_MARK}
+import json as _dm_json
+import os as _dm_os
+import threading as _dm_threading
+import time as _dm_time
+_DM_LOG_DIR = _dm_os.environ.get("DM_LOG_DIR", "/tmp")
+try:
+    _dm_os.makedirs(_DM_LOG_DIR, exist_ok=True)
+except Exception:
+    pass
+_DM_LOG_PATH = _dm_os.path.join(
+    _DM_LOG_DIR, f"dm_mig_pid{{_dm_os.getpid()}}.jsonl"
+)
+_DM_LOG_FILE = None
+_DM_LOG_LOCK = _dm_threading.Lock()
+# req_id -> in-flight record. We pop and flush when t_first_token lands or on
+# finish/abort.
+_DM_DATA: dict = {{}}
+
+
+def _dm_write_event(d: dict) -> None:
+    global _DM_LOG_FILE
+    if _DM_LOG_FILE is None:
+        _DM_LOG_FILE = open(_DM_LOG_PATH, "a", buffering=1)
+    with _DM_LOG_LOCK:
+        _DM_LOG_FILE.write(_dm_json.dumps(d) + "\\n")
+
+
+def _dm_is_migrated(request) -> bool:
+    ktp = getattr(request, "kv_transfer_params", None)
+    if not isinstance(ktp, dict):
+        return False
+    return bool(ktp.get("do_remote_prefill"))
+
+
+def _dm_snapshot_arrival(scheduler) -> dict:
+    try:
+        n_running = len(scheduler.running)
+    except Exception:
+        n_running = -1
+    try:
+        n_waiting_main = len(scheduler.waiting)
+    except Exception:
+        n_waiting_main = -1
+    try:
+        n_skipped = len(scheduler.skipped_waiting)
+    except Exception:
+        n_skipped = 0
+    pending_tok = 0
+    n_kv = 0
+    try:
+        from vllm.v1.request import RequestStatus as _RS
+        for r in list(scheduler.waiting):
+            try:
+                if getattr(r, "status", None) == _RS.WAITING_FOR_REMOTE_KVS:
+                    n_kv += 1
+                npr = int(getattr(r, "num_prompt_tokens", 0))
+                nct = int(getattr(r, "num_computed_tokens", 0))
+                pending_tok += max(0, npr - nct)
+            except Exception:
+                pass
+        for r in list(scheduler.skipped_waiting):
+            try:
+                if getattr(r, "status", None) == _RS.WAITING_FOR_REMOTE_KVS:
+                    n_kv += 1
+            except Exception:
+                pass
+    except Exception:
+        pass
+    return {{
+        "n_running": int(n_running),
+        "n_waiting": int(n_waiting_main + n_skipped),
+        "pending_prefill_tok": int(pending_tok),
+        "n_waiting_for_kvs": int(n_kv),
+    }}
+
+
+def _dm_emit_and_drop(req_id: str, reason: str = "first_token") -> None:
+    rec = _DM_DATA.pop(req_id, None)
+    if rec is None:
+        return
+    rec["flush_reason"] = reason
+    rec["t_flush_unix"] = _dm_time.time()
+    _dm_write_event(rec)
+
+
+{END_MARK}
+
+
+"""
+
+# ---------- Patch 2: add_request() hook --------------------------------------
+# Right after self.requests[request.request_id] = request (line ~1927) and the
+# if self.log_stats: block. Anchor includes the QUEUED record_event line so it
+# is uniquely matchable.
+ADD_REQUEST_ANCHOR = """            self._enqueue_waiting_request(request)
+            self.requests[request.request_id] = request
+            if self.log_stats:
+                request.record_event(EngineCoreEventType.QUEUED)
+"""
+
+ADD_REQUEST_REPLACE = f"""            self._enqueue_waiting_request(request)
+            self.requests[request.request_id] = request
+            if self.log_stats:
+                request.record_event(EngineCoreEventType.QUEUED)
+            {START_MARK}
+            try:
+                if _dm_is_migrated(request):
+                    _DM_DATA[request.request_id] = {{
+                        "req_id": str(request.request_id),
+                        "is_migrated": True,
+                        "n_prompt_tokens": int(getattr(request, "num_prompt_tokens", 0)),
+                        "t_arrival_unix": _dm_time.time(),
+                        "t_wait_for_kvs_unix": None,
+                        "t_kv_recv_done_unix": None,
+                        "t_first_scheduled_unix": None,
+                        "t_first_token_unix": None,
+                        "arrival_state": _dm_snapshot_arrival(self),
+                    }}
+            except Exception:
+                pass
+            {END_MARK}
+"""
+
+# ---------- Patch 3: WAITING_FOR_REMOTE_KVS transition -----------------------
+WAIT_KV_ANCHOR = """                    request.status = RequestStatus.WAITING_FOR_REMOTE_KVS
+                    step_skipped_waiting.prepend_request(request)
+"""
+
+WAIT_KV_REPLACE = f"""                    request.status = RequestStatus.WAITING_FOR_REMOTE_KVS
+                    {START_MARK}
+                    try:
+                        _rec = _DM_DATA.get(request.request_id)
+                        if _rec is not None and _rec["t_wait_for_kvs_unix"] is None:
+                            _rec["t_wait_for_kvs_unix"] = _dm_time.time()
+                    except Exception:
+                        pass
+                    {END_MARK}
+                    step_skipped_waiting.prepend_request(request)
+"""
+
+# ---------- Patch 4: finished_recving signal ---------------------------------
+FINISHED_RECV_ANCHOR = """            if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS:
+                self.finished_recving_kv_req_ids.add(req_id)
+            elif RequestStatus.is_finished(req.status):
+"""
+
+FINISHED_RECV_REPLACE = f"""            if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS:
+                self.finished_recving_kv_req_ids.add(req_id)
+                {START_MARK}
+                try:
+                    _rec = _DM_DATA.get(req_id)
+                    if _rec is not None and _rec["t_kv_recv_done_unix"] is None:
+                        _rec["t_kv_recv_done_unix"] = _dm_time.time()
+                except Exception:
+                    pass
+                {END_MARK}
+            elif RequestStatus.is_finished(req.status):
+"""
+
+# ---------- Patch 5: first scheduled (sweep at end of schedule()) ------------
+# Co-exists with the MB5 snapshot inserted at the same location.
+SCHED_END_ANCHOR = """        # MB5_INSTRUMENT_START
+        _mb5_snapshot(self)
+        # MB5_INSTRUMENT_END
+        return scheduler_output
+"""
+
+SCHED_END_REPLACE = f"""        # MB5_INSTRUMENT_START
+        _mb5_snapshot(self)
+        # MB5_INSTRUMENT_END
+        {START_MARK}
+        try:
+            if _DM_DATA:
+                _now_dm = _dm_time.time()
+                for _r in self.running:
+                    _rec = _DM_DATA.get(_r.request_id)
+                    if _rec is not None and _rec["t_first_scheduled_unix"] is None:
+                        _rec["t_first_scheduled_unix"] = _now_dm
+        except Exception:
+            pass
+        {END_MARK}
+        return scheduler_output
+"""
+
+# ---------- Patch 6: first new token in update_from_output -------------------
+FIRST_TOK_ANCHOR = """            # Check for stop and update request status.
+            if new_token_ids:
+                new_token_ids, stopped = self._update_request_with_output(
+                    request, new_token_ids
+                )
+"""
+
+FIRST_TOK_REPLACE = f"""            # Check for stop and update request status.
+            if new_token_ids:
+                {START_MARK}
+                try:
+                    _rec = _DM_DATA.get(request.request_id)
+                    if _rec is not None and _rec["t_first_token_unix"] is None:
+                        _rec["t_first_token_unix"] = _dm_time.time()
+                        _dm_emit_and_drop(request.request_id, reason="first_token")
+                except Exception:
+                    pass
+                {END_MARK}
+                new_token_ids, stopped = self._update_request_with_output(
+                    request, new_token_ids
+                )
+"""
+
+# ---------- Patch 7: abort/finish — flush partial record ---------------------
+FINISH_ANCHOR = """            request.status = finished_status
+            self._free_request(request, delay_free_blocks=delay_free_blocks)
+"""
+
+FINISH_REPLACE = f"""            request.status = finished_status
+            {START_MARK}
+            try:
+                if request.request_id in _DM_DATA:
+                    _dm_emit_and_drop(request.request_id, reason="finish_or_abort")
+            except Exception:
+                pass
+            {END_MARK}
+            self._free_request(request, delay_free_blocks=delay_free_blocks)
+"""
+
+PATCHES = [
+    ("header",            HEADER_ANCHOR,         HEADER_INSERT + HEADER_ANCHOR),
+    ("add_request",       ADD_REQUEST_ANCHOR,    ADD_REQUEST_REPLACE),
+    ("wait_for_kvs",      WAIT_KV_ANCHOR,        WAIT_KV_REPLACE),
+    ("finished_recving",  FINISHED_RECV_ANCHOR,  FINISHED_RECV_REPLACE),
+    ("first_scheduled",   SCHED_END_ANCHOR,      SCHED_END_REPLACE),
+    ("first_token",       FIRST_TOK_ANCHOR,      FIRST_TOK_REPLACE),
+    ("finish_flush",      FINISH_ANCHOR,         FINISH_REPLACE),
+]
+
+
+def find_target(venv_or_path: Path) -> Path:
+    candidates = [venv_or_path / TARGET_REL, DEFAULT_VENV / TARGET_REL]
+    for c in candidates:
+        if c.is_file():
+            return c
+    raise FileNotFoundError(f"cannot find {TARGET_REL} under {venv_or_path}")
+
+
+def is_patched(text: str) -> bool:
+    return START_MARK in text
+
+
+def apply(target: Path) -> None:
+    text = target.read_text()
+    if is_patched(text):
+        print(f"[dm-instr] already patched: {target}")
+        return
+    new = text
+    for name, src, dst in PATCHES:
+        if src not in new:
+            raise RuntimeError(
+                f"patch {name!r}: anchor not found in {target}. "
+                f"Anchor head: {src.splitlines()[0]!r}"
+            )
+        new = new.replace(src, dst, 1)
+    target.write_text(new)
+    print(f"[dm-instr] applied {len(PATCHES)} patches -> {target}")
+
+
+def revert(target: Path) -> None:
+    text = target.read_text()
+    if not is_patched(text):
+        print(f"[dm-instr] not patched (nothing to revert): {target}")
+        return
+    # Strip our DM_* block, including the trailing newline that
+    # terminated the END_MARK line. We do NOT collapse other blank-line
+    # runs (MB5_* whitespace and original spacing between methods are
+    # preserved).
+    pat = re.compile(
+        r"[ \t]*" + re.escape(START_MARK) + r".*?" + re.escape(END_MARK) + r"\n",
+        flags=re.DOTALL,
+    )
+    new = pat.sub("", text)
+    # The header insert added a leading "# DM_INSTRUMENT_START\n" with
+    # two trailing blank lines and the anchor; revert removed the block
+    # plus its trailing newline, leaving one extra blank line before the
+    # class — harmless. We additionally collapse the very narrow case of
+    # "\n\n\nclass Scheduler" -> "\n\nclass Scheduler" so revert is
+    # byte-identical for that anchor.
+    new = re.sub(r"\n{3,}class Scheduler\(", "\n\nclass Scheduler(", new)
+    target.write_text(new)
+    print(f"[dm-instr] reverted: {target}")
+
+
+def main() -> None:
+    p = argparse.ArgumentParser()
+    p.add_argument("--apply", action="store_true")
+    p.add_argument("--revert", action="store_true")
+    p.add_argument("--check", action="store_true")
+    p.add_argument("--venv", type=Path, default=DEFAULT_VENV)
+    args = p.parse_args()
+    target = find_target(args.venv)
+    if args.apply:
+        apply(target)
+    elif args.revert:
+        revert(target)
+    elif args.check:
+        state = "PATCHED" if is_patched(target.read_text()) else "CLEAN"
+        print(f"[dm-instr] {state}: {target}")
+    else:
+        p.error("specify --apply / --revert / --check")
+
+
+if __name__ == "__main__":
+    main()
--- a/microbench/fresh_setup/instrument_mooncake.py
+++ b/microbench/fresh_setup/instrument_mooncake.py
@@ -151,11 +151,65 @@ RECV_FINISH_REPLACE = f"""                    if response.status == MooncakeXfer
                        {END_MARK}
                        break"""

+# ---- Patch 5: send_kv_to_decode entry (P-side, producer receives pull req) ----
+
+SEND_ENTRY_TARGET = """    async def send_kv_to_decode(
+        self, identity: bytes, sock: zmq.asyncio.Socket, meta: MooncakeXferMetadata
+    ):
+        pending_reqs: dict[ReqId, SendBlockMeta] = {}"""
+
+SEND_ENTRY_REPLACE = f"""    async def send_kv_to_decode(
+        self, identity: bytes, sock: zmq.asyncio.Socket, meta: MooncakeXferMetadata
+    ):
+        pending_reqs: dict[ReqId, SendBlockMeta] = {{}}
+        {START_MARK}
+        try:
+            _mb2_log_event({{"event": "send_kv_to_decode_enter",
+                              "d_req_ids": [str(r) for r in meta.req_blocks],
+                              "t_start_unix": _mb2_time.time(),
+                              "tp_rank": getattr(self, "tp_rank", -1)}})
+        except Exception:
+            pass
+        {END_MARK}"""
+
+# ---- Patch 6: wait_and_ret ready-wait timing (P-side, src KV commit wait) ----
+
+READY_WAIT_TARGET = """        async def wait_and_ret(
+            d_req_id: ReqId, send_meta: SendBlockMeta
+        ) -> tuple[ReqId, SendBlockMeta]:
+            await send_meta.ready.wait()
+            return d_req_id, send_meta"""
+
+READY_WAIT_REPLACE = f"""        async def wait_and_ret(
+            d_req_id: ReqId, send_meta: SendBlockMeta
+        ) -> tuple[ReqId, SendBlockMeta]:
+            {START_MARK}
+            _mb2_rw_start = _mb2_time.perf_counter()
+            _mb2_rw_start_unix = _mb2_time.time()
+            _mb2_rw_already = send_meta.ready.is_set()
+            {END_MARK}
+            await send_meta.ready.wait()
+            {START_MARK}
+            try:
+                _mb2_log_event({{"event": "ready_wait",
+                                  "d_req_id": str(d_req_id),
+                                  "transfer_id": str(getattr(send_meta, "transfer_id", "")),
+                                  "ready_already_set": bool(_mb2_rw_already),
+                                  "ready_wait_s": _mb2_time.perf_counter() - _mb2_rw_start,
+                                  "t_start_unix": _mb2_rw_start_unix,
+                                  "tp_rank": getattr(self, "tp_rank", -1)}})
+            except Exception:
+                pass
+            {END_MARK}
+            return d_req_id, send_meta"""
+
 PATCHES = [
    ("header",               HEADER_ANCHOR,      HEADER_ANCHOR + HEADER_INSERT),
    ("_send_blocks",         SEND_TARGET,        SEND_REPLACE),
    ("receive_kv (entry)",   RECV_ENTRY_TARGET,  RECV_ENTRY_REPLACE),
    ("receive_kv (FINISH)",  RECV_FINISH_TARGET, RECV_FINISH_REPLACE),
+    ("send_kv (entry)",      SEND_ENTRY_TARGET,  SEND_ENTRY_REPLACE),
+    ("ready_wait",           READY_WAIT_TARGET,  READY_WAIT_REPLACE),
 ]


--- a/microbench/fresh_setup/mb6_transfer_under_load.py
+++ b/microbench/fresh_setup/mb6_transfer_under_load.py
@@ -0,0 +1,261 @@
+#!/usr/bin/env python3
+"""MB6: KV-transfer bandwidth vs instance busy-ness.
+
+Confirms the causal hypothesis from the v3 breakdown: the migration
+transfer runs far below wire speed because it happens between instances
+that are concurrently busy with compute (GIL-starved control plane +
+HBM/NIC contention), NOT because of a wire/NIC limit.
+
+Method (reuses the MB2 transfer primitive):
+  prefill on A (do_remote_decode, max_tokens=1) -> migrate to B
+  (do_remote_prefill). Time step 2 = the KV transfer.
+
+For each background-load level B in --bg-loads, we hold B concurrent
+long-decode streams on BOTH instances to keep them busy, then run
+--repeats measured transfers per size. With the MB2 mooncake instrument
+applied (MB2_LOG_DIR set), the analyzer can split the e2e transfer into
+RDMA-actual (`send_blocks`) vs control-plane.
+
+Expected: bg=0 reproduces MB2 (~10 GB/s); higher bg degrades toward the
+~2-3 GB/s seen in the v3 trace.
+
+Usage:
+  python mb6_transfer_under_load.py \
+    --src-port 8000 --dst-port 8001 --src-bp 8998 --dst-bp 8999 \
+    --sizes 16384,65536 --bg-loads 0,8,24 --repeats 4 \
+    --out mb6_result.json
+"""
+from __future__ import annotations
+
+import argparse
+import asyncio
+import json
+import statistics
+import time
+import uuid
+from pathlib import Path
+
+import httpx
+
+MODEL_PATH = "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct"
+KV_PER_TOK = 98304  # Qwen3-30B-A3B est bytes/token
+
+
+def synth_prompt(seed: int, n: int) -> list[int]:
+    import random
+    rng = random.Random(seed)
+    return [rng.randint(100, 150000) for _ in range(n)]
+
+
+async def get_engine_id(client, host, bp):
+    r = await client.get(f"http://{host}:{bp}/query")
+    r.raise_for_status()
+    return r.json()["0"]["engine_id"]
+
+
+async def completion(client, host, port, prompt, max_tokens, ktp=None, stream=False):
+    payload = {
+        "model": MODEL_PATH, "prompt": prompt, "max_tokens": max_tokens,
+        "min_tokens": max_tokens if max_tokens == 1 else 1,
+        "temperature": 0.0, "stream": stream,
+    }
+    if ktp:
+        payload["kv_transfer_params"] = ktp
+    t0 = time.perf_counter()
+    if stream:
+        # consume the stream to keep the instance decoding
+        async with client.stream("POST", f"http://{host}:{port}/v1/completions",
+                                 json=payload, timeout=600.0) as r:
+            r.raise_for_status()
+            async for _ in r.aiter_bytes():
+                pass
+        return time.perf_counter() - t0, {}
+    r = await client.post(f"http://{host}:{port}/v1/completions",
+                          json=payload, timeout=600.0)
+    elapsed = time.perf_counter() - t0
+    r.raise_for_status()
+    return elapsed, r.json()
+
+
+async def num_running(client, host, port) -> int:
+    """Read vLLM running-request gauge from /metrics."""
+    try:
+        r = await client.get(f"http://{host}:{port}/metrics", timeout=5.0)
+        for line in r.text.splitlines():
+            if line.startswith("vllm:num_requests_running"):
+                return int(float(line.split()[-1]))
+    except Exception:
+        pass
+    return -1
+
+
+class BackgroundLoad:
+    """Maintain N concurrent long-decode streams on a set of (host,port)."""
+    def __init__(self, client, endpoints, concurrency, prompt_tokens=2000,
+                 out_tokens=6000):
+        self.client = client
+        self.endpoints = endpoints
+        self.concurrency = concurrency
+        self.prompt_tokens = prompt_tokens
+        self.out_tokens = out_tokens
+        self._stop = asyncio.Event()
+        self._tasks: list[asyncio.Task] = []
+
+    async def _worker(self, idx):
+        host, port = self.endpoints[idx % len(self.endpoints)]
+        seed = 900000 + idx
+        while not self._stop.is_set():
+            prompt = synth_prompt(seed, self.prompt_tokens)
+            seed += 1
+            try:
+                await completion(self.client, host, port, prompt,
+                                 max_tokens=self.out_tokens, stream=True)
+            except Exception:
+                await asyncio.sleep(0.5)
+
+    def start(self):
+        self._tasks = [asyncio.create_task(self._worker(i))
+                       for i in range(self.concurrency)]
+
+    async def stop(self):
+        self._stop.set()
+        for t in self._tasks:
+            t.cancel()
+        await asyncio.gather(*self._tasks, return_exceptions=True)
+        self._tasks = []
+
+
+async def measure_transfer(client, src_host, src_port, dst_host, dst_port,
+                           src_eid, src_bootstrap_addr, input_tokens, seed):
+    prompt = synth_prompt(seed, input_tokens)
+    transfer_id = uuid.uuid4().hex
+    # step 1: prefill on A
+    await completion(client, src_host, src_port, prompt, max_tokens=1,
+                     ktp={"do_remote_decode": True, "transfer_id": transfer_id})
+    # step 2: migrate to B (this is the timed transfer)
+    t_start_unix = time.time()
+    t_xfer, _ = await completion(
+        client, dst_host, dst_port, prompt, max_tokens=1,
+        ktp={"do_remote_prefill": True, "transfer_id": transfer_id,
+             "remote_engine_id": src_eid,
+             "remote_bootstrap_addr": src_bootstrap_addr})
+    return {
+        "input_tokens": input_tokens,
+        "t_transfer_s": t_xfer,
+        "t_step2_start_unix": t_start_unix,
+        "t_step2_end_unix": time.time(),
+        "kv_bytes": input_tokens * KV_PER_TOK,
+        "eff_gbps": input_tokens * KV_PER_TOK / 1e9 / t_xfer if t_xfer > 0 else 0,
+    }
+
+
+async def main_async(a):
+    sizes = [int(s) for s in a.sizes.split(",")]
+    bg_loads = [int(s) for s in a.bg_loads.split(",")]
+    src_host, dst_host = a.src_host, a.dst_host
+    limits = httpx.Limits(max_connections=256, max_keepalive_connections=256)
+    async with httpx.AsyncClient(limits=limits, trust_env=False) as client:
+        src_eid = await get_engine_id(client, src_host, a.src_bp)
+        src_bootstrap_addr = f"http://{src_host}:{a.src_bp}"
+        print(f"[mb6] src eid={src_eid[:16]}...  endpoints A={src_host}:{a.src_port} "
+              f"B={dst_host}:{a.dst_port}")
+
+        endpoints = [(src_host, a.src_port), (dst_host, a.dst_port)]
+        results = []
+        for bg in bg_loads:
+            loader = None
+            if bg > 0:
+                loader = BackgroundLoad(client, endpoints, bg,
+                                        prompt_tokens=a.bg_prompt,
+                                        out_tokens=a.bg_out)
+                loader.start()
+                # wait for instances to actually be busy
+                print(f"[mb6] bg={bg}: ramping background load ...")
+                for _ in range(40):
+                    await asyncio.sleep(1.0)
+                    na = await num_running(client, src_host, a.src_port)
+                    nb = await num_running(client, dst_host, a.dst_port)
+                    if na >= 1 and nb >= 1:
+                        print(f"[mb6] bg={bg}: busy (A running={na} B running={nb})")
+                        break
+            else:
+                print(f"[mb6] bg=0: idle baseline")
+                # ensure idle
+                await asyncio.sleep(2.0)
+
+            for sz in sizes:
+                for rep in range(a.repeats):
+                    na = await num_running(client, src_host, a.src_port)
+                    nb = await num_running(client, dst_host, a.dst_port)
+                    row = await measure_transfer(
+                        client, src_host, a.src_port, dst_host, a.dst_port,
+                        src_eid, src_bootstrap_addr, sz, seed=sz * 100 + rep + bg * 7)
+                    row["bg_load"] = bg
+                    row["A_running_at_measure"] = na
+                    row["B_running_at_measure"] = nb
+                    results.append(row)
+                    kv_mib = sz * KV_PER_TOK / 2**20
+                    print(f"  bg={bg:>3} size={sz:>6} ({kv_mib:6.0f}MiB) rep={rep} "
+                          f"A_run={na:>2} B_run={nb:>2} "
+                          f"transfer={row['t_transfer_s']*1000:7.0f}ms "
+                          f"eff={row['eff_gbps']:5.2f}GB/s")
+
+            if loader:
+                await loader.stop()
+                # let the instances drain before next bg level
+                print(f"[mb6] bg={bg}: draining ...")
+                for _ in range(60):
+                    await asyncio.sleep(1.0)
+                    na = await num_running(client, src_host, a.src_port)
+                    nb = await num_running(client, dst_host, a.dst_port)
+                    if na <= 0 and nb <= 0:
+                        break
+
+    # summary per (bg, size)
+    print("\n=== summary: effective transfer bandwidth vs background load ===")
+    print(f"{'bg':>4} {'size':>7} {'n':>3} {'xfer_p50_ms':>12} {'eff_p50_GBps':>13} "
+          f"{'eff_mean':>9}")
+    summary = []
+    for bg in bg_loads:
+        for sz in sizes:
+            rs = [r for r in results if r["bg_load"] == bg and r["input_tokens"] == sz]
+            if not rs:
+                continue
+            xfer = sorted(r["t_transfer_s"] for r in rs)
+            eff = sorted(r["eff_gbps"] for r in rs)
+            p50x = xfer[len(xfer) // 2]
+            p50e = eff[len(eff) // 2]
+            meane = statistics.mean(eff)
+            summary.append({"bg": bg, "size": sz, "n": len(rs),
+                            "xfer_p50_ms": p50x * 1000,
+                            "eff_p50_gbps": p50e, "eff_mean_gbps": meane})
+            print(f"{bg:>4} {sz:>7} {len(rs):>3} {p50x*1000:>12.0f} "
+                  f"{p50e:>13.2f} {meane:>9.2f}")
+
+    Path(a.out).write_text(json.dumps(
+        {"model": MODEL_PATH, "kv_bytes_per_token": KV_PER_TOK,
+         "label": a.label, "raw": results, "summary": summary}, indent=2))
+    print(f"\n[mb6] wrote {a.out}")
+
+
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("--src-host", default="127.0.0.1")
+    p.add_argument("--dst-host", default="127.0.0.1")
+    p.add_argument("--src-port", type=int, default=8000)
+    p.add_argument("--dst-port", type=int, default=8001)
+    p.add_argument("--src-bp", type=int, default=8998)
+    p.add_argument("--dst-bp", type=int, default=8999)
+    p.add_argument("--sizes", default="16384,65536")
+    p.add_argument("--bg-loads", default="0,8,24")
+    p.add_argument("--repeats", type=int, default=4)
+    p.add_argument("--bg-prompt", type=int, default=2000)
+    p.add_argument("--bg-out", type=int, default=6000)
+    p.add_argument("--label", default="main-venv")
+    p.add_argument("--out", default="mb6_result.json")
+    args = p.parse_args()
+    asyncio.run(main_async(args))
+
+
+if __name__ == "__main__":
+    main()
--- a/microbench/fresh_setup/run_mb6.sh
+++ b/microbench/fresh_setup/run_mb6.sh
@@ -0,0 +1,109 @@
+#!/usr/bin/env bash
+# MB6 launcher: 2 vLLM instances (kv_both, Mooncake) + transfer-under-load
+# sweep. Parameterized by VENV so it runs on either the patched main venv
+# or the fresh upstream venv, to test whether the bandwidth degradation is
+# our patch or inherent to upstream mooncake.
+#
+# Usage:
+#   VENV=/home/admin/cpfs/wjh/agentic-kv/.venv       bash run_mb6.sh   # main
+#   VENV=/home/admin/cpfs/wjh/agentic-kv-fresh/.venv bash run_mb6.sh   # fresh
+
+set -uo pipefail
+
+PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
+VENV="${VENV:-$PROJ_DIR/.venv}"
+LABEL="${LABEL:-$(basename $(dirname $VENV))}"
+MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
+GPUS="${GPUS:-0 1}"
+SIZES="${SIZES:-16384,65536}"
+BG_LOADS="${BG_LOADS:-0,8,24}"
+REPEATS="${REPEATS:-4}"
+DATE="$(date +%Y%m%d_%H%M)"
+OUTDIR="${OUTDIR:-$PROJ_DIR/outputs/mb6_${LABEL}_${DATE}}"
+PYTHON="$VENV/bin/python"
+MC_INSTR="$PROJ_DIR/microbench/fresh_setup/instrument_mooncake.py"
+DRIVER="$PROJ_DIR/microbench/fresh_setup/mb6_transfer_under_load.py"
+MC_FILE="$VENV/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
+
+mkdir -p "$OUTDIR/logs"
+XFER_LOG_DIR="$OUTDIR/xfer_log"; mkdir -p "$XFER_LOG_DIR"
+
+echo "=== MB6 transfer-under-load ($LABEL) ==="
+echo "VENV : $VENV"
+echo "Out  : $OUTDIR"
+echo ""
+
+PORTS=(8000 8001); BPS=(8998 8999)
+gpu_arr=($GPUS)
+
+cleanup() {
+    pkill -9 -f "vllm serve" 2>/dev/null || true
+    pkill -9 -f "EngineCore" 2>/dev/null || true
+    sleep 4
+    "$PYTHON" "$MC_INSTR" --venv "$MC_FILE" --revert 2>/dev/null || true
+}
+trap cleanup EXIT
+cleanup
+
+echo "[0] apply MB2 mooncake instrument to $LABEL venv"
+"$PYTHON" "$MC_INSTR" --venv "$MC_FILE" --apply
+"$PYTHON" "$MC_INSTR" --venv "$MC_FILE" --check
+
+echo "[1] launch 2 instances"
+i=0
+for gpu in ${gpu_arr[@]:0:2}; do
+    port=${PORTS[$i]}; bp=${BPS[$i]}; master=$((29600 + i))
+    PYTHONHASHSEED=42 \
+    VLLM_MOONCAKE_BOOTSTRAP_PORT=$bp \
+    MB2_LOG_DIR="$XFER_LOG_DIR" \
+    CUDA_VISIBLE_DEVICES=$gpu \
+    MASTER_PORT=$master \
+    nohup "$VENV/bin/vllm" serve "$MODEL" \
+        --host 0.0.0.0 --port "$port" \
+        --tensor-parallel-size 1 --trust-remote-code --enable-prefix-caching \
+        --dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
+        --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
+        --enable-prompt-tokens-details \
+        > "$OUTDIR/logs/vllm_${i}_gpu${gpu}.log" 2>&1 &
+    disown
+    sleep 2
+    i=$((i + 1))
+done
+
+echo "[2] wait for health"
+for i in 0 1; do
+    port=${PORTS[$i]}; tries=0
+    while ! curl -sf "http://127.0.0.1:$port/health" >/dev/null 2>&1; do
+        tries=$((tries + 1))
+        if [ $tries -gt 180 ]; then echo "FATAL inst_$i not healthy"; exit 1; fi
+        sleep 2
+    done
+    echo "  inst_$i ready"
+done
+# bootstrap /query reachable?
+for i in 0 1; do
+    bp=${BPS[$i]}; tries=0
+    while ! curl -sf "http://127.0.0.1:$bp/query" >/dev/null 2>&1; do
+        tries=$((tries + 1))
+        if [ $tries -gt 60 ]; then echo "WARN bootstrap $bp not ready"; break; fi
+        sleep 2
+    done
+done
+
+echo "[3] run MB6 driver"
+"$PYTHON" "$DRIVER" \
+    --src-port "${PORTS[0]}" --dst-port "${PORTS[1]}" \
+    --src-bp "${BPS[0]}" --dst-bp "${BPS[1]}" \
+    --sizes "$SIZES" --bg-loads "$BG_LOADS" --repeats "$REPEATS" \
+    --label "$LABEL" --out "$OUTDIR/mb6_result.json" \
+    2>&1 | tee "$OUTDIR/mb6_run.txt"
+
+echo "[4] teardown + revert"
+pkill -9 -f "vllm serve" 2>/dev/null || true
+pkill -9 -f "EngineCore" 2>/dev/null || true
+sleep 4
+"$PYTHON" "$MC_INSTR" --venv "$MC_FILE" --revert
+
+echo ""
+echo "Done. Artifacts in $OUTDIR/"
+echo "  mb6_result.json  mb6_run.txt  xfer_log/  logs/"