Layerwise under load: overlap benefit survives (bg=16)

mb7 with background decode load (8/instance). Critical-path transfer overhead stays ~constant ~90ms for layerwise vs 158/239/749ms baseline (up to 7.9x at 32k), prefill not slowed, KV correct. Confirms the overlap holds on busy instances. DESIGN.md updated with idle-vs-load table + the two blockers (chunk-safety, concurrent-transfer safety) that the full 1200-req trace needs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:30:14 +08:00
parent fec50fa45d
commit e77bdcac5a
5 changed files with 387 additions and 9 deletions
--- a/microbench/connector_tax/layerwise/DESIGN.md
+++ b/microbench/connector_tax/layerwise/DESIGN.md
@@ -114,11 +114,39 @@ Key signatures:
   (`_handle_combined_pd_sep_v2`) dispatches sequentially and would need the
   write-mode (concurrent) restructure.

+## Results under LOAD (bg=16 background decode streams, 8 per instance)
+
+Critical-path transfer overhead (ms), `total − prefill_only`:
+
+| KV size | idle base | idle LW | **load base** | **load LW** |
+|--------:|----------:|--------:|--------------:|------------:|
+| 8k  | 123 | 58 | 158 | **94** |
+| 16k | 202 | 58 | 239 | **83** |
+| 32k | 529 | 57 | **749** | **95** |
+
+The overlap **survives load**: layerwise overhead stays ~constant (~90 ms)
+under load while baseline grows to 749 ms at 32k (7.9× reduction). Prefill did
+not slow (load LW `t_A` == load `prefill_only`); the transfer (0.56/1.46/4.37 s,
+producer logs) ran inside the prefill window even with 16 concurrent decodes.
+Correctness PASS under load.
+
 ## Verdict

-The mechanism **works and delivers the predicted benefit**: layer-wise push
-turns migration's KV-transfer cost from O(KV size) on the critical path into a
-near-constant tail, by overlapping it with prefill compute — exactly what
-MoRIIO's write mode does on AMD, now demonstrated on NVIDIA/Mooncake. Whether
-it flips agentic *migration* to net-positive still depends on the busy-instance
-behavior (caveat 1) and is the next experiment.
+The mechanism **works and the benefit holds under load**: layer-wise push turns
+migration's KV-transfer cost from O(KV size) on the critical path into a
+near-constant ~90 ms tail, by overlapping it with prefill compute — what
+MoRIIO's write mode does on AMD, now demonstrated on NVIDIA/Mooncake.
+
+**BUT this is single-transfer, non-chunked.** Running the actual 1200-req trace
+correctly needs two more pieces this PoC does NOT have:
+1. **Chunk-safe tracking** — long agentic prompts force chunked prefill;
+   `save_kv_layer` then fires per-chunk and the monotonic counter would ship
+   uncomputed blocks. Needs slot-mapping-aware per-(request,chunk) tracking.
+2. **Concurrent-transfer safety** — the global counter assumes one producer at
+   a time; the trace migrates from busy instances running other forwards.
+
+Also: even with those fixed, layer-wise only removes the **transfer half** of
+the measured migration overhead. The b3_v3_fullbreak profile showed dst-side
+`T_kv_pull` = ~55% RDMA + ~45% control-plane GIL-dispatch stalls; layer-wise
+hides the RDMA half but the control-plane half is orthogonal. So a trace
+re-profile would show roughly the transfer half collapse, not the whole thing.
--- a/microbench/connector_tax/layerwise/mb7_layerwise.py
+++ b/microbench/connector_tax/layerwise/mb7_layerwise.py
@@ -72,6 +72,56 @@ def cached_of(resp) -> int:
    return det.get("cached_tokens", 0) or usage.get("cached_tokens", 0) or 0


+async def _stream_completion(client, host, port, prompt, max_tokens):
+    payload = {"model": MODEL, "prompt": prompt, "max_tokens": max_tokens,
+               "min_tokens": 1, "temperature": 0.0, "stream": True}
+    async with client.stream("POST", f"http://{host}:{port}/v1/completions",
+                             json=payload, timeout=600.0) as r:
+        r.raise_for_status()
+        async for _ in r.aiter_bytes():
+            pass
+
+
+class BackgroundLoad:
+    """Hold N concurrent long-decode streams across endpoints to keep busy."""
+    def __init__(self, client, endpoints, n, prompt_tokens=2000, out_tokens=6000):
+        self.client, self.endpoints, self.n = client, endpoints, n
+        self.pt, self.ot = prompt_tokens, out_tokens
+        self._stop = asyncio.Event()
+        self._tasks = []
+
+    async def _w(self, idx):
+        host, port = self.endpoints[idx % len(self.endpoints)]
+        seed = 800000 + idx
+        while not self._stop.is_set():
+            try:
+                await _stream_completion(self.client, host, port,
+                                         synth_prompt(seed, self.pt), self.ot)
+            except Exception:
+                await asyncio.sleep(0.5)
+            seed += 1
+
+    def start(self):
+        self._tasks = [asyncio.create_task(self._w(i)) for i in range(self.n)]
+
+    async def stop(self):
+        self._stop.set()
+        for t in self._tasks:
+            t.cancel()
+        await asyncio.gather(*self._tasks, return_exceptions=True)
+
+
+async def num_running(client, host, port):
+    try:
+        r = await client.get(f"http://{host}:{port}/metrics", timeout=5.0)
+        for line in r.text.splitlines():
+            if line.startswith("vllm:num_requests_running"):
+                return int(float(line.split()[-1]))
+    except Exception:
+        pass
+    return -1
+
+
 async def prefill_only(client, host, port, prompt):
    """Reference: plain prefill cost on A, no transfer."""
    dt, _ = await completion(client, host, port, prompt, max_tokens=1)
@@ -128,7 +178,20 @@ async def main_async(a):
    async with httpx.AsyncClient(limits=limits, trust_env=False) as client:
        src_eid = await get_engine_id(client, a.src_host, a.src_bp)
        src_bp_addr = f"http://{a.src_host}:{a.src_bp}"
-        print(f"[mb7] mode={a.mode} src_eid={src_eid[:16]}...")
+        print(f"[mb7] mode={a.mode} bg_load={a.bg_load} src_eid={src_eid[:16]}...")
+
+        loader = None
+        if a.bg_load > 0:
+            loader = BackgroundLoad(client, [A, B], a.bg_load)
+            loader.start()
+            print(f"[mb7] ramping background load ({a.bg_load}) ...")
+            for _ in range(40):
+                await asyncio.sleep(1.0)
+                na = await num_running(client, *A)
+                nb = await num_running(client, *B)
+                if na >= 1 and nb >= 1:
+                    print(f"[mb7] busy: A_run={na} B_run={nb}")
+                    break

        results = []
        for sz in sizes:
@@ -155,8 +218,11 @@ async def main_async(a):
                      f"total={row['t_total_s']*1000:7.0f}ms {extra} "
                      f"cached={row['cached']}/{sz} correct={row['correct']}")

+        if loader:
+            await loader.stop()
+
    # summary
-    print(f"\n=== {a.mode} summary ===")
+    print(f"\n=== {a.mode} (bg={a.bg_load}) summary ===")
    print(f"{'size':>7} {'n':>2} {'pf_only_ms':>11} {'total_ms':>9} "
          f"{'overhead_ms':>12} {'correct':>8}")
    summary = []
@@ -191,6 +257,8 @@ def main():
    p.add_argument("--dst-bp", type=int, default=8999)
    p.add_argument("--sizes", default="8192,32768,65536")
    p.add_argument("--repeats", type=int, default=3)
+    p.add_argument("--bg-load", type=int, default=0,
+                   help="N concurrent background decode streams across A+B")
    p.add_argument("--out", default="mb7_result.json")
    args = p.parse_args()
    asyncio.run(main_async(args))
--- a/microbench/connector_tax/layerwise/results/mb7_baseline_bg16.json
+++ b/microbench/connector_tax/layerwise/results/mb7_baseline_bg16.json
@@ -0,0 +1,140 @@
+{
+  "mode": "baseline",
+  "model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
+  "raw": [
+    {
+      "t_prefill_s": 0.5868483350022871,
+      "t_xfer_s": 0.19584889299949282,
+      "t_total_s": 0.7827702419999696,
+      "cached": 8176,
+      "mode": "baseline",
+      "size": 8192,
+      "rep": 0,
+      "t_prefill_only_s": 0.5920699099988269,
+      "kv_gib": 0.75,
+      "correct": true
+    },
+    {
+      "t_prefill_s": 0.5875704979989678,
+      "t_xfer_s": 0.1554814909977722,
+      "t_total_s": 0.7431365060001554,
+      "cached": 8176,
+      "mode": "baseline",
+      "size": 8192,
+      "rep": 1,
+      "t_prefill_only_s": 0.5814537600017502,
+      "kv_gib": 0.75,
+      "correct": true
+    },
+    {
+      "t_prefill_s": 0.5852241569991747,
+      "t_xfer_s": 0.15129724399957922,
+      "t_total_s": 0.7365909610016388,
+      "cached": 8176,
+      "mode": "baseline",
+      "size": 8192,
+      "rep": 2,
+      "t_prefill_only_s": 0.5846994370003813,
+      "kv_gib": 0.75,
+      "correct": true
+    },
+    {
+      "t_prefill_s": 1.498547145001794,
+      "t_xfer_s": 0.2475714690008317,
+      "t_total_s": 1.7462187470009667,
+      "cached": 16368,
+      "mode": "baseline",
+      "size": 16384,
+      "rep": 0,
+      "t_prefill_only_s": 1.5670790190015396,
+      "kv_gib": 1.5,
+      "correct": true
+    },
+    {
+      "t_prefill_s": 1.5025789940009417,
+      "t_xfer_s": 0.24532966799961287,
+      "t_total_s": 1.7479741930001182,
+      "cached": 16368,
+      "mode": "baseline",
+      "size": 16384,
+      "rep": 1,
+      "t_prefill_only_s": 1.5008903820016712,
+      "kv_gib": 1.5,
+      "correct": true
+    },
+    {
+      "t_prefill_s": 1.5021674179988622,
+      "t_xfer_s": 0.24640760400143336,
+      "t_total_s": 1.7486415580024186,
+      "cached": 16368,
+      "mode": "baseline",
+      "size": 16384,
+      "rep": 2,
+      "t_prefill_only_s": 1.509417139001016,
+      "kv_gib": 1.5,
+      "correct": true
+    },
+    {
+      "t_prefill_s": 4.444555983998725,
+      "t_xfer_s": 0.4227471090016479,
+      "t_total_s": 4.86737214599998,
+      "cached": 32752,
+      "mode": "baseline",
+      "size": 32768,
+      "rep": 0,
+      "t_prefill_only_s": 4.4467717689985875,
+      "kv_gib": 3.0,
+      "correct": true
+    },
+    {
+      "t_prefill_s": 4.442135782999685,
+      "t_xfer_s": 0.7519038230020669,
+      "t_total_s": 5.194113359000767,
+      "cached": 32752,
+      "mode": "baseline",
+      "size": 32768,
+      "rep": 1,
+      "t_prefill_only_s": 4.445541313998547,
+      "kv_gib": 3.0,
+      "correct": true
+    },
+    {
+      "t_prefill_s": 4.439772993999213,
+      "t_xfer_s": 0.7855456319994119,
+      "t_total_s": 5.225392060998274,
+      "cached": 32752,
+      "mode": "baseline",
+      "size": 32768,
+      "rep": 2,
+      "t_prefill_only_s": 4.442906365002273,
+      "kv_gib": 3.0,
+      "correct": true
+    }
+  ],
+  "summary": [
+    {
+      "size": 8192,
+      "n": 3,
+      "pf_only_ms": 584.6994370003813,
+      "total_ms": 743.1365060001554,
+      "overhead_ms": 158.43706899977406,
+      "all_correct": true
+    },
+    {
+      "size": 16384,
+      "n": 3,
+      "pf_only_ms": 1509.417139001016,
+      "total_ms": 1747.9741930001182,
+      "overhead_ms": 238.5570539991022,
+      "all_correct": true
+    },
+    {
+      "size": 32768,
+      "n": 3,
+      "pf_only_ms": 4445.541313998547,
+      "total_ms": 5194.113359000767,
+      "overhead_ms": 748.57204500222,
+      "all_correct": true
+    }
+  ]
+}
--- a/microbench/connector_tax/layerwise/results/mb7_layerwise_bg16.json
+++ b/microbench/connector_tax/layerwise/results/mb7_layerwise_bg16.json
@@ -0,0 +1,140 @@
+{
+  "mode": "layerwise",
+  "model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
+  "raw": [
+    {
+      "t_A_s": 0.5905098549992545,
+      "t_B_s": 0.6900827390018094,
+      "t_total_s": 0.6904724189989793,
+      "cached": 8176,
+      "mode": "layerwise",
+      "size": 8192,
+      "rep": 0,
+      "t_prefill_only_s": 0.5852864849985053,
+      "kv_gib": 0.75,
+      "correct": true
+    },
+    {
+      "t_A_s": 0.5897548109969648,
+      "t_B_s": 0.6827381169969158,
+      "t_total_s": 0.6828304180016858,
+      "cached": 8176,
+      "mode": "layerwise",
+      "size": 8192,
+      "rep": 1,
+      "t_prefill_only_s": 0.5890174580017629,
+      "kv_gib": 0.75,
+      "correct": true
+    },
+    {
+      "t_A_s": 0.5850713190011447,
+      "t_B_s": 0.6744917560026806,
+      "t_total_s": 0.6745770380002796,
+      "cached": 8176,
+      "mode": "layerwise",
+      "size": 8192,
+      "rep": 2,
+      "t_prefill_only_s": 0.5943713950000529,
+      "kv_gib": 0.75,
+      "correct": true
+    },
+    {
+      "t_A_s": 1.5030149390004226,
+      "t_B_s": 1.596173029000056,
+      "t_total_s": 1.597060264000902,
+      "cached": 16368,
+      "mode": "layerwise",
+      "size": 16384,
+      "rep": 0,
+      "t_prefill_only_s": 1.5130829510017065,
+      "kv_gib": 1.5,
+      "correct": true
+    },
+    {
+      "t_A_s": 1.499876754998695,
+      "t_B_s": 1.5940461120007967,
+      "t_total_s": 1.5948001770011615,
+      "cached": 16368,
+      "mode": "layerwise",
+      "size": 16384,
+      "rep": 1,
+      "t_prefill_only_s": 1.5024838620010996,
+      "kv_gib": 1.5,
+      "correct": true
+    },
+    {
+      "t_A_s": 1.5068977490009274,
+      "t_B_s": 1.5950395179970656,
+      "t_total_s": 1.59571184500237,
+      "cached": 16368,
+      "mode": "layerwise",
+      "size": 16384,
+      "rep": 2,
+      "t_prefill_only_s": 1.5303227439981129,
+      "kv_gib": 1.5,
+      "correct": true
+    },
+    {
+      "t_A_s": 4.4503932609986805,
+      "t_B_s": 4.538851200999488,
+      "t_total_s": 4.539281312001549,
+      "cached": 32752,
+      "mode": "layerwise",
+      "size": 32768,
+      "rep": 0,
+      "t_prefill_only_s": 4.446753306998289,
+      "kv_gib": 3.0,
+      "correct": true
+    },
+    {
+      "t_A_s": 4.44226107799841,
+      "t_B_s": 4.551636377997056,
+      "t_total_s": 4.552389411001059,
+      "cached": 32752,
+      "mode": "layerwise",
+      "size": 32768,
+      "rep": 1,
+      "t_prefill_only_s": 4.44538704000297,
+      "kv_gib": 3.0,
+      "correct": true
+    },
+    {
+      "t_A_s": 4.440309538000292,
+      "t_B_s": 4.539836316998844,
+      "t_total_s": 4.540553365997766,
+      "cached": 32752,
+      "mode": "layerwise",
+      "size": 32768,
+      "rep": 2,
+      "t_prefill_only_s": 4.443476915999781,
+      "kv_gib": 3.0,
+      "correct": true
+    }
+  ],
+  "summary": [
+    {
+      "size": 8192,
+      "n": 3,
+      "pf_only_ms": 589.0174580017629,
+      "total_ms": 682.8304180016858,
+      "overhead_ms": 93.8129599999229,
+      "all_correct": true
+    },
+    {
+      "size": 16384,
+      "n": 3,
+      "pf_only_ms": 1513.0829510017065,
+      "total_ms": 1595.71184500237,
+      "overhead_ms": 82.62889400066342,
+      "all_correct": true
+    },
+    {
+      "size": 32768,
+      "n": 3,
+      "pf_only_ms": 4445.38704000297,
+      "total_ms": 4540.553365997766,
+      "overhead_ms": 95.16632599479635,
+      "all_correct": true
+    }
+  ]
+}
--- a/microbench/connector_tax/layerwise/run_mb7.sh
+++ b/microbench/connector_tax/layerwise/run_mb7.sh
@@ -22,6 +22,7 @@ MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
 GPUS=(${GPUS:-0 1})
 SIZES="${SIZES:-8192,16384,32768}"
 REPEATS="${REPEATS:-3}"
+BG_LOAD="${BG_LOAD:-0}"
 MAX_BATCHED="${MAX_BATCHED:-40960}"   # >= max prompt => no chunked prefill
 DATE="$(date +%Y%m%d_%H%M)"
 OUTDIR="${OUTDIR:-$PROJ_DIR/outputs/mb7_${MODE}_${DATE}}"
@@ -100,7 +101,8 @@ echo "[run] mb7 --mode $MODE"
 "$PYTHON" "$DRIVER" --mode "$MODE" \
    --src-port "${PORTS[0]}" --dst-port "${PORTS[1]}" \
    --src-bp "${BPS[0]}" --dst-bp "${BPS[1]}" \
-    --sizes "$SIZES" --repeats "$REPEATS" --out "$OUTDIR/mb7_result.json" \
+    --sizes "$SIZES" --repeats "$REPEATS" --bg-load "$BG_LOAD" \
+    --out "$OUTDIR/mb7_result.json" \
    2>&1 | tee "$OUTDIR/mb7_run.txt"

 echo "[done] $OUTDIR"