Layerwise under load: overlap benefit survives (bg=16)

mb7 with background decode load (8/instance). Critical-path transfer overhead
stays ~constant ~90ms for layerwise vs 158/239/749ms baseline (up to 7.9x at
32k), prefill not slowed, KV correct. Confirms the overlap holds on busy
instances. DESIGN.md updated with idle-vs-load table + the two blockers
(chunk-safety, concurrent-transfer safety) that the full 1200-req trace needs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 16:30:14 +08:00
parent fec50fa45d
commit e77bdcac5a
5 changed files with 387 additions and 9 deletions

View File

@@ -114,11 +114,39 @@ Key signatures:
(`_handle_combined_pd_sep_v2`) dispatches sequentially and would need the
write-mode (concurrent) restructure.
## Results under LOAD (bg=16 background decode streams, 8 per instance)
Critical-path transfer overhead (ms), `total prefill_only`:
| KV size | idle base | idle LW | **load base** | **load LW** |
|--------:|----------:|--------:|--------------:|------------:|
| 8k | 123 | 58 | 158 | **94** |
| 16k | 202 | 58 | 239 | **83** |
| 32k | 529 | 57 | **749** | **95** |
The overlap **survives load**: layerwise overhead stays ~constant (~90 ms)
under load while baseline grows to 749 ms at 32k (7.9× reduction). Prefill did
not slow (load LW `t_A` == load `prefill_only`); the transfer (0.56/1.46/4.37 s,
producer logs) ran inside the prefill window even with 16 concurrent decodes.
Correctness PASS under load.
## Verdict
The mechanism **works and delivers the predicted benefit**: layer-wise push
turns migration's KV-transfer cost from O(KV size) on the critical path into a
near-constant tail, by overlapping it with prefill compute exactly what
MoRIIO's write mode does on AMD, now demonstrated on NVIDIA/Mooncake. Whether
it flips agentic *migration* to net-positive still depends on the busy-instance
behavior (caveat 1) and is the next experiment.
The mechanism **works and the benefit holds under load**: layer-wise push turns
migration's KV-transfer cost from O(KV size) on the critical path into a
near-constant ~90 ms tail, by overlapping it with prefill compute what
MoRIIO's write mode does on AMD, now demonstrated on NVIDIA/Mooncake.
**BUT this is single-transfer, non-chunked.** Running the actual 1200-req trace
correctly needs two more pieces this PoC does NOT have:
1. **Chunk-safe tracking** long agentic prompts force chunked prefill;
`save_kv_layer` then fires per-chunk and the monotonic counter would ship
uncomputed blocks. Needs slot-mapping-aware per-(request,chunk) tracking.
2. **Concurrent-transfer safety** the global counter assumes one producer at
a time; the trace migrates from busy instances running other forwards.
Also: even with those fixed, layer-wise only removes the **transfer half** of
the measured migration overhead. The b3_v3_fullbreak profile showed dst-side
`T_kv_pull` = ~55% RDMA + ~45% control-plane GIL-dispatch stalls; layer-wise
hides the RDMA half but the control-plane half is orthogonal. So a trace
re-profile would show roughly the transfer half collapse, not the whole thing.

View File

@@ -72,6 +72,56 @@ def cached_of(resp) -> int:
return det.get("cached_tokens", 0) or usage.get("cached_tokens", 0) or 0
async def _stream_completion(client, host, port, prompt, max_tokens):
payload = {"model": MODEL, "prompt": prompt, "max_tokens": max_tokens,
"min_tokens": 1, "temperature": 0.0, "stream": True}
async with client.stream("POST", f"http://{host}:{port}/v1/completions",
json=payload, timeout=600.0) as r:
r.raise_for_status()
async for _ in r.aiter_bytes():
pass
class BackgroundLoad:
"""Hold N concurrent long-decode streams across endpoints to keep busy."""
def __init__(self, client, endpoints, n, prompt_tokens=2000, out_tokens=6000):
self.client, self.endpoints, self.n = client, endpoints, n
self.pt, self.ot = prompt_tokens, out_tokens
self._stop = asyncio.Event()
self._tasks = []
async def _w(self, idx):
host, port = self.endpoints[idx % len(self.endpoints)]
seed = 800000 + idx
while not self._stop.is_set():
try:
await _stream_completion(self.client, host, port,
synth_prompt(seed, self.pt), self.ot)
except Exception:
await asyncio.sleep(0.5)
seed += 1
def start(self):
self._tasks = [asyncio.create_task(self._w(i)) for i in range(self.n)]
async def stop(self):
self._stop.set()
for t in self._tasks:
t.cancel()
await asyncio.gather(*self._tasks, return_exceptions=True)
async def num_running(client, host, port):
try:
r = await client.get(f"http://{host}:{port}/metrics", timeout=5.0)
for line in r.text.splitlines():
if line.startswith("vllm:num_requests_running"):
return int(float(line.split()[-1]))
except Exception:
pass
return -1
async def prefill_only(client, host, port, prompt):
"""Reference: plain prefill cost on A, no transfer."""
dt, _ = await completion(client, host, port, prompt, max_tokens=1)
@@ -128,7 +178,20 @@ async def main_async(a):
async with httpx.AsyncClient(limits=limits, trust_env=False) as client:
src_eid = await get_engine_id(client, a.src_host, a.src_bp)
src_bp_addr = f"http://{a.src_host}:{a.src_bp}"
print(f"[mb7] mode={a.mode} src_eid={src_eid[:16]}...")
print(f"[mb7] mode={a.mode} bg_load={a.bg_load} src_eid={src_eid[:16]}...")
loader = None
if a.bg_load > 0:
loader = BackgroundLoad(client, [A, B], a.bg_load)
loader.start()
print(f"[mb7] ramping background load ({a.bg_load}) ...")
for _ in range(40):
await asyncio.sleep(1.0)
na = await num_running(client, *A)
nb = await num_running(client, *B)
if na >= 1 and nb >= 1:
print(f"[mb7] busy: A_run={na} B_run={nb}")
break
results = []
for sz in sizes:
@@ -155,8 +218,11 @@ async def main_async(a):
f"total={row['t_total_s']*1000:7.0f}ms {extra} "
f"cached={row['cached']}/{sz} correct={row['correct']}")
if loader:
await loader.stop()
# summary
print(f"\n=== {a.mode} summary ===")
print(f"\n=== {a.mode} (bg={a.bg_load}) summary ===")
print(f"{'size':>7} {'n':>2} {'pf_only_ms':>11} {'total_ms':>9} "
f"{'overhead_ms':>12} {'correct':>8}")
summary = []
@@ -191,6 +257,8 @@ def main():
p.add_argument("--dst-bp", type=int, default=8999)
p.add_argument("--sizes", default="8192,32768,65536")
p.add_argument("--repeats", type=int, default=3)
p.add_argument("--bg-load", type=int, default=0,
help="N concurrent background decode streams across A+B")
p.add_argument("--out", default="mb7_result.json")
args = p.parse_args()
asyncio.run(main_async(args))

View File

@@ -0,0 +1,140 @@
{
"mode": "baseline",
"model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
"raw": [
{
"t_prefill_s": 0.5868483350022871,
"t_xfer_s": 0.19584889299949282,
"t_total_s": 0.7827702419999696,
"cached": 8176,
"mode": "baseline",
"size": 8192,
"rep": 0,
"t_prefill_only_s": 0.5920699099988269,
"kv_gib": 0.75,
"correct": true
},
{
"t_prefill_s": 0.5875704979989678,
"t_xfer_s": 0.1554814909977722,
"t_total_s": 0.7431365060001554,
"cached": 8176,
"mode": "baseline",
"size": 8192,
"rep": 1,
"t_prefill_only_s": 0.5814537600017502,
"kv_gib": 0.75,
"correct": true
},
{
"t_prefill_s": 0.5852241569991747,
"t_xfer_s": 0.15129724399957922,
"t_total_s": 0.7365909610016388,
"cached": 8176,
"mode": "baseline",
"size": 8192,
"rep": 2,
"t_prefill_only_s": 0.5846994370003813,
"kv_gib": 0.75,
"correct": true
},
{
"t_prefill_s": 1.498547145001794,
"t_xfer_s": 0.2475714690008317,
"t_total_s": 1.7462187470009667,
"cached": 16368,
"mode": "baseline",
"size": 16384,
"rep": 0,
"t_prefill_only_s": 1.5670790190015396,
"kv_gib": 1.5,
"correct": true
},
{
"t_prefill_s": 1.5025789940009417,
"t_xfer_s": 0.24532966799961287,
"t_total_s": 1.7479741930001182,
"cached": 16368,
"mode": "baseline",
"size": 16384,
"rep": 1,
"t_prefill_only_s": 1.5008903820016712,
"kv_gib": 1.5,
"correct": true
},
{
"t_prefill_s": 1.5021674179988622,
"t_xfer_s": 0.24640760400143336,
"t_total_s": 1.7486415580024186,
"cached": 16368,
"mode": "baseline",
"size": 16384,
"rep": 2,
"t_prefill_only_s": 1.509417139001016,
"kv_gib": 1.5,
"correct": true
},
{
"t_prefill_s": 4.444555983998725,
"t_xfer_s": 0.4227471090016479,
"t_total_s": 4.86737214599998,
"cached": 32752,
"mode": "baseline",
"size": 32768,
"rep": 0,
"t_prefill_only_s": 4.4467717689985875,
"kv_gib": 3.0,
"correct": true
},
{
"t_prefill_s": 4.442135782999685,
"t_xfer_s": 0.7519038230020669,
"t_total_s": 5.194113359000767,
"cached": 32752,
"mode": "baseline",
"size": 32768,
"rep": 1,
"t_prefill_only_s": 4.445541313998547,
"kv_gib": 3.0,
"correct": true
},
{
"t_prefill_s": 4.439772993999213,
"t_xfer_s": 0.7855456319994119,
"t_total_s": 5.225392060998274,
"cached": 32752,
"mode": "baseline",
"size": 32768,
"rep": 2,
"t_prefill_only_s": 4.442906365002273,
"kv_gib": 3.0,
"correct": true
}
],
"summary": [
{
"size": 8192,
"n": 3,
"pf_only_ms": 584.6994370003813,
"total_ms": 743.1365060001554,
"overhead_ms": 158.43706899977406,
"all_correct": true
},
{
"size": 16384,
"n": 3,
"pf_only_ms": 1509.417139001016,
"total_ms": 1747.9741930001182,
"overhead_ms": 238.5570539991022,
"all_correct": true
},
{
"size": 32768,
"n": 3,
"pf_only_ms": 4445.541313998547,
"total_ms": 5194.113359000767,
"overhead_ms": 748.57204500222,
"all_correct": true
}
]
}

View File

@@ -0,0 +1,140 @@
{
"mode": "layerwise",
"model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
"raw": [
{
"t_A_s": 0.5905098549992545,
"t_B_s": 0.6900827390018094,
"t_total_s": 0.6904724189989793,
"cached": 8176,
"mode": "layerwise",
"size": 8192,
"rep": 0,
"t_prefill_only_s": 0.5852864849985053,
"kv_gib": 0.75,
"correct": true
},
{
"t_A_s": 0.5897548109969648,
"t_B_s": 0.6827381169969158,
"t_total_s": 0.6828304180016858,
"cached": 8176,
"mode": "layerwise",
"size": 8192,
"rep": 1,
"t_prefill_only_s": 0.5890174580017629,
"kv_gib": 0.75,
"correct": true
},
{
"t_A_s": 0.5850713190011447,
"t_B_s": 0.6744917560026806,
"t_total_s": 0.6745770380002796,
"cached": 8176,
"mode": "layerwise",
"size": 8192,
"rep": 2,
"t_prefill_only_s": 0.5943713950000529,
"kv_gib": 0.75,
"correct": true
},
{
"t_A_s": 1.5030149390004226,
"t_B_s": 1.596173029000056,
"t_total_s": 1.597060264000902,
"cached": 16368,
"mode": "layerwise",
"size": 16384,
"rep": 0,
"t_prefill_only_s": 1.5130829510017065,
"kv_gib": 1.5,
"correct": true
},
{
"t_A_s": 1.499876754998695,
"t_B_s": 1.5940461120007967,
"t_total_s": 1.5948001770011615,
"cached": 16368,
"mode": "layerwise",
"size": 16384,
"rep": 1,
"t_prefill_only_s": 1.5024838620010996,
"kv_gib": 1.5,
"correct": true
},
{
"t_A_s": 1.5068977490009274,
"t_B_s": 1.5950395179970656,
"t_total_s": 1.59571184500237,
"cached": 16368,
"mode": "layerwise",
"size": 16384,
"rep": 2,
"t_prefill_only_s": 1.5303227439981129,
"kv_gib": 1.5,
"correct": true
},
{
"t_A_s": 4.4503932609986805,
"t_B_s": 4.538851200999488,
"t_total_s": 4.539281312001549,
"cached": 32752,
"mode": "layerwise",
"size": 32768,
"rep": 0,
"t_prefill_only_s": 4.446753306998289,
"kv_gib": 3.0,
"correct": true
},
{
"t_A_s": 4.44226107799841,
"t_B_s": 4.551636377997056,
"t_total_s": 4.552389411001059,
"cached": 32752,
"mode": "layerwise",
"size": 32768,
"rep": 1,
"t_prefill_only_s": 4.44538704000297,
"kv_gib": 3.0,
"correct": true
},
{
"t_A_s": 4.440309538000292,
"t_B_s": 4.539836316998844,
"t_total_s": 4.540553365997766,
"cached": 32752,
"mode": "layerwise",
"size": 32768,
"rep": 2,
"t_prefill_only_s": 4.443476915999781,
"kv_gib": 3.0,
"correct": true
}
],
"summary": [
{
"size": 8192,
"n": 3,
"pf_only_ms": 589.0174580017629,
"total_ms": 682.8304180016858,
"overhead_ms": 93.8129599999229,
"all_correct": true
},
{
"size": 16384,
"n": 3,
"pf_only_ms": 1513.0829510017065,
"total_ms": 1595.71184500237,
"overhead_ms": 82.62889400066342,
"all_correct": true
},
{
"size": 32768,
"n": 3,
"pf_only_ms": 4445.38704000297,
"total_ms": 4540.553365997766,
"overhead_ms": 95.16632599479635,
"all_correct": true
}
]
}

View File

@@ -22,6 +22,7 @@ MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
GPUS=(${GPUS:-0 1})
SIZES="${SIZES:-8192,16384,32768}"
REPEATS="${REPEATS:-3}"
BG_LOAD="${BG_LOAD:-0}"
MAX_BATCHED="${MAX_BATCHED:-40960}" # >= max prompt => no chunked prefill
DATE="$(date +%Y%m%d_%H%M)"
OUTDIR="${OUTDIR:-$PROJ_DIR/outputs/mb7_${MODE}_${DATE}}"
@@ -100,7 +101,8 @@ echo "[run] mb7 --mode $MODE"
"$PYTHON" "$DRIVER" --mode "$MODE" \
--src-port "${PORTS[0]}" --dst-port "${PORTS[1]}" \
--src-bp "${BPS[0]}" --dst-bp "${BPS[1]}" \
--sizes "$SIZES" --repeats "$REPEATS" --out "$OUTDIR/mb7_result.json" \
--sizes "$SIZES" --repeats "$REPEATS" --bg-load "$BG_LOAD" \
--out "$OUTDIR/mb7_result.json" \
2>&1 | tee "$OUTDIR/mb7_run.txt"
echo "[done] $OUTDIR"