Migration transfer-cost study: KV transfer is slow on busy GPUs
MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at ~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into ~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s) + ~45% control-plane GIL starvation during long prefills. Reproduced on a fresh upstream venv (byte-identical transfer path) -> upstream/hardware inherent, not our patch. Layerwise is the wrong lever; the tax is structural on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6, instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
178
microbench/connector_tax/cache_sweep/MIGRATION_TRANSFER_COST.md
Normal file
178
microbench/connector_tax/cache_sweep/MIGRATION_TRANSFER_COST.md
Normal file
@@ -0,0 +1,178 @@
|
|||||||
|
# Why KV-transfer is slow during migration under real load
|
||||||
|
|
||||||
|
**Question.** EAR's unified+A+B routing beats migration (v3) on agentic
|
||||||
|
workloads. We wanted to know whether *layerwise* KV transfer would shrink
|
||||||
|
migration's overhead enough to make it viable. Investigating that led to a
|
||||||
|
sharper question: **in a real (loaded) cluster, when we migrate, the KV
|
||||||
|
transfer is already slow — the effective bandwidth is far below the
|
||||||
|
~10 GB/s wire rate. Why?**
|
||||||
|
|
||||||
|
This doc answers that with instrumented measurements.
|
||||||
|
|
||||||
|
**TL;DR.** Migration fires precisely when instances are *busy* (that's the
|
||||||
|
trigger). But on a busy instance, KV transfer runs at **~3 GB/s instead of
|
||||||
|
~10 GB/s**, because:
|
||||||
|
|
||||||
|
1. **The RDMA write itself slows ~2× under compute load** — GPU-direct RDMA
|
||||||
|
(`batch_transfer_sync_write`) contends with the running attention/MLP
|
||||||
|
kernels for **HBM and PCIe bandwidth**. (idle 7.6 GB/s → busy 4.0 GB/s)
|
||||||
|
2. **The connector's Python control plane gets GIL-starved** — mooncake's
|
||||||
|
ZMQ handshake + transfer orchestration run on asyncio threads inside the
|
||||||
|
engine process; when the engine's main thread is doing a long forward
|
||||||
|
pass (e.g. a 100k-token prefill), those threads stall for *seconds*.
|
||||||
|
|
||||||
|
Both are **inherent to upstream vLLM 0.18.1 + mooncake** (reproduced on a
|
||||||
|
clean fresh venv; the transfer path is byte-identical to upstream — our
|
||||||
|
patches did not cause this), and both get **worse**, not better, with
|
||||||
|
layerwise transfer. So the bandwidth gap is not a layerwise problem; it is a
|
||||||
|
*transfer-on-a-busy-GPU* problem.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Evidence chain
|
||||||
|
|
||||||
|
Three independent measurements, all on dash0 (8×H100, Qwen3-Coder-30B-A3B,
|
||||||
|
TP=1), Mooncake `kv_both`.
|
||||||
|
|
||||||
|
### 1a. Instrumented v3 trace replay — where does migration time go?
|
||||||
|
|
||||||
|
Run `outputs/b3_v3_fullbreak_20260528_0338/`. Instruments:
|
||||||
|
`instrument_dst_migration.py` (dst scheduler lifecycle) +
|
||||||
|
`instrument_mooncake.py` (connector internals: `send_blocks` RDMA,
|
||||||
|
`receive_kv` window, `ready_wait`).
|
||||||
|
|
||||||
|
25 migrations fired over the trace. Dst-side migration overhead
|
||||||
|
(`T_kv_pull` = scheduler marks `WAITING_FOR_REMOTE_KVS` → `finished_recving`):
|
||||||
|
|
||||||
|
| component | share | what it is |
|
||||||
|
|---|---:|---|
|
||||||
|
| RDMA-actual (`batch_transfer_sync_write`) | **55%** (55.2 s) | the real RDMA write |
|
||||||
|
| dst control-plane gap | **45%** (45.4 s) | scheduler↔receiver_loop dispatch + completion propagation |
|
||||||
|
| `ready_wait` (src KV not committed) | 0% | 25/25 already committed — **ruled out** |
|
||||||
|
|
||||||
|
- Pure RDMA aggregate rate: **2.03 GB/s** (vs MB2 idle 9.7 GB/s).
|
||||||
|
- RDMA rate **collapses with transfer size**: <3 GiB → 4–9.5 GB/s,
|
||||||
|
>5 GiB → 0.9–2.6 GB/s.
|
||||||
|
- The control-plane gap is **bimodal**: median 0.04 s, but a handful of
|
||||||
|
requests stall ~10 s. Those are small-KV transfers (0.18 s of actual RDMA)
|
||||||
|
whose `T_kv_pull` is 8–11 s — i.e. the dst's `receiver_loop` thread was
|
||||||
|
GIL-starved for ~10 s while the engine did a big forward pass.
|
||||||
|
|
||||||
|
> Earlier (pre-instrumentation) we wrongly attributed ~90% of migration
|
||||||
|
> overhead to "dst scheduler queueing" by estimating transfer at clean wire
|
||||||
|
> speed. With real instrumentation, dst *scheduler admission* is ~0
|
||||||
|
> (`T_admission_post_kv` = 0.003 s); the time is the transfer phase (RDMA +
|
||||||
|
> connector control plane), both degraded by instance busy-ness.
|
||||||
|
|
||||||
|
### 1b. MB6 controlled microbench — does busy-ness cause it?
|
||||||
|
|
||||||
|
`microbench/fresh_setup/mb6_transfer_under_load.py` + `run_mb6.sh`: 2
|
||||||
|
instances, transfer a fixed-size KV (prefill on A → migrate to B) while
|
||||||
|
holding *N* background decode streams on both. Sweep N.
|
||||||
|
|
||||||
|
Effective transfer bandwidth (65k-token KV ≈ 6 GiB), main venv:
|
||||||
|
|
||||||
|
| background load | 65k transfer | eff bandwidth |
|
||||||
|
|---|---:|---:|
|
||||||
|
| **0 (idle)** | 747 ms | **8.76 GB/s** |
|
||||||
|
| 8 (4/instance) | 2423 ms | 4.53 GB/s |
|
||||||
|
| **24 (12/instance)** | 2015 ms | **3.33 GB/s** |
|
||||||
|
|
||||||
|
Monotonic degradation with load. **The busy level (3.3 GB/s) matches the
|
||||||
|
v3 trace's 3.3 GB/s median exactly** — because agentic instances run
|
||||||
|
~10+ concurrent requests, i.e. the bg=24 regime.
|
||||||
|
|
||||||
|
Decomposing the 65k transfer into RDMA-actual vs control-plane:
|
||||||
|
|
||||||
|
| bg | RDMA rate | control-plane share |
|
||||||
|
|---|---:|---:|
|
||||||
|
| 0 (idle) | 7.56 GB/s | 13% |
|
||||||
|
| 8 | 4.07 GB/s | 11% |
|
||||||
|
| 24 (busy) | 3.97 GB/s | 15% |
|
||||||
|
|
||||||
|
In the clean microbench the **RDMA write itself is the dominant degrading
|
||||||
|
term** (7.6 → 4.0 GB/s). The ~10 s control-plane stalls seen in the trace
|
||||||
|
(1a) don't reproduce here because steady decode forward passes are short;
|
||||||
|
they require the long (100k-token) prefills that the real trace has.
|
||||||
|
|
||||||
|
### 1c. Fresh-venv comparison — is it our patch?
|
||||||
|
|
||||||
|
Same MB6 sweep on `agentic-kv-fresh/.venv` (clean upstream-style 0.18.1):
|
||||||
|
|
||||||
|
| bg | 65k eff (fresh) | 65k eff (main/patched) |
|
||||||
|
|---|---:|---:|
|
||||||
|
| 0 | 8.73 GB/s | 8.76 GB/s |
|
||||||
|
| 8 | 4.52 GB/s | 4.53 GB/s |
|
||||||
|
| 24 | 3.27 GB/s | 3.33 GB/s |
|
||||||
|
|
||||||
|
**Identical within noise.** Plus a static check: the v3 transfer path
|
||||||
|
(`send_kv_to_decode`, `_send_blocks`/`batch_transfer_sync_write`,
|
||||||
|
`_build_transfer_params`) is **byte-identical** to pristine upstream 0.18.1
|
||||||
|
(commit `445e491`); `receive_kv_from_single_worker` differs only by a 4-line
|
||||||
|
error branch. Our mooncake commits (`a7df84b` direct-read,
|
||||||
|
`ea51497` partial-prefill, `e3a1d70` read→push) only touch a *separate*
|
||||||
|
`direct_read` path that v3 does **not** use (v3 requests carry no
|
||||||
|
`direct_read` flag → normal push path).
|
||||||
|
|
||||||
|
→ **The slowdown is upstream/hardware-inherent, not introduced by us.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Root cause
|
||||||
|
|
||||||
|
Migration in agentic serving transfers KV **between instances that are
|
||||||
|
concurrently busy with compute** — by construction, since v3 migrates *away
|
||||||
|
from* a busy host. On a busy instance:
|
||||||
|
|
||||||
|
- **HBM/PCIe contention (the dominant, irreducible part).** Mooncake's
|
||||||
|
transfer is GPU-direct RDMA: the NIC DMAs KV straight out of / into GPU
|
||||||
|
HBM. While the GPU runs attention+MLP kernels, those kernels saturate HBM
|
||||||
|
bandwidth, so the NIC's RDMA gets a smaller slice. Effective transfer
|
||||||
|
bandwidth roughly halves (7.6 → 4.0 GB/s at our load), and degrades
|
||||||
|
further for large multi-segment transfers.
|
||||||
|
- **Control-plane GIL starvation (secondary, bursty).** The connector runs
|
||||||
|
its ZMQ handshake + `send_kv_to_decode`/`receive_kv` orchestration on
|
||||||
|
asyncio threads (`sender_loop`/`receiver_loop`) *inside the engine
|
||||||
|
process*. A long forward pass (100k-token prefill) holds the GIL for
|
||||||
|
seconds, stalling those threads → multi-second dispatch gaps even when the
|
||||||
|
actual transfer is 0.2 s.
|
||||||
|
|
||||||
|
MB2 measured 9.7 GB/s precisely because both endpoints were **idle**. The
|
||||||
|
real-workload gap is the difference between "idle benchmark" and "transfer
|
||||||
|
while the GPU is doing the day job."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Implication: layerwise is the wrong lever; migration's tax is largely irreducible
|
||||||
|
|
||||||
|
| lever | effect on the gap |
|
||||||
|
|---|---|
|
||||||
|
| **Model-level layerwise transfer** (push each layer's KV during prefill) | **Worse.** Prefill is the most HBM-intensive phase, so per-layer transfers contend *harder* for HBM (Cause 1); and they multiply the control-plane round-trips (Cause 2). |
|
||||||
|
| **Control-plane fix** (move mooncake orchestration off the GIL-contended threads / separate process) | Addresses only the bursty ~10 s stalls (~15% in the clean case, up to ~45% of the trace tail). Does **not** touch the HBM-contention half. |
|
||||||
|
| **Reduce bytes** (cache-aware target so less KV moves) | Helps linearly; v3 Mechanism B already does some. Orthogonal. |
|
||||||
|
| **Migrate to/from idle instances** | Would restore ~10 GB/s — but defeats the purpose (we migrate *because* the host is busy). |
|
||||||
|
|
||||||
|
The dominant cost (RDMA contending with compute for HBM on busy instances)
|
||||||
|
is a **hardware reality**, not a software bug we can patch away, and not
|
||||||
|
something layerwise improves. This reinforces
|
||||||
|
[UNIFIED_ABLATION.md](UNIFIED_ABLATION.md): the unified no-migration path
|
||||||
|
(A+B'+RaceFix) remains the right default; migration's transfer tax is
|
||||||
|
structural on a loaded agentic cluster.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Repro / artifacts
|
||||||
|
|
||||||
|
- Instrumented v3 breakdown: `outputs/b3_v3_fullbreak_20260528_0338/unified_v3/`
|
||||||
|
(`transfer_decomp.txt`, `dst_migration_breakdown.{csv,png}`,
|
||||||
|
`transfer_rootcause.png`)
|
||||||
|
- MB6 main: `outputs/mb6_agentic-kv_20260528_0552/mb6_result.json`
|
||||||
|
- MB6 fresh: `outputs/mb6_fresh_20260528_0559/mb6_result.json`
|
||||||
|
- Instruments: `microbench/fresh_setup/instrument_dst_migration.py`,
|
||||||
|
`microbench/fresh_setup/instrument_mooncake.py`
|
||||||
|
- Microbench: `microbench/fresh_setup/mb6_transfer_under_load.py` +
|
||||||
|
`run_mb6.sh` (`VENV=… bash run_mb6.sh`)
|
||||||
|
- Analyzers: `analyze_dst_migration.py`, `analyze_transfer_decomp.py`
|
||||||
|
|
||||||
|
All instruments apply/revert cleanly via `--apply`/`--revert`; both venvs
|
||||||
|
were restored after the runs.
|
||||||
333
microbench/connector_tax/cache_sweep/analyze_dst_migration.py
Executable file
333
microbench/connector_tax/cache_sweep/analyze_dst_migration.py
Executable file
@@ -0,0 +1,333 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Analyze dst-side migration breakdown for unified_v3 runs.
|
||||||
|
|
||||||
|
Joins the proxy `breakdown.json` (per-request route + phase timestamps)
|
||||||
|
with the dst engine per-PID logs written by
|
||||||
|
`instrument_dst_migration.py` (`dm_mig_pid<pid>.jsonl`), to attribute
|
||||||
|
each migration's dst-side wall-clock into:
|
||||||
|
|
||||||
|
T_relay proxy decode-sent → dst arrival
|
||||||
|
T_admission_pre_kv dst arrival → status=WAITING_FOR_REMOTE_KVS
|
||||||
|
(waiting in dst's scheduler queue before KV pull
|
||||||
|
is even initiated)
|
||||||
|
T_kv_pull WAITING_FOR_REMOTE_KVS → finished_recving
|
||||||
|
(the actual RDMA transfer + connector ack)
|
||||||
|
T_admission_post_kv finished_recving → first time in self.running
|
||||||
|
(KV ready, waiting for batch slot)
|
||||||
|
T_first_iter first scheduled → first generated token
|
||||||
|
(one decode-iter compute + sampler latency)
|
||||||
|
|
||||||
|
Layerwise transfer can at best eliminate T_kv_pull. Everything else is
|
||||||
|
queueing or compute that layerwise does not touch.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python analyze_dst_migration.py \
|
||||||
|
--proxy-breakdown <RUNDIR>/breakdown.json \
|
||||||
|
--dst-log-dir <DST_LOG_DIR>
|
||||||
|
[--output <RUNDIR>/dst_migration_breakdown.csv]
|
||||||
|
[--plot <RUNDIR>/dst_migration_breakdown.png]
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import statistics
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def _core_req_id(rid: str) -> str:
|
||||||
|
"""Normalize a vLLM engine req_id back to the proxy's request_id.
|
||||||
|
|
||||||
|
vLLM wraps the proxy id `S:T:U:N` as `cmpl-S:T:U:N-<dp_rank>-<hex>`.
|
||||||
|
Strip the `cmpl-` prefix and the trailing `-<digits>-<hex>` suffix so
|
||||||
|
it joins against the proxy `breakdown.json` request_id.
|
||||||
|
"""
|
||||||
|
if not rid:
|
||||||
|
return rid
|
||||||
|
s = rid
|
||||||
|
if s.startswith("cmpl-"):
|
||||||
|
s = s[len("cmpl-"):]
|
||||||
|
m = re.match(r"^(.*)-\d+-[0-9a-fA-F]+$", s)
|
||||||
|
if m:
|
||||||
|
s = m.group(1)
|
||||||
|
return s
|
||||||
|
|
||||||
|
|
||||||
|
def _pct(vals: list[float], q: float) -> float:
|
||||||
|
if not vals:
|
||||||
|
return float("nan")
|
||||||
|
vs = sorted(vals)
|
||||||
|
i = max(0, min(len(vs) - 1, int(math.ceil(q * len(vs))) - 1))
|
||||||
|
return vs[i]
|
||||||
|
|
||||||
|
|
||||||
|
def _summary(name: str, vals: list[float]) -> dict:
|
||||||
|
if not vals:
|
||||||
|
return {"name": name, "n": 0}
|
||||||
|
return {
|
||||||
|
"name": name,
|
||||||
|
"n": len(vals),
|
||||||
|
"mean_s": statistics.mean(vals),
|
||||||
|
"p50_s": _pct(vals, 0.5),
|
||||||
|
"p90_s": _pct(vals, 0.9),
|
||||||
|
"p99_s": _pct(vals, 0.99),
|
||||||
|
"max_s": max(vals),
|
||||||
|
"sum_s": sum(vals),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def load_dst_log(dst_log_dir: Path) -> dict[str, dict]:
|
||||||
|
by_req: dict[str, dict] = {}
|
||||||
|
found_files = sorted(dst_log_dir.glob("dm_mig_pid*.jsonl"))
|
||||||
|
print(f"[analyze] dst log files: {len(found_files)} under {dst_log_dir}")
|
||||||
|
for f in found_files:
|
||||||
|
with f.open() as fh:
|
||||||
|
for line in fh:
|
||||||
|
try:
|
||||||
|
rec = json.loads(line)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
rid = rec.get("req_id")
|
||||||
|
if not rid:
|
||||||
|
continue
|
||||||
|
key = _core_req_id(rid)
|
||||||
|
rec["_raw_req_id"] = rid
|
||||||
|
# If a req shows up twice (shouldn't, but be safe), prefer the
|
||||||
|
# one with t_first_token_unix populated.
|
||||||
|
prev = by_req.get(key)
|
||||||
|
if prev is None or (
|
||||||
|
rec.get("t_first_token_unix") and
|
||||||
|
not prev.get("t_first_token_unix")
|
||||||
|
):
|
||||||
|
by_req[key] = rec
|
||||||
|
print(f"[analyze] unique dst records: {len(by_req)}")
|
||||||
|
return by_req
|
||||||
|
|
||||||
|
|
||||||
|
def load_proxy_breakdown(path: Path) -> list[dict]:
|
||||||
|
with path.open() as fh:
|
||||||
|
data = json.load(fh)
|
||||||
|
assert isinstance(data, list), f"unexpected breakdown.json shape: {type(data)}"
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
|
def decompose(proxy_recs: list[dict], dst_by_req: dict[str, dict]) -> list[dict]:
|
||||||
|
"""Build per-migration breakdown rows by joining proxy + dst by req_id."""
|
||||||
|
rows: list[dict] = []
|
||||||
|
migrations = [x for x in proxy_recs if x.get("route_class") == "PD_SEP_V2"]
|
||||||
|
print(f"[analyze] proxy migrations: {len(migrations)} "
|
||||||
|
f"(of {len(proxy_recs)} total requests)")
|
||||||
|
|
||||||
|
miss_in_dst = 0
|
||||||
|
missing_phases = 0
|
||||||
|
for p in migrations:
|
||||||
|
rid = p.get("request_id")
|
||||||
|
dst = dst_by_req.get(rid)
|
||||||
|
if dst is None:
|
||||||
|
miss_in_dst += 1
|
||||||
|
continue
|
||||||
|
if dst.get("t_first_token_unix") is None:
|
||||||
|
missing_phases += 1
|
||||||
|
# still include the row but mark phases as NaN downstream
|
||||||
|
t_decode_sent = p.get("t_decode_sent_unix")
|
||||||
|
t_first_tok = p.get("t_first_token_unix")
|
||||||
|
t_arrival = dst.get("t_arrival_unix")
|
||||||
|
t_wait_kvs = dst.get("t_wait_for_kvs_unix")
|
||||||
|
t_kv_done = dst.get("t_kv_recv_done_unix")
|
||||||
|
t_first_sched = dst.get("t_first_scheduled_unix")
|
||||||
|
t_first_tok_dst = dst.get("t_first_token_unix")
|
||||||
|
|
||||||
|
def _diff(a, b):
|
||||||
|
if a is None or b is None:
|
||||||
|
return None
|
||||||
|
return float(a) - float(b)
|
||||||
|
|
||||||
|
rows.append({
|
||||||
|
"request_id": rid,
|
||||||
|
"session_id": p.get("session_id"),
|
||||||
|
"input_length": p.get("input_length"),
|
||||||
|
"v3_new_local": p.get("v3_new_local"),
|
||||||
|
"v3_target_idx": p.get("v3_target_idx") or p.get("v3_decode_target_idx"),
|
||||||
|
"arrival_n_running": (dst.get("arrival_state") or {}).get("n_running"),
|
||||||
|
"arrival_n_waiting": (dst.get("arrival_state") or {}).get("n_waiting"),
|
||||||
|
"arrival_pending_prefill_tok": (dst.get("arrival_state") or {}).get("pending_prefill_tok"),
|
||||||
|
"arrival_n_waiting_for_kvs": (dst.get("arrival_state") or {}).get("n_waiting_for_kvs"),
|
||||||
|
# Phase durations (seconds)
|
||||||
|
"T_proxy_total_dst_first_token_s": _diff(t_first_tok, t_decode_sent),
|
||||||
|
"T_relay_s": _diff(t_arrival, t_decode_sent),
|
||||||
|
"T_admission_pre_kv_s": _diff(t_wait_kvs, t_arrival),
|
||||||
|
"T_kv_pull_s": _diff(t_kv_done, t_wait_kvs),
|
||||||
|
"T_admission_post_kv_s": _diff(t_first_sched, t_kv_done),
|
||||||
|
"T_first_iter_s": _diff(t_first_tok_dst, t_first_sched),
|
||||||
|
# Raw timestamps for debugging
|
||||||
|
"t_decode_sent_unix": t_decode_sent,
|
||||||
|
"t_dst_arrival_unix": t_arrival,
|
||||||
|
"t_dst_wait_for_kvs_unix": t_wait_kvs,
|
||||||
|
"t_dst_kv_recv_done_unix": t_kv_done,
|
||||||
|
"t_dst_first_scheduled_unix": t_first_sched,
|
||||||
|
"t_dst_first_token_unix": t_first_tok_dst,
|
||||||
|
"t_proxy_first_token_unix": t_first_tok,
|
||||||
|
})
|
||||||
|
|
||||||
|
print(f"[analyze] missing in dst log: {miss_in_dst}")
|
||||||
|
print(f"[analyze] dst record incomplete (no t_first_token): {missing_phases}")
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def emit_summary(rows: list[dict]) -> None:
|
||||||
|
if not rows:
|
||||||
|
print("[analyze] no rows — nothing to summarize.")
|
||||||
|
return
|
||||||
|
|
||||||
|
phase_keys = [
|
||||||
|
"T_proxy_total_dst_first_token_s",
|
||||||
|
"T_relay_s",
|
||||||
|
"T_admission_pre_kv_s",
|
||||||
|
"T_kv_pull_s",
|
||||||
|
"T_admission_post_kv_s",
|
||||||
|
"T_first_iter_s",
|
||||||
|
]
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("=" * 88)
|
||||||
|
print(f"Migration dst-side phase breakdown (n_migrations={len(rows)})")
|
||||||
|
print("=" * 88)
|
||||||
|
print(f"{'phase':<36} {'n':>4} {'mean(s)':>9} {'p50':>8} {'p90':>8} "
|
||||||
|
f"{'p99':>8} {'max':>8} {'sum(s)':>9}")
|
||||||
|
print("-" * 88)
|
||||||
|
for k in phase_keys:
|
||||||
|
vals = [r[k] for r in rows if r.get(k) is not None]
|
||||||
|
if not vals:
|
||||||
|
print(f"{k:<36} {'n/a':>4}")
|
||||||
|
continue
|
||||||
|
s = _summary(k, vals)
|
||||||
|
print(f"{k:<36} {s['n']:>4} {s['mean_s']:>9.3f} {s['p50_s']:>8.3f} "
|
||||||
|
f"{s['p90_s']:>8.3f} {s['p99_s']:>8.3f} {s['max_s']:>8.3f} "
|
||||||
|
f"{s['sum_s']:>9.2f}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("Aggregate attribution (sum across all migrations):")
|
||||||
|
sums = {}
|
||||||
|
for k in ("T_relay_s", "T_admission_pre_kv_s", "T_kv_pull_s",
|
||||||
|
"T_admission_post_kv_s", "T_first_iter_s"):
|
||||||
|
sums[k] = sum(r[k] for r in rows if r.get(k) is not None)
|
||||||
|
total = sum(sums.values())
|
||||||
|
total_proxy = sum(r["T_proxy_total_dst_first_token_s"] for r in rows
|
||||||
|
if r.get("T_proxy_total_dst_first_token_s") is not None)
|
||||||
|
print(f" decomposed sum : {total:>8.2f} s")
|
||||||
|
print(f" proxy total sum : {total_proxy:>8.2f} s "
|
||||||
|
f"(should be ~equal; gap = uninstrumented)")
|
||||||
|
if total > 0:
|
||||||
|
for k, v in sums.items():
|
||||||
|
print(f" {k:<28} {v:>8.2f} s ({v/total*100:5.1f} %)")
|
||||||
|
|
||||||
|
# Headline: "How much could layerwise save?"
|
||||||
|
layerwise_addressable = sums.get("T_kv_pull_s", 0.0)
|
||||||
|
queue_residual = sum(v for k, v in sums.items() if k != "T_kv_pull_s")
|
||||||
|
print()
|
||||||
|
print("Layerwise-addressable vs queue-residual:")
|
||||||
|
print(f" T_kv_pull_s (addressable by layerwise) : {layerwise_addressable:>8.2f} s "
|
||||||
|
f"({layerwise_addressable / total * 100 if total else 0:5.1f} %)")
|
||||||
|
print(f" everything else (queue/admission/iter) : {queue_residual:>8.2f} s "
|
||||||
|
f"({queue_residual / total * 100 if total else 0:5.1f} %)")
|
||||||
|
|
||||||
|
|
||||||
|
def write_csv(rows: list[dict], path: Path) -> None:
|
||||||
|
import csv
|
||||||
|
if not rows:
|
||||||
|
path.write_text("")
|
||||||
|
return
|
||||||
|
fields = list(rows[0].keys())
|
||||||
|
with path.open("w", newline="") as fh:
|
||||||
|
w = csv.DictWriter(fh, fieldnames=fields)
|
||||||
|
w.writeheader()
|
||||||
|
w.writerows(rows)
|
||||||
|
print(f"[analyze] wrote CSV: {path} (n={len(rows)})")
|
||||||
|
|
||||||
|
|
||||||
|
def maybe_plot(rows: list[dict], out_path: Path) -> None:
|
||||||
|
try:
|
||||||
|
import matplotlib
|
||||||
|
matplotlib.use("Agg")
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[analyze] matplotlib unavailable ({e}); skipping plot.")
|
||||||
|
return
|
||||||
|
|
||||||
|
if not rows:
|
||||||
|
return
|
||||||
|
|
||||||
|
rows_sorted = sorted(
|
||||||
|
rows,
|
||||||
|
key=lambda r: r.get("T_proxy_total_dst_first_token_s") or 0.0,
|
||||||
|
)
|
||||||
|
n = len(rows_sorted)
|
||||||
|
idx = list(range(n))
|
||||||
|
|
||||||
|
def col(k):
|
||||||
|
return [(r.get(k) or 0.0) for r in rows_sorted]
|
||||||
|
|
||||||
|
relay = col("T_relay_s")
|
||||||
|
pre = col("T_admission_pre_kv_s")
|
||||||
|
pull = col("T_kv_pull_s")
|
||||||
|
post = col("T_admission_post_kv_s")
|
||||||
|
first_iter = col("T_first_iter_s")
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(11, 5))
|
||||||
|
bot = [0.0] * n
|
||||||
|
for vals, label, color in [
|
||||||
|
(relay, "HTTP relay", "#cccccc"),
|
||||||
|
(pre, "admission pre-KV", "#f4a261"),
|
||||||
|
(pull, "KV pull (layerwise-addressable)", "#e76f51"),
|
||||||
|
(post, "admission post-KV", "#2a9d8f"),
|
||||||
|
(first_iter, "first decode iter", "#264653"),
|
||||||
|
]:
|
||||||
|
ax.bar(idx, vals, bottom=bot, color=color, label=label, width=0.85)
|
||||||
|
bot = [b + v for b, v in zip(bot, vals)]
|
||||||
|
|
||||||
|
ax.set_xticks(idx)
|
||||||
|
ax.set_xticklabels([str(i + 1) for i in idx], rotation=0, fontsize=8)
|
||||||
|
ax.set_xlabel("Migrated request (sorted by total dst wait, ascending)")
|
||||||
|
ax.set_ylabel("Time (s)")
|
||||||
|
ax.set_title("Per-migration dst-side phase breakdown (v3 unified_v3 run)")
|
||||||
|
ax.legend(loc="upper left", fontsize=9)
|
||||||
|
ax.grid(axis="y", linestyle=":", alpha=0.5)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(out_path, dpi=120)
|
||||||
|
plt.close(fig)
|
||||||
|
print(f"[analyze] wrote plot: {out_path}")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("--proxy-breakdown", type=Path, required=True)
|
||||||
|
p.add_argument("--dst-log-dir", type=Path, required=True)
|
||||||
|
p.add_argument("--output", type=Path, default=None,
|
||||||
|
help="CSV path (default: <run>/dst_migration_breakdown.csv)")
|
||||||
|
p.add_argument("--plot", type=Path, default=None,
|
||||||
|
help="PNG path (default: <run>/dst_migration_breakdown.png)")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
if not args.proxy_breakdown.is_file():
|
||||||
|
sys.exit(f"missing proxy breakdown: {args.proxy_breakdown}")
|
||||||
|
if not args.dst_log_dir.is_dir():
|
||||||
|
sys.exit(f"missing dst log dir: {args.dst_log_dir}")
|
||||||
|
|
||||||
|
run_dir = args.proxy_breakdown.parent
|
||||||
|
out_csv = args.output or (run_dir / "dst_migration_breakdown.csv")
|
||||||
|
out_png = args.plot or (run_dir / "dst_migration_breakdown.png")
|
||||||
|
|
||||||
|
proxy_recs = load_proxy_breakdown(args.proxy_breakdown)
|
||||||
|
dst_by_req = load_dst_log(args.dst_log_dir)
|
||||||
|
rows = decompose(proxy_recs, dst_by_req)
|
||||||
|
emit_summary(rows)
|
||||||
|
write_csv(rows, out_csv)
|
||||||
|
maybe_plot(rows, out_png)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
133
microbench/connector_tax/cache_sweep/analyze_migration_log.py
Normal file
133
microbench/connector_tax/cache_sweep/analyze_migration_log.py
Normal file
@@ -0,0 +1,133 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Per-migration log + per-instance summary for a v3 trace replay.
|
||||||
|
|
||||||
|
Reads <run_dir>/breakdown.json and <run_dir>/metrics.jsonl and emits:
|
||||||
|
1. A row per migration showing src→dst, per-side state snapshots, and
|
||||||
|
the resulting TTFT.
|
||||||
|
2. Histograms: migrations received per inst, sent per inst, all
|
||||||
|
(src→dst) pairs.
|
||||||
|
3. Post-rotation tail: how many turns of migrated sessions ended up on
|
||||||
|
each inst (downstream impact of rotation).
|
||||||
|
4. Anti-hotspot signal: recent_mig_received_in_window at decision time.
|
||||||
|
|
||||||
|
Run any v3 replay through this to spot pathological clustering of
|
||||||
|
migrations on the same dst within a short window.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python analyze_migration_log.py <RUN_DIR>
|
||||||
|
where <RUN_DIR> contains breakdown.json + metrics.jsonl (i.e. the proxy's
|
||||||
|
per-policy output folder, e.g. .../b3_v3_20260527_1344/unified_v3).
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def main(run_dir: Path) -> None:
|
||||||
|
bd = json.load(open(run_dir / "breakdown.json"))
|
||||||
|
m = {json.loads(l)["request_id"]: json.loads(l)
|
||||||
|
for l in open(run_dir / "metrics.jsonl")}
|
||||||
|
|
||||||
|
mig = [e for e in bd if e.get("v3_migrate")]
|
||||||
|
mig.sort(key=lambda x: x.get("t_decision_unix", 0))
|
||||||
|
|
||||||
|
print(f"=== {len(mig)} migrations in {run_dir.name} ===\n")
|
||||||
|
cols = (
|
||||||
|
"#", "t_rel", "session", "turn",
|
||||||
|
"src", "dst", "src_nreq", "src_dec_tok",
|
||||||
|
"dst_nreq", "dst_cache", "dst_recent_recv",
|
||||||
|
"inlen", "self_ttft_ms",
|
||||||
|
)
|
||||||
|
print(" " + " ".join(f"{c:>13}" for c in cols))
|
||||||
|
print("-" * (15 * len(cols)))
|
||||||
|
|
||||||
|
t0 = mig[0]["t_decision_unix"] if mig else 0
|
||||||
|
for i, e in enumerate(mig):
|
||||||
|
rid = e["request_id"]
|
||||||
|
src_idx = e.get("v3_src_idx", e["chosen_idx"])
|
||||||
|
dst_idx = e.get("v3_target_idx", -1)
|
||||||
|
src_state = e.get("v3_src_state") or {}
|
||||||
|
dst_state = e.get("v3_target_state") or {}
|
||||||
|
cands = {c["idx"]: c for c in e.get("candidate_scores", [])}
|
||||||
|
# Fall back to candidate_scores if dedicated v3_*_state fields aren't present.
|
||||||
|
src_nreq = src_state.get("num_requests", cands.get(src_idx, {}).get("num_requests", "-"))
|
||||||
|
src_dec_tok = src_state.get("ongoing_decode_tokens",
|
||||||
|
cands.get(src_idx, {}).get("ongoing_decode_tokens", "-"))
|
||||||
|
dst_nreq = dst_state.get("num_requests", cands.get(dst_idx, {}).get("num_requests", "-"))
|
||||||
|
dst_cache = e.get("v3_target_cache_hit", dst_state.get("cache_hit_estimate", 0))
|
||||||
|
dst_recent = e.get("v3_target_recent_received",
|
||||||
|
dst_state.get("recent_mig_received_in_window", "-"))
|
||||||
|
inlen = e.get("input_length") or m.get(rid, {}).get("input_length", 0)
|
||||||
|
ttft = m.get(rid, {}).get("ttft_s") or 0
|
||||||
|
t_rel = e["t_decision_unix"] - t0
|
||||||
|
turn = m.get(rid, {}).get("turn_id", "?")
|
||||||
|
print(
|
||||||
|
f" {i+1:>13} {t_rel:>13.1f} {e['session_id']:>13} {turn:>13} "
|
||||||
|
f"{src_idx:>13} {dst_idx:>13} {src_nreq:>13} {src_dec_tok:>13} "
|
||||||
|
f"{dst_nreq:>13} {dst_cache:>13} {dst_recent:>13} "
|
||||||
|
f"{inlen:>13} {ttft*1000:>13.0f}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Aggregate counts
|
||||||
|
print("\n=== Migrations TO each instance ===")
|
||||||
|
to_count = Counter(e.get("v3_target_idx", -1) for e in mig)
|
||||||
|
for idx in range(8):
|
||||||
|
print(f" inst_{idx}: {to_count.get(idx, 0)} migrations received")
|
||||||
|
|
||||||
|
print("\n=== Migrations FROM each instance ===")
|
||||||
|
from_count = Counter(e.get("v3_src_idx", e["chosen_idx"]) for e in mig)
|
||||||
|
for idx in range(8):
|
||||||
|
print(f" inst_{idx}: {from_count.get(idx, 0)} migrations sent")
|
||||||
|
|
||||||
|
print("\n=== Migration pairs (src→dst, count) ===")
|
||||||
|
pair_count = Counter(
|
||||||
|
(e.get("v3_src_idx", e["chosen_idx"]), e.get("v3_target_idx", -1))
|
||||||
|
for e in mig
|
||||||
|
)
|
||||||
|
for (s, d), n in sorted(pair_count.items(), key=lambda x: -x[1]):
|
||||||
|
print(f" {s} → {d}: {n}")
|
||||||
|
|
||||||
|
print("\n=== Sessions migrating multiple times ===")
|
||||||
|
sess_mig = defaultdict(list)
|
||||||
|
for e in mig:
|
||||||
|
sess_mig[e["session_id"]].append(
|
||||||
|
(e.get("t_decision_unix", 0),
|
||||||
|
e.get("v3_src_idx", e["chosen_idx"]),
|
||||||
|
e.get("v3_target_idx", -1))
|
||||||
|
)
|
||||||
|
multi = {s: ev for s, ev in sess_mig.items() if len(ev) > 1}
|
||||||
|
if not multi:
|
||||||
|
print(" (none)")
|
||||||
|
for sess, events in sorted(multi.items()):
|
||||||
|
chain = " → ".join(f"{s}->{d}" for _, s, d in sorted(events))
|
||||||
|
print(f" session {sess}: {chain}")
|
||||||
|
|
||||||
|
# Recent-received hotspot signal — non-zero values mean the picker
|
||||||
|
# accepted a target that recently got another migration.
|
||||||
|
print("\n=== Anti-hotspot signal: dst.recent_mig_received_in_window ===")
|
||||||
|
rec = [e.get("v3_target_recent_received", 0) for e in mig]
|
||||||
|
if rec:
|
||||||
|
nonzero = [v for v in rec if v]
|
||||||
|
print(f" total migrations: {len(rec)}, "
|
||||||
|
f"with recent_received > 0: {len(nonzero)}, "
|
||||||
|
f"max recent_received: {max(rec)}")
|
||||||
|
|
||||||
|
# Post-rotation tail: turns of migrated sessions after their LAST mig
|
||||||
|
print("\n=== Post-rotation tail per inst (turns of migrated sessions after last mig) ===")
|
||||||
|
tail = Counter()
|
||||||
|
for sess, events in sess_mig.items():
|
||||||
|
final_dst = sorted(events)[-1][2]
|
||||||
|
last_t = max(t for t, _, _ in events)
|
||||||
|
sess_turns = [mm for rid, mm in m.items() if mm["session_id"] == sess]
|
||||||
|
tail[final_dst] += sum(1 for mm in sess_turns
|
||||||
|
if mm.get("t_dispatch_unix", 0) > last_t)
|
||||||
|
for idx in range(8):
|
||||||
|
print(f" inst_{idx}: {tail.get(idx, 0)} tail turns")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) < 2:
|
||||||
|
print("usage: analyze_migration_log.py <run_dir>", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
main(Path(sys.argv[1]))
|
||||||
237
microbench/connector_tax/cache_sweep/analyze_transfer_decomp.py
Executable file
237
microbench/connector_tax/cache_sweep/analyze_transfer_decomp.py
Executable file
@@ -0,0 +1,237 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Decompose migration KV-transfer time into RDMA-actual vs control-plane.
|
||||||
|
|
||||||
|
Joins three logs from an instrumented unified_v3 run:
|
||||||
|
|
||||||
|
proxy breakdown.json — per-request route + phase timestamps
|
||||||
|
dst_mig_log/dm_mig_pid*.jsonl — dst lifecycle (instrument_dst_migration.py)
|
||||||
|
gives T_kv_pull = wait_for_kvs -> recv_done
|
||||||
|
mooncake xfer/mb2_transfer_pid*.jsonl — connector internals
|
||||||
|
(instrument_mooncake.py):
|
||||||
|
send_blocks : pure RDMA (total_bytes, duration_s) [producer]
|
||||||
|
receive_kv_enter/finish: consumer-observed transfer window [consumer]
|
||||||
|
ready_wait : producer wait for src KV commit [producer]
|
||||||
|
send_kv_to_decode_enter: producer received the pull request [producer]
|
||||||
|
|
||||||
|
Decisive question: of the 87% dst-side overhead that is T_kv_pull, how
|
||||||
|
much is the actual RDMA write (`send_blocks`) vs control-plane
|
||||||
|
(handshake / ready-wait / GIL starvation on the busy src)?
|
||||||
|
|
||||||
|
- send_blocks bandwidth ~ wire (10 GB/s) AND << T_kv_pull
|
||||||
|
=> loss is control-plane; layerwise (which only moves WHEN the
|
||||||
|
RDMA fires) will NOT fix it.
|
||||||
|
- send_blocks bandwidth << wire
|
||||||
|
=> the RDMA write itself is slow (NIC / src-side servicing);
|
||||||
|
characterize with a load microbench next.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python analyze_transfer_decomp.py \
|
||||||
|
--proxy-breakdown <RUN>/unified_v3/breakdown.json \
|
||||||
|
--dst-log-dir <RUN>/dst_mig_log \
|
||||||
|
--xfer-log-dir <RUN>/xfer_log
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
import re
|
||||||
|
import statistics
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def _core_req_id(rid: str) -> str:
|
||||||
|
if not rid:
|
||||||
|
return rid
|
||||||
|
s = rid
|
||||||
|
if s.startswith("cmpl-"):
|
||||||
|
s = s[len("cmpl-"):]
|
||||||
|
m = re.match(r"^(.*)-\d+-[0-9a-fA-F]+$", s)
|
||||||
|
if m:
|
||||||
|
s = m.group(1)
|
||||||
|
return s
|
||||||
|
|
||||||
|
|
||||||
|
def _pct(vals, q):
|
||||||
|
if not vals:
|
||||||
|
return float("nan")
|
||||||
|
vs = sorted(vals)
|
||||||
|
i = max(0, min(len(vs) - 1, int(math.ceil(q * len(vs))) - 1))
|
||||||
|
return vs[i]
|
||||||
|
|
||||||
|
|
||||||
|
def _stat_line(name, vals, unit="s"):
|
||||||
|
if not vals:
|
||||||
|
print(f"{name:<34} n=0")
|
||||||
|
return
|
||||||
|
print(f"{name:<34} n={len(vals):>3} mean={statistics.mean(vals):>8.3f} "
|
||||||
|
f"p50={_pct(vals,0.5):>8.3f} p90={_pct(vals,0.9):>8.3f} "
|
||||||
|
f"max={max(vals):>8.3f} sum={sum(vals):>8.2f} {unit}")
|
||||||
|
|
||||||
|
|
||||||
|
def load_events(xfer_dir: Path):
|
||||||
|
files = sorted(xfer_dir.glob("mb2_transfer_pid*.jsonl"))
|
||||||
|
print(f"[xfer] log files: {len(files)} under {xfer_dir}")
|
||||||
|
send_blocks, recv_enter, recv_finish, ready_wait, send_enter = [], [], [], [], []
|
||||||
|
for f in files:
|
||||||
|
pid = f.stem.replace("mb2_transfer_pid", "")
|
||||||
|
with f.open() as fh:
|
||||||
|
for line in fh:
|
||||||
|
try:
|
||||||
|
e = json.loads(line)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
e["_pid"] = pid
|
||||||
|
ev = e.get("event")
|
||||||
|
if ev == "send_blocks":
|
||||||
|
send_blocks.append(e)
|
||||||
|
elif ev == "receive_kv_enter":
|
||||||
|
recv_enter.append(e)
|
||||||
|
elif ev == "receive_kv_finish":
|
||||||
|
recv_finish.append(e)
|
||||||
|
elif ev == "ready_wait":
|
||||||
|
ready_wait.append(e)
|
||||||
|
elif ev == "send_kv_to_decode_enter":
|
||||||
|
send_enter.append(e)
|
||||||
|
print(f"[xfer] events: send_blocks={len(send_blocks)} "
|
||||||
|
f"recv_enter={len(recv_enter)} recv_finish={len(recv_finish)} "
|
||||||
|
f"ready_wait={len(ready_wait)} send_enter={len(send_enter)}")
|
||||||
|
return send_blocks, recv_enter, recv_finish, ready_wait, send_enter
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("--proxy-breakdown", type=Path, required=True)
|
||||||
|
p.add_argument("--dst-log-dir", type=Path, required=True)
|
||||||
|
p.add_argument("--xfer-log-dir", type=Path, required=True)
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
for pth in (args.proxy_breakdown, args.dst_log_dir, args.xfer_log_dir):
|
||||||
|
if not pth.exists():
|
||||||
|
sys.exit(f"missing: {pth}")
|
||||||
|
|
||||||
|
proxy = json.load(args.proxy_breakdown.open())
|
||||||
|
migrations = [x for x in proxy if x.get("route_class") == "PD_SEP_V2"]
|
||||||
|
mig_ids = {x.get("request_id") for x in migrations}
|
||||||
|
print(f"[proxy] migrations: {len(migrations)} / {len(proxy)} total")
|
||||||
|
|
||||||
|
# dst lifecycle: T_kv_pull per migration (core req id)
|
||||||
|
dst_pull = {}
|
||||||
|
for f in sorted(args.dst_log_dir.glob("dm_mig_pid*.jsonl")):
|
||||||
|
for line in f.open():
|
||||||
|
try:
|
||||||
|
r = json.loads(line)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
tw = r.get("t_wait_for_kvs_unix")
|
||||||
|
td = r.get("t_kv_recv_done_unix")
|
||||||
|
if tw and td:
|
||||||
|
dst_pull[_core_req_id(r.get("req_id"))] = td - tw
|
||||||
|
|
||||||
|
sb, re_enter, re_finish, rw, se = load_events(args.xfer_log_dir)
|
||||||
|
|
||||||
|
# ---- 1. Pure RDMA bandwidth from send_blocks (the decisive number) ----
|
||||||
|
print("\n" + "=" * 90)
|
||||||
|
print("1. PURE RDMA WRITE rate (`send_blocks` = batch_transfer_sync_write)")
|
||||||
|
print("=" * 90)
|
||||||
|
bws, durs, bytes_l = [], [], []
|
||||||
|
for e in sb:
|
||||||
|
b = e.get("total_bytes", 0)
|
||||||
|
d = e.get("duration_s", 0)
|
||||||
|
if d and d > 0 and b > 0:
|
||||||
|
bws.append(b / 1e9 / d)
|
||||||
|
durs.append(d)
|
||||||
|
bytes_l.append(b)
|
||||||
|
if bws:
|
||||||
|
tot_b = sum(bytes_l)
|
||||||
|
tot_d = sum(durs)
|
||||||
|
print(f" send_blocks calls: {len(bws)}")
|
||||||
|
print(f" total bytes moved : {tot_b/2**30:.2f} GiB")
|
||||||
|
print(f" total RDMA time : {tot_d:.2f} s")
|
||||||
|
print(f" AGGREGATE rate : {tot_b/1e9/tot_d:.2f} GB/s "
|
||||||
|
f"(MB2 idle-src steady-state = ~9.7-10 GB/s)")
|
||||||
|
_stat_line(" per-call rate (GB/s)", bws, unit="GB/s")
|
||||||
|
_stat_line(" per-call duration", durs)
|
||||||
|
# bandwidth vs size — small ops are latency-bound
|
||||||
|
print("\n rate vs transfer size:")
|
||||||
|
pairs = sorted(zip(bytes_l, bws))
|
||||||
|
for b, w in pairs:
|
||||||
|
bar = "#" * int(min(40, w * 4))
|
||||||
|
print(f" {b/2**20:>8.1f} MiB {w:>6.2f} GB/s {bar}")
|
||||||
|
else:
|
||||||
|
print(" no send_blocks events with positive duration")
|
||||||
|
|
||||||
|
# ---- 2. Producer ready-wait (src KV commit) ----
|
||||||
|
print("\n" + "=" * 90)
|
||||||
|
print("2. PRODUCER ready-wait (src KV not yet committed when pull arrived)")
|
||||||
|
print("=" * 90)
|
||||||
|
rw_vals = [e.get("ready_wait_s", 0) for e in rw if e.get("ready_wait_s") is not None]
|
||||||
|
already = sum(1 for e in rw if e.get("ready_already_set"))
|
||||||
|
_stat_line(" ready_wait", rw_vals)
|
||||||
|
print(f" ready_already_set at entry: {already}/{len(rw)} "
|
||||||
|
f"(if most are True, src commit is not the bottleneck)")
|
||||||
|
|
||||||
|
# ---- 3. Consumer-observed receive_kv window ----
|
||||||
|
print("\n" + "=" * 90)
|
||||||
|
print("3. CONSUMER receive_kv window (enter->FINISH, ~most of T_kv_pull)")
|
||||||
|
print("=" * 90)
|
||||||
|
rf_vals = [e.get("duration_s", 0) for e in re_finish if e.get("duration_s")]
|
||||||
|
_stat_line(" receive_kv duration", rf_vals)
|
||||||
|
|
||||||
|
# ---- 4. Per-migration join: T_kv_pull vs receive_kv vs ready_wait ----
|
||||||
|
print("\n" + "=" * 90)
|
||||||
|
print("4. PER-MIGRATION join (T_kv_pull from dst vs connector internals)")
|
||||||
|
print("=" * 90)
|
||||||
|
# index connector events by core req id
|
||||||
|
rf_by_req = {}
|
||||||
|
for e in re_finish:
|
||||||
|
for rid in e.get("req_ids", []):
|
||||||
|
rf_by_req[_core_req_id(rid)] = e.get("duration_s")
|
||||||
|
rw_by_req = {}
|
||||||
|
for e in rw:
|
||||||
|
rw_by_req[_core_req_id(e.get("d_req_id", ""))] = e.get("ready_wait_s")
|
||||||
|
|
||||||
|
joined = 0
|
||||||
|
sum_pull = sum_recv = sum_rw = 0.0
|
||||||
|
rows = []
|
||||||
|
for m in migrations:
|
||||||
|
core = m.get("request_id")
|
||||||
|
pull = dst_pull.get(core)
|
||||||
|
recv = rf_by_req.get(core)
|
||||||
|
rwv = rw_by_req.get(core)
|
||||||
|
if pull is None and recv is None:
|
||||||
|
continue
|
||||||
|
joined += 1
|
||||||
|
if pull: sum_pull += pull
|
||||||
|
if recv: sum_recv += recv
|
||||||
|
if rwv: sum_rw += rwv
|
||||||
|
rows.append((core, m.get("input_length"), m.get("v3_target_cache_hit"),
|
||||||
|
pull, recv, rwv))
|
||||||
|
print(f" joined migrations: {joined}")
|
||||||
|
print(f" Σ T_kv_pull (dst) = {sum_pull:8.2f} s")
|
||||||
|
print(f" Σ receive_kv (consumer) = {sum_recv:8.2f} s")
|
||||||
|
print(f" Σ ready_wait (producer) = {sum_rw:8.2f} s")
|
||||||
|
# The RDMA share: best-effort total send_blocks time
|
||||||
|
sum_rdma = sum(durs) if durs else 0.0
|
||||||
|
print(f" Σ send_blocks RDMA = {sum_rdma:8.2f} s (all transfers, "
|
||||||
|
f"not just migrations)")
|
||||||
|
if sum_pull > 0:
|
||||||
|
print(f"\n RDMA-actual / T_kv_pull ≈ {sum_rdma/sum_pull*100:5.1f} %")
|
||||||
|
print(f" ready-wait / T_kv_pull ≈ {sum_rw/sum_pull*100:5.1f} %")
|
||||||
|
resid = sum_pull - sum_rdma - sum_rw
|
||||||
|
print(f" control-plane residual ≈ {resid/sum_pull*100:5.1f} % "
|
||||||
|
f"(handshake / ZMQ / GIL starvation)")
|
||||||
|
|
||||||
|
print("\n per-migration detail:")
|
||||||
|
print(f" {'req_id':<22} {'in_len':>7} {'dst_hit':>8} {'kv_pull':>8} "
|
||||||
|
f"{'recv_kv':>8} {'rdy_wait':>8}")
|
||||||
|
for core, il, hit, pull, recv, rwv in sorted(
|
||||||
|
rows, key=lambda r: -(r[3] or 0)):
|
||||||
|
def s(v): return f"{v:.2f}" if v is not None else " --"
|
||||||
|
print(f" {core:<22} {str(il):>7} {str(hit):>8} {s(pull):>8} "
|
||||||
|
f"{s(recv):>8} {s(rwv):>8}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
90
microbench/connector_tax/cache_sweep/run_v3_dst_breakdown.sh
Executable file
90
microbench/connector_tax/cache_sweep/run_v3_dst_breakdown.sh
Executable file
@@ -0,0 +1,90 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# v3 trace replay with dst-side migration breakdown instrumentation.
|
||||||
|
#
|
||||||
|
# Same trace + DR_FIX as `run_v3_replay.sh`, plus:
|
||||||
|
# - instrument_dst_migration.py applied to vLLM scheduler
|
||||||
|
# - DM_LOG_DIR exported to all 8 vLLM instances so per-PID
|
||||||
|
# dst-migration logs land in <RUNDIR>/dst_mig_log/
|
||||||
|
# - analyze_dst_migration.py runs on completion to print the
|
||||||
|
# T_kv_pull vs queue-residual decomposition
|
||||||
|
#
|
||||||
|
# Usage: bash run_v3_dst_breakdown.sh
|
||||||
|
|
||||||
|
set -uo pipefail
|
||||||
|
|
||||||
|
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||||
|
TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
|
||||||
|
DATE="$(date +%Y%m%d_%H%M)"
|
||||||
|
OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/b3_v3_dstbreak_${DATE}}"
|
||||||
|
PYTHON="$PROJ_DIR/.venv/bin/python"
|
||||||
|
VLLM_ROOT="${VLLM_ROOT:-$PROJ_DIR/.venv/lib/python3.12/site-packages/vllm}"
|
||||||
|
DR_FIX_SCRIPT="$PROJ_DIR/microbench/connector_tax/cache_sweep/apply_direct_read_fix.py"
|
||||||
|
DM_INSTRUMENT="$PROJ_DIR/microbench/fresh_setup/instrument_dst_migration.py"
|
||||||
|
ANALYZE="$PROJ_DIR/microbench/connector_tax/cache_sweep/analyze_dst_migration.py"
|
||||||
|
|
||||||
|
mkdir -p "$OUTROOT"
|
||||||
|
DST_LOG_DIR="$OUTROOT/dst_mig_log"
|
||||||
|
mkdir -p "$DST_LOG_DIR"
|
||||||
|
|
||||||
|
echo "=== unified_v3 + dst-side migration breakdown ==="
|
||||||
|
echo "Trace : $TRACE"
|
||||||
|
echo "Out : $OUTROOT"
|
||||||
|
echo "DST logs : $DST_LOG_DIR"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
cleanup_all() {
|
||||||
|
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||||
|
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||||
|
sleep 5
|
||||||
|
"$PYTHON" "$DR_FIX_SCRIPT" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
|
||||||
|
"$PYTHON" "$DM_INSTRUMENT" --revert --venv "$PROJ_DIR/.venv" 2>/dev/null || true
|
||||||
|
}
|
||||||
|
trap cleanup_all EXIT
|
||||||
|
cleanup_all
|
||||||
|
|
||||||
|
echo "[stage 0a] applying CT_DR_FIX (env-gated)"
|
||||||
|
"$PYTHON" "$DR_FIX_SCRIPT" --apply --vllm-root "$VLLM_ROOT"
|
||||||
|
|
||||||
|
echo "[stage 0b] applying DST migration instrumentation"
|
||||||
|
"$PYTHON" "$DM_INSTRUMENT" --apply --venv "$PROJ_DIR/.venv"
|
||||||
|
"$PYTHON" "$DM_INSTRUMENT" --check --venv "$PROJ_DIR/.venv"
|
||||||
|
|
||||||
|
cfg_dir="$OUTROOT/unified_v3"
|
||||||
|
mkdir -p "$cfg_dir"
|
||||||
|
|
||||||
|
# Activate DR-fix env gate (consistent with run_v3_replay.sh)
|
||||||
|
export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
|
||||||
|
# Export DM_LOG_DIR — every vLLM EngineCore inherits this env and writes
|
||||||
|
# its own dm_mig_pid<pid>.jsonl into it.
|
||||||
|
export DM_LOG_DIR="$DST_LOG_DIR"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "====== unified_v3 ; DR_SYNC_DISABLED=1 ; DM_LOG_DIR=$DST_LOG_DIR ======"
|
||||||
|
bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "unified_v3" "$TRACE" "$cfg_dir" \
|
||||||
|
2>&1 | tee "$cfg_dir/orchestrator.log" | tail -30
|
||||||
|
|
||||||
|
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||||
|
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||||
|
sleep 5
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "[stage Z] reverting DR_FIX + DM instrument"
|
||||||
|
"$PYTHON" "$DR_FIX_SCRIPT" --revert --vllm-root "$VLLM_ROOT"
|
||||||
|
"$PYTHON" "$DM_INSTRUMENT" --revert --venv "$PROJ_DIR/.venv"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "[stage analyze] dst-side migration breakdown"
|
||||||
|
"$PYTHON" "$ANALYZE" \
|
||||||
|
--proxy-breakdown "$cfg_dir/breakdown.json" \
|
||||||
|
--dst-log-dir "$DST_LOG_DIR" \
|
||||||
|
--output "$cfg_dir/dst_migration_breakdown.csv" \
|
||||||
|
--plot "$cfg_dir/dst_migration_breakdown.png" \
|
||||||
|
2>&1 | tee "$cfg_dir/dst_migration_breakdown.txt"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Done."
|
||||||
|
echo " proxy breakdown : $cfg_dir/breakdown.json"
|
||||||
|
echo " dst per-PID log : $DST_LOG_DIR/"
|
||||||
|
echo " decomposition : $cfg_dir/dst_migration_breakdown.{csv,png,txt}"
|
||||||
102
microbench/connector_tax/cache_sweep/run_v3_full_breakdown.sh
Executable file
102
microbench/connector_tax/cache_sweep/run_v3_full_breakdown.sh
Executable file
@@ -0,0 +1,102 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# v3 trace replay with FULL migration instrumentation:
|
||||||
|
# - instrument_dst_migration.py : dst lifecycle -> T_kv_pull
|
||||||
|
# - instrument_mooncake.py : connector internals (send_blocks RDMA,
|
||||||
|
# receive_kv window, ready_wait)
|
||||||
|
# Goal: decompose the 87% T_kv_pull into RDMA-actual vs control-plane to
|
||||||
|
# explain why effective bandwidth is far below the ~10 GB/s wire rate.
|
||||||
|
#
|
||||||
|
# Usage: bash run_v3_full_breakdown.sh
|
||||||
|
|
||||||
|
set -uo pipefail
|
||||||
|
|
||||||
|
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||||
|
TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
|
||||||
|
DATE="$(date +%Y%m%d_%H%M)"
|
||||||
|
OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/b3_v3_fullbreak_${DATE}}"
|
||||||
|
PYTHON="$PROJ_DIR/.venv/bin/python"
|
||||||
|
VENV="$PROJ_DIR/.venv"
|
||||||
|
VLLM_ROOT="${VLLM_ROOT:-$VENV/lib/python3.12/site-packages/vllm}"
|
||||||
|
DR_FIX="$PROJ_DIR/microbench/connector_tax/cache_sweep/apply_direct_read_fix.py"
|
||||||
|
DM_INSTR="$PROJ_DIR/microbench/fresh_setup/instrument_dst_migration.py"
|
||||||
|
MC_INSTR="$PROJ_DIR/microbench/fresh_setup/instrument_mooncake.py"
|
||||||
|
ANALYZE_DST="$PROJ_DIR/microbench/connector_tax/cache_sweep/analyze_dst_migration.py"
|
||||||
|
ANALYZE_XFER="$PROJ_DIR/microbench/connector_tax/cache_sweep/analyze_transfer_decomp.py"
|
||||||
|
|
||||||
|
mkdir -p "$OUTROOT"
|
||||||
|
DST_LOG_DIR="$OUTROOT/dst_mig_log"
|
||||||
|
XFER_LOG_DIR="$OUTROOT/xfer_log"
|
||||||
|
mkdir -p "$DST_LOG_DIR" "$XFER_LOG_DIR"
|
||||||
|
|
||||||
|
echo "=== unified_v3 + FULL migration breakdown ==="
|
||||||
|
echo "Out : $OUTROOT"
|
||||||
|
echo "DST logs : $DST_LOG_DIR"
|
||||||
|
echo "XFER logs: $XFER_LOG_DIR"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
cleanup_all() {
|
||||||
|
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||||
|
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||||
|
sleep 5
|
||||||
|
"$PYTHON" "$DR_FIX" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
|
||||||
|
"$PYTHON" "$DM_INSTR" --revert --venv "$VENV" 2>/dev/null || true
|
||||||
|
"$PYTHON" "$MC_INSTR" --revert --venv "$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py" 2>/dev/null || true
|
||||||
|
}
|
||||||
|
trap cleanup_all EXIT
|
||||||
|
cleanup_all
|
||||||
|
|
||||||
|
echo "[0a] DR_FIX"
|
||||||
|
"$PYTHON" "$DR_FIX" --apply --vllm-root "$VLLM_ROOT"
|
||||||
|
echo "[0b] DST migration instrument"
|
||||||
|
"$PYTHON" "$DM_INSTR" --apply --venv "$VENV"
|
||||||
|
echo "[0c] Mooncake transfer instrument"
|
||||||
|
"$PYTHON" "$MC_INSTR" --apply --venv "$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
|
||||||
|
"$PYTHON" "$DM_INSTR" --check --venv "$VENV"
|
||||||
|
"$PYTHON" "$MC_INSTR" --check --venv "$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
|
||||||
|
|
||||||
|
cfg_dir="$OUTROOT/unified_v3"
|
||||||
|
mkdir -p "$cfg_dir"
|
||||||
|
|
||||||
|
export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
|
||||||
|
export DM_LOG_DIR="$DST_LOG_DIR"
|
||||||
|
export MB2_LOG_DIR="$XFER_LOG_DIR"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "====== unified_v3 ; DM_LOG_DIR + MB2_LOG_DIR set ======"
|
||||||
|
bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "unified_v3" "$TRACE" "$cfg_dir" \
|
||||||
|
2>&1 | tee "$cfg_dir/orchestrator.log" | tail -25
|
||||||
|
|
||||||
|
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||||
|
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||||
|
sleep 5
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "[Z] revert all instruments"
|
||||||
|
"$PYTHON" "$DR_FIX" --revert --vllm-root "$VLLM_ROOT"
|
||||||
|
"$PYTHON" "$DM_INSTR" --revert --venv "$VENV"
|
||||||
|
"$PYTHON" "$MC_INSTR" --revert --venv "$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "[analyze 1] dst-side T_kv_pull breakdown"
|
||||||
|
"$PYTHON" "$ANALYZE_DST" \
|
||||||
|
--proxy-breakdown "$cfg_dir/breakdown.json" \
|
||||||
|
--dst-log-dir "$DST_LOG_DIR" \
|
||||||
|
--output "$cfg_dir/dst_migration_breakdown.csv" \
|
||||||
|
--plot "$cfg_dir/dst_migration_breakdown.png" \
|
||||||
|
2>&1 | tee "$cfg_dir/dst_migration_breakdown.txt" || echo "(dst analyze failed)"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "[analyze 2] transfer decomposition: RDMA-actual vs control-plane"
|
||||||
|
"$PYTHON" "$ANALYZE_XFER" \
|
||||||
|
--proxy-breakdown "$cfg_dir/breakdown.json" \
|
||||||
|
--dst-log-dir "$DST_LOG_DIR" \
|
||||||
|
--xfer-log-dir "$XFER_LOG_DIR" \
|
||||||
|
2>&1 | tee "$cfg_dir/transfer_decomp.txt" || echo "(xfer analyze failed)"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Done. Artifacts in $cfg_dir/"
|
||||||
|
echo " dst_migration_breakdown.{csv,png,txt}"
|
||||||
|
echo " transfer_decomp.txt"
|
||||||
|
echo " raw: $DST_LOG_DIR/ $XFER_LOG_DIR/"
|
||||||
198
microbench/fresh_setup/analyze_goodput.py
Normal file
198
microbench/fresh_setup/analyze_goodput.py
Normal file
@@ -0,0 +1,198 @@
|
|||||||
|
"""SLO-goodput analyzer + PD_advantage for the PD-disagg crossover study.
|
||||||
|
|
||||||
|
Reads per-arm replayer output (replay_metrics.jsonl) and computes, per arm:
|
||||||
|
- completion rate (error-free fraction)
|
||||||
|
- raw TTFT / TPOT / E2E percentiles (over successes — reported for context
|
||||||
|
only; NEVER the verdict metric, since failing arms have a small success set)
|
||||||
|
- SLO-goodput: fraction of OFFERED requests that are error-free AND meet a
|
||||||
|
(TTFT, TPOT) SLO. This is the verdict metric.
|
||||||
|
|
||||||
|
The two arms must replay the IDENTICAL trace (same seed), so they are paired
|
||||||
|
request-for-request. PD_advantage = goodput(arm) / goodput(baseline); y=1 is
|
||||||
|
the crossover line — PD_advantage >= 1 means PD-disagg wins.
|
||||||
|
|
||||||
|
Goodput is computed over a grid of SLO thresholds so the conclusion does not
|
||||||
|
hinge on one arbitrary cutoff.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python analyze_goodput.py \
|
||||||
|
--arm 8C-proxy .../8C-proxy/replay_metrics.jsonl \
|
||||||
|
--arm 4P+4D .../4P+4D/replay_metrics.jsonl \
|
||||||
|
--baseline 8C-proxy \
|
||||||
|
--ttft-slo 0.5 1 2 5 --tpot-slo 0.05 0.1 0.2
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import statistics
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def load_metrics(path: Path) -> list[dict]:
|
||||||
|
rows = []
|
||||||
|
with path.open("r", encoding="utf-8") as fh:
|
||||||
|
for line in fh:
|
||||||
|
line = line.strip()
|
||||||
|
if line:
|
||||||
|
rows.append(json.loads(line))
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def percentile(sorted_vals: list[float], pct: float) -> float:
|
||||||
|
n = len(sorted_vals)
|
||||||
|
if n == 0:
|
||||||
|
return float("nan")
|
||||||
|
if n == 1:
|
||||||
|
return sorted_vals[0]
|
||||||
|
rank = pct * (n - 1)
|
||||||
|
lo = int(rank)
|
||||||
|
hi = min(lo + 1, n - 1)
|
||||||
|
frac = rank - lo
|
||||||
|
return sorted_vals[lo] * (1 - frac) + sorted_vals[hi] * frac
|
||||||
|
|
||||||
|
|
||||||
|
def pstats(vals: list[float]) -> dict:
|
||||||
|
clean = sorted(v for v in vals if v is not None)
|
||||||
|
if not clean:
|
||||||
|
return {"count": 0}
|
||||||
|
return {
|
||||||
|
"count": len(clean),
|
||||||
|
"mean": statistics.fmean(clean),
|
||||||
|
"p50": percentile(clean, 0.50),
|
||||||
|
"p90": percentile(clean, 0.90),
|
||||||
|
"p99": percentile(clean, 0.99),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def offered_window_s(rows: list[dict]) -> float:
|
||||||
|
ts = [r.get("trace_timestamp_s") for r in rows if r.get("trace_timestamp_s") is not None]
|
||||||
|
if len(ts) < 2:
|
||||||
|
return 0.0
|
||||||
|
return max(ts) - min(ts)
|
||||||
|
|
||||||
|
|
||||||
|
def meets_slo(r: dict, ttft_slo: float, tpot_slo: float) -> bool:
|
||||||
|
if r.get("error") is not None:
|
||||||
|
return False
|
||||||
|
ttft = r.get("ttft_s")
|
||||||
|
tpot = r.get("tpot_s")
|
||||||
|
if ttft is None:
|
||||||
|
return False
|
||||||
|
if ttft > ttft_slo:
|
||||||
|
return False
|
||||||
|
# tpot=0 happens only for single-token outputs; treat as meeting any SLO.
|
||||||
|
if tpot is not None and tpot > tpot_slo:
|
||||||
|
return False
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def load_summary(jsonl_path: Path) -> dict:
|
||||||
|
"""Read the sibling replay_metrics.summary.json (wall-clock, amplification)."""
|
||||||
|
sp = jsonl_path.with_suffix(".summary.json")
|
||||||
|
if sp.exists():
|
||||||
|
try:
|
||||||
|
return json.loads(sp.read_text())
|
||||||
|
except Exception:
|
||||||
|
return {}
|
||||||
|
return {}
|
||||||
|
|
||||||
|
|
||||||
|
def summarize_arm(name: str, jsonl_path: Path, rows: list[dict]) -> dict:
|
||||||
|
n = len(rows)
|
||||||
|
ok = [r for r in rows if r.get("error") is None]
|
||||||
|
window = offered_window_s(rows)
|
||||||
|
summ = load_summary(jsonl_path)
|
||||||
|
return {
|
||||||
|
"name": name,
|
||||||
|
"n_offered": n,
|
||||||
|
"n_success": len(ok),
|
||||||
|
"completion_rate": len(ok) / n if n else 0.0,
|
||||||
|
"offered_window_s": window,
|
||||||
|
"offered_qps": n / window if window > 0 else 0.0,
|
||||||
|
# Throughput: how much longer than the offered window it took to drain.
|
||||||
|
# ~1.0 = keeps up; >1 = falling behind (the cleanest PD-collapse signal).
|
||||||
|
"wall_clock_s": summ.get("wall_clock_s"),
|
||||||
|
"amplification": summ.get("amplification"),
|
||||||
|
"ttft": pstats([r.get("ttft_s") for r in ok]),
|
||||||
|
"tpot": pstats([r.get("tpot_s") for r in ok]),
|
||||||
|
"e2e": pstats([r.get("latency_s") for r in ok]),
|
||||||
|
"_rows": rows,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(description=__doc__,
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||||
|
p.add_argument("--arm", nargs=2, action="append", metavar=("NAME", "PATH"),
|
||||||
|
required=True, help="arm name + replay_metrics.jsonl path (repeatable)")
|
||||||
|
p.add_argument("--baseline", required=True, help="arm name to use as PD_advantage denominator")
|
||||||
|
p.add_argument("--ttft-slo", nargs="+", type=float, default=[0.5, 1.0, 2.0, 5.0])
|
||||||
|
p.add_argument("--tpot-slo", nargs="+", type=float, default=[0.05, 0.1, 0.2])
|
||||||
|
p.add_argument("--out-json", type=Path, default=None)
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
arms = {}
|
||||||
|
for name, path in args.arm:
|
||||||
|
arms[name] = summarize_arm(name, Path(path), load_metrics(Path(path)))
|
||||||
|
|
||||||
|
if args.baseline not in arms:
|
||||||
|
raise SystemExit(f"baseline {args.baseline!r} not among arms {list(arms)}")
|
||||||
|
|
||||||
|
# ---- per-arm overview ------------------------------------------------
|
||||||
|
print("=" * 78)
|
||||||
|
print("PER-ARM OVERVIEW (latency stats over successes only — context, not verdict)")
|
||||||
|
print("=" * 78)
|
||||||
|
hdr = f"{'arm':<12}{'offered':>8}{'compl%':>8}{'ampl':>6}{'oQPS':>7}" \
|
||||||
|
f"{'TTFTp50':>9}{'TTFTp90':>9}{'TPOTp50':>9}{'TPOTp99':>9}{'E2Ep90':>9}"
|
||||||
|
print(hdr)
|
||||||
|
for name, a in arms.items():
|
||||||
|
t, tp, e = a["ttft"], a["tpot"], a["e2e"]
|
||||||
|
ampl = a.get("amplification")
|
||||||
|
ampl_s = f"{ampl:>6.2f}" if isinstance(ampl, (int, float)) else f"{'--':>6}"
|
||||||
|
print(f"{name:<12}{a['n_offered']:>8}{100*a['completion_rate']:>7.1f}%"
|
||||||
|
f"{ampl_s}{a['offered_qps']:>7.2f}"
|
||||||
|
f"{t.get('p50', float('nan')):>9.2f}{t.get('p90', float('nan')):>9.2f}"
|
||||||
|
f"{1000*tp.get('p50', float('nan')):>8.0f}m{1000*tp.get('p99', float('nan')):>8.0f}m"
|
||||||
|
f"{e.get('p90', float('nan')):>9.2f}")
|
||||||
|
|
||||||
|
# ---- SLO-goodput grid + PD_advantage --------------------------------
|
||||||
|
base = arms[args.baseline]
|
||||||
|
grid = []
|
||||||
|
print()
|
||||||
|
print("=" * 78)
|
||||||
|
print(f"SLO-GOODPUT (attainment = error-free AND TTFT<=slo AND TPOT<=slo)")
|
||||||
|
print(f"PD_advantage = attainment(arm) / attainment(baseline={args.baseline}); "
|
||||||
|
f">=1 means arm wins")
|
||||||
|
print("=" * 78)
|
||||||
|
for ttft_slo in args.ttft_slo:
|
||||||
|
for tpot_slo in args.tpot_slo:
|
||||||
|
row = {"ttft_slo_s": ttft_slo, "tpot_slo_s": tpot_slo, "arms": {}}
|
||||||
|
base_n = sum(1 for r in base["_rows"] if meets_slo(r, ttft_slo, tpot_slo))
|
||||||
|
base_att = base_n / base["n_offered"] if base["n_offered"] else 0.0
|
||||||
|
line = f"TTFT<={ttft_slo:>4}s TPOT<={int(1000*tpot_slo):>4}ms | "
|
||||||
|
cells = []
|
||||||
|
for name, a in arms.items():
|
||||||
|
n_slo = sum(1 for r in a["_rows"] if meets_slo(r, ttft_slo, tpot_slo))
|
||||||
|
att = n_slo / a["n_offered"] if a["n_offered"] else 0.0
|
||||||
|
adv = (att / base_att) if base_att > 0 else float("nan")
|
||||||
|
row["arms"][name] = {"attainment": att, "pd_advantage": adv, "n_slo": n_slo}
|
||||||
|
tag = "" if name == args.baseline else f" adv={adv:.2f}"
|
||||||
|
cells.append(f"{name}={100*att:>5.1f}%{tag}")
|
||||||
|
print(line + " ".join(cells))
|
||||||
|
grid.append(row)
|
||||||
|
|
||||||
|
if args.out_json:
|
||||||
|
out = {
|
||||||
|
"baseline": args.baseline,
|
||||||
|
"arms": {n: {k: v for k, v in a.items() if k != "_rows"}
|
||||||
|
for n, a in arms.items()},
|
||||||
|
"slo_grid": grid,
|
||||||
|
}
|
||||||
|
args.out_json.write_text(json.dumps(out, indent=2))
|
||||||
|
print(f"\nwrote {args.out_json}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
360
microbench/fresh_setup/instrument_dst_migration.py
Executable file
360
microbench/fresh_setup/instrument_dst_migration.py
Executable file
@@ -0,0 +1,360 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Instrument vLLM V1 scheduler to dump per-request DST-side migration timeline.
|
||||||
|
|
||||||
|
For each request that arrives at the engine with `kv_transfer_params`
|
||||||
|
containing `do_remote_prefill=True` (i.e., the decode-target of an
|
||||||
|
EAR v3 migration), record:
|
||||||
|
|
||||||
|
t_arrival_unix — Scheduler.add_request() entry
|
||||||
|
t_wait_for_kvs_unix — status set to WAITING_FOR_REMOTE_KVS (KV pull start)
|
||||||
|
t_kv_recv_done_unix — req_id added to finished_recving_kv_req_ids
|
||||||
|
t_first_scheduled_unix — first time req appears in self.running after KV done
|
||||||
|
t_first_token_unix — first new_token_ids appended in update_from_output
|
||||||
|
arrival_state — {n_running, n_waiting, pending_prefill_tok,
|
||||||
|
n_waiting_for_kvs}
|
||||||
|
|
||||||
|
We complement the proxy `breakdown.json` (t_decode_sent_unix /
|
||||||
|
t_first_token_unix) to attribute the migration's dst-side wait into:
|
||||||
|
HTTP relay + admission_pre_kv + KV pull + admission_post_kv + first_iter
|
||||||
|
|
||||||
|
One JSONL per EngineCore PID at $DM_LOG_DIR/dm_mig_pid<pid>.jsonl
|
||||||
|
(default DM_LOG_DIR=/tmp). Records are flushed when t_first_token is
|
||||||
|
reached or when the request is aborted/finished.
|
||||||
|
|
||||||
|
Co-exists with MB5 KV snapshot patches (different START/END markers).
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python instrument_dst_migration.py --apply [--venv PATH]
|
||||||
|
python instrument_dst_migration.py --revert [--venv PATH]
|
||||||
|
python instrument_dst_migration.py --check [--venv PATH]
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import re
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
DEFAULT_VENV = Path("/home/admin/cpfs/wjh/agentic-kv/.venv")
|
||||||
|
TARGET_REL = "lib/python3.12/site-packages/vllm/v1/core/sched/scheduler.py"
|
||||||
|
|
||||||
|
START_MARK = "# DM_INSTRUMENT_START"
|
||||||
|
END_MARK = "# DM_INSTRUMENT_END"
|
||||||
|
|
||||||
|
# ---------- Patch 1: module-level header (helpers + globals) -----------------
|
||||||
|
# Anchor: the very first `class Scheduler(SchedulerInterface):` line. We insert
|
||||||
|
# the entire helper block immediately before that, so MB5's prior block (if
|
||||||
|
# present) is preserved and our block lives in module scope. The anchor must
|
||||||
|
# stay outside our own START/END markers so revert() can re-find it.
|
||||||
|
HEADER_ANCHOR = "class Scheduler(SchedulerInterface):"
|
||||||
|
|
||||||
|
HEADER_INSERT = f"""{START_MARK}
|
||||||
|
import json as _dm_json
|
||||||
|
import os as _dm_os
|
||||||
|
import threading as _dm_threading
|
||||||
|
import time as _dm_time
|
||||||
|
_DM_LOG_DIR = _dm_os.environ.get("DM_LOG_DIR", "/tmp")
|
||||||
|
try:
|
||||||
|
_dm_os.makedirs(_DM_LOG_DIR, exist_ok=True)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
_DM_LOG_PATH = _dm_os.path.join(
|
||||||
|
_DM_LOG_DIR, f"dm_mig_pid{{_dm_os.getpid()}}.jsonl"
|
||||||
|
)
|
||||||
|
_DM_LOG_FILE = None
|
||||||
|
_DM_LOG_LOCK = _dm_threading.Lock()
|
||||||
|
# req_id -> in-flight record. We pop and flush when t_first_token lands or on
|
||||||
|
# finish/abort.
|
||||||
|
_DM_DATA: dict = {{}}
|
||||||
|
|
||||||
|
|
||||||
|
def _dm_write_event(d: dict) -> None:
|
||||||
|
global _DM_LOG_FILE
|
||||||
|
if _DM_LOG_FILE is None:
|
||||||
|
_DM_LOG_FILE = open(_DM_LOG_PATH, "a", buffering=1)
|
||||||
|
with _DM_LOG_LOCK:
|
||||||
|
_DM_LOG_FILE.write(_dm_json.dumps(d) + "\\n")
|
||||||
|
|
||||||
|
|
||||||
|
def _dm_is_migrated(request) -> bool:
|
||||||
|
ktp = getattr(request, "kv_transfer_params", None)
|
||||||
|
if not isinstance(ktp, dict):
|
||||||
|
return False
|
||||||
|
return bool(ktp.get("do_remote_prefill"))
|
||||||
|
|
||||||
|
|
||||||
|
def _dm_snapshot_arrival(scheduler) -> dict:
|
||||||
|
try:
|
||||||
|
n_running = len(scheduler.running)
|
||||||
|
except Exception:
|
||||||
|
n_running = -1
|
||||||
|
try:
|
||||||
|
n_waiting_main = len(scheduler.waiting)
|
||||||
|
except Exception:
|
||||||
|
n_waiting_main = -1
|
||||||
|
try:
|
||||||
|
n_skipped = len(scheduler.skipped_waiting)
|
||||||
|
except Exception:
|
||||||
|
n_skipped = 0
|
||||||
|
pending_tok = 0
|
||||||
|
n_kv = 0
|
||||||
|
try:
|
||||||
|
from vllm.v1.request import RequestStatus as _RS
|
||||||
|
for r in list(scheduler.waiting):
|
||||||
|
try:
|
||||||
|
if getattr(r, "status", None) == _RS.WAITING_FOR_REMOTE_KVS:
|
||||||
|
n_kv += 1
|
||||||
|
npr = int(getattr(r, "num_prompt_tokens", 0))
|
||||||
|
nct = int(getattr(r, "num_computed_tokens", 0))
|
||||||
|
pending_tok += max(0, npr - nct)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
for r in list(scheduler.skipped_waiting):
|
||||||
|
try:
|
||||||
|
if getattr(r, "status", None) == _RS.WAITING_FOR_REMOTE_KVS:
|
||||||
|
n_kv += 1
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return {{
|
||||||
|
"n_running": int(n_running),
|
||||||
|
"n_waiting": int(n_waiting_main + n_skipped),
|
||||||
|
"pending_prefill_tok": int(pending_tok),
|
||||||
|
"n_waiting_for_kvs": int(n_kv),
|
||||||
|
}}
|
||||||
|
|
||||||
|
|
||||||
|
def _dm_emit_and_drop(req_id: str, reason: str = "first_token") -> None:
|
||||||
|
rec = _DM_DATA.pop(req_id, None)
|
||||||
|
if rec is None:
|
||||||
|
return
|
||||||
|
rec["flush_reason"] = reason
|
||||||
|
rec["t_flush_unix"] = _dm_time.time()
|
||||||
|
_dm_write_event(rec)
|
||||||
|
|
||||||
|
|
||||||
|
{END_MARK}
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ---------- Patch 2: add_request() hook --------------------------------------
|
||||||
|
# Right after self.requests[request.request_id] = request (line ~1927) and the
|
||||||
|
# if self.log_stats: block. Anchor includes the QUEUED record_event line so it
|
||||||
|
# is uniquely matchable.
|
||||||
|
ADD_REQUEST_ANCHOR = """ self._enqueue_waiting_request(request)
|
||||||
|
self.requests[request.request_id] = request
|
||||||
|
if self.log_stats:
|
||||||
|
request.record_event(EngineCoreEventType.QUEUED)
|
||||||
|
"""
|
||||||
|
|
||||||
|
ADD_REQUEST_REPLACE = f""" self._enqueue_waiting_request(request)
|
||||||
|
self.requests[request.request_id] = request
|
||||||
|
if self.log_stats:
|
||||||
|
request.record_event(EngineCoreEventType.QUEUED)
|
||||||
|
{START_MARK}
|
||||||
|
try:
|
||||||
|
if _dm_is_migrated(request):
|
||||||
|
_DM_DATA[request.request_id] = {{
|
||||||
|
"req_id": str(request.request_id),
|
||||||
|
"is_migrated": True,
|
||||||
|
"n_prompt_tokens": int(getattr(request, "num_prompt_tokens", 0)),
|
||||||
|
"t_arrival_unix": _dm_time.time(),
|
||||||
|
"t_wait_for_kvs_unix": None,
|
||||||
|
"t_kv_recv_done_unix": None,
|
||||||
|
"t_first_scheduled_unix": None,
|
||||||
|
"t_first_token_unix": None,
|
||||||
|
"arrival_state": _dm_snapshot_arrival(self),
|
||||||
|
}}
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
{END_MARK}
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ---------- Patch 3: WAITING_FOR_REMOTE_KVS transition -----------------------
|
||||||
|
WAIT_KV_ANCHOR = """ request.status = RequestStatus.WAITING_FOR_REMOTE_KVS
|
||||||
|
step_skipped_waiting.prepend_request(request)
|
||||||
|
"""
|
||||||
|
|
||||||
|
WAIT_KV_REPLACE = f""" request.status = RequestStatus.WAITING_FOR_REMOTE_KVS
|
||||||
|
{START_MARK}
|
||||||
|
try:
|
||||||
|
_rec = _DM_DATA.get(request.request_id)
|
||||||
|
if _rec is not None and _rec["t_wait_for_kvs_unix"] is None:
|
||||||
|
_rec["t_wait_for_kvs_unix"] = _dm_time.time()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
{END_MARK}
|
||||||
|
step_skipped_waiting.prepend_request(request)
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ---------- Patch 4: finished_recving signal ---------------------------------
|
||||||
|
FINISHED_RECV_ANCHOR = """ if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS:
|
||||||
|
self.finished_recving_kv_req_ids.add(req_id)
|
||||||
|
elif RequestStatus.is_finished(req.status):
|
||||||
|
"""
|
||||||
|
|
||||||
|
FINISHED_RECV_REPLACE = f""" if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS:
|
||||||
|
self.finished_recving_kv_req_ids.add(req_id)
|
||||||
|
{START_MARK}
|
||||||
|
try:
|
||||||
|
_rec = _DM_DATA.get(req_id)
|
||||||
|
if _rec is not None and _rec["t_kv_recv_done_unix"] is None:
|
||||||
|
_rec["t_kv_recv_done_unix"] = _dm_time.time()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
{END_MARK}
|
||||||
|
elif RequestStatus.is_finished(req.status):
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ---------- Patch 5: first scheduled (sweep at end of schedule()) ------------
|
||||||
|
# Co-exists with the MB5 snapshot inserted at the same location.
|
||||||
|
SCHED_END_ANCHOR = """ # MB5_INSTRUMENT_START
|
||||||
|
_mb5_snapshot(self)
|
||||||
|
# MB5_INSTRUMENT_END
|
||||||
|
return scheduler_output
|
||||||
|
"""
|
||||||
|
|
||||||
|
SCHED_END_REPLACE = f""" # MB5_INSTRUMENT_START
|
||||||
|
_mb5_snapshot(self)
|
||||||
|
# MB5_INSTRUMENT_END
|
||||||
|
{START_MARK}
|
||||||
|
try:
|
||||||
|
if _DM_DATA:
|
||||||
|
_now_dm = _dm_time.time()
|
||||||
|
for _r in self.running:
|
||||||
|
_rec = _DM_DATA.get(_r.request_id)
|
||||||
|
if _rec is not None and _rec["t_first_scheduled_unix"] is None:
|
||||||
|
_rec["t_first_scheduled_unix"] = _now_dm
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
{END_MARK}
|
||||||
|
return scheduler_output
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ---------- Patch 6: first new token in update_from_output -------------------
|
||||||
|
FIRST_TOK_ANCHOR = """ # Check for stop and update request status.
|
||||||
|
if new_token_ids:
|
||||||
|
new_token_ids, stopped = self._update_request_with_output(
|
||||||
|
request, new_token_ids
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
|
||||||
|
FIRST_TOK_REPLACE = f""" # Check for stop and update request status.
|
||||||
|
if new_token_ids:
|
||||||
|
{START_MARK}
|
||||||
|
try:
|
||||||
|
_rec = _DM_DATA.get(request.request_id)
|
||||||
|
if _rec is not None and _rec["t_first_token_unix"] is None:
|
||||||
|
_rec["t_first_token_unix"] = _dm_time.time()
|
||||||
|
_dm_emit_and_drop(request.request_id, reason="first_token")
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
{END_MARK}
|
||||||
|
new_token_ids, stopped = self._update_request_with_output(
|
||||||
|
request, new_token_ids
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ---------- Patch 7: abort/finish — flush partial record ---------------------
|
||||||
|
FINISH_ANCHOR = """ request.status = finished_status
|
||||||
|
self._free_request(request, delay_free_blocks=delay_free_blocks)
|
||||||
|
"""
|
||||||
|
|
||||||
|
FINISH_REPLACE = f""" request.status = finished_status
|
||||||
|
{START_MARK}
|
||||||
|
try:
|
||||||
|
if request.request_id in _DM_DATA:
|
||||||
|
_dm_emit_and_drop(request.request_id, reason="finish_or_abort")
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
{END_MARK}
|
||||||
|
self._free_request(request, delay_free_blocks=delay_free_blocks)
|
||||||
|
"""
|
||||||
|
|
||||||
|
PATCHES = [
|
||||||
|
("header", HEADER_ANCHOR, HEADER_INSERT + HEADER_ANCHOR),
|
||||||
|
("add_request", ADD_REQUEST_ANCHOR, ADD_REQUEST_REPLACE),
|
||||||
|
("wait_for_kvs", WAIT_KV_ANCHOR, WAIT_KV_REPLACE),
|
||||||
|
("finished_recving", FINISHED_RECV_ANCHOR, FINISHED_RECV_REPLACE),
|
||||||
|
("first_scheduled", SCHED_END_ANCHOR, SCHED_END_REPLACE),
|
||||||
|
("first_token", FIRST_TOK_ANCHOR, FIRST_TOK_REPLACE),
|
||||||
|
("finish_flush", FINISH_ANCHOR, FINISH_REPLACE),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def find_target(venv_or_path: Path) -> Path:
|
||||||
|
candidates = [venv_or_path / TARGET_REL, DEFAULT_VENV / TARGET_REL]
|
||||||
|
for c in candidates:
|
||||||
|
if c.is_file():
|
||||||
|
return c
|
||||||
|
raise FileNotFoundError(f"cannot find {TARGET_REL} under {venv_or_path}")
|
||||||
|
|
||||||
|
|
||||||
|
def is_patched(text: str) -> bool:
|
||||||
|
return START_MARK in text
|
||||||
|
|
||||||
|
|
||||||
|
def apply(target: Path) -> None:
|
||||||
|
text = target.read_text()
|
||||||
|
if is_patched(text):
|
||||||
|
print(f"[dm-instr] already patched: {target}")
|
||||||
|
return
|
||||||
|
new = text
|
||||||
|
for name, src, dst in PATCHES:
|
||||||
|
if src not in new:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"patch {name!r}: anchor not found in {target}. "
|
||||||
|
f"Anchor head: {src.splitlines()[0]!r}"
|
||||||
|
)
|
||||||
|
new = new.replace(src, dst, 1)
|
||||||
|
target.write_text(new)
|
||||||
|
print(f"[dm-instr] applied {len(PATCHES)} patches -> {target}")
|
||||||
|
|
||||||
|
|
||||||
|
def revert(target: Path) -> None:
|
||||||
|
text = target.read_text()
|
||||||
|
if not is_patched(text):
|
||||||
|
print(f"[dm-instr] not patched (nothing to revert): {target}")
|
||||||
|
return
|
||||||
|
# Strip our DM_* block, including the trailing newline that
|
||||||
|
# terminated the END_MARK line. We do NOT collapse other blank-line
|
||||||
|
# runs (MB5_* whitespace and original spacing between methods are
|
||||||
|
# preserved).
|
||||||
|
pat = re.compile(
|
||||||
|
r"[ \t]*" + re.escape(START_MARK) + r".*?" + re.escape(END_MARK) + r"\n",
|
||||||
|
flags=re.DOTALL,
|
||||||
|
)
|
||||||
|
new = pat.sub("", text)
|
||||||
|
# The header insert added a leading "# DM_INSTRUMENT_START\n" with
|
||||||
|
# two trailing blank lines and the anchor; revert removed the block
|
||||||
|
# plus its trailing newline, leaving one extra blank line before the
|
||||||
|
# class — harmless. We additionally collapse the very narrow case of
|
||||||
|
# "\n\n\nclass Scheduler" -> "\n\nclass Scheduler" so revert is
|
||||||
|
# byte-identical for that anchor.
|
||||||
|
new = re.sub(r"\n{3,}class Scheduler\(", "\n\nclass Scheduler(", new)
|
||||||
|
target.write_text(new)
|
||||||
|
print(f"[dm-instr] reverted: {target}")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("--apply", action="store_true")
|
||||||
|
p.add_argument("--revert", action="store_true")
|
||||||
|
p.add_argument("--check", action="store_true")
|
||||||
|
p.add_argument("--venv", type=Path, default=DEFAULT_VENV)
|
||||||
|
args = p.parse_args()
|
||||||
|
target = find_target(args.venv)
|
||||||
|
if args.apply:
|
||||||
|
apply(target)
|
||||||
|
elif args.revert:
|
||||||
|
revert(target)
|
||||||
|
elif args.check:
|
||||||
|
state = "PATCHED" if is_patched(target.read_text()) else "CLEAN"
|
||||||
|
print(f"[dm-instr] {state}: {target}")
|
||||||
|
else:
|
||||||
|
p.error("specify --apply / --revert / --check")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -151,11 +151,65 @@ RECV_FINISH_REPLACE = f""" if response.status == MooncakeXfer
|
|||||||
{END_MARK}
|
{END_MARK}
|
||||||
break"""
|
break"""
|
||||||
|
|
||||||
|
# ---- Patch 5: send_kv_to_decode entry (P-side, producer receives pull req) ----
|
||||||
|
|
||||||
|
SEND_ENTRY_TARGET = """ async def send_kv_to_decode(
|
||||||
|
self, identity: bytes, sock: zmq.asyncio.Socket, meta: MooncakeXferMetadata
|
||||||
|
):
|
||||||
|
pending_reqs: dict[ReqId, SendBlockMeta] = {}"""
|
||||||
|
|
||||||
|
SEND_ENTRY_REPLACE = f""" async def send_kv_to_decode(
|
||||||
|
self, identity: bytes, sock: zmq.asyncio.Socket, meta: MooncakeXferMetadata
|
||||||
|
):
|
||||||
|
pending_reqs: dict[ReqId, SendBlockMeta] = {{}}
|
||||||
|
{START_MARK}
|
||||||
|
try:
|
||||||
|
_mb2_log_event({{"event": "send_kv_to_decode_enter",
|
||||||
|
"d_req_ids": [str(r) for r in meta.req_blocks],
|
||||||
|
"t_start_unix": _mb2_time.time(),
|
||||||
|
"tp_rank": getattr(self, "tp_rank", -1)}})
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
{END_MARK}"""
|
||||||
|
|
||||||
|
# ---- Patch 6: wait_and_ret ready-wait timing (P-side, src KV commit wait) ----
|
||||||
|
|
||||||
|
READY_WAIT_TARGET = """ async def wait_and_ret(
|
||||||
|
d_req_id: ReqId, send_meta: SendBlockMeta
|
||||||
|
) -> tuple[ReqId, SendBlockMeta]:
|
||||||
|
await send_meta.ready.wait()
|
||||||
|
return d_req_id, send_meta"""
|
||||||
|
|
||||||
|
READY_WAIT_REPLACE = f""" async def wait_and_ret(
|
||||||
|
d_req_id: ReqId, send_meta: SendBlockMeta
|
||||||
|
) -> tuple[ReqId, SendBlockMeta]:
|
||||||
|
{START_MARK}
|
||||||
|
_mb2_rw_start = _mb2_time.perf_counter()
|
||||||
|
_mb2_rw_start_unix = _mb2_time.time()
|
||||||
|
_mb2_rw_already = send_meta.ready.is_set()
|
||||||
|
{END_MARK}
|
||||||
|
await send_meta.ready.wait()
|
||||||
|
{START_MARK}
|
||||||
|
try:
|
||||||
|
_mb2_log_event({{"event": "ready_wait",
|
||||||
|
"d_req_id": str(d_req_id),
|
||||||
|
"transfer_id": str(getattr(send_meta, "transfer_id", "")),
|
||||||
|
"ready_already_set": bool(_mb2_rw_already),
|
||||||
|
"ready_wait_s": _mb2_time.perf_counter() - _mb2_rw_start,
|
||||||
|
"t_start_unix": _mb2_rw_start_unix,
|
||||||
|
"tp_rank": getattr(self, "tp_rank", -1)}})
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
{END_MARK}
|
||||||
|
return d_req_id, send_meta"""
|
||||||
|
|
||||||
PATCHES = [
|
PATCHES = [
|
||||||
("header", HEADER_ANCHOR, HEADER_ANCHOR + HEADER_INSERT),
|
("header", HEADER_ANCHOR, HEADER_ANCHOR + HEADER_INSERT),
|
||||||
("_send_blocks", SEND_TARGET, SEND_REPLACE),
|
("_send_blocks", SEND_TARGET, SEND_REPLACE),
|
||||||
("receive_kv (entry)", RECV_ENTRY_TARGET, RECV_ENTRY_REPLACE),
|
("receive_kv (entry)", RECV_ENTRY_TARGET, RECV_ENTRY_REPLACE),
|
||||||
("receive_kv (FINISH)", RECV_FINISH_TARGET, RECV_FINISH_REPLACE),
|
("receive_kv (FINISH)", RECV_FINISH_TARGET, RECV_FINISH_REPLACE),
|
||||||
|
("send_kv (entry)", SEND_ENTRY_TARGET, SEND_ENTRY_REPLACE),
|
||||||
|
("ready_wait", READY_WAIT_TARGET, READY_WAIT_REPLACE),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
261
microbench/fresh_setup/mb6_transfer_under_load.py
Executable file
261
microbench/fresh_setup/mb6_transfer_under_load.py
Executable file
@@ -0,0 +1,261 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""MB6: KV-transfer bandwidth vs instance busy-ness.
|
||||||
|
|
||||||
|
Confirms the causal hypothesis from the v3 breakdown: the migration
|
||||||
|
transfer runs far below wire speed because it happens between instances
|
||||||
|
that are concurrently busy with compute (GIL-starved control plane +
|
||||||
|
HBM/NIC contention), NOT because of a wire/NIC limit.
|
||||||
|
|
||||||
|
Method (reuses the MB2 transfer primitive):
|
||||||
|
prefill on A (do_remote_decode, max_tokens=1) -> migrate to B
|
||||||
|
(do_remote_prefill). Time step 2 = the KV transfer.
|
||||||
|
|
||||||
|
For each background-load level B in --bg-loads, we hold B concurrent
|
||||||
|
long-decode streams on BOTH instances to keep them busy, then run
|
||||||
|
--repeats measured transfers per size. With the MB2 mooncake instrument
|
||||||
|
applied (MB2_LOG_DIR set), the analyzer can split the e2e transfer into
|
||||||
|
RDMA-actual (`send_blocks`) vs control-plane.
|
||||||
|
|
||||||
|
Expected: bg=0 reproduces MB2 (~10 GB/s); higher bg degrades toward the
|
||||||
|
~2-3 GB/s seen in the v3 trace.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python mb6_transfer_under_load.py \
|
||||||
|
--src-port 8000 --dst-port 8001 --src-bp 8998 --dst-bp 8999 \
|
||||||
|
--sizes 16384,65536 --bg-loads 0,8,24 --repeats 4 \
|
||||||
|
--out mb6_result.json
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import statistics
|
||||||
|
import time
|
||||||
|
import uuid
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
MODEL_PATH = "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct"
|
||||||
|
KV_PER_TOK = 98304 # Qwen3-30B-A3B est bytes/token
|
||||||
|
|
||||||
|
|
||||||
|
def synth_prompt(seed: int, n: int) -> list[int]:
|
||||||
|
import random
|
||||||
|
rng = random.Random(seed)
|
||||||
|
return [rng.randint(100, 150000) for _ in range(n)]
|
||||||
|
|
||||||
|
|
||||||
|
async def get_engine_id(client, host, bp):
|
||||||
|
r = await client.get(f"http://{host}:{bp}/query")
|
||||||
|
r.raise_for_status()
|
||||||
|
return r.json()["0"]["engine_id"]
|
||||||
|
|
||||||
|
|
||||||
|
async def completion(client, host, port, prompt, max_tokens, ktp=None, stream=False):
|
||||||
|
payload = {
|
||||||
|
"model": MODEL_PATH, "prompt": prompt, "max_tokens": max_tokens,
|
||||||
|
"min_tokens": max_tokens if max_tokens == 1 else 1,
|
||||||
|
"temperature": 0.0, "stream": stream,
|
||||||
|
}
|
||||||
|
if ktp:
|
||||||
|
payload["kv_transfer_params"] = ktp
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
if stream:
|
||||||
|
# consume the stream to keep the instance decoding
|
||||||
|
async with client.stream("POST", f"http://{host}:{port}/v1/completions",
|
||||||
|
json=payload, timeout=600.0) as r:
|
||||||
|
r.raise_for_status()
|
||||||
|
async for _ in r.aiter_bytes():
|
||||||
|
pass
|
||||||
|
return time.perf_counter() - t0, {}
|
||||||
|
r = await client.post(f"http://{host}:{port}/v1/completions",
|
||||||
|
json=payload, timeout=600.0)
|
||||||
|
elapsed = time.perf_counter() - t0
|
||||||
|
r.raise_for_status()
|
||||||
|
return elapsed, r.json()
|
||||||
|
|
||||||
|
|
||||||
|
async def num_running(client, host, port) -> int:
|
||||||
|
"""Read vLLM running-request gauge from /metrics."""
|
||||||
|
try:
|
||||||
|
r = await client.get(f"http://{host}:{port}/metrics", timeout=5.0)
|
||||||
|
for line in r.text.splitlines():
|
||||||
|
if line.startswith("vllm:num_requests_running"):
|
||||||
|
return int(float(line.split()[-1]))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return -1
|
||||||
|
|
||||||
|
|
||||||
|
class BackgroundLoad:
|
||||||
|
"""Maintain N concurrent long-decode streams on a set of (host,port)."""
|
||||||
|
def __init__(self, client, endpoints, concurrency, prompt_tokens=2000,
|
||||||
|
out_tokens=6000):
|
||||||
|
self.client = client
|
||||||
|
self.endpoints = endpoints
|
||||||
|
self.concurrency = concurrency
|
||||||
|
self.prompt_tokens = prompt_tokens
|
||||||
|
self.out_tokens = out_tokens
|
||||||
|
self._stop = asyncio.Event()
|
||||||
|
self._tasks: list[asyncio.Task] = []
|
||||||
|
|
||||||
|
async def _worker(self, idx):
|
||||||
|
host, port = self.endpoints[idx % len(self.endpoints)]
|
||||||
|
seed = 900000 + idx
|
||||||
|
while not self._stop.is_set():
|
||||||
|
prompt = synth_prompt(seed, self.prompt_tokens)
|
||||||
|
seed += 1
|
||||||
|
try:
|
||||||
|
await completion(self.client, host, port, prompt,
|
||||||
|
max_tokens=self.out_tokens, stream=True)
|
||||||
|
except Exception:
|
||||||
|
await asyncio.sleep(0.5)
|
||||||
|
|
||||||
|
def start(self):
|
||||||
|
self._tasks = [asyncio.create_task(self._worker(i))
|
||||||
|
for i in range(self.concurrency)]
|
||||||
|
|
||||||
|
async def stop(self):
|
||||||
|
self._stop.set()
|
||||||
|
for t in self._tasks:
|
||||||
|
t.cancel()
|
||||||
|
await asyncio.gather(*self._tasks, return_exceptions=True)
|
||||||
|
self._tasks = []
|
||||||
|
|
||||||
|
|
||||||
|
async def measure_transfer(client, src_host, src_port, dst_host, dst_port,
|
||||||
|
src_eid, src_bootstrap_addr, input_tokens, seed):
|
||||||
|
prompt = synth_prompt(seed, input_tokens)
|
||||||
|
transfer_id = uuid.uuid4().hex
|
||||||
|
# step 1: prefill on A
|
||||||
|
await completion(client, src_host, src_port, prompt, max_tokens=1,
|
||||||
|
ktp={"do_remote_decode": True, "transfer_id": transfer_id})
|
||||||
|
# step 2: migrate to B (this is the timed transfer)
|
||||||
|
t_start_unix = time.time()
|
||||||
|
t_xfer, _ = await completion(
|
||||||
|
client, dst_host, dst_port, prompt, max_tokens=1,
|
||||||
|
ktp={"do_remote_prefill": True, "transfer_id": transfer_id,
|
||||||
|
"remote_engine_id": src_eid,
|
||||||
|
"remote_bootstrap_addr": src_bootstrap_addr})
|
||||||
|
return {
|
||||||
|
"input_tokens": input_tokens,
|
||||||
|
"t_transfer_s": t_xfer,
|
||||||
|
"t_step2_start_unix": t_start_unix,
|
||||||
|
"t_step2_end_unix": time.time(),
|
||||||
|
"kv_bytes": input_tokens * KV_PER_TOK,
|
||||||
|
"eff_gbps": input_tokens * KV_PER_TOK / 1e9 / t_xfer if t_xfer > 0 else 0,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
async def main_async(a):
|
||||||
|
sizes = [int(s) for s in a.sizes.split(",")]
|
||||||
|
bg_loads = [int(s) for s in a.bg_loads.split(",")]
|
||||||
|
src_host, dst_host = a.src_host, a.dst_host
|
||||||
|
limits = httpx.Limits(max_connections=256, max_keepalive_connections=256)
|
||||||
|
async with httpx.AsyncClient(limits=limits, trust_env=False) as client:
|
||||||
|
src_eid = await get_engine_id(client, src_host, a.src_bp)
|
||||||
|
src_bootstrap_addr = f"http://{src_host}:{a.src_bp}"
|
||||||
|
print(f"[mb6] src eid={src_eid[:16]}... endpoints A={src_host}:{a.src_port} "
|
||||||
|
f"B={dst_host}:{a.dst_port}")
|
||||||
|
|
||||||
|
endpoints = [(src_host, a.src_port), (dst_host, a.dst_port)]
|
||||||
|
results = []
|
||||||
|
for bg in bg_loads:
|
||||||
|
loader = None
|
||||||
|
if bg > 0:
|
||||||
|
loader = BackgroundLoad(client, endpoints, bg,
|
||||||
|
prompt_tokens=a.bg_prompt,
|
||||||
|
out_tokens=a.bg_out)
|
||||||
|
loader.start()
|
||||||
|
# wait for instances to actually be busy
|
||||||
|
print(f"[mb6] bg={bg}: ramping background load ...")
|
||||||
|
for _ in range(40):
|
||||||
|
await asyncio.sleep(1.0)
|
||||||
|
na = await num_running(client, src_host, a.src_port)
|
||||||
|
nb = await num_running(client, dst_host, a.dst_port)
|
||||||
|
if na >= 1 and nb >= 1:
|
||||||
|
print(f"[mb6] bg={bg}: busy (A running={na} B running={nb})")
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
print(f"[mb6] bg=0: idle baseline")
|
||||||
|
# ensure idle
|
||||||
|
await asyncio.sleep(2.0)
|
||||||
|
|
||||||
|
for sz in sizes:
|
||||||
|
for rep in range(a.repeats):
|
||||||
|
na = await num_running(client, src_host, a.src_port)
|
||||||
|
nb = await num_running(client, dst_host, a.dst_port)
|
||||||
|
row = await measure_transfer(
|
||||||
|
client, src_host, a.src_port, dst_host, a.dst_port,
|
||||||
|
src_eid, src_bootstrap_addr, sz, seed=sz * 100 + rep + bg * 7)
|
||||||
|
row["bg_load"] = bg
|
||||||
|
row["A_running_at_measure"] = na
|
||||||
|
row["B_running_at_measure"] = nb
|
||||||
|
results.append(row)
|
||||||
|
kv_mib = sz * KV_PER_TOK / 2**20
|
||||||
|
print(f" bg={bg:>3} size={sz:>6} ({kv_mib:6.0f}MiB) rep={rep} "
|
||||||
|
f"A_run={na:>2} B_run={nb:>2} "
|
||||||
|
f"transfer={row['t_transfer_s']*1000:7.0f}ms "
|
||||||
|
f"eff={row['eff_gbps']:5.2f}GB/s")
|
||||||
|
|
||||||
|
if loader:
|
||||||
|
await loader.stop()
|
||||||
|
# let the instances drain before next bg level
|
||||||
|
print(f"[mb6] bg={bg}: draining ...")
|
||||||
|
for _ in range(60):
|
||||||
|
await asyncio.sleep(1.0)
|
||||||
|
na = await num_running(client, src_host, a.src_port)
|
||||||
|
nb = await num_running(client, dst_host, a.dst_port)
|
||||||
|
if na <= 0 and nb <= 0:
|
||||||
|
break
|
||||||
|
|
||||||
|
# summary per (bg, size)
|
||||||
|
print("\n=== summary: effective transfer bandwidth vs background load ===")
|
||||||
|
print(f"{'bg':>4} {'size':>7} {'n':>3} {'xfer_p50_ms':>12} {'eff_p50_GBps':>13} "
|
||||||
|
f"{'eff_mean':>9}")
|
||||||
|
summary = []
|
||||||
|
for bg in bg_loads:
|
||||||
|
for sz in sizes:
|
||||||
|
rs = [r for r in results if r["bg_load"] == bg and r["input_tokens"] == sz]
|
||||||
|
if not rs:
|
||||||
|
continue
|
||||||
|
xfer = sorted(r["t_transfer_s"] for r in rs)
|
||||||
|
eff = sorted(r["eff_gbps"] for r in rs)
|
||||||
|
p50x = xfer[len(xfer) // 2]
|
||||||
|
p50e = eff[len(eff) // 2]
|
||||||
|
meane = statistics.mean(eff)
|
||||||
|
summary.append({"bg": bg, "size": sz, "n": len(rs),
|
||||||
|
"xfer_p50_ms": p50x * 1000,
|
||||||
|
"eff_p50_gbps": p50e, "eff_mean_gbps": meane})
|
||||||
|
print(f"{bg:>4} {sz:>7} {len(rs):>3} {p50x*1000:>12.0f} "
|
||||||
|
f"{p50e:>13.2f} {meane:>9.2f}")
|
||||||
|
|
||||||
|
Path(a.out).write_text(json.dumps(
|
||||||
|
{"model": MODEL_PATH, "kv_bytes_per_token": KV_PER_TOK,
|
||||||
|
"label": a.label, "raw": results, "summary": summary}, indent=2))
|
||||||
|
print(f"\n[mb6] wrote {a.out}")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("--src-host", default="127.0.0.1")
|
||||||
|
p.add_argument("--dst-host", default="127.0.0.1")
|
||||||
|
p.add_argument("--src-port", type=int, default=8000)
|
||||||
|
p.add_argument("--dst-port", type=int, default=8001)
|
||||||
|
p.add_argument("--src-bp", type=int, default=8998)
|
||||||
|
p.add_argument("--dst-bp", type=int, default=8999)
|
||||||
|
p.add_argument("--sizes", default="16384,65536")
|
||||||
|
p.add_argument("--bg-loads", default="0,8,24")
|
||||||
|
p.add_argument("--repeats", type=int, default=4)
|
||||||
|
p.add_argument("--bg-prompt", type=int, default=2000)
|
||||||
|
p.add_argument("--bg-out", type=int, default=6000)
|
||||||
|
p.add_argument("--label", default="main-venv")
|
||||||
|
p.add_argument("--out", default="mb6_result.json")
|
||||||
|
args = p.parse_args()
|
||||||
|
asyncio.run(main_async(args))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
109
microbench/fresh_setup/run_mb6.sh
Executable file
109
microbench/fresh_setup/run_mb6.sh
Executable file
@@ -0,0 +1,109 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# MB6 launcher: 2 vLLM instances (kv_both, Mooncake) + transfer-under-load
|
||||||
|
# sweep. Parameterized by VENV so it runs on either the patched main venv
|
||||||
|
# or the fresh upstream venv, to test whether the bandwidth degradation is
|
||||||
|
# our patch or inherent to upstream mooncake.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# VENV=/home/admin/cpfs/wjh/agentic-kv/.venv bash run_mb6.sh # main
|
||||||
|
# VENV=/home/admin/cpfs/wjh/agentic-kv-fresh/.venv bash run_mb6.sh # fresh
|
||||||
|
|
||||||
|
set -uo pipefail
|
||||||
|
|
||||||
|
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||||
|
VENV="${VENV:-$PROJ_DIR/.venv}"
|
||||||
|
LABEL="${LABEL:-$(basename $(dirname $VENV))}"
|
||||||
|
MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
|
||||||
|
GPUS="${GPUS:-0 1}"
|
||||||
|
SIZES="${SIZES:-16384,65536}"
|
||||||
|
BG_LOADS="${BG_LOADS:-0,8,24}"
|
||||||
|
REPEATS="${REPEATS:-4}"
|
||||||
|
DATE="$(date +%Y%m%d_%H%M)"
|
||||||
|
OUTDIR="${OUTDIR:-$PROJ_DIR/outputs/mb6_${LABEL}_${DATE}}"
|
||||||
|
PYTHON="$VENV/bin/python"
|
||||||
|
MC_INSTR="$PROJ_DIR/microbench/fresh_setup/instrument_mooncake.py"
|
||||||
|
DRIVER="$PROJ_DIR/microbench/fresh_setup/mb6_transfer_under_load.py"
|
||||||
|
MC_FILE="$VENV/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
|
||||||
|
|
||||||
|
mkdir -p "$OUTDIR/logs"
|
||||||
|
XFER_LOG_DIR="$OUTDIR/xfer_log"; mkdir -p "$XFER_LOG_DIR"
|
||||||
|
|
||||||
|
echo "=== MB6 transfer-under-load ($LABEL) ==="
|
||||||
|
echo "VENV : $VENV"
|
||||||
|
echo "Out : $OUTDIR"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
PORTS=(8000 8001); BPS=(8998 8999)
|
||||||
|
gpu_arr=($GPUS)
|
||||||
|
|
||||||
|
cleanup() {
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||||
|
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||||
|
sleep 4
|
||||||
|
"$PYTHON" "$MC_INSTR" --venv "$MC_FILE" --revert 2>/dev/null || true
|
||||||
|
}
|
||||||
|
trap cleanup EXIT
|
||||||
|
cleanup
|
||||||
|
|
||||||
|
echo "[0] apply MB2 mooncake instrument to $LABEL venv"
|
||||||
|
"$PYTHON" "$MC_INSTR" --venv "$MC_FILE" --apply
|
||||||
|
"$PYTHON" "$MC_INSTR" --venv "$MC_FILE" --check
|
||||||
|
|
||||||
|
echo "[1] launch 2 instances"
|
||||||
|
i=0
|
||||||
|
for gpu in ${gpu_arr[@]:0:2}; do
|
||||||
|
port=${PORTS[$i]}; bp=${BPS[$i]}; master=$((29600 + i))
|
||||||
|
PYTHONHASHSEED=42 \
|
||||||
|
VLLM_MOONCAKE_BOOTSTRAP_PORT=$bp \
|
||||||
|
MB2_LOG_DIR="$XFER_LOG_DIR" \
|
||||||
|
CUDA_VISIBLE_DEVICES=$gpu \
|
||||||
|
MASTER_PORT=$master \
|
||||||
|
nohup "$VENV/bin/vllm" serve "$MODEL" \
|
||||||
|
--host 0.0.0.0 --port "$port" \
|
||||||
|
--tensor-parallel-size 1 --trust-remote-code --enable-prefix-caching \
|
||||||
|
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
|
||||||
|
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
|
||||||
|
--enable-prompt-tokens-details \
|
||||||
|
> "$OUTDIR/logs/vllm_${i}_gpu${gpu}.log" 2>&1 &
|
||||||
|
disown
|
||||||
|
sleep 2
|
||||||
|
i=$((i + 1))
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "[2] wait for health"
|
||||||
|
for i in 0 1; do
|
||||||
|
port=${PORTS[$i]}; tries=0
|
||||||
|
while ! curl -sf "http://127.0.0.1:$port/health" >/dev/null 2>&1; do
|
||||||
|
tries=$((tries + 1))
|
||||||
|
if [ $tries -gt 180 ]; then echo "FATAL inst_$i not healthy"; exit 1; fi
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
echo " inst_$i ready"
|
||||||
|
done
|
||||||
|
# bootstrap /query reachable?
|
||||||
|
for i in 0 1; do
|
||||||
|
bp=${BPS[$i]}; tries=0
|
||||||
|
while ! curl -sf "http://127.0.0.1:$bp/query" >/dev/null 2>&1; do
|
||||||
|
tries=$((tries + 1))
|
||||||
|
if [ $tries -gt 60 ]; then echo "WARN bootstrap $bp not ready"; break; fi
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "[3] run MB6 driver"
|
||||||
|
"$PYTHON" "$DRIVER" \
|
||||||
|
--src-port "${PORTS[0]}" --dst-port "${PORTS[1]}" \
|
||||||
|
--src-bp "${BPS[0]}" --dst-bp "${BPS[1]}" \
|
||||||
|
--sizes "$SIZES" --bg-loads "$BG_LOADS" --repeats "$REPEATS" \
|
||||||
|
--label "$LABEL" --out "$OUTDIR/mb6_result.json" \
|
||||||
|
2>&1 | tee "$OUTDIR/mb6_run.txt"
|
||||||
|
|
||||||
|
echo "[4] teardown + revert"
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||||
|
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||||
|
sleep 4
|
||||||
|
"$PYTHON" "$MC_INSTR" --venv "$MC_FILE" --revert
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Done. Artifacts in $OUTDIR/"
|
||||||
|
echo " mb6_result.json mb6_run.txt xfer_log/ logs/"
|
||||||
Reference in New Issue
Block a user