Merge layerwise KV transfer + engine-state ablation onto main

Brings the worktree-mooncake-layerwise line (layerwise Mooncake connector,
write-mode proxy, real engine-state feed + eff_ accessors, mb7 microbench,
v3 trace re-profile, A/B x migration matrix runner) into main so the repo
is self-contained for these experiments. Disjoint paths
(microbench/connector_tax/layerwise/*) => clean merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-05-29 11:53:40 +08:00
22 changed files with 11105 additions and 0 deletions

View File

@@ -0,0 +1,188 @@
# Layer-wise KV transfer on Mooncake — exploration
Goal: make vLLM's `MooncakeConnector` push KV **per-layer during prefill**
(write mode) instead of the current **post-hoc full-request transfer**, then
microbench correctness + whether it hides the transfer behind prefill compute
(the thing MoRIIO's write mode does on AMD; no NVIDIA connector ships it).
Everything here is isolated in worktree `worktree-mooncake-layerwise`. The
dash0 venv connector is backed up at `mooncake_connector.py.ORIG_BACKUP`;
revert = copy the backup back. Opt-in via env `MOONCAKE_LAYERWISE=1`, so with
the env unset the connector behaves exactly as upstream.
## Baseline flow (post-hoc, what we have)
1. Proxy: prefill on src (`do_remote_decode`, max_tokens=1) → **await done**
decode on dst (`do_remote_prefill`) which pulls.
2. dst `start_load_kv``receive_kv` sends ZMQ `MooncakeXferMetadata` (its block
addrs) to src bootstrap.
3. src `send_kv_to_decode`: waits `send_meta.ready` (set at `request_finished`,
i.e. **after full prefill**) → `_build_transfer_params` (all layers) →
`_send_blocks` (one big `batch_transfer_sync_write`) → FINISH response.
Measured: this full transfer is on the critical path, runs at ~3 GB/s under
load (vs ~10 GB/s idle), dominating migration TTFT.
## Layer-wise flow (write mode, this exploration)
Key idea: keep all RDMA + completion on the `sender_loop` thread (clean), but
issue **one `batch_transfer_sync_write` per layer**, each fired as soon as that
layer's KV is computed — so writes overlap the remaining prefill compute.
Signaling: `save_kv_layer(layer_name, ...)` (called by vLLM's attention hook
after each layer's forward, on the main worker thread) records "layer L
computed" and wakes the sender_loop. `send_kv_to_decode` loops L=0..N-1,
waits until L is computed, writes layer L's blocks, then sends FINISH.
### Edits to `mooncake_connector.py` (all gated by `_lw_enabled`)
1. **Worker `__init__`**: `_lw_enabled` (env), layer-name→position map,
`_lw_computed: dict[transfer_id,int]`, `_lw_active: set[transfer_id]`,
wake event, lock.
2. **`register_kv_caches`**: build `_lw_layer_pos[layer_name]` (0..N-1) and
`_lw_addr_idx[pos]` = indices into `kv_caches_base_addr` (×2 if
`split_k_and_v`).
3. **Scheduler `update_state_after_alloc`** (`do_remote_decode` branch): in
layer-wise mode capture `blocks.get_block_ids()[0]` and store non-empty in
`_reqs_need_send` so the worker learns local block_ids + sets `ready`
**before** prefill finishes.
4. **Worker `note_layer_computed(layer_name)`** (new) called from
`MooncakeConnector.save_kv_layer`: bump `_lw_computed[tid]` for active
producers, `call_soon_threadsafe(wake.set)`.
5. **Worker `send_kv_to_decode`**: in layer-wise mode, mark transfer active,
loop layers: await `_lw_computed[tid] >= L`, `_send_blocks` for layer L
only (subset of `_build_transfer_params`), then send FINISH.
6. **Worker `_build_layer_transfer_params`** (new): like
`_build_transfer_params` but only the addr indices for one layer position.
### Microbench requirements
- Disable chunked prefill (`--max-num-batched-tokens` ≥ prompt) so prefill is a
single forward and `save_kv_layer` fires once per layer in order.
- Dispatch the dst (`do_remote_prefill`) request **first/concurrently** so the
ZMQ handshake reaches src during prefill.
- Correctness: dst follow-up `cached_tokens == prompt_len` (KV landed),
identical to baseline.
- Perf: src prefill wall-clock (does layer-wise slow it?) and dst TTFT (does
transfer leave the critical path?), swept over KV size, vs baseline.
## Status
- [x] worktree + connector backup + design
- [x] modified connector (LAYERWISE.py, +193/-4 lines, env-gated)
- [x] correctness microbench (mb7_layerwise.py) + launcher (run_mb7.sh)
- [x] correctness run on dash0 — PASS (KV lands; cached == prompt)
- [x] perf run + verdict — POSITIVE (transfer hidden behind prefill)
## Results (2-instance, idle, chunked-prefill off, Qwen3-30B-A3B, 48 layers)
Metric: `overhead = total prefill_only` = the transfer cost left on the
critical path (TTFT). Baseline = post-hoc full pull (sequential).
| KV size | baseline overhead | **layerwise overhead** | reduction |
|--------:|------------------:|-----------------------:|----------:|
| 8192 (0.75 GiB) | 123 ms | **58 ms** | 2.1× |
| 16384 (1.5 GiB) | 202 ms | **58 ms** | 3.5× |
| 32768 (3.0 GiB) | 529 ms | **57 ms** | 9.3× |
Key signatures:
- **Layerwise overhead is ~constant (~58 ms)** regardless of KV size, while
baseline grows O(KV size). The 58 ms is handshake + last-layer tail + 1
decode; the bulk transfer is hidden behind prefill compute.
- **Prefill did NOT slow down**: layerwise `t_A` (575/1495/4440 ms) ==
`prefill_only` (574/1492/4440 ms). The concurrent RDMA was "free" on idle
GPUs — no measurable HBM contention with prefill compute here.
- Producer logs confirm the transfer itself took 0.39/0.55/4.37 s (grows with
size) yet ran *inside* the prefill window, so it left the critical path.
- **Correctness PASS**: B's follow-up cached == prompt for all sizes; the
48-layer / 96-base-addr (split K&V) per-layer addressing is correct.
## Caveats (why this is a proof-of-concept, not a verdict for production)
1. **Idle instances only.** Real migration happens between *busy* instances.
Under load both prefill and transfer slow; transfer (even at ~3 GB/s) is
still < prefill for big contexts so it should still hide, but receive-side
(B) and HBM contention during prefill are untested here. NEXT: rerun with
background load on both A and B.
2. **Chunked prefill disabled.** The monotonic layer counter assumes one
forward, layers in order. Production uses chunked prefill (multi-step),
which needs per-(chunk,layer) tracking not implemented.
3. **Single concurrent producer transfer.** Global counter; real migration is
concurrent. Would need per-transfer state.
4. **Microbench dispatch.** mb7 fires B then A with a 50 ms head start to get
the handshake to A before its forward. The real proxy path
(`_handle_combined_pd_sep_v2`) dispatches sequentially and would need the
write-mode (concurrent) restructure.
## Results under LOAD (bg=16 background decode streams, 8 per instance)
Critical-path transfer overhead (ms), `total prefill_only`:
| KV size | idle base | idle LW | **load base** | **load LW** |
|--------:|----------:|--------:|--------------:|------------:|
| 8k | 123 | 58 | 158 | **94** |
| 16k | 202 | 58 | 239 | **83** |
| 32k | 529 | 57 | **749** | **95** |
The overlap **survives load**: layerwise overhead stays ~constant (~90 ms)
under load while baseline grows to 749 ms at 32k (7.9× reduction). Prefill did
not slow (load LW `t_A` == load `prefill_only`); the transfer (0.56/1.46/4.37 s,
producer logs) ran inside the prefill window even with 16 concurrent decodes.
Correctness PASS under load.
## FULL 1200-req v3 TRACE re-profile (chunk-safe + concurrent + write-mode)
Hardened connector (per-step incremental shipping, per-transfer state) +
write-mode proxy (concurrent prefill/decode dispatch). Two passes of
`w600_r0.0015_st30.jsonl` under `unified_v3`, differing only in transfer mode.
Correctness: layer-wise **1213/1214 success** (1 connection-error on the 128k
req, not KV corruption); byte-level KV correctness validated on mb7
(chunked + 3-way concurrent, `cached==prompt`); producer logs confirm
incremental shipping (e.g. `shipped 7872/7872 blocks`).
Migration sets differ between runs (write-mode timing shifts which requests
trigger migration; only 4 migrated in both), but are distributionally
comparable (median new_local/input 0.42 vs 0.46). **Matched migrations
all improved**, scaling with the transfer hidden behind prefill:
| request | input | new_local | base TTFT | LW TTFT | Δ |
|---|--:|--:|--:|--:|--:|
| 1268630 | 102k | 97k | 41.20 | 33.96 | **7.23s** |
| 1334223 | 37k | 14k | 6.04 | 3.23 | 2.81s |
| 1279412 | 40k | 8k | 5.50 | 2.92 | 2.58s |
| 1271459 | 8.9k | 8.9k | 37.01 | 36.98 | 0.03s (queue-bound) |
Trace-level TTFT (different sets, directional): overall p90 9.799.16 (6%),
p99 44.8942.85 (5%). **Modest** because (a) migrations are only 25/1214
**2%** of requests, and (b) several migrations are queue/contention-bound, not
transfer-bound layer-wise removes the transfer component but not the
control-plane/queue residual (the ~45% from the b3_v3_fullbreak profile).
**Verdict on the trace re-profile:** layer-wise does exactly what the profile
predicted it removes the transfer half of migration overhead (matched
migrations 2.6 to 7.2s, biggest where there's the most prefill to hide
behind), but the trace-level gain is small because migrations are rare and
partly queue-bound. It does NOT, on its own, flip migration to a clear win
over unified for this agentic workload.
## Verdict (microbench)
The mechanism **works and the benefit holds under load**: layer-wise push turns
migration's KV-transfer cost from O(KV size) on the critical path into a
near-constant ~90 ms tail, by overlapping it with prefill compute what
MoRIIO's write mode does on AMD, now demonstrated on NVIDIA/Mooncake.
**BUT this is single-transfer, non-chunked.** Running the actual 1200-req trace
correctly needs two more pieces this PoC does NOT have:
1. **Chunk-safe tracking** long agentic prompts force chunked prefill;
`save_kv_layer` then fires per-chunk and the monotonic counter would ship
uncomputed blocks. Needs slot-mapping-aware per-(request,chunk) tracking.
2. **Concurrent-transfer safety** the global counter assumes one producer at
a time; the trace migrates from busy instances running other forwards.
Also: even with those fixed, layer-wise only removes the **transfer half** of
the measured migration overhead. The b3_v3_fullbreak profile showed dst-side
`T_kv_pull` = ~55% RDMA + ~45% control-plane GIL-dispatch stalls; layer-wise
hides the RDMA half but the control-plane half is orthogonal. So a trace
re-profile would show roughly the transfer half collapse, not the whole thing.

View File

@@ -0,0 +1,82 @@
# P2: real engine-state feed for migration target selection
Problem: the router (`cache_aware_proxy.py`) decides migration targets from
**shadow counters** it maintains itself (incremented at dispatch, decremented
at completion) and reconciles to vLLM `/metrics` only every **30 s**
(`_reconcile_loop`). So every routing/migration decision is on stale state.
Worse, the signal that predicts the ~45% control-plane stall — *is the target
mid-large-prefill?* (a big prefill holds the GIL and starves the mooncake
receiver_loop) — isn't visible at all, and `/metrics` doesn't expose it either.
Fix: vLLM publishes **real** per-engine state to a shared store ~20 Hz; the
router reads ground truth and avoids GIL-stall / capacity-wall targets.
## Components (all unit-tested without GPUs)
- `engine_state.py` — canonical `compute_snapshot(scheduler, id)`, `StateWriter`,
`StateReader`. Schema per engine: `ts, num_running, num_waiting,
gpu_blocks_total/free, gpu_kv_used_frac, pending_prefill_tokens,
ongoing_decode_tokens, num_prefilling, max_prefill_remaining`.
- `instrument_engine_state.py` — vLLM `Scheduler` patch (apply/revert markers
`ES_INSTRUMENT_*`): a daemon thread publishes the snapshot every
`AGENTIC_ENGINE_STATE_PERIOD_MS` (50 ms) off the forward hot path. Inlined
writer (engine process needs no repo import). Coexists with MB5.
- `migration_target.py` — pure target scorer: avoid `max_prefill_remaining ≥
es_big_prefill_threshold` (GIL stall) and `gpu_kv_used_frac ≥ es_kv_wall_frac`
(capacity wall), then rank by cache-richness and **real** load.
- `cache_aware_proxy.WRITEMODE.py` — wired: `InstanceState.real_state`,
`_engine_state_poll_loop` (instance i ← `engine_{i}`), `_real_load`/Gate-3 and
Mechanism-B now real-state-aware. `--engine-state-uri` flag; off ⇒ identical
to before (shadow only).
Transport (`AGENTIC_ENGINE_STATE_URI` / `--engine-state-uri`):
`file:///dev/shm/agentic_engine_state` (default, zero-dep, single-node) or
`redis://host:port/0` (multi-node; needs redis-py + server — not installed on
dash0, so file backend is the working default).
## Tests (no GPU)
- `compute_snapshot` field math (mock scheduler): running/waiting,
max_prefill_remaining, pending, decode, kv_used_frac.
- writer→reader round-trip + staleness drop (file backend).
- target scorer: 5 cases incl. *avoid GIL-stall target even when its shadow
load is lower*, *real load beats stale shadow*, *cache-rich wins*,
*avoid KV wall*, *graceful fallback when feed missing*.
- end-to-end: publish 8 engines (one mid-130k-prefill) → proxy inlined reader →
target selection avoids it.
## Enabling in a GPU run (when free)
1. `instrument_engine_state.py --apply` on the dash0 venv.
2. `export AGENTIC_ENGINE_STATE_URI=file:///dev/shm/agentic_engine_state`
before the launcher (vLLM instances inherit it; `AGENTIC_WORKER_ID=engine_{i}`
already set by `b3_isolated_policy.sh` → publishes as `engine_{i}`).
3. Proxy: `EXTRA_PROXY_ARGS="--engine-state-uri file:///dev/shm/agentic_engine_state ..."`.
4. Revert the patch + `rm -rf /dev/shm/agentic_engine_state` after.
## ALL policies now read the real state (update)
`InstanceState` exposes effective accessors used by **every** picker:
`eff_num_requests / eff_pending_prefill / eff_ongoing_decode /
eff_ongoing_tokens` = `max(shadow, real)` when the feed is fresh (real fixes
the 30s-stale under-count; shadow's atomic pre-await reservation still covers
the in-flight window, preserving the RaceFix), plus real-only
`r_max_prefill_remaining / r_kv_used_frac`. Wired into: `load_only`, `lmetric`,
`sticky`, `pick_instance` (legacy), `pick_instance_unified_hybrid`
(unified / unified_kv_both), `pick_instance_unified_v3` (gate + Mechanism B),
and `snapshot_workers` (logged scores now match the decision + real fields).
Feed off ⇒ `real_state is None` ⇒ accessors return shadow ⇒ byte-identical to
before. (legacy `unified_v2` left on shadow — retired, not in the ablation.)
## Ablation (when GPU free)
`run_v3_trace.sh` gains `ES=1` (apply engine-state patch + feed + proxy flag)
and always deploys the enhanced proxy (dormant when feed/write-mode off).
`run_ablation_es.sh` runs each config twice (ES=0 vs ES=1) so the only
difference is the state source. Default decisive set (4 runs): champion
`unified+A+B` and `unified_v3+A+B+layerwise`, each ES0/ES1. Extend CONFIGS for
`lmetric` / `unified_kv_both` / `load_only`. Compares per-policy TTFT
(overall + migrated) and whether the **ranking** changes with ground-truth
state.
## Status / scope
- Built + unit-tested (snapshot, round-trip, target scorer, eff_ accessors,
end-to-end publish→read→select); NOT yet run against live engines (GPU busy).
- TP=1 only (one EngineCore/instance → one publisher/engine_id). TP>1 needs
per-rank ids.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,140 @@
#!/usr/bin/env python3
"""Engine-state store: canonical snapshot + writer/reader, shared schema.
The vLLM scheduler patch (instrument_engine_state.py) inlines a faithful copy
of `compute_snapshot` + the file/redis writer (engine process needs no repo
import). The router (cache_aware_proxy) imports `StateReader` here to read the
real per-engine state instead of its stale shadow counters.
Schema (one record per engine, key = engine_id):
ts, engine_id, num_running, num_waiting, gpu_blocks_total, gpu_blocks_free,
gpu_kv_used_frac, pending_prefill_tokens, ongoing_decode_tokens,
num_prefilling, max_prefill_remaining
Transport URIs:
file:///dev/shm/agentic_engine_state (default; atomic temp+rename)
redis://host:port/0 (optional; needs redis-py)
"""
from __future__ import annotations
import json
import os
import time
def compute_snapshot(scheduler, engine_id: str) -> dict:
"""Cheap O(batch) read of routing-relevant real state from a live
vLLM V1 Scheduler (duck-typed for testability)."""
try:
pool = scheduler.kv_cache_manager.block_pool
total = int(pool.num_gpu_blocks)
free = int(pool.get_num_free_blocks())
except Exception:
total = free = -1
n_run = pend = dec = n_pref = max_pref = 0
try:
for r in scheduler.running:
n_run += 1
npr = int(getattr(r, "num_prompt_tokens", 0))
nct = int(getattr(r, "num_computed_tokens", 0))
if nct < npr:
rem = npr - nct
pend += rem
n_pref += 1
max_pref = max(max_pref, rem)
else:
dec += int(getattr(r, "num_tokens", 0))
except Exception:
pass
n_wait = 0
try:
n_wait = len(scheduler.waiting) + len(getattr(scheduler, "skipped_waiting", []))
for r in list(scheduler.waiting):
pend += max(0, int(getattr(r, "num_prompt_tokens", 0))
- int(getattr(r, "num_computed_tokens", 0)))
except Exception:
pass
used = ((total - free) / total) if (total and total > 0) else -1.0
return {
"ts": time.time(),
"engine_id": engine_id,
"num_running": n_run,
"num_waiting": int(n_wait),
"gpu_blocks_total": total,
"gpu_blocks_free": free,
"gpu_kv_used_frac": used,
"pending_prefill_tokens": int(pend),
"ongoing_decode_tokens": int(dec),
"num_prefilling": n_pref,
"max_prefill_remaining": int(max_pref),
}
class StateWriter:
def __init__(self, uri: str, engine_id: str):
self.engine_id = engine_id
self.kind = None
if uri.startswith("file://"):
self.kind = "file"
self.dir = uri[len("file://"):]
os.makedirs(self.dir, exist_ok=True)
self.path = os.path.join(self.dir, f"{engine_id}.json")
self.tmp = self.path + f".tmp.{os.getpid()}"
elif uri.startswith("redis://"):
self.kind = "redis"
import redis
self.r = redis.Redis.from_url(uri)
self.key = f"engine_state:{engine_id}"
else:
raise ValueError(f"unsupported engine-state URI: {uri}")
def publish(self, state: dict):
if self.kind == "file":
with open(self.tmp, "w") as f:
f.write(json.dumps(state))
os.replace(self.tmp, self.path)
elif self.kind == "redis":
self.r.set(self.key, json.dumps(state), ex=5)
class StateReader:
"""Router-side reader. read_all() returns {engine_id: state}, dropping
records older than max_age_s (so a dead/hung engine is ignored)."""
def __init__(self, uri: str, max_age_s: float = 2.0):
self.uri = uri
self.max_age_s = max_age_s
self.kind = None
if uri.startswith("file://"):
self.kind = "file"
self.dir = uri[len("file://"):]
elif uri.startswith("redis://"):
self.kind = "redis"
import redis
self.r = redis.Redis.from_url(uri)
else:
raise ValueError(f"unsupported engine-state URI: {uri}")
def read_all(self) -> dict[str, dict]:
now = time.time()
out: dict[str, dict] = {}
try:
if self.kind == "file":
import glob
for p in glob.glob(os.path.join(self.dir, "*.json")):
try:
s = json.load(open(p))
except Exception:
continue
if now - s.get("ts", 0) <= self.max_age_s:
out[s.get("engine_id", os.path.basename(p)[:-5])] = s
elif self.kind == "redis":
for k in self.r.scan_iter("engine_state:*"):
v = self.r.get(k)
if not v:
continue
s = json.loads(v)
if now - s.get("ts", 0) <= self.max_age_s:
out[s.get("engine_id")] = s
except Exception:
pass
return out

View File

@@ -0,0 +1,234 @@
#!/usr/bin/env python3
"""Patch vLLM V1 scheduler to publish REAL engine state to a shared store,
so the global router reads ground truth instead of its own stale shadow
counters (reconciled only every 30s).
Published per engine (key = AGENTIC_ENGINE_ID), throttled ~20 Hz from a
daemon thread (off the forward hot path):
{ts, num_running, num_waiting, gpu_blocks_total, gpu_blocks_free,
gpu_kv_used_frac, pending_prefill_tokens, ongoing_decode_tokens,
num_prefilling, max_prefill_remaining}
`max_prefill_remaining` is the key signal /metrics does NOT expose: the
largest in-progress prefill on the engine. A big in-progress prefill holds
the GIL and stalls the mooncake receiver_loop — so the router should avoid
migrating KV to such an instance (P2).
Transport (env AGENTIC_ENGINE_STATE_URI):
file:///dev/shm/agentic_engine_state (default; atomic temp+rename)
redis://host:port/0 (optional; needs redis-py + server)
Self-contained (inlined writer) so the engine process needs no repo import.
Apply/revert markers: # ES_INSTRUMENT_START / # ES_INSTRUMENT_END.
Usage:
python instrument_engine_state.py --apply [--venv PATH]
python instrument_engine_state.py --revert [--venv PATH]
python instrument_engine_state.py --check [--venv PATH]
"""
from __future__ import annotations
import argparse
import re
from pathlib import Path
DEFAULT_VENV = Path("/home/admin/cpfs/wjh/agentic-kv/.venv")
TARGET_REL = "lib/python3.12/site-packages/vllm/v1/core/sched/scheduler.py"
START = "# ES_INSTRUMENT_START"
END = "# ES_INSTRUMENT_END"
# ---- Patch 1: header (writer + publisher thread), before class Scheduler ----
HEADER_ANCHOR = "class Scheduler(SchedulerInterface):"
HEADER = f'''{START}
import json as _es_json
import os as _es_os
import threading as _es_threading
import time as _es_time
_ES_URI = _es_os.environ.get("AGENTIC_ENGINE_STATE_URI", "")
_ES_ID = _es_os.environ.get("AGENTIC_ENGINE_ID") or _es_os.environ.get(
"AGENTIC_WORKER_ID", f"engine_{{_es_os.getpid()}}")
_ES_PERIOD_S = float(_es_os.environ.get("AGENTIC_ENGINE_STATE_PERIOD_MS", "50")) / 1000.0
class _ESWriter:
"""Pluggable state writer: file:// (atomic temp+rename) or redis://."""
def __init__(self, uri: str, engine_id: str):
self.engine_id = engine_id
self.kind = None
if uri.startswith("file://"):
self.kind = "file"
self.dir = uri[len("file://"):]
_es_os.makedirs(self.dir, exist_ok=True)
self.path = _es_os.path.join(self.dir, f"{{engine_id}}.json")
self.tmp = self.path + f".tmp.{{_es_os.getpid()}}"
elif uri.startswith("redis://"):
self.kind = "redis"
import redis # lazy
self.r = redis.Redis.from_url(uri)
self.key = f"engine_state:{{engine_id}}"
def publish(self, state: dict):
try:
if self.kind == "file":
with open(self.tmp, "w") as f:
f.write(_es_json.dumps(state))
_es_os.replace(self.tmp, self.path) # atomic
elif self.kind == "redis":
self.r.set(self.key, _es_json.dumps(state), ex=5)
except Exception:
pass
def _es_compute_snapshot(scheduler) -> dict:
"""Cheap O(batch) state read from the live scheduler."""
try:
kvm = scheduler.kv_cache_manager
pool = kvm.block_pool
total = int(pool.num_gpu_blocks)
free = int(pool.get_num_free_blocks())
except Exception:
total = free = -1
n_run = 0
pend = 0
dec = 0
n_pref = 0
max_pref = 0
try:
for r in scheduler.running:
n_run += 1
npr = int(getattr(r, "num_prompt_tokens", 0))
nct = int(getattr(r, "num_computed_tokens", 0))
if nct < npr: # still prefilling
rem = npr - nct
pend += rem
n_pref += 1
if rem > max_pref:
max_pref = rem
else: # decoding
dec += int(getattr(r, "num_tokens", 0))
except Exception:
pass
n_wait = 0
try:
n_wait = len(scheduler.waiting) + len(getattr(scheduler, "skipped_waiting", []))
for r in list(scheduler.waiting):
pend += max(0, int(getattr(r, "num_prompt_tokens", 0))
- int(getattr(r, "num_computed_tokens", 0)))
except Exception:
pass
used_frac = ((total - free) / total) if (total and total > 0) else -1.0
return {{
"ts": _es_time.time(),
"engine_id": _ES_ID,
"num_running": n_run,
"num_waiting": int(n_wait),
"gpu_blocks_total": total,
"gpu_blocks_free": free,
"gpu_kv_used_frac": used_frac,
"pending_prefill_tokens": int(pend),
"ongoing_decode_tokens": int(dec),
"num_prefilling": n_pref,
"max_prefill_remaining": int(max_pref),
}}
class _ESPublisher:
def __init__(self, scheduler):
self._sched = scheduler
self._writer = _ESWriter(_ES_URI, _ES_ID)
self._stop = _es_threading.Event()
self._t = _es_threading.Thread(target=self._loop, daemon=True)
self._t.start()
def _loop(self):
while not self._stop.is_set():
try:
self._writer.publish(_es_compute_snapshot(self._sched))
except Exception:
pass
_es_time.sleep(_ES_PERIOD_S)
{END}
'''
# ---- Patch 2: start the publisher at the end of Scheduler.__init__ ----------
# Anchor on the existing agentic step-log block tail in __init__.
INIT_ANCHOR = """ _step_path = _os.environ.get("AGENTIC_STEP_LOG_PATH")"""
INIT_INSERT = f""" {START}
if _ES_URI:
try:
self._es_publisher = _ESPublisher(self)
logger.info("agentic engine-state publisher: uri=%s id=%s",
_ES_URI, _ES_ID)
except Exception as _e:
logger.warning("engine-state publisher disabled (%r)", _e)
{END}
_step_path = _os.environ.get("AGENTIC_STEP_LOG_PATH")"""
PATCHES = [
("header", HEADER_ANCHOR, HEADER + HEADER_ANCHOR),
("init", INIT_ANCHOR, INIT_INSERT),
]
def find_target(venv: Path) -> Path:
for c in (venv / TARGET_REL, DEFAULT_VENV / TARGET_REL):
if c.is_file():
return c
raise FileNotFoundError(f"cannot find {TARGET_REL} under {venv}")
def is_patched(t: str) -> bool:
return START in t
def apply(target: Path):
text = target.read_text()
if is_patched(text):
print(f"[es-instr] already patched: {target}")
return
new = text
for name, src, dst in PATCHES:
if src not in new:
raise RuntimeError(f"patch {name!r}: anchor not found in {target}")
new = new.replace(src, dst, 1)
target.write_text(new)
print(f"[es-instr] applied {len(PATCHES)} patches -> {target}")
def revert(target: Path):
text = target.read_text()
if not is_patched(text):
print(f"[es-instr] not patched: {target}")
return
pat = re.compile(r"[ \t]*" + re.escape(START) + r".*?" + re.escape(END) + r"\n",
flags=re.DOTALL)
new = pat.sub("", text)
new = re.sub(r"\n{3,}class Scheduler\(", "\n\nclass Scheduler(", new)
target.write_text(new)
print(f"[es-instr] reverted: {target}")
def main():
p = argparse.ArgumentParser()
p.add_argument("--apply", action="store_true")
p.add_argument("--revert", action="store_true")
p.add_argument("--check", action="store_true")
p.add_argument("--venv", type=Path, default=DEFAULT_VENV)
a = p.parse_args()
t = find_target(a.venv)
if a.apply:
apply(t)
elif a.revert:
revert(t)
elif a.check:
print(f"[es-instr] {'PATCHED' if is_patched(t.read_text()) else 'CLEAN'}: {t}")
else:
p.error("specify --apply/--revert/--check")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,294 @@
#!/usr/bin/env python3
"""MB7: correctness + perf of layer-wise KV push vs post-hoc transfer.
Two 2-instance modes against A (src/producer) and B (dst/consumer):
baseline : prefill A (await) -> THEN B pulls (post-hoc full transfer).
T_total = T_prefill + T_xfer (sequential)
layerwise : dispatch B's remote-prefill (handshake) and A's prefill
CONCURRENTLY, so A pushes each layer as it computes it.
If overlap works, T_total ~= max(T_prefill, T_xfer) ~= T_prefill.
Reference: T_prefill_only = a plain prefill on A with no transfer.
Correctness: after the transfer, a plain follow-up to B on the same prompt
must report cached_tokens >= ~prompt_len (the KV actually landed on B).
The connector mode is selected by the launcher (run_mb7.sh): baseline uses the
stock connector; layerwise deploys mooncake_connector.LAYERWISE.py +
MOONCAKE_LAYERWISE=1. This script just drives the requests and measures.
Usage:
python mb7_layerwise.py --mode layerwise --sizes 8192,32768,65536 --repeats 3 \
--src-port 8000 --dst-port 8001 --src-bp 8998 --dst-bp 8999 --out mb7.json
"""
from __future__ import annotations
import argparse
import asyncio
import json
import statistics
import time
import uuid
from pathlib import Path
import httpx
MODEL = "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct"
KV_PER_TOK = 98304
def synth_prompt(seed: int, n: int) -> list[int]:
import random
rng = random.Random(seed)
return [rng.randint(100, 150000) for _ in range(n)]
async def get_engine_id(client, host, bp):
r = await client.get(f"http://{host}:{bp}/query")
r.raise_for_status()
return r.json()["0"]["engine_id"]
async def completion(client, host, port, prompt, max_tokens, ktp=None):
payload = {
"model": MODEL, "prompt": prompt, "max_tokens": max_tokens,
"min_tokens": max_tokens if max_tokens == 1 else 1,
"temperature": 0.0, "stream": False,
}
if ktp:
payload["kv_transfer_params"] = ktp
t0 = time.perf_counter()
r = await client.post(f"http://{host}:{port}/v1/completions",
json=payload, timeout=600.0)
dt = time.perf_counter() - t0
r.raise_for_status()
return dt, r.json()
def cached_of(resp) -> int:
usage = resp.get("usage") or {}
det = usage.get("prompt_tokens_details") or {}
return det.get("cached_tokens", 0) or usage.get("cached_tokens", 0) or 0
async def _stream_completion(client, host, port, prompt, max_tokens):
payload = {"model": MODEL, "prompt": prompt, "max_tokens": max_tokens,
"min_tokens": 1, "temperature": 0.0, "stream": True}
async with client.stream("POST", f"http://{host}:{port}/v1/completions",
json=payload, timeout=600.0) as r:
r.raise_for_status()
async for _ in r.aiter_bytes():
pass
class BackgroundLoad:
"""Hold N concurrent long-decode streams across endpoints to keep busy."""
def __init__(self, client, endpoints, n, prompt_tokens=2000, out_tokens=6000):
self.client, self.endpoints, self.n = client, endpoints, n
self.pt, self.ot = prompt_tokens, out_tokens
self._stop = asyncio.Event()
self._tasks = []
async def _w(self, idx):
host, port = self.endpoints[idx % len(self.endpoints)]
seed = 800000 + idx
while not self._stop.is_set():
try:
await _stream_completion(self.client, host, port,
synth_prompt(seed, self.pt), self.ot)
except Exception:
await asyncio.sleep(0.5)
seed += 1
def start(self):
self._tasks = [asyncio.create_task(self._w(i)) for i in range(self.n)]
async def stop(self):
self._stop.set()
for t in self._tasks:
t.cancel()
await asyncio.gather(*self._tasks, return_exceptions=True)
async def num_running(client, host, port):
try:
r = await client.get(f"http://{host}:{port}/metrics", timeout=5.0)
for line in r.text.splitlines():
if line.startswith("vllm:num_requests_running"):
return int(float(line.split()[-1]))
except Exception:
pass
return -1
async def prefill_only(client, host, port, prompt):
"""Reference: plain prefill cost on A, no transfer."""
dt, _ = await completion(client, host, port, prompt, max_tokens=1)
return dt
async def measure_baseline(client, A, B, src_eid, src_bp_addr, prompt, seed):
tid = uuid.uuid4().hex
t0 = time.perf_counter()
t_pf, _ = await completion(client, *A, prompt, 1,
ktp={"do_remote_decode": True, "transfer_id": tid})
t_xfer, _ = await completion(client, *B, prompt, 1,
ktp={"do_remote_prefill": True, "transfer_id": tid,
"remote_engine_id": src_eid,
"remote_bootstrap_addr": src_bp_addr})
t_total = time.perf_counter() - t0
# correctness: B follow-up should hit cache
_, fr = await completion(client, *B, prompt, 1)
return {"t_prefill_s": t_pf, "t_xfer_s": t_xfer, "t_total_s": t_total,
"cached": cached_of(fr)}
async def measure_layerwise(client, A, B, src_eid, src_bp_addr, prompt, seed):
"""Dispatch B handshake + A prefill concurrently => layer-wise overlap."""
tid = uuid.uuid4().hex
t0 = time.perf_counter()
async def run_B():
return await completion(client, *B, prompt, 1,
ktp={"do_remote_prefill": True, "transfer_id": tid,
"remote_engine_id": src_eid,
"remote_bootstrap_addr": src_bp_addr})
async def run_A():
# small head start for B's handshake to reach A before A's forward
await asyncio.sleep(0.05)
return await completion(client, *A, prompt, 1,
ktp={"do_remote_decode": True, "transfer_id": tid})
b_task = asyncio.create_task(run_B())
a_task = asyncio.create_task(run_A())
(t_b, _), (t_a, _) = await asyncio.gather(b_task, a_task)
t_total = time.perf_counter() - t0
_, fr = await completion(client, *B, prompt, 1)
return {"t_A_s": t_a, "t_B_s": t_b, "t_total_s": t_total,
"cached": cached_of(fr)}
async def main_async(a):
sizes = [int(s) for s in a.sizes.split(",")]
A = (a.src_host, a.src_port)
B = (a.dst_host, a.dst_port)
limits = httpx.Limits(max_connections=64, max_keepalive_connections=64)
async with httpx.AsyncClient(limits=limits, trust_env=False) as client:
src_eid = await get_engine_id(client, a.src_host, a.src_bp)
src_bp_addr = f"http://{a.src_host}:{a.src_bp}"
print(f"[mb7] mode={a.mode} bg_load={a.bg_load} src_eid={src_eid[:16]}...")
loader = None
if a.bg_load > 0:
loader = BackgroundLoad(client, [A, B], a.bg_load)
loader.start()
print(f"[mb7] ramping background load ({a.bg_load}) ...")
for _ in range(40):
await asyncio.sleep(1.0)
na = await num_running(client, *A)
nb = await num_running(client, *B)
if na >= 1 and nb >= 1:
print(f"[mb7] busy: A_run={na} B_run={nb}")
break
# --- concurrent correctness mode: fire N transfers at once ----------
if a.concurrent > 1 and a.mode == "layerwise":
print(f"[mb7] CONCURRENT correctness: {a.concurrent} simultaneous "
f"transfers per size (src=A stresses concurrent producing)")
all_ok = True
for sz in sizes:
tasks = [
asyncio.create_task(measure_layerwise(
client, A, B, src_eid, src_bp_addr,
synth_prompt(sz * 1000 + j, sz), sz * 1000 + j))
for j in range(a.concurrent)
]
rows = await asyncio.gather(*tasks)
oks = [r["cached"] >= int(sz * 0.9) for r in rows]
all_ok = all_ok and all(oks)
print(f" sz={sz:>6} x{a.concurrent}: cached="
f"{[r['cached'] for r in rows]} correct={oks}")
print(f"[mb7] CONCURRENT correctness: "
f"{'ALL PASS' if all_ok else 'FAILURE'}")
if loader:
await loader.stop()
return
results = []
for sz in sizes:
for rep in range(a.repeats):
prompt = synth_prompt(sz * 100 + rep, sz)
# reference prefill-only cost (fresh prompt, different seed so no cache)
t_pf_only = await prefill_only(
client, *A, synth_prompt(sz * 100 + rep + 555, sz))
if a.mode == "baseline":
row = await measure_baseline(client, A, B, src_eid, src_bp_addr,
prompt, sz * 100 + rep)
else:
row = await measure_layerwise(client, A, B, src_eid, src_bp_addr,
prompt, sz * 100 + rep)
row.update({"mode": a.mode, "size": sz, "rep": rep,
"t_prefill_only_s": t_pf_only,
"kv_gib": sz * KV_PER_TOK / 2**30,
"correct": row["cached"] >= int(sz * 0.9)})
results.append(row)
extra = (f"xfer={row.get('t_xfer_s', 0)*1000:.0f}ms"
if a.mode == "baseline"
else f"tA={row.get('t_A_s',0)*1000:.0f}ms tB={row.get('t_B_s',0)*1000:.0f}ms")
print(f" sz={sz:>6} rep={rep} pf_only={t_pf_only*1000:6.0f}ms "
f"total={row['t_total_s']*1000:7.0f}ms {extra} "
f"cached={row['cached']}/{sz} correct={row['correct']}")
if loader:
await loader.stop()
# summary
print(f"\n=== {a.mode} (bg={a.bg_load}) summary ===")
print(f"{'size':>7} {'n':>2} {'pf_only_ms':>11} {'total_ms':>9} "
f"{'overhead_ms':>12} {'correct':>8}")
summary = []
for sz in sizes:
rs = [r for r in results if r["size"] == sz]
if not rs:
continue
pf = statistics.median(r["t_prefill_only_s"] for r in rs) * 1000
tot = statistics.median(r["t_total_s"] for r in rs) * 1000
allok = all(r["correct"] for r in rs)
# overhead = total - prefill_only = the part NOT hidden behind prefill
overhead = tot - pf
summary.append({"size": sz, "n": len(rs), "pf_only_ms": pf,
"total_ms": tot, "overhead_ms": overhead,
"all_correct": allok})
print(f"{sz:>7} {len(rs):>2} {pf:>11.0f} {tot:>9.0f} {overhead:>12.0f} "
f"{str(allok):>8}")
Path(a.out).write_text(json.dumps(
{"mode": a.mode, "model": MODEL, "raw": results, "summary": summary}, indent=2))
print(f"\n[mb7] wrote {a.out}")
def main():
p = argparse.ArgumentParser()
p.add_argument("--mode", choices=["baseline", "layerwise"], required=True)
p.add_argument("--src-host", default="127.0.0.1")
p.add_argument("--dst-host", default="127.0.0.1")
p.add_argument("--src-port", type=int, default=8000)
p.add_argument("--dst-port", type=int, default=8001)
p.add_argument("--src-bp", type=int, default=8998)
p.add_argument("--dst-bp", type=int, default=8999)
p.add_argument("--sizes", default="8192,32768,65536")
p.add_argument("--repeats", type=int, default=3)
p.add_argument("--bg-load", type=int, default=0,
help="N concurrent background decode streams across A+B")
p.add_argument("--concurrent", type=int, default=1,
help="layerwise: fire N simultaneous transfers to test "
"concurrent-producing correctness")
p.add_argument("--out", default="mb7_result.json")
args = p.parse_args()
asyncio.run(main_async(args))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,79 @@
#!/usr/bin/env python3
"""P2: real-state-aware migration target selection.
Pure helpers (no proxy deps) so they're unit-testable. The router calls
`rank_migration_targets` to pick the decode target, using REAL engine state
(from the engine-state store) when available, falling back to shadow counters.
Key fix over the shadow-only Mechanism B: deprioritise targets that are
mid-large-prefill (`max_prefill_remaining` high) — those hold the GIL and
stall the mooncake receiver_loop, which is the ~45% control-plane residual
that layer-wise transfer does NOT fix. Also avoid targets near the KV
capacity wall (`gpu_kv_used_frac` high).
"""
from __future__ import annotations
from dataclasses import dataclass
@dataclass
class TargetCandidate:
idx: int
cache_hit: int # estimated transfer bytes saved (tokens)
shadow_num_req: int # proxy shadow counter (fallback)
ongoing_tokens: int # shadow tertiary
real_state: dict | None = None # engine-state record, or None if stale/missing
def real_load(c: TargetCandidate) -> float:
"""Effective load: prefer real (running + waiting); else shadow."""
rs = c.real_state
if rs is not None:
return float(rs.get("num_running", 0) + rs.get("num_waiting", 0))
return float(c.shadow_num_req)
def big_prefill_remaining(c: TargetCandidate) -> int:
"""Largest in-progress prefill on the candidate (GIL-stall predictor).
0 when unknown (no real state) so we don't over-penalise blind."""
rs = c.real_state
return int(rs.get("max_prefill_remaining", 0)) if rs is not None else 0
def kv_used_frac(c: TargetCandidate) -> float:
rs = c.real_state
if rs is not None:
f = rs.get("gpu_kv_used_frac", -1.0)
return float(f) if f is not None and f >= 0 else 0.0
return 0.0
def target_sort_key(
c: TargetCandidate,
big_prefill_threshold: int = 16000,
kv_wall_frac: float = 0.90,
):
"""Sort key (lower = better). Ordering of concerns:
1. NOT mid-large-prefill (avoid the GIL-stall dst) [bool]
2. NOT near the KV capacity wall [bool]
3. most cache-rich (fewest transfer bytes) -> -cache_hit
4. lowest real load
5. lowest ongoing_tokens (shadow tertiary tie-break)
"""
stalls = 1 if big_prefill_remaining(c) >= big_prefill_threshold else 0
near_wall = 1 if kv_used_frac(c) >= kv_wall_frac else 0
return (stalls, near_wall, -c.cache_hit, real_load(c), c.ongoing_tokens)
def rank_migration_targets(
candidates: list[TargetCandidate],
big_prefill_threshold: int = 16000,
kv_wall_frac: float = 0.90,
) -> TargetCandidate | None:
"""Return the best candidate, or None if the list is empty."""
if not candidates:
return None
return min(
candidates,
key=lambda c: target_sort_key(c, big_prefill_threshold, kv_wall_frac),
)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,140 @@
{
"mode": "baseline",
"model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
"raw": [
{
"t_prefill_s": 0.5736213000018324,
"t_xfer_s": 0.36388630099827424,
"t_total_s": 0.9375749369974073,
"cached": 8176,
"mode": "baseline",
"size": 8192,
"rep": 0,
"t_prefill_only_s": 1.0551288530004967,
"kv_gib": 0.75,
"correct": true
},
{
"t_prefill_s": 0.5740011439993395,
"t_xfer_s": 0.12374231500143651,
"t_total_s": 0.6978207100000873,
"cached": 8176,
"mode": "baseline",
"size": 8192,
"rep": 1,
"t_prefill_only_s": 0.5743715360003989,
"kv_gib": 0.75,
"correct": true
},
{
"t_prefill_s": 0.5732713990000775,
"t_xfer_s": 0.10885842400239198,
"t_total_s": 0.6821924389987544,
"cached": 8176,
"mode": "baseline",
"size": 8192,
"rep": 2,
"t_prefill_only_s": 0.5745713680007611,
"kv_gib": 0.75,
"correct": true
},
{
"t_prefill_s": 1.4892208660021424,
"t_xfer_s": 0.2091717740004242,
"t_total_s": 1.6984740270017937,
"cached": 16368,
"mode": "baseline",
"size": 16384,
"rep": 0,
"t_prefill_only_s": 1.4990949730017746,
"kv_gib": 1.5,
"correct": true
},
{
"t_prefill_s": 1.4885207330007688,
"t_xfer_s": 0.2010940889995254,
"t_total_s": 1.6896768289989268,
"cached": 16368,
"mode": "baseline",
"size": 16384,
"rep": 1,
"t_prefill_only_s": 1.4898170189990196,
"kv_gib": 1.5,
"correct": true
},
{
"t_prefill_s": 1.4895933570005582,
"t_xfer_s": 0.2026357979993918,
"t_total_s": 1.6922962099997676,
"cached": 16368,
"mode": "baseline",
"size": 16384,
"rep": 2,
"t_prefill_only_s": 1.4907751430000644,
"kv_gib": 1.5,
"correct": true
},
{
"t_prefill_s": 4.438586502998078,
"t_xfer_s": 0.37847799000155646,
"t_total_s": 4.817142683001293,
"cached": 32752,
"mode": "baseline",
"size": 32768,
"rep": 0,
"t_prefill_only_s": 4.437922253000579,
"kv_gib": 3.0,
"correct": true
},
{
"t_prefill_s": 4.4350325649975275,
"t_xfer_s": 0.5313337980005599,
"t_total_s": 4.966431269000168,
"cached": 32752,
"mode": "baseline",
"size": 32768,
"rep": 1,
"t_prefill_only_s": 4.437473922000208,
"kv_gib": 3.0,
"correct": true
},
{
"t_prefill_s": 4.436279826000828,
"t_xfer_s": 0.6335160570015432,
"t_total_s": 5.069869226001174,
"cached": 32752,
"mode": "baseline",
"size": 32768,
"rep": 2,
"t_prefill_only_s": 4.440119222999783,
"kv_gib": 3.0,
"correct": true
}
],
"summary": [
{
"size": 8192,
"n": 3,
"pf_only_ms": 574.5713680007611,
"total_ms": 697.8207100000873,
"overhead_ms": 123.24934199932613,
"all_correct": true
},
{
"size": 16384,
"n": 3,
"pf_only_ms": 1490.7751430000644,
"total_ms": 1692.2962099997676,
"overhead_ms": 201.52106699970318,
"all_correct": true
},
{
"size": 32768,
"n": 3,
"pf_only_ms": 4437.922253000579,
"total_ms": 4966.431269000168,
"overhead_ms": 528.5090159995889,
"all_correct": true
}
]
}

View File

@@ -0,0 +1,140 @@
{
"mode": "baseline",
"model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
"raw": [
{
"t_prefill_s": 0.5868483350022871,
"t_xfer_s": 0.19584889299949282,
"t_total_s": 0.7827702419999696,
"cached": 8176,
"mode": "baseline",
"size": 8192,
"rep": 0,
"t_prefill_only_s": 0.5920699099988269,
"kv_gib": 0.75,
"correct": true
},
{
"t_prefill_s": 0.5875704979989678,
"t_xfer_s": 0.1554814909977722,
"t_total_s": 0.7431365060001554,
"cached": 8176,
"mode": "baseline",
"size": 8192,
"rep": 1,
"t_prefill_only_s": 0.5814537600017502,
"kv_gib": 0.75,
"correct": true
},
{
"t_prefill_s": 0.5852241569991747,
"t_xfer_s": 0.15129724399957922,
"t_total_s": 0.7365909610016388,
"cached": 8176,
"mode": "baseline",
"size": 8192,
"rep": 2,
"t_prefill_only_s": 0.5846994370003813,
"kv_gib": 0.75,
"correct": true
},
{
"t_prefill_s": 1.498547145001794,
"t_xfer_s": 0.2475714690008317,
"t_total_s": 1.7462187470009667,
"cached": 16368,
"mode": "baseline",
"size": 16384,
"rep": 0,
"t_prefill_only_s": 1.5670790190015396,
"kv_gib": 1.5,
"correct": true
},
{
"t_prefill_s": 1.5025789940009417,
"t_xfer_s": 0.24532966799961287,
"t_total_s": 1.7479741930001182,
"cached": 16368,
"mode": "baseline",
"size": 16384,
"rep": 1,
"t_prefill_only_s": 1.5008903820016712,
"kv_gib": 1.5,
"correct": true
},
{
"t_prefill_s": 1.5021674179988622,
"t_xfer_s": 0.24640760400143336,
"t_total_s": 1.7486415580024186,
"cached": 16368,
"mode": "baseline",
"size": 16384,
"rep": 2,
"t_prefill_only_s": 1.509417139001016,
"kv_gib": 1.5,
"correct": true
},
{
"t_prefill_s": 4.444555983998725,
"t_xfer_s": 0.4227471090016479,
"t_total_s": 4.86737214599998,
"cached": 32752,
"mode": "baseline",
"size": 32768,
"rep": 0,
"t_prefill_only_s": 4.4467717689985875,
"kv_gib": 3.0,
"correct": true
},
{
"t_prefill_s": 4.442135782999685,
"t_xfer_s": 0.7519038230020669,
"t_total_s": 5.194113359000767,
"cached": 32752,
"mode": "baseline",
"size": 32768,
"rep": 1,
"t_prefill_only_s": 4.445541313998547,
"kv_gib": 3.0,
"correct": true
},
{
"t_prefill_s": 4.439772993999213,
"t_xfer_s": 0.7855456319994119,
"t_total_s": 5.225392060998274,
"cached": 32752,
"mode": "baseline",
"size": 32768,
"rep": 2,
"t_prefill_only_s": 4.442906365002273,
"kv_gib": 3.0,
"correct": true
}
],
"summary": [
{
"size": 8192,
"n": 3,
"pf_only_ms": 584.6994370003813,
"total_ms": 743.1365060001554,
"overhead_ms": 158.43706899977406,
"all_correct": true
},
{
"size": 16384,
"n": 3,
"pf_only_ms": 1509.417139001016,
"total_ms": 1747.9741930001182,
"overhead_ms": 238.5570539991022,
"all_correct": true
},
{
"size": 32768,
"n": 3,
"pf_only_ms": 4445.541313998547,
"total_ms": 5194.113359000767,
"overhead_ms": 748.57204500222,
"all_correct": true
}
]
}

View File

@@ -0,0 +1,140 @@
{
"mode": "layerwise",
"model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
"raw": [
{
"t_A_s": 0.5749198459998297,
"t_B_s": 0.6508419569981925,
"t_total_s": 0.6509377910006151,
"cached": 8176,
"mode": "layerwise",
"size": 8192,
"rep": 0,
"t_prefill_only_s": 1.0447357020020718,
"kv_gib": 0.75,
"correct": true
},
{
"t_A_s": 0.574626908000937,
"t_B_s": 0.6306310719992325,
"t_total_s": 0.6307087300010608,
"cached": 8176,
"mode": "layerwise",
"size": 8192,
"rep": 1,
"t_prefill_only_s": 0.5731983850018878,
"kv_gib": 0.75,
"correct": true
},
{
"t_A_s": 0.5756587910000235,
"t_B_s": 0.6316753270002664,
"t_total_s": 0.6317471290021786,
"cached": 8176,
"mode": "layerwise",
"size": 8192,
"rep": 2,
"t_prefill_only_s": 0.5737888650000968,
"kv_gib": 0.75,
"correct": true
},
{
"t_A_s": 1.4953326409995498,
"t_B_s": 1.5502465710014803,
"t_total_s": 1.5503262860001996,
"cached": 16368,
"mode": "layerwise",
"size": 16384,
"rep": 0,
"t_prefill_only_s": 1.5000705940001353,
"kv_gib": 1.5,
"correct": true
},
{
"t_A_s": 1.493850356000621,
"t_B_s": 1.5505031290012994,
"t_total_s": 1.5505791659998067,
"cached": 16368,
"mode": "layerwise",
"size": 16384,
"rep": 1,
"t_prefill_only_s": 1.4924546469992492,
"kv_gib": 1.5,
"correct": true
},
{
"t_A_s": 1.4979969070009247,
"t_B_s": 1.554968774002191,
"t_total_s": 1.5551903560008213,
"cached": 16368,
"mode": "layerwise",
"size": 16384,
"rep": 2,
"t_prefill_only_s": 1.4914496510027675,
"kv_gib": 1.5,
"correct": true
},
{
"t_A_s": 4.4403588690001925,
"t_B_s": 4.496483378999983,
"t_total_s": 4.4965666819989565,
"cached": 32752,
"mode": "layerwise",
"size": 32768,
"rep": 0,
"t_prefill_only_s": 4.440080869000667,
"kv_gib": 3.0,
"correct": true
},
{
"t_A_s": 4.44209005599987,
"t_B_s": 4.499940814999718,
"t_total_s": 4.500021006002498,
"cached": 32752,
"mode": "layerwise",
"size": 32768,
"rep": 1,
"t_prefill_only_s": 4.440225810998527,
"kv_gib": 3.0,
"correct": true
},
{
"t_A_s": 4.437084657998639,
"t_B_s": 4.496842522999941,
"t_total_s": 4.496926485000586,
"cached": 32752,
"mode": "layerwise",
"size": 32768,
"rep": 2,
"t_prefill_only_s": 4.439449855002749,
"kv_gib": 3.0,
"correct": true
}
],
"summary": [
{
"size": 8192,
"n": 3,
"pf_only_ms": 573.7888650000968,
"total_ms": 631.7471290021786,
"overhead_ms": 57.958264002081705,
"all_correct": true
},
{
"size": 16384,
"n": 3,
"pf_only_ms": 1492.4546469992492,
"total_ms": 1550.5791659998067,
"overhead_ms": 58.124519000557484,
"all_correct": true
},
{
"size": 32768,
"n": 3,
"pf_only_ms": 4440.080869000667,
"total_ms": 4496.926485000586,
"overhead_ms": 56.845615999918664,
"all_correct": true
}
]
}

View File

@@ -0,0 +1,140 @@
{
"mode": "layerwise",
"model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
"raw": [
{
"t_A_s": 0.5905098549992545,
"t_B_s": 0.6900827390018094,
"t_total_s": 0.6904724189989793,
"cached": 8176,
"mode": "layerwise",
"size": 8192,
"rep": 0,
"t_prefill_only_s": 0.5852864849985053,
"kv_gib": 0.75,
"correct": true
},
{
"t_A_s": 0.5897548109969648,
"t_B_s": 0.6827381169969158,
"t_total_s": 0.6828304180016858,
"cached": 8176,
"mode": "layerwise",
"size": 8192,
"rep": 1,
"t_prefill_only_s": 0.5890174580017629,
"kv_gib": 0.75,
"correct": true
},
{
"t_A_s": 0.5850713190011447,
"t_B_s": 0.6744917560026806,
"t_total_s": 0.6745770380002796,
"cached": 8176,
"mode": "layerwise",
"size": 8192,
"rep": 2,
"t_prefill_only_s": 0.5943713950000529,
"kv_gib": 0.75,
"correct": true
},
{
"t_A_s": 1.5030149390004226,
"t_B_s": 1.596173029000056,
"t_total_s": 1.597060264000902,
"cached": 16368,
"mode": "layerwise",
"size": 16384,
"rep": 0,
"t_prefill_only_s": 1.5130829510017065,
"kv_gib": 1.5,
"correct": true
},
{
"t_A_s": 1.499876754998695,
"t_B_s": 1.5940461120007967,
"t_total_s": 1.5948001770011615,
"cached": 16368,
"mode": "layerwise",
"size": 16384,
"rep": 1,
"t_prefill_only_s": 1.5024838620010996,
"kv_gib": 1.5,
"correct": true
},
{
"t_A_s": 1.5068977490009274,
"t_B_s": 1.5950395179970656,
"t_total_s": 1.59571184500237,
"cached": 16368,
"mode": "layerwise",
"size": 16384,
"rep": 2,
"t_prefill_only_s": 1.5303227439981129,
"kv_gib": 1.5,
"correct": true
},
{
"t_A_s": 4.4503932609986805,
"t_B_s": 4.538851200999488,
"t_total_s": 4.539281312001549,
"cached": 32752,
"mode": "layerwise",
"size": 32768,
"rep": 0,
"t_prefill_only_s": 4.446753306998289,
"kv_gib": 3.0,
"correct": true
},
{
"t_A_s": 4.44226107799841,
"t_B_s": 4.551636377997056,
"t_total_s": 4.552389411001059,
"cached": 32752,
"mode": "layerwise",
"size": 32768,
"rep": 1,
"t_prefill_only_s": 4.44538704000297,
"kv_gib": 3.0,
"correct": true
},
{
"t_A_s": 4.440309538000292,
"t_B_s": 4.539836316998844,
"t_total_s": 4.540553365997766,
"cached": 32752,
"mode": "layerwise",
"size": 32768,
"rep": 2,
"t_prefill_only_s": 4.443476915999781,
"kv_gib": 3.0,
"correct": true
}
],
"summary": [
{
"size": 8192,
"n": 3,
"pf_only_ms": 589.0174580017629,
"total_ms": 682.8304180016858,
"overhead_ms": 93.8129599999229,
"all_correct": true
},
{
"size": 16384,
"n": 3,
"pf_only_ms": 1513.0829510017065,
"total_ms": 1595.71184500237,
"overhead_ms": 82.62889400066342,
"all_correct": true
},
{
"size": 32768,
"n": 3,
"pf_only_ms": 4445.38704000297,
"total_ms": 4540.553365997766,
"overhead_ms": 95.16632599479635,
"all_correct": true
}
]
}

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,33 @@
#!/usr/bin/env bash
# A/B x migration matrix on the 1200-req trace (sequential, ~47 min each).
# 1. unified (no A/B, no migration) anchor
# 2. unified + A+B (documented champion, no mig)
# 3. unified_v3 + A+B + layer-wise (champion + cheap mig)
# We already have: unified_v3 + layer-wise (no A/B) from the prior run.
#
# Q1 (migration benefit w/ layer-wise): #1 vs prior v3+layerwise(noAB)
# Q2 (does migration add to champion): #2 vs #3
set -uo pipefail
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
R="$PROJ_DIR/microbench/connector_tax/layerwise/run_v3_trace.sh"
AB="--overload-factor 1.3 --lmetric-decode-weight 0.01"
LOGD=/tmp/dst_break_logs; mkdir -p "$LOGD"
echo "########## 1/3 unified plain ##########"
TAG=unified_plain POLICY=unified MODE=baseline AB_FLAGS="" \
bash "$R" 2>&1 | tee "$LOGD/abmatrix_1_unified_plain.log" | tail -6
echo "########## 2/3 unified + A+B ##########"
TAG=unified_AB POLICY=unified MODE=baseline AB_FLAGS="$AB" \
bash "$R" 2>&1 | tee "$LOGD/abmatrix_2_unified_AB.log" | tail -6
echo "########## 3/3 unified_v3 + A+B + layer-wise ##########"
TAG=v3_AB_lw POLICY=unified_v3 MODE=layerwise AB_FLAGS="$AB" \
bash "$R" 2>&1 | tee "$LOGD/abmatrix_3_v3_AB_lw.log" | tail -6
echo "########## MATRIX DONE ##########"
for t in unified_plain unified_AB v3_AB_lw; do
D=$(ls -dt "$PROJ_DIR"/outputs/v3trace_${t}_*/unified_v3 2>/dev/null | head -1)
echo "=== $t ($D) ==="
sed -n '/\[stats\]/,/\[done\]/p' "$LOGD"/abmatrix_*_${t}.log 2>/dev/null | grep -E "requests:|TTFT|migrations:" || true
done

View File

@@ -0,0 +1,42 @@
#!/usr/bin/env bash
# Ablation: does the REAL engine-state feed (P2) change each policy's
# performance and ranking vs the stale-shadow baseline?
#
# Each config is run twice (ES=0 shadow-only, ES=1 real-state feed) so the
# ONLY difference is the state source. Sequential, ~47 min each.
#
# Default = the 4 decisive runs (champion + migration, with/without feed).
# Extend CONFIGS for the full sweep (lmetric / unified_kv_both / load_only).
set -uo pipefail
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
R="$PROJ_DIR/microbench/connector_tax/layerwise/run_v3_trace.sh"
AB="--overload-factor 1.3 --lmetric-decode-weight 0.01"
LOGD=/tmp/dst_break_logs; mkdir -p "$LOGD"
# CONFIG format: "TAG|POLICY|MODE|AB?|ES"
CONFIGS=(
"unified_AB_es0|unified|baseline|AB|0"
"unified_AB_es1|unified|baseline|AB|1"
"v3_AB_lw_es0|unified_v3|layerwise|AB|0"
"v3_AB_lw_es1|unified_v3|layerwise|AB|1"
# --- extend for the full sweep ---
# "lmetric_es0|lmetric|baseline|noAB|0"
# "lmetric_es1|lmetric|baseline|noAB|1"
# "ukvboth_AB_es0|unified_kv_both|baseline|AB|0"
# "ukvboth_AB_es1|unified_kv_both|baseline|AB|1"
)
for cfg in "${CONFIGS[@]}"; do
IFS='|' read -r tag policy mode ab es <<< "$cfg"
ab_flags=""; [ "$ab" = "AB" ] && ab_flags="$AB"
echo "########## $tag (policy=$policy mode=$mode ab=$ab es=$es) ##########"
TAG="$tag" POLICY="$policy" MODE="$mode" AB_FLAGS="$ab_flags" ES="$es" \
bash "$R" 2>&1 | tee "$LOGD/abl_${tag}.log" | tail -6
done
echo "########## ABLATION DONE — summary ##########"
for cfg in "${CONFIGS[@]}"; do
IFS='|' read -r tag _ _ _ _ <<< "$cfg"
echo "=== $tag ==="
grep -E "requests:|TTFT|migrations:" "$LOGD/abl_${tag}.log" 2>/dev/null || true
done

View File

@@ -0,0 +1,114 @@
#!/usr/bin/env bash
# MB7 launcher (runs on dash0). Two 2-instance modes selected by MODE env:
# MODE=baseline : restore stock connector, no layerwise env
# MODE=layerwise : deploy mooncake_connector.LAYERWISE.py + MOONCAKE_LAYERWISE=1
#
# Chunked prefill is DISABLED (max-num-batched-tokens >= max prompt) so the
# producer prefill is a single forward and save_kv_layer fires once per layer
# in order — the layer-wise counter assumes this.
#
# The connector is always restored from .ORIG_BACKUP on exit.
#
# Usage (on dash0):
# MODE=baseline bash run_mb7.sh
# MODE=layerwise bash run_mb7.sh
set -uo pipefail
MODE="${MODE:-baseline}"
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
VENV="${VENV:-$PROJ_DIR/.venv}"
MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
GPUS=(${GPUS:-0 1})
SIZES="${SIZES:-8192,16384,32768}"
REPEATS="${REPEATS:-3}"
BG_LOAD="${BG_LOAD:-0}"
CONCURRENT="${CONCURRENT:-1}"
MAX_BATCHED="${MAX_BATCHED:-40960}" # >= max prompt => no chunked prefill
DATE="$(date +%Y%m%d_%H%M)"
OUTDIR="${OUTDIR:-$PROJ_DIR/outputs/mb7_${MODE}_${DATE}}"
PYTHON="$VENV/bin/python"
MC_FILE="$VENV/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
LW_SRC="${LW_SRC:-/tmp/mooncake_connector.LAYERWISE.py}"
DRIVER="$PROJ_DIR/microbench/connector_tax/layerwise/mb7_layerwise.py"
mkdir -p "$OUTDIR/logs"
PORTS=(8000 8001); BPS=(8998 8999)
echo "=== MB7 ($MODE) ==="
echo "Out: $OUTDIR ; connector: $MC_FILE"
restore_connector() {
if [ -f "$MC_FILE.ORIG_BACKUP" ]; then
cp -f "$MC_FILE.ORIG_BACKUP" "$MC_FILE"
echo "[restore] connector reset to ORIG"
fi
}
cleanup() {
pkill -9 -f "vllm serve" 2>/dev/null || true
pkill -9 -f "EngineCore" 2>/dev/null || true
sleep 4
restore_connector
}
trap cleanup EXIT
pkill -9 -f "vllm serve" 2>/dev/null || true; sleep 3
# Deploy the connector for the chosen mode.
if [ "$MODE" = "layerwise" ]; then
if [ ! -f "$LW_SRC" ]; then echo "FATAL: $LW_SRC not found (scp it first)"; exit 1; fi
cp -f "$LW_SRC" "$MC_FILE"
"$PYTHON" -c "import ast; ast.parse(open('$MC_FILE').read()); print('[deploy] LAYERWISE connector AST OK')" || exit 1
LW_ENV="MOONCAKE_LAYERWISE=1"
else
restore_connector
LW_ENV=""
fi
echo "[launch] 2 instances (max-num-batched-tokens=$MAX_BATCHED, chunked-prefill off)"
i=0
for gpu in "${GPUS[@]:0:2}"; do
port=${PORTS[$i]}; bp=${BPS[$i]}; master=$((29700 + i))
env $LW_ENV \
PYTHONHASHSEED=42 VLLM_MOONCAKE_BOOTSTRAP_PORT=$bp \
CUDA_VISIBLE_DEVICES=$gpu MASTER_PORT=$master \
nohup "$VENV/bin/vllm" serve "$MODEL" \
--host 0.0.0.0 --port "$port" --tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching --dtype auto \
--gpu-memory-utilization 0.9 --max-model-len 200000 \
--max-num-batched-tokens "$MAX_BATCHED" \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
--enable-prompt-tokens-details \
> "$OUTDIR/logs/vllm_${i}_gpu${gpu}.log" 2>&1 &
disown; sleep 2; i=$((i + 1))
done
echo "[health] waiting ..."
for i in 0 1; do
port=${PORTS[$i]}; tries=0
while ! curl -sf "http://127.0.0.1:$port/health" >/dev/null 2>&1; do
tries=$((tries + 1)); [ $tries -gt 180 ] && { echo "FATAL inst_$i"; exit 1; }
sleep 2
done
echo " inst_$i ready"
done
for i in 0 1; do
bp=${BPS[$i]}; tries=0
while ! curl -sf "http://127.0.0.1:$bp/query" >/dev/null 2>&1; do
tries=$((tries+1)); [ $tries -gt 60 ] && { echo "WARN bp $bp"; break; }; sleep 2
done
done
echo "[run] mb7 --mode $MODE"
"$PYTHON" "$DRIVER" --mode "$MODE" \
--src-port "${PORTS[0]}" --dst-port "${PORTS[1]}" \
--src-bp "${BPS[0]}" --dst-bp "${BPS[1]}" \
--sizes "$SIZES" --repeats "$REPEATS" --bg-load "$BG_LOAD" \
--concurrent "$CONCURRENT" --out "$OUTDIR/mb7_result.json" \
2>&1 | tee "$OUTDIR/mb7_run.txt"
echo "[done] $OUTDIR"
# grep layerwise transfer logs from the producer (gpu0) for sanity
if [ "$MODE" = "layerwise" ]; then
echo "=== producer layerwise log lines ==="
grep -i "layerwise" "$OUTDIR/logs/vllm_0_gpu${GPUS[0]}.log" | tail -10 || true
fi

View File

@@ -0,0 +1,114 @@
#!/usr/bin/env bash
# Full 1200-req v3 trace, two modes (MODE env), for layer-wise re-profile.
# MODE=baseline : stock connector + stock proxy (post-hoc transfer)
# MODE=layerwise : LAYERWISE connector + write-mode proxy (overlapped)
# Both: unified_v3 routing + DR-fix. Connector & proxy restored from backup
# on exit. Output-equivalence/correctness gate = success rate + migrated-req
# TTFT distribution (byte-level KV correctness already validated on mb7).
#
# Usage (on dash0): MODE=baseline bash run_v3_trace.sh
# MODE=layerwise bash run_v3_trace.sh
set -uo pipefail
MODE="${MODE:-baseline}"
POLICY="${POLICY:-unified_v3}"
AB_FLAGS="${AB_FLAGS:-}" # e.g. "--overload-factor 1.3 --lmetric-decode-weight 0.01"
TAG="${TAG:-$MODE}"
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
VENV="$PROJ_DIR/.venv"
VLLM_ROOT="$VENV/lib/python3.12/site-packages/vllm"
TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
DATE="$(date +%Y%m%d_%H%M)"
OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/v3trace_${TAG}_${DATE}}"
PYTHON="$VENV/bin/python"
DR_FIX="$PROJ_DIR/microbench/connector_tax/cache_sweep/apply_direct_read_fix.py"
MC_FILE="$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
PROXY_FILE="$PROJ_DIR/scripts/cache_aware_proxy.py"
# Staging on shared cpfs (visible on dash0/dash1), not node-local /tmp.
_LWDIR="$PROJ_DIR/microbench/connector_tax/layerwise"
LW_CONN="${LW_CONN:-$_LWDIR/mooncake_connector.LAYERWISE.py}"
WM_PROXY="${WM_PROXY:-$_LWDIR/cache_aware_proxy.WRITEMODE.py}"
ES_INSTR="$_LWDIR/instrument_engine_state.py"
ES="${ES:-0}" # 1 = enable real engine-state feed (P2)
ES_DIR="/dev/shm/agentic_engine_state_${TAG}"
mkdir -p "$OUTROOT"
cfg_dir="$OUTROOT/unified_v3"; mkdir -p "$cfg_dir"
# Backups (connector backup already exists as .ORIG_BACKUP; make proxy one).
[ -f "$MC_FILE.ORIG_BACKUP" ] || cp "$MC_FILE" "$MC_FILE.ORIG_BACKUP"
[ -f "$PROXY_FILE.ORIG_BACKUP" ] || cp "$PROXY_FILE" "$PROXY_FILE.ORIG_BACKUP"
restore() {
cp -f "$MC_FILE.ORIG_BACKUP" "$MC_FILE"
cp -f "$PROXY_FILE.ORIG_BACKUP" "$PROXY_FILE"
"$PYTHON" "$DR_FIX" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
"$PYTHON" "$ES_INSTR" --revert --venv "$VENV" 2>/dev/null || true
rm -rf "$ES_DIR" 2>/dev/null || true
echo "[restore] connector+proxy reset to ORIG, DR-fix + ES-patch reverted"
}
cleanup() {
pkill -9 -f cache_aware_proxy 2>/dev/null || true
pkill -9 -f "vllm serve" 2>/dev/null || true
pkill -9 -f "EngineCore" 2>/dev/null || true
sleep 5
restore
}
trap cleanup EXIT
pkill -9 -f "vllm serve" 2>/dev/null || true; sleep 3
restore # start from clean
echo "=== v3 trace (mode=$MODE es=$ES tag=$TAG) -> $OUTROOT ==="
# Always deploy the enhanced proxy (write-mode + engine-state, both env/flag
# gated; with feed off + write-mode off it behaves identically to stock).
cp -f "$WM_PROXY" "$PROXY_FILE"
if [ "$MODE" = "layerwise" ]; then
cp -f "$LW_CONN" "$MC_FILE"
export MOONCAKE_LAYERWISE=1
export EAR_WRITE_MODE=1
fi
"$PYTHON" -c "import ast; ast.parse(open('$MC_FILE').read()); ast.parse(open('$PROXY_FILE').read()); print('[deploy] proxy + connector AST OK')" || exit 1
PROXY_ES_ARG=""
if [ "$ES" = "1" ]; then
echo "[ES] apply engine-state patch + enable feed at $ES_DIR"
"$PYTHON" "$ES_INSTR" --apply --venv "$VENV"
mkdir -p "$ES_DIR"
export AGENTIC_ENGINE_STATE_URI="file://$ES_DIR"
PROXY_ES_ARG="--engine-state-uri file://$ES_DIR"
fi
echo "[DR-fix] apply"
"$PYTHON" "$DR_FIX" --apply --vllm-root "$VLLM_ROOT"
export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
echo "[run] $POLICY AB=[$AB_FLAGS] (MOONCAKE_LAYERWISE=${MOONCAKE_LAYERWISE:-0} EAR_WRITE_MODE=${EAR_WRITE_MODE:-0})"
EXTRA_PROXY_ARGS="$AB_FLAGS $PROXY_ES_ARG" bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "$POLICY" "$TRACE" "$cfg_dir" \
2>&1 | tee "$cfg_dir/orchestrator.log" | tail -20
pkill -9 -f cache_aware_proxy 2>/dev/null || true
pkill -9 -f "vllm serve" 2>/dev/null || true
sleep 5
echo "[stats] $MODE"
"$PYTHON" - "$cfg_dir" << 'PYEOF'
import json, sys, statistics
d = sys.argv[1]
ms = [json.loads(l) for l in open(f"{d}/metrics.jsonl")]
ok = [m for m in ms if not m.get("error")]
ttft = sorted(m["ttft_s"] for m in ok if m.get("ttft_s") is not None)
def p(q): return ttft[min(len(ttft)-1, int(q*len(ttft)))] if ttft else 0
print(f" requests: {len(ms)} success: {len(ok)} ({len(ok)/max(1,len(ms))*100:.1f}%)")
print(f" TTFT s : p50={p(.5):.2f} p90={p(.9):.2f} p99={p(.99):.2f}")
# migrated reqs from proxy breakdown
try:
bd = json.load(open(f"{d}/breakdown.json"))
mig = [x for x in bd if x.get("route_class") == "PD_SEP_V2"]
mids = {x["request_id"] for x in mig}
mt = sorted(m["ttft_s"] for m in ok if m["request_id"] in mids and m.get("ttft_s"))
print(f" migrations: {len(mig)} migrated-req TTFT: "
f"p50={mt[len(mt)//2]:.2f} p90={mt[int(len(mt)*.9)]:.2f} max={mt[-1]:.2f}" if mt else f" migrations: {len(mig)}")
except Exception as e:
print(f" (breakdown parse: {e})")
PYEOF
echo "[done] $cfg_dir (metrics.jsonl, breakdown.json)"