Merge layerwise KV transfer + engine-state ablation onto main
Brings the worktree-mooncake-layerwise line (layerwise Mooncake connector, write-mode proxy, real engine-state feed + eff_ accessors, mb7 microbench, v3 trace re-profile, A/B x migration matrix runner) into main so the repo is self-contained for these experiments. Disjoint paths (microbench/connector_tax/layerwise/*) => clean merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
188
microbench/connector_tax/layerwise/DESIGN.md
Normal file
188
microbench/connector_tax/layerwise/DESIGN.md
Normal file
@@ -0,0 +1,188 @@
|
|||||||
|
# Layer-wise KV transfer on Mooncake — exploration
|
||||||
|
|
||||||
|
Goal: make vLLM's `MooncakeConnector` push KV **per-layer during prefill**
|
||||||
|
(write mode) instead of the current **post-hoc full-request transfer**, then
|
||||||
|
microbench correctness + whether it hides the transfer behind prefill compute
|
||||||
|
(the thing MoRIIO's write mode does on AMD; no NVIDIA connector ships it).
|
||||||
|
|
||||||
|
Everything here is isolated in worktree `worktree-mooncake-layerwise`. The
|
||||||
|
dash0 venv connector is backed up at `mooncake_connector.py.ORIG_BACKUP`;
|
||||||
|
revert = copy the backup back. Opt-in via env `MOONCAKE_LAYERWISE=1`, so with
|
||||||
|
the env unset the connector behaves exactly as upstream.
|
||||||
|
|
||||||
|
## Baseline flow (post-hoc, what we have)
|
||||||
|
|
||||||
|
1. Proxy: prefill on src (`do_remote_decode`, max_tokens=1) → **await done** →
|
||||||
|
decode on dst (`do_remote_prefill`) which pulls.
|
||||||
|
2. dst `start_load_kv`→`receive_kv` sends ZMQ `MooncakeXferMetadata` (its block
|
||||||
|
addrs) to src bootstrap.
|
||||||
|
3. src `send_kv_to_decode`: waits `send_meta.ready` (set at `request_finished`,
|
||||||
|
i.e. **after full prefill**) → `_build_transfer_params` (all layers) →
|
||||||
|
`_send_blocks` (one big `batch_transfer_sync_write`) → FINISH response.
|
||||||
|
|
||||||
|
Measured: this full transfer is on the critical path, runs at ~3 GB/s under
|
||||||
|
load (vs ~10 GB/s idle), dominating migration TTFT.
|
||||||
|
|
||||||
|
## Layer-wise flow (write mode, this exploration)
|
||||||
|
|
||||||
|
Key idea: keep all RDMA + completion on the `sender_loop` thread (clean), but
|
||||||
|
issue **one `batch_transfer_sync_write` per layer**, each fired as soon as that
|
||||||
|
layer's KV is computed — so writes overlap the remaining prefill compute.
|
||||||
|
|
||||||
|
Signaling: `save_kv_layer(layer_name, ...)` (called by vLLM's attention hook
|
||||||
|
after each layer's forward, on the main worker thread) records "layer L
|
||||||
|
computed" and wakes the sender_loop. `send_kv_to_decode` loops L=0..N-1,
|
||||||
|
waits until L is computed, writes layer L's blocks, then sends FINISH.
|
||||||
|
|
||||||
|
### Edits to `mooncake_connector.py` (all gated by `_lw_enabled`)
|
||||||
|
|
||||||
|
1. **Worker `__init__`**: `_lw_enabled` (env), layer-name→position map,
|
||||||
|
`_lw_computed: dict[transfer_id,int]`, `_lw_active: set[transfer_id]`,
|
||||||
|
wake event, lock.
|
||||||
|
2. **`register_kv_caches`**: build `_lw_layer_pos[layer_name]` (0..N-1) and
|
||||||
|
`_lw_addr_idx[pos]` = indices into `kv_caches_base_addr` (×2 if
|
||||||
|
`split_k_and_v`).
|
||||||
|
3. **Scheduler `update_state_after_alloc`** (`do_remote_decode` branch): in
|
||||||
|
layer-wise mode capture `blocks.get_block_ids()[0]` and store non-empty in
|
||||||
|
`_reqs_need_send` so the worker learns local block_ids + sets `ready`
|
||||||
|
**before** prefill finishes.
|
||||||
|
4. **Worker `note_layer_computed(layer_name)`** (new) called from
|
||||||
|
`MooncakeConnector.save_kv_layer`: bump `_lw_computed[tid]` for active
|
||||||
|
producers, `call_soon_threadsafe(wake.set)`.
|
||||||
|
5. **Worker `send_kv_to_decode`**: in layer-wise mode, mark transfer active,
|
||||||
|
loop layers: await `_lw_computed[tid] >= L`, `_send_blocks` for layer L
|
||||||
|
only (subset of `_build_transfer_params`), then send FINISH.
|
||||||
|
6. **Worker `_build_layer_transfer_params`** (new): like
|
||||||
|
`_build_transfer_params` but only the addr indices for one layer position.
|
||||||
|
|
||||||
|
### Microbench requirements
|
||||||
|
|
||||||
|
- Disable chunked prefill (`--max-num-batched-tokens` ≥ prompt) so prefill is a
|
||||||
|
single forward and `save_kv_layer` fires once per layer in order.
|
||||||
|
- Dispatch the dst (`do_remote_prefill`) request **first/concurrently** so the
|
||||||
|
ZMQ handshake reaches src during prefill.
|
||||||
|
- Correctness: dst follow-up `cached_tokens == prompt_len` (KV landed),
|
||||||
|
identical to baseline.
|
||||||
|
- Perf: src prefill wall-clock (does layer-wise slow it?) and dst TTFT (does
|
||||||
|
transfer leave the critical path?), swept over KV size, vs baseline.
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
- [x] worktree + connector backup + design
|
||||||
|
- [x] modified connector (LAYERWISE.py, +193/-4 lines, env-gated)
|
||||||
|
- [x] correctness microbench (mb7_layerwise.py) + launcher (run_mb7.sh)
|
||||||
|
- [x] correctness run on dash0 — PASS (KV lands; cached == prompt)
|
||||||
|
- [x] perf run + verdict — POSITIVE (transfer hidden behind prefill)
|
||||||
|
|
||||||
|
## Results (2-instance, idle, chunked-prefill off, Qwen3-30B-A3B, 48 layers)
|
||||||
|
|
||||||
|
Metric: `overhead = total − prefill_only` = the transfer cost left on the
|
||||||
|
critical path (TTFT). Baseline = post-hoc full pull (sequential).
|
||||||
|
|
||||||
|
| KV size | baseline overhead | **layerwise overhead** | reduction |
|
||||||
|
|--------:|------------------:|-----------------------:|----------:|
|
||||||
|
| 8192 (0.75 GiB) | 123 ms | **58 ms** | 2.1× |
|
||||||
|
| 16384 (1.5 GiB) | 202 ms | **58 ms** | 3.5× |
|
||||||
|
| 32768 (3.0 GiB) | 529 ms | **57 ms** | 9.3× |
|
||||||
|
|
||||||
|
Key signatures:
|
||||||
|
- **Layerwise overhead is ~constant (~58 ms)** regardless of KV size, while
|
||||||
|
baseline grows O(KV size). The 58 ms is handshake + last-layer tail + 1
|
||||||
|
decode; the bulk transfer is hidden behind prefill compute.
|
||||||
|
- **Prefill did NOT slow down**: layerwise `t_A` (575/1495/4440 ms) ==
|
||||||
|
`prefill_only` (574/1492/4440 ms). The concurrent RDMA was "free" on idle
|
||||||
|
GPUs — no measurable HBM contention with prefill compute here.
|
||||||
|
- Producer logs confirm the transfer itself took 0.39/0.55/4.37 s (grows with
|
||||||
|
size) yet ran *inside* the prefill window, so it left the critical path.
|
||||||
|
- **Correctness PASS**: B's follow-up cached == prompt for all sizes; the
|
||||||
|
48-layer / 96-base-addr (split K&V) per-layer addressing is correct.
|
||||||
|
|
||||||
|
## Caveats (why this is a proof-of-concept, not a verdict for production)
|
||||||
|
|
||||||
|
1. **Idle instances only.** Real migration happens between *busy* instances.
|
||||||
|
Under load both prefill and transfer slow; transfer (even at ~3 GB/s) is
|
||||||
|
still < prefill for big contexts so it should still hide, but receive-side
|
||||||
|
(B) and HBM contention during prefill are untested here. NEXT: rerun with
|
||||||
|
background load on both A and B.
|
||||||
|
2. **Chunked prefill disabled.** The monotonic layer counter assumes one
|
||||||
|
forward, layers in order. Production uses chunked prefill (multi-step),
|
||||||
|
which needs per-(chunk,layer) tracking — not implemented.
|
||||||
|
3. **Single concurrent producer transfer.** Global counter; real migration is
|
||||||
|
concurrent. Would need per-transfer state.
|
||||||
|
4. **Microbench dispatch.** mb7 fires B then A with a 50 ms head start to get
|
||||||
|
the handshake to A before its forward. The real proxy path
|
||||||
|
(`_handle_combined_pd_sep_v2`) dispatches sequentially and would need the
|
||||||
|
write-mode (concurrent) restructure.
|
||||||
|
|
||||||
|
## Results under LOAD (bg=16 background decode streams, 8 per instance)
|
||||||
|
|
||||||
|
Critical-path transfer overhead (ms), `total − prefill_only`:
|
||||||
|
|
||||||
|
| KV size | idle base | idle LW | **load base** | **load LW** |
|
||||||
|
|--------:|----------:|--------:|--------------:|------------:|
|
||||||
|
| 8k | 123 | 58 | 158 | **94** |
|
||||||
|
| 16k | 202 | 58 | 239 | **83** |
|
||||||
|
| 32k | 529 | 57 | **749** | **95** |
|
||||||
|
|
||||||
|
The overlap **survives load**: layerwise overhead stays ~constant (~90 ms)
|
||||||
|
under load while baseline grows to 749 ms at 32k (7.9× reduction). Prefill did
|
||||||
|
not slow (load LW `t_A` == load `prefill_only`); the transfer (0.56/1.46/4.37 s,
|
||||||
|
producer logs) ran inside the prefill window even with 16 concurrent decodes.
|
||||||
|
Correctness PASS under load.
|
||||||
|
|
||||||
|
## FULL 1200-req v3 TRACE re-profile (chunk-safe + concurrent + write-mode)
|
||||||
|
|
||||||
|
Hardened connector (per-step incremental shipping, per-transfer state) +
|
||||||
|
write-mode proxy (concurrent prefill/decode dispatch). Two passes of
|
||||||
|
`w600_r0.0015_st30.jsonl` under `unified_v3`, differing only in transfer mode.
|
||||||
|
|
||||||
|
Correctness: layer-wise **1213/1214 success** (1 connection-error on the 128k
|
||||||
|
req, not KV corruption); byte-level KV correctness validated on mb7
|
||||||
|
(chunked + 3-way concurrent, `cached==prompt`); producer logs confirm
|
||||||
|
incremental shipping (e.g. `shipped 7872/7872 blocks`).
|
||||||
|
|
||||||
|
Migration sets differ between runs (write-mode timing shifts which requests
|
||||||
|
trigger migration; only 4 migrated in both), but are distributionally
|
||||||
|
comparable (median new_local/input ≈ 0.42 vs 0.46). **Matched migrations
|
||||||
|
all improved**, scaling with the transfer hidden behind prefill:
|
||||||
|
|
||||||
|
| request | input | new_local | base TTFT | LW TTFT | Δ |
|
||||||
|
|---|--:|--:|--:|--:|--:|
|
||||||
|
| 1268630 | 102k | 97k | 41.20 | 33.96 | **−7.23s** |
|
||||||
|
| 1334223 | 37k | 14k | 6.04 | 3.23 | −2.81s |
|
||||||
|
| 1279412 | 40k | 8k | 5.50 | 2.92 | −2.58s |
|
||||||
|
| 1271459 | 8.9k | 8.9k | 37.01 | 36.98 | −0.03s (queue-bound) |
|
||||||
|
|
||||||
|
Trace-level TTFT (different sets, directional): overall p90 9.79→9.16 (−6%),
|
||||||
|
p99 44.89→42.85 (−5%). **Modest** because (a) migrations are only 25/1214 ≈
|
||||||
|
**2%** of requests, and (b) several migrations are queue/contention-bound, not
|
||||||
|
transfer-bound — layer-wise removes the transfer component but not the
|
||||||
|
control-plane/queue residual (the ~45% from the b3_v3_fullbreak profile).
|
||||||
|
|
||||||
|
**Verdict on the trace re-profile:** layer-wise does exactly what the profile
|
||||||
|
predicted — it removes the transfer half of migration overhead (matched
|
||||||
|
migrations −2.6 to −7.2s, biggest where there's the most prefill to hide
|
||||||
|
behind), but the trace-level gain is small because migrations are rare and
|
||||||
|
partly queue-bound. It does NOT, on its own, flip migration to a clear win
|
||||||
|
over unified for this agentic workload.
|
||||||
|
|
||||||
|
## Verdict (microbench)
|
||||||
|
|
||||||
|
The mechanism **works and the benefit holds under load**: layer-wise push turns
|
||||||
|
migration's KV-transfer cost from O(KV size) on the critical path into a
|
||||||
|
near-constant ~90 ms tail, by overlapping it with prefill compute — what
|
||||||
|
MoRIIO's write mode does on AMD, now demonstrated on NVIDIA/Mooncake.
|
||||||
|
|
||||||
|
**BUT this is single-transfer, non-chunked.** Running the actual 1200-req trace
|
||||||
|
correctly needs two more pieces this PoC does NOT have:
|
||||||
|
1. **Chunk-safe tracking** — long agentic prompts force chunked prefill;
|
||||||
|
`save_kv_layer` then fires per-chunk and the monotonic counter would ship
|
||||||
|
uncomputed blocks. Needs slot-mapping-aware per-(request,chunk) tracking.
|
||||||
|
2. **Concurrent-transfer safety** — the global counter assumes one producer at
|
||||||
|
a time; the trace migrates from busy instances running other forwards.
|
||||||
|
|
||||||
|
Also: even with those fixed, layer-wise only removes the **transfer half** of
|
||||||
|
the measured migration overhead. The b3_v3_fullbreak profile showed dst-side
|
||||||
|
`T_kv_pull` = ~55% RDMA + ~45% control-plane GIL-dispatch stalls; layer-wise
|
||||||
|
hides the RDMA half but the control-plane half is orthogonal. So a trace
|
||||||
|
re-profile would show roughly the transfer half collapse, not the whole thing.
|
||||||
82
microbench/connector_tax/layerwise/P2_ENGINE_STATE.md
Normal file
82
microbench/connector_tax/layerwise/P2_ENGINE_STATE.md
Normal file
@@ -0,0 +1,82 @@
|
|||||||
|
# P2: real engine-state feed for migration target selection
|
||||||
|
|
||||||
|
Problem: the router (`cache_aware_proxy.py`) decides migration targets from
|
||||||
|
**shadow counters** it maintains itself (incremented at dispatch, decremented
|
||||||
|
at completion) and reconciles to vLLM `/metrics` only every **30 s**
|
||||||
|
(`_reconcile_loop`). So every routing/migration decision is on stale state.
|
||||||
|
Worse, the signal that predicts the ~45% control-plane stall — *is the target
|
||||||
|
mid-large-prefill?* (a big prefill holds the GIL and starves the mooncake
|
||||||
|
receiver_loop) — isn't visible at all, and `/metrics` doesn't expose it either.
|
||||||
|
|
||||||
|
Fix: vLLM publishes **real** per-engine state to a shared store ~20 Hz; the
|
||||||
|
router reads ground truth and avoids GIL-stall / capacity-wall targets.
|
||||||
|
|
||||||
|
## Components (all unit-tested without GPUs)
|
||||||
|
|
||||||
|
- `engine_state.py` — canonical `compute_snapshot(scheduler, id)`, `StateWriter`,
|
||||||
|
`StateReader`. Schema per engine: `ts, num_running, num_waiting,
|
||||||
|
gpu_blocks_total/free, gpu_kv_used_frac, pending_prefill_tokens,
|
||||||
|
ongoing_decode_tokens, num_prefilling, max_prefill_remaining`.
|
||||||
|
- `instrument_engine_state.py` — vLLM `Scheduler` patch (apply/revert markers
|
||||||
|
`ES_INSTRUMENT_*`): a daemon thread publishes the snapshot every
|
||||||
|
`AGENTIC_ENGINE_STATE_PERIOD_MS` (50 ms) off the forward hot path. Inlined
|
||||||
|
writer (engine process needs no repo import). Coexists with MB5.
|
||||||
|
- `migration_target.py` — pure target scorer: avoid `max_prefill_remaining ≥
|
||||||
|
es_big_prefill_threshold` (GIL stall) and `gpu_kv_used_frac ≥ es_kv_wall_frac`
|
||||||
|
(capacity wall), then rank by cache-richness and **real** load.
|
||||||
|
- `cache_aware_proxy.WRITEMODE.py` — wired: `InstanceState.real_state`,
|
||||||
|
`_engine_state_poll_loop` (instance i ← `engine_{i}`), `_real_load`/Gate-3 and
|
||||||
|
Mechanism-B now real-state-aware. `--engine-state-uri` flag; off ⇒ identical
|
||||||
|
to before (shadow only).
|
||||||
|
|
||||||
|
Transport (`AGENTIC_ENGINE_STATE_URI` / `--engine-state-uri`):
|
||||||
|
`file:///dev/shm/agentic_engine_state` (default, zero-dep, single-node) or
|
||||||
|
`redis://host:port/0` (multi-node; needs redis-py + server — not installed on
|
||||||
|
dash0, so file backend is the working default).
|
||||||
|
|
||||||
|
## Tests (no GPU)
|
||||||
|
- `compute_snapshot` field math (mock scheduler): running/waiting,
|
||||||
|
max_prefill_remaining, pending, decode, kv_used_frac.
|
||||||
|
- writer→reader round-trip + staleness drop (file backend).
|
||||||
|
- target scorer: 5 cases incl. *avoid GIL-stall target even when its shadow
|
||||||
|
load is lower*, *real load beats stale shadow*, *cache-rich wins*,
|
||||||
|
*avoid KV wall*, *graceful fallback when feed missing*.
|
||||||
|
- end-to-end: publish 8 engines (one mid-130k-prefill) → proxy inlined reader →
|
||||||
|
target selection avoids it.
|
||||||
|
|
||||||
|
## Enabling in a GPU run (when free)
|
||||||
|
1. `instrument_engine_state.py --apply` on the dash0 venv.
|
||||||
|
2. `export AGENTIC_ENGINE_STATE_URI=file:///dev/shm/agentic_engine_state`
|
||||||
|
before the launcher (vLLM instances inherit it; `AGENTIC_WORKER_ID=engine_{i}`
|
||||||
|
already set by `b3_isolated_policy.sh` → publishes as `engine_{i}`).
|
||||||
|
3. Proxy: `EXTRA_PROXY_ARGS="--engine-state-uri file:///dev/shm/agentic_engine_state ..."`.
|
||||||
|
4. Revert the patch + `rm -rf /dev/shm/agentic_engine_state` after.
|
||||||
|
|
||||||
|
## ALL policies now read the real state (update)
|
||||||
|
`InstanceState` exposes effective accessors used by **every** picker:
|
||||||
|
`eff_num_requests / eff_pending_prefill / eff_ongoing_decode /
|
||||||
|
eff_ongoing_tokens` = `max(shadow, real)` when the feed is fresh (real fixes
|
||||||
|
the 30s-stale under-count; shadow's atomic pre-await reservation still covers
|
||||||
|
the in-flight window, preserving the RaceFix), plus real-only
|
||||||
|
`r_max_prefill_remaining / r_kv_used_frac`. Wired into: `load_only`, `lmetric`,
|
||||||
|
`sticky`, `pick_instance` (legacy), `pick_instance_unified_hybrid`
|
||||||
|
(unified / unified_kv_both), `pick_instance_unified_v3` (gate + Mechanism B),
|
||||||
|
and `snapshot_workers` (logged scores now match the decision + real fields).
|
||||||
|
Feed off ⇒ `real_state is None` ⇒ accessors return shadow ⇒ byte-identical to
|
||||||
|
before. (legacy `unified_v2` left on shadow — retired, not in the ablation.)
|
||||||
|
|
||||||
|
## Ablation (when GPU free)
|
||||||
|
`run_v3_trace.sh` gains `ES=1` (apply engine-state patch + feed + proxy flag)
|
||||||
|
and always deploys the enhanced proxy (dormant when feed/write-mode off).
|
||||||
|
`run_ablation_es.sh` runs each config twice (ES=0 vs ES=1) so the only
|
||||||
|
difference is the state source. Default decisive set (4 runs): champion
|
||||||
|
`unified+A+B` and `unified_v3+A+B+layerwise`, each ES0/ES1. Extend CONFIGS for
|
||||||
|
`lmetric` / `unified_kv_both` / `load_only`. Compares per-policy TTFT
|
||||||
|
(overall + migrated) and whether the **ranking** changes with ground-truth
|
||||||
|
state.
|
||||||
|
|
||||||
|
## Status / scope
|
||||||
|
- Built + unit-tested (snapshot, round-trip, target scorer, eff_ accessors,
|
||||||
|
end-to-end publish→read→select); NOT yet run against live engines (GPU busy).
|
||||||
|
- TP=1 only (one EngineCore/instance → one publisher/engine_id). TP>1 needs
|
||||||
|
per-rank ids.
|
||||||
1907
microbench/connector_tax/layerwise/cache_aware_proxy.WRITEMODE.py
Normal file
1907
microbench/connector_tax/layerwise/cache_aware_proxy.WRITEMODE.py
Normal file
File diff suppressed because it is too large
Load Diff
140
microbench/connector_tax/layerwise/engine_state.py
Normal file
140
microbench/connector_tax/layerwise/engine_state.py
Normal file
@@ -0,0 +1,140 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Engine-state store: canonical snapshot + writer/reader, shared schema.
|
||||||
|
|
||||||
|
The vLLM scheduler patch (instrument_engine_state.py) inlines a faithful copy
|
||||||
|
of `compute_snapshot` + the file/redis writer (engine process needs no repo
|
||||||
|
import). The router (cache_aware_proxy) imports `StateReader` here to read the
|
||||||
|
real per-engine state instead of its stale shadow counters.
|
||||||
|
|
||||||
|
Schema (one record per engine, key = engine_id):
|
||||||
|
ts, engine_id, num_running, num_waiting, gpu_blocks_total, gpu_blocks_free,
|
||||||
|
gpu_kv_used_frac, pending_prefill_tokens, ongoing_decode_tokens,
|
||||||
|
num_prefilling, max_prefill_remaining
|
||||||
|
|
||||||
|
Transport URIs:
|
||||||
|
file:///dev/shm/agentic_engine_state (default; atomic temp+rename)
|
||||||
|
redis://host:port/0 (optional; needs redis-py)
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
|
||||||
|
|
||||||
|
def compute_snapshot(scheduler, engine_id: str) -> dict:
|
||||||
|
"""Cheap O(batch) read of routing-relevant real state from a live
|
||||||
|
vLLM V1 Scheduler (duck-typed for testability)."""
|
||||||
|
try:
|
||||||
|
pool = scheduler.kv_cache_manager.block_pool
|
||||||
|
total = int(pool.num_gpu_blocks)
|
||||||
|
free = int(pool.get_num_free_blocks())
|
||||||
|
except Exception:
|
||||||
|
total = free = -1
|
||||||
|
n_run = pend = dec = n_pref = max_pref = 0
|
||||||
|
try:
|
||||||
|
for r in scheduler.running:
|
||||||
|
n_run += 1
|
||||||
|
npr = int(getattr(r, "num_prompt_tokens", 0))
|
||||||
|
nct = int(getattr(r, "num_computed_tokens", 0))
|
||||||
|
if nct < npr:
|
||||||
|
rem = npr - nct
|
||||||
|
pend += rem
|
||||||
|
n_pref += 1
|
||||||
|
max_pref = max(max_pref, rem)
|
||||||
|
else:
|
||||||
|
dec += int(getattr(r, "num_tokens", 0))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
n_wait = 0
|
||||||
|
try:
|
||||||
|
n_wait = len(scheduler.waiting) + len(getattr(scheduler, "skipped_waiting", []))
|
||||||
|
for r in list(scheduler.waiting):
|
||||||
|
pend += max(0, int(getattr(r, "num_prompt_tokens", 0))
|
||||||
|
- int(getattr(r, "num_computed_tokens", 0)))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
used = ((total - free) / total) if (total and total > 0) else -1.0
|
||||||
|
return {
|
||||||
|
"ts": time.time(),
|
||||||
|
"engine_id": engine_id,
|
||||||
|
"num_running": n_run,
|
||||||
|
"num_waiting": int(n_wait),
|
||||||
|
"gpu_blocks_total": total,
|
||||||
|
"gpu_blocks_free": free,
|
||||||
|
"gpu_kv_used_frac": used,
|
||||||
|
"pending_prefill_tokens": int(pend),
|
||||||
|
"ongoing_decode_tokens": int(dec),
|
||||||
|
"num_prefilling": n_pref,
|
||||||
|
"max_prefill_remaining": int(max_pref),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class StateWriter:
|
||||||
|
def __init__(self, uri: str, engine_id: str):
|
||||||
|
self.engine_id = engine_id
|
||||||
|
self.kind = None
|
||||||
|
if uri.startswith("file://"):
|
||||||
|
self.kind = "file"
|
||||||
|
self.dir = uri[len("file://"):]
|
||||||
|
os.makedirs(self.dir, exist_ok=True)
|
||||||
|
self.path = os.path.join(self.dir, f"{engine_id}.json")
|
||||||
|
self.tmp = self.path + f".tmp.{os.getpid()}"
|
||||||
|
elif uri.startswith("redis://"):
|
||||||
|
self.kind = "redis"
|
||||||
|
import redis
|
||||||
|
self.r = redis.Redis.from_url(uri)
|
||||||
|
self.key = f"engine_state:{engine_id}"
|
||||||
|
else:
|
||||||
|
raise ValueError(f"unsupported engine-state URI: {uri}")
|
||||||
|
|
||||||
|
def publish(self, state: dict):
|
||||||
|
if self.kind == "file":
|
||||||
|
with open(self.tmp, "w") as f:
|
||||||
|
f.write(json.dumps(state))
|
||||||
|
os.replace(self.tmp, self.path)
|
||||||
|
elif self.kind == "redis":
|
||||||
|
self.r.set(self.key, json.dumps(state), ex=5)
|
||||||
|
|
||||||
|
|
||||||
|
class StateReader:
|
||||||
|
"""Router-side reader. read_all() returns {engine_id: state}, dropping
|
||||||
|
records older than max_age_s (so a dead/hung engine is ignored)."""
|
||||||
|
def __init__(self, uri: str, max_age_s: float = 2.0):
|
||||||
|
self.uri = uri
|
||||||
|
self.max_age_s = max_age_s
|
||||||
|
self.kind = None
|
||||||
|
if uri.startswith("file://"):
|
||||||
|
self.kind = "file"
|
||||||
|
self.dir = uri[len("file://"):]
|
||||||
|
elif uri.startswith("redis://"):
|
||||||
|
self.kind = "redis"
|
||||||
|
import redis
|
||||||
|
self.r = redis.Redis.from_url(uri)
|
||||||
|
else:
|
||||||
|
raise ValueError(f"unsupported engine-state URI: {uri}")
|
||||||
|
|
||||||
|
def read_all(self) -> dict[str, dict]:
|
||||||
|
now = time.time()
|
||||||
|
out: dict[str, dict] = {}
|
||||||
|
try:
|
||||||
|
if self.kind == "file":
|
||||||
|
import glob
|
||||||
|
for p in glob.glob(os.path.join(self.dir, "*.json")):
|
||||||
|
try:
|
||||||
|
s = json.load(open(p))
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if now - s.get("ts", 0) <= self.max_age_s:
|
||||||
|
out[s.get("engine_id", os.path.basename(p)[:-5])] = s
|
||||||
|
elif self.kind == "redis":
|
||||||
|
for k in self.r.scan_iter("engine_state:*"):
|
||||||
|
v = self.r.get(k)
|
||||||
|
if not v:
|
||||||
|
continue
|
||||||
|
s = json.loads(v)
|
||||||
|
if now - s.get("ts", 0) <= self.max_age_s:
|
||||||
|
out[s.get("engine_id")] = s
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return out
|
||||||
234
microbench/connector_tax/layerwise/instrument_engine_state.py
Normal file
234
microbench/connector_tax/layerwise/instrument_engine_state.py
Normal file
@@ -0,0 +1,234 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Patch vLLM V1 scheduler to publish REAL engine state to a shared store,
|
||||||
|
so the global router reads ground truth instead of its own stale shadow
|
||||||
|
counters (reconciled only every 30s).
|
||||||
|
|
||||||
|
Published per engine (key = AGENTIC_ENGINE_ID), throttled ~20 Hz from a
|
||||||
|
daemon thread (off the forward hot path):
|
||||||
|
|
||||||
|
{ts, num_running, num_waiting, gpu_blocks_total, gpu_blocks_free,
|
||||||
|
gpu_kv_used_frac, pending_prefill_tokens, ongoing_decode_tokens,
|
||||||
|
num_prefilling, max_prefill_remaining}
|
||||||
|
|
||||||
|
`max_prefill_remaining` is the key signal /metrics does NOT expose: the
|
||||||
|
largest in-progress prefill on the engine. A big in-progress prefill holds
|
||||||
|
the GIL and stalls the mooncake receiver_loop — so the router should avoid
|
||||||
|
migrating KV to such an instance (P2).
|
||||||
|
|
||||||
|
Transport (env AGENTIC_ENGINE_STATE_URI):
|
||||||
|
file:///dev/shm/agentic_engine_state (default; atomic temp+rename)
|
||||||
|
redis://host:port/0 (optional; needs redis-py + server)
|
||||||
|
|
||||||
|
Self-contained (inlined writer) so the engine process needs no repo import.
|
||||||
|
Apply/revert markers: # ES_INSTRUMENT_START / # ES_INSTRUMENT_END.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python instrument_engine_state.py --apply [--venv PATH]
|
||||||
|
python instrument_engine_state.py --revert [--venv PATH]
|
||||||
|
python instrument_engine_state.py --check [--venv PATH]
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import re
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
DEFAULT_VENV = Path("/home/admin/cpfs/wjh/agentic-kv/.venv")
|
||||||
|
TARGET_REL = "lib/python3.12/site-packages/vllm/v1/core/sched/scheduler.py"
|
||||||
|
START = "# ES_INSTRUMENT_START"
|
||||||
|
END = "# ES_INSTRUMENT_END"
|
||||||
|
|
||||||
|
# ---- Patch 1: header (writer + publisher thread), before class Scheduler ----
|
||||||
|
HEADER_ANCHOR = "class Scheduler(SchedulerInterface):"
|
||||||
|
HEADER = f'''{START}
|
||||||
|
import json as _es_json
|
||||||
|
import os as _es_os
|
||||||
|
import threading as _es_threading
|
||||||
|
import time as _es_time
|
||||||
|
|
||||||
|
_ES_URI = _es_os.environ.get("AGENTIC_ENGINE_STATE_URI", "")
|
||||||
|
_ES_ID = _es_os.environ.get("AGENTIC_ENGINE_ID") or _es_os.environ.get(
|
||||||
|
"AGENTIC_WORKER_ID", f"engine_{{_es_os.getpid()}}")
|
||||||
|
_ES_PERIOD_S = float(_es_os.environ.get("AGENTIC_ENGINE_STATE_PERIOD_MS", "50")) / 1000.0
|
||||||
|
|
||||||
|
|
||||||
|
class _ESWriter:
|
||||||
|
"""Pluggable state writer: file:// (atomic temp+rename) or redis://."""
|
||||||
|
def __init__(self, uri: str, engine_id: str):
|
||||||
|
self.engine_id = engine_id
|
||||||
|
self.kind = None
|
||||||
|
if uri.startswith("file://"):
|
||||||
|
self.kind = "file"
|
||||||
|
self.dir = uri[len("file://"):]
|
||||||
|
_es_os.makedirs(self.dir, exist_ok=True)
|
||||||
|
self.path = _es_os.path.join(self.dir, f"{{engine_id}}.json")
|
||||||
|
self.tmp = self.path + f".tmp.{{_es_os.getpid()}}"
|
||||||
|
elif uri.startswith("redis://"):
|
||||||
|
self.kind = "redis"
|
||||||
|
import redis # lazy
|
||||||
|
self.r = redis.Redis.from_url(uri)
|
||||||
|
self.key = f"engine_state:{{engine_id}}"
|
||||||
|
|
||||||
|
def publish(self, state: dict):
|
||||||
|
try:
|
||||||
|
if self.kind == "file":
|
||||||
|
with open(self.tmp, "w") as f:
|
||||||
|
f.write(_es_json.dumps(state))
|
||||||
|
_es_os.replace(self.tmp, self.path) # atomic
|
||||||
|
elif self.kind == "redis":
|
||||||
|
self.r.set(self.key, _es_json.dumps(state), ex=5)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def _es_compute_snapshot(scheduler) -> dict:
|
||||||
|
"""Cheap O(batch) state read from the live scheduler."""
|
||||||
|
try:
|
||||||
|
kvm = scheduler.kv_cache_manager
|
||||||
|
pool = kvm.block_pool
|
||||||
|
total = int(pool.num_gpu_blocks)
|
||||||
|
free = int(pool.get_num_free_blocks())
|
||||||
|
except Exception:
|
||||||
|
total = free = -1
|
||||||
|
n_run = 0
|
||||||
|
pend = 0
|
||||||
|
dec = 0
|
||||||
|
n_pref = 0
|
||||||
|
max_pref = 0
|
||||||
|
try:
|
||||||
|
for r in scheduler.running:
|
||||||
|
n_run += 1
|
||||||
|
npr = int(getattr(r, "num_prompt_tokens", 0))
|
||||||
|
nct = int(getattr(r, "num_computed_tokens", 0))
|
||||||
|
if nct < npr: # still prefilling
|
||||||
|
rem = npr - nct
|
||||||
|
pend += rem
|
||||||
|
n_pref += 1
|
||||||
|
if rem > max_pref:
|
||||||
|
max_pref = rem
|
||||||
|
else: # decoding
|
||||||
|
dec += int(getattr(r, "num_tokens", 0))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
n_wait = 0
|
||||||
|
try:
|
||||||
|
n_wait = len(scheduler.waiting) + len(getattr(scheduler, "skipped_waiting", []))
|
||||||
|
for r in list(scheduler.waiting):
|
||||||
|
pend += max(0, int(getattr(r, "num_prompt_tokens", 0))
|
||||||
|
- int(getattr(r, "num_computed_tokens", 0)))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
used_frac = ((total - free) / total) if (total and total > 0) else -1.0
|
||||||
|
return {{
|
||||||
|
"ts": _es_time.time(),
|
||||||
|
"engine_id": _ES_ID,
|
||||||
|
"num_running": n_run,
|
||||||
|
"num_waiting": int(n_wait),
|
||||||
|
"gpu_blocks_total": total,
|
||||||
|
"gpu_blocks_free": free,
|
||||||
|
"gpu_kv_used_frac": used_frac,
|
||||||
|
"pending_prefill_tokens": int(pend),
|
||||||
|
"ongoing_decode_tokens": int(dec),
|
||||||
|
"num_prefilling": n_pref,
|
||||||
|
"max_prefill_remaining": int(max_pref),
|
||||||
|
}}
|
||||||
|
|
||||||
|
|
||||||
|
class _ESPublisher:
|
||||||
|
def __init__(self, scheduler):
|
||||||
|
self._sched = scheduler
|
||||||
|
self._writer = _ESWriter(_ES_URI, _ES_ID)
|
||||||
|
self._stop = _es_threading.Event()
|
||||||
|
self._t = _es_threading.Thread(target=self._loop, daemon=True)
|
||||||
|
self._t.start()
|
||||||
|
|
||||||
|
def _loop(self):
|
||||||
|
while not self._stop.is_set():
|
||||||
|
try:
|
||||||
|
self._writer.publish(_es_compute_snapshot(self._sched))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
_es_time.sleep(_ES_PERIOD_S)
|
||||||
|
{END}
|
||||||
|
|
||||||
|
|
||||||
|
'''
|
||||||
|
|
||||||
|
# ---- Patch 2: start the publisher at the end of Scheduler.__init__ ----------
|
||||||
|
# Anchor on the existing agentic step-log block tail in __init__.
|
||||||
|
INIT_ANCHOR = """ _step_path = _os.environ.get("AGENTIC_STEP_LOG_PATH")"""
|
||||||
|
INIT_INSERT = f""" {START}
|
||||||
|
if _ES_URI:
|
||||||
|
try:
|
||||||
|
self._es_publisher = _ESPublisher(self)
|
||||||
|
logger.info("agentic engine-state publisher: uri=%s id=%s",
|
||||||
|
_ES_URI, _ES_ID)
|
||||||
|
except Exception as _e:
|
||||||
|
logger.warning("engine-state publisher disabled (%r)", _e)
|
||||||
|
{END}
|
||||||
|
_step_path = _os.environ.get("AGENTIC_STEP_LOG_PATH")"""
|
||||||
|
|
||||||
|
PATCHES = [
|
||||||
|
("header", HEADER_ANCHOR, HEADER + HEADER_ANCHOR),
|
||||||
|
("init", INIT_ANCHOR, INIT_INSERT),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def find_target(venv: Path) -> Path:
|
||||||
|
for c in (venv / TARGET_REL, DEFAULT_VENV / TARGET_REL):
|
||||||
|
if c.is_file():
|
||||||
|
return c
|
||||||
|
raise FileNotFoundError(f"cannot find {TARGET_REL} under {venv}")
|
||||||
|
|
||||||
|
|
||||||
|
def is_patched(t: str) -> bool:
|
||||||
|
return START in t
|
||||||
|
|
||||||
|
|
||||||
|
def apply(target: Path):
|
||||||
|
text = target.read_text()
|
||||||
|
if is_patched(text):
|
||||||
|
print(f"[es-instr] already patched: {target}")
|
||||||
|
return
|
||||||
|
new = text
|
||||||
|
for name, src, dst in PATCHES:
|
||||||
|
if src not in new:
|
||||||
|
raise RuntimeError(f"patch {name!r}: anchor not found in {target}")
|
||||||
|
new = new.replace(src, dst, 1)
|
||||||
|
target.write_text(new)
|
||||||
|
print(f"[es-instr] applied {len(PATCHES)} patches -> {target}")
|
||||||
|
|
||||||
|
|
||||||
|
def revert(target: Path):
|
||||||
|
text = target.read_text()
|
||||||
|
if not is_patched(text):
|
||||||
|
print(f"[es-instr] not patched: {target}")
|
||||||
|
return
|
||||||
|
pat = re.compile(r"[ \t]*" + re.escape(START) + r".*?" + re.escape(END) + r"\n",
|
||||||
|
flags=re.DOTALL)
|
||||||
|
new = pat.sub("", text)
|
||||||
|
new = re.sub(r"\n{3,}class Scheduler\(", "\n\nclass Scheduler(", new)
|
||||||
|
target.write_text(new)
|
||||||
|
print(f"[es-instr] reverted: {target}")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("--apply", action="store_true")
|
||||||
|
p.add_argument("--revert", action="store_true")
|
||||||
|
p.add_argument("--check", action="store_true")
|
||||||
|
p.add_argument("--venv", type=Path, default=DEFAULT_VENV)
|
||||||
|
a = p.parse_args()
|
||||||
|
t = find_target(a.venv)
|
||||||
|
if a.apply:
|
||||||
|
apply(t)
|
||||||
|
elif a.revert:
|
||||||
|
revert(t)
|
||||||
|
elif a.check:
|
||||||
|
print(f"[es-instr] {'PATCHED' if is_patched(t.read_text()) else 'CLEAN'}: {t}")
|
||||||
|
else:
|
||||||
|
p.error("specify --apply/--revert/--check")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
294
microbench/connector_tax/layerwise/mb7_layerwise.py
Normal file
294
microbench/connector_tax/layerwise/mb7_layerwise.py
Normal file
@@ -0,0 +1,294 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""MB7: correctness + perf of layer-wise KV push vs post-hoc transfer.
|
||||||
|
|
||||||
|
Two 2-instance modes against A (src/producer) and B (dst/consumer):
|
||||||
|
|
||||||
|
baseline : prefill A (await) -> THEN B pulls (post-hoc full transfer).
|
||||||
|
T_total = T_prefill + T_xfer (sequential)
|
||||||
|
layerwise : dispatch B's remote-prefill (handshake) and A's prefill
|
||||||
|
CONCURRENTLY, so A pushes each layer as it computes it.
|
||||||
|
If overlap works, T_total ~= max(T_prefill, T_xfer) ~= T_prefill.
|
||||||
|
|
||||||
|
Reference: T_prefill_only = a plain prefill on A with no transfer.
|
||||||
|
|
||||||
|
Correctness: after the transfer, a plain follow-up to B on the same prompt
|
||||||
|
must report cached_tokens >= ~prompt_len (the KV actually landed on B).
|
||||||
|
|
||||||
|
The connector mode is selected by the launcher (run_mb7.sh): baseline uses the
|
||||||
|
stock connector; layerwise deploys mooncake_connector.LAYERWISE.py +
|
||||||
|
MOONCAKE_LAYERWISE=1. This script just drives the requests and measures.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python mb7_layerwise.py --mode layerwise --sizes 8192,32768,65536 --repeats 3 \
|
||||||
|
--src-port 8000 --dst-port 8001 --src-bp 8998 --dst-bp 8999 --out mb7.json
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import statistics
|
||||||
|
import time
|
||||||
|
import uuid
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
MODEL = "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct"
|
||||||
|
KV_PER_TOK = 98304
|
||||||
|
|
||||||
|
|
||||||
|
def synth_prompt(seed: int, n: int) -> list[int]:
|
||||||
|
import random
|
||||||
|
rng = random.Random(seed)
|
||||||
|
return [rng.randint(100, 150000) for _ in range(n)]
|
||||||
|
|
||||||
|
|
||||||
|
async def get_engine_id(client, host, bp):
|
||||||
|
r = await client.get(f"http://{host}:{bp}/query")
|
||||||
|
r.raise_for_status()
|
||||||
|
return r.json()["0"]["engine_id"]
|
||||||
|
|
||||||
|
|
||||||
|
async def completion(client, host, port, prompt, max_tokens, ktp=None):
|
||||||
|
payload = {
|
||||||
|
"model": MODEL, "prompt": prompt, "max_tokens": max_tokens,
|
||||||
|
"min_tokens": max_tokens if max_tokens == 1 else 1,
|
||||||
|
"temperature": 0.0, "stream": False,
|
||||||
|
}
|
||||||
|
if ktp:
|
||||||
|
payload["kv_transfer_params"] = ktp
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
r = await client.post(f"http://{host}:{port}/v1/completions",
|
||||||
|
json=payload, timeout=600.0)
|
||||||
|
dt = time.perf_counter() - t0
|
||||||
|
r.raise_for_status()
|
||||||
|
return dt, r.json()
|
||||||
|
|
||||||
|
|
||||||
|
def cached_of(resp) -> int:
|
||||||
|
usage = resp.get("usage") or {}
|
||||||
|
det = usage.get("prompt_tokens_details") or {}
|
||||||
|
return det.get("cached_tokens", 0) or usage.get("cached_tokens", 0) or 0
|
||||||
|
|
||||||
|
|
||||||
|
async def _stream_completion(client, host, port, prompt, max_tokens):
|
||||||
|
payload = {"model": MODEL, "prompt": prompt, "max_tokens": max_tokens,
|
||||||
|
"min_tokens": 1, "temperature": 0.0, "stream": True}
|
||||||
|
async with client.stream("POST", f"http://{host}:{port}/v1/completions",
|
||||||
|
json=payload, timeout=600.0) as r:
|
||||||
|
r.raise_for_status()
|
||||||
|
async for _ in r.aiter_bytes():
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class BackgroundLoad:
|
||||||
|
"""Hold N concurrent long-decode streams across endpoints to keep busy."""
|
||||||
|
def __init__(self, client, endpoints, n, prompt_tokens=2000, out_tokens=6000):
|
||||||
|
self.client, self.endpoints, self.n = client, endpoints, n
|
||||||
|
self.pt, self.ot = prompt_tokens, out_tokens
|
||||||
|
self._stop = asyncio.Event()
|
||||||
|
self._tasks = []
|
||||||
|
|
||||||
|
async def _w(self, idx):
|
||||||
|
host, port = self.endpoints[idx % len(self.endpoints)]
|
||||||
|
seed = 800000 + idx
|
||||||
|
while not self._stop.is_set():
|
||||||
|
try:
|
||||||
|
await _stream_completion(self.client, host, port,
|
||||||
|
synth_prompt(seed, self.pt), self.ot)
|
||||||
|
except Exception:
|
||||||
|
await asyncio.sleep(0.5)
|
||||||
|
seed += 1
|
||||||
|
|
||||||
|
def start(self):
|
||||||
|
self._tasks = [asyncio.create_task(self._w(i)) for i in range(self.n)]
|
||||||
|
|
||||||
|
async def stop(self):
|
||||||
|
self._stop.set()
|
||||||
|
for t in self._tasks:
|
||||||
|
t.cancel()
|
||||||
|
await asyncio.gather(*self._tasks, return_exceptions=True)
|
||||||
|
|
||||||
|
|
||||||
|
async def num_running(client, host, port):
|
||||||
|
try:
|
||||||
|
r = await client.get(f"http://{host}:{port}/metrics", timeout=5.0)
|
||||||
|
for line in r.text.splitlines():
|
||||||
|
if line.startswith("vllm:num_requests_running"):
|
||||||
|
return int(float(line.split()[-1]))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return -1
|
||||||
|
|
||||||
|
|
||||||
|
async def prefill_only(client, host, port, prompt):
|
||||||
|
"""Reference: plain prefill cost on A, no transfer."""
|
||||||
|
dt, _ = await completion(client, host, port, prompt, max_tokens=1)
|
||||||
|
return dt
|
||||||
|
|
||||||
|
|
||||||
|
async def measure_baseline(client, A, B, src_eid, src_bp_addr, prompt, seed):
|
||||||
|
tid = uuid.uuid4().hex
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
t_pf, _ = await completion(client, *A, prompt, 1,
|
||||||
|
ktp={"do_remote_decode": True, "transfer_id": tid})
|
||||||
|
t_xfer, _ = await completion(client, *B, prompt, 1,
|
||||||
|
ktp={"do_remote_prefill": True, "transfer_id": tid,
|
||||||
|
"remote_engine_id": src_eid,
|
||||||
|
"remote_bootstrap_addr": src_bp_addr})
|
||||||
|
t_total = time.perf_counter() - t0
|
||||||
|
# correctness: B follow-up should hit cache
|
||||||
|
_, fr = await completion(client, *B, prompt, 1)
|
||||||
|
return {"t_prefill_s": t_pf, "t_xfer_s": t_xfer, "t_total_s": t_total,
|
||||||
|
"cached": cached_of(fr)}
|
||||||
|
|
||||||
|
|
||||||
|
async def measure_layerwise(client, A, B, src_eid, src_bp_addr, prompt, seed):
|
||||||
|
"""Dispatch B handshake + A prefill concurrently => layer-wise overlap."""
|
||||||
|
tid = uuid.uuid4().hex
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
|
||||||
|
async def run_B():
|
||||||
|
return await completion(client, *B, prompt, 1,
|
||||||
|
ktp={"do_remote_prefill": True, "transfer_id": tid,
|
||||||
|
"remote_engine_id": src_eid,
|
||||||
|
"remote_bootstrap_addr": src_bp_addr})
|
||||||
|
|
||||||
|
async def run_A():
|
||||||
|
# small head start for B's handshake to reach A before A's forward
|
||||||
|
await asyncio.sleep(0.05)
|
||||||
|
return await completion(client, *A, prompt, 1,
|
||||||
|
ktp={"do_remote_decode": True, "transfer_id": tid})
|
||||||
|
|
||||||
|
b_task = asyncio.create_task(run_B())
|
||||||
|
a_task = asyncio.create_task(run_A())
|
||||||
|
(t_b, _), (t_a, _) = await asyncio.gather(b_task, a_task)
|
||||||
|
t_total = time.perf_counter() - t0
|
||||||
|
_, fr = await completion(client, *B, prompt, 1)
|
||||||
|
return {"t_A_s": t_a, "t_B_s": t_b, "t_total_s": t_total,
|
||||||
|
"cached": cached_of(fr)}
|
||||||
|
|
||||||
|
|
||||||
|
async def main_async(a):
|
||||||
|
sizes = [int(s) for s in a.sizes.split(",")]
|
||||||
|
A = (a.src_host, a.src_port)
|
||||||
|
B = (a.dst_host, a.dst_port)
|
||||||
|
limits = httpx.Limits(max_connections=64, max_keepalive_connections=64)
|
||||||
|
async with httpx.AsyncClient(limits=limits, trust_env=False) as client:
|
||||||
|
src_eid = await get_engine_id(client, a.src_host, a.src_bp)
|
||||||
|
src_bp_addr = f"http://{a.src_host}:{a.src_bp}"
|
||||||
|
print(f"[mb7] mode={a.mode} bg_load={a.bg_load} src_eid={src_eid[:16]}...")
|
||||||
|
|
||||||
|
loader = None
|
||||||
|
if a.bg_load > 0:
|
||||||
|
loader = BackgroundLoad(client, [A, B], a.bg_load)
|
||||||
|
loader.start()
|
||||||
|
print(f"[mb7] ramping background load ({a.bg_load}) ...")
|
||||||
|
for _ in range(40):
|
||||||
|
await asyncio.sleep(1.0)
|
||||||
|
na = await num_running(client, *A)
|
||||||
|
nb = await num_running(client, *B)
|
||||||
|
if na >= 1 and nb >= 1:
|
||||||
|
print(f"[mb7] busy: A_run={na} B_run={nb}")
|
||||||
|
break
|
||||||
|
|
||||||
|
# --- concurrent correctness mode: fire N transfers at once ----------
|
||||||
|
if a.concurrent > 1 and a.mode == "layerwise":
|
||||||
|
print(f"[mb7] CONCURRENT correctness: {a.concurrent} simultaneous "
|
||||||
|
f"transfers per size (src=A stresses concurrent producing)")
|
||||||
|
all_ok = True
|
||||||
|
for sz in sizes:
|
||||||
|
tasks = [
|
||||||
|
asyncio.create_task(measure_layerwise(
|
||||||
|
client, A, B, src_eid, src_bp_addr,
|
||||||
|
synth_prompt(sz * 1000 + j, sz), sz * 1000 + j))
|
||||||
|
for j in range(a.concurrent)
|
||||||
|
]
|
||||||
|
rows = await asyncio.gather(*tasks)
|
||||||
|
oks = [r["cached"] >= int(sz * 0.9) for r in rows]
|
||||||
|
all_ok = all_ok and all(oks)
|
||||||
|
print(f" sz={sz:>6} x{a.concurrent}: cached="
|
||||||
|
f"{[r['cached'] for r in rows]} correct={oks}")
|
||||||
|
print(f"[mb7] CONCURRENT correctness: "
|
||||||
|
f"{'ALL PASS' if all_ok else 'FAILURE'}")
|
||||||
|
if loader:
|
||||||
|
await loader.stop()
|
||||||
|
return
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for sz in sizes:
|
||||||
|
for rep in range(a.repeats):
|
||||||
|
prompt = synth_prompt(sz * 100 + rep, sz)
|
||||||
|
# reference prefill-only cost (fresh prompt, different seed so no cache)
|
||||||
|
t_pf_only = await prefill_only(
|
||||||
|
client, *A, synth_prompt(sz * 100 + rep + 555, sz))
|
||||||
|
if a.mode == "baseline":
|
||||||
|
row = await measure_baseline(client, A, B, src_eid, src_bp_addr,
|
||||||
|
prompt, sz * 100 + rep)
|
||||||
|
else:
|
||||||
|
row = await measure_layerwise(client, A, B, src_eid, src_bp_addr,
|
||||||
|
prompt, sz * 100 + rep)
|
||||||
|
row.update({"mode": a.mode, "size": sz, "rep": rep,
|
||||||
|
"t_prefill_only_s": t_pf_only,
|
||||||
|
"kv_gib": sz * KV_PER_TOK / 2**30,
|
||||||
|
"correct": row["cached"] >= int(sz * 0.9)})
|
||||||
|
results.append(row)
|
||||||
|
extra = (f"xfer={row.get('t_xfer_s', 0)*1000:.0f}ms"
|
||||||
|
if a.mode == "baseline"
|
||||||
|
else f"tA={row.get('t_A_s',0)*1000:.0f}ms tB={row.get('t_B_s',0)*1000:.0f}ms")
|
||||||
|
print(f" sz={sz:>6} rep={rep} pf_only={t_pf_only*1000:6.0f}ms "
|
||||||
|
f"total={row['t_total_s']*1000:7.0f}ms {extra} "
|
||||||
|
f"cached={row['cached']}/{sz} correct={row['correct']}")
|
||||||
|
|
||||||
|
if loader:
|
||||||
|
await loader.stop()
|
||||||
|
|
||||||
|
# summary
|
||||||
|
print(f"\n=== {a.mode} (bg={a.bg_load}) summary ===")
|
||||||
|
print(f"{'size':>7} {'n':>2} {'pf_only_ms':>11} {'total_ms':>9} "
|
||||||
|
f"{'overhead_ms':>12} {'correct':>8}")
|
||||||
|
summary = []
|
||||||
|
for sz in sizes:
|
||||||
|
rs = [r for r in results if r["size"] == sz]
|
||||||
|
if not rs:
|
||||||
|
continue
|
||||||
|
pf = statistics.median(r["t_prefill_only_s"] for r in rs) * 1000
|
||||||
|
tot = statistics.median(r["t_total_s"] for r in rs) * 1000
|
||||||
|
allok = all(r["correct"] for r in rs)
|
||||||
|
# overhead = total - prefill_only = the part NOT hidden behind prefill
|
||||||
|
overhead = tot - pf
|
||||||
|
summary.append({"size": sz, "n": len(rs), "pf_only_ms": pf,
|
||||||
|
"total_ms": tot, "overhead_ms": overhead,
|
||||||
|
"all_correct": allok})
|
||||||
|
print(f"{sz:>7} {len(rs):>2} {pf:>11.0f} {tot:>9.0f} {overhead:>12.0f} "
|
||||||
|
f"{str(allok):>8}")
|
||||||
|
|
||||||
|
Path(a.out).write_text(json.dumps(
|
||||||
|
{"mode": a.mode, "model": MODEL, "raw": results, "summary": summary}, indent=2))
|
||||||
|
print(f"\n[mb7] wrote {a.out}")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("--mode", choices=["baseline", "layerwise"], required=True)
|
||||||
|
p.add_argument("--src-host", default="127.0.0.1")
|
||||||
|
p.add_argument("--dst-host", default="127.0.0.1")
|
||||||
|
p.add_argument("--src-port", type=int, default=8000)
|
||||||
|
p.add_argument("--dst-port", type=int, default=8001)
|
||||||
|
p.add_argument("--src-bp", type=int, default=8998)
|
||||||
|
p.add_argument("--dst-bp", type=int, default=8999)
|
||||||
|
p.add_argument("--sizes", default="8192,32768,65536")
|
||||||
|
p.add_argument("--repeats", type=int, default=3)
|
||||||
|
p.add_argument("--bg-load", type=int, default=0,
|
||||||
|
help="N concurrent background decode streams across A+B")
|
||||||
|
p.add_argument("--concurrent", type=int, default=1,
|
||||||
|
help="layerwise: fire N simultaneous transfers to test "
|
||||||
|
"concurrent-producing correctness")
|
||||||
|
p.add_argument("--out", default="mb7_result.json")
|
||||||
|
args = p.parse_args()
|
||||||
|
asyncio.run(main_async(args))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
79
microbench/connector_tax/layerwise/migration_target.py
Normal file
79
microbench/connector_tax/layerwise/migration_target.py
Normal file
@@ -0,0 +1,79 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""P2: real-state-aware migration target selection.
|
||||||
|
|
||||||
|
Pure helpers (no proxy deps) so they're unit-testable. The router calls
|
||||||
|
`rank_migration_targets` to pick the decode target, using REAL engine state
|
||||||
|
(from the engine-state store) when available, falling back to shadow counters.
|
||||||
|
|
||||||
|
Key fix over the shadow-only Mechanism B: deprioritise targets that are
|
||||||
|
mid-large-prefill (`max_prefill_remaining` high) — those hold the GIL and
|
||||||
|
stall the mooncake receiver_loop, which is the ~45% control-plane residual
|
||||||
|
that layer-wise transfer does NOT fix. Also avoid targets near the KV
|
||||||
|
capacity wall (`gpu_kv_used_frac` high).
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from dataclasses import dataclass
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class TargetCandidate:
|
||||||
|
idx: int
|
||||||
|
cache_hit: int # estimated transfer bytes saved (tokens)
|
||||||
|
shadow_num_req: int # proxy shadow counter (fallback)
|
||||||
|
ongoing_tokens: int # shadow tertiary
|
||||||
|
real_state: dict | None = None # engine-state record, or None if stale/missing
|
||||||
|
|
||||||
|
|
||||||
|
def real_load(c: TargetCandidate) -> float:
|
||||||
|
"""Effective load: prefer real (running + waiting); else shadow."""
|
||||||
|
rs = c.real_state
|
||||||
|
if rs is not None:
|
||||||
|
return float(rs.get("num_running", 0) + rs.get("num_waiting", 0))
|
||||||
|
return float(c.shadow_num_req)
|
||||||
|
|
||||||
|
|
||||||
|
def big_prefill_remaining(c: TargetCandidate) -> int:
|
||||||
|
"""Largest in-progress prefill on the candidate (GIL-stall predictor).
|
||||||
|
0 when unknown (no real state) so we don't over-penalise blind."""
|
||||||
|
rs = c.real_state
|
||||||
|
return int(rs.get("max_prefill_remaining", 0)) if rs is not None else 0
|
||||||
|
|
||||||
|
|
||||||
|
def kv_used_frac(c: TargetCandidate) -> float:
|
||||||
|
rs = c.real_state
|
||||||
|
if rs is not None:
|
||||||
|
f = rs.get("gpu_kv_used_frac", -1.0)
|
||||||
|
return float(f) if f is not None and f >= 0 else 0.0
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def target_sort_key(
|
||||||
|
c: TargetCandidate,
|
||||||
|
big_prefill_threshold: int = 16000,
|
||||||
|
kv_wall_frac: float = 0.90,
|
||||||
|
):
|
||||||
|
"""Sort key (lower = better). Ordering of concerns:
|
||||||
|
1. NOT mid-large-prefill (avoid the GIL-stall dst) [bool]
|
||||||
|
2. NOT near the KV capacity wall [bool]
|
||||||
|
3. most cache-rich (fewest transfer bytes) -> -cache_hit
|
||||||
|
4. lowest real load
|
||||||
|
5. lowest ongoing_tokens (shadow tertiary tie-break)
|
||||||
|
"""
|
||||||
|
stalls = 1 if big_prefill_remaining(c) >= big_prefill_threshold else 0
|
||||||
|
near_wall = 1 if kv_used_frac(c) >= kv_wall_frac else 0
|
||||||
|
return (stalls, near_wall, -c.cache_hit, real_load(c), c.ongoing_tokens)
|
||||||
|
|
||||||
|
|
||||||
|
def rank_migration_targets(
|
||||||
|
candidates: list[TargetCandidate],
|
||||||
|
big_prefill_threshold: int = 16000,
|
||||||
|
kv_wall_frac: float = 0.90,
|
||||||
|
) -> TargetCandidate | None:
|
||||||
|
"""Return the best candidate, or None if the list is empty."""
|
||||||
|
if not candidates:
|
||||||
|
return None
|
||||||
|
return min(
|
||||||
|
candidates,
|
||||||
|
key=lambda c: target_sort_key(c, big_prefill_threshold, kv_wall_frac),
|
||||||
|
)
|
||||||
1490
microbench/connector_tax/layerwise/mooncake_connector.BASE.py
Normal file
1490
microbench/connector_tax/layerwise/mooncake_connector.BASE.py
Normal file
File diff suppressed because it is too large
Load Diff
1713
microbench/connector_tax/layerwise/mooncake_connector.LAYERWISE.py
Normal file
1713
microbench/connector_tax/layerwise/mooncake_connector.LAYERWISE.py
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
140
microbench/connector_tax/layerwise/results/mb7_baseline.json
Normal file
140
microbench/connector_tax/layerwise/results/mb7_baseline.json
Normal file
@@ -0,0 +1,140 @@
|
|||||||
|
{
|
||||||
|
"mode": "baseline",
|
||||||
|
"model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
|
||||||
|
"raw": [
|
||||||
|
{
|
||||||
|
"t_prefill_s": 0.5736213000018324,
|
||||||
|
"t_xfer_s": 0.36388630099827424,
|
||||||
|
"t_total_s": 0.9375749369974073,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 1.0551288530004967,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 0.5740011439993395,
|
||||||
|
"t_xfer_s": 0.12374231500143651,
|
||||||
|
"t_total_s": 0.6978207100000873,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 0.5743715360003989,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 0.5732713990000775,
|
||||||
|
"t_xfer_s": 0.10885842400239198,
|
||||||
|
"t_total_s": 0.6821924389987544,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 0.5745713680007611,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 1.4892208660021424,
|
||||||
|
"t_xfer_s": 0.2091717740004242,
|
||||||
|
"t_total_s": 1.6984740270017937,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 1.4990949730017746,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 1.4885207330007688,
|
||||||
|
"t_xfer_s": 0.2010940889995254,
|
||||||
|
"t_total_s": 1.6896768289989268,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 1.4898170189990196,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 1.4895933570005582,
|
||||||
|
"t_xfer_s": 0.2026357979993918,
|
||||||
|
"t_total_s": 1.6922962099997676,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 1.4907751430000644,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 4.438586502998078,
|
||||||
|
"t_xfer_s": 0.37847799000155646,
|
||||||
|
"t_total_s": 4.817142683001293,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 4.437922253000579,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 4.4350325649975275,
|
||||||
|
"t_xfer_s": 0.5313337980005599,
|
||||||
|
"t_total_s": 4.966431269000168,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 4.437473922000208,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 4.436279826000828,
|
||||||
|
"t_xfer_s": 0.6335160570015432,
|
||||||
|
"t_total_s": 5.069869226001174,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 4.440119222999783,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"summary": [
|
||||||
|
{
|
||||||
|
"size": 8192,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 574.5713680007611,
|
||||||
|
"total_ms": 697.8207100000873,
|
||||||
|
"overhead_ms": 123.24934199932613,
|
||||||
|
"all_correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"size": 16384,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 1490.7751430000644,
|
||||||
|
"total_ms": 1692.2962099997676,
|
||||||
|
"overhead_ms": 201.52106699970318,
|
||||||
|
"all_correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"size": 32768,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 4437.922253000579,
|
||||||
|
"total_ms": 4966.431269000168,
|
||||||
|
"overhead_ms": 528.5090159995889,
|
||||||
|
"all_correct": true
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
@@ -0,0 +1,140 @@
|
|||||||
|
{
|
||||||
|
"mode": "baseline",
|
||||||
|
"model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
|
||||||
|
"raw": [
|
||||||
|
{
|
||||||
|
"t_prefill_s": 0.5868483350022871,
|
||||||
|
"t_xfer_s": 0.19584889299949282,
|
||||||
|
"t_total_s": 0.7827702419999696,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 0.5920699099988269,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 0.5875704979989678,
|
||||||
|
"t_xfer_s": 0.1554814909977722,
|
||||||
|
"t_total_s": 0.7431365060001554,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 0.5814537600017502,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 0.5852241569991747,
|
||||||
|
"t_xfer_s": 0.15129724399957922,
|
||||||
|
"t_total_s": 0.7365909610016388,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 0.5846994370003813,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 1.498547145001794,
|
||||||
|
"t_xfer_s": 0.2475714690008317,
|
||||||
|
"t_total_s": 1.7462187470009667,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 1.5670790190015396,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 1.5025789940009417,
|
||||||
|
"t_xfer_s": 0.24532966799961287,
|
||||||
|
"t_total_s": 1.7479741930001182,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 1.5008903820016712,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 1.5021674179988622,
|
||||||
|
"t_xfer_s": 0.24640760400143336,
|
||||||
|
"t_total_s": 1.7486415580024186,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 1.509417139001016,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 4.444555983998725,
|
||||||
|
"t_xfer_s": 0.4227471090016479,
|
||||||
|
"t_total_s": 4.86737214599998,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 4.4467717689985875,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 4.442135782999685,
|
||||||
|
"t_xfer_s": 0.7519038230020669,
|
||||||
|
"t_total_s": 5.194113359000767,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 4.445541313998547,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_prefill_s": 4.439772993999213,
|
||||||
|
"t_xfer_s": 0.7855456319994119,
|
||||||
|
"t_total_s": 5.225392060998274,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "baseline",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 4.442906365002273,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"summary": [
|
||||||
|
{
|
||||||
|
"size": 8192,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 584.6994370003813,
|
||||||
|
"total_ms": 743.1365060001554,
|
||||||
|
"overhead_ms": 158.43706899977406,
|
||||||
|
"all_correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"size": 16384,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 1509.417139001016,
|
||||||
|
"total_ms": 1747.9741930001182,
|
||||||
|
"overhead_ms": 238.5570539991022,
|
||||||
|
"all_correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"size": 32768,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 4445.541313998547,
|
||||||
|
"total_ms": 5194.113359000767,
|
||||||
|
"overhead_ms": 748.57204500222,
|
||||||
|
"all_correct": true
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
140
microbench/connector_tax/layerwise/results/mb7_layerwise.json
Normal file
140
microbench/connector_tax/layerwise/results/mb7_layerwise.json
Normal file
@@ -0,0 +1,140 @@
|
|||||||
|
{
|
||||||
|
"mode": "layerwise",
|
||||||
|
"model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
|
||||||
|
"raw": [
|
||||||
|
{
|
||||||
|
"t_A_s": 0.5749198459998297,
|
||||||
|
"t_B_s": 0.6508419569981925,
|
||||||
|
"t_total_s": 0.6509377910006151,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 1.0447357020020718,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 0.574626908000937,
|
||||||
|
"t_B_s": 0.6306310719992325,
|
||||||
|
"t_total_s": 0.6307087300010608,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 0.5731983850018878,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 0.5756587910000235,
|
||||||
|
"t_B_s": 0.6316753270002664,
|
||||||
|
"t_total_s": 0.6317471290021786,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 0.5737888650000968,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 1.4953326409995498,
|
||||||
|
"t_B_s": 1.5502465710014803,
|
||||||
|
"t_total_s": 1.5503262860001996,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 1.5000705940001353,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 1.493850356000621,
|
||||||
|
"t_B_s": 1.5505031290012994,
|
||||||
|
"t_total_s": 1.5505791659998067,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 1.4924546469992492,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 1.4979969070009247,
|
||||||
|
"t_B_s": 1.554968774002191,
|
||||||
|
"t_total_s": 1.5551903560008213,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 1.4914496510027675,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 4.4403588690001925,
|
||||||
|
"t_B_s": 4.496483378999983,
|
||||||
|
"t_total_s": 4.4965666819989565,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 4.440080869000667,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 4.44209005599987,
|
||||||
|
"t_B_s": 4.499940814999718,
|
||||||
|
"t_total_s": 4.500021006002498,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 4.440225810998527,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 4.437084657998639,
|
||||||
|
"t_B_s": 4.496842522999941,
|
||||||
|
"t_total_s": 4.496926485000586,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 4.439449855002749,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"summary": [
|
||||||
|
{
|
||||||
|
"size": 8192,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 573.7888650000968,
|
||||||
|
"total_ms": 631.7471290021786,
|
||||||
|
"overhead_ms": 57.958264002081705,
|
||||||
|
"all_correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"size": 16384,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 1492.4546469992492,
|
||||||
|
"total_ms": 1550.5791659998067,
|
||||||
|
"overhead_ms": 58.124519000557484,
|
||||||
|
"all_correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"size": 32768,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 4440.080869000667,
|
||||||
|
"total_ms": 4496.926485000586,
|
||||||
|
"overhead_ms": 56.845615999918664,
|
||||||
|
"all_correct": true
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
@@ -0,0 +1,140 @@
|
|||||||
|
{
|
||||||
|
"mode": "layerwise",
|
||||||
|
"model": "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct",
|
||||||
|
"raw": [
|
||||||
|
{
|
||||||
|
"t_A_s": 0.5905098549992545,
|
||||||
|
"t_B_s": 0.6900827390018094,
|
||||||
|
"t_total_s": 0.6904724189989793,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 0.5852864849985053,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 0.5897548109969648,
|
||||||
|
"t_B_s": 0.6827381169969158,
|
||||||
|
"t_total_s": 0.6828304180016858,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 0.5890174580017629,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 0.5850713190011447,
|
||||||
|
"t_B_s": 0.6744917560026806,
|
||||||
|
"t_total_s": 0.6745770380002796,
|
||||||
|
"cached": 8176,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 8192,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 0.5943713950000529,
|
||||||
|
"kv_gib": 0.75,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 1.5030149390004226,
|
||||||
|
"t_B_s": 1.596173029000056,
|
||||||
|
"t_total_s": 1.597060264000902,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 1.5130829510017065,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 1.499876754998695,
|
||||||
|
"t_B_s": 1.5940461120007967,
|
||||||
|
"t_total_s": 1.5948001770011615,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 1.5024838620010996,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 1.5068977490009274,
|
||||||
|
"t_B_s": 1.5950395179970656,
|
||||||
|
"t_total_s": 1.59571184500237,
|
||||||
|
"cached": 16368,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 16384,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 1.5303227439981129,
|
||||||
|
"kv_gib": 1.5,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 4.4503932609986805,
|
||||||
|
"t_B_s": 4.538851200999488,
|
||||||
|
"t_total_s": 4.539281312001549,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 0,
|
||||||
|
"t_prefill_only_s": 4.446753306998289,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 4.44226107799841,
|
||||||
|
"t_B_s": 4.551636377997056,
|
||||||
|
"t_total_s": 4.552389411001059,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 1,
|
||||||
|
"t_prefill_only_s": 4.44538704000297,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"t_A_s": 4.440309538000292,
|
||||||
|
"t_B_s": 4.539836316998844,
|
||||||
|
"t_total_s": 4.540553365997766,
|
||||||
|
"cached": 32752,
|
||||||
|
"mode": "layerwise",
|
||||||
|
"size": 32768,
|
||||||
|
"rep": 2,
|
||||||
|
"t_prefill_only_s": 4.443476915999781,
|
||||||
|
"kv_gib": 3.0,
|
||||||
|
"correct": true
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"summary": [
|
||||||
|
{
|
||||||
|
"size": 8192,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 589.0174580017629,
|
||||||
|
"total_ms": 682.8304180016858,
|
||||||
|
"overhead_ms": 93.8129599999229,
|
||||||
|
"all_correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"size": 16384,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 1513.0829510017065,
|
||||||
|
"total_ms": 1595.71184500237,
|
||||||
|
"overhead_ms": 82.62889400066342,
|
||||||
|
"all_correct": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"size": 32768,
|
||||||
|
"n": 3,
|
||||||
|
"pf_only_ms": 4445.38704000297,
|
||||||
|
"total_ms": 4540.553365997766,
|
||||||
|
"overhead_ms": 95.16632599479635,
|
||||||
|
"all_correct": true
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
33
microbench/connector_tax/layerwise/run_ab_matrix.sh
Executable file
33
microbench/connector_tax/layerwise/run_ab_matrix.sh
Executable file
@@ -0,0 +1,33 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# A/B x migration matrix on the 1200-req trace (sequential, ~47 min each).
|
||||||
|
# 1. unified (no A/B, no migration) anchor
|
||||||
|
# 2. unified + A+B (documented champion, no mig)
|
||||||
|
# 3. unified_v3 + A+B + layer-wise (champion + cheap mig)
|
||||||
|
# We already have: unified_v3 + layer-wise (no A/B) from the prior run.
|
||||||
|
#
|
||||||
|
# Q1 (migration benefit w/ layer-wise): #1 vs prior v3+layerwise(noAB)
|
||||||
|
# Q2 (does migration add to champion): #2 vs #3
|
||||||
|
set -uo pipefail
|
||||||
|
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||||
|
R="$PROJ_DIR/microbench/connector_tax/layerwise/run_v3_trace.sh"
|
||||||
|
AB="--overload-factor 1.3 --lmetric-decode-weight 0.01"
|
||||||
|
LOGD=/tmp/dst_break_logs; mkdir -p "$LOGD"
|
||||||
|
|
||||||
|
echo "########## 1/3 unified plain ##########"
|
||||||
|
TAG=unified_plain POLICY=unified MODE=baseline AB_FLAGS="" \
|
||||||
|
bash "$R" 2>&1 | tee "$LOGD/abmatrix_1_unified_plain.log" | tail -6
|
||||||
|
|
||||||
|
echo "########## 2/3 unified + A+B ##########"
|
||||||
|
TAG=unified_AB POLICY=unified MODE=baseline AB_FLAGS="$AB" \
|
||||||
|
bash "$R" 2>&1 | tee "$LOGD/abmatrix_2_unified_AB.log" | tail -6
|
||||||
|
|
||||||
|
echo "########## 3/3 unified_v3 + A+B + layer-wise ##########"
|
||||||
|
TAG=v3_AB_lw POLICY=unified_v3 MODE=layerwise AB_FLAGS="$AB" \
|
||||||
|
bash "$R" 2>&1 | tee "$LOGD/abmatrix_3_v3_AB_lw.log" | tail -6
|
||||||
|
|
||||||
|
echo "########## MATRIX DONE ##########"
|
||||||
|
for t in unified_plain unified_AB v3_AB_lw; do
|
||||||
|
D=$(ls -dt "$PROJ_DIR"/outputs/v3trace_${t}_*/unified_v3 2>/dev/null | head -1)
|
||||||
|
echo "=== $t ($D) ==="
|
||||||
|
sed -n '/\[stats\]/,/\[done\]/p' "$LOGD"/abmatrix_*_${t}.log 2>/dev/null | grep -E "requests:|TTFT|migrations:" || true
|
||||||
|
done
|
||||||
42
microbench/connector_tax/layerwise/run_ablation_es.sh
Executable file
42
microbench/connector_tax/layerwise/run_ablation_es.sh
Executable file
@@ -0,0 +1,42 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Ablation: does the REAL engine-state feed (P2) change each policy's
|
||||||
|
# performance and ranking vs the stale-shadow baseline?
|
||||||
|
#
|
||||||
|
# Each config is run twice (ES=0 shadow-only, ES=1 real-state feed) so the
|
||||||
|
# ONLY difference is the state source. Sequential, ~47 min each.
|
||||||
|
#
|
||||||
|
# Default = the 4 decisive runs (champion + migration, with/without feed).
|
||||||
|
# Extend CONFIGS for the full sweep (lmetric / unified_kv_both / load_only).
|
||||||
|
set -uo pipefail
|
||||||
|
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||||
|
R="$PROJ_DIR/microbench/connector_tax/layerwise/run_v3_trace.sh"
|
||||||
|
AB="--overload-factor 1.3 --lmetric-decode-weight 0.01"
|
||||||
|
LOGD=/tmp/dst_break_logs; mkdir -p "$LOGD"
|
||||||
|
|
||||||
|
# CONFIG format: "TAG|POLICY|MODE|AB?|ES"
|
||||||
|
CONFIGS=(
|
||||||
|
"unified_AB_es0|unified|baseline|AB|0"
|
||||||
|
"unified_AB_es1|unified|baseline|AB|1"
|
||||||
|
"v3_AB_lw_es0|unified_v3|layerwise|AB|0"
|
||||||
|
"v3_AB_lw_es1|unified_v3|layerwise|AB|1"
|
||||||
|
# --- extend for the full sweep ---
|
||||||
|
# "lmetric_es0|lmetric|baseline|noAB|0"
|
||||||
|
# "lmetric_es1|lmetric|baseline|noAB|1"
|
||||||
|
# "ukvboth_AB_es0|unified_kv_both|baseline|AB|0"
|
||||||
|
# "ukvboth_AB_es1|unified_kv_both|baseline|AB|1"
|
||||||
|
)
|
||||||
|
|
||||||
|
for cfg in "${CONFIGS[@]}"; do
|
||||||
|
IFS='|' read -r tag policy mode ab es <<< "$cfg"
|
||||||
|
ab_flags=""; [ "$ab" = "AB" ] && ab_flags="$AB"
|
||||||
|
echo "########## $tag (policy=$policy mode=$mode ab=$ab es=$es) ##########"
|
||||||
|
TAG="$tag" POLICY="$policy" MODE="$mode" AB_FLAGS="$ab_flags" ES="$es" \
|
||||||
|
bash "$R" 2>&1 | tee "$LOGD/abl_${tag}.log" | tail -6
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "########## ABLATION DONE — summary ##########"
|
||||||
|
for cfg in "${CONFIGS[@]}"; do
|
||||||
|
IFS='|' read -r tag _ _ _ _ <<< "$cfg"
|
||||||
|
echo "=== $tag ==="
|
||||||
|
grep -E "requests:|TTFT|migrations:" "$LOGD/abl_${tag}.log" 2>/dev/null || true
|
||||||
|
done
|
||||||
114
microbench/connector_tax/layerwise/run_mb7.sh
Normal file
114
microbench/connector_tax/layerwise/run_mb7.sh
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# MB7 launcher (runs on dash0). Two 2-instance modes selected by MODE env:
|
||||||
|
# MODE=baseline : restore stock connector, no layerwise env
|
||||||
|
# MODE=layerwise : deploy mooncake_connector.LAYERWISE.py + MOONCAKE_LAYERWISE=1
|
||||||
|
#
|
||||||
|
# Chunked prefill is DISABLED (max-num-batched-tokens >= max prompt) so the
|
||||||
|
# producer prefill is a single forward and save_kv_layer fires once per layer
|
||||||
|
# in order — the layer-wise counter assumes this.
|
||||||
|
#
|
||||||
|
# The connector is always restored from .ORIG_BACKUP on exit.
|
||||||
|
#
|
||||||
|
# Usage (on dash0):
|
||||||
|
# MODE=baseline bash run_mb7.sh
|
||||||
|
# MODE=layerwise bash run_mb7.sh
|
||||||
|
|
||||||
|
set -uo pipefail
|
||||||
|
|
||||||
|
MODE="${MODE:-baseline}"
|
||||||
|
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||||
|
VENV="${VENV:-$PROJ_DIR/.venv}"
|
||||||
|
MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
|
||||||
|
GPUS=(${GPUS:-0 1})
|
||||||
|
SIZES="${SIZES:-8192,16384,32768}"
|
||||||
|
REPEATS="${REPEATS:-3}"
|
||||||
|
BG_LOAD="${BG_LOAD:-0}"
|
||||||
|
CONCURRENT="${CONCURRENT:-1}"
|
||||||
|
MAX_BATCHED="${MAX_BATCHED:-40960}" # >= max prompt => no chunked prefill
|
||||||
|
DATE="$(date +%Y%m%d_%H%M)"
|
||||||
|
OUTDIR="${OUTDIR:-$PROJ_DIR/outputs/mb7_${MODE}_${DATE}}"
|
||||||
|
PYTHON="$VENV/bin/python"
|
||||||
|
MC_FILE="$VENV/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
|
||||||
|
LW_SRC="${LW_SRC:-/tmp/mooncake_connector.LAYERWISE.py}"
|
||||||
|
DRIVER="$PROJ_DIR/microbench/connector_tax/layerwise/mb7_layerwise.py"
|
||||||
|
|
||||||
|
mkdir -p "$OUTDIR/logs"
|
||||||
|
PORTS=(8000 8001); BPS=(8998 8999)
|
||||||
|
|
||||||
|
echo "=== MB7 ($MODE) ==="
|
||||||
|
echo "Out: $OUTDIR ; connector: $MC_FILE"
|
||||||
|
|
||||||
|
restore_connector() {
|
||||||
|
if [ -f "$MC_FILE.ORIG_BACKUP" ]; then
|
||||||
|
cp -f "$MC_FILE.ORIG_BACKUP" "$MC_FILE"
|
||||||
|
echo "[restore] connector reset to ORIG"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
cleanup() {
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||||
|
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||||
|
sleep 4
|
||||||
|
restore_connector
|
||||||
|
}
|
||||||
|
trap cleanup EXIT
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true; sleep 3
|
||||||
|
|
||||||
|
# Deploy the connector for the chosen mode.
|
||||||
|
if [ "$MODE" = "layerwise" ]; then
|
||||||
|
if [ ! -f "$LW_SRC" ]; then echo "FATAL: $LW_SRC not found (scp it first)"; exit 1; fi
|
||||||
|
cp -f "$LW_SRC" "$MC_FILE"
|
||||||
|
"$PYTHON" -c "import ast; ast.parse(open('$MC_FILE').read()); print('[deploy] LAYERWISE connector AST OK')" || exit 1
|
||||||
|
LW_ENV="MOONCAKE_LAYERWISE=1"
|
||||||
|
else
|
||||||
|
restore_connector
|
||||||
|
LW_ENV=""
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[launch] 2 instances (max-num-batched-tokens=$MAX_BATCHED, chunked-prefill off)"
|
||||||
|
i=0
|
||||||
|
for gpu in "${GPUS[@]:0:2}"; do
|
||||||
|
port=${PORTS[$i]}; bp=${BPS[$i]}; master=$((29700 + i))
|
||||||
|
env $LW_ENV \
|
||||||
|
PYTHONHASHSEED=42 VLLM_MOONCAKE_BOOTSTRAP_PORT=$bp \
|
||||||
|
CUDA_VISIBLE_DEVICES=$gpu MASTER_PORT=$master \
|
||||||
|
nohup "$VENV/bin/vllm" serve "$MODEL" \
|
||||||
|
--host 0.0.0.0 --port "$port" --tensor-parallel-size 1 \
|
||||||
|
--trust-remote-code --enable-prefix-caching --dtype auto \
|
||||||
|
--gpu-memory-utilization 0.9 --max-model-len 200000 \
|
||||||
|
--max-num-batched-tokens "$MAX_BATCHED" \
|
||||||
|
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
|
||||||
|
--enable-prompt-tokens-details \
|
||||||
|
> "$OUTDIR/logs/vllm_${i}_gpu${gpu}.log" 2>&1 &
|
||||||
|
disown; sleep 2; i=$((i + 1))
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "[health] waiting ..."
|
||||||
|
for i in 0 1; do
|
||||||
|
port=${PORTS[$i]}; tries=0
|
||||||
|
while ! curl -sf "http://127.0.0.1:$port/health" >/dev/null 2>&1; do
|
||||||
|
tries=$((tries + 1)); [ $tries -gt 180 ] && { echo "FATAL inst_$i"; exit 1; }
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
echo " inst_$i ready"
|
||||||
|
done
|
||||||
|
for i in 0 1; do
|
||||||
|
bp=${BPS[$i]}; tries=0
|
||||||
|
while ! curl -sf "http://127.0.0.1:$bp/query" >/dev/null 2>&1; do
|
||||||
|
tries=$((tries+1)); [ $tries -gt 60 ] && { echo "WARN bp $bp"; break; }; sleep 2
|
||||||
|
done
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "[run] mb7 --mode $MODE"
|
||||||
|
"$PYTHON" "$DRIVER" --mode "$MODE" \
|
||||||
|
--src-port "${PORTS[0]}" --dst-port "${PORTS[1]}" \
|
||||||
|
--src-bp "${BPS[0]}" --dst-bp "${BPS[1]}" \
|
||||||
|
--sizes "$SIZES" --repeats "$REPEATS" --bg-load "$BG_LOAD" \
|
||||||
|
--concurrent "$CONCURRENT" --out "$OUTDIR/mb7_result.json" \
|
||||||
|
2>&1 | tee "$OUTDIR/mb7_run.txt"
|
||||||
|
|
||||||
|
echo "[done] $OUTDIR"
|
||||||
|
# grep layerwise transfer logs from the producer (gpu0) for sanity
|
||||||
|
if [ "$MODE" = "layerwise" ]; then
|
||||||
|
echo "=== producer layerwise log lines ==="
|
||||||
|
grep -i "layerwise" "$OUTDIR/logs/vllm_0_gpu${GPUS[0]}.log" | tail -10 || true
|
||||||
|
fi
|
||||||
114
microbench/connector_tax/layerwise/run_v3_trace.sh
Executable file
114
microbench/connector_tax/layerwise/run_v3_trace.sh
Executable file
@@ -0,0 +1,114 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Full 1200-req v3 trace, two modes (MODE env), for layer-wise re-profile.
|
||||||
|
# MODE=baseline : stock connector + stock proxy (post-hoc transfer)
|
||||||
|
# MODE=layerwise : LAYERWISE connector + write-mode proxy (overlapped)
|
||||||
|
# Both: unified_v3 routing + DR-fix. Connector & proxy restored from backup
|
||||||
|
# on exit. Output-equivalence/correctness gate = success rate + migrated-req
|
||||||
|
# TTFT distribution (byte-level KV correctness already validated on mb7).
|
||||||
|
#
|
||||||
|
# Usage (on dash0): MODE=baseline bash run_v3_trace.sh
|
||||||
|
# MODE=layerwise bash run_v3_trace.sh
|
||||||
|
|
||||||
|
set -uo pipefail
|
||||||
|
MODE="${MODE:-baseline}"
|
||||||
|
POLICY="${POLICY:-unified_v3}"
|
||||||
|
AB_FLAGS="${AB_FLAGS:-}" # e.g. "--overload-factor 1.3 --lmetric-decode-weight 0.01"
|
||||||
|
TAG="${TAG:-$MODE}"
|
||||||
|
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||||
|
VENV="$PROJ_DIR/.venv"
|
||||||
|
VLLM_ROOT="$VENV/lib/python3.12/site-packages/vllm"
|
||||||
|
TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
|
||||||
|
DATE="$(date +%Y%m%d_%H%M)"
|
||||||
|
OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/v3trace_${TAG}_${DATE}}"
|
||||||
|
PYTHON="$VENV/bin/python"
|
||||||
|
DR_FIX="$PROJ_DIR/microbench/connector_tax/cache_sweep/apply_direct_read_fix.py"
|
||||||
|
MC_FILE="$VLLM_ROOT/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py"
|
||||||
|
PROXY_FILE="$PROJ_DIR/scripts/cache_aware_proxy.py"
|
||||||
|
# Staging on shared cpfs (visible on dash0/dash1), not node-local /tmp.
|
||||||
|
_LWDIR="$PROJ_DIR/microbench/connector_tax/layerwise"
|
||||||
|
LW_CONN="${LW_CONN:-$_LWDIR/mooncake_connector.LAYERWISE.py}"
|
||||||
|
WM_PROXY="${WM_PROXY:-$_LWDIR/cache_aware_proxy.WRITEMODE.py}"
|
||||||
|
ES_INSTR="$_LWDIR/instrument_engine_state.py"
|
||||||
|
ES="${ES:-0}" # 1 = enable real engine-state feed (P2)
|
||||||
|
ES_DIR="/dev/shm/agentic_engine_state_${TAG}"
|
||||||
|
|
||||||
|
mkdir -p "$OUTROOT"
|
||||||
|
cfg_dir="$OUTROOT/unified_v3"; mkdir -p "$cfg_dir"
|
||||||
|
|
||||||
|
# Backups (connector backup already exists as .ORIG_BACKUP; make proxy one).
|
||||||
|
[ -f "$MC_FILE.ORIG_BACKUP" ] || cp "$MC_FILE" "$MC_FILE.ORIG_BACKUP"
|
||||||
|
[ -f "$PROXY_FILE.ORIG_BACKUP" ] || cp "$PROXY_FILE" "$PROXY_FILE.ORIG_BACKUP"
|
||||||
|
|
||||||
|
restore() {
|
||||||
|
cp -f "$MC_FILE.ORIG_BACKUP" "$MC_FILE"
|
||||||
|
cp -f "$PROXY_FILE.ORIG_BACKUP" "$PROXY_FILE"
|
||||||
|
"$PYTHON" "$DR_FIX" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
|
||||||
|
"$PYTHON" "$ES_INSTR" --revert --venv "$VENV" 2>/dev/null || true
|
||||||
|
rm -rf "$ES_DIR" 2>/dev/null || true
|
||||||
|
echo "[restore] connector+proxy reset to ORIG, DR-fix + ES-patch reverted"
|
||||||
|
}
|
||||||
|
cleanup() {
|
||||||
|
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||||
|
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||||
|
sleep 5
|
||||||
|
restore
|
||||||
|
}
|
||||||
|
trap cleanup EXIT
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true; sleep 3
|
||||||
|
restore # start from clean
|
||||||
|
|
||||||
|
echo "=== v3 trace (mode=$MODE es=$ES tag=$TAG) -> $OUTROOT ==="
|
||||||
|
# Always deploy the enhanced proxy (write-mode + engine-state, both env/flag
|
||||||
|
# gated; with feed off + write-mode off it behaves identically to stock).
|
||||||
|
cp -f "$WM_PROXY" "$PROXY_FILE"
|
||||||
|
if [ "$MODE" = "layerwise" ]; then
|
||||||
|
cp -f "$LW_CONN" "$MC_FILE"
|
||||||
|
export MOONCAKE_LAYERWISE=1
|
||||||
|
export EAR_WRITE_MODE=1
|
||||||
|
fi
|
||||||
|
"$PYTHON" -c "import ast; ast.parse(open('$MC_FILE').read()); ast.parse(open('$PROXY_FILE').read()); print('[deploy] proxy + connector AST OK')" || exit 1
|
||||||
|
|
||||||
|
PROXY_ES_ARG=""
|
||||||
|
if [ "$ES" = "1" ]; then
|
||||||
|
echo "[ES] apply engine-state patch + enable feed at $ES_DIR"
|
||||||
|
"$PYTHON" "$ES_INSTR" --apply --venv "$VENV"
|
||||||
|
mkdir -p "$ES_DIR"
|
||||||
|
export AGENTIC_ENGINE_STATE_URI="file://$ES_DIR"
|
||||||
|
PROXY_ES_ARG="--engine-state-uri file://$ES_DIR"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[DR-fix] apply"
|
||||||
|
"$PYTHON" "$DR_FIX" --apply --vllm-root "$VLLM_ROOT"
|
||||||
|
export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
|
||||||
|
|
||||||
|
echo "[run] $POLICY AB=[$AB_FLAGS] (MOONCAKE_LAYERWISE=${MOONCAKE_LAYERWISE:-0} EAR_WRITE_MODE=${EAR_WRITE_MODE:-0})"
|
||||||
|
EXTRA_PROXY_ARGS="$AB_FLAGS $PROXY_ES_ARG" bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "$POLICY" "$TRACE" "$cfg_dir" \
|
||||||
|
2>&1 | tee "$cfg_dir/orchestrator.log" | tail -20
|
||||||
|
|
||||||
|
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||||
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||||
|
sleep 5
|
||||||
|
|
||||||
|
echo "[stats] $MODE"
|
||||||
|
"$PYTHON" - "$cfg_dir" << 'PYEOF'
|
||||||
|
import json, sys, statistics
|
||||||
|
d = sys.argv[1]
|
||||||
|
ms = [json.loads(l) for l in open(f"{d}/metrics.jsonl")]
|
||||||
|
ok = [m for m in ms if not m.get("error")]
|
||||||
|
ttft = sorted(m["ttft_s"] for m in ok if m.get("ttft_s") is not None)
|
||||||
|
def p(q): return ttft[min(len(ttft)-1, int(q*len(ttft)))] if ttft else 0
|
||||||
|
print(f" requests: {len(ms)} success: {len(ok)} ({len(ok)/max(1,len(ms))*100:.1f}%)")
|
||||||
|
print(f" TTFT s : p50={p(.5):.2f} p90={p(.9):.2f} p99={p(.99):.2f}")
|
||||||
|
# migrated reqs from proxy breakdown
|
||||||
|
try:
|
||||||
|
bd = json.load(open(f"{d}/breakdown.json"))
|
||||||
|
mig = [x for x in bd if x.get("route_class") == "PD_SEP_V2"]
|
||||||
|
mids = {x["request_id"] for x in mig}
|
||||||
|
mt = sorted(m["ttft_s"] for m in ok if m["request_id"] in mids and m.get("ttft_s"))
|
||||||
|
print(f" migrations: {len(mig)} migrated-req TTFT: "
|
||||||
|
f"p50={mt[len(mt)//2]:.2f} p90={mt[int(len(mt)*.9)]:.2f} max={mt[-1]:.2f}" if mt else f" migrations: {len(mig)}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f" (breakdown parse: {e})")
|
||||||
|
PYEOF
|
||||||
|
echo "[done] $cfg_dir (metrics.jsonl, breakdown.json)"
|
||||||
Reference in New Issue
Block a user