Commit Graph

235 Commits

Author SHA1 Message Date
837df6bc9e v2 exp(a): three-tier KV-hit latency microbench (GPU >> CPU >> miss)
Measures TTFT to serve a reused prefix of length L from each KV tier on a
single H20 (Qwen3-Coder-30B-A3B, vLLM 0.18.1): miss (recompute), CPU-tier
hit (native DRAM offload), GPU-tier hit (HBM prefix cache). Each measured
request is bracketed by /metrics scrapes so the tier is verified
(vllm:prefix_cache_hits vs external_prefix_cache_hits), not assumed.

Result: GPU hit is ~flat (42->111 ms over 1k->64k tokens); CPU hit is
transfer-bound (PCIe H2D ~54 GB/s, 57->272 ms); miss grows superlinearly
(78 ms -> 15.2 s). GPU beats CPU 1.4-2.5x (gap grows with context);
miss/CPU up to 56x, miss/GPU up to 137x. pcie_transfer.py is the
independent CPU-hit floor backstop. Evidence for the GPU-hit-first
principle (paper section 2.2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 11:23:04 +08:00
cf812b6264 Workload characterization C1-C3 on full production trace
Joint/temporal characterizations of the full 051315 cluster trace (2.11M
req / 1.31M sessions / 2h), beyond the existing single-variable marginals:

- C1 mixture: 90.3% sessions single-turn, but multi-turn (9.7%) = 44% reqs /
  67% prefill mass; continuation hazard rises 10%->94% (Lindy); heaviness
  unpredictable at turn 1 (corr 0.04-0.15) => reactive routing justified.
- C2 resident/delta: resident context 11k->56k while new-prefill 2.7k->~200;
  per-turn reuse ->99.6%; resident/delta ("PD tax") ->~250-450x.
- C3 prefill/decode: token mass 98.7% input / 1.3% output, BUT decode ~70% of
  TIME (robust 68-71%); "decode negligible" is wrong (tokens != time). Correct
  colo argument = roofline complementarity, not "no decode".

Maps each to (1) PD-colocation and (2) routing. compute_chars.py + chars.json
+ figs/workload_chars/. Raw-file exact validation (cached_tokens, real
timings) pending.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:19:39 +08:00
847f52f03b PD-disagg crossover: regular synthetic trace + goodput sweep + figure
gen_synthetic_trace.py --mode regular: maximally-regular multi-turn trace
(fixed prefix/delta/turns, constant arrivals, zero session skew) to isolate
the structural PD cost (per-turn full-context transfer + P/D capacity split)
from the skew/hot-pin artifact.

analysis/crossover/: SLO-goodput PD_advantage sweeps bracketing the
prefill<->decode bottleneck axis (D1 grow input -> prefill-bound; D2 grow
output -> decode-bound). figs/crossover_pd_advantage.png shows the crossover
(y=1) with the agentic operating region annotated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:19:23 +08:00
48ae72467a Replayer: closed-loop inter-turn think-time mode
Add --inter-turn-think (env REPLAY_INTER_TURN_THINK_S): turn 1 fires on
session admission, each later turn a FIXED think-time after the previous
turn COMPLETES, ignoring absolute trace timestamps. Combined with
--max-inflight-sessions (env REPLAY_MAX_INFLIGHT) this is a stable N-user
closed loop, removing the open-loop "fire immediately because timestamp is
in the past" retrigger artifact. Needed for the dispatch-coupling
(wall-clock amplification) sweep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:19:12 +08:00
657cd36f3d Gate evict_sent_blocks behind VLLM_EVICT_SENT_BLOCKS
Fork commit e13391e unconditionally evicts sent blocks from the prefix
cache on every KV transfer. That is correct only for session MIGRATION
(source won't see the session again); for plain PD-disagg producer->
consumer transfers it destroys cross-turn producer reuse and contaminates
PD reuse experiments. Default OFF; enable for migration runs via
VLLM_EVICT_SENT_BLOCKS=1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:18:59 +08:00
a0db3cbe77 Add leastwork_kappa decode-aware ablation (net-negative, documented)
--policy leastwork_kappa + --kappa (default 2.5e-6, derived from KV ~100KB/tok
/ HBM 4TB/s / TPOT 10ms on H20+Qwen3-30B-A3B): score = prefill_work * (1 +
kappa * ongoing_decode_tokens), modelling decode as a fractional throughput tax
on a new prefill.

Result on the 600s trace: NET-NEGATIVE vs plain leastwork — TTFT p90 +18%,
E2E p90 +14%, balance 1.55x->1.97x, and it does NOT fix the E2E-p99 it targeted.
Decode is too cheap in agentic (output p50~80) for the term to help; it just
bounces heavy reqs off their cache-owner into cold re-prefill. The E2E-p99 tail
is the structural HEAVY+>50k floor (per-class p99 ~51-52k for ALL policies), not
decode interference. Kept in-tree as a documented ablation justifying LPWL's
omission of any decode term; do not revive without a decode-heavy regime.
See analysis/lpwl_5policy_600s.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 17:07:23 +08:00
71b0747b3b 600s-truncated trace + LPWL 5-policy results
traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600
trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical
APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder,
lower-locality regime; whitelisted alongside the parent anonymized trace.

analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on
the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero
knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request
balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is
E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B
was tuned on full w600 yet is beaten by the knob-free policy on this regime.
Includes the run_5policy_600s.sh repro driver.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 16:08:35 +08:00
160c29133d Unified bench report: mean+TPS+per-worker GPU util, auto-captured
scripts/bench_report.py is now the canonical analyzer: per run + per input-
class it emits TTFT/TPOT/E2E mean+p50+p90+p99, decode/prefill TPS (aggregate
and per-worker), APC, per-worker GPU util mean/max, and load-spread ratios.

b3_isolated_policy.sh auto-captures the inputs for every run: gpu_util.csv
(via gpu_monitor.sh, 5s, replay-window only) + bench_config.json (worker->GPU
map); teardown stops the sampler. Future runs populate per-worker GPU util
automatically.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 16:08:22 +08:00
d9046322c6 Add parameter-free LPWL routing policy (--policy leastwork)
Least-Prefill-Work-Left: score = pending_prefill_tokens + max(0, input -
cache_hit_here), pure argmin with (num_requests, round-robin) tie-break.
Zero hyperparameters — derived from the agentic pattern: decode is cheap
(I/O ~217x) so outstanding prefill-token-work is the only load worth
modelling. Dropping LMetric's x num_requests factor (a) un-swallows the
cache signal so affinity emerges with no gate, and (b) makes an idle-but-
decoding host score `input` (its true marginal cost) instead of 0,
removing the empty-batch degeneracy. Stick-vs-spill crossover is computed
from real token-work, replacing overload_factor + cache_ratio gate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 16:08:10 +08:00
8a876e90d1 traces/README: clarify w600 is the session-start window, not span
The trace actually spans ~2912 s (~48.5 min): all 274 sessions START within
the 600 s --window-seconds window, but their later multi-turn requests (34%
of rows, inter-turn gaps up to ~700 s) extend well past t=600 s. Remove the
misleading "~600 s span".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 12:04:14 +08:00
e532e83d3e mb5_run: scrape per-instance prefix-cache counters before teardown
Per-port vllm:prefix_cache_{queries,hits}_total -> instance_apc.txt. For PD
this is the only honest reuse signal: producer ports show cross-turn prefix
hits, while the consumer's per-request cached_tokens just counts transferred KV.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:56:43 +08:00
d376d91fe1 Engine-state ablation: full sweep harness + results
Real-time engine state is NOT the routing lever. Across 6 policies × es0/es1,
real state reshuffles 44-76% of decisions but never beats the champion
(unified+A+B, p90 7.62s). The effect's SIGN is set by reactivity: one-shot
placement (sticky) HELPS -26%; per-request affinity-dominated is a wash;
per-request pure-load (lmetric +17%, load_only +27%) HURTS via herding (stale
shadow was a dampener). Feed verified fresh (median 25ms, <=92ms during
prefills). Prior shadow-state results stand. ES_ABLATION_RESULTS.md has the
table + mechanism; run_full_ablation.sh / fresh_sampler.py / cmp_es.py are the
harness.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:55:49 +08:00
08c3cf48aa Ship anonymized benchmark trace w600_r0.0015_st30 + provenance
Whitelist the sampled replay trace (1214 reqs / 274 sessions / ~600 s) past
the traces/ ignore so the repo is runnable without dash0 access. Metadata
only (token counts, opaque KV-block hashes, timing, session structure) — no
prompts/outputs/PII. traces/README documents schema, provenance (sampled
from the internal GLM-5.1 production trace via scripts/sample_trace.py), and
the regeneration command.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:54:43 +08:00
8708b75520 Merge layerwise KV transfer + engine-state ablation onto main
Brings the worktree-mooncake-layerwise line (layerwise Mooncake connector,
write-mode proxy, real engine-state feed + eff_ accessors, mb7 microbench,
v3 trace re-profile, A/B x migration matrix runner) into main so the repo
is self-contained for these experiments. Disjoint paths
(microbench/connector_tax/layerwise/*) => clean merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:40 +08:00
ee5db0b321 MB5 driver updates: PD-proxy + snapshot instrument + launcher tweaks
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:27 +08:00
bad512d3c5 PD-disagg crossover: synthetic-trace generator + morpher + plotter
gen_synthetic_trace (vanilla Poisson, zero prefix reuse — the regime where
PD-disagg is expected to win), mutate_trace (morph reuse/burst/skew toward
the agentic regime), and plot_crossover. Emits the replayer's JSONL schema.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:21 +08:00
41a0c1c48f Migration correctness smoke tests: direct-read, partial-transfer, NIXL
Standalone smoke tests validating KV-migration correctness paths before
trace replay: full migrate-cache, partial-prefill transfer, and a
NIXL-connector variant, each with a runner.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:13 +08:00
1262c9c22e Migration transfer-cost study: KV transfer is slow on busy GPUs
MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at
~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into
~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s)
+ ~45% control-plane GIL starvation during long prefills. Reproduced on a
fresh upstream venv (byte-identical transfer path) -> upstream/hardware
inherent, not our patch. Layerwise is the wrong lever; the tax is structural
on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6,
instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:01 +08:00
67fcec7933 Unified-routing A+B ablation: decode-aware LMetric + v3 anti-hotspot
cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the
LMetric fallback score) and a v3 anti-hotspot recent-migration penalty
(effective_load = num_req + recent-migration count over a sliding window),
preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents
the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix
sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by
~20%. Runners/analyzer for the b3 trace replay included.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:52:44 +08:00
a2f2645fda PD_DISAGG_RESULTS §6.3: producer hot-pinning figure
Direct per-producer KV-pool evidence for the session-affinity backfire.
At the same 4P+4D ratio:
- round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01)
- session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25)

A 25x jump in producer load imbalance — heavy multi-turn sessions
concentrate onto single producers, the same hot-pinning pathology as
sticky routing in the colocated §3.3 study.

plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from
snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs
session comparison) — same two-stage pattern as aggregate_mb5.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 00:38:20 +08:00
7947831e0f run_v3_trace.sh: stage LAYERWISE conn + enhanced proxy from shared cpfs (dash1-ready) 2026-05-29 00:29:56 +08:00
6243b78bba PD_DISAGG_RESULTS §6: session-affinity routing does not rescue PD
Swept session-affinity P routing (MB5_P_ROUTING=session) across all
four ratios on the metrics-fixed stack. Findings:

- Strictly worse than round-robin at every ratio. 4P+4D: round-robin
  100% vs session-affinity 36% completion.
- Success DECREASES monotonically as decode capacity grows
  (6P+2D 59% -> 4P+4D 36% -> 3P+5D 24% -> 2P+6D 19%) — refutes the
  "session prefill is faster so it needs more D" hypothesis.
- GPUs sit at ~0% utilization (2P+6D entirely idle) — the cluster
  stalls on KV-transfer/admission coordination, not compute. This is
  the deepest anti-PD argument: paid-for hardware does nothing while
  requests pile up; colocation keeps every GPU busy.
- Mechanism: session-affinity pins heavy multi-turn sessions onto
  single producers (producer hot-pinning, same pathology as sticky
  routing in the colocated §3.3 study); fewer producers -> worse
  concentration -> the monotonic decline. Failed transfers also pin
  producer KV (kv_load_failure_policy=fail), compounding to deadlock.

Verdict: neither ratio tuning nor routing policy rescues static
PD-disagg for this agentic workload — the failure is structural.

mb5_launch.sh: add 5P+3D / 3P+5D ratios for the sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 00:25:10 +08:00
5b26c345f4 P2: all routing policies read real state via eff_ accessors + ablation harness
InstanceState.eff_{num_requests,pending_prefill,ongoing_decode,ongoing_tokens}
= max(shadow, real) when feed fresh (fixes 30s-stale under-count, keeps
in-flight RaceFix), plus real-only r_max_prefill_remaining / r_kv_used_frac.
Wired into load_only, lmetric, sticky, unified(_kv_both), unified_v3, and
snapshot logging. Feed off => identical to before. run_v3_trace.sh gains ES=1
toggle (always deploys enhanced proxy); run_ablation_es.sh runs each config
ES0-vs-ES1 to test whether real state changes policy performance/ranking.
All unit-tested without GPU.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 20:21:12 +08:00
be948d32b8 P2: real engine-state feed replaces stale shadow counters for migration targeting
vLLM scheduler publishes real state (running/waiting, KV free, and the
max-in-progress-prefill signal /metrics lacks) to a tmpfs/redis store ~20Hz;
router reads it and avoids GIL-stall (mid-large-prefill) + KV-capacity-wall
targets, using real load over 30s-stale shadow counters. Components:
engine_state.py (canonical+reader), instrument_engine_state.py (scheduler
patch, file/redis writer), migration_target.py (scorer), proxy wiring
(--engine-state-uri, off=unchanged). All unit-tested without GPU; not yet
run live. See P2_ENGINE_STATE.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 20:01:26 +08:00
19191940e6 A/B x migration matrix runner (parameterized run_v3_trace.sh + wrapper) 2026-05-28 19:23:16 +08:00
63387f614d Full v3 trace re-profile with layer-wise: matched migrations improve
1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s,
scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99
-5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms
layer-wise removes the transfer half of migration overhead but not the
control-plane/queue residual. DESIGN.md updated with results.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 19:16:37 +08:00
21db2affb4 Trace runner (run_v3_trace.sh) + concurrent mb7 correctness test 2026-05-28 17:28:48 +08:00
e705bb33b6 Proxy write-mode: concurrent prefill+decode dispatch for v3 (EAR_WRITE_MODE=1) 2026-05-28 17:22:18 +08:00
4242bba034 Chunk-safe + concurrent layer-wise connector (per-step incremental shipping)
Scheduler tracks per-producer block_ids (accumulated from scheduler_output)
and emits per-step LWSendMeta with cumulative computed_tokens. Worker
lw_wait_for_save records a CUDA event per step and enqueues progress; the
sender-loop ship loop drains it, shipping only computed+dst-wanted+unshipped
blocks in order (correct under chunked prefill). Per-transfer state =
concurrent-safe. Keeps v1 single-transfer version as reference.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 17:15:54 +08:00
4cd71b6631 Working-set figure: extend left panel to ~50 nodes
Include T=600s/1800s points so the diminishing-returns tail is visible:
14 -> 52 nodes buys only +6pp APC (74%->79.8%), still under the 80.4%
ceiling that oracle/LRU reaches at 14 nodes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 17:11:12 +08:00
2247d1de08 Working-set figure: right panel = W(t) time series
Replace the (redundant) nodes-vs-T cost curve with the working-set
W(t) over wall-clock time for T=2/30/300s. Shows footprint is steady
(peak ~ median) after a short warm-up, so peak-based sizing is sound;
the 300s curve hugs the 14-node ceiling throughout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:31:26 +08:00
e77bdcac5a Layerwise under load: overlap benefit survives (bg=16)
mb7 with background decode load (8/instance). Critical-path transfer overhead
stays ~constant ~90ms for layerwise vs 158/239/749ms baseline (up to 7.9x at
32k), prefill not slowed, KV correct. Confirms the overlap holds on busy
instances. DESIGN.md updated with idle-vs-load table + the two blockers
(chunk-safety, concurrent-transfer safety) that the full 1200-req trace needs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:30:14 +08:00
c94b2e237a Working-set figure: linear node axes + benefit/cost split
Drop log node axis (decade ticks were unreadable). Left = APC vs #nodes
(linear), right = #nodes vs retention window T. Mark the 1-node budget
crossing (~7s reuse, ~8% APC) and the 14-node oracle ceiling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:24:15 +08:00
3b8be5bb61 Working-set figure: express footprint in node count, not GB
Both axes now in "# nodes" (footprint / per-node KV pool) so the
cluster-size implication is direct: 1-node budget line + 14-node oracle
ceiling, instead of raw GB.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:16:00 +08:00
dae98c6472 Working-set sizing tool + GLM-5.1-FP8/B300 result
Configurable KV working-set analyzer (GPU model x TP/PP/EP x model
config.json with MLA/GQA auto x KV/weight dtype). Computes Denning W(T),
oracle [first,last], and retain-forever footprints vs a per-replica KV
pool, plus the APC captured at each retention window.

GLM-5.1-FP8 (MLA, 43.9 KiB/token) on 1x B300 node (1528 GB KV pool):
live KV fits trivially (~533 GB), but the full 80.4% APC ceiling needs
~14 nodes (oracle) -> long-tail reuse motivates DRAM offload, not HBM.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:03:25 +08:00
fec50fa45d Layerwise KV transfer on Mooncake: PoC + microbench (worktree exploration)
Implements per-layer KV push during prefill (write mode) on vLLM's
MooncakeConnector, env-gated by MOONCAKE_LAYERWISE=1. 2-instance microbench
(mb7) shows correctness (KV lands, cached==prompt) and that the transfer is
hidden behind prefill compute: critical-path overhead drops from O(KV size)
(123/202/529ms for 8k/16k/32k) to a flat ~58ms (2-9x), with no prefill
slowdown, on idle instances. Caveats: idle-only, chunked-prefill disabled,
single concurrent transfer — see DESIGN.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 15:34:43 +08:00
2e6a369046 PD_DISAGG_RESULTS §5.1: D-pool pressure crashes consumers
Document the consumer EngineCore crash chain (D-pool 97% -> 112k-token
KV transfer fails -> negative prompt-token counter -> prometheus
ValueError -> engine dead -> cliff failure). Explains the round-robin
6P+2D rep variance (100/56/80%) as intermittent consumer death, and
notes the counter-clamp patch needed to compare routing arms fairly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 13:02:21 +08:00
3957c2df86 MB5 patch: clamp PD-consumer metrics counter underflow
Root cause of the 6P+2D run-to-run collapse (rep1 100%, rep2 56%,
rep3 80%, session-routing 6.6%): not load-shedding, but a consumer
EngineCore crash.

Failure chain observed in the consumer logs:
  1. D-pool fills to ~97% (decode-side capacity ceiling, the H1 story)
  2. a large request's KV transfer fails: "Mooncake transfer engine
     returned -1" (112k-token request, pool full)
  3. scheduler fails the request (kv_load_failure_policy=fail)
  4. PromptTokenStats.local_cache_hit = num_cached + recomputed -
     num_external_computed goes NEGATIVE (external transfer exceeded
     cached count)
  5. loggers.record() calls Counter.inc(negative) -> prometheus raises
     "Counters can only be incremented by non-negative amounts."
  6. EngineCore dies -> every subsequent request fails (the cliff:
     all successes in the first ~110s, zero after)

This turns ONE failed request into a total config collapse, and is
what made the round-robin 6P+2D reps look randomly variable.

Fix: clamp the three per-source prompt-token counts to >= 0 in
loggers.record() before they hit Counter.inc(). Pure insertion,
revertible via the existing sentinel mechanism. Lets a transfer
failure stay a single failed request instead of killing the engine,
so routing arms can be compared on equal footing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 13:01:23 +08:00
8596135680 MB5 analysis: per-role KV split proves static-partition mismatch
aggregate_mb5.py:
- Split the cluster KV timeline by role (P-pool vs D-pool) using a
  PID->role map parsed from vllm_logs filenames. The cluster average
  hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool
  is actually pegged at ~100% while prefill idles at ~30%.
- Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving
  host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced
  (matplotlib) renders locally. matplotlib import is now lazy.
- New plot_role_split figure + p/d peak/steady columns in the CSV.

PD_DISAGG_RESULTS.md: consolidated writeup with figures inline.
Verdict: no static P:D ratio beats 8C colocation. The binding
constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D,
P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays
elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner,
the MB1 phase-isolation benefit is real) but loses TTFT and sheds
load. Round-robin P routing also zeroes prefix-cache reuse; a
session-affinity re-run of 6P+2D is in flight to test the fix.

Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization,
mb5_latency_compare + mb5_summary.csv.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 12:05:17 +08:00
e8980ce957 MB5 proxy: session-affinity P routing (MB5_P_ROUTING=session)
The upstream mooncake_connector_proxy round-robins both P and D
selection. For agentic multi-turn sessions this destroys prefix-cache
reuse on the producer side — every turn of a session lands on a
different P, so the prefix-cache hit ratio collapses to 0 (observed in
the 6P+2D round-robin baseline) and every turn re-prefills from
scratch, piling extra load on the P pool.

Add an env-gated routing mode so the same proxy serves both arms of a
clean A/B:
  MB5_P_ROUTING=rr       round-robin (default, = upstream behavior)
  MB5_P_ROUTING=session  consistent md5 hash on X-Session-Id -> same
                         producer for all turns of a session

Decode side stays round-robin (load balance) in both modes — decode
KV is freshly transferred per turn, so D gains nothing from affinity
but everything from even load spreading.

mb5_launch.sh threads MB5_P_ROUTING through to the proxy and logs the
active mode. Default path is byte-for-byte the old behavior, so an
in-flight round-robin sweep is unaffected if this is redeployed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:05:25 +08:00
b13ca10d19 PD_DISAGG_INVESTIGATION: snapshot Phase 0 done + sweep in flight
Phase 0 infrastructure (vendored proxy, dual-file vLLM patcher,
per-instance + cross-config plotters) is fully assembled and
smoke-validated. Sweep RUN_TAG=20260527_164040 (4 configs × 3 reps
on w600) is running on dash1.

Also realigned the figure list with what `aggregate_mb5.py`
actually produces (mb5_kv_timeline, mb5_peak_utilization,
mb5_latency_compare, mb5_summary.csv).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:51:28 +08:00
a66f24d242 MB5 aggregate: cross-config KV-pool + latency comparison
Reads sweep root + tag, for each (config, rep):
- merges per-PID snapshots into cluster-wide KV timeline (carry-forward
  for PIDs without a sample in the bin)
- computes peak (max) and steady-state (10-90% median) pool utilization
- pulls latency p50/p90/p99 from replay_metrics.summary.json

Produces 4 outputs in --out-dir:
- mb5_kv_timeline.png    — N-panel cluster KV % over time, one panel per
                            config, faint per-rep lines + bold median
- mb5_peak_utilization.png — bar chart (peak vs steady) with ±std error bars
- mb5_latency_compare.png  — bar chart p50/p90/p99 e2e latency per config
- mb5_summary.csv          — flat per-(config, rep) table for the writeup

Validated on 4P+4D × 20-req smoke:
  4P+4D rep1: peak=12.8% steady=10.7% peak_wait=1
  p50=1.3s p90=10.5s p99=17.1s (vs. <1s for 8C — expected gap).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:49:21 +08:00
a9c7310f4a MB5 PD-disagg pipeline: working end-to-end
Three independent bugs were blocking PD-disagg smoke; each fix is
isolated so the next PD experiment doesn't re-hit them.

1. mb5_launch.sh
   - stop_all() also kills mb5_pd_proxy.py (our vendored copy),
     not just the upstream filename, and asserts ports 8000-8007 +
     PROXY_PORT are free before launching — stale proxies were
     silently passing the readiness check.
   - Proxy readiness uses a generic "any HTTP response" probe;
     mooncake_connector_proxy only exposes /v1/completions so
     /v1/models 404 is expected.

2. mb5_pd_proxy.py (vendored from third_party so deploy.sh ships it)
   - Force min_tokens=1 on the prefill leg. Clients that set
     min_tokens == max_tokens (our replayer does) collide with
     vLLM's min_tokens<=max_tokens check after the proxy caps
     max_tokens=1.

3. instrument_kv_snapshot.py
   - Adds a second patch target: initialize
     MooncakeConnectorWorker.bootstrap_server = None in __init__.
     vLLM 0.18.1 only sets it under the is_kv_producer branch, so
     kv_consumer hits AttributeError as soon as the first remote
     prefill request lands.
   - apply/revert refactored to iterate over (path, patches) pairs.

plot_kv_pool_timeline.py also handles snapshot files that never
captured a running request (would otherwise IndexError on an empty
stackplot input).

Smoke: 4P+4D × 20 reqs → 20/20 success, mean 3.9s, p99 17s, 8 PIDs
all writing snapshots (601 total), well above the 8C baseline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:14:22 +08:00
e0d3b5150a MB5 driver fixes: bash env-prefix + replayer flag names + python date math
Two bugs caught by 8C smoke:

mb5_launch.sh
  ${env_bp_arg} expanded as a literal command line prefix doesn't work
  when env_bp_arg is itself a variable — bash only treats VAR=val as
  an env assignment if it sees the literal in the parsed command, not
  after expansion. Fix: always export VLLM_MOONCAKE_BOOTSTRAP_PORT as
  a literal, defaulting to 9999 when caller passed no port (consumer
  mode ignores the var so the placeholder is harmless).

mb5_run.sh
  replayer's actual CLI flags are --trace / --output / --endpoint /
  --model, not the --*-path / --*-name variants I had. Plus dash1
  has no `bc`; compute wall_clock_s via python instead.

Both fixed; 8C smoke (CONFIG=8C REPS=1 REQUEST_LIMIT=20) now runs
end-to-end in ~30 s:
  - 8 vLLM kv_both instances on GPU 0-7 come up
  - replayer round-robins 20 reqs across them
  - MB5 instrumentation captures 8 snapshot files (one per EngineCore
    PID), ranging 7-139 snapshots each = ~10 Hz throttle works
  - plot_kv_pool_timeline.py renders the stacked-area + queue-depth
    chart cleanly (figs/mb5_smoke/*.png)

Pipeline validated. Ready for the real PD-ratio sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:23:23 +08:00
e9abd70c8d MB5 driver: launcher, orchestrator, KV-pool timeline plotter
Three new files to drive the PD ratio sweep + per-request KV occupancy
capture, plus a deploy.sh update so the patched replayer rides along
to the fresh-venv host.

mb5_launch.sh
  One script handles all four configs we plan to sweep:
    CONFIG=8C / 6P+2D / 4P+4D / 2P+6D
  - For 8C: 8 vLLM instances with kv_role=kv_both on GPU 0-7. Replayer
    talks to them via the existing comma-separated round-robin in
    replayer/replay.py — no proxy.
  - For PD configs: kv_role=kv_producer for the P pool (with
    VLLM_MOONCAKE_BOOTSTRAP_PORT) + kv_role=kv_consumer for the D pool,
    routed by the official vLLM example
    third_party/vllm/examples/online_serving/disaggregated_serving/
    mooncake_connector/mooncake_connector_proxy.py — no policy choice
    made by us, per user instruction to use the standard recipe.
  - Applies instrument_kv_snapshot.py before launching so every
    EngineCore writes its per-step KV snapshot to
    $RUN_ROOT/kv_snapshots/mb5_kv_snapshot_pid<pid>.jsonl
  - Reverts the patch on stop.
  - Emits ENDPOINTS= line on stdout for the orchestrator to read.

mb5_run.sh
  For each CONFIG × rep: launch, replay w600 trace via the existing
  replayer, capture wall-clock, tear down, cool down 10 s. Defaults:
    CONFIGS="8C 6P+2D 4P+4D 2P+6D"
    REPS=3
    TRACE=traces/w600_r0.0015_st30.jsonl
  All artefacts go under $FRESH_ROOT/mb5_runs/$RUN_TAG_${config}_rep${rep}/
  (vllm_logs/, kv_snapshots/, replay_metrics.jsonl, wall_clock_s.txt).

plot_kv_pool_timeline.py
  Reads one or more mb5_kv_snapshot_pid*.jsonl files and renders a
  stacked-area chart per file:
    x = wall-clock since first snapshot
    y = KV block count, stacked by per-request contribution
    overlay: pool-total ceiling, 90% line, waiting-queue depth subplot
  Bands are colored by a deterministic hash of request_id so individual
  requests are visually tractable across the run.
  This is the figure the user asked for — turns headline "PD-disagg is
  10× worse" into a system-level picture of *where* the KV pool is
  blocked, when, and by which requests.

deploy.sh
  Also tar-syncs the local replayer/ dir to
  /home/admin/cpfs/wjh/agentic-kv-fresh/replayer/ so mb5_run.sh can
  `python -m replayer` against the patched (trace_span_s/amplification)
  version, not the older copy under /home/admin/cpfs/wjh/agentic-kv/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:02:57 +08:00
a4f5dd56aa MB5 instrumentation: per-request KV-block snapshot from vLLM V1 scheduler
The §3.2 H1 (D-pool capacity wall) argument needs system-level evidence,
not just headline latency. This patch lets us record, every ~100 ms,
the exact composition of each vLLM instance's KV pool:

  - total / free / used block counts
  - for each RUNNING request: blocks held, computed tokens, prompt tokens
  - for each WAITING request: prompt tokens, status

Hook: inside Scheduler.schedule() right before the return. Per-request
blocks come from coordinator.single_type_managers[*].req_to_blocks
(vLLM 0.18.1's own per-request bookkeeping; no new tracking layer).
Throttled by MB5_PERIOD_MS env var (default 100 ms = 10 Hz) so a
13-min trace replay produces ~8 k snapshots per instance instead of
~80 k unthrottled.

Output: $MB5_LOG_DIR/mb5_kv_snapshot_pid<pid>.jsonl
(default MB5_LOG_DIR=/tmp). One file per EngineCore PID.

Apply/revert idempotent, same pattern as instrument_mooncake.py.
Markers: # MB5_INSTRUMENT_START / # MB5_INSTRUMENT_END.

Validated on dash1 venv: apply → py_compile ok → revert → py_compile ok.

With this in place we can build the stacked-area "KV pool composition
over time" figure the user asked for: x = wall-clock, y = block count,
colored bands = per-request portions. Comparing 8C colo vs 4P+4D
on the same trace will directly show whether (and when) the D pool
hits its ceiling — turning "PD-disagg is X× worse" into "PD-disagg
is X× worse BECAUSE these specific requests at this specific time
filled the pool and forced this queue depth".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:30:53 +08:00
4a93096c1e Add PD_DISAGG_INVESTIGATION.md — living TODO for proving H1–H4
We don't have paper-grade evidence yet that PD-disagg fails in agentic.
MB1+MB2 corrected accounting puts phase-isolation cost-benefit on
PD-disagg's side; the only direct support is colleague's one data point
on a patched dash0 build (TTFT p50 62×, success 52%) and the f4b
geometric capacity argument. To close §3.2 properly we need fresh-venv
empirical replication PLUS system-level instrumentation that tells the
reviewer *which* component is the bottleneck — not just headline
latency.

This document tracks the four candidate failure hypotheses (H1 D-pool
capacity, H2 static-partition mismatch, H3 cache reuse + P-pool hotspot,
H4 end-to-end throughput loss), their current evidence status, and the
phased experiment plan to address each.

Key findings already recorded:
- Phase 0 TODO 0.1 (find standard PD-disagg deployment) is done — vLLM
  ships an official example at
  examples/online_serving/disaggregated_serving/mooncake_connector/
  with a kv_producer+kv_consumer launcher and a Mooncake-aware proxy
  that supports arbitrary P:D ratios via env vars. Per user direction,
  we will NOT polish PD-disagg policy ourselves; we use the official
  recipe as the "PD-disagg" baseline in §3.2 / §5.2.
- Phase 1 (MB5+3 combined: PD ratio sweep with D-pool occupancy logging)
  is the critical path. Designed to either confirm H1 with system
  breakdown evidence (D-pool ≥ 90% for ≥ 30% of trace + queue depth
  spike) or falsify it (some ratio matches 8C colo, in which case §3.2
  needs rewriting).
- D-pool occupancy timeline is the single most important new
  instrumentation — turns "PD-disagg is 10× worse" into "PD-disagg is
  10× worse BECAUSE the D pool sits at >90% for X% of the trace".

Configurations to run on dash1 8-GPU first:
  8C (colo baseline), 6P+2D, 4P+4D, 2P+6D  ×  3 reps  ×  w600 trace.

Open question still in the doc: vLLM 0.18.1 had an AttributeError on
self.bootstrap_server in kv_consumer mode when we hit it during MB2
sanity; likely the issue was bad kv_transfer_params from our side
(missing transfer_id, wrong field names), which we have since fixed.
Official proxy uses the same handshake we now have, so it should
just work. If not, single-line patch to initialize self.bootstrap_server
= None for consumer mode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:24:31 +08:00
f739f7d461 Proxy/runner support for Nixl connector + unified_v3 (offload-decode) policy
scripts/b3_isolated_policy.sh:
  Recognize unified_v3 as a kv_both-requiring policy; respect explicit
  KV_CONNECTOR=Nixl override (so unified_v2 / unified_v3 / unified_kv_both
  can run against either Mooncake or Nixl back-end). When Nixl is
  selected, skip the bootstrap-ports plumbing — Nixl uses its own UCX
  side-channel and the proxy forwards kv_transfer_params from the src
  response body instead of pre-baking engine_id/bootstrap_addr.

scripts/cache_aware_proxy.py:
  - New unified_v3 policy (~250 lines): prefill stays on session-affinity
    host (preserves intra-session prefix-cache reuse), decode is migrated
    to a lower-load target when the affinity host is busy with concurrent
    decodes. KV transfer flows prefill_host → decode_target, opposite of
    v2. Knobs: v3_min_new_tokens, v3_min_prefill_decode_busy,
    v3_target_load_ratio, v3_min_load_gap, v3_rotate_affinity,
    v3_prefer_cache_target. cache_miss_audit found rotation hurts cross-
    turn locality (9.5% hit with vs ~80% without) so default
    v3_rotate_affinity=False.
  - New connector_type setting ("mooncake" | "nixl") gating the PD-sep
    handshake form: mooncake uses pre-baked kv_transfer_params,
    nixl forwards them from the response body.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:05:19 +08:00
da39ab6804 Correct PD-disagg cost/benefit framing across repo
The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:04:49 +08:00
abde010b64 Add RESULTS_SUMMARY.md — concise Chinese summary of current findings
One-page distillation of what the paper can claim today, with figure /
data path next to each row. Sections:

  1. Workload 性质 — intra-session reuse, skew, KV footprint
  2. Dispatch Coupling — agentic vs chatbot inter-turn gap regime
  3. 现有调度三类失败 — load-balance / static PD-disagg / pure sticky
  4. PD-disagg cost vs benefit — MB2 (transfer 9.7 GB/s ceiling,
     topology-independent) + MB1 (decode halted during prefill 15-200x),
     joined into the §3.2 cost > benefit headline for any KV ≥ 80 MiB
  5. EAR 实证状态 — Pillar 1 (affinity) validated, Pillar 2 (migration)
     substrate validated + strategy-layer pending
  6. 已能写的 paper 主张(按 confidence 排序)
  7. 待做(MB3-5, migration e2e, wall-clock sweep, scale-out)

Designed to be the one doc to read when re-entering the project after
a break.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:38:28 +08:00