The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.
What was wrong:
I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
of the new request (~50–200 ms)" — implicitly treating the benefit as
per-request and bounded by that request's own decode. The correct
accounting is per-prefill-event across all stalled streams:
benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
≈ D × T_prefill
which follows from the chunked-prefill math (each of L/N chunks slows
D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).
Plug MB1 + MB2 numbers in:
prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
2k tok | 0.14 s | 8 ms | 1.1 s | 0.7 %
33k tok | 4.5 s | 320 ms | 36 s | 0.9 %
125k tok | 57 s | 1.9 s | 456 s | 0.4 %
On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.
The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.
Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
function; keep mb1_interference.png and update its title to note
per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
no more "max benefit = decode duration" claim); §3.2 implications
section replaced with the corrected per-prefill-event table; explicit
⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
capacity argument (the real failure mode), MB1/MB2 demoted from
"kill-shot for PD-disagg" to "supporting context inputs to a
cost-benefit table that actually favors PD-disagg on this axis";
§6 paper-claims list reordered to remove the wrong "PD-disagg loses
on cost-vs-benefit" claim and replace with the corrected ones
PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
One-page distillation of what the paper can claim today, with figure /
data path next to each row. Sections:
1. Workload 性质 — intra-session reuse, skew, KV footprint
2. Dispatch Coupling — agentic vs chatbot inter-turn gap regime
3. 现有调度三类失败 — load-balance / static PD-disagg / pure sticky
4. PD-disagg cost vs benefit — MB2 (transfer 9.7 GB/s ceiling,
topology-independent) + MB1 (decode halted during prefill 15-200x),
joined into the §3.2 cost > benefit headline for any KV ≥ 80 MiB
5. EAR 实证状态 — Pillar 1 (affinity) validated, Pillar 2 (migration)
substrate validated + strategy-layer pending
6. 已能写的 paper 主张(按 confidence 排序)
7. 待做(MB3-5, migration e2e, wall-clock sweep, scale-out)
Designed to be the one doc to read when re-entering the project after
a break.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.
Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
= prefill_ttft / (num_tokens_during_prefill / D). This is the average
rate at which each decode stream produces tokens during the burst.
p50 of inter-token intervals is deceptive (chunked-prefill makes most
intervals look normal); the burst-average gives the true cost.
Results (D=8 row, the most agentic-realistic case):
P (tokens) | prefill_ttft | per-stream TPOT during | penalty
2048 | 143 ms | 32 ms | 4×
8192 | 583 ms | 114 ms | 15×
32768 | 4520 ms | 388 ms | 52×
65536 | 15615 ms | 757 ms | 99×
131072 | 56991 ms | 1419 ms | 183×
Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.
§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).
The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.
Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)
Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
interleave decode more aggressively. Chunk-size sensitivity is
flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sweep on dash1 GPU 0 → dash2 GPU 0 over 200 Gbps RoCE.
remote_bootstrap_addr=http://172.27.123.142:8998. Same 9-size × 5-rep
config as the 2026-05-27 intra-node run.
Per-size pure_transfer (p50) lines up within 1–3% of the intra-node
numbers across all sizes:
size intra p50 inter p50
512 tok 5.3 ms 5.2 ms
2048 tok 20.6 20.0
8192 tok 83.7 80.9
32k tok 320.9 309.6
64k tok 1895 1734 (bimodal in both)
128k tok 2835 2818 (bimodal in both)
=> Mooncake's batch_transfer_sync_write **does not use NVLink** for
intra-node peers; both paths go through the 200 Gbps RDMA NIC, with
the 200 Gbps NIC (not the GPU interconnect) being the bottleneck. The
~9.7 GB/s steady-state ceiling and the 6+ GiB variance regime are
identical across topologies.
Operational implication for §3.2: PD-disaggregation does not get
cheaper by co-locating P and D on the same node — every routed request
pays the same ~10 GB/s ceiling for KV transfer, no matter where it
lands. Halving the transfer cost cannot be bought back by topology.
Caveat: B's receive_kv events did not log on dash2 — `MB2_LOG_DIR`
env var did not propagate through vLLM's EngineCore subprocess on
the consumer host (cat /proc/$ENGINE_PID/environ is empty on dash2
for that var, but the producer host on dash1 worked). For this run
pure_transfer numbers are from A's send_blocks alone; full rx_total
breakdown is not available, but pure_transfer is the dominant term.
Adds:
- analyze_mb2_send_only.py — analyzer that works from A's send_blocks
alone when B's receive_kv events are absent
- plot_mb2_compare.py — overlay intra vs inter on the same axes
- plot_mb2.py — tolerate the `rows`-less send-only schema
- figs/mb2_transfer_{time,bw}_inter.png — inter-node single-curve
- figs/mb2_transfer_{time,bw}_compare.png — intra vs inter overlay
- analysis/mb2/A_inter_kvboth.jsonl, inter_kvboth_client.json,
inter_kvboth_breakdown.json
- analysis/mb2/README.md — Summary block updated to reference both
paths, dated 2026-05-27 run-log entry appended with the full table
and the topology-independence framing
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the pieces needed to run the producer on dash1 and the consumer on
dash2 with the same shared cpfs venv:
start_vllm_single.sh
INSTANCE / GPU / PORT / BP / MASTER / ROLE env vars; brings up ONE
vLLM instance + applies the mooncake instrumentation patch (idempotent
since the venv is cpfs-shared, so the first invocation applies and the
second is a no-op). Per-instance MB2_LOG_DIR keeps producer/consumer
events separate even though both directories live on the same cpfs
path visible to both hosts.
mb2_kv_transfer.py
New --src-host / --dst-host args. Defaults stay 127.0.0.1 for
backward-compat with the intra-node sweep. /v1/completions URLs and
/query URLs now use the supplied hosts. remote_bootstrap_addr is
built as http://<src_host>:<src_bp> so the consumer's
do_remote_prefill request carries a routable address.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lifts the MB2 intra-node results out of commit messages into a single
place the paper can cite. Structure:
Summary — one-line table + headline numbers for §3.2
Setup — exact hardware/software/config
Method — 3-step bench, instrumentation, pair-by-time-window
Results — full per-size table (latest run dated)
Known limitations — kv_both vs strict, serial-only, intra-only,
sanity preamble in the logs
§3.2 implications — transfer/decode ratio table at agentic sizes
Open questions / next runs — inter-node, bandwidth-ceiling
investigation, concurrent transfers,
strict kv_producer/consumer
Reproduction — exact commands
Run log — dated entries; new runs append here
The latest "intra-node" entry references `de164e5` for the raw
artifacts + figures.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Full sweep result on dash1 GPU 0+1 with vanilla vLLM 0.18.1 +
mooncake-transfer-engine 0.3.11, kv_both connector. Per-stage decomposition
via the instrumentation patch (analyze_mb2.py pairs A's send_blocks with
B's receive_kv enter/finish by time window).
Steady-state (1k..32k tokens, 96 MiB..3 GiB KV):
pure_transfer ≈ size / 9.7 GB/s
rx_overhead ≈ 2–3 ms (ZMQ handshake + P-side setup)
bandwidth ≈ 9.6–10.1 GB/s, very stable
Large-size regime (65k..131k tokens, 6..12 GiB):
p50 bandwidth collapses to 3.4–4.5 GB/s
max bandwidth still hits ~9.7 GB/s (some runs achieve it)
p99 agentic request (11.5 GiB) lands here
Implication for §3.2 PD-disaggregation cost argument:
median agentic decode = 50–200 ms (tool-call JSON output)
median agentic-tail KV transfer (p99 11.5 GiB):
best case (9.7 GB/s) ≈ 1.19 s
observed range 1.5 – 10 s
⇒ KV transfer is 8–100× larger than the decode it enables.
This is intra-node — the lower-bound transfer cost. Inter-node RDMA
will be slower; that's MB2 phase 2.
Adds:
- analyze_mb2.py: pair A.send_blocks ↔ B.receive_kv by time window;
per-size aggregation (n, ms_p50, ms_min/max, GB/s_p50/max)
- plot_mb2.py: log-log transfer-time chart + bandwidth-vs-size chart
- analysis/mb2/A_intra_kvboth.jsonl, B_intra_kvboth.jsonl: raw events
(51 + 102 events including the sanity preamble)
- analysis/mb2/intra_kvboth_breakdown.json: paired and aggregated
- figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit closes the loop on the fresh-venv MB2 path. Three corrections
on top of the previous scaffold made the bench fire successfully on
dash1 GPU 0+1 with kv_both connector roles:
1. Re-target instrumentation patch to vLLM's shipped MooncakeConnector
(vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py).
The mooncake-package's own mooncake_connector_v1.py turned out not to
be the implementation vLLM 0.18.1 loads — the
'{"kv_connector": "MooncakeConnector"}' config picks up the vLLM-shipped
one. Patches go at _send_blocks (P-side) and receive_kv_from_single_worker
(D-side, async, both entry and FINISH branch).
2. /query lives on the mooncake bootstrap port, not the vLLM HTTP port.
Add --src-bp / --dst-bp args; default 8998 / 8999.
3. kv_transfer_params schema for the vanilla connector:
do_remote_decode → {transfer_id}
do_remote_prefill → {transfer_id, remote_engine_id, remote_bootstrap_addr}
where remote_bootstrap_addr must include the http:// scheme. The dash0
smoke_test_migrate_cache.py was written for the patched build, which
used a different field-name set (remote_host, remote_port,
remote_block_ids); those are rejected here.
Also discovered (and worked around): vLLM 0.18.1 with kv_role=kv_consumer
raises AttributeError on `self.bootstrap_server` because that attribute
is only assigned conditionally inside `if not self.is_kv_consumer`. We
sidestep by running kv_both for the microbench — transfer mechanics are
identical (same batch_transfer_sync_write call); the role gate only
affects which request types each instance accepts. For §5 strict PD-disagg
baseline we'll need either to fix this bug or front the pair with a
role-aware proxy.
Sanity smoke (3 sizes × 2 repeats, dash1 GPU 0+1, kv_both intra-node):
input KV-MiB send_blocks_ms (P) receive_kv_ms (D) client_step2_ms
512 48 5–23 7–33 18–91
2048 192 21 23 37
8192 768 85 88 110
=> intra-node bandwidth ~9 GB/s on the actual transfer for 768 MiB,
which is well below NVLink p2p; likely PCIe-staged. Worth verifying.
Next step (in flight): full sweep 512..128k tokens × 5 repeats with
the per-stage analyzer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
start_vllm_pair.sh
ROLE_A / ROLE_B env vars (default kv_producer / kv_consumer for strict
PD-disagg). Override to kv_both for the kv_both control. The role is
injected into --kv-transfer-config so vLLM imposes the role restriction.
mb2_kv_transfer.py
--skip-verify flag drops step 3 (the plain completion sanity-check on
the destination), required when the dst is kv_consumer-only since a
kv_consumer instance refuses to serve a request without
do_remote_prefill. The transfer-time itself is still measured from
step 2 (do_remote_prefill on the consumer).
Also: per-step client-side wall-clock timestamps (t_step1_client_unix,
t_step2_client_unix, t_step2_end_unix) are now captured so the
post-hoc breakdown analyzer can join with the per-instance JSONL logs
on absolute time.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-stage breakdown of "step 2" (the B-side do_remote_prefill) requires
vLLM/mooncake-internal timing — we cannot infer it from black-box HTTP
E2E. This commit adds the four pieces to do that breakdown:
instrument_mooncake.py
apply / revert / check patches on mooncake_connector_v1.py to emit
structured JSONL transfer events at two key sites:
send_blocks (P-side, on batch_transfer_sync_write):
{event, remote_session, total_bytes, duration_s, t_start_unix,
ret, tp_rank, t_log_unix}
receive_kv (D-side, on the ZMQ-driven pull request):
{event, path, local_req_ids, remote_req_ids, duration_s,
t_start_unix, tp_rank, t_log_unix}
All injected code is bracketed by `# MB2_INSTRUMENT_START/END` so the
--revert pass is a single regex scan. Apply-revert round-trip
validated on dash1 (PATCHED → py_compile ok → revert → CLEAN → ok).
start_vllm_pair.sh (updated)
- Picks up instrument_mooncake.py via SCRIPT_DIR.
- On `start`: applies patch before launching the two vLLM instances.
- On `stop` (or trap exit): reverts patch.
- Sets per-instance MB2_LOG_DIR = $FRESH_ROOT/mb2_transfer_logs/{A,B}/
so send-side and receive-side events land in cleanly separated dirs.
deploy.sh
tar-over-ssh sync of microbench/fresh_setup/ → cpfs
/home/admin/cpfs/wjh/agentic-kv-fresh/scripts/ so dash1 / dash2 see
the same scripts (dash{1,2} don't have rsync; tar pipe works).
The mb2_kv_transfer.py client still uses black-box E2E timing — the
next commit will teach it to ingest the per-instance JSONL logs to
produce the 4-way breakdown (queueing / setup / transfer / decode).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two new files prepare measurement of T_transfer(KV_size, network_path),
the gap §3.2's PD-disagg cost argument has had since day one.
microbench/fresh_setup/start_vllm_pair.sh
start | status | stop two vLLM 0.18.1 instances on local GPUs (A, B)
with --kv-transfer-config '{"kv_connector":"MooncakeConnector",
"kv_role":"kv_both"}' running off the fresh venv (vanilla wheel +
vanilla mooncake 0.3.11, NOT the dash0 patched build). GPU IDs and
ports are env-overridable so the same script drives the intra-node
pair (GPU_A=0 GPU_B=1 on one host) and the inter-node pair (GPU_A=0
on dash1, GPU_B=0 on dash2 — launched per host separately).
microbench/fresh_setup/mb2_kv_transfer.py
Three-step measurement borrowed from connector_tax/.../smoke_test_
migrate_cache.py:
1. do_remote_decode on A (compute & cache KV; max_tokens=1)
2. do_remote_prefill on B (pull KV from A — this is the timed step)
3. plain completion on B (sanity check: cached_tokens ≈ prompt len)
Sweeps input_tokens ∈ {512, 1k, 2k, 4k, 8k, 16k, 32k, 64k} with 5
repeats each; reports mean / p50 / p90 transfer time and a per-size
raw log. Per-token KV is 98304 B (Qwen3-Coder-30B-A3B), so the upper
end ≈ 6 GiB transfers — within the p99 11.5 GiB range from §2 but
below it (the model's max_model_len 200000 caps the absolute upper).
What we will NOT learn from this design:
- Bandwidth saturation when the system is loaded (single-request bench)
- vLLM-internal scheduling overhead vs pure transfer (the timed step
folds them together — but for the §3.2 argument that's the right
"what does PD-disagg actually pay" number)
Intentionally not committed yet: an orchestrator that loops over
intra-/inter-node configs. We start manual on dash1 intra-node to
verify the measurement is sane before scaling out.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Install script lives in microbench/fresh_setup/install.sh. Single shared
venv at /home/admin/cpfs/wjh/agentic-kv-fresh/.venv (cpfs is mounted at
the same path on dash0/1/2 so one install serves all three).
vllm : 0.18.1 (official wheel)
mooncake-transfer-engine: 0.3.11.post1
Smoke-tested on dash1 + dash2: imports succeed, kv_transfer module
resolves. This venv is the vanilla reference for all subsequent
microbench / PD-disagg experiments — not the dash0 patched build that
carries the connector_tax fix.
The script defines proxyOn inline (ipads 127.0.0.1:11235) so it works
under non-interactive ssh (~/.bashrc proxyOn is interactive-only).
Sets -eo pipefail (not -u) because venv activation references unset
PS1-like vars under -u.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous §2.3 narrative said "chatbot has T_human ≈ 30 s think-time,
agentic has T_external ≈ 0, so agentic is always closed-loop and chatbot
never is". The new T_external measurements on the production chatbot
trace (qwen3-max, n=42 k inter-turn gaps from formatted parent_chat_id
sessions) show the binary framing is wrong:
agentic p50 1.6 s, 39% gaps < 1 s, p99 738 s
chatbot p50 7.2 s, 4% gaps < 1 s, p99 43 s
Both have nonzero T_external. The right distinction is the *shape*:
chatbot is unimodal around 5–15 s (human cadence); agentic is bimodal
with a sub-second tool-call mass (39 % vs chatbot's 4 %) plus a long-
pause tail (13 % > 30 s). The agentic sub-second mass is what activates
dispatch coupling — for any W_turn > 1 s scheduler those turns satisfy
W_turn ≫ T_external by construction.
The empirical regime split:
unified TTFT p90 = 7.3 s → agentic 73% closed-loop, chatbot 32%
lmetric TTFT p90 = 15.7s → agentic 80%, chatbot 88%
lmetric is bad enough that it drags the chatbot regime into closed-loop
too. This is a direct empirical explanation for lmetric underperforming
on both workloads.
Updates:
- PAPER_OUTLINE.md §2.3: lead with the regime threshold W_turn ≷
T_external, replace the "T_human dominates" Little's Law with the
general form L = Λ · N · (W_turn(L) + T_external), embed f3a CDF,
add the empirical regime table; correct the small-perturbation
formula to include the +T_external dampening term.
- MEETING.md §1: same reframe, condensed (CDF figure, two-row regime
table, one-line conclusion).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User-requested comparison of inter-turn external gap distribution between
the production agentic trace (Qwen3-Coder) and a production chatbot trace
(qwen3-max chat). Both computed as
T_external = next_turn.start_ms - prev_turn.end_ms
on the same kind of pipeline (raw input + raw output join on request_id,
session structure from the formatted trace's parent_chat_id chains).
The chatbot trace lives as two files on dash0:
input : bailian-trace/qwen-trace-260321-260327/qwen3-max-input-032309-032311.jsonl
output : bailian-trace/qwen-trace-260321-260327/qwen3-max-output-032109-032711.jsonl
The raw input has no session_id (uuid is per-record, user_id has only 4
distinct tenant values for 346 k requests). We recover session structure
from the formatted file (qwen_chat_blksz_64_032309-032311.jsonl, which
groups requests by parent_chat_id), matching each formatted record to a
raw record by (timestamp, output_length) — prompt_token_num is anonymized
to 0 in this trace, so we use generate_token_num as the join key.
End time is derived from time_to_finish_token (ms duration) not the "time"
string field (which is the log-write time, not request completion).
Numbers (chatbot, 42 228 inter-turn gaps over 32 262 multi-turn sessions):
p25 4.85 s p50 7.18 s p75 8.22 s p90 15.0 s p99 43 s
4% gaps < 1 s 29% < 5 s 78% < 10 s 98% < 30 s
Compare to agentic (same metric, scripts/compute_inter_turn_gap_remote.py):
p25 0.69 s p50 1.6 s p75 8.6 s p90 44 s p99 738 s
39% gaps < 1 s 67% < 5 s 77% < 10 s 87% < 30 s
Distributions differ in shape, not just location:
- Chatbot is tight, unimodal around 5–10 s (human interaction).
- Agentic is bimodal: a sub-second autonomous tool-call mode (39 % < 1 s)
plus a long-pause tail (13 % > 30 s, p99 = 738 s) for sessions where
the operator steps away.
- The sub-second tool-call mass is where dispatch coupling lives —
those turns have W_turn ≫ T_external for any current scheduler.
The earlier "chatbot has T_human ≈ 30 s" hand-wave was wrong empirically.
The right framing for §2.3 is "agentic has a sub-second tool-call mode
that chatbot doesn't", not "chatbot has think-time and agentic doesn't".
Adds:
- scripts/compute_inter_turn_gap_chatbot.py: dash0-side aggregator
(raw input/output join + formatted alignment by ts + output_length)
- analysis/characterization/data/chatbot_inter_turn_gap.json: CDF cache
- scripts/plot_inter_turn_gap.py: overlays both curves on log-x
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Why the LMetric → load_only APC gap is only +3.3pp despite LMetric
explicitly being "cache-aware load routing":
P = pending_prefill_tokens + (input_length - cache_hit)
score = P × num_requests <-- multiplicative
cache_hit appears only as a reduction inside P. Because score is
multiplicative in num_requests, a session-affinity instance whose
num_requests has climbed will lose argmin to a cold instance even
when cache_hit on the warm one is ~90%. Worked example:
warm: P=2500, num_req=5 -> score 12500
cold: P=10000, num_req=1 -> score 10000 <-- LMetric picks cold
load_only 53.9% APC (pure num_requests)
LMetric 57.2% +3.3pp (cache as additive cost term)
sticky 77.7% +23.8pp (cache as hard constraint)
unified 78.7% +24.8pp (cache as hard+soft hybrid)
Lesson worth stating explicitly in §3.1: cache awareness folded into
a multiplicative load cost-model is structurally insufficient. Affinity
must be a separate routing branch (sticky / unified hybrid), not a
correction term inside a load score.
PAPER_OUTLINE.md §3.1 gets the design analysis + the new APC table;
MEETING.md gets a one-paragraph version of the same point.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User flagged unified_v2 as a still-buggy build. Regenerate the four
per-policy figures with only the four stable policies:
lmetric, load_only, sticky, unified
Story is now directly comparable to v1: unified still dominates p90
TTFT (8.8s) and E2E p90 (20.0s) over the other three on the fresh run.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User-provided fresh run with five policies (lmetric, load_only, sticky,
unified, plus a new unified_v2 variant). Reproduces the v1 set under
figs/v2/ so we can A/B the same panels:
f4a_apc_loss.png — APC bars per policy
f4c_per_worker_ttft.png — per-worker TTFT p90 panel per policy
f6_e2e_latency_bars.png — TTFT/TPOT/E2E p90 bars per policy
f6_e2e_latency_full_grid — mean/p50/p90/p99 × TTFT/TPOT/E2E grid
scripts/render_b3_figures_v2.py is a standalone driver that reads each
policy's metrics.summary.json and breakdown.json directly from the run
directory — the breakdown.json `routed_to` field is required to recover
per-worker assignment because the new setup routes every request
through a proxy (127.0.0.1:9300), so metrics.jsonl's endpoint_url no
longer identifies the backend.
Headline numbers, new vs v1:
APC v2: lmetric 57.2% / load_only 53.9% / sticky 77.7%
unified 78.7% / unified_v2 78.4%
v1: lmetric 56.9% / load_only 54.1% / sticky 77.2% / unified 79.4%
TTFT p90 (s) v2: lmetric 14.8 / load_only 20.1 / sticky 14.8 /
unified 8.8 / unified_v2 10.1
v1: lmetric 15.7 / load_only 20.2 / sticky 18.0 / unified 7.3
E2E p90 (s) v2: lmetric 25.4 / load_only 33.9 / sticky 30.3 /
unified 20.0 / unified_v2 24.1
v1: lmetric 24.8 / load_only 33.5 / sticky 34.6 / unified 18.0
Worker p90 (s, median / max)
v2: lmetric 13.3/30.4 · load_only 21.3/29.2 · sticky 13.5/33.0
unified 10.0/35.1 · unified_v2 8.6/34.2
v1: lmetric 13.9/31.3 · load_only 19.4/25.1 · sticky 20.3/55.4
unified 10.3/37.7
Story is unchanged: unified dominates at p90 across TTFT/E2E and on
median-worker latency; unified_v2 is competitive at p50 but slightly
worse than unified at p90.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The earlier conversation suggested agentic might "have no human think-time"
and therefore live in a strict closed-loop regime. The user pushed back:
tool calls also take time and might restore a chatbot-like buffer between
turns. To resolve this, we go to the actual data.
The previously-published per-record formatted trace only carries arrival
timestamps, so an arrival-to-arrival diff conflates W_turn + T_external.
The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/
051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms,
which lets us compute the pure inter-turn external gap
T_external = next.request_ready_time_ms - prev.request_end_time_ms
for each session's consecutive turn pair.
Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions):
p25 = 0.69 s
p50 = 1.6 s
p75 = 8.6 s
p90 = 44 s
mean = 37 s (heavy long-tail; paused/abandoned sessions)
39 % of gaps < 1 s
67 % of gaps < 5 s
87 % of gaps < 30 s
The bulk of the distribution is dominated by sub-second to a-few-seconds
tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 =
7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile
of T_external, so dispatch coupling is the dominant regime for the
majority of turns — not a corner case.
This corrects the earlier conflated arrival-to-arrival "median gap 11 s"
figure (which folded W_turn into T_external). The true T_external median
is 1.6 s.
Adds:
- scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator
- analysis/characterization/data/agentic_inter_turn_gap.json: 500-point
CDF cache + summary stats, scp'd back from dash0
- scripts/plot_inter_turn_gap.py: local figure renderer
- figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and
unified/lmetric TTFT p90 reference lines
Next step (per user): pull a chatbot trace through the same pipeline and
compare distributions side by side; this will let §2.3 stop hand-waving
about "no think-time" and instead present the regime split empirically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
~50% HBM for model params (~48 GiB on 96 GiB H20)
~10% for runtime activation buffers
~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.
New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)
Key reads off the figure:
p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
p90 (8.0 GiB/req): 4 fit/inst → 32 / 16 / 8
p95 (9.6 GiB/req): 4 fit/inst → 32 / 16 / 8
p99 (11.5 GiB/req): 3 fit/inst → 24 / 12 / 6
PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).
- analysis/characterization/render_window1_figures.py:
fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
but computes floor(KV_pool / req_size) × N_D and annotates the
per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
framing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The headline f6_e2e_latency_bars only shows p90, hiding three regimes:
- mean: unified dominates (3.3s TTFT, 7.0s E2E vs sticky 5.6s / 12.1s)
- p50: sticky and unified are tied on first-turn TTFT (0.5s each) —
sticky's first turn of each session is free, after which queues
accumulate. Unified beats sticky everywhere else.
- p99: tail amplification reveals unified's biggest gap —
TTFT 42.3s vs sticky 74.1s; E2E 68.8s vs sticky 139.7s.
The 12-panel figure is the honest full picture; the 3-panel headline
stays for slide-friendly summary.
- analysis/characterization/window_1_results/raw_stats/{policy}.json:
cached ttft/tpot/e2e {mean,p50,p90,p99} pulled from dash0
/home/admin/cpfs/wjh/agentic-kv/outputs/b3_sweep_20260525_095043/
(b3_policy_comparison.json doesn't record mean, only percentiles).
- analysis/characterization/render_window1_figures.py:
new fig_b3_latency_full_grid renders the 4×3 grid from the cache.
- figs/f6_e2e_latency_full_grid.png: 12-panel companion.
- PAPER_OUTLINE.md §5.2: both figures embedded; main table column
renamed from "Hotspot idx" to "Worker p90 (median / max)" to match
the new metric convention.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The max/median ratio inverts the actual user-facing p90 ranking:
sticky: hotspot=2.73 but system e2e p90 = 34.6s (worst)
unified: hotspot=3.67 but system e2e p90 = 18.0s (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.
Changes:
- analysis/characterization/render_window1_figures.py:
fig_b3_per_worker_ttft now annotates each subplot with
"median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
was the deprecated ratio; superseded by f4c per-worker bars + f6
e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
conclusion — replace "hotspot index" mentions with
"worst-worker p90" or "(median, max) worker p90"; promote the
§3.3 methodology note to a top-level sub-finding ("hot pin
failure must be measured with per-worker absolute latency,
not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
pair directly; explicit one-line note on why the ratio is dropped.
Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Companion to the figure cleanup: prose in §3.1 was still quoting
"capped 31.6% APC" as one of the failure-mode datapoints. Same reason
as the figures — capped is a workload manipulation, not a policy, so
it doesn't belong in the §3.1 routing-policy narrative.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
'capped' is not a routing policy — it's lmetric run on a separately
truncated trace (sessions capped to 8 turns via build_capped_trace.py).
Putting it alongside lmetric/load_only/sticky/unified in per-policy
comparison figures is misleading because the workload differs, not
the routing decision. Comparing apples to a different-trace orange
inflates/deflates apparent policy gaps for the wrong reasons.
Regenerated 4 figures with --exclude-policies capped on
analysis/characterization/render_window1_figures.py:
- f4a_apc_loss.png (APC bars)
- f4c_apc_vs_hotspot_tradeoff.png (APC vs hotspot scatter)
- f4c_per_worker_ttft.png (per-worker TTFT panel)
- f6_e2e_latency_bars.png (TTFT/TPOT/E2E bars)
Added --exclude-policies CLI flag to the renderer so this is a
reversible choice, not a permanent script mutation. capped data remains
in b3_policy_comparison.json and can be brought back in workload-
sensitivity sections (where it actually belongs) by omitting the flag.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-curve variant of f2b — production trace only, no replay overlay
and no uniform reference. Cleaner for boss-meeting/talk slides where the
extra context is noise. The combined three-curve figure is unchanged.
scripts/plot_session_skew_cdf.py: split into plot_combined +
plot_production_solo helpers; one run emits both PNGs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pulls 456 (rank%, cum%) sample points from the raw production trace at
dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl,
cached locally so the figure is reproducible without ssh access. Sampled
anchors match the precomputed summary exactly:
top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
plus newly readable points:
top 25% = 87.5%, top 50% = 96.0%
Workload characterization is now consistent with the production
distribution rather than the small replay subset. Replay window CDF kept
as an overlay to show the same hockey-stick shape on the data §5 actually
uses.
- analysis/characterization/data/production_session_skew_cdf.json: cached
sample points (29 KB), so the figure rebuilds locally
- scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw
- MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace,
add top-25%/50% data points
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed
from the production trace summary (which is not present locally, only its
precomputed JSON). The new figure is a continuous CDF of cumulative
input-token mass vs session rank percentile, generated directly from the
replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable.
Headline numbers update accordingly:
replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8%
production trace (n=1.3M): top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
Both show extreme skew well above the y=x uniform reference; the replay
trace is less extreme at top-1% because n=274 makes that bucket only
~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers
so motivation matches §5 evaluation; production numbers kept as a side
note for context.
- scripts/plot_session_skew_cdf.py: reproducible figure generator
- MEETING.md / PAPER_OUTLINE.md: update narrative + caption
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified
has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly
half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and
misleading on its own. Per-worker absolute TTFT p90:
sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s
unified: median 10.3s, max 37.7s -> system e2e p90 18.0s
Mechanism: top 1% sessions own 46.5% input mass and there are more hot
sessions than instances (8), so sticky's hash binding gives *every* worker
its own hot session and the median worker is also slow. Unified's LMetric
fallback re-routes cold/new sessions away from hot affinity instances,
preserving 7/8 worker speed. System p90 is dominated by the majority of
requests landing on fast workers, hence the 2x e2e gap.
Changes:
- Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars)
instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter)
- §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around
absolute median + max + system e2e p90 instead of hotspot ratio
- Add a §3.3 sub-finding: "hot pin failure must be measured with
per-worker absolute latency, not normalized ratio"
- Keep the scatter as supplementary for §5 multi-policy summary
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 trace-replay A/B/C (commit ef9e010) shows the kv_both substrate
is net positive on current codebase, not just neutral:
- TTFT p90: 11.97s plain → 9.74s kv_both (−18.6%) → 7.58s with DR-fix (−36.6%)
This reverses the elastic_migration_v2 paper's +45% kv_both penalty claim
and removes the primary cause of the 4 prior migration reverts.
Reframes EAR Pillar 2 from "DEFERRED" to "PARTIAL" — substrate verified,
e2e strategy-layer validation (trigger thresholds + target selection in
the dispatch-coupling feedback loop) remains as the only open risk.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-runs the elastic_migration_v2 trace (w600 r0.0015 st30, 1214 reqs,
274 sessions, 8×TP1 vLLM + cache_aware_proxy) with three configs:
- plain unified
- unified + Mooncake kv_both
- unified + Mooncake kv_both + DR-fix (env-gated O(|cache|) hash sync removal)
TTFT p90: 11.97 s → 9.74 s (−18.6%) → 7.58 s (−36.6% vs plain)
E2E p90: 23.48 s → 21.25 s (−9.5%) → 17.93 s (−23.6% vs plain)
Two findings:
1. The "+45% kv_both penalty" claim from elastic_migration_v2 is OBSOLETE
on current codebase — kv_both is now *faster* than plain at p90.
Likely fixed by e3a1d70 (RDMA-READ → bootstrap PUSH refactor) and
the connector-mode delay_free_blocks extending cross-turn prefix
cache hits on a 93%-intra-session-reuse trace.
2. DR-fix removes another 22% from TTFT p90 by skipping the
O(|cache|) hash sync in build_connector_meta. Cache-sweep with
DR-fix shows slope drops from +94.5 to +2.3 μs/1k blocks.
Adds:
- run_trace_replay_drfix.sh: A/B/C harness (env CT_DR_FIX gates patch)
- analyze_trace_replay.py: TTFT/TPOT/E2E delta analysis
- REPORT_TRACE_REPLAY.md: summary + reproduction
- results/20260526_1627_drfix/: cache-sweep with DR-fix
- results/trace_replay_20260526_1652/: full trace-replay A/B/C
Implication for EAR paper: the kv_both substrate is no longer the
bottleneck blocking session migration. The prior 4 migration reverts
were dominated by transfer overhead that has now been characterized
and (partially) removed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Convert figs/f4b_pdsep_kv_wall.pdf to PNG via pdftoppm @ 150 DPI so
MEETING.md and PAPER_OUTLINE.md render the figure inline on GitHub /
any standard markdown viewer (PDF !() embeds don't render).
- PAPER_OUTLINE.md F2, F4, F6: switch from backtick code references to
proper ![]() image embeds so the doc is actually viewable as a deck.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- replayer/replay.py: emit trace_span_s and amplification in summary
(Phase 1 of the wall-clock amplification measurement plan; needed for
§2.3 dispatch coupling empirical closure)
- figs/: 8 reusable figures copied from analysis/ with paper-spec names
(f2a/b/c workload, f4a/b/c/d failure modes, f6 e2e partial)
- PAPER_OUTLINE.md: real figure paths, explicit TBD markers for
custom drawings and pending data; new "Validation Status" table at top
and reorganized "Work Plan" splitting can-do-now vs migration-deferred
Migration validation deferred per user: 4 prior attempts (6b255fa,
e991960/5772149, cc6e562, 4c583f2) were reverted due to transfer
overhead; pending re-test on top of connector_tax DR-fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Initial 8-section outline for "Elastic Affinity Router" — agentic LLM
scheduler with session-affinity routing + hot-triggered session migration.
Centerpiece is §2.3's dispatch coupling argument: agentic workloads close
Little's Law on themselves (no human think-time), so per-turn W enters Λ,
amplifying small latency differences into throughput differences. This is
the intellectual hook the design hangs on.
§3 attacks three baselines on three orthogonal failure modes (load-balance
loses locality, static PD-disagg hits D-side KV wall, pure sticky creates
hot pin). §4 frames EAR as the single scheduler that addresses all three.
All figures and several numbers (T_hot, T_cool, EAR wall-clock factor) are
TBD — see Open Items at bottom.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds an env-gated skip for the per-step `set(cache.keys())` walk in
MooncakeConnectorScheduler.build_connector_meta() that was introduced
in our own commit a7df84b (Direct RDMA read). Re-runs the cache_sweep
A/B with three configs: plain (control), mooncake_both (baseline), and
mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1).
Files:
apply_direct_read_fix.py one-line env-gate patch (markered revert)
run_drfix.sh orchestrator for plain + mooncake_both + drfix
analyze.py extended to compare mooncake_both_drfix vs plain
and mooncake_both vs mooncake_both_drfix
REPORT_DRFIX.md findings
results/20260526_1543_drfix/ run artifacts
Headline:
config | slope (μs/1k blocks) | step_dur p50 @ 16.6k
----------------------|----------------------|---------------------
mooncake_both | +81.0 | 1 550 μs
mooncake_both_drfix | -0.7 (≈ 0) | 95 μs
plain (control) | -1.8 (≈ 0) | 72 μs
build_meta p50 @ 16.6k blocks:
mooncake_both = 1 459 μs
mooncake_both_drfix = 6 μs (residual loop bookkeeping)
worker get_finished p50:
mooncake_both = 178 μs (unchanged; this fix doesn't touch it)
mooncake_both_drfix = 183 μs
The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at
|cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within
±50 μs across the full cache range — that's noise-level. The slope
goes from +81 to essentially zero.
Worker-side get_finished (180 μs constant) is unchanged because the
DR-fix touches scheduler.build_connector_meta only. That's the next
target if we want to bring kv_both fully back to plain-level.
Extrapolation to trace-replay (|cache|≈13k, APC≈79%):
before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step
after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step
→ 85% reduction in per-step connector cost
→ TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step
Confirms: the entire O(|cache|) slope was introduced by our own
direct-RDMA-read implementation (commit a7df84b), not upstream
Mooncake. Production fix: gate the sync on the presence of any
direct_read consumer, or replace per-step diff with an incremental
delta listener fed by block_pool add/remove callbacks.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up to Microbench 3 that finally tests H5 (cache-size
dependence) and instruments worker-side connector callbacks the
original patch missed.
Patch v2 (apply_step_timing_v2.py) adds:
scheduler: `cache_size` field in engine_step.jsonl
worker: `get_finished_us` + `start_load_kv_us` in worker_step.r0.jsonl
uses BLOCK_BEGIN/END sentinels for safe multi-line revert
(the original v1 patch survives this v2's apply/revert cycle)
Driver: continuous open-loop (1.5 req/s, 4096x256 random per req)
that lets APC fill from 0 → ceiling within one vLLM lifetime so a
single run produces the full cache_size sweep. Decode-only steps
are filtered post-hoc to remove prefill-mix variance.
Findings (H20 96GB, ceiling reached ~17.5k blocks; n=15-18k decode
steps per config):
config | slope (μs / 1k blocks) | step_dur p50 @ |cache|=16.6k
---------------|------------------------|-----------------------------
mooncake_both | +85.6 | 1528 μs (build_meta=1442, 94%)
noop_connector | -0.8 (≈0) | 79 μs
plain | +1.0 (≈0) | 84 μs
Worker-side get_finished p50/p90/p99 (μs/step):
mooncake_both: 180 / 257 / 333
noop_connector: 0 / 0 / 2
H5 PASSES. mooncake_both step_duration scales linearly with |cache|
because build_connector_meta walks set(cache.keys()) every step
(`mooncake_connector.py:434-450`). plain and noop are flat.
The previously-uninstrumented get_finished() adds a constant
180 μs/step on top — two `run_coroutine_threadsafe(...).result()`
blocking waits in kv_both mode (`mooncake_connector.py:1107-1137`)
fire every step even when no transfer is pending.
Trace-replay reconciliation (APC ≈ 79% → |cache| ≈ 13k blocks):
build_meta @ 13k ≈ 1060 μs + get_finished ≈ 180 μs = 1.24 ms/step
On ~7 ms decode forward → +15-20% TPOT per step.
This explains most of the trace-replay +25% TPOT p90 gap from
single-instance per-step cost alone, leaving a smaller residual
for multi-instance coupling than originally assumed.
Two clear fixes pointed out in REPORT.md:
1. replace O(|cache|) per-step walk with incremental delta
listener using block_pool's add/remove callbacks
2. short-circuit get_finished() when both producer/consumer
queues are empty in kv_both
Heavy raw artifacts (engine_step.jsonl, vllm_stdout/stderr,
.vllm.pid) are .gitignored — they re-derive from `bash run_all.sh`
and SUMMARY.md / per_config.json fully capture the conclusions.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The prior write-up presented one specific reading of the data as
the headline without flagging methodology gaps. Three corrections:
1. The "0% low-concurrency tax" comes from a single back-to-back
mooncake_both_v2/plain_v2 rerun. The original Phase A pair
showed TTFT p90 +29%, TPOT p90 +54%, E2E p90 +55% at rate=2
— a 40 percentage-point swing between two consecutive runs
that the original write-up did not call out. The run-to-run
noise floor is too high to claim "0%" at low concurrency.
2. get_finished() was never instrumented. The patch only times
step_duration_us and build_meta_us. "100% of per-step cost is
build_meta" is an upper bound on what was timed, not a true
decomposition.
3. H5 (cache-size dependence) was the central hypothesis but
was never tested in the prior run; random content kept APC
near empty.
The +7-9% high-concurrency (single instance, 512x64, rate=8-16)
and +17% 8-instance-saturated numbers are kept; they were
measured with adequate sample sizes and are reproducible.
The follow-up sweep in cache_sweep/ tests H5 directly and
revises the decomposition.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
8×TP1 + load_only proxy, shape 512×64, rates 32/64/128 req/s total:
Rate=32 (non-saturated, thr=0.95-0.97):
plain TTFT p90=64ms, mooncake_both=65ms → +2% (noise)
Rate=64 (non-saturated, thr=0.96):
plain TTFT p90=114ms, mooncake_both=107ms → -6% (noise)
Rate=128 (saturated, thr=0.70-0.71):
plain TTFT p90=702ms, mooncake_both=822ms → +17%
plain TTFT p50=339ms, mooncake_both=470ms → +39%
Conclusion: The elastic_migration_v2 +45% is a saturation artifact.
Under SLO-compliant load (TTFT<10s, thr_ratio>0.9), mooncake_both's
1.4ms/step build_connector_meta overhead is completely masked by the
scheduler-model async pipeline. The tax only manifests when the system
is already saturated and queueing amplifies per-step differences.
For practical deployment: enabling kv_role=kv_both has effectively zero
cost as long as the serving system stays within SLO capacity bounds.
- Regenerate uv.lock after adding fastapi/uvicorn deps so uv sync
--locked no longer fails
- B3 scripts: default MODEL to $HOME/models/... matching documented
convention and other launch scripts (repo has no models/ directory)
- launch_elastic_p2p: append || true to each trap command so set -e
doesn't abort cleanup when jobs -p is empty and EngineCore orphans
remain
Adds unified_nixl_both to elastic_migration_v2: same picker as
unified_kv_both (never triggers PD-sep), but launches vLLM with
NixlConnector instead of MooncakeConnector. Compared against plain
unified and unified_kv_both (Mooncake) we can now attribute the
substrate overhead between "v1 connector framework irreducible
cost" (proxied by the leaner NIXL) and "Mooncake implementation
extra" (Mooncake - NIXL).
Result (vs plain unified, both substrates never PD-sep):
metric plain NIXL Mooncake
TTFT p90 7.35s +37.9% +45.3% (NIXL: +7pp better)
TPOT p90 17.1ms +15.5% +24.5% (NIXL: +9pp better)
E2E p90 18.03s +17.4% +27.0% (NIXL: +10pp better)
hotspot 3.667 +0.2% +19.0% (NIXL: keeps it flat)
APC 79.4% -0.3pp -1.1pp
interference - 5.58 8.57 (NIXL: ~35% lower)
The cleanest signal is hotspot: NIXL preserves plain-unified's
distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step
O(|cache|) `set(self._block_pool.cache.keys())` diff against
_known_hash_keys (mooncake_connector.py:432-456) inflates routing
imbalance by 19%. The hash sync runs unconditionally even when no
direct_read consumer is present.
Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer
GPU memory, per-step SchedulerOutput.kv_connector_metadata
round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL
~= Mooncake-specific overhead (the hash-sync loop and stricter
delay_free semantics).
Practical implication: NIXL is meaningfully better than Mooncake on
this stack, but even NIXL imposes 16-38% across metrics — too
expensive for selective-PD-sep on agentic workloads where the
trigger rate is < 0.5%.
Launch fixes required for NIXL multi-instance:
- VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default
5600; we use 5600+i). Without this, 7 of 8 instances silently hang
in `zmq.error.ZMQError: Address already in use` and the launcher
trap kills all of them at health-check timeout.
- Health-check timeout raised from 180s to 360s; NIXL initialization
(UCX agent + memory registration) is ~100-150s per instance under
8-way concurrent load, vs Mooncake's ~30-60s.
New figure: fig_connector_substrate_attribution.png stacks plain /
framework / Mooncake-extra / v2-branch overhead per metric.
Existing figures (fig_kv_both_overhead, fig_three_way_hotspot)
updated to include NIXL as a fourth bar.
README updated with 4-way table, Result 1 reframed as "the cost is
mostly framework, not Mooncake — but Mooncake adds the hotspot
penalty", and the substrate-vs-PD-sep tradeoff math.
Refs: nixl_connector.py:700 handshake listener bind, factory.py
register_connector for the NixlConnector entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Critical:
- cache_aware_proxy: _handle_pd_sep leaked p_inst.num_requests (never
decremented) and never managed d_inst.num_requests; fix media_type
from application/json to text/event-stream for SSE stream
High:
- b3_sweep/b3_isolated_policy/b3_analyze: replace hardcoded
/home/admin/cpfs/wjh/ ROOT with script-relative $(dirname "$0")/..
- b3_analyze: replace hardcoded 8-port WORKER_MAP with dynamic
generation from BASE_PORT and N_INSTANCES
Medium:
- analyze_breakdown: warn on stderr when records are skipped (was silent)
- deploy_vllm_patches: fail-fast on SSH/SCP errors instead of
continuing with empty VENV_SITE
- pyproject.toml: declare fastapi and uvicorn as runtime dependencies
- launch_elastic_p2p: kill EngineCore and proxy in trap handler to
prevent GPU memory leaks on exit
NIXL's _nixl_handshake_listener (vllm/distributed/kv_transfer/
kv_connector/v1/nixl_connector.py:700) binds a ZMQ ROUTER socket on
the side_channel_port, which defaults to 5600. When 8 NIXL vLLMs
launch concurrently on the same host all 8 race for tcp://localhost:5600;
exactly one succeeds and the others silently hang in the listener
thread with:
zmq.error.ZMQError: Address already in use (addr='tcp://localhost:5600')
The engines themselves never reach "Application startup complete"
and the b3_isolated_policy.sh health-check times out. First observed
when 7 of 8 inst_X.log files contained the ZMQ error and the 8th
(by random ordering) was the one healthy instance.
Fix: set VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600 + i)) per instance in
the NIXL launch branch. Each engine now gets a distinct handshake
port (5600..5607 by default). Verified: all 8 instances now reach
"Application startup complete" within the 360 s health budget.
This is NIXL-specific; Mooncake uses VLLM_MOONCAKE_BOOTSTRAP_PORT
which we were already varying per instance.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a NIXL-backed counterpart to unified_kv_both so we can attribute
the kv_both substrate overhead measured in the elastic_migration_v2
section to either Mooncake-specific code or a generic v1-connector
cost shared by all connectors.
- scripts/cache_aware_proxy.py: register --policy unified_nixl_both.
Picker is identical to unified (and unified_kv_both); routing
decisions never go through the PD-sep branch. Differs only at the
vLLM launch layer.
- scripts/b3_isolated_policy.sh: new KV_CONNECTOR env var
(Mooncake|Nixl), auto-set based on POLICY. NIXL launch path uses
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
with no VLLM_MOONCAKE_BOOTSTRAP_PORT (NIXL uses UCX side-channels).
- Health-check timeout: 90 iterations * 2s -> 180 iterations * 2s
(180s -> 360s). Empirically NIXL needs ~100-150s per instance to
initialize the UCX agent and register KV cache memory; 8
concurrent NIXL launches frequently overshoot the previous 180s
budget. Mooncake is unaffected (still finishes well inside the new
budget). The 8-vLLM unified_nixl_both first launch tripped the
old timeout despite 7/8 instances reaching startup-complete.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
plot_interference.py reads the interference sweep summary (4 D × 4 P × 3 reps,
cold prefill prompts) and produces:
fig_interference_heatmap.png
TPOT p90 interference index over (D, P): 14x at D=8 P=2k → 214x at D=1 P=32k.
fig_interference_lines.png
(a) TPOT p90 during prefill vs P, log-y, one line per D + baseline dashed
(b) Cold prefill TTFT vs P (interference window length)
Confirms B2 finding: cold prefill on the same worker stalls overlapping
decodes for 14-214x baseline TPOT. The interference window grows linearly
with P (from ~140ms at 2k to ~4.6s at 32k) and is essentially independent
of decode batch size — prefill compute time dominates.
Instrumentation patches (microbench/patches/):
- pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var)
- apply_patches.py: idempotent patch installer for mooncake_connector.py
and scheduler.py, marks insertions with # PD_PROFILE_PATCH
- analyze_events.py: joins per-process JSONL event logs by transfer_id
into per-request phase durations
Seven events captured per request:
D_get_num_matched → P_zmq_received → P_prefill_done →
P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted
Driver fix (microbench/lifecycle/driver.py):
seed_prefix_cache now sends via the proxy URL so P and D both cache
the seeded prefix with matching block hashes. Previously seeding D
directly produced different block hashes than the proxy-routed
measurement requests, making incremental transfer impossible.
Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93):
prefill_compute 620 ms median (95% of overhead)
rdma_transfer 42 ms median (~71 Gbps effective)
other overhead 10 ms median (dispatch + params + signal + promote)
Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the
transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.
New analysis/characterization/elastic_migration_v2/ packages the
unified_v2 + unified_kv_both experiments into a self-contained
results section that the paper can cite as the "we tried selective
PD-sep migration" case study. The section finds three independent
reasons PD-sep doesn't help on agentic w600:
1. Mooncake kv_both substrate alone (no PD-sep ever firing) imposes
TTFT p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain
unified. Per-step KVConnectorMetadata maintenance and block
reservation semantics dominate even when no transfer is pending.
2. PD-sep gate fires only 0.16-0.41% of requests across two
gate-tightness configurations. 88-76% are killed by
new_local < threshold because 93% intra-session reuse on agentic
traces leaves a small uncached tail; 19% are killed by
chosen_no_active_decode (snapshot-time gate). Even relaxed
thresholds can't grow trigger rate past 0.5%.
3. When PD-sep fires, the calibrated cost model
(0.3s + bytes / 2.7 GB/s) is wrong by 10-20x. 5 triggered
requests in v2.1 saw realized TTFT 12-45s vs model-predicted
migrate cost 0.7-2.2s, consistent with the E2 audit's finding
that D-side block pre-reservation and missing layerwise
pipelining dominate the decode_sent -> first_token clock.
Three-way comparison (unified vs unified_kv_both vs unified_v2):
v2 vs the kv_both control is roughly net-zero (-10% hotspot,
-14% TPOT p90, +3% TTFT p90, +9% TTFT p99). v2 vs plain unified is
strictly worse by 27-49% across latency percentiles because the
kv_both substrate tax is unavoidable when the policy is enabled.
Contents:
- README.md: the four results sections, the three-way comparison
table, an explicit "what this claims for the paper" list, and a
cross-reference index to the earlier characterization documents.
- data/: b3_policy_comparison.json + per-policy breakdown.json
+ per-policy hotspot_index.json for the four policies in scope.
- figures/: 4 PNGs rendered by render_figures.py:
* fig_kv_both_overhead.png — 4-metric bar chart with delta
annotations showing kv_both alone costs +45% TTFT p90.
* fig_v2_trigger_funnel.png — per-reason request count for the
two gate configurations on log scale.
* fig_v2_predicted_vs_actual.png — scatter of model-predicted
migrate cost vs realized TTFT for the 5 triggered requests,
with y=x, 10x, and 20x reference lines.
* fig_three_way_hotspot.png — per-worker TTFT p90 grouped bars
across the three policies.
The section is intentionally self-contained: it lists what the
experiment validates (cost model picks correct candidates;
shadow-drift fix is necessary; same-worker interference is real)
alongside what it disproves (per-request PD-sep on agentic via
Mooncake is not a net win in current implementation).
Refs: E1/E2 subagent audits, B2 microbench, unified_v2 commits
19f69a9 / 4b833d3 / 95c8ef8.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The proxy maintains shadow counters (num_requests, ongoing_tokens,
pending_prefill_tokens, ongoing_decode_tokens) used by every routing
picker. They are incremented in _handle_local_request and decremented
in the generator's finally block. When the StreamingResponse generator
never enters (client disconnect between proxy returning the response
and Starlette starting iteration, or Starlette failing before
iteration), the decrement never fires and the counter stays elevated
forever. Over a multi-hour run the shadow accumulates "phantom" load
on the affected instances and biases the router away from them.
Concrete observation that prompted the fix: during the unified_kv_both
B3 run, engine_0 sat at proxy num_requests=1 / ongoing_decode_tokens=80406
while vLLM's own /metrics reported num_running=0 num_waiting=0 and the
GPU sat at 0% utilization. Every routing decision after that point
believed engine_0 was busy with an 80k-token decode that did not exist.
Fix: extend _reconcile_loop to actively poll each instance's
/metrics every 30 s. If the proxy's num_requests has been higher than
vLLM's (running + waiting) for two consecutive cycles (~60 s of stable
drift), reduce the shadow to vLLM's truth. When vLLM is fully idle
(running=0, waiting=0), zero ongoing_tokens, ongoing_decode_tokens,
and pending_prefill_tokens as well.
Two-cycle persistence avoids correcting transient mismatches where
the proxy has just incremented for a new request that vLLM has not
scheduled yet. A single ~30 s blip is not large enough to corrupt
routing decisions; only persistent drift gets corrected.
The previous _reconcile_loop only clamped negatives. Phantom positives
are now caught and logged ("[reconcile] {url}: phantom drift ...").
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
v2.0 ran on B3 and triggered PD-sep only 2 / 1214 times (0.2%). The
gates were too conservative; the v2-vs-v1 latency gap (TTFT p90
7.35 -> 8.96 s) is therefore probably attributable to kv_both
always-on overhead, not to the PD-sep mechanism itself. v2.1 has two
fixes plus an isolation control.
Bug fix:
- The "chosen has live decodes worth protecting" gate combined
num_requests and ongoing_decode_tokens with AND, falling through
when EITHER was small. Under agentic workloads each worker rarely
stacks more than 1-2 concurrent requests, so the gate killed 84%
of v2.0 candidates that reached it. Replace with a pure
ongoing_decode_tokens == 0 check ("chosen_no_active_decode") —
same semantic, much higher recall.
Threshold relaxation (B2 microbench is the calibration source):
- pd_sep_min_new_tokens: 16000 -> 8000 (B2 TPOT idx 1.9x already
at 8k, TTFT idx 12x — strictly worth migrating)
- pd_sep_min_decodes_protected: 2 -> 1
- pd_sep_min_src_cache_tokens: 8000 -> 4000
- pd_sep_min_extra_cache_tokens: 4000 -> 2000
Isolation control:
- New --policy unified_kv_both option. Uses the exact same picker as
--policy unified but the vLLMs are launched in kv_role=kv_both
(the same launch mode unified_v2 requires). PD-sep never fires.
Compares against unified_v2 to attribute any v2 effect to the
PD-sep branch alone, not the kv_both always-on overhead.
- Both unified_kv_both and unified_v2 auto-enable kv_both launch in
b3_isolated_policy.sh.
Tests:
- Updated the existing "chosen has no decodes" test for the new
gate name and semantic.
- All 24 proxy tests pass.
Refs: window_1_results/v2_breakdown analysis (88.7% of candidates
caught by old new_local_below_threshold; 84% of the remainder
caught by the old few_decodes gate).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>