Reuse and concurrency axes redone with proper controlled variables, plus
the orchestration used to run them on dash0:
- run_reuse_fixed.sh: hold REAL prefill work (delta) constant, vary only
cached prefix -> reuse = C/(C+U). Supersedes old fig1 (which held
input=8192 and sliced prefix out, confounding "more reuse" with "less
prefill").
- run_conc.sh: agentic-corner config (in=32768, delta=512, reuse=0.984,
out=128) that exposes PD's structural KV-transfer tax. Supersedes old fig3.
- run_campaign{,2,3}.sh, backfill_d2048o128.sh: serial campaign drivers
(strictly one driver at a time), out=128 sweeps, PD wall-cap for
collapse-draining high-reuse arms, and flaked-arm backfill.
- mb5_run_gpu.sh: per-config bring-up / replay / teardown orchestrator.
- plot_pd_crossover.py: render the reuse_compare figures from fig_agg dumps.
- fig_agg.py: tolerate null stats from fully-collapsed arms (0 successes
write the stat keys as null; `dict.get(k, {})` returns null, not {}).
Data: fig1_reuse_fixed.json, fig1_reuse_d{1024,2048}_o128.json
Figs: reuse_compare_AB.png, reuse_compare_ABC.png
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up to the LMetric sweep: rerun with --policy linear (cache-aware
load + sticky session affinity, the cache_aware_proxy default) and cap
each PD-disagg arm at 2x the colo bench wall (SIGTERM bench.sh once cap
is exceeded; the cleanup trap clears vLLM and proxy; capped runs lack
metrics.summary.json so the analysis script computes from raw
metrics.jsonl).
Headline: the success-rate ceiling is policy-invariant.
arm linear (capped at 2x) lmetric (uncapped)
colo 807/807 = 100%, 964s 807/807 = 100%, 1021s
pd6 (6:2) 472/807 = 58%, 2280s ⊗ 474/807 = 59%, 3325s
pd4 (4:4) 349/807 = 43%, 2281s ⊗ 348/807 = 43%, 6850s
pd2 (2:6) 176/807 = 22%, 2280s ⊗ 180/807 = 22%, 19275s
Routing affects only how much wall is wasted timing out unreachable
requests at 600s each. Linear hits the same ceiling in 2280s as
LMetric does in 3300-19000s. This *strengthens* the §5 D-pool
capacity-ceiling thesis -- the cap is structural, not a routing
artifact.
Artifacts:
analysis/v2/fig4r_linear.json -- 4-arm linear summary
analysis/v2/PD_DISAGG_LMETRIC.md -- extended with wall-cap section
figs/v2/fig4_linear_vs_lmetric.png -- 3-panel side-by-side comparison
microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py
Anchor experiment for the clean-stack PD comparison using the canonical
cache-aware proxy with --policy lmetric (scripts/bench.sh harness). Two
traces x four arms = eight runs on dash1.
Headline: with the right routing baseline (LMetric), PD-colo holds 100%
completion on both traces while every static PD-disagg ratio fails
(14-65% completion), and the failure mode rotates with the split --
no static partition has a working operating point on this workload.
LMetric improves colo dramatically (TTFT p50 1.0s vs original §3 RR
7.0s; 7x) but does NOT rescue PD-disagg, confirming the bottleneck is
structural (D-pool admission + multi-turn KV accumulation), not routing.
Completion matrix:
first600s full
colo 100% 100%
pd6 (6:2) 58.7% 65.3% (decode-bound)
pd4 (4:4) 43.1% 43.9% (both bottlenecks)
pd2 (2:6) 22.3% 13.9% (prefill-bound)
The original §3 RR "100% PD completion" appears to be a measurement
artifact of e13391e: producer-KV eviction acted as a relief valve,
letting more requests squeeze under the 600s timeout at the (uncosted)
price of cross-turn re-prefill. With the eviction off, PD-disagg is
worse than §3 advertised, not better.
Artifacts:
analysis/v2/fig4l_lmetric.json -- 8-arm summary data
analysis/v2/PD_DISAGG_LMETRIC.md -- writeup + reproduce recipe
figs/v2/fig4_lmetric_pd_vs_colo.png -- 4-panel comparison figure
microbench/fresh_setup/plot_fig4l_lmetric.py -- plot script
Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).
PD_DISAGG_RESULTS.md top CORRECTION banner + §6 RETRACTED marker.
§6 (session-affinity hot-pin) was an `e13391e`
artifact under controlled concurrency; §3 RR, §4
TPOT win, §5 D-pool ceiling, §5.1 consumer crash
stand.
RESULTS_SUMMARY.md §4 confirm+refine note: clean ablation confirms
the D-pool capacity thesis and adds regime-
dependence.
pd_separation_analysis.md scoped caution: thesis confirmed; flags
only reuse-dependent figures for cross-check
(this study used a different stack).
figs/mb5/CORRECTION.md flags mb5_producer_hotspot.png as retracted;
§3 RR and §5 D-pool figures stand.
Ran unified vs unified_v4 A/B on dash2 (8×H20, kv_both+DR-fix substrate,
w600_r0.0015_st30_first600s trace). Key findings:
- At 1x QPS (~1.3 req/s): zero migrations. pending_prefill_tokens is 0 for
95% of routing decisions because instances complete prefill before the next
request arrives. The relative arm (src_pp > fleet_median*1.5) never fires.
- At 2x QPS (~2.7 req/s): 4 migrations (0.5%). src_pp>0 rises to 24% of
eligible decisions. Trigger correctly identifies genuinely overloaded
instances (src_pp 13k–73k vs fleet median 3.8k–33k).
Conclusion: mechanism is correct but migration benefit requires higher
concurrency (scale-out or >3x QPS) where queue pressure makes the signal
non-trivial. At single-node 8-instance scale, Pillar 1 (affinity routing)
is sufficient and Pillar 2 gracefully degrades to no-op.
Next: scale-out validation (16+ GPU) where session skew naturally
concentrates load and triggers migration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gen_synthetic_trace.py --mode regular: maximally-regular multi-turn trace
(fixed prefix/delta/turns, constant arrivals, zero session skew) to isolate
the structural PD cost (per-turn full-context transfer + P/D capacity split)
from the skew/hot-pin artifact.
analysis/crossover/: SLO-goodput PD_advantage sweeps bracketing the
prefill<->decode bottleneck axis (D1 grow input -> prefill-bound; D2 grow
output -> decode-bound). figs/crossover_pd_advantage.png shows the crossover
(y=1) with the agentic operating region annotated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
--policy leastwork_kappa + --kappa (default 2.5e-6, derived from KV ~100KB/tok
/ HBM 4TB/s / TPOT 10ms on H20+Qwen3-30B-A3B): score = prefill_work * (1 +
kappa * ongoing_decode_tokens), modelling decode as a fractional throughput tax
on a new prefill.
Result on the 600s trace: NET-NEGATIVE vs plain leastwork — TTFT p90 +18%,
E2E p90 +14%, balance 1.55x->1.97x, and it does NOT fix the E2E-p99 it targeted.
Decode is too cheap in agentic (output p50~80) for the term to help; it just
bounces heavy reqs off their cache-owner into cold re-prefill. The E2E-p99 tail
is the structural HEAVY+>50k floor (per-class p99 ~51-52k for ALL policies), not
decode interference. Kept in-tree as a documented ablation justifying LPWL's
omission of any decode term; do not revive without a decode-heavy regime.
See analysis/lpwl_5policy_600s.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600
trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical
APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder,
lower-locality regime; whitelisted alongside the parent anonymized trace.
analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on
the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero
knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request
balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is
E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B
was tuned on full w600 yet is beaten by the knob-free policy on this regime.
Includes the run_5policy_600s.sh repro driver.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Direct per-producer KV-pool evidence for the session-affinity backfire.
At the same 4P+4D ratio:
- round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01)
- session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25)
A 25x jump in producer load imbalance — heavy multi-turn sessions
concentrate onto single producers, the same hot-pinning pathology as
sticky routing in the colocated §3.3 study.
plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from
snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs
session comparison) — same two-stage pattern as aggregate_mb5.py.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Configurable KV working-set analyzer (GPU model x TP/PP/EP x model
config.json with MLA/GQA auto x KV/weight dtype). Computes Denning W(T),
oracle [first,last], and retain-forever footprints vs a per-replica KV
pool, plus the APC captured at each retention window.
GLM-5.1-FP8 (MLA, 43.9 KiB/token) on 1x B300 node (1528 GB KV pool):
live KV fits trivially (~533 GB), but the full 80.4% APC ceiling needs
~14 nodes (oracle) -> long-tail reuse motivates DRAM offload, not HBM.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
aggregate_mb5.py:
- Split the cluster KV timeline by role (P-pool vs D-pool) using a
PID->role map parsed from vllm_logs filenames. The cluster average
hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool
is actually pegged at ~100% while prefill idles at ~30%.
- Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving
host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced
(matplotlib) renders locally. matplotlib import is now lazy.
- New plot_role_split figure + p/d peak/steady columns in the CSV.
PD_DISAGG_RESULTS.md: consolidated writeup with figures inline.
Verdict: no static P:D ratio beats 8C colocation. The binding
constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D,
P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays
elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner,
the MB1 phase-isolation benefit is real) but loses TTFT and sheds
load. Round-robin P routing also zeroes prefix-cache reuse; a
session-affinity re-run of 6P+2D is in flight to test the fix.
Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization,
mb5_latency_compare + mb5_summary.csv.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.
What was wrong:
I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
of the new request (~50–200 ms)" — implicitly treating the benefit as
per-request and bounded by that request's own decode. The correct
accounting is per-prefill-event across all stalled streams:
benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
≈ D × T_prefill
which follows from the chunked-prefill math (each of L/N chunks slows
D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).
Plug MB1 + MB2 numbers in:
prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
2k tok | 0.14 s | 8 ms | 1.1 s | 0.7 %
33k tok | 4.5 s | 320 ms | 36 s | 0.9 %
125k tok | 57 s | 1.9 s | 456 s | 0.4 %
On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.
The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.
Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
function; keep mb1_interference.png and update its title to note
per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
no more "max benefit = decode duration" claim); §3.2 implications
section replaced with the corrected per-prefill-event table; explicit
⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
capacity argument (the real failure mode), MB1/MB2 demoted from
"kill-shot for PD-disagg" to "supporting context inputs to a
cost-benefit table that actually favors PD-disagg on this axis";
§6 paper-claims list reordered to remove the wrong "PD-disagg loses
on cost-vs-benefit" claim and replace with the corrected ones
PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.
Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
= prefill_ttft / (num_tokens_during_prefill / D). This is the average
rate at which each decode stream produces tokens during the burst.
p50 of inter-token intervals is deceptive (chunked-prefill makes most
intervals look normal); the burst-average gives the true cost.
Results (D=8 row, the most agentic-realistic case):
P (tokens) | prefill_ttft | per-stream TPOT during | penalty
2048 | 143 ms | 32 ms | 4×
8192 | 583 ms | 114 ms | 15×
32768 | 4520 ms | 388 ms | 52×
65536 | 15615 ms | 757 ms | 99×
131072 | 56991 ms | 1419 ms | 183×
Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.
§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).
The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.
Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)
Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
interleave decode more aggressively. Chunk-size sensitivity is
flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sweep on dash1 GPU 0 → dash2 GPU 0 over 200 Gbps RoCE.
remote_bootstrap_addr=http://172.27.123.142:8998. Same 9-size × 5-rep
config as the 2026-05-27 intra-node run.
Per-size pure_transfer (p50) lines up within 1–3% of the intra-node
numbers across all sizes:
size intra p50 inter p50
512 tok 5.3 ms 5.2 ms
2048 tok 20.6 20.0
8192 tok 83.7 80.9
32k tok 320.9 309.6
64k tok 1895 1734 (bimodal in both)
128k tok 2835 2818 (bimodal in both)
=> Mooncake's batch_transfer_sync_write **does not use NVLink** for
intra-node peers; both paths go through the 200 Gbps RDMA NIC, with
the 200 Gbps NIC (not the GPU interconnect) being the bottleneck. The
~9.7 GB/s steady-state ceiling and the 6+ GiB variance regime are
identical across topologies.
Operational implication for §3.2: PD-disaggregation does not get
cheaper by co-locating P and D on the same node — every routed request
pays the same ~10 GB/s ceiling for KV transfer, no matter where it
lands. Halving the transfer cost cannot be bought back by topology.
Caveat: B's receive_kv events did not log on dash2 — `MB2_LOG_DIR`
env var did not propagate through vLLM's EngineCore subprocess on
the consumer host (cat /proc/$ENGINE_PID/environ is empty on dash2
for that var, but the producer host on dash1 worked). For this run
pure_transfer numbers are from A's send_blocks alone; full rx_total
breakdown is not available, but pure_transfer is the dominant term.
Adds:
- analyze_mb2_send_only.py — analyzer that works from A's send_blocks
alone when B's receive_kv events are absent
- plot_mb2_compare.py — overlay intra vs inter on the same axes
- plot_mb2.py — tolerate the `rows`-less send-only schema
- figs/mb2_transfer_{time,bw}_inter.png — inter-node single-curve
- figs/mb2_transfer_{time,bw}_compare.png — intra vs inter overlay
- analysis/mb2/A_inter_kvboth.jsonl, inter_kvboth_client.json,
inter_kvboth_breakdown.json
- analysis/mb2/README.md — Summary block updated to reference both
paths, dated 2026-05-27 run-log entry appended with the full table
and the topology-independence framing
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lifts the MB2 intra-node results out of commit messages into a single
place the paper can cite. Structure:
Summary — one-line table + headline numbers for §3.2
Setup — exact hardware/software/config
Method — 3-step bench, instrumentation, pair-by-time-window
Results — full per-size table (latest run dated)
Known limitations — kv_both vs strict, serial-only, intra-only,
sanity preamble in the logs
§3.2 implications — transfer/decode ratio table at agentic sizes
Open questions / next runs — inter-node, bandwidth-ceiling
investigation, concurrent transfers,
strict kv_producer/consumer
Reproduction — exact commands
Run log — dated entries; new runs append here
The latest "intra-node" entry references `de164e5` for the raw
artifacts + figures.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Full sweep result on dash1 GPU 0+1 with vanilla vLLM 0.18.1 +
mooncake-transfer-engine 0.3.11, kv_both connector. Per-stage decomposition
via the instrumentation patch (analyze_mb2.py pairs A's send_blocks with
B's receive_kv enter/finish by time window).
Steady-state (1k..32k tokens, 96 MiB..3 GiB KV):
pure_transfer ≈ size / 9.7 GB/s
rx_overhead ≈ 2–3 ms (ZMQ handshake + P-side setup)
bandwidth ≈ 9.6–10.1 GB/s, very stable
Large-size regime (65k..131k tokens, 6..12 GiB):
p50 bandwidth collapses to 3.4–4.5 GB/s
max bandwidth still hits ~9.7 GB/s (some runs achieve it)
p99 agentic request (11.5 GiB) lands here
Implication for §3.2 PD-disaggregation cost argument:
median agentic decode = 50–200 ms (tool-call JSON output)
median agentic-tail KV transfer (p99 11.5 GiB):
best case (9.7 GB/s) ≈ 1.19 s
observed range 1.5 – 10 s
⇒ KV transfer is 8–100× larger than the decode it enables.
This is intra-node — the lower-bound transfer cost. Inter-node RDMA
will be slower; that's MB2 phase 2.
Adds:
- analyze_mb2.py: pair A.send_blocks ↔ B.receive_kv by time window;
per-size aggregation (n, ms_p50, ms_min/max, GB/s_p50/max)
- plot_mb2.py: log-log transfer-time chart + bandwidth-vs-size chart
- analysis/mb2/A_intra_kvboth.jsonl, B_intra_kvboth.jsonl: raw events
(51 + 102 events including the sanity preamble)
- analysis/mb2/intra_kvboth_breakdown.json: paired and aggregated
- figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User-requested comparison of inter-turn external gap distribution between
the production agentic trace (Qwen3-Coder) and a production chatbot trace
(qwen3-max chat). Both computed as
T_external = next_turn.start_ms - prev_turn.end_ms
on the same kind of pipeline (raw input + raw output join on request_id,
session structure from the formatted trace's parent_chat_id chains).
The chatbot trace lives as two files on dash0:
input : bailian-trace/qwen-trace-260321-260327/qwen3-max-input-032309-032311.jsonl
output : bailian-trace/qwen-trace-260321-260327/qwen3-max-output-032109-032711.jsonl
The raw input has no session_id (uuid is per-record, user_id has only 4
distinct tenant values for 346 k requests). We recover session structure
from the formatted file (qwen_chat_blksz_64_032309-032311.jsonl, which
groups requests by parent_chat_id), matching each formatted record to a
raw record by (timestamp, output_length) — prompt_token_num is anonymized
to 0 in this trace, so we use generate_token_num as the join key.
End time is derived from time_to_finish_token (ms duration) not the "time"
string field (which is the log-write time, not request completion).
Numbers (chatbot, 42 228 inter-turn gaps over 32 262 multi-turn sessions):
p25 4.85 s p50 7.18 s p75 8.22 s p90 15.0 s p99 43 s
4% gaps < 1 s 29% < 5 s 78% < 10 s 98% < 30 s
Compare to agentic (same metric, scripts/compute_inter_turn_gap_remote.py):
p25 0.69 s p50 1.6 s p75 8.6 s p90 44 s p99 738 s
39% gaps < 1 s 67% < 5 s 77% < 10 s 87% < 30 s
Distributions differ in shape, not just location:
- Chatbot is tight, unimodal around 5–10 s (human interaction).
- Agentic is bimodal: a sub-second autonomous tool-call mode (39 % < 1 s)
plus a long-pause tail (13 % > 30 s, p99 = 738 s) for sessions where
the operator steps away.
- The sub-second tool-call mass is where dispatch coupling lives —
those turns have W_turn ≫ T_external for any current scheduler.
The earlier "chatbot has T_human ≈ 30 s" hand-wave was wrong empirically.
The right framing for §2.3 is "agentic has a sub-second tool-call mode
that chatbot doesn't", not "chatbot has think-time and agentic doesn't".
Adds:
- scripts/compute_inter_turn_gap_chatbot.py: dash0-side aggregator
(raw input/output join + formatted alignment by ts + output_length)
- analysis/characterization/data/chatbot_inter_turn_gap.json: CDF cache
- scripts/plot_inter_turn_gap.py: overlays both curves on log-x
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The earlier conversation suggested agentic might "have no human think-time"
and therefore live in a strict closed-loop regime. The user pushed back:
tool calls also take time and might restore a chatbot-like buffer between
turns. To resolve this, we go to the actual data.
The previously-published per-record formatted trace only carries arrival
timestamps, so an arrival-to-arrival diff conflates W_turn + T_external.
The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/
051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms,
which lets us compute the pure inter-turn external gap
T_external = next.request_ready_time_ms - prev.request_end_time_ms
for each session's consecutive turn pair.
Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions):
p25 = 0.69 s
p50 = 1.6 s
p75 = 8.6 s
p90 = 44 s
mean = 37 s (heavy long-tail; paused/abandoned sessions)
39 % of gaps < 1 s
67 % of gaps < 5 s
87 % of gaps < 30 s
The bulk of the distribution is dominated by sub-second to a-few-seconds
tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 =
7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile
of T_external, so dispatch coupling is the dominant regime for the
majority of turns — not a corner case.
This corrects the earlier conflated arrival-to-arrival "median gap 11 s"
figure (which folded W_turn into T_external). The true T_external median
is 1.6 s.
Adds:
- scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator
- analysis/characterization/data/agentic_inter_turn_gap.json: 500-point
CDF cache + summary stats, scp'd back from dash0
- scripts/plot_inter_turn_gap.py: local figure renderer
- figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and
unified/lmetric TTFT p90 reference lines
Next step (per user): pull a chatbot trace through the same pipeline and
compare distributions side by side; this will let §2.3 stop hand-waving
about "no think-time" and instead present the regime split empirically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
~50% HBM for model params (~48 GiB on 96 GiB H20)
~10% for runtime activation buffers
~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.
New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)
Key reads off the figure:
p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
p90 (8.0 GiB/req): 4 fit/inst → 32 / 16 / 8
p95 (9.6 GiB/req): 4 fit/inst → 32 / 16 / 8
p99 (11.5 GiB/req): 3 fit/inst → 24 / 12 / 6
PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).
- analysis/characterization/render_window1_figures.py:
fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
but computes floor(KV_pool / req_size) × N_D and annotates the
per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
framing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The headline f6_e2e_latency_bars only shows p90, hiding three regimes:
- mean: unified dominates (3.3s TTFT, 7.0s E2E vs sticky 5.6s / 12.1s)
- p50: sticky and unified are tied on first-turn TTFT (0.5s each) —
sticky's first turn of each session is free, after which queues
accumulate. Unified beats sticky everywhere else.
- p99: tail amplification reveals unified's biggest gap —
TTFT 42.3s vs sticky 74.1s; E2E 68.8s vs sticky 139.7s.
The 12-panel figure is the honest full picture; the 3-panel headline
stays for slide-friendly summary.
- analysis/characterization/window_1_results/raw_stats/{policy}.json:
cached ttft/tpot/e2e {mean,p50,p90,p99} pulled from dash0
/home/admin/cpfs/wjh/agentic-kv/outputs/b3_sweep_20260525_095043/
(b3_policy_comparison.json doesn't record mean, only percentiles).
- analysis/characterization/render_window1_figures.py:
new fig_b3_latency_full_grid renders the 4×3 grid from the cache.
- figs/f6_e2e_latency_full_grid.png: 12-panel companion.
- PAPER_OUTLINE.md §5.2: both figures embedded; main table column
renamed from "Hotspot idx" to "Worker p90 (median / max)" to match
the new metric convention.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The max/median ratio inverts the actual user-facing p90 ranking:
sticky: hotspot=2.73 but system e2e p90 = 34.6s (worst)
unified: hotspot=3.67 but system e2e p90 = 18.0s (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.
Changes:
- analysis/characterization/render_window1_figures.py:
fig_b3_per_worker_ttft now annotates each subplot with
"median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
was the deprecated ratio; superseded by f4c per-worker bars + f6
e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
conclusion — replace "hotspot index" mentions with
"worst-worker p90" or "(median, max) worker p90"; promote the
§3.3 methodology note to a top-level sub-finding ("hot pin
failure must be measured with per-worker absolute latency,
not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
pair directly; explicit one-line note on why the ratio is dropped.
Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
'capped' is not a routing policy — it's lmetric run on a separately
truncated trace (sessions capped to 8 turns via build_capped_trace.py).
Putting it alongside lmetric/load_only/sticky/unified in per-policy
comparison figures is misleading because the workload differs, not
the routing decision. Comparing apples to a different-trace orange
inflates/deflates apparent policy gaps for the wrong reasons.
Regenerated 4 figures with --exclude-policies capped on
analysis/characterization/render_window1_figures.py:
- f4a_apc_loss.png (APC bars)
- f4c_apc_vs_hotspot_tradeoff.png (APC vs hotspot scatter)
- f4c_per_worker_ttft.png (per-worker TTFT panel)
- f6_e2e_latency_bars.png (TTFT/TPOT/E2E bars)
Added --exclude-policies CLI flag to the renderer so this is a
reversible choice, not a permanent script mutation. capped data remains
in b3_policy_comparison.json and can be brought back in workload-
sensitivity sections (where it actually belongs) by omitting the flag.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pulls 456 (rank%, cum%) sample points from the raw production trace at
dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl,
cached locally so the figure is reproducible without ssh access. Sampled
anchors match the precomputed summary exactly:
top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
plus newly readable points:
top 25% = 87.5%, top 50% = 96.0%
Workload characterization is now consistent with the production
distribution rather than the small replay subset. Replay window CDF kept
as an overlay to show the same hockey-stick shape on the data §5 actually
uses.
- analysis/characterization/data/production_session_skew_cdf.json: cached
sample points (29 KB), so the figure rebuilds locally
- scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw
- MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace,
add top-25%/50% data points
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds unified_nixl_both to elastic_migration_v2: same picker as
unified_kv_both (never triggers PD-sep), but launches vLLM with
NixlConnector instead of MooncakeConnector. Compared against plain
unified and unified_kv_both (Mooncake) we can now attribute the
substrate overhead between "v1 connector framework irreducible
cost" (proxied by the leaner NIXL) and "Mooncake implementation
extra" (Mooncake - NIXL).
Result (vs plain unified, both substrates never PD-sep):
metric plain NIXL Mooncake
TTFT p90 7.35s +37.9% +45.3% (NIXL: +7pp better)
TPOT p90 17.1ms +15.5% +24.5% (NIXL: +9pp better)
E2E p90 18.03s +17.4% +27.0% (NIXL: +10pp better)
hotspot 3.667 +0.2% +19.0% (NIXL: keeps it flat)
APC 79.4% -0.3pp -1.1pp
interference - 5.58 8.57 (NIXL: ~35% lower)
The cleanest signal is hotspot: NIXL preserves plain-unified's
distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step
O(|cache|) `set(self._block_pool.cache.keys())` diff against
_known_hash_keys (mooncake_connector.py:432-456) inflates routing
imbalance by 19%. The hash sync runs unconditionally even when no
direct_read consumer is present.
Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer
GPU memory, per-step SchedulerOutput.kv_connector_metadata
round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL
~= Mooncake-specific overhead (the hash-sync loop and stricter
delay_free semantics).
Practical implication: NIXL is meaningfully better than Mooncake on
this stack, but even NIXL imposes 16-38% across metrics — too
expensive for selective-PD-sep on agentic workloads where the
trigger rate is < 0.5%.
Launch fixes required for NIXL multi-instance:
- VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default
5600; we use 5600+i). Without this, 7 of 8 instances silently hang
in `zmq.error.ZMQError: Address already in use` and the launcher
trap kills all of them at health-check timeout.
- Health-check timeout raised from 180s to 360s; NIXL initialization
(UCX agent + memory registration) is ~100-150s per instance under
8-way concurrent load, vs Mooncake's ~30-60s.
New figure: fig_connector_substrate_attribution.png stacks plain /
framework / Mooncake-extra / v2-branch overhead per metric.
Existing figures (fig_kv_both_overhead, fig_three_way_hotspot)
updated to include NIXL as a fourth bar.
README updated with 4-way table, Result 1 reframed as "the cost is
mostly framework, not Mooncake — but Mooncake adds the hotspot
penalty", and the substrate-vs-PD-sep tradeoff math.
Refs: nixl_connector.py:700 handshake listener bind, factory.py
register_connector for the NixlConnector entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New analysis/characterization/elastic_migration_v2/ packages the
unified_v2 + unified_kv_both experiments into a self-contained
results section that the paper can cite as the "we tried selective
PD-sep migration" case study. The section finds three independent
reasons PD-sep doesn't help on agentic w600:
1. Mooncake kv_both substrate alone (no PD-sep ever firing) imposes
TTFT p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain
unified. Per-step KVConnectorMetadata maintenance and block
reservation semantics dominate even when no transfer is pending.
2. PD-sep gate fires only 0.16-0.41% of requests across two
gate-tightness configurations. 88-76% are killed by
new_local < threshold because 93% intra-session reuse on agentic
traces leaves a small uncached tail; 19% are killed by
chosen_no_active_decode (snapshot-time gate). Even relaxed
thresholds can't grow trigger rate past 0.5%.
3. When PD-sep fires, the calibrated cost model
(0.3s + bytes / 2.7 GB/s) is wrong by 10-20x. 5 triggered
requests in v2.1 saw realized TTFT 12-45s vs model-predicted
migrate cost 0.7-2.2s, consistent with the E2 audit's finding
that D-side block pre-reservation and missing layerwise
pipelining dominate the decode_sent -> first_token clock.
Three-way comparison (unified vs unified_kv_both vs unified_v2):
v2 vs the kv_both control is roughly net-zero (-10% hotspot,
-14% TPOT p90, +3% TTFT p90, +9% TTFT p99). v2 vs plain unified is
strictly worse by 27-49% across latency percentiles because the
kv_both substrate tax is unavoidable when the policy is enabled.
Contents:
- README.md: the four results sections, the three-way comparison
table, an explicit "what this claims for the paper" list, and a
cross-reference index to the earlier characterization documents.
- data/: b3_policy_comparison.json + per-policy breakdown.json
+ per-policy hotspot_index.json for the four policies in scope.
- figures/: 4 PNGs rendered by render_figures.py:
* fig_kv_both_overhead.png — 4-metric bar chart with delta
annotations showing kv_both alone costs +45% TTFT p90.
* fig_v2_trigger_funnel.png — per-reason request count for the
two gate configurations on log scale.
* fig_v2_predicted_vs_actual.png — scatter of model-predicted
migrate cost vs realized TTFT for the 5 triggered requests,
with y=x, 10x, and 20x reference lines.
* fig_three_way_hotspot.png — per-worker TTFT p90 grouped bars
across the three policies.
The section is intentionally self-contained: it lists what the
experiment validates (cost model picks correct candidates;
shadow-drift fix is necessary; same-worker interference is real)
alongside what it disproves (per-request PD-sep on agentic via
Mooncake is not a net win in current implementation).
Refs: E1/E2 subagent audits, B2 microbench, unified_v2 commits
19f69a9 / 4b833d3 / 95c8ef8.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User's 2026-05-25 draft aligning three threads (agentic-kv vLLM
experiments, dash0 artifacts, agentic-pd-hybrid SGLang work) into
a single story for the paper. Tracked so future iterations and
review history are in version control.
Co-Authored-By: Gahow Wang <chiahaco@gmail.com>
After the B3 audit bug fixes (joined_analysis hotspot median +
b3_analyze percentile interp), regenerate b3_policy_comparison.json
and the per-policy hotspot_index.json from the same raw run on
dash0 and re-render the three affected figures (apc-vs-hotspot,
latency-bars, per-worker TTFT).
Key number changes in window_1_results.md:
- hotspot_index magnitudes corrected (all five policies; lmetric
smallest delta at +0.7%, sticky largest at +16.1%)
- "capped reduces hotspot 13%" -> "~10% (2.253 -> 2.020)"
- TTFT/E2E/TPOT percentiles shift by <1% from floor->interp
(unified TTFT p90 7.24 -> 7.35 s)
Restructured "Caveats" into "Limitations (read this before quoting
B3 numbers)":
1. Agentic dispatch coupling is by design — promoted from caveat
to top-level methodology framing, tied to
agentic_dispatch_coupling.md
2. B3 interference_index is binary (not size-graded) — added
3. Hot-sweep cache contamination (<1%) — kept
4. Unified interference unrecoverable — kept with explicit warning
not to read unified's failure attribution as causal
5. w600 is a sample, not full trace — kept
6. Reuse decomposition is per-token in expectation — added
current_results/characterization_claim_matrix.md updates:
- The "heavy-tail not sole cause" claim now cites the corrected
~10% drop with the median bug noted
- New supported claim: "B3 saturated-replay latency gaps include an
agentic dispatch-coupling feedback term, which is intentional and
matches production"; cited against agentic_dispatch_coupling.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three fixes from the B3 audit:
1) joined_analysis.hotspot_index used sorted[n//2] as median, which
returns the ~60th percentile for n=8 (even-length). Systematically
under-states the hotspot index. Recomputed values:
lmetric 2.238 -> 2.253 (+0.7%)
load_only 1.140 -> 1.294 (+13.5%)
sticky 2.349 -> 2.728 (+16.1%)
unified 3.350 -> 3.667 (+9.5%)
capped 1.937 -> 2.020 (+4.3%)
Qualitative ranking preserved; "capped only modestly reduces hotspot"
story holds with ~10% drop instead of the previously reported 13%.
Added test_hotspot_index_uses_true_median_for_even_n to lock in the
fix.
2) b3_analyze.sh's pct() helper used floor-indexed percentile
sorted[int(p*(n-1))], inconsistent with metrics._percentile and
joined_analysis._percentile which both use linear interpolation.
Now matches.
3) b3_sweep.sh's capped step called run_policy "capped", but the
proxy's argparse has no "capped" choice, so the hot-sweep variant
would have crashed on this step. The actual capped data was
produced via b3_isolated_policy.sh with --policy lmetric. Replace
the broken inline call with an explicit launch_proxy lmetric +
inline replayer block so the sweep script matches the data path
it documents.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The B3 audit flagged the trace replayer's "fire turn N+1 immediately
if turn N is behind schedule" semantics as a potential benchmark
crime, because under saturation the effective arrival process becomes
policy-dependent (slow policy -> longer session lifetimes -> more
concurrent in-flight -> harder system -> still slower). The audit
called this dispatch slip.
But in agentic workloads, turn N+1 is generated by a tool-call
response or an autonomous-loop step, not by a human reading the
previous reply. There is no inter-turn think-time. So the replayer's
"no think-time, sequential within session, fire-immediately-when-
ready" behavior is the correct model of agentic production, and the
feedback amplification is a real property of production systems
under saturation rather than an artifact of the replayer.
The note (analysis/characterization/agentic_dispatch_coupling.md)
lays out:
- The dispatch rule and the apparent feedback loop
- Why agentic workloads do not have user think-time
- Application of Little's Law: slower policy carries higher concurrent
in-flight load, so the policy x feedback gap is real, not artifact
- Reframes B3 as the "production-replay" experiment and B4 as the
orthogonal "controlled-load" experiment, complementary not
hierarchical
- Calls the feedback amplification itself out as a finding worth
reporting (e.g. unified's ~2x latency-p90 gap over lmetric in B3
reflects both the routing improvement and the in-flight reduction)
- Contrasts with chat workloads (human think-time partially breaks
the feedback loop, agentic removes that floor)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The B2 same-worker TPOT p90 idx is non-monotone: 7.89x at 32k drops
to 2.26x at 65k. The naive reading is "interference gets weaker for
huge prefills"; the actual mechanism is a regime shift, and reading
TPOT p90 alone is misleading.
Three superimposed effects:
1. Cost migration TPOT -> TTFT. A 32k prefill is short enough that
chunked-prefill keeps interleaving decode steps, so overlapping
decodes trickle tokens out at painful per-token rates. A 65k
prefill is long enough that overlapping decodes are *fully*
blocked for ~10s; once they break through, the injection is
winding down and subsequent iterations run unobstructed. The
cost lands on the TTFT clock (14s) instead of inflating TPOT.
2. Bimodal TPOT distribution. At 65k overlap, decodes split into
"blocked entire prefill then normal rate" and "trickled slowly
through prefill chunks". p99 sits on the second population and
grows 59 -> 169.5 ms; p90 sits on the first and shrinks.
3. "Clean" stops being clean. With 4x ~10s injections in 60s, the
110 "clean" decodes at 65k are squeezed into 2-3s recovery
pockets. TPOT p90 clean rises 6.9 -> 9.6 ms (40%), shrinking
the denominator of the ratio.
window_1_results.md adds a new B2 subsection laying out the
mechanism with the per-cell data table and the explicit reading
rule: headline interference metric is TTFT idx (monotone); TPOT
p99 is the right tail indicator; TPOT p90 alone is unsafe across
regime shifts. Direct implication: TTFT and TPOT need separate
SLO thresholds under PD-colo, because they measure costs from
different points in the request lifecycle and the cost migration
between them is workload-dependent.
current_results/characterization_claim_matrix.md adds a new
supported claim for the cost migration, listed against the existing
B2 evidence. current_results/reviewer_risk_register.md adds a
low-severity entry warning future readers off TPOT p90 alone.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refresh the standing audit package now that B1' / B2 / B3 are complete.
current_results/characterization_claim_matrix.md
Flips seven entries from "not_yet_supported" / "partially_supported"
to "supported" with pointers into window_1_results/. New entries
cover per-session sequentiality, KV per request, real reuse
decomposition, theoretical APC ceiling, the LMetric locality gap,
Unified breaking the locality-vs-latency tradeoff, B2 causal
interference proof, sticky's interference inflation, and the
partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay
"not_yet_supported" (Window 2 work).
current_results/main_claim_allowed_runs.md
New "Allowed For Routing-Policy Comparison" section pins the five
B3 policy directories. New "Allowed For PD-colo Interference"
section pins the B2 sweep. Legacy section retained for the
pre-instrumentation 200/500/1000-req runs.
current_results/reviewer_risk_register.md
Marks the two old "high"-severity risks (sequentiality / reuse
decomposition) as resolved; adds new entries for the APC
contamination empirics, the b3_analyze.sh truncate-write bug that
cost unified's interference index, the GPU-0 EngineCore ghost
cleanup, the saturated-replay caveat for trace-timestamp dispatch,
and the synthetic B2 decode workload.
current_results/all_figures_index.md
Adds the 8 new Window 1 figures alongside the existing 6 from the
legacy summarize_runs run.
current_results/reproduction_commands.sh
Records the full B3 + B2 + figure pipeline.
analysis/characterization_todo_for_interns.md
Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE;
only B4 and B5 remain (Window 2).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
analysis/characterization/window_1_results.md is the headline write-up
for Window 1: workload characterization (KV per request, real reuse
decomposition, APC theoretical ceilings), B3 5-policy sweep with
per-policy interpretation, B2 same-vs-different-worker interference
microbench with causal reading, and an explicit list of what Window 1
does *not* answer (deferred to B4 SRR sweep + B5 attribution).
Under window_1_results/:
- 5 raw result JSONs from the B3 sweep, the B2 microbench, the APC
upper bound, and the KV footprint
- per-policy hotspot_index.json snapshots so render_window1_figures.py
can plot per-worker TTFT p90 distributions
- 8 PNG figures (figures/) covering the headline claims
Three takeaways the figures pin down:
1) intra-session reuse dominates (93.2%), so session-affinity routing
is the right primary lever
2) unified hybrid affinity hits 79.4% APC (97% of the 79.6% intra-
session ceiling) AND cuts TTFT p90 from lmetric's 15.6s to 7.24s
3) B2 different-worker control sits at idx ≈ 1.0 across 32× prefill-
size variation; same-worker TTFT idx scales 2.15× -> 218×, which
is the cleanest causal evidence for same-worker prefill-decode
interference
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three CPU-only analysis pieces that turn raw Window 1 artifacts into
publishable numbers and figures.
scripts/compute_apc_upper_bound.py
Block-level trie walk over hash_ids to compute the theoretical APC
ceiling on a trace, decomposed into intra-session / any-session /
shared-prefix-only. Gives a fixed reference for what each routing
policy could *possibly* achieve. w600 result: 79.6% intra-session,
80.3% any-session, 0.1% shared-prefix.
analysis/characterization/b2_sweep_analysis.py (rewrite)
Previous version used joined_analysis.interference_index() which
labeled overlap = "any prefill in any other request during this
decode". With short-prompt decode load this is always true
(everyone's prefill overlaps everyone else's decode); n_overlap
was 239/240 even in the different-worker control.
New version labels overlap iff the decode's [t_first_token, t_finish]
intersects an actual large *injection* window, computed from the
cell's "prefill"-tagged metric rows. Different-worker control now
cleanly sits at idx ≈ 1.0, same-worker scales monotonically.
analysis/characterization/render_window1_figures.py
Renders 8 PNGs from the result JSONs: B3 latency / APC vs ceiling
/ APC vs hotspot scatter / per-worker TTFT / failure breakdown,
B2 TPOT and TTFT curves (overlap vs clean and idx), reuse
decomposition, KV footprint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents each pick_instance_* function from cache_aware_proxy.py in
pseudocode so the policy semantics can be cited without re-reading
implementation details. Covers lmetric (main baseline), load_only
(no cache / no affinity control), sticky (hard affinity control),
unified (gated affinity + LMetric fallback), and capped (lmetric on
a per-session turn-capped trace).
Includes a decision matrix that maps each policy to whether it uses
session affinity, cache awareness, load awareness, and overload
break, plus a one-liner per control explaining what comparison
isolates which factor.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scripts/b2_interference.py is the controlled microbench. It runs two
coroutines against the open proxy bypass (direct vLLM endpoints):
- decode_load: continuous short-prompt requests at fixed QPS into a
designated decode instance, to keep it decode-saturated.
- prefill_injections: N large one-token requests at fixed interval,
pointed at either the same instance (same-worker variant) or a
paired one (different-worker control).
Each cell (variant × prefill_size) gets its own metrics.jsonl plus a
run_window.json containing t_start_unix/t_end_unix. The shared
engine_*.jsonl from the scheduler patch is sliced by that window in
the aggregator.
analysis/characterization/b2_sweep_analysis.py walks the cell tree,
slices the per-worker step log by each cell's window, runs the A5
interference_index() against the slice, and emits a single
b2_sweep_summary.json with one row per cell. This is what feeds the
"interference vs uncached prefill size" figure.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Smoke validation on dash0 surfaced three real bugs that broke
interference and failure-attribution labels end-to-end:
1. endpoint_url in metrics is the proxy URL (e.g. http://h:9200);
the vLLM worker URL lives in breakdown's routed_to. The
interference index and label path were taking endpoint_url first,
so every request looked routed to a non-existent worker and the
overlap counter stayed at zero.
2. _normalize_worker hard-coded base port 8000, so a smoke run on
port 9100 resolved to engine_1100 instead of engine_0. Added a
--worker-map URL=engine_id CLI flag and _resolve_worker() that
prefers the explicit map and falls back to the heuristic.
3. vLLM rewrites the per-step rid as cmpl-<proxy_id>-<i>-<hash>, so
the str equality check between per_req rid and our proxy
request_id never matched -> every prefill step looked like
"other request prefill", which would have flipped overlap to
100%. Added _vllm_rid_matches() that strips the cmpl-/chatcmpl-
prefix.
After the fix, the same smoke run reports interference_index = 22.9
across 24 overlap / 6 clean requests on a single instance, which is
the expected shape for serial dispatch into a cold engine.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures 5 runs from the experiment matrix (combined-ca x3 seeds,
pdsep-4p4d seed1, pdsep-6p2d seed1) on traces/w600_r0.0015_st30.jsonl
with cuda graphs enabled. The headline:
combined-ca: TTFT p50 0.91s success 99.5%
pdsep-4p4d: TTFT p50 62.8s success 52% (69x worse, half dropped)
pdsep-6p2d: TTFT p50 51.1s success 68% (56x worse, third dropped)
C2 (fig_c2): headline bars per config with error bars.
C3 (fig_c3): per-instance KV utilization time-series. Both PD-sep
splits hit the memory wall, but the side differs by P:D ratio --
4P+4D pins the P-side, 6P+2D pins both sides (D-side back-pressures
P-side).
C4 (fig_c4): TTFT stacked breakdown. 99% of PD-sep TTFT is P-side
prefill compute; D-side wait + first token is <=1.2s. The bottleneck
is P-side prefill queueing, not D-side decode wait as the original
analytical model assumed.
system_analysis.md gains a Layer 5b that reconciles the analytical
KV-wall model (which considered D-side only) with the empirical
finding that the wall hits whichever side has fewer GPUs, and
co-saturates both at extreme splits via D-side back-pressure.
plot_pd_matrix.py ingests outputs/pd_matrix/* into all four figures.
bench.sh gained AGENTIC_STEP_LOG_DIR hooks for future runs (set during
this work but not used by the current matrix's data).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New analysis/characterization/joined_analysis.py joins replayer
metrics.jsonl + proxy breakdown.json + worker_state.jsonl by
request_id, plus engine_*.jsonl by worker_id, and emits:
- joined.jsonl per-request merged record
- reuse_decomposition.json real intra/cross/shared classification
using session_id + hash_ids + cached_tokens
- interference_index.json TPOT_p90(same-worker prefill overlap)
/ TPOT_p90(clean), per Batch 2
- hotspot_index.json max/median worker TTFT-p90, per Batch 3
- failure_label.jsonl per-slow-request cause label, per Batch 5
- failure_breakdown.json label histogram
- window_summary.json SRR warmup/steady/drain aggregates
Closes the analyzer side of Phase A; replaces the
status: unavailable placeholders the existing scaffold emits when
join sources are missing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add Progress Snapshot table to the intern TODO so per-batch status
(DONE / partial / blocked-on-instrumentation) is visible at a glance.
- New analysis/claude_characterization_work_plan.md scopes the Phase A
instrumentation tasks (A1-A5) plus Window 1 (B1'+B2+B3) and Window 2
(B4+B5) on dash0, with locked decisions for model, topology, trace,
SLO style, and GPU phasing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the experiment harness that gates the empirical claims (C2/C3/C4/C5)
in the PD-sep paper section. Three pieces:
1. scripts/bench.sh: new --mode pdsep with --pd-ratio P:D, and an
--eager flag to re-enable --enforce-eager for the cuda-graph
ablation. pdsep reuses the elastic-mode Mooncake kv_both launch and
swaps the proxy command from --combined to --prefill/--decode.
baseline and elastic flows are unchanged.
2. analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh: matrix
driver that runs {combined-ca, pdsep-4p4d, pdsep-6p2d} x cudagraph
x 3 seeds by default (~2 h on dash0). --with-rr adds combined-rr;
--with-eager doubles to ~5 h with the cuda-graph ablation. Skips
completed runs, captures per-instance vLLM logs (needed for C3
step-level KV-utilization mining).
3. fig_kv_memory_wall.pdf: empirical anchor (star) at REPORT.md §3.3's
observed 6P+2D 97% KV utilization. The marker lands on the model's
predicted curve at p90 input, confirming the steady-state analysis.
README updated with the run command, output layout, and the followup
plotters that consume outputs/pd_matrix/.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the system-level argument resolving the roofline/PD-sep paradox.
Even at 95% cache reuse prefill stays compute-bound (the C6 roofline
fact), yet PD separation regresses TTFT 72%. The new system_analysis.md
walks through six layers showing why the roofline claim is necessary
but not sufficient, with the falsifiable condition being decode-side
KV memory budget: concurrent_decode * KV_per_req / (N_D * HBM_pool).
For chatbot this ratio is << 1 at any layout; for agentic at p90+
context it goes >> 1 under 4P+4D and 6P+2D, predicting the empirical
97% decode KV occupancy. fig_kv_memory_wall.pdf visualizes the model
with audit-able constants; fig_c1a/b ground the per-request KV-size
inputs in the actual sampled trace (input p50=33.5k, p90=101k,
intra-session reuse 79.2%).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is
net negative under agentic workloads" paper section: plot scripts for C1
(workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7
PDFs already rendered, and a README mapping candidate claims to required
figures plus open re-run items.
Removes --enforce-eager from bench.sh and all active launch scripts so
cuda graphs are captured -- the prior methodology suppressed one of
PD-sep's structural advantages (D-node fixed-shape decode). Legacy
scripts under scripts/legacy/ are intentionally untouched as historical
records.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).
- REPORT.md §1.1 / §3.9: add errata callout and section header noting
the "Final Design" framing was retired after cc6e562 / 4c583f2;
point readers to docs/migration-policy-design.md.
- docs/migration-policy-design.md: rewrite. Opens with the current
hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
tie-breaker), then a "What Was Retired" commit table, then the old
Approach A numbers preserved as "Historical Baseline-Mode Comparison".
- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
LMetric isn't "neutralized by affinity constraints" (pure --policy
lmetric has no affinity at all); it converges to similar placements
because P_tokens includes new_uncached_tokens, giving it implicit
soft affinity.
- analysis/elastic_hypotheses.md: same LMetric correction in the
"DOESN'T work" summary, plus a footer cross-referencing the current
routing direction.
- analysis/unified_routing_fix_review.md: track this file (was
untracked); it is the review handoff cited from the updated docs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Key finding: at 16 concurrent sessions (2 per GPU), TPOT p90 degrades
from 0.073 to 0.106 (+45%), with MEDIUM TPOT at 0.197 (+149%).
This is the first time we've reproduced real prefill-decode interference
in controlled experiments.
Elastic RDMA at 16 sessions doesn't help: only 13/500 offloaded (cache-gate
correct for cold turn-1), kv_both adds ~16% TPOT overhead at high concurrency.
Load scaling: 1000req_ts20, 200req_ts10, 200req_ts5, 500req_ts10 all show
~30% GPU util at 8 sessions. The bottleneck is max_inflight_sessions, not
arrival rate.
Updated elastic_hypotheses.md with H8, H9, and comprehensive final analysis.
The real bottleneck is vLLM's chunked prefill scheduling, not routing or
PD disaggregation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>