Direct per-producer KV-pool evidence for the session-affinity backfire.
At the same 4P+4D ratio:
- round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01)
- session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25)
A 25x jump in producer load imbalance — heavy multi-turn sessions
concentrate onto single producers, the same hot-pinning pathology as
sticky routing in the colocated §3.3 study.
plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from
snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs
session comparison) — same two-stage pattern as aggregate_mb5.py.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Include T=600s/1800s points so the diminishing-returns tail is visible:
14 -> 52 nodes buys only +6pp APC (74%->79.8%), still under the 80.4%
ceiling that oracle/LRU reaches at 14 nodes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the (redundant) nodes-vs-T cost curve with the working-set
W(t) over wall-clock time for T=2/30/300s. Shows footprint is steady
(peak ~ median) after a short warm-up, so peak-based sizing is sound;
the 300s curve hugs the 14-node ceiling throughout.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop log node axis (decade ticks were unreadable). Left = APC vs #nodes
(linear), right = #nodes vs retention window T. Mark the 1-node budget
crossing (~7s reuse, ~8% APC) and the 14-node oracle ceiling.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both axes now in "# nodes" (footprint / per-node KV pool) so the
cluster-size implication is direct: 1-node budget line + 14-node oracle
ceiling, instead of raw GB.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Configurable KV working-set analyzer (GPU model x TP/PP/EP x model
config.json with MLA/GQA auto x KV/weight dtype). Computes Denning W(T),
oracle [first,last], and retain-forever footprints vs a per-replica KV
pool, plus the APC captured at each retention window.
GLM-5.1-FP8 (MLA, 43.9 KiB/token) on 1x B300 node (1528 GB KV pool):
live KV fits trivially (~533 GB), but the full 80.4% APC ceiling needs
~14 nodes (oracle) -> long-tail reuse motivates DRAM offload, not HBM.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
aggregate_mb5.py:
- Split the cluster KV timeline by role (P-pool vs D-pool) using a
PID->role map parsed from vllm_logs filenames. The cluster average
hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool
is actually pegged at ~100% while prefill idles at ~30%.
- Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving
host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced
(matplotlib) renders locally. matplotlib import is now lazy.
- New plot_role_split figure + p/d peak/steady columns in the CSV.
PD_DISAGG_RESULTS.md: consolidated writeup with figures inline.
Verdict: no static P:D ratio beats 8C colocation. The binding
constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D,
P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays
elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner,
the MB1 phase-isolation benefit is real) but loses TTFT and sheds
load. Round-robin P routing also zeroes prefix-cache reuse; a
session-affinity re-run of 6P+2D is in flight to test the fix.
Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization,
mb5_latency_compare + mb5_summary.csv.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.
What was wrong:
I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
of the new request (~50–200 ms)" — implicitly treating the benefit as
per-request and bounded by that request's own decode. The correct
accounting is per-prefill-event across all stalled streams:
benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
≈ D × T_prefill
which follows from the chunked-prefill math (each of L/N chunks slows
D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).
Plug MB1 + MB2 numbers in:
prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
2k tok | 0.14 s | 8 ms | 1.1 s | 0.7 %
33k tok | 4.5 s | 320 ms | 36 s | 0.9 %
125k tok | 57 s | 1.9 s | 456 s | 0.4 %
On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.
The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.
Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
function; keep mb1_interference.png and update its title to note
per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
no more "max benefit = decode duration" claim); §3.2 implications
section replaced with the corrected per-prefill-event table; explicit
⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
capacity argument (the real failure mode), MB1/MB2 demoted from
"kill-shot for PD-disagg" to "supporting context inputs to a
cost-benefit table that actually favors PD-disagg on this axis";
§6 paper-claims list reordered to remove the wrong "PD-disagg loses
on cost-vs-benefit" claim and replace with the corrected ones
PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.
Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
= prefill_ttft / (num_tokens_during_prefill / D). This is the average
rate at which each decode stream produces tokens during the burst.
p50 of inter-token intervals is deceptive (chunked-prefill makes most
intervals look normal); the burst-average gives the true cost.
Results (D=8 row, the most agentic-realistic case):
P (tokens) | prefill_ttft | per-stream TPOT during | penalty
2048 | 143 ms | 32 ms | 4×
8192 | 583 ms | 114 ms | 15×
32768 | 4520 ms | 388 ms | 52×
65536 | 15615 ms | 757 ms | 99×
131072 | 56991 ms | 1419 ms | 183×
Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.
§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).
The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.
Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)
Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
interleave decode more aggressively. Chunk-size sensitivity is
flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sweep on dash1 GPU 0 → dash2 GPU 0 over 200 Gbps RoCE.
remote_bootstrap_addr=http://172.27.123.142:8998. Same 9-size × 5-rep
config as the 2026-05-27 intra-node run.
Per-size pure_transfer (p50) lines up within 1–3% of the intra-node
numbers across all sizes:
size intra p50 inter p50
512 tok 5.3 ms 5.2 ms
2048 tok 20.6 20.0
8192 tok 83.7 80.9
32k tok 320.9 309.6
64k tok 1895 1734 (bimodal in both)
128k tok 2835 2818 (bimodal in both)
=> Mooncake's batch_transfer_sync_write **does not use NVLink** for
intra-node peers; both paths go through the 200 Gbps RDMA NIC, with
the 200 Gbps NIC (not the GPU interconnect) being the bottleneck. The
~9.7 GB/s steady-state ceiling and the 6+ GiB variance regime are
identical across topologies.
Operational implication for §3.2: PD-disaggregation does not get
cheaper by co-locating P and D on the same node — every routed request
pays the same ~10 GB/s ceiling for KV transfer, no matter where it
lands. Halving the transfer cost cannot be bought back by topology.
Caveat: B's receive_kv events did not log on dash2 — `MB2_LOG_DIR`
env var did not propagate through vLLM's EngineCore subprocess on
the consumer host (cat /proc/$ENGINE_PID/environ is empty on dash2
for that var, but the producer host on dash1 worked). For this run
pure_transfer numbers are from A's send_blocks alone; full rx_total
breakdown is not available, but pure_transfer is the dominant term.
Adds:
- analyze_mb2_send_only.py — analyzer that works from A's send_blocks
alone when B's receive_kv events are absent
- plot_mb2_compare.py — overlay intra vs inter on the same axes
- plot_mb2.py — tolerate the `rows`-less send-only schema
- figs/mb2_transfer_{time,bw}_inter.png — inter-node single-curve
- figs/mb2_transfer_{time,bw}_compare.png — intra vs inter overlay
- analysis/mb2/A_inter_kvboth.jsonl, inter_kvboth_client.json,
inter_kvboth_breakdown.json
- analysis/mb2/README.md — Summary block updated to reference both
paths, dated 2026-05-27 run-log entry appended with the full table
and the topology-independence framing
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Full sweep result on dash1 GPU 0+1 with vanilla vLLM 0.18.1 +
mooncake-transfer-engine 0.3.11, kv_both connector. Per-stage decomposition
via the instrumentation patch (analyze_mb2.py pairs A's send_blocks with
B's receive_kv enter/finish by time window).
Steady-state (1k..32k tokens, 96 MiB..3 GiB KV):
pure_transfer ≈ size / 9.7 GB/s
rx_overhead ≈ 2–3 ms (ZMQ handshake + P-side setup)
bandwidth ≈ 9.6–10.1 GB/s, very stable
Large-size regime (65k..131k tokens, 6..12 GiB):
p50 bandwidth collapses to 3.4–4.5 GB/s
max bandwidth still hits ~9.7 GB/s (some runs achieve it)
p99 agentic request (11.5 GiB) lands here
Implication for §3.2 PD-disaggregation cost argument:
median agentic decode = 50–200 ms (tool-call JSON output)
median agentic-tail KV transfer (p99 11.5 GiB):
best case (9.7 GB/s) ≈ 1.19 s
observed range 1.5 – 10 s
⇒ KV transfer is 8–100× larger than the decode it enables.
This is intra-node — the lower-bound transfer cost. Inter-node RDMA
will be slower; that's MB2 phase 2.
Adds:
- analyze_mb2.py: pair A.send_blocks ↔ B.receive_kv by time window;
per-size aggregation (n, ms_p50, ms_min/max, GB/s_p50/max)
- plot_mb2.py: log-log transfer-time chart + bandwidth-vs-size chart
- analysis/mb2/A_intra_kvboth.jsonl, B_intra_kvboth.jsonl: raw events
(51 + 102 events including the sanity preamble)
- analysis/mb2/intra_kvboth_breakdown.json: paired and aggregated
- figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User-requested comparison of inter-turn external gap distribution between
the production agentic trace (Qwen3-Coder) and a production chatbot trace
(qwen3-max chat). Both computed as
T_external = next_turn.start_ms - prev_turn.end_ms
on the same kind of pipeline (raw input + raw output join on request_id,
session structure from the formatted trace's parent_chat_id chains).
The chatbot trace lives as two files on dash0:
input : bailian-trace/qwen-trace-260321-260327/qwen3-max-input-032309-032311.jsonl
output : bailian-trace/qwen-trace-260321-260327/qwen3-max-output-032109-032711.jsonl
The raw input has no session_id (uuid is per-record, user_id has only 4
distinct tenant values for 346 k requests). We recover session structure
from the formatted file (qwen_chat_blksz_64_032309-032311.jsonl, which
groups requests by parent_chat_id), matching each formatted record to a
raw record by (timestamp, output_length) — prompt_token_num is anonymized
to 0 in this trace, so we use generate_token_num as the join key.
End time is derived from time_to_finish_token (ms duration) not the "time"
string field (which is the log-write time, not request completion).
Numbers (chatbot, 42 228 inter-turn gaps over 32 262 multi-turn sessions):
p25 4.85 s p50 7.18 s p75 8.22 s p90 15.0 s p99 43 s
4% gaps < 1 s 29% < 5 s 78% < 10 s 98% < 30 s
Compare to agentic (same metric, scripts/compute_inter_turn_gap_remote.py):
p25 0.69 s p50 1.6 s p75 8.6 s p90 44 s p99 738 s
39% gaps < 1 s 67% < 5 s 77% < 10 s 87% < 30 s
Distributions differ in shape, not just location:
- Chatbot is tight, unimodal around 5–10 s (human interaction).
- Agentic is bimodal: a sub-second autonomous tool-call mode (39 % < 1 s)
plus a long-pause tail (13 % > 30 s, p99 = 738 s) for sessions where
the operator steps away.
- The sub-second tool-call mass is where dispatch coupling lives —
those turns have W_turn ≫ T_external for any current scheduler.
The earlier "chatbot has T_human ≈ 30 s" hand-wave was wrong empirically.
The right framing for §2.3 is "agentic has a sub-second tool-call mode
that chatbot doesn't", not "chatbot has think-time and agentic doesn't".
Adds:
- scripts/compute_inter_turn_gap_chatbot.py: dash0-side aggregator
(raw input/output join + formatted alignment by ts + output_length)
- analysis/characterization/data/chatbot_inter_turn_gap.json: CDF cache
- scripts/plot_inter_turn_gap.py: overlays both curves on log-x
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User flagged unified_v2 as a still-buggy build. Regenerate the four
per-policy figures with only the four stable policies:
lmetric, load_only, sticky, unified
Story is now directly comparable to v1: unified still dominates p90
TTFT (8.8s) and E2E p90 (20.0s) over the other three on the fresh run.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User-provided fresh run with five policies (lmetric, load_only, sticky,
unified, plus a new unified_v2 variant). Reproduces the v1 set under
figs/v2/ so we can A/B the same panels:
f4a_apc_loss.png — APC bars per policy
f4c_per_worker_ttft.png — per-worker TTFT p90 panel per policy
f6_e2e_latency_bars.png — TTFT/TPOT/E2E p90 bars per policy
f6_e2e_latency_full_grid — mean/p50/p90/p99 × TTFT/TPOT/E2E grid
scripts/render_b3_figures_v2.py is a standalone driver that reads each
policy's metrics.summary.json and breakdown.json directly from the run
directory — the breakdown.json `routed_to` field is required to recover
per-worker assignment because the new setup routes every request
through a proxy (127.0.0.1:9300), so metrics.jsonl's endpoint_url no
longer identifies the backend.
Headline numbers, new vs v1:
APC v2: lmetric 57.2% / load_only 53.9% / sticky 77.7%
unified 78.7% / unified_v2 78.4%
v1: lmetric 56.9% / load_only 54.1% / sticky 77.2% / unified 79.4%
TTFT p90 (s) v2: lmetric 14.8 / load_only 20.1 / sticky 14.8 /
unified 8.8 / unified_v2 10.1
v1: lmetric 15.7 / load_only 20.2 / sticky 18.0 / unified 7.3
E2E p90 (s) v2: lmetric 25.4 / load_only 33.9 / sticky 30.3 /
unified 20.0 / unified_v2 24.1
v1: lmetric 24.8 / load_only 33.5 / sticky 34.6 / unified 18.0
Worker p90 (s, median / max)
v2: lmetric 13.3/30.4 · load_only 21.3/29.2 · sticky 13.5/33.0
unified 10.0/35.1 · unified_v2 8.6/34.2
v1: lmetric 13.9/31.3 · load_only 19.4/25.1 · sticky 20.3/55.4
unified 10.3/37.7
Story is unchanged: unified dominates at p90 across TTFT/E2E and on
median-worker latency; unified_v2 is competitive at p50 but slightly
worse than unified at p90.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The earlier conversation suggested agentic might "have no human think-time"
and therefore live in a strict closed-loop regime. The user pushed back:
tool calls also take time and might restore a chatbot-like buffer between
turns. To resolve this, we go to the actual data.
The previously-published per-record formatted trace only carries arrival
timestamps, so an arrival-to-arrival diff conflates W_turn + T_external.
The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/
051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms,
which lets us compute the pure inter-turn external gap
T_external = next.request_ready_time_ms - prev.request_end_time_ms
for each session's consecutive turn pair.
Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions):
p25 = 0.69 s
p50 = 1.6 s
p75 = 8.6 s
p90 = 44 s
mean = 37 s (heavy long-tail; paused/abandoned sessions)
39 % of gaps < 1 s
67 % of gaps < 5 s
87 % of gaps < 30 s
The bulk of the distribution is dominated by sub-second to a-few-seconds
tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 =
7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile
of T_external, so dispatch coupling is the dominant regime for the
majority of turns — not a corner case.
This corrects the earlier conflated arrival-to-arrival "median gap 11 s"
figure (which folded W_turn into T_external). The true T_external median
is 1.6 s.
Adds:
- scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator
- analysis/characterization/data/agentic_inter_turn_gap.json: 500-point
CDF cache + summary stats, scp'd back from dash0
- scripts/plot_inter_turn_gap.py: local figure renderer
- figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and
unified/lmetric TTFT p90 reference lines
Next step (per user): pull a chatbot trace through the same pipeline and
compare distributions side by side; this will let §2.3 stop hand-waving
about "no think-time" and instead present the regime split empirically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
~50% HBM for model params (~48 GiB on 96 GiB H20)
~10% for runtime activation buffers
~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.
New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)
Key reads off the figure:
p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
p90 (8.0 GiB/req): 4 fit/inst → 32 / 16 / 8
p95 (9.6 GiB/req): 4 fit/inst → 32 / 16 / 8
p99 (11.5 GiB/req): 3 fit/inst → 24 / 12 / 6
PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).
- analysis/characterization/render_window1_figures.py:
fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
but computes floor(KV_pool / req_size) × N_D and annotates the
per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
framing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The headline f6_e2e_latency_bars only shows p90, hiding three regimes:
- mean: unified dominates (3.3s TTFT, 7.0s E2E vs sticky 5.6s / 12.1s)
- p50: sticky and unified are tied on first-turn TTFT (0.5s each) —
sticky's first turn of each session is free, after which queues
accumulate. Unified beats sticky everywhere else.
- p99: tail amplification reveals unified's biggest gap —
TTFT 42.3s vs sticky 74.1s; E2E 68.8s vs sticky 139.7s.
The 12-panel figure is the honest full picture; the 3-panel headline
stays for slide-friendly summary.
- analysis/characterization/window_1_results/raw_stats/{policy}.json:
cached ttft/tpot/e2e {mean,p50,p90,p99} pulled from dash0
/home/admin/cpfs/wjh/agentic-kv/outputs/b3_sweep_20260525_095043/
(b3_policy_comparison.json doesn't record mean, only percentiles).
- analysis/characterization/render_window1_figures.py:
new fig_b3_latency_full_grid renders the 4×3 grid from the cache.
- figs/f6_e2e_latency_full_grid.png: 12-panel companion.
- PAPER_OUTLINE.md §5.2: both figures embedded; main table column
renamed from "Hotspot idx" to "Worker p90 (median / max)" to match
the new metric convention.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The max/median ratio inverts the actual user-facing p90 ranking:
sticky: hotspot=2.73 but system e2e p90 = 34.6s (worst)
unified: hotspot=3.67 but system e2e p90 = 18.0s (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.
Changes:
- analysis/characterization/render_window1_figures.py:
fig_b3_per_worker_ttft now annotates each subplot with
"median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
was the deprecated ratio; superseded by f4c per-worker bars + f6
e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
conclusion — replace "hotspot index" mentions with
"worst-worker p90" or "(median, max) worker p90"; promote the
§3.3 methodology note to a top-level sub-finding ("hot pin
failure must be measured with per-worker absolute latency,
not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
pair directly; explicit one-line note on why the ratio is dropped.
Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
'capped' is not a routing policy — it's lmetric run on a separately
truncated trace (sessions capped to 8 turns via build_capped_trace.py).
Putting it alongside lmetric/load_only/sticky/unified in per-policy
comparison figures is misleading because the workload differs, not
the routing decision. Comparing apples to a different-trace orange
inflates/deflates apparent policy gaps for the wrong reasons.
Regenerated 4 figures with --exclude-policies capped on
analysis/characterization/render_window1_figures.py:
- f4a_apc_loss.png (APC bars)
- f4c_apc_vs_hotspot_tradeoff.png (APC vs hotspot scatter)
- f4c_per_worker_ttft.png (per-worker TTFT panel)
- f6_e2e_latency_bars.png (TTFT/TPOT/E2E bars)
Added --exclude-policies CLI flag to the renderer so this is a
reversible choice, not a permanent script mutation. capped data remains
in b3_policy_comparison.json and can be brought back in workload-
sensitivity sections (where it actually belongs) by omitting the flag.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-curve variant of f2b — production trace only, no replay overlay
and no uniform reference. Cleaner for boss-meeting/talk slides where the
extra context is noise. The combined three-curve figure is unchanged.
scripts/plot_session_skew_cdf.py: split into plot_combined +
plot_production_solo helpers; one run emits both PNGs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pulls 456 (rank%, cum%) sample points from the raw production trace at
dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl,
cached locally so the figure is reproducible without ssh access. Sampled
anchors match the precomputed summary exactly:
top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
plus newly readable points:
top 25% = 87.5%, top 50% = 96.0%
Workload characterization is now consistent with the production
distribution rather than the small replay subset. Replay window CDF kept
as an overlay to show the same hockey-stick shape on the data §5 actually
uses.
- analysis/characterization/data/production_session_skew_cdf.json: cached
sample points (29 KB), so the figure rebuilds locally
- scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw
- MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace,
add top-25%/50% data points
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed
from the production trace summary (which is not present locally, only its
precomputed JSON). The new figure is a continuous CDF of cumulative
input-token mass vs session rank percentile, generated directly from the
replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable.
Headline numbers update accordingly:
replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8%
production trace (n=1.3M): top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
Both show extreme skew well above the y=x uniform reference; the replay
trace is less extreme at top-1% because n=274 makes that bucket only
~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers
so motivation matches §5 evaluation; production numbers kept as a side
note for context.
- scripts/plot_session_skew_cdf.py: reproducible figure generator
- MEETING.md / PAPER_OUTLINE.md: update narrative + caption
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified
has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly
half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and
misleading on its own. Per-worker absolute TTFT p90:
sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s
unified: median 10.3s, max 37.7s -> system e2e p90 18.0s
Mechanism: top 1% sessions own 46.5% input mass and there are more hot
sessions than instances (8), so sticky's hash binding gives *every* worker
its own hot session and the median worker is also slow. Unified's LMetric
fallback re-routes cold/new sessions away from hot affinity instances,
preserving 7/8 worker speed. System p90 is dominated by the majority of
requests landing on fast workers, hence the 2x e2e gap.
Changes:
- Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars)
instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter)
- §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around
absolute median + max + system e2e p90 instead of hotspot ratio
- Add a §3.3 sub-finding: "hot pin failure must be measured with
per-worker absolute latency, not normalized ratio"
- Keep the scatter as supplementary for §5 multi-policy summary
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Convert figs/f4b_pdsep_kv_wall.pdf to PNG via pdftoppm @ 150 DPI so
MEETING.md and PAPER_OUTLINE.md render the figure inline on GitHub /
any standard markdown viewer (PDF !() embeds don't render).
- PAPER_OUTLINE.md F2, F4, F6: switch from backtick code references to
proper ![]() image embeds so the doc is actually viewable as a deck.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- replayer/replay.py: emit trace_span_s and amplification in summary
(Phase 1 of the wall-clock amplification measurement plan; needed for
§2.3 dispatch coupling empirical closure)
- figs/: 8 reusable figures copied from analysis/ with paper-spec names
(f2a/b/c workload, f4a/b/c/d failure modes, f6 e2e partial)
- PAPER_OUTLINE.md: real figure paths, explicit TBD markers for
custom drawings and pending data; new "Validation Status" table at top
and reorganized "Work Plan" splitting can-do-now vs migration-deferred
Migration validation deferred per user: 4 prior attempts (6b255fa,
e991960/5772149, cc6e562, 4c583f2) were reverted due to transfer
overhead; pending re-test on top of connector_tax DR-fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>