User-provided fresh run with five policies (lmetric, load_only, sticky,
unified, plus a new unified_v2 variant). Reproduces the v1 set under
figs/v2/ so we can A/B the same panels:
f4a_apc_loss.png — APC bars per policy
f4c_per_worker_ttft.png — per-worker TTFT p90 panel per policy
f6_e2e_latency_bars.png — TTFT/TPOT/E2E p90 bars per policy
f6_e2e_latency_full_grid — mean/p50/p90/p99 × TTFT/TPOT/E2E grid
scripts/render_b3_figures_v2.py is a standalone driver that reads each
policy's metrics.summary.json and breakdown.json directly from the run
directory — the breakdown.json `routed_to` field is required to recover
per-worker assignment because the new setup routes every request
through a proxy (127.0.0.1:9300), so metrics.jsonl's endpoint_url no
longer identifies the backend.
Headline numbers, new vs v1:
APC v2: lmetric 57.2% / load_only 53.9% / sticky 77.7%
unified 78.7% / unified_v2 78.4%
v1: lmetric 56.9% / load_only 54.1% / sticky 77.2% / unified 79.4%
TTFT p90 (s) v2: lmetric 14.8 / load_only 20.1 / sticky 14.8 /
unified 8.8 / unified_v2 10.1
v1: lmetric 15.7 / load_only 20.2 / sticky 18.0 / unified 7.3
E2E p90 (s) v2: lmetric 25.4 / load_only 33.9 / sticky 30.3 /
unified 20.0 / unified_v2 24.1
v1: lmetric 24.8 / load_only 33.5 / sticky 34.6 / unified 18.0
Worker p90 (s, median / max)
v2: lmetric 13.3/30.4 · load_only 21.3/29.2 · sticky 13.5/33.0
unified 10.0/35.1 · unified_v2 8.6/34.2
v1: lmetric 13.9/31.3 · load_only 19.4/25.1 · sticky 20.3/55.4
unified 10.3/37.7
Story is unchanged: unified dominates at p90 across TTFT/E2E and on
median-worker latency; unified_v2 is competitive at p50 but slightly
worse than unified at p90.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The earlier conversation suggested agentic might "have no human think-time"
and therefore live in a strict closed-loop regime. The user pushed back:
tool calls also take time and might restore a chatbot-like buffer between
turns. To resolve this, we go to the actual data.
The previously-published per-record formatted trace only carries arrival
timestamps, so an arrival-to-arrival diff conflates W_turn + T_external.
The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/
051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms,
which lets us compute the pure inter-turn external gap
T_external = next.request_ready_time_ms - prev.request_end_time_ms
for each session's consecutive turn pair.
Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions):
p25 = 0.69 s
p50 = 1.6 s
p75 = 8.6 s
p90 = 44 s
mean = 37 s (heavy long-tail; paused/abandoned sessions)
39 % of gaps < 1 s
67 % of gaps < 5 s
87 % of gaps < 30 s
The bulk of the distribution is dominated by sub-second to a-few-seconds
tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 =
7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile
of T_external, so dispatch coupling is the dominant regime for the
majority of turns — not a corner case.
This corrects the earlier conflated arrival-to-arrival "median gap 11 s"
figure (which folded W_turn into T_external). The true T_external median
is 1.6 s.
Adds:
- scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator
- analysis/characterization/data/agentic_inter_turn_gap.json: 500-point
CDF cache + summary stats, scp'd back from dash0
- scripts/plot_inter_turn_gap.py: local figure renderer
- figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and
unified/lmetric TTFT p90 reference lines
Next step (per user): pull a chatbot trace through the same pipeline and
compare distributions side by side; this will let §2.3 stop hand-waving
about "no think-time" and instead present the regime split empirically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
~50% HBM for model params (~48 GiB on 96 GiB H20)
~10% for runtime activation buffers
~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.
New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)
Key reads off the figure:
p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
p90 (8.0 GiB/req): 4 fit/inst → 32 / 16 / 8
p95 (9.6 GiB/req): 4 fit/inst → 32 / 16 / 8
p99 (11.5 GiB/req): 3 fit/inst → 24 / 12 / 6
PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).
- analysis/characterization/render_window1_figures.py:
fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
but computes floor(KV_pool / req_size) × N_D and annotates the
per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
framing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The headline f6_e2e_latency_bars only shows p90, hiding three regimes:
- mean: unified dominates (3.3s TTFT, 7.0s E2E vs sticky 5.6s / 12.1s)
- p50: sticky and unified are tied on first-turn TTFT (0.5s each) —
sticky's first turn of each session is free, after which queues
accumulate. Unified beats sticky everywhere else.
- p99: tail amplification reveals unified's biggest gap —
TTFT 42.3s vs sticky 74.1s; E2E 68.8s vs sticky 139.7s.
The 12-panel figure is the honest full picture; the 3-panel headline
stays for slide-friendly summary.
- analysis/characterization/window_1_results/raw_stats/{policy}.json:
cached ttft/tpot/e2e {mean,p50,p90,p99} pulled from dash0
/home/admin/cpfs/wjh/agentic-kv/outputs/b3_sweep_20260525_095043/
(b3_policy_comparison.json doesn't record mean, only percentiles).
- analysis/characterization/render_window1_figures.py:
new fig_b3_latency_full_grid renders the 4×3 grid from the cache.
- figs/f6_e2e_latency_full_grid.png: 12-panel companion.
- PAPER_OUTLINE.md §5.2: both figures embedded; main table column
renamed from "Hotspot idx" to "Worker p90 (median / max)" to match
the new metric convention.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The max/median ratio inverts the actual user-facing p90 ranking:
sticky: hotspot=2.73 but system e2e p90 = 34.6s (worst)
unified: hotspot=3.67 but system e2e p90 = 18.0s (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.
Changes:
- analysis/characterization/render_window1_figures.py:
fig_b3_per_worker_ttft now annotates each subplot with
"median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
was the deprecated ratio; superseded by f4c per-worker bars + f6
e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
conclusion — replace "hotspot index" mentions with
"worst-worker p90" or "(median, max) worker p90"; promote the
§3.3 methodology note to a top-level sub-finding ("hot pin
failure must be measured with per-worker absolute latency,
not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
pair directly; explicit one-line note on why the ratio is dropped.
Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
'capped' is not a routing policy — it's lmetric run on a separately
truncated trace (sessions capped to 8 turns via build_capped_trace.py).
Putting it alongside lmetric/load_only/sticky/unified in per-policy
comparison figures is misleading because the workload differs, not
the routing decision. Comparing apples to a different-trace orange
inflates/deflates apparent policy gaps for the wrong reasons.
Regenerated 4 figures with --exclude-policies capped on
analysis/characterization/render_window1_figures.py:
- f4a_apc_loss.png (APC bars)
- f4c_apc_vs_hotspot_tradeoff.png (APC vs hotspot scatter)
- f4c_per_worker_ttft.png (per-worker TTFT panel)
- f6_e2e_latency_bars.png (TTFT/TPOT/E2E bars)
Added --exclude-policies CLI flag to the renderer so this is a
reversible choice, not a permanent script mutation. capped data remains
in b3_policy_comparison.json and can be brought back in workload-
sensitivity sections (where it actually belongs) by omitting the flag.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-curve variant of f2b — production trace only, no replay overlay
and no uniform reference. Cleaner for boss-meeting/talk slides where the
extra context is noise. The combined three-curve figure is unchanged.
scripts/plot_session_skew_cdf.py: split into plot_combined +
plot_production_solo helpers; one run emits both PNGs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pulls 456 (rank%, cum%) sample points from the raw production trace at
dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl,
cached locally so the figure is reproducible without ssh access. Sampled
anchors match the precomputed summary exactly:
top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
plus newly readable points:
top 25% = 87.5%, top 50% = 96.0%
Workload characterization is now consistent with the production
distribution rather than the small replay subset. Replay window CDF kept
as an overlay to show the same hockey-stick shape on the data §5 actually
uses.
- analysis/characterization/data/production_session_skew_cdf.json: cached
sample points (29 KB), so the figure rebuilds locally
- scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw
- MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace,
add top-25%/50% data points
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed
from the production trace summary (which is not present locally, only its
precomputed JSON). The new figure is a continuous CDF of cumulative
input-token mass vs session rank percentile, generated directly from the
replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable.
Headline numbers update accordingly:
replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8%
production trace (n=1.3M): top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
Both show extreme skew well above the y=x uniform reference; the replay
trace is less extreme at top-1% because n=274 makes that bucket only
~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers
so motivation matches §5 evaluation; production numbers kept as a side
note for context.
- scripts/plot_session_skew_cdf.py: reproducible figure generator
- MEETING.md / PAPER_OUTLINE.md: update narrative + caption
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified
has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly
half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and
misleading on its own. Per-worker absolute TTFT p90:
sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s
unified: median 10.3s, max 37.7s -> system e2e p90 18.0s
Mechanism: top 1% sessions own 46.5% input mass and there are more hot
sessions than instances (8), so sticky's hash binding gives *every* worker
its own hot session and the median worker is also slow. Unified's LMetric
fallback re-routes cold/new sessions away from hot affinity instances,
preserving 7/8 worker speed. System p90 is dominated by the majority of
requests landing on fast workers, hence the 2x e2e gap.
Changes:
- Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars)
instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter)
- §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around
absolute median + max + system e2e p90 instead of hotspot ratio
- Add a §3.3 sub-finding: "hot pin failure must be measured with
per-worker absolute latency, not normalized ratio"
- Keep the scatter as supplementary for §5 multi-policy summary
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Convert figs/f4b_pdsep_kv_wall.pdf to PNG via pdftoppm @ 150 DPI so
MEETING.md and PAPER_OUTLINE.md render the figure inline on GitHub /
any standard markdown viewer (PDF !() embeds don't render).
- PAPER_OUTLINE.md F2, F4, F6: switch from backtick code references to
proper ![]() image embeds so the doc is actually viewable as a deck.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- replayer/replay.py: emit trace_span_s and amplification in summary
(Phase 1 of the wall-clock amplification measurement plan; needed for
§2.3 dispatch coupling empirical closure)
- figs/: 8 reusable figures copied from analysis/ with paper-spec names
(f2a/b/c workload, f4a/b/c/d failure modes, f6 e2e partial)
- PAPER_OUTLINE.md: real figure paths, explicit TBD markers for
custom drawings and pending data; new "Validation Status" table at top
and reorganized "Work Plan" splitting can-do-now vs migration-deferred
Migration validation deferred per user: 4 prior attempts (6b255fa,
e991960/5772149, cc6e562, 4c583f2) were reverted due to transfer
overhead; pending re-test on top of connector_tax DR-fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>