agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	03d8c5d0d1	Render 4 per-policy figures on b3_replay_20260527_0114 into figs/v2/ User-provided fresh run with five policies (lmetric, load_only, sticky, unified, plus a new unified_v2 variant). Reproduces the v1 set under figs/v2/ so we can A/B the same panels: f4a_apc_loss.png — APC bars per policy f4c_per_worker_ttft.png — per-worker TTFT p90 panel per policy f6_e2e_latency_bars.png — TTFT/TPOT/E2E p90 bars per policy f6_e2e_latency_full_grid — mean/p50/p90/p99 × TTFT/TPOT/E2E grid scripts/render_b3_figures_v2.py is a standalone driver that reads each policy's metrics.summary.json and breakdown.json directly from the run directory — the breakdown.json `routed_to` field is required to recover per-worker assignment because the new setup routes every request through a proxy (127.0.0.1:9300), so metrics.jsonl's endpoint_url no longer identifies the backend. Headline numbers, new vs v1: APC v2: lmetric 57.2% / load_only 53.9% / sticky 77.7% unified 78.7% / unified_v2 78.4% v1: lmetric 56.9% / load_only 54.1% / sticky 77.2% / unified 79.4% TTFT p90 (s) v2: lmetric 14.8 / load_only 20.1 / sticky 14.8 / unified 8.8 / unified_v2 10.1 v1: lmetric 15.7 / load_only 20.2 / sticky 18.0 / unified 7.3 E2E p90 (s) v2: lmetric 25.4 / load_only 33.9 / sticky 30.3 / unified 20.0 / unified_v2 24.1 v1: lmetric 24.8 / load_only 33.5 / sticky 34.6 / unified 18.0 Worker p90 (s, median / max) v2: lmetric 13.3/30.4 · load_only 21.3/29.2 · sticky 13.5/33.0 unified 10.0/35.1 · unified_v2 8.6/34.2 v1: lmetric 13.9/31.3 · load_only 19.4/25.1 · sticky 20.3/55.4 unified 10.3/37.7 Story is unchanged: unified dominates at p90 across TTFT/E2E and on median-worker latency; unified_v2 is competitive at p50 but slightly worse than unified at p90. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 13:52:17 +08:00
Gahow Wang	41232f49d3	Measure inter-turn T_external on the raw production trace; add f3a CDF The earlier conversation suggested agentic might "have no human think-time" and therefore live in a strict closed-loop regime. The user pushed back: tool calls also take time and might restore a chatbot-like buffer between turns. To resolve this, we go to the actual data. The previously-published per-record formatted trace only carries arrival timestamps, so an arrival-to-arrival diff conflates W_turn + T_external. The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/ 051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms, which lets us compute the pure inter-turn external gap T_external = next.request_ready_time_ms - prev.request_end_time_ms for each session's consecutive turn pair. Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions): p25 = 0.69 s p50 = 1.6 s p75 = 8.6 s p90 = 44 s mean = 37 s (heavy long-tail; paused/abandoned sessions) 39 % of gaps < 1 s 67 % of gaps < 5 s 87 % of gaps < 30 s The bulk of the distribution is dominated by sub-second to a-few-seconds tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 = 7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile of T_external, so dispatch coupling is the dominant regime for the majority of turns — not a corner case. This corrects the earlier conflated arrival-to-arrival "median gap 11 s" figure (which folded W_turn into T_external). The true T_external median is 1.6 s. Adds: - scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator - analysis/characterization/data/agentic_inter_turn_gap.json: 500-point CDF cache + summary stats, scp'd back from dash0 - scripts/plot_inter_turn_gap.py: local figure renderer - figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and unified/lmetric TTFT p90 reference lines Next step (per user): pull a chatbot trace through the same pipeline and compare distributions side by side; this will let §2.3 stop hand-waving about "no think-time" and instead present the regime split empirically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 12:37:32 +08:00
Gahow Wang	555cabcf1f	f2c: switch to per-instance decode-concurrency view; correct KV pool ceiling Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB usable" reference line. That ceiling was wrong — a 30B-A3B bf16 deployment burns roughly: ~50% HBM for model params (~48 GiB on 96 GiB H20) ~10% for runtime activation buffers ~40% left for the KV cache pool (~38.4 GiB) so 95 GiB was overstating the available pool by 2.5×. New f2c reframes the same data into the answer that actually motivates the paper: how many concurrent decodes does a single instance hold, and how does PD-disagg change that? Grouped bars per percentile show system-wide concurrent decode capacity for three 8-GPU deployments: Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2) Key reads off the figure: p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide p90 (8.0 GiB/req): 4 fit/inst → 32 / 16 / 8 p95 (9.6 GiB/req): 4 fit/inst → 32 / 16 / 8 p99 (11.5 GiB/req): 3 fit/inst → 24 / 12 / 6 PD-disagg 4P+4D literally halves the decode population at the same per-request KV pressure — this is the concrete §3.2 "KV memory wall" penalty stated in terms users care about (concurrency). - analysis/characterization/render_window1_figures.py: fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json but computes floor(KV_pool / req_size) × N_D and annotates the per-instance fit count below each percentile group. - figs/f2c_kv_footprint_cdf.png: regenerated. - MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the new ceiling and the "3 p99 decodes per instance / halved by PD-disagg" framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 11:28:47 +08:00
Gahow Wang	922d79ac95	Add full latency grid (mean/p50/p90/p99 × TTFT/TPOT/E2E) as f6 companion The headline f6_e2e_latency_bars only shows p90, hiding three regimes: - mean: unified dominates (3.3s TTFT, 7.0s E2E vs sticky 5.6s / 12.1s) - p50: sticky and unified are tied on first-turn TTFT (0.5s each) — sticky's first turn of each session is free, after which queues accumulate. Unified beats sticky everywhere else. - p99: tail amplification reveals unified's biggest gap — TTFT 42.3s vs sticky 74.1s; E2E 68.8s vs sticky 139.7s. The 12-panel figure is the honest full picture; the 3-panel headline stays for slide-friendly summary. - analysis/characterization/window_1_results/raw_stats/{policy}.json: cached ttft/tpot/e2e {mean,p50,p90,p99} pulled from dash0 /home/admin/cpfs/wjh/agentic-kv/outputs/b3_sweep_20260525_095043/ (b3_policy_comparison.json doesn't record mean, only percentiles). - analysis/characterization/render_window1_figures.py: new fig_b3_latency_full_grid renders the 4×3 grid from the cache. - figs/f6_e2e_latency_full_grid.png: 12-panel companion. - PAPER_OUTLINE.md §5.2: both figures embedded; main table column renamed from "Hotspot idx" to "Worker p90 (median / max)" to match the new metric convention. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 11:15:18 +08:00
Gahow Wang	5e6e98aee7	Replace max/median hotspot index with (median, max) absolute pair The max/median ratio inverts the actual user-facing p90 ranking: sticky: hotspot=2.73 but system e2e p90 = 34.6s (worst) unified: hotspot=3.67 but system e2e p90 = 18.0s (best) because sticky's median is also high (everyone slow) while unified concentrates the damage on one worker and keeps the other 7 fast. Any "imbalance" metric structurally punishes the affinity-then-escape schemes that we actually want to advocate for. Changes: - analysis/characterization/render_window1_figures.py: fig_b3_per_worker_ttft now annotates each subplot with "median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring documents why we drop the ratio. - figs/f4c_per_worker_ttft.png: regenerated with new titles. - figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis was the deprecated ratio; superseded by f4c per-worker bars + f6 e2e bars which together carry the same information honestly. - PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8 conclusion — replace "hotspot index" mentions with "worst-worker p90" or "(median, max) worker p90"; promote the §3.3 methodology note to a top-level sub-finding ("hot pin failure must be measured with per-worker absolute latency, not normalized ratio"). - MEETING.md: §3.3 narrative reworded to lead with the (median, max) pair directly; explicit one-line note on why the ratio is dropped. Conceptual uses of "hot session" / "hot instance" / "hot pin" remain unchanged — only the metric called hotspot index is retired. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 11:07:12 +08:00
Gahow Wang	09ff1069c3	Drop 'capped' from per-policy figures (f4a, f4c×2, f6) 'capped' is not a routing policy — it's lmetric run on a separately truncated trace (sessions capped to 8 turns via build_capped_trace.py). Putting it alongside lmetric/load_only/sticky/unified in per-policy comparison figures is misleading because the workload differs, not the routing decision. Comparing apples to a different-trace orange inflates/deflates apparent policy gaps for the wrong reasons. Regenerated 4 figures with --exclude-policies capped on analysis/characterization/render_window1_figures.py: - f4a_apc_loss.png (APC bars) - f4c_apc_vs_hotspot_tradeoff.png (APC vs hotspot scatter) - f4c_per_worker_ttft.png (per-worker TTFT panel) - f6_e2e_latency_bars.png (TTFT/TPOT/E2E bars) Added --exclude-policies CLI flag to the renderer so this is a reversible choice, not a permanent script mutation. capped data remains in b3_policy_comparison.json and can be brought back in workload- sensitivity sections (where it actually belongs) by omitting the flag. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:57:43 +08:00
Gahow Wang	74e0c2157a	Add solo production-trace CDF figure (f2b_session_skew_prod.png) Single-curve variant of f2b — production trace only, no replay overlay and no uniform reference. Cleaner for boss-meeting/talk slides where the extra context is noise. The combined three-curve figure is unchanged. scripts/plot_session_skew_cdf.py: split into plot_combined + plot_production_solo helpers; one run emits both PNGs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:53:30 +08:00
Gahow Wang	1220da249c	f2b: regenerate CDF from production trace (1.3M sessions on dash0) Pulls 456 (rank%, cum%) sample points from the raw production trace at dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl, cached locally so the figure is reproducible without ssh access. Sampled anchors match the precomputed summary exactly: top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6% plus newly readable points: top 25% = 87.5%, top 50% = 96.0% Workload characterization is now consistent with the production distribution rather than the small replay subset. Replay window CDF kept as an overlay to show the same hockey-stick shape on the data §5 actually uses. - analysis/characterization/data/production_session_skew_cdf.json: cached sample points (29 KB), so the figure rebuilds locally - scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw - MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace, add top-25%/50% data points Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:41:53 +08:00
Gahow Wang	22c4aa58e4	f2b: replace top-1/5/10% bars with full CDF; align all docs to replay-trace numbers The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed from the production trace summary (which is not present locally, only its precomputed JSON). The new figure is a continuous CDF of cumulative input-token mass vs session rank percentile, generated directly from the replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable. Headline numbers update accordingly: replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8% production trace (n=1.3M): top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6% Both show extreme skew well above the y=x uniform reference; the replay trace is less extreme at top-1% because n=274 makes that bucket only ~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers so motivation matches §5 evaluation; production numbers kept as a side note for context. - scripts/plot_session_skew_cdf.py: reproducible figure generator - MEETING.md / PAPER_OUTLINE.md: update narrative + caption Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:37:22 +08:00
Gahow Wang	020a5c79a7	§3.3 reframe: hot pin failure is uniformly-slow workers, not max/median ratio User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly half of sticky's. Resolution: hotspot index (max/median) is a ratio and misleading on its own. Per-worker absolute TTFT p90: sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s unified: median 10.3s, max 37.7s -> system e2e p90 18.0s Mechanism: top 1% sessions own 46.5% input mass and there are more hot sessions than instances (8), so sticky's hash binding gives every worker its own hot session and the median worker is also slow. Unified's LMetric fallback re-routes cold/new sessions away from hot affinity instances, preserving 7/8 worker speed. System p90 is dominated by the majority of requests landing on fast workers, hence the 2x e2e gap. Changes: - Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars) instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter) - §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around absolute median + max + system e2e p90 instead of hotspot ratio - Add a §3.3 sub-finding: "hot pin failure must be measured with per-worker absolute latency, not normalized ratio" - Keep the scatter as supplementary for §5 multi-policy summary Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:10:23 +08:00
Gahow Wang	df0ee5a02b	Use PNG for KV memory wall figure; switch outline to inline image embeds - Convert figs/f4b_pdsep_kv_wall.pdf to PNG via pdftoppm @ 150 DPI so MEETING.md and PAPER_OUTLINE.md render the figure inline on GitHub / any standard markdown viewer (PDF !() embeds don't render). - PAPER_OUTLINE.md F2, F4, F6: switch from backtick code references to proper ![]() image embeds so the doc is actually viewable as a deck. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:13:26 +08:00
Gahow Wang	52cdb80367	EAR outline: copy reusable figures, mark migration sections deferred - replayer/replay.py: emit trace_span_s and amplification in summary (Phase 1 of the wall-clock amplification measurement plan; needed for §2.3 dispatch coupling empirical closure) - figs/: 8 reusable figures copied from analysis/ with paper-spec names (f2a/b/c workload, f4a/b/c/d failure modes, f6 e2e partial) - PAPER_OUTLINE.md: real figure paths, explicit TBD markers for custom drawings and pending data; new "Validation Status" table at top and reorganized "Work Plan" splitting can-do-now vs migration-deferred Migration validation deferred per user: 4 prior attempts (`6b255fa`, e991960/5772149, `cc6e562`, `4c583f2`) were reverted due to transfer overhead; pending re-test on top of connector_tax DR-fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 01:44:13 +08:00

12 Commits