f2b: replace top-1/5/10% bars with full CDF; align all docs to replay-trace numbers

The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed from the production trace summary (which is not present locally, only its precomputed JSON). The new figure is a continuous CDF of cumulative input-token mass vs session rank percentile, generated directly from the replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable. Headline numbers update accordingly: replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8% production trace (n=1.3M): top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6% Both show extreme skew well above the y=x uniform reference; the replay trace is less extreme at top-1% because n=274 makes that bucket only ~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers so motivation matches §5 evaluation; production numbers kept as a side note for context. - scripts/plot_session_skew_cdf.py: reproducible figure generator - MEETING.md / PAPER_OUTLINE.md: update narrative + caption Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:37:22 +08:00
parent 020a5c79a7
commit 22c4aa58e4
4 changed files with 95 additions and 5 deletions
--- a/PAPER_OUTLINE.md
+++ b/PAPER_OUTLINE.md
@@ -48,7 +48,7 @@ Agentic workload 与 chatbot 的三个本质差异：

 - **Multi-turn, programmatic continuation**：每个 turn 由上一个 turn 的 tool-call 结果触发，没有人类 think-time
 - **Prefill-dominated**：input/output token ratio **75x**，98% 计算在 prefill 阶段（chatbot 为 1-10x）
- **Skewed sessions**：top 1% session 贡献 **46.5%** input token 量
+- **Skewed sessions**：在 replay trace 上 top 1% session 贡献 **24.3%** input token，top 5% **61.9%**，top 10% **75.8%**（vs uniform 1/5/10%）；production 全 trace（1.3M session）skew 更极端，top 1% 达 46.5%

 平均 session 长度 TBD turn、TBD 输入 token；p99 单请求 KV 占用 **11.49 GiB**（H20 96GB HBM 的 12%）。

@@ -68,7 +68,7 @@ Trace 上 KV reuse 的分解：

 ![F2a Reuse topology — intra 93.2% / cross 5.7% / shared 1.1%](figs/f2a_reuse_topology.png)

-![F2b Session skew — top 1% = 46.5% input mass](figs/f2b_session_skew.png)
+![F2b Session skew CDF — top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8% input mass (replay trace)](figs/f2b_session_skew.png)

 ![F2c KV footprint CDF — p99 = 11.8 GiB ≈ 12% of H20](figs/f2c_kv_footprint_cdf.png)

@@ -137,7 +137,7 @@ Round-robin 和 load-aware routing（如 LMetric, OSDI'26）最大化 instance
 | `unified` (affinity + LMetric fallback) | **10.3 s** | 37.7 s | **18.0 s** |
 | `lmetric` | 14.0 s | 31.3 s | 24.8 s |

-机制：top 1% session 占 46.5% input mass，hot session 数量 ≥ instance 数（8）；sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**，median worker 也被拖慢到 20s 量级。unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker，保留 7/8 worker 的速度。系统 p90 由大多数请求决定，所以 unified 在 e2e p90 上 ~2x 快于 sticky。
+机制：top 5% session 占 ~62% input mass，hot session 数量远大于 instance 数（8）；sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**，median worker 也被拖慢到 20s 量级。unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker，保留 7/8 worker 的速度。系统 p90 由大多数请求决定，所以 unified 在 e2e p90 上 ~2x 快于 sticky。

 **注意**：hotspot ratio (max/median) 单独看是误导性的 —— sticky 的 2.73 比 unified 的 3.67 *低*，但因为 sticky 的 median 也高（20.3s vs unified 的 10.3s），系统整体更慢。一个有用的 §3.3 sub-finding：**hot pin failure 必须用 per-worker absolute latency 衡量，不能用 normalized ratio**。