f2b: regenerate CDF from production trace (1.3M sessions on dash0)

Pulls 456 (rank%, cum%) sample points from the raw production trace at
dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl,
cached locally so the figure is reproducible without ssh access. Sampled
anchors match the precomputed summary exactly:
  top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
plus newly readable points:
  top 25% = 87.5%, top 50% = 96.0%

Workload characterization is now consistent with the production
distribution rather than the small replay subset. Replay window CDF kept
as an overlay to show the same hockey-stick shape on the data §5 actually
uses.

- analysis/characterization/data/production_session_skew_cdf.json: cached
  sample points (29 KB), so the figure rebuilds locally
- scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw
- MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace,
  add top-25%/50% data points

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:41:53 +08:00
parent 22c4aa58e4
commit 1220da249c
5 changed files with 67 additions and 42 deletions

View File

@@ -48,7 +48,7 @@ Agentic workload 与 chatbot 的三个本质差异:
- **Multi-turn, programmatic continuation**:每个 turn 由上一个 turn 的 tool-call 结果触发,没有人类 think-time
- **Prefill-dominated**input/output token ratio **75x**98% 计算在 prefill 阶段chatbot 为 1-10x
- **Skewed sessions**:在 replay trace 上 top 1% session 贡献 **24.3%** input tokentop 5% **61.9%**top 10% **75.8%**vs uniform 1/5/10%production 全 trace1.3M sessionskew 更极端top 1% 达 46.5%
- **Skewed sessions**(来自 Qwen3 production tracen=1.3M session / 2.1M req / 7200stop 1% 贡献 **46.5%** input tokentop 5% **66.5%**top 10% **74.6%**top 25% **87.5%**top 50% **96.0%** —— 半数 session 几乎占满全部 input mass
平均 session 长度 TBD turn、TBD 输入 tokenp99 单请求 KV 占用 **11.49 GiB**H20 96GB HBM 的 12%)。
@@ -68,7 +68,7 @@ Trace 上 KV reuse 的分解:
![F2a Reuse topology — intra 93.2% / cross 5.7% / shared 1.1%](figs/f2a_reuse_topology.png)
![F2b Session skew CDF — top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8% input mass (replay trace)](figs/f2b_session_skew.png)
![F2b Session input-token mass CDF — production trace top 1%/5%/10%/25%/50% = 46.5%/66.5%/74.6%/87.5%/96.0% (replay window overlaid for sanity)](figs/f2b_session_skew.png)
![F2c KV footprint CDF — p99 = 11.8 GiB ≈ 12% of H20](figs/f2c_kv_footprint_cdf.png)
@@ -137,7 +137,7 @@ Round-robin 和 load-aware routing如 LMetric, OSDI'26最大化 instance
| `unified` (affinity + LMetric fallback) | **10.3 s** | 37.7 s | **18.0 s** |
| `lmetric` | 14.0 s | 31.3 s | 24.8 s |
机制top 5% session ~62% input masshot session 数量远大于 instance 8sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢到 20s 量级unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified e2e p90 ~2x 快于 sticky
机制production trace top 1% session 46.5% input masstop 5% 66.5%hot session 数量远大于 instance 8sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢到 20s 量级unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified e2e p90 ~2x 快于 sticky
**注意**hotspot ratio (max/median) 单独看是误导性的 —— sticky 2.73 unified 3.67 *低*但因为 sticky median 也高20.3s vs unified 10.3s系统整体更慢一个有用的 §3.3 sub-finding**hot pin failure 必须用 per-worker absolute latency 衡量不能用 normalized ratio**。