f2b: regenerate CDF from production trace (1.3M sessions on dash0)
Pulls 456 (rank%, cum%) sample points from the raw production trace at dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl, cached locally so the figure is reproducible without ssh access. Sampled anchors match the precomputed summary exactly: top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6% plus newly readable points: top 25% = 87.5%, top 50% = 96.0% Workload characterization is now consistent with the production distribution rather than the small replay subset. Replay window CDF kept as an overlay to show the same hockey-stick shape on the data §5 actually uses. - analysis/characterization/data/production_session_skew_cdf.json: cached sample points (29 KB), so the figure rebuilds locally - scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw - MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace, add top-25%/50% data points Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -48,7 +48,7 @@ Agentic workload 与 chatbot 的三个本质差异:
|
||||
|
||||
- **Multi-turn, programmatic continuation**:每个 turn 由上一个 turn 的 tool-call 结果触发,没有人类 think-time
|
||||
- **Prefill-dominated**:input/output token ratio **75x**,98% 计算在 prefill 阶段(chatbot 为 1-10x)
|
||||
- **Skewed sessions**:在 replay trace 上 top 1% session 贡献 **24.3%** input token,top 5% **61.9%**,top 10% **75.8%**(vs uniform 1/5/10%);production 全 trace(1.3M session)skew 更极端,top 1% 达 46.5%
|
||||
- **Skewed sessions**(来自 Qwen3 production trace,n=1.3M session / 2.1M req / 7200s):top 1% 贡献 **46.5%** input token,top 5% **66.5%**,top 10% **74.6%**,top 25% **87.5%**,top 50% **96.0%** —— 半数 session 几乎占满全部 input mass
|
||||
|
||||
平均 session 长度 TBD turn、TBD 输入 token;p99 单请求 KV 占用 **11.49 GiB**(H20 96GB HBM 的 12%)。
|
||||
|
||||
@@ -68,7 +68,7 @@ Trace 上 KV reuse 的分解:
|
||||
|
||||

|
||||
|
||||

|
||||

|
||||
|
||||

|
||||
|
||||
@@ -137,7 +137,7 @@ Round-robin 和 load-aware routing(如 LMetric, OSDI'26)最大化 instance
|
||||
| `unified` (affinity + LMetric fallback) | **10.3 s** | 37.7 s | **18.0 s** |
|
||||
| `lmetric` | 14.0 s | 31.3 s | 24.8 s |
|
||||
|
||||
机制:top 5% session 占 ~62% input mass,hot session 数量远大于 instance 数(8);sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**,median worker 也被拖慢到 20s 量级。unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker,保留 7/8 worker 的速度。系统 p90 由大多数请求决定,所以 unified 在 e2e p90 上 ~2x 快于 sticky。
|
||||
机制:production trace 上 top 1% session 占 46.5% input mass、top 5% 占 66.5%,hot session 数量远大于 instance 数(8);sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**,median worker 也被拖慢到 20s 量级。unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker,保留 7/8 worker 的速度。系统 p90 由大多数请求决定,所以 unified 在 e2e p90 上 ~2x 快于 sticky。
|
||||
|
||||
**注意**:hotspot ratio (max/median) 单独看是误导性的 —— sticky 的 2.73 比 unified 的 3.67 *低*,但因为 sticky 的 median 也高(20.3s vs unified 的 10.3s),系统整体更慢。一个有用的 §3.3 sub-finding:**hot pin failure 必须用 per-worker absolute latency 衡量,不能用 normalized ratio**。
|
||||
|
||||
|
||||
Reference in New Issue
Block a user