f2b: replace top-1/5/10% bars with full CDF; align all docs to replay-trace numbers

The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed
from the production trace summary (which is not present locally, only its
precomputed JSON). The new figure is a continuous CDF of cumulative
input-token mass vs session rank percentile, generated directly from the
replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable.

Headline numbers update accordingly:
  replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8%
  production trace (n=1.3M):     top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%

Both show extreme skew well above the y=x uniform reference; the replay
trace is less extreme at top-1% because n=274 makes that bucket only
~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers
so motivation matches §5 evaluation; production numbers kept as a side
note for context.

- scripts/plot_session_skew_cdf.py: reproducible figure generator
- MEETING.md / PAPER_OUTLINE.md: update narrative + caption

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:37:22 +08:00
parent 020a5c79a7
commit 22c4aa58e4
4 changed files with 95 additions and 5 deletions

View File

@@ -26,7 +26,7 @@ L = Λ · N · W_turn(L) # agentic, T_human≈0
| | 数据 | 图 |
|---|---|---|
| KV reuse 几乎只在 session 内 | intra 93.2% / cross 5.7% / shared 1.1% | ![](figs/f2a_reuse_topology.png) |
| Session 极度偏斜 | top 1% = 46.5% input mass | ![](figs/f2b_session_skew.png) |
| Session 极度偏斜 | replay 上 top 1% / 5% / 10% = 24% / 62% / 76% input massproduction 全 trace 更陡top 1% = 46.5% | ![](figs/f2b_session_skew.png) |
| 单请求 KV footprint 已经很大 | p99 = 11.8 GiB ≈ H20 12% | ![](figs/f2c_kv_footprint_cdf.png) |
理论 APC 上界 = intra-session 79.6% / any-session 80.3%,差 <1pp。**任何不 affinity 的调度都丢绝大部分 reuse。**
@@ -58,7 +58,7 @@ agentic 平均请求 33.6k token 需 3.3GB KV4P+4D / 6P+2D 在 agentic regime
| sticky | **20.3s** | 55.4s | **34.6s** |
| unified | **10.3s** | 37.7s | **18.0s** |
机制top 1% session 46.5% input hot session 数量多于 instance 8 sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢Unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified 几乎 2x
机制top 5% session ~62% input hot session 数量多于 instance 8 sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢Unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified 几乎 2x
---

View File

@@ -48,7 +48,7 @@ Agentic workload 与 chatbot 的三个本质差异:
- **Multi-turn, programmatic continuation**:每个 turn 由上一个 turn 的 tool-call 结果触发,没有人类 think-time
- **Prefill-dominated**input/output token ratio **75x**98% 计算在 prefill 阶段chatbot 为 1-10x
- **Skewed sessions**top 1% session 贡献 **46.5%** input token
- **Skewed sessions**在 replay trace 上 top 1% session 贡献 **24.3%** input tokentop 5% **61.9%**top 10% **75.8%**vs uniform 1/5/10%production 全 trace1.3M sessionskew 更极端top 1% 达 46.5%
平均 session 长度 TBD turn、TBD 输入 tokenp99 单请求 KV 占用 **11.49 GiB**H20 96GB HBM 的 12%)。
@@ -68,7 +68,7 @@ Trace 上 KV reuse 的分解:
![F2a Reuse topology — intra 93.2% / cross 5.7% / shared 1.1%](figs/f2a_reuse_topology.png)
![F2b Session skew — top 1% = 46.5% input mass](figs/f2b_session_skew.png)
![F2b Session skew CDF — top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8% input mass (replay trace)](figs/f2b_session_skew.png)
![F2c KV footprint CDF — p99 = 11.8 GiB ≈ 12% of H20](figs/f2c_kv_footprint_cdf.png)
@@ -137,7 +137,7 @@ Round-robin 和 load-aware routing如 LMetric, OSDI'26最大化 instance
| `unified` (affinity + LMetric fallback) | **10.3 s** | 37.7 s | **18.0 s** |
| `lmetric` | 14.0 s | 31.3 s | 24.8 s |
机制top 1% session 46.5% input masshot session 数量 instance 8sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢到 20s 量级unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified e2e p90 ~2x 快于 sticky
机制top 5% session ~62% input masshot session 数量远大于 instance 8sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢到 20s 量级unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified e2e p90 ~2x 快于 sticky
**注意**hotspot ratio (max/median) 单独看是误导性的 —— sticky 2.73 unified 3.67 *低*但因为 sticky median 也高20.3s vs unified 10.3s系统整体更慢一个有用的 §3.3 sub-finding**hot pin failure 必须用 per-worker absolute latency 衡量不能用 normalized ratio**。

Binary file not shown.

Before

Width:  |  Height:  |  Size: 55 KiB

After

Width:  |  Height:  |  Size: 94 KiB

View File

@@ -0,0 +1,90 @@
#!/usr/bin/env python3
"""Plot a CDF of cumulative input-token mass by session rank.
Reads a JSONL trace (chat_id, session_id, input_length, ...), aggregates
per-session input_length, sorts sessions descending by total, and plots
cumulative fraction of input-token mass vs session-rank percentile.
The figure replaces the previous discrete top-1%/5%/10% bars with a
continuous curve so any percentile can be read off directly.
"""
from __future__ import annotations
import argparse
import json
from collections import defaultdict
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
def load_session_input_tokens(trace_path: Path) -> dict[str, int]:
totals: dict[str, int] = defaultdict(int)
with trace_path.open() as f:
for line in f:
row = json.loads(line)
totals[row["session_id"]] += int(row["input_length"])
return dict(totals)
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument(
"--trace",
default="traces/w600_r0.0015_st30.jsonl",
help="JSONL trace path",
)
parser.add_argument(
"--out",
default="figs/f2b_session_skew.png",
help="Output figure path",
)
args = parser.parse_args()
session_totals = load_session_input_tokens(Path(args.trace))
n_sessions = len(session_totals)
sorted_vals = np.sort(np.array(list(session_totals.values())))[::-1]
cum = np.cumsum(sorted_vals) / sorted_vals.sum()
rank_pct = np.arange(1, n_sessions + 1) / n_sessions * 100
marks = [1, 5, 10, 25, 50]
mark_idx = [int(np.ceil(n_sessions * p / 100)) - 1 for p in marks]
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(rank_pct, cum * 100, color="#2f6fab", lw=2.2,
label="cumulative input-token mass")
ax.plot([0, 100], [0, 100], color="#999", ls="--", lw=1,
label="uniform reference (y = x)")
for p, i in zip(marks, mark_idx):
y = cum[i] * 100
ax.scatter([p], [y], color="#c44e52", zorder=5, s=40)
ax.annotate(
f"top {p}% → {y:.1f}%",
xy=(p, y),
xytext=(p + 2, y - 5),
fontsize=9,
color="#333",
)
ax.set_xlim(0, 100)
ax.set_ylim(0, 102)
ax.set_xlabel("Session rank percentile (top → bottom by input-token mass)")
ax.set_ylabel("Cumulative % of input-token mass")
ax.set_title(
f"Session input-token mass CDF "
f"(n={n_sessions} sessions, "
f"total={sorted_vals.sum() / 1e6:.1f} M tokens)"
)
ax.grid(True, alpha=0.3)
ax.legend(loc="lower right", framealpha=0.9)
out_path = Path(args.out)
out_path.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out_path, dpi=150, bbox_inches="tight")
print(f"wrote {out_path}")
if __name__ == "__main__":
main()