Use PNG for KV memory wall figure; switch outline to inline image embeds
- Convert figs/f4b_pdsep_kv_wall.pdf to PNG via pdftoppm @ 150 DPI so MEETING.md and PAPER_OUTLINE.md render the figure inline on GitHub / any standard markdown viewer (PDF !() embeds don't render). - PAPER_OUTLINE.md F2, F4, F6: switch from backtick code references to proper ![]() image embeds so the doc is actually viewable as a deck. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -64,9 +64,12 @@ Trace 上 KV reuse 的分解:
|
||||
理论 APC 上界:any-session **80.3%**,intra-session-only **79.6%**,差距 <1pp。**cache 本质上是 session-local 的**;任何不保留 session affinity 的调度都丢掉绝大部分 reuse 机会。
|
||||
|
||||
**Figure 2: Workload characterization (3 panels)** — 现有数据可复用:
|
||||
- `figs/f2a_reuse_topology.png` ✅ — intra-session 93.2% / cross-session 5.7% / shared 1.1% bar
|
||||
- `figs/f2b_session_skew.png` ✅ — top 1%/5%/10% session input-token mass
|
||||
- `figs/f2c_kv_footprint_cdf.png` ✅ — per-request KV footprint p50/p90/p95/p99 (p99 = 11.8 GiB ≈ 12% of H20)
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
> 📝 Layout TBD:三张拼成 1×3 还是分散到 §2.1/§2.2/§2.4 各一张。
|
||||
|
||||
@@ -128,11 +131,18 @@ Round-robin 和 load-aware routing(如 LMetric, OSDI'26)最大化 instance
|
||||
硬 session-instance 绑定恢复 locality(APC **77.2%**,达到上界 97%),但把 skew 中的大 session 锁在单 instance 上,**interference index 从 LMetric 的 6.53 翻倍到 13.65**(同 trace 同硬件)。违反 §2.4 的 skew 容忍要求。
|
||||
|
||||
**Figure 4: Three baselines, three failure modes** — 拆成三个子图,分别放在 §3.1/§3.2/§3.3:
|
||||
- §3.1 `figs/f4a_apc_loss.png` ✅ — APC 实测 vs 理论上界 79.6% (lmetric 56.9%, load_only 54.1%, capped 31.6%, sticky 77.2%, unified 79.4%)
|
||||
- §3.2 `figs/f4b_pdsep_kv_wall.pdf` ✅ — D 侧 KV pool 占用 vs per-request KV footprint,4P+4D 和 6P+2D 在 agentic regime 都穿过 90% 内存墙
|
||||
- §3.3 `figs/f4c_apc_vs_hotspot_tradeoff.png` ✅ — APC vs hotspot index 散点(unified/sticky 在高 APC 高 hotspot 区,lmetric/load_only 在低 APC 低 hotspot 区)
|
||||
|
||||
> 📝 可选 `figs/f4d_pd_interference.png` ✅ — Prefill-decode 干扰(同 GPU 8k prefill 让 TPOT 退化 66x),放 §3.3 支撑 sticky 的 interference 论证。
|
||||
§3.1 — APC 实测 vs 理论上界 79.6% (lmetric 56.9%, load_only 54.1%, capped 31.6%, sticky 77.2%, unified 79.4%):
|
||||

|
||||
|
||||
§3.2 — D 侧 KV pool 占用 vs per-request KV footprint,4P+4D 和 6P+2D 在 agentic regime 都穿过 90% 内存墙:
|
||||

|
||||
|
||||
§3.3 — APC vs hotspot index 散点(unified/sticky 在高 APC 高 hotspot 区,lmetric/load_only 在低 APC 低 hotspot 区):
|
||||

|
||||
|
||||
> 📝 可选支撑图 — Prefill-decode 干扰(同 GPU 8k prefill 让 TPOT 退化 66x),放 §3.3 支撑 sticky 的 interference 论证:
|
||||

|
||||
|
||||
### §3.4 Takeaway
|
||||
|
||||
@@ -215,9 +225,11 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
|
||||
|
||||
### §5.2 End-to-end Performance
|
||||
|
||||
**Figure 6: End-to-end performance** — `figs/f6_e2e_latency_bars.png` ✅ (PARTIAL)
|
||||
> 现有:TTFT/TPOT/E2E p90 bar chart × 5 policies (lmetric / load_only / sticky / unified / capped)。
|
||||
> **🚧 TBD (NEW DATA)**:缺 `static PD-disagg` 那一列;EAR 列也是 TBD(需 migration validation)。要再补一张同样格式但包含全 6 个 baseline 的图。
|
||||
**Figure 6: End-to-end performance** — ✅ (PARTIAL,缺 PD-disagg 列)
|
||||
|
||||

|
||||
|
||||
> **🚧 TBD (NEW DATA)**:上图缺 `static PD-disagg` 那一列;EAR 列也是 TBD(需 migration validation)。要再补一张同样格式但包含全 6 个 baseline 的图。
|
||||
|
||||
| Scheduler | TTFT p50 | TTFT p90 | TPOT p90 | APC | Hotspot idx | Wall-clock factor |
|
||||
|---|---|---|---|---|---|---|
|
||||
|
||||
Reference in New Issue
Block a user