Update MEETING.md + PAPER_OUTLINE.md with connector_tax substrate validation
2026-05-27 trace-replay A/B/C (commit ef9e010) shows the kv_both substrate
is net positive on current codebase, not just neutral:
- TTFT p90: 11.97s plain → 9.74s kv_both (−18.6%) → 7.58s with DR-fix (−36.6%)
This reverses the elastic_migration_v2 paper's +45% kv_both penalty claim
and removes the primary cause of the 4 prior migration reverts.
Reframes EAR Pillar 2 from "DEFERRED" to "PARTIAL" — substrate verified,
e2e strategy-layer validation (trigger thresholds + target selection in
the dispatch-coupling feedback loop) remains as the only open risk.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
20
MEETING.md
20
MEETING.md
@@ -63,9 +63,16 @@ APC 拉到 77-79%(接近上界),但 hotspot index 翻倍:sticky 2.73、u
|
||||
新 session 用 load-balance 分配 host;后续 turn 按 session→host binding 路由。
|
||||
→ 这就是当前 `unified` 算法(hybrid LMetric + high-cache affinity),APC 79.4%,达到上界 97%。
|
||||
|
||||
**Pillar 2 — Hot-triggered session migration(实证待补)**
|
||||
**Pillar 2 — Hot-triggered session migration(end-to-end 实证待补,substrate 已验证)**
|
||||
当 host 的 `pending_prefill_tokens > T_hot`,把整个 session 的 KV 通过 mooncake `kv_connector` migrate 到更轻 instance;session binding 更新;后续 turn 路由到新 host。
|
||||
|
||||
> 🆕 **2026-05-27 数据**(commit `ef9e010`):之前认为是 migration blocker 的 `kv_both` substrate overhead 已经不存在。在 8×TP1 trace replay 上 A/B/C 对比:
|
||||
> - plain unified: TTFT p90 = 11.97s
|
||||
> - unified + `kv_both`(未 DR-fix): 9.74s(**−18.6%** vs plain)
|
||||
> - unified + `kv_both` + DR-fix: 7.58s(**−36.6%** vs plain)
|
||||
>
|
||||
> 即原 elastic_migration_v2 论文里 "+45% kv_both penalty" 已 obsolete;当前 substrate 是 **net positive**(connector mode 的 `delay_free_blocks=True` 在 93% intra-session-reuse trace 上把跨 turn cache hit 窗口拉长)。Migration 之前 4 次 revert 的主因消失。
|
||||
|
||||
关键 design:
|
||||
- Target 选择用 **observable pending prefill tokens**,**不用** cost-model prediction(实测 mooncake cost model 误差 10-21x,绕过)
|
||||
- Per-session cooldown 防 thrashing
|
||||
@@ -83,6 +90,7 @@ APC 拉到 77-79%(接近上界),但 hotspot index 翻倍:sticky 2.73、u
|
||||
- Pillar 1 affinity routing 已实现并测过(current `unified` 算法)
|
||||
- Dispatch coupling 的 Little's Law 形式化推导
|
||||
- `replayer/replay.py` patched 输出 `amplification`
|
||||
- 🆕 **kv_both substrate validation**(commit `ef9e010`):trace replay A/B/C 证明 substrate 已经是 net positive(TTFT p90 −18.6% / DR-fix 后 −36.6% vs plain),原 +45% penalty obsolete
|
||||
|
||||
### 🟢 不依赖 migration 可以现在做
|
||||
|
||||
@@ -91,17 +99,17 @@ APC 拉到 77-79%(接近上界),但 hotspot index 翻倍:sticky 2.73、u
|
||||
3. λ / skew / KV pool 三轴 sensitivity
|
||||
4. Draft §1-§4 正文(数据已齐)
|
||||
|
||||
### 🚧 待 migration validation
|
||||
### 🚧 待 migration end-to-end validation
|
||||
|
||||
- §4.3 migration mechanism 在 `connector_tax` DR-fix 之上重测
|
||||
- §4.3 migration mechanism 的 e2e trigger + target selection 实验(substrate 已通,只缺策略层)
|
||||
- Full ablation(migration-only + both)
|
||||
- §5.6 migration microbench
|
||||
|
||||
### 风险
|
||||
|
||||
- Migration 之前 4 次尝试(`6b255fa`, `e991960/5772149`, `cc6e562`, `4c583f2`)都被 transfer overhead 吞掉而 revert
|
||||
- 最近 DR-fix 把 `build_connector_meta` slope +81 → -0.7 μs/1k blocks,但**未在 trace replay 上验证**
|
||||
- 若 migration validation fail,paper 可 pivot 成 "affinity-only is enough" —— 仍然能发,强度降一档
|
||||
- Migration 之前 4 次尝试(`6b255fa`, `e991960/5772149`, `cc6e562`, `4c583f2`)都被 transfer overhead 吞掉而 revert —— **该 overhead 已在 2026-05-27 验证不再存在**(substrate net positive)
|
||||
- 仍未直接验证 e2e migration 策略层(trigger + target 选择)能在反馈环里产生正收益;中间还有"决策错误 + cooldown thrashing"两类风险,独立于 substrate
|
||||
- 即便 migration e2e 仍 marginal,affinity-only pillar 的实证已经独立成立,paper 至少有 strong-affinity 的 storyline 可写
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -13,14 +13,15 @@
|
||||
| §3.2 静态 PD-disagg 撞 KV 墙 | ✅ 完整 (`f4b`) | — |
|
||||
| §3.3 Sticky 造 hot pin | ✅ 完整 (`f4c`, `f4d`) | — |
|
||||
| §4.1-2 Affinity routing | ✅ 已实现(current `unified` 算法)| — |
|
||||
| §4.3 Migration mechanism | 🚧 **DEFERRED** | 待 connector_tax fix 后重测 |
|
||||
| `kv_both` substrate cost | ✅ **VERIFIED net-positive** (2026-05-27, commit `ef9e010`) | TTFT p90 −18.6% w/o DR-fix, −36.6% w/ DR-fix |
|
||||
| §4.3 Migration mechanism (e2e) | 🚧 **PARTIAL** | substrate 已通;e2e trigger + target selection 实验未跑 |
|
||||
| §5.2 End-to-end | ⚠️ 5/6 baseline 有数据 (`f6`) | 缺 static PD-disagg;EAR 列待 migration |
|
||||
| §5.3 Ablation | 🚧 **PARTIAL DEFER** | 仅 affinity-only 现可做,full 待 migration |
|
||||
| §5.4 Dispatch coupling validation | 🚧 **NEW DATA NEEDED** | 5 baseline wall-clock 重跑(Phase 1 patch 后)|
|
||||
| §5.5 Sensitivity | 🚧 **PARTIAL DEFER** | λ/skew/KV pool 可做;`T_hot`/`T_cool` 待 migration |
|
||||
| §5.6 Migration microbench | 🚧 **FULL DEFER** | 完全依赖 migration validation |
|
||||
|
||||
**前提背景**:team 之前 4 次尝试 migration 都因 transfer overhead 被还原(见 `analysis/unified_routing_fix_review.md`);最近 `connector_tax` 工作的 DR-fix 把 build_connector_meta 的 1.4ms/step overhead 降到接近 0,但还未跑过完整 migration 实验。**EAR 的 migration 部分目前是 design intent,待重测后写入实证。**
|
||||
**前提背景**:team 之前 4 次尝试 migration 都因 transfer overhead 被还原(见 `analysis/unified_routing_fix_review.md`);2026-05-27 的 trace-replay A/B/C(`microbench/connector_tax/cache_sweep/REPORT_TRACE_REPLAY.md`)证明 `kv_both` substrate 已经反转 —— 不仅 +45% penalty obsolete,substrate 本身就是 net positive(TTFT p90 −18.6% vs plain,DR-fix 后 −36.6%)。**之前 4 次 migration revert 的最大根因消失,但 e2e migration 策略层(trigger + target selection 在反馈环里的真实收益)仍未直接验证 —— EAR 的 migration 部分实验已无 substrate 风险,只剩策略层风险。**
|
||||
|
||||
---
|
||||
|
||||
@@ -165,11 +166,13 @@ EAR 是位于 N 个同质 instance 之上的 router。每个 instance 是对称
|
||||
- **Warm path**:已建立 session 的后续每个 turn 一律路由到当前 host
|
||||
- **效果**:intra-session KV reuse 被构造性保留,APC 接近 §2.2 的上界 79.6%
|
||||
|
||||
### §4.3 Pillar 2: Hot-Triggered Session Migration 🚧 DEFERRED VALIDATION
|
||||
### §4.3 Pillar 2: Hot-Triggered Session Migration 🚧 PARTIAL VALIDATION
|
||||
|
||||
避免 Pillar 1 退化成 pure sticky 的关键 mechanism。
|
||||
|
||||
> **状态**:Design 描述完整,但实证数据待 `connector_tax` DR-fix 之后重测。之前 4 次 migration 尝试(`6b255fa`, `e991960/5772149`, `cc6e562`, `4c583f2`)都因 transfer overhead 被还原 —— 直到 DR-fix 之前,migration 的实测收益始终被 overhead 吞掉。新一轮验证未跑。
|
||||
> **状态(2026-05-27 更新)**:
|
||||
> - **Substrate 验证 PASS**(commit `ef9e010`):`kv_both` connector 在 trace replay 上 net positive(TTFT p90 −18.6%),DR-fix 后再 −22%。之前认为是 migration blocker 的 transfer overhead 已不存在。
|
||||
> - **策略层 e2e 验证 PENDING**:trigger 阈值 + target selection 在 agentic 反馈环里的真实收益仍未直接测。之前 4 次 migration 尝试(`6b255fa`, `e991960/5772149`, `cc6e562`, `4c583f2`)被还原的主因(substrate overhead)已消失,但 trigger 决策错误 + cooldown thrashing 是独立风险,需新一轮 e2e 实验确认。
|
||||
|
||||
#### §4.3.1 Trigger signal
|
||||
|
||||
@@ -340,7 +343,7 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
|
||||
|
||||
### 🚧 Deferred (待 migration validation)
|
||||
|
||||
- [ ] §4.3 migration mechanism 重测(`connector_tax` DR-fix 之后跑)
|
||||
- [ ] §4.3 migration mechanism e2e 验证:substrate 已通(commit `ef9e010`),缺 trigger + target selection 的策略层实验
|
||||
- [ ] §5.3 full ablation (migration-only + both 两个配置)
|
||||
- [ ] §5.5 `T_hot` / `T_cool` 两轴 sensitivity
|
||||
- [ ] §5.6 migration microbench 全部
|
||||
|
||||
Reference in New Issue
Block a user