Two cleanups:
1. Drop "E1: naive 1P3D default" experiment from the onboarding manual.
GPU hours are precious; naive 1P3D + policy=default has near-certain
loss on multi-turn cache hit (it's round-robin without prefix awareness),
so the comparison doesn't add information vs E1=naive 1P3D kv-aware.
The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC
v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial /
5.5h parallel. Updated:
- §0 TL;DR ("3 组" -> "2 组")
- §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware)
- §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop)
- §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2)
- §6 decision table + expected-range table
- §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2")
- §9 deliverables
2. Move 8 deprecated docs to docs/archive/:
AGENTIC_FIT_ANALYSIS_ZH.md (ts=10 era analysis; superseded)
STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded)
KVC_DEBUG_JOURNEY_V1_TO_V5.md (v1-v5 sweep process notes)
V5_PROFILE_INVESTIGATION_ZH.md (v5 1Hz polling investigation)
REFACTOR_PLAN_ZH.md (v0 plan; superseded by V1)
KVCACHE_CENTRIC_PROGRESS_ZH.md (earliest 2026-04-27 progress)
SWEBENCH_EXPERIMENT_PROGRESS.md (early SWE trace setup)
SWEBENCH_EXPERIMENT_RESULTS.md (early SWE result snapshot)
All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS /
REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from
`docs/FOO.md` to `docs/archive/FOO.md` via sed pass.
Added `docs/archive/README.md` explaining what each archived doc is
and when (if ever) to reopen it. Designed so a new reader hitting
the archive dir immediately knows it's not required reading.
After this commit the active docs in docs/ are 9 files (down from 17),
which should make the onboarding doc's "Level 1 / Level 2 / Level 3"
classification self-evident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
124 lines
5.9 KiB
Markdown
124 lines
5.9 KiB
Markdown
# Refactor Plan v0:极简版
|
||
|
||
**日期**:2026-05-06
|
||
**目标**:用最小改动 + 轻量实验,验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` 提出的结构性缺陷是否真实存在、影响多大。
|
||
**预算**:8h GPU 时间(约 4-6 次 ~30-60 min smoke run)。
|
||
**KISS 边界**:不动 SGLang `scheduler.py` 主循环结构;不引入新 mooncake 协议;不实现 cross-D session migration;不做 admission probe/commit 拆分;不动 LRU eviction 策略。
|
||
|
||
## 计划结论(与用户已确认的)
|
||
|
||
回审 plan-v0 时发现两个原 Phase 1 改动**都不是 bug**:
|
||
|
||
- `_estimate_session_resident_tokens` 返回 full prompt 是设计如此——所有需要"增量"的 call site 都已经做 `target - current` 减法(`replay.py:1247-1254`、`:1393-1394`、`:1490-1491`)。
|
||
- `decode_resident_blocks` 不缩减只是浪费几 MB 内存,**不影响 routing 决策**(SWE trace 的 hash_ids 是 session-unique,policy 仍能正确选 D)。
|
||
|
||
最终极简版只做一件代码改动(**加 backpressure**)+ 大量 instrumentation。
|
||
|
||
## 唯一代码改动:Backpressure 信号
|
||
|
||
### 改动点 1:SGLang `admit_direct_append` 响应增加两个字段
|
||
|
||
文件:`third_party/sglang/python/sglang/srt/managers/io_struct.py`、`scheduler.py`
|
||
|
||
```python
|
||
@dataclass
|
||
class DirectAppendAdmissionReqOutput:
|
||
... # 已有字段保留
|
||
recommended_pause_ms: int = 0 # 新增
|
||
queue_depth: int = 0 # 新增
|
||
```
|
||
|
||
`scheduler.py:admit_direct_append` 末尾计算 hint:
|
||
|
||
```python
|
||
def _compute_backpressure_pause_hint(self) -> float:
|
||
depth = len(self.disagg_decode_transfer_queue.queue)
|
||
if depth < 8:
|
||
return 0.0
|
||
return min(2000.0, depth * 100.0) # 简单线性
|
||
```
|
||
|
||
### 改动点 2:replay 端按 hint 退避
|
||
|
||
文件:`src/agentic_pd_hybrid/replay.py`
|
||
|
||
- `DecodeResidencyState` 新增 `pause_until_s: dict[str, float]`
|
||
- `_query_decode_direct_admission` 解析响应里的 `recommended_pause_ms`,更新 `pause_until_s[server_url] = now + pause_ms / 1000`
|
||
- 在调 `_invoke_router` / `_invoke_decode_session_direct` 前检查 `pause_until_s[decode_url]`,若 `now < pause_until` 则 sleep 到该时刻
|
||
|
||
### 改动点 3:新 CLI flag
|
||
|
||
`src/agentic_pd_hybrid/cli.py`、`benchmark.py`:
|
||
|
||
```
|
||
--enable-backpressure # 默认 false,保留 baseline 行为
|
||
```
|
||
|
||
### 改动点 4:观测日志
|
||
|
||
每个 run dir 新增三个 jsonl:
|
||
|
||
- `admission-events.jsonl`:每次 admission RPC(timestamp, session, D, can_admit, queue_depth, pause_ms, latency_s, available_tokens, evicted_session_count)
|
||
- `backpressure-events.jsonl`:每次实际 sleep(timestamp, D, sleep_ms, queue_depth_at_signal)
|
||
- `session-d-binding.jsonl`:每个 session 第一次 open 在某 D 时记录(timestamp, session, D, turn_id)
|
||
|
||
## 实验矩阵(8h 预算内)
|
||
|
||
按"先做 anchor,再做单变量对照"排序。每行右侧是预估机时。
|
||
|
||
| ID | 配置 | 目的 | 机时 |
|
||
|---|---|---|---|
|
||
| **E0 (existing)** | v5 baseline,time-scale=10,无 backpressure | Anchor,已存在 `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/run1` | 0 |
|
||
| **E1** | v5 + backpressure ON,time-scale=10,全 trace | 验证 Claim §3(backpressure 是否能消除 KVTransferError 雪崩) | ~50 min |
|
||
| **E2** | v5 baseline,time-scale=1,**短 trace**(前 12 sessions ≈ 1000 reqs) | 验证 Claim §7(time-scale=10 失真);不开 backpressure | ~60 min |
|
||
| **E3** | 8DP CA,time-scale=1,同 E2 trace | E2 的对照——真实时序下 KVC 是否仍输 DP | ~60 min |
|
||
| **E4** | v5 + backpressure,time-scale=1,同 E2 trace | backpressure 在真实时序下还有用吗? | ~60 min |
|
||
| **E5**(备选) | v5 baseline,time-scale=10,**concurrency=4**,全 trace | 验证 Claim §1(高并发是不是必要条件) | ~50 min |
|
||
|
||
总:4-5 个 run,~3-5h。剩余预算给失败重跑/分析。
|
||
|
||
## 实验目标——回到 §1-§7 一一对照
|
||
|
||
| 文档 § | Claim | 由哪个 exp 证伪/支持 | 需要的指标 |
|
||
|---|---|---|---|
|
||
| §1 | Session 永久 pin + 容量盲选造成双峰 | 已有 E0 数据足够 | direct-to-D rate per session distribution |
|
||
| §2 | LRU 跟不上压力 | 已有 E0 logs 足够 + E1 看 backpressure 之后 trim/error 比例变化 | trim 事件数 vs OOM 数 |
|
||
| §3 | 没 backpressure 是雪崩源 | E0 vs E1 | KVTransferError 数、P99 latency |
|
||
| §4 | admission RPC 干扰 scheduler | 不在本轮实验范围(需要 admission probe 拆分才能验,不做) | – |
|
||
| §5 | P-side 不感知 D 健康 | 已有 E0 logs 足够(prefill-0 vs prefill-1 错误数) | per-P KVTransferError |
|
||
| §6 | (已撤回) | – | – |
|
||
| §7 | time-scale=10 失真 | E0 vs E2(同 KVC,不同 time-scale);E2 vs E3(同 time-scale,KVC vs DP) | latency 分布、direct-to-D rate |
|
||
|
||
## Final 实验报告交付
|
||
|
||
跑完后输出 `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md`,按 §1-§7 每条给出:
|
||
|
||
- **Claim 字面**
|
||
- **数据证据**(哪个 exp、哪个 metric)
|
||
- **结论**:成立 / 部分成立 / 推翻
|
||
- **影响量化**:数字差异
|
||
- **不确定性**:N=1 风险、其他 confounder
|
||
|
||
## 不做的事(KISS 边界)
|
||
|
||
| 想做但不做 | 理由 |
|
||
|---|---|
|
||
| 跑 N=3 重复 | 8h 装不下;single-run 可看大方向 |
|
||
| 全 sweep 参数 | 只调 time-scale 和 backpressure 一个 boolean |
|
||
| 改 LRU eviction | 不在本轮范围 |
|
||
| Cross-D migration | 不在本轮范围 |
|
||
| Admission probe/commit 拆分 | 不在本轮范围 |
|
||
| P-side D-health routing | 不在本轮范围 |
|
||
| 修两个"非 bug"(estimate / aging) | 验证后非真实 bug |
|
||
|
||
## 预期失败路径
|
||
|
||
- **GPU 资源紧张**:smoke trace 进一步压缩(前 8 sessions / 600 reqs)
|
||
- **time-scale=1 跑超 1.5h**:截断到 600s 内能完成的部分
|
||
- **backpressure 配错**:先用 sleep_ms = depth * 100 简单线性;调不通就回滚到 0(无 backpressure)
|
||
- **SGLang patch 编译错**:所有 patch 在 io_struct.py 和 scheduler.py 的少量行内,可单独 git restore
|
||
|
||
---
|
||
|
||
接下来:实现 → 跑 smoke → 写报告。
|