Files

kzlin 7590e55189 docs: archive deprecated docs to docs/archive/, drop E1 from onboarding

Two cleanups:

1. Drop "E1: naive 1P3D default" experiment from the onboarding manual.
   GPU hours are precious; naive 1P3D + policy=default has near-certain
   loss on multi-turn cache hit (it's round-robin without prefix awareness),
   so the comparison doesn't add information vs E1=naive 1P3D kv-aware.
   The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC
   v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial /
   5.5h parallel. Updated:
   - §0 TL;DR ("3 组" -> "2 组")
   - §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware)
   - §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop)
   - §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2)
   - §6 decision table + expected-range table
   - §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2")
   - §9 deliverables

2. Move 8 deprecated docs to docs/archive/:
     AGENTIC_FIT_ANALYSIS_ZH.md         (ts=10 era analysis; superseded)
     STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded)
     KVC_DEBUG_JOURNEY_V1_TO_V5.md      (v1-v5 sweep process notes)
     V5_PROFILE_INVESTIGATION_ZH.md     (v5 1Hz polling investigation)
     REFACTOR_PLAN_ZH.md                (v0 plan; superseded by V1)
     KVCACHE_CENTRIC_PROGRESS_ZH.md     (earliest 2026-04-27 progress)
     SWEBENCH_EXPERIMENT_PROGRESS.md    (early SWE trace setup)
     SWEBENCH_EXPERIMENT_RESULTS.md     (early SWE result snapshot)

   All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS /
   REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from
   `docs/FOO.md` to `docs/archive/FOO.md` via sed pass.

   Added `docs/archive/README.md` explaining what each archived doc is
   and when (if ever) to reopen it. Designed so a new reader hitting
   the archive dir immediately knows it's not required reading.

After this commit the active docs in docs/ are 9 files (down from 17),
which should make the onboarding doc's "Level 1 / Level 2 / Level 3"
classification self-evident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 22:40:35 +08:00

5.9 KiB

Raw Blame History

Refactor Plan v0：极简版

日期：2026-05-06 目标：用最小改动 + 轻量实验，验证 docs/AGENTIC_FIT_ANALYSIS_ZH.md 提出的结构性缺陷是否真实存在、影响多大。预算：8h GPU 时间（约 4-6 次 ~30-60 min smoke run）。 KISS 边界：不动 SGLang scheduler.py 主循环结构；不引入新 mooncake 协议；不实现 cross-D session migration；不做 admission probe/commit 拆分；不动 LRU eviction 策略。

计划结论（与用户已确认的）

回审 plan-v0 时发现两个原 Phase 1 改动都不是 bug：

_estimate_session_resident_tokens 返回 full prompt 是设计如此——所有需要"增量"的 call site 都已经做 target - current 减法（replay.py:1247-1254、:1393-1394、:1490-1491）。
decode_resident_blocks 不缩减只是浪费几 MB 内存，不影响 routing 决策（SWE trace 的 hash_ids 是 session-unique，policy 仍能正确选 D）。

最终极简版只做一件代码改动（加 backpressure）+ 大量 instrumentation。

唯一代码改动：Backpressure 信号

改动点 1：SGLang `admit_direct_append` 响应增加两个字段

文件：third_party/sglang/python/sglang/srt/managers/io_struct.py、scheduler.py

@dataclass
class DirectAppendAdmissionReqOutput:
    ...                         # 已有字段保留
    recommended_pause_ms: int = 0   # 新增
    queue_depth: int = 0            # 新增

scheduler.py:admit_direct_append 末尾计算 hint：

def _compute_backpressure_pause_hint(self) -> float:
    depth = len(self.disagg_decode_transfer_queue.queue)
    if depth < 8:
        return 0.0
    return min(2000.0, depth * 100.0)   # 简单线性

改动点 2：replay 端按 hint 退避

文件：src/agentic_pd_hybrid/replay.py

DecodeResidencyState 新增 pause_until_s: dict[str, float]
_query_decode_direct_admission 解析响应里的 recommended_pause_ms，更新 pause_until_s[server_url] = now + pause_ms / 1000
在调 _invoke_router / _invoke_decode_session_direct 前检查 pause_until_s[decode_url]，若 now < pause_until 则 sleep 到该时刻

改动点 3：新 CLI flag

src/agentic_pd_hybrid/cli.py、benchmark.py：

--enable-backpressure   # 默认 false，保留 baseline 行为

改动点 4：观测日志

每个 run dir 新增三个 jsonl：

admission-events.jsonl：每次 admission RPC（timestamp, session, D, can_admit, queue_depth, pause_ms, latency_s, available_tokens, evicted_session_count）
backpressure-events.jsonl：每次实际 sleep（timestamp, D, sleep_ms, queue_depth_at_signal）
session-d-binding.jsonl：每个 session 第一次 open 在某 D 时记录（timestamp, session, D, turn_id）

实验矩阵（8h 预算内）

按"先做 anchor，再做单变量对照"排序。每行右侧是预估机时。

ID	配置	目的	机时
E0 (existing)	v5 baseline，time-scale=10，无 backpressure	Anchor，已存在 `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/run1`	0
E1	v5 + backpressure ON，time-scale=10，全 trace	验证 Claim §3（backpressure 是否能消除 KVTransferError 雪崩）	~50 min
E2	v5 baseline，time-scale=1，短 trace（前 12 sessions ≈ 1000 reqs）	验证 Claim §7（time-scale=10 失真）；不开 backpressure	~60 min
E3	8DP CA，time-scale=1，同 E2 trace	E2 的对照——真实时序下 KVC 是否仍输 DP	~60 min
E4	v5 + backpressure，time-scale=1，同 E2 trace	backpressure 在真实时序下还有用吗？	~60 min
E5（备选）	v5 baseline，time-scale=10，concurrency=4，全 trace	验证 Claim §1（高并发是不是必要条件）	~50 min

总：4-5 个 run，~3-5h。剩余预算给失败重跑/分析。

实验目标——回到 §1-§7 一一对照

文档 §	Claim	由哪个 exp 证伪/支持	需要的指标
§1	Session 永久 pin + 容量盲选造成双峰	已有 E0 数据足够	direct-to-D rate per session distribution
§2	LRU 跟不上压力	已有 E0 logs 足够 + E1 看 backpressure 之后 trim/error 比例变化	trim 事件数 vs OOM 数
§3	没 backpressure 是雪崩源	E0 vs E1	KVTransferError 数、P99 latency
§4	admission RPC 干扰 scheduler	不在本轮实验范围（需要 admission probe 拆分才能验，不做）	–
§5	P-side 不感知 D 健康	已有 E0 logs 足够（prefill-0 vs prefill-1 错误数）	per-P KVTransferError
§6	(已撤回)	–	–
§7	time-scale=10 失真	E0 vs E2（同 KVC，不同 time-scale）；E2 vs E3（同 time-scale，KVC vs DP）	latency 分布、direct-to-D rate

Final 实验报告交付

跑完后输出 docs/STRUCTURAL_VALIDATION_REPORT_ZH.md，按 §1-§7 每条给出：

Claim 字面
数据证据（哪个 exp、哪个 metric）
结论：成立 / 部分成立 / 推翻
影响量化：数字差异
不确定性：N=1 风险、其他 confounder

不做的事（KISS 边界）

想做但不做	理由
跑 N=3 重复	8h 装不下；single-run 可看大方向
全 sweep 参数	只调 time-scale 和 backpressure 一个 boolean
改 LRU eviction	不在本轮范围
Cross-D migration	不在本轮范围
Admission probe/commit 拆分	不在本轮范围
P-side D-health routing	不在本轮范围
修两个"非 bug"（estimate / aging）	验证后非真实 bug

预期失败路径

GPU 资源紧张：smoke trace 进一步压缩（前 8 sessions / 600 reqs）
time-scale=1 跑超 1.5h：截断到 600s 内能完成的部分
backpressure 配错：先用 sleep_ms = depth * 100 简单线性；调不通就回滚到 0（无 backpressure）
SGLang patch 编译错：所有 patch 在 io_struct.py 和 scheduler.py 的少量行内，可单独 git restore

接下来：实现 → 跑 smoke → 写报告。

5.9 KiB Raw Blame History Unescape Escape