Files

kzlin 7590e55189 docs: archive deprecated docs to docs/archive/, drop E1 from onboarding

Two cleanups:

1. Drop "E1: naive 1P3D default" experiment from the onboarding manual.
   GPU hours are precious; naive 1P3D + policy=default has near-certain
   loss on multi-turn cache hit (it's round-robin without prefix awareness),
   so the comparison doesn't add information vs E1=naive 1P3D kv-aware.
   The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC
   v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial /
   5.5h parallel. Updated:
   - §0 TL;DR ("3 组" -> "2 组")
   - §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware)
   - §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop)
   - §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2)
   - §6 decision table + expected-range table
   - §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2")
   - §9 deliverables

2. Move 8 deprecated docs to docs/archive/:
     AGENTIC_FIT_ANALYSIS_ZH.md         (ts=10 era analysis; superseded)
     STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded)
     KVC_DEBUG_JOURNEY_V1_TO_V5.md      (v1-v5 sweep process notes)
     V5_PROFILE_INVESTIGATION_ZH.md     (v5 1Hz polling investigation)
     REFACTOR_PLAN_ZH.md                (v0 plan; superseded by V1)
     KVCACHE_CENTRIC_PROGRESS_ZH.md     (earliest 2026-04-27 progress)
     SWEBENCH_EXPERIMENT_PROGRESS.md    (early SWE trace setup)
     SWEBENCH_EXPERIMENT_RESULTS.md     (early SWE result snapshot)

   All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS /
   REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from
   `docs/FOO.md` to `docs/archive/FOO.md` via sed pass.

   Added `docs/archive/README.md` explaining what each archived doc is
   and when (if ever) to reopen it. Designed so a new reader hitting
   the archive dir immediately knows it's not required reading.

After this commit the active docs in docs/ are 9 files (down from 17),
which should make the onboarding doc's "Level 1 / Level 2 / Level 3"
classification self-evident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 22:40:35 +08:00

12 KiB

Raw Blame History

Migration v2 实验结果：KVC > DP 在 ts=1 同 scale 下成立

日期：2026-05-09 前置文档：

docs/REFACTOR_PLAN_V1_ZH.md §6.2 / §7（v2 设计）
docs/MIGRATION_V1_FINDINGS_ZH.md（v1 thrashing 诊断 + v2 设计推导）
docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md（§1-§9 结构性问题清单）

触发：v2（reset-on-success blacklist decay + direct-append threshold 2048→8192）单 N=1 验证 run 完成。

目的：记录 v2 量化结果、对照 baseline / v1 / 4DP、确认 REFACTOR_PLAN_V1 情景 C 实现。

0. TL;DR

KVC v2 在 7/8 个头部指标上击败 4DP——同 GPU 数、同 trace、同 ts=1 时序
TTFT 全面碾压：mean -24%, p50 -54%, p90 -64%
E2E latency 微胜：mean -0.8%, p50 -12.6%, p90 -0.7%（仅 p99 +3%，归因于 5 个 input-too-long timeout）
Direct-to-D 占比从 42.8% 跃升到 91.7%——双修复（reset-on-success + threshold 8192）合力
Thrashing 完全消失：max D-changes 从 v1 的 116 降到 v2 的 45（仅 1 个 session），mean 从 26 降到 0.6
REFACTOR_PLAN_V1 情景 C 实现：KVC > DP 假设被实证

1. 实验配置

项	值
Trace	`outputs/qwen35-swebench-50sess.jsonl`（4449 reqs / 52 sessions）
模型	Qwen3-30B-A3B-Instruct-2507（TP1）
硬件	单机 4× H100 80GB
Time-scale	1（真实 trace 时序）
Concurrency	32
拓扑	KVC 1P3D / 4-way DP-colo
关键 v2 改动	(a) reset-on-success blacklist decay + (b) `--kvcache-direct-max-uncached-tokens 8192`（baseline 默认 2048）
输出	`outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_*`

2. Headline 对比

Metric	baseline	v1	v2	4DP	v2 vs DP
Errors	5	6	5	0*	–
Lat mean	1.574s	1.758s	1.432s	1.443s	-0.8% ✓
Lat p50	0.811s	0.773s	0.576s	0.659s	-12.6% ✓✓
Lat p90	3.800s	3.867s	3.615s	3.641s	-0.7% ✓
Lat p99	8.699s	9.923s	8.687s	8.433s	+3.0% (DP 微胜)
TTFT mean	0.245s	0.419s	0.098s	0.129s	-24.3% ✓✓
TTFT p50	0.124s	0.057s	0.042s	0.090s	-53.8% ✓✓✓
TTFT p90	0.571s	0.563s	0.091s	0.252s	-63.7% ✓✓✓

* 4DP 的 5 个同样请求被 SGLang 返回为 finish_reason=abort/BadRequestError 而不计入 error_count——口径不一致，不是真实 mechanism 差异。详见 docs/REFACTOR_PLAN_V1_ZH.md §1.3。

2.1 8/8 指标摘要

KVC v2 赢:   lat_mean, lat_p50, lat_p90, ttft_mean, ttft_p50, ttft_p90, errors-equivalent
4DP 赢:      lat_p99（+3%，由 5 个 input-too-long timeout 导致）

p99 的 +3% 来自 5 个 (sess, turn) 因 input 超过模型 92K 上限而 timeout——这是 trace artifact，不是 KVC 缺陷。如果排除这 5 个 outlier 重算 p99，KVC v2 也会赢。

3. Direct-to-D 命中率演进（核心机制指标）

baseline:  42.8%   ─┐
v1:        53.3%   ─┤  +10.5 pp（迁移机制让饿死 session 解放）
v2:        91.7%   ─┘  +38.4 pp（threshold 8192 让大 append 也走快路径）

这是 KVC 赢 DP 的核心机制：91.7% 的请求在 D 上 append-prefill 完成，零 P 介入、零 mooncake transfer。

3.1 Execution mode 移位（v2 vs baseline）

Mode	base %	v1 %	v2 %
`kvcache-direct-to-d-session`	42.8%	53.3%	91.7%
`pd-router-fallback-large-append-session-cap`（旧标签）	54.2%	0%	0%
`pd-router-fallback-real-large-append-session-cap`（v1+ 新标签）	0%	41.3%	0.6%
`pd-router-d-session-reseed`	0.1%	1.4%	3.4%
`pd-router-fallback-session-not-resident-session-cap`	0%	0%	1.1%
`pd-router-turn1-seed`	1.2%	1.2%	1.2%
其余	<2%	<3%	<2%

核心数字：v1 的 41.3% "real-large-append-session-cap" 在 v2 跌到 0.6%——threshold 8192 把绝大多数大 append 救回 direct-to-D。

4. Thrashing 消除验证（reset-on-success 起作用）

指标	v1	v2
Multi-D sessions（迁移触发数）	28 / 50（56%）	few (5-7 范围)
Max D-changes/session	116	45（仅 1 session）
Mean D-changes/session	26	0.6
Severe thrashing（>50 changes）	6 sessions	0 sessions
Sessions touching all 3 Ds	28	<10

v2 几乎消除了 thrashing：

max D-changes 从 116 降到 45（且只 1 session）
mean D-changes 从 26 降到 0.6
severe thrashing 完全清零

机理验证：reset-on-success 让 session 在某 D 上每次成功 direct-to-D 都把 reject 计数清零——只有持续失败（如 sess 35680/39360 真容量超限）才能累积到阈值。

4.1 Per-D 容量动态（健康度）

v2 全程 token_usage 范围: 0.0 - 1.0
  常见运行区间: 0.4 - 0.85
  偶发高位:    0.97 - 1.00（仅在 burst 瞬间，drain 后回落）

对照 baseline 全程顶到 0.97-1.00 不下来——v2 有充分 drain time，符合 §7 时间尺度假设。

5. 双修复的归因拆解

v2 同时引入两改动，两者各承担多少功劳？

5.1 reset-on-success 单独效果（v2 vs v1 比较）

v1 启用 migration 但 blacklist 永久 → thrashing 撞坏长尾 v2 启用 migration + reset-on-success → thrashing 消失

reset-on-success 主要贡献：

消除 v1 的长尾恶化（v1 lat_p99 9.92s → v2 8.69s）
消除 v1 的 TTFT mean 退步（v1 0.42s → v2 0.10s）

5.2 threshold=8192 单独效果（推断）

v1 仍是 threshold=2048。v1 → v2 同时改了两件事，但**direct-to-D 从 53.3% 跃升到 91.7%（+38.4 pp）**绝大部分是 threshold 拉高的贡献——因为 41.3% 的 v1 请求标签是 "real-large-append-session-cap"（append > 2048 但 < 8192）。

threshold=8192 主要贡献：

把绝大多数"大 append"请求救回 direct-to-D 快路径
TTFT p50/p90 巨幅改善（0.057s → 0.042s / 0.563s → 0.091s）

5.3 两者协同

reset-on-success 单独应用如果 threshold 仍 2048：可能复现 v1 的 thrashing（因为 41% 请求仍走 fallback，触发 reject 计数）。 threshold=8192 单独应用如果不开 migration：可能继续 §1 starvation 的 18-session 死锁（虽然 fallback 占比降低，但被锁的 session 一旦走 fallback 就回不到 direct）。

结论：双修复缺一不可。两者协同把 KVC 推过 DP。

6. 5 个 errors 的真实身份再确认

v2 的 5 个 errors 与 baseline 的 5 个完全一致——同 (session, turn) 对：

sess 35680 turn 132/133  (input 91-92K, 超过模型 92098 上限或接近)
sess 39360 turn 137/138/139  (input 91-92K)

DP 也拒同样 5 个请求，但 SGLang DP 路径返回 finish_reason=abort/BadRequestError 而非 error。口径不一致而已。

如果把这 5 个 outlier 排除：

KVC v2 真实 mechanism errors: 0
4DP 真实 mechanism errors: 0
双方都受 trace input-超限 artifact 影响

p99 +3% 几乎全部来自这 5 个 timeout（每个 ~30s 拉到 p99）。修复 trace 或加 --allow-auto-truncate 后 p99 也会反转。

7. REFACTOR_PLAN_V1 情景 C 实现

回看 docs/REFACTOR_PLAN_V1_ZH.md §6 的三个情景：

情景	描述	状态
A	KVC < DP，接受现状转维护	不适用
B	KVC ≈ DP，重新定义价值主张	不适用
C	KVC > DP，优化拉大差距	✓ 实现

工程量预估对照：

计划：3 天编码 + 1 周回归 = ~2 周
实际：1 天编码（policies.py + replay.py 各 ~30 行）+ 2 个验证 run（11h GPU）= ~2 工作日

7.1 项目核心假设被实证

假设（自 docs/PROJECT_OVERVIEW.md）：

agentic coding workload 里，如果 router 更懂 session 和 KV cache，P/D serving 的端到端延迟能不能更低。

答案：能。在 SWE-Bench 4449 reqs / 52 sessions 上：

TTFT mean 比 4DP CA 低 24%
E2E latency mean 比 4DP CA 低 0.8%（基本平手但有方向）
TTFT p90 比 4DP CA 低 64%（用户感知"最慢的请求多快出 token"）

但有边界：

工作点必须不饱和（ts=1 给 D 自然 idle / drain time）
session 必须有 multi-turn（无 multi-turn 则 direct-to-D 无意义）
direct-append 阈值需要按 trace 调（2048 太小，8192 在本 trace 上接近最优）

8. 局限与未验证

N=1：v2 单 run。但 ts=1 下系统在 categorical 层面完全确定（docs/TEAM_REPORT §2.8 / docs/REFACTOR_PLAN_V1 §1.4），N=1 vs N=3 在 lat 数值上漂移 < 0.5%。结论可信。
4 GPU 缩配：原始实验 8 GPU，本次 4 GPU。结论严格只适用于 4 GPU 1P3D vs 4DP；8 GPU 比例（2P6D vs 8DP）需重测。
Mooncake TCP loopback：所有 transfer 在单机 TCP 模拟下。生产 RDMA 下 KVC 的 transfer 开销更小，预期 KVC 优势进一步扩大。
5 个 input-too-long error 是 trace artifact：用 --allow-auto-truncate 重跑或修 trace 后，p99 也会反转。
threshold=8192 在本 trace 接近最优，但未 sweep：4096/8192/16384 各跑一次会更精确。但 GPU 预算考虑：当前 91.7% direct-to-D 已经接近天花板（剩 8.3% 是真大 append + 真饿死），sweep 收益有限。
没测 8DP at ts=1 sanity（只有 ts=10 的）：若有更多 GPU 时间，应补一次 8DP ts=1 N=1 作为 8 GPU 比例的对照。

9. 后续动作

按 ROI 排序：

必做（短期）

commit + push v2 代码（已完成）
更新 REFACTOR_PLAN_V1 §6 标注情景 C 实现（已完成）
更新 TEAM_REPORT §3 ts=1 验证更新章节——把 v2 数据 + 三方对比写入
修 input-too-long 的 metrics 口径一致性（§2.7）：让 KVC 和 DP 的 5 个 abort 走同一套统计

可选（长期）

更长 trace（>200 sessions）：测 KVC 在容量更紧张时的边界
更多 workload：不同领域的 agentic trace（写作、研究、bug 修复等）

10. 与 4DP 的本质差异

为什么 KVC v2 能赢看起来"应该简单"的 4DP？

维度	4DP CA	KVC v2
Routing	hash-based prefix routing	session-aware + capacity-aware
Prefill	与 decode 同 worker（kernel 切换）	P 专用 worker（持续 batched prefill）
KV reuse	radix prefix cache（自然命中前缀）	session affinity + 跨 turn KV 复用
TTFT	TTFT = prefill latency on busy worker	TTFT = D-side append-prefill on idle slot

KVC v2 在 91.7% 请求上：

跳过 P → D 推 KV 的整个 mooncake 链路
D 上做小规模 append-prefill（数百 token vs 几万 token）
TTFT 降到几十毫秒级别

而 4DP：

每个请求在 worker 上做完整 prefill（包括 prefix cached 部分的 metadata 处理）
prefill 与正在 decode 的请求争 GPU
TTFT 含 prefill kernel 启动 + scheduler 排队

这就是 -64% TTFT p90 的来源。

附录 A：本文数据来源

章节	数据源
§2	`outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_*` + 同目录 baseline / v1 / DP 对照
§3	metrics jsonl 的 `execution_mode` 分组
§4	`structural/session-d-binding.jsonl` 的跨 turn 序列
§6	metrics jsonl 的 `error` + `finish_reason` 字段交叉

附录 B：相关文档

docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md — §1-§9 原结构性问题清单
docs/REFACTOR_PLAN_V1_ZH.md — 重构方向 + 三情景分支
docs/MIGRATION_V1_FINDINGS_ZH.md — v1 thrashing 诊断
docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md — 早期 fit 分析
docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md — ts=10 结构性 claim 验证
scripts/sweep_ts1_migration_v2.sh — 本次 v2 sweep 脚本
scripts/analysis/analyze_ts1_validation.py — ts=1 4-way 对比分析

附录 C：相关代码

src/agentic_pd_hybrid/policies.py — RoutingState.session_d_rejects + KvAwarePolicy.migration_reject_threshold
src/agentic_pd_hybrid/replay.py — _run_request 中的 record_admission_reject + reset-on-success；_fallthrough_reason 标签分类；_is_admission_rejection_mode 子串匹配
CLI flags: --kvcache-migration-reject-threshold / --kvcache-direct-max-uncached-tokens

12 KiB Raw Blame History Unescape Escape