Files

kzlin 1d51704dad docs(kvc): agentic-fit analysis, refactor plan, validation report

Three new docs covering the structural-fit investigation:

- AGENTIC_FIT_ANALYSIS_ZH.md: §1-§7 of structural design issues that
  surface KVC vs vanilla DP gap on real agentic workloads (SWE 50sess).
  Quantifies session pinning, LRU shortfall, P-side imbalance,
  time-scale distortion, etc., with code citations and N=3 rerun data.

- REFACTOR_PLAN_ZH.md: KISS-edition refactor plan. After verifying the
  original "estimate inflation" and "resident_blocks aging" claims were
  not real bugs, scope shrinks to one code change (backpressure) plus a
  4-run smoke sweep within an 8h budget.

- STRUCTURAL_VALIDATION_REPORT_ZH.md: validates §1-§7 claims using
  existing v5 baseline rerun data + 8DP CA baseline. Each claim labeled
  fully-supported / indirect / retracted with the data source. Notes
  that backpressure E2E validation is pending GPU smoke run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-06 21:30:11 +08:00

5.9 KiB

Raw Blame History

Refactor Plan v0：极简版

日期：2026-05-06 目标：用最小改动 + 轻量实验，验证 docs/AGENTIC_FIT_ANALYSIS_ZH.md 提出的结构性缺陷是否真实存在、影响多大。预算：8h GPU 时间（约 4-6 次 ~30-60 min smoke run）。 KISS 边界：不动 SGLang scheduler.py 主循环结构；不引入新 mooncake 协议；不实现 cross-D session migration；不做 admission probe/commit 拆分；不动 LRU eviction 策略。

计划结论（与用户已确认的）

回审 plan-v0 时发现两个原 Phase 1 改动都不是 bug：

_estimate_session_resident_tokens 返回 full prompt 是设计如此——所有需要"增量"的 call site 都已经做 target - current 减法（replay.py:1247-1254、:1393-1394、:1490-1491）。
decode_resident_blocks 不缩减只是浪费几 MB 内存，不影响 routing 决策（SWE trace 的 hash_ids 是 session-unique，policy 仍能正确选 D）。

最终极简版只做一件代码改动（加 backpressure）+ 大量 instrumentation。

唯一代码改动：Backpressure 信号

改动点 1：SGLang `admit_direct_append` 响应增加两个字段

文件：third_party/sglang/python/sglang/srt/managers/io_struct.py、scheduler.py

@dataclass
class DirectAppendAdmissionReqOutput:
    ...                         # 已有字段保留
    recommended_pause_ms: int = 0   # 新增
    queue_depth: int = 0            # 新增

scheduler.py:admit_direct_append 末尾计算 hint：

def _compute_backpressure_pause_hint(self) -> float:
    depth = len(self.disagg_decode_transfer_queue.queue)
    if depth < 8:
        return 0.0
    return min(2000.0, depth * 100.0)   # 简单线性

改动点 2：replay 端按 hint 退避

文件：src/agentic_pd_hybrid/replay.py

DecodeResidencyState 新增 pause_until_s: dict[str, float]
_query_decode_direct_admission 解析响应里的 recommended_pause_ms，更新 pause_until_s[server_url] = now + pause_ms / 1000
在调 _invoke_router / _invoke_decode_session_direct 前检查 pause_until_s[decode_url]，若 now < pause_until 则 sleep 到该时刻

改动点 3：新 CLI flag

src/agentic_pd_hybrid/cli.py、benchmark.py：

--enable-backpressure   # 默认 false，保留 baseline 行为

改动点 4：观测日志

每个 run dir 新增三个 jsonl：

admission-events.jsonl：每次 admission RPC（timestamp, session, D, can_admit, queue_depth, pause_ms, latency_s, available_tokens, evicted_session_count）
backpressure-events.jsonl：每次实际 sleep（timestamp, D, sleep_ms, queue_depth_at_signal）
session-d-binding.jsonl：每个 session 第一次 open 在某 D 时记录（timestamp, session, D, turn_id）

实验矩阵（8h 预算内）

按"先做 anchor，再做单变量对照"排序。每行右侧是预估机时。

ID	配置	目的	机时
E0 (existing)	v5 baseline，time-scale=10，无 backpressure	Anchor，已存在 `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/run1`	0
E1	v5 + backpressure ON，time-scale=10，全 trace	验证 Claim §3（backpressure 是否能消除 KVTransferError 雪崩）	~50 min
E2	v5 baseline，time-scale=1，短 trace（前 12 sessions ≈ 1000 reqs）	验证 Claim §7（time-scale=10 失真）；不开 backpressure	~60 min
E3	8DP CA，time-scale=1，同 E2 trace	E2 的对照——真实时序下 KVC 是否仍输 DP	~60 min
E4	v5 + backpressure，time-scale=1，同 E2 trace	backpressure 在真实时序下还有用吗？	~60 min
E5（备选）	v5 baseline，time-scale=10，concurrency=4，全 trace	验证 Claim §1（高并发是不是必要条件）	~50 min

总：4-5 个 run，~3-5h。剩余预算给失败重跑/分析。

实验目标——回到 §1-§7 一一对照

文档 §	Claim	由哪个 exp 证伪/支持	需要的指标
§1	Session 永久 pin + 容量盲选造成双峰	已有 E0 数据足够	direct-to-D rate per session distribution
§2	LRU 跟不上压力	已有 E0 logs 足够 + E1 看 backpressure 之后 trim/error 比例变化	trim 事件数 vs OOM 数
§3	没 backpressure 是雪崩源	E0 vs E1	KVTransferError 数、P99 latency
§4	admission RPC 干扰 scheduler	不在本轮实验范围（需要 admission probe 拆分才能验，不做）	–
§5	P-side 不感知 D 健康	已有 E0 logs 足够（prefill-0 vs prefill-1 错误数）	per-P KVTransferError
§6	(已撤回)	–	–
§7	time-scale=10 失真	E0 vs E2（同 KVC，不同 time-scale）；E2 vs E3（同 time-scale，KVC vs DP）	latency 分布、direct-to-D rate

Final 实验报告交付

跑完后输出 docs/STRUCTURAL_VALIDATION_REPORT_ZH.md，按 §1-§7 每条给出：

Claim 字面
数据证据（哪个 exp、哪个 metric）
结论：成立 / 部分成立 / 推翻
影响量化：数字差异
不确定性：N=1 风险、其他 confounder

不做的事（KISS 边界）

想做但不做	理由
跑 N=3 重复	8h 装不下；single-run 可看大方向
全 sweep 参数	只调 time-scale 和 backpressure 一个 boolean
改 LRU eviction	不在本轮范围
Cross-D migration	不在本轮范围
Admission probe/commit 拆分	不在本轮范围
P-side D-health routing	不在本轮范围
修两个"非 bug"（estimate / aging）	验证后非真实 bug

预期失败路径

GPU 资源紧张：smoke trace 进一步压缩（前 8 sessions / 600 reqs）
time-scale=1 跑超 1.5h：截断到 600s 内能完成的部分
backpressure 配错：先用 sleep_ms = depth * 100 简单线性；调不通就回滚到 0（无 backpressure）
SGLang patch 编译错：所有 patch 在 io_struct.py 和 scheduler.py 的少量行内，可单独 git restore

接下来：实现 → 跑 smoke → 写报告。

5.9 KiB Raw Blame History Unescape Escape