Compare commits
13 Commits
h200-cu130
...
improve/au
| Author | SHA1 | Date | |
|---|---|---|---|
| 110bd68000 | |||
| d93228e156 | |||
| 9a81c993ab | |||
| dbb9eee471 | |||
| 4021f27ee2 | |||
| c5f552e122 | |||
| a785b83023 | |||
| 76a79dfdda | |||
| 591cd6d382 | |||
| fd37eda367 | |||
| 683c44bd71 | |||
| baa843a3f9 | |||
| 6cdea52f28 |
24
AGENTS.md
24
AGENTS.md
@@ -1,9 +1,33 @@
|
||||
# AGENTS.md
|
||||
|
||||
## For new collaborators / agents
|
||||
|
||||
Before doing anything else, read [docs/INDEX_ZH.md](docs/INDEX_ZH.md). It points to the
|
||||
3 must-read docs and a role-based reading path (new SWE, paper reviewer,
|
||||
reproducing student, control-plane reader).
|
||||
|
||||
Cross-branch progress, weaknesses, and roadmap live in
|
||||
[docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md). It is the single source of truth
|
||||
for "what's done, what's broken, what to do next."
|
||||
|
||||
Two engineering work items are pre-specced and ready to pick up:
|
||||
- block-level eviction refactor — [docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)
|
||||
- D→P incremental KV sync — [docs/D_TO_P_SYNC_CONTRACT_ZH.md](docs/D_TO_P_SYNC_CONTRACT_ZH.md)
|
||||
|
||||
Evaluation protocol (paper-quality N, paired CI, stratification,
|
||||
baselines) is in [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md).
|
||||
|
||||
## Environment
|
||||
|
||||
Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
|
||||
|
||||
Algorithm-layer unit tests (no GPU, no SGLang):
|
||||
|
||||
```bash
|
||||
uv sync --group test
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
## Goal
|
||||
|
||||
Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
|
||||
|
||||
28
README.md
28
README.md
@@ -6,6 +6,9 @@
|
||||
|
||||
更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。
|
||||
|
||||
新加入的合作者:先看 [docs/INDEX_ZH.md](docs/INDEX_ZH.md),按"我是谁"选 3 篇必读文档。
|
||||
项目当前进度、薄弱点、路线图总览见 [docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md)。
|
||||
|
||||
## 当前做了什么
|
||||
|
||||
- 启动单机 SGLang P/D 栈。
|
||||
@@ -99,3 +102,28 @@ uv run agentic-pd-hybrid replay \
|
||||
- SGLang 改动:`feat(sglang): ...` / `fix(sglang): ...`。
|
||||
- `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
|
||||
- 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。
|
||||
|
||||
## 单元测试(无 GPU)
|
||||
|
||||
算法层(policies、Algorithm 1 / Theorem 1)有 pure-Python 单测,跑测试不需要 GPU、不需要 SGLang:
|
||||
|
||||
```bash
|
||||
uv sync --group test
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
详见 [tests/README.md](tests/README.md)。
|
||||
|
||||
## 评测脚本
|
||||
|
||||
按 [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md) 跑数据后:
|
||||
|
||||
```bash
|
||||
# M3: 按 turn_id / input_length / overlap_ratio / append_tokens 分桶
|
||||
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl
|
||||
|
||||
# M2: paired-on-same-trial bootstrap 95% CI
|
||||
scripts/analysis/paired_compare.py \
|
||||
--baseline outputs/run-dp/request-metrics.jsonl \
|
||||
--candidate outputs/run-kvc/request-metrics.jsonl
|
||||
```
|
||||
|
||||
140
docs/AUDIT_AND_ROADMAP_ZH.md
Normal file
140
docs/AUDIT_AND_ROADMAP_ZH.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# 项目整体审阅与下一阶段路线图
|
||||
|
||||
**日期**:2026-05-12
|
||||
**分支起点**:`improve/audit-and-foundations`(基于 `h200-cu130`)
|
||||
**性质**:跨分支整合 + 路线图,供合作者判断每个 commit 是否值得 merge
|
||||
**对象**:项目下一个 SWE / research agent + 论文 reviewer 预读
|
||||
|
||||
本文把 `main` / `kvc-debug-journey-v1-to-v4` / `feat/d-to-p-sync` / `h200-cu130` / `kvc-real-ali-iter-v1` 五个分支的进度、已成立的贡献、薄弱点、走到 SOSP/OSDI + 工业级的路线图集中到一处,方便快速对齐。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. **已经成立**:v1 → v2 算法(reset-on-success、字典序 Route、worker-mode Admit RPC)有形式化定义 + 两条 theorem + SWE-Bench 50 sess ts=1 上 6/8 指标击败 4DP CA 的实测。
|
||||
2. **核心薄弱点**:(a) session-level eviction 与 KVC 设计意图冲突;(b) D→P 增量 KV 同步不存在,TTFT p99 长尾来自此;(c) mooncake "instance not alive" 级联是控制层根本可用性问题;(d) 评测仍缺多 baseline 多 trace 强统计。
|
||||
3. **不需要 GPU 也能推进**的事:算法层 unit test、形式化设计文档(block-level evict、D→P sync 接口契约)、评测协议、分层分析工具、文档体系收口。本路线图的 Milestone 1 大部分都属于此类。
|
||||
4. **进 OSDI/SOSP 必须做的**:执行 §S1(block-level evict)+ §S2(D→P sync POC)+ §M2/M3/M4(多 baseline / 全 Ali / paired 协议)。预计 3–4 个月单/双人。
|
||||
|
||||
---
|
||||
|
||||
## 1. 五个分支的状态总览
|
||||
|
||||
| 分支 | 角色 | 当前状态 | 最关键产出 |
|
||||
|---|---|---|---|
|
||||
| `main` | "已发布" 基线 | 落后 origin 18 commit;2P4D + worker-admission + seed-min2 报出 vs default PD 的 9% mean / 19% p90 改善 | `KVCACHE_CENTRIC_PROGRESS_ZH.md` 的两档策略:latency-best vs stable |
|
||||
| `kvc-debug-journey-v1-to-v4` | 主工作分支 | v1→v5 完整算法演化;`KVC_ROUTER_ALGORITHM.md` 三段算法 + 两条 theorem | SWE-Bench 50 sess ts=1:v2 6/8 指标击败 4DP CA;**TTFT p99 仍输 3×**(1.28s vs 0.43s),诊断为 8.3% reseed 慢路径 |
|
||||
| `feat/d-to-p-sync` | 占位分支 | 代码空,仅 `RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` | 已排除"capacity-backup 是 D→P sync"的误解;列出 4 项工程子任务 |
|
||||
| `h200-cu130` | 真硬件 + RDMA 验证 | 4×H200 + mlx5_60 NDR 400 Gb/s 上跑 E1/E2/E3 | **E2 80% failure**(mooncake 死链级联);**E3 16min 触发 SGLang patch invariant crash**;最新 `KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 把 root cause 上升到"session-level 是错的 eviction granularity" |
|
||||
| `kvc-real-ali-iter-v1` | 真 Ali trace 验证 | 8×H20,179-req KVC-fit slice + 600-req/15min cold-window | KVC vs DP:KVC-fit p50 −46% ✅;real 15min p90 +19s ❌,53 errors vs DP 1;KVC 默认 mem-fraction OOM,必须降到 0.82 |
|
||||
|
||||
---
|
||||
|
||||
## 2. 已经"硬"成立的贡献
|
||||
|
||||
按"reviewer 能不能反驳"为标尺:
|
||||
|
||||
1. **Reset-on-success 修复 v1 thrashing**:v1 永久 blacklist → migration 死循环 failure mode 有实测 + Algorithm 3 形式化 + Theorem 1 的不饿死证明(`KVC_ROUTER_ALGORITHM.md` §3.4 / §4.1)。
|
||||
2. **三段算法分工清晰**:Algorithm 1(字典序 Route)+ Algorithm 2(D 自治 Admit RPC)+ Algorithm 3(Dispatch + reset-on-success)。v5 把 admission 从 router 估算改成 D RPC(Option D)是把 capacity ground truth 与 routing score 解耦的正确分层。
|
||||
3. **Direct-to-D 快路径的确定性命中**(Theorem 2):只要 residency ⊇ prefix ∧ append ≤ τ_append ∧ cap_ok 三条件同时成立必走快路径;SWE-Bench 91.6% 命中、TTFT p50 = 0.43s 是结构性结果。
|
||||
4. **每一个 negative result 都有 forensic 级解释**:mooncake death、cold-D、reseed 慢路径、session-level evict 都有代码定位 + 时间线 + 反例。这条对 paper 是真正加分项。
|
||||
|
||||
---
|
||||
|
||||
## 3. 让 reviewer 一击致命的薄弱点
|
||||
|
||||
### 3.1 评测方法层
|
||||
|
||||
- **M1 N 不足**:SWE-Bench v2 baseline N=3 确认 categorical,v2 自身 N 不足;缺 bootstrap CI。
|
||||
- **M2 比较口径不对等**:E2 80% 失败时用 "successful only" 算 latency 与 E1 全集比;paper 必须 paired-on-same-trial。
|
||||
- **M3 trace 偏 KVC-friendly**:KVC-fit slice 按 small-append + high overlap 筛过;full Ali(turn2+ ratio 26%、single-turn 极多)的 dilution 后结果没跑过。
|
||||
- **M4 baseline 不够强**:缺 vLLM + prefix-cache、DistServe、SplitWise、Mooncake-Master 任何一个。
|
||||
- **M5 trace 单一性**:缺 ShareGPT/Mooncake trace、缺 long-context tool-use agent benchmark、缺合成 adversarial trace。
|
||||
- **M6 硬件覆盖**:只 single-node ≤ 8 GPU;没有跨节点、没有 ≥ 32 GPU 集群实测。
|
||||
|
||||
### 3.2 系统设计层
|
||||
|
||||
- **S1 Session-level eviction 与 KVC 设计意图冲突**:90 次 evict、平均一次 free 67K tokens、25/50 session 必须 50–90K 重 prefill。`KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 已识别但未实现修复。
|
||||
- **S2 D→P 增量同步不存在**:TTFT p99 长尾 50% 来自 P 重 prefill。`capacity-backup` 是 seed-time 静态快照,不是 D→P sync。修复需改 SGLang radix 的单生产者假设。
|
||||
- **S3 Mooncake 级联 death**:admission no-space → 持续重试 seed → 心跳掉线 → SGLang 整批 abort(E2 1054/1285 失败)。控制层根本可用性 bug。
|
||||
- **S4 Admission RPC 同步阻塞**:缺 backoff / hedging / staleness budget。D scheduler GIL 抖动即把 router 卡死。
|
||||
- **S5 Cold-D / overlap-pinning**:boilerplate 24-token block hash 让所有 session 与 D0/D1 重叠 → D2/D3 0 binding。load-floor bonus 是补丁,不是 first-principles 修复。
|
||||
- **S6 SGLang 本地 patch 已 785 行 / 10 文件**,含 `schedule_batch.py:1646` 这种 hot-path 不变量改动;E3 crash 就是 vendored patch 引入的 latent landmine。
|
||||
- **S7 失败恢复 / 幂等性**:streaming session 在 chunked-prefill retry 下幂等性靠 `SessionSlot.restore_to_req`;缺 worker crash / mooncake 重连 / partial KV 损坏的恢复 protocol。
|
||||
- **S8 没有 multi-tenant / SLO-aware scheduling**:算法目标隐式 w_ttft=w_lat=1。生产里 interactive / batch / background 必须分级。
|
||||
- **S9 Topology fixed at boot**:P/D 比例是启动参数。生产负载需要 elastic。
|
||||
- **S10 Backpressure pause hint 信号未闭环**:触发 20 次但因 no-BP 无人响应;control-plane 没接通。
|
||||
|
||||
### 3.3 工程基础设施层
|
||||
|
||||
- **可观测性**:metrics 是 jsonl + 离线 `recompute_summary.py`;生产需要 Prometheus + Grafana + OpenTelemetry trace。
|
||||
- **形式化测试**:算法层与状态层缺 unit test;`SessionSlot.restore_to_req` 幂等性是作者自己 flag 的 invariant。
|
||||
- **混沌注入**:mooncake death 这种 control-plane failure 必须有 fault injection harness。
|
||||
- **代码体量**:`replay.py` 2460 行,集 orchestration / policy hook / control plane / metrics 于一身——prototype OK,paper-quality artifact 偏弱。
|
||||
|
||||
---
|
||||
|
||||
## 4. 路线图
|
||||
|
||||
分三个 milestone。每个 milestone 可独立交付(paper 章节或工程 release)。
|
||||
|
||||
### Milestone 1 — Defensible SOSP/OSDI submission(3–4 个月,单 / 双人)
|
||||
|
||||
**目标**:把现有算法 + 失败诊断收口成能扛 PC 第一轮的稿子。
|
||||
|
||||
1. **执行 §S1(block-level eviction refactor)** — 见 `docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`。
|
||||
- Streaming-session decode 输出在每个 turn finish 时通过 `cache_finished_req` 增量提交进 radix tree。
|
||||
- `SessionSlot` 退化为纯 metadata(仅持 `last_node` + lock_ref)。
|
||||
- `release_session` 改为 `dec_lock_ref` + 删 slot;evict 完全交给 SGLang radix LRU。
|
||||
- 预期:evict 粒度从 67K tokens/次降到 24 tokens/次;reseed 频率降一个数量级。
|
||||
2. **执行 §S2(D→P 增量同步 POC)** — 见 `docs/D_TO_P_SYNC_CONTRACT_ZH.md`。
|
||||
- microbench 证明:D append 完成后异步推 KV block 回 P 端 radix → 下次 reseed 跳过 re-prefill。
|
||||
3. **修 §S3(mooncake death 级联)**:admission RPC backoff + jitter;per-D pending-seed budget;mooncake heartbeat 与 admission 解耦。
|
||||
4. **修 §S5 的 first-principles 解法**:把 `overlap` 重定义为 "session 在 D 上独占 prefix 的 hash 数"(去掉 boilerplate 共享 hash 贡献),让 score 自然分散。
|
||||
5. **重做评测**:见 `docs/EVALUATION_PROTOCOL_ZH.md`。N≥3 + bootstrap CI + 多 baseline + 全 Ali + 分层报告。
|
||||
6. **形式化扩充**:加 Theorem 3(block-level evict 下重 prefill cost 上界)+ Theorem 4(D→P sync 的 staleness budget β 与 reseed cost 关系)。
|
||||
7. **Artifact**:一键脚本 + Dockerfile + 4×A100 一小时复现核心 table/figure。
|
||||
|
||||
### Milestone 2 — Production-quality serving substrate(再 3–6 个月,2–3 人)
|
||||
|
||||
8. **控制平面分层**:把 `replay.py` 拆成 `router/` / `control/` / `obs/` / `orch/`。
|
||||
9. **Elastic topology**:autoscaling controller,输入 (P queue, D transfer queue, D KV usage)。
|
||||
10. **Multi-tenant + SLO classes**:interactive / batch / background 三档独立 admission budget。
|
||||
11. **Failure injection harness**:mooncake link flap / D OOM kill / router GC pause / partial KV corruption;每个 case 有恢复 SLA。
|
||||
12. **Persistent KV tier**:CPU DRAM + NVMe + RDMA-attached pool;evict 改为 demote。
|
||||
13. **Cross-node + heterogeneous**:H100 + H200 + L40S 混合,topology-aware routing。
|
||||
14. **Observability**:per-request OpenTelemetry + Prometheus per-D + Grafana 主面板。
|
||||
|
||||
### Milestone 3 — 真正能进 OSDI'27 的科研增量(6–12 个月)
|
||||
|
||||
15. **Learning-based admission / migration**:multi-armed bandit / RL 控制 τ_reject 与 K;用 trace 训 session-aliveness predictor。
|
||||
16. **跨 router residency consensus**:轻量 gossip 共享 `Σ.resident[d]`。
|
||||
17. **可证明 competitive ratio**:在 oracle KV-residency 模型下证明 KVC expected TTFT 与 offline optimal 比值有界。
|
||||
18. **分布式 prefix tree**:逻辑 prefix 映射到多 D 物理副本,支持 multi-tenant prefix 共享(system prompt / tool schema)。
|
||||
19. **Energy-aware variant**:GPU SM 利用率 + PCIe/RDMA 能耗进目标函数。
|
||||
20. **End-to-end agent serving framing**:从 request-level latency 上升到 agent task completion time(coding agent 一个 task 30+ turn)。
|
||||
|
||||
---
|
||||
|
||||
## 5. 不需要 GPU 也能推进的工作清单
|
||||
|
||||
按 ROI 排:
|
||||
|
||||
- [x] 本路线图(`AUDIT_AND_ROADMAP_ZH.md`)。
|
||||
- [x] 合作者入口(`docs/INDEX_ZH.md`)。
|
||||
- [x] Block-level eviction 具体设计(`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`)。
|
||||
- [x] D→P sync 接口契约(`docs/D_TO_P_SYNC_CONTRACT_ZH.md`)。
|
||||
- [x] 评测协议(`docs/EVALUATION_PROTOCOL_ZH.md`)。
|
||||
- [x] `KvAwarePolicy` 纯函数 score 抽取 + unit test(Algorithm 1)。
|
||||
- [x] 不饿死性质测试(Theorem 1)。
|
||||
- [x] 分层分析脚本(按 turn-index / append-size / overlap 三维分桶)。
|
||||
- [x] Paired-comparison 协议 helper。
|
||||
- [ ] Mooncake death 的可重现 mock harness(无 GPU 也能跑)。
|
||||
- [ ] SGLang patch surface 的归类清单(每个 patch 标"必须" / "实验性" / "可下线")。
|
||||
- [ ] Failure-mode taxonomy 文档(cold-D、overlap-pin、mooncake death、reseed storm、evict storm)。
|
||||
|
||||
---
|
||||
|
||||
## 6. 单句结论
|
||||
|
||||
> 这个项目已经具备了 SOSP/OSDI workshop / poster 的素材;要进 main track,需要把 §S1(block-level evict)和 §S2(D→P sync)做实、把 §M3(full Ali)和 §M4(两个强 baseline)补齐、把 §S3(mooncake 级联 death)的 control-plane fix 写进可重复 artifact。如果只能做一件事,先做 block-level eviction refactor —— 它同时解决"reseed 太频繁"和"P 端 radix 多生产者扩展的前置条件"。
|
||||
309
docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md
Normal file
309
docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md
Normal file
@@ -0,0 +1,309 @@
|
||||
# Block-level Eviction Refactor — 设计文档
|
||||
|
||||
**日期**:2026-05-12
|
||||
**前置**:[KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md)(架构层 manifesto)
|
||||
**性质**:实现层设计 + API 草案 + 测试计划,供下一个合作者直接据此编码
|
||||
**Status**:草案,未实现。代码全部 quoted from `third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py @ origin/h200-cu130`
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
把 `SessionAwareCache` 当前对 streaming-session **整段 KV 一次性 free** 的语义改成:
|
||||
|
||||
1. Streaming-session decode 输出在 turn finish 时 **增量 commit 进 radix tree**。
|
||||
2. `SessionSlot` 退化为**纯 metadata**(仅持 `last_node` + lock_ref 状态),不再独占 KV 区间。
|
||||
3. `release_session` 改为只 dec_lock_ref + 删 slot,**让 SGLang 标准 radix LRU 按 block 粒度蚕食**。
|
||||
|
||||
预期收益:evict 粒度从一次 ~67K tokens 降到 ~24 tokens(page_size 个 token),reseed 频率降一个数量级;同时把 P 端 radix tree 改造成可被外部喂数据(为 [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) 铺路)。
|
||||
|
||||
---
|
||||
|
||||
## 1. 现状代码梳理
|
||||
|
||||
### 1.1 关键文件与函数
|
||||
|
||||
`third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py`
|
||||
|
||||
| 函数 / 字段 | 当前语义 |
|
||||
|---|---|
|
||||
| `SessionSlot.req_pool_idx` | streaming-session 独占的 req_pool 槽位 |
|
||||
| `SessionSlot.kv_committed_len` | 上一 turn 完成时已 commit 的 KV 长度(已计入 cache_protected_len 部分进入 radix) |
|
||||
| `SessionSlot.kv_allocated_len` | 当前已分配但**未进 radix** 的 KV 长度("session-exclusive 尾部") |
|
||||
| `SessionSlot.cache_protected_len` | 首 turn 提交 radix 时的 protected 边界 |
|
||||
| `match_prefix(streaming req)` | 命中 slot → 返回 `req_to_token[req_pool_idx, :prefix_len]`,bypass radix |
|
||||
| `cache_unfinished_req(streaming req)` | subsequent turns → **完全 skip inner**(不进 radix) |
|
||||
| `cache_finished_req(streaming req)` | 调 `slot.save_from_req`,**不调 inner.cache_finished_req** |
|
||||
| `release_session(sid)` | `dec_lock_ref(slot.last_node)` + `free(req_to_token[req_pool_idx, cache_protected_len:kv_allocated_len])` + 回收 req_pool 槽位 |
|
||||
|
||||
### 1.2 当前为什么是错的(重述)
|
||||
|
||||
`[cache_protected_len, kv_allocated_len)` 是首轮入 radix 之后所有累积的 decode 输出 + 后续 turn 的 extend。在 Inferact / SWE-Bench 实测:
|
||||
|
||||
- `cache_protected_len` ≈ 首 turn boilerplate ~12K
|
||||
- `kv_allocated_len` 累积 50–100K
|
||||
- 每次 `release_session` 一次性释放 38–88K,这部分**从未进 radix**,无法享受 leaf-by-leaf 渐进 evict
|
||||
|
||||
→ session 被 evict 后必须从 client 原 prompt 重 prefill 全长 + mooncake transfer 全长,跟 naive PD-disagg 等价(详见 manifesto §1)。
|
||||
|
||||
---
|
||||
|
||||
## 2. 目标行为表
|
||||
|
||||
| 场景 | 现状 | 目标 |
|
||||
|---|---|---|
|
||||
| Session 累积 50K KV,D 满了 | `release_session` 一次释放 38K | radix LRU 从最老 leaf 开始 evict,单次 ~24 tokens |
|
||||
| Session 被 evict 后再到来 | 必须 reseed 50K | 仅 re-prefill 被 evict 的 leaf 部分(典型 ≤ 5K) |
|
||||
| Evicted session TTFT | 50–90K reseed ≈ 3–7s | 5K append-prefill ≈ 200ms |
|
||||
| 不被 evict 的 session | 同 session 内 turns append-only | 同样 append-only(不变) |
|
||||
| Direct-to-D fast path 命中率 | 91.6% (SWE-Bench) / 38% (E3 Inferact) | 应 ≥ 85% 即使 saturation |
|
||||
|
||||
---
|
||||
|
||||
## 3. 设计
|
||||
|
||||
### 3.1 SessionSlot 字段精简
|
||||
|
||||
**after refactor**:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class SessionSlot:
|
||||
virtual_node: _VirtualNode = field(default_factory=_VirtualNode)
|
||||
|
||||
# Pointer into the radix tree — the deepest node owned by this session's
|
||||
# committed prefix. Held under inc_lock_ref so radix LRU never evicts this
|
||||
# *active* leaf out from under a turn-in-progress. Released by
|
||||
# release_session.
|
||||
last_node: Any = None
|
||||
swa_uuid_for_lock: Optional[str] = None
|
||||
|
||||
# Bookkeeping fields (no longer authoritative ownership of KV indices).
|
||||
last_access_time: float = field(default_factory=time.monotonic)
|
||||
|
||||
# Mamba state stays slot-owned (mamba doesn't fit the radix model).
|
||||
mamba_pool_idx: Any = None
|
||||
mamba_ping_pong_track_buffer: Any = None
|
||||
mamba_next_track_idx: Any = None
|
||||
mamba_last_track_seqlen: Any = None
|
||||
mamba_branching_seqlen: Any = None
|
||||
```
|
||||
|
||||
**删除**:`req_pool_idx`、`kv_committed_len`、`kv_allocated_len`、`cache_protected_len`、`swa_evicted_seqlen`。这些字段的真值改由 radix tree + req_to_token_pool 共同维护。
|
||||
|
||||
### 3.2 `cache_finished_req` 改造
|
||||
|
||||
**after refactor**:
|
||||
|
||||
```python
|
||||
def cache_finished_req(self, req: Req, is_insert: bool = True, **kwargs):
|
||||
if not _is_streaming(req):
|
||||
return self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
|
||||
|
||||
session_id = req.session.session_id
|
||||
slot = self.slots.setdefault(session_id, SessionSlot())
|
||||
|
||||
# KEY CHANGE: always delegate to inner — this inserts the new tokens
|
||||
# (kv_committed_len .. fill_ids end) as radix-tree blocks. Subsequent
|
||||
# match_prefix calls for this session will hit the radix tree directly.
|
||||
result = self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
|
||||
|
||||
# Update slot bookkeeping only (no KV ownership).
|
||||
slot.last_node = req.last_node
|
||||
slot.swa_uuid_for_lock = req.swa_uuid_for_lock
|
||||
slot.last_access_time = time.monotonic()
|
||||
|
||||
# Mamba state still goes through slot.
|
||||
slot.mamba_pool_idx = req.mamba_pool_idx
|
||||
...
|
||||
return result
|
||||
```
|
||||
|
||||
**不变量**:
|
||||
- `inner.cache_finished_req` 会把 `[kv_committed_len_old, kv_committed_len_new)` 范围内对齐到 page_size 的 KV 插入 radix。这个语义来自 SGLang 标准实现,无需改 inner。
|
||||
- `slot.last_node` 现在指向**当前 session 已 commit prefix 的尾节点**,每个 turn 后向前推进。
|
||||
- `dec_lock_ref(old_last_node)` + `inc_lock_ref(new_last_node)` 必须在 turn 切换时执行。
|
||||
|
||||
### 3.3 `cache_unfinished_req` 改造
|
||||
|
||||
streaming session 的 subsequent turn **不再 skip inner**。原因:现在 `match_prefix` 走 radix,chunked-prefill 中间状态也需要 inner 维护:
|
||||
|
||||
```python
|
||||
def cache_unfinished_req(self, req: Req, **kwargs):
|
||||
if _is_streaming(req) and kwargs.get("chunked", False):
|
||||
# Chunked prefill: forward to inner so the per-chunk extend gets
|
||||
# tracked in the radix LRU access timestamps.
|
||||
...
|
||||
self.inner.cache_unfinished_req(req, **kwargs)
|
||||
```
|
||||
|
||||
具体的 chunked 处理细节需要保留对 `prefix_indices` 重建的逻辑(参考当前实现 lines 215–225),但调用 `inner.cache_unfinished_req` 不能 skip。
|
||||
|
||||
### 3.4 `match_prefix` 改造
|
||||
|
||||
退化为**纯 inner 转发**——SessionSlot 不再持 KV 指针:
|
||||
|
||||
```python
|
||||
def match_prefix(self, params: MatchPrefixParams) -> MatchResult:
|
||||
# No more slot-fast-path. Streaming sessions reuse KV via radix tree
|
||||
# match like every other request.
|
||||
return self.inner.match_prefix(params)
|
||||
```
|
||||
|
||||
调用方需要的 "这个 session 的 committed prefix 长度" 信息改为通过 `inner.match_prefix(...).device_indices.shape[0]` 推导。
|
||||
|
||||
### 3.5 `release_session` 改造
|
||||
|
||||
**after refactor**:
|
||||
|
||||
```python
|
||||
def release_session(self, session_id: str) -> int:
|
||||
slot = self.slots.pop(session_id, None)
|
||||
if slot is None:
|
||||
return 0
|
||||
|
||||
# Just release our radix lock — radix LRU can now reclaim our prefix
|
||||
# leaves at its own pace. NO direct token_to_kv_pool free.
|
||||
if slot.last_node is not None:
|
||||
if slot.swa_uuid_for_lock is not None:
|
||||
self.inner.dec_lock_ref(
|
||||
slot.last_node,
|
||||
DecLockRefParams(swa_uuid_for_lock=slot.swa_uuid_for_lock),
|
||||
)
|
||||
else:
|
||||
self.inner.dec_lock_ref(slot.last_node)
|
||||
|
||||
# Mamba state still needs explicit cleanup if present.
|
||||
if slot.mamba_pool_idx is not None:
|
||||
...
|
||||
|
||||
return 0 # "freed_tokens" no longer meaningful; radix LRU shed lazily
|
||||
```
|
||||
|
||||
### 3.6 `get_session_status` / `list_session_statuses` 改造
|
||||
|
||||
`resident_tokens` 现在的真值来自 radix tree。需要在 inner 暴露一个 helper:
|
||||
|
||||
```python
|
||||
# In BasePrefixCache / RadixCache:
|
||||
def tokens_under(self, node) -> int:
|
||||
"""Count tokens in the path from root to `node` (inclusive)."""
|
||||
...
|
||||
|
||||
# In SessionAwareCache:
|
||||
def get_session_status(self, session_id: str) -> Optional[Dict[str, Any]]:
|
||||
slot = self.slots.get(session_id)
|
||||
if slot is None:
|
||||
return None
|
||||
resident_tokens = self.inner.tokens_under(slot.last_node) if slot.last_node else 0
|
||||
return {
|
||||
"session_id": session_id,
|
||||
"resident": resident_tokens > 0,
|
||||
"resident_tokens": int(resident_tokens),
|
||||
"last_access_time": float(slot.last_access_time),
|
||||
}
|
||||
```
|
||||
|
||||
`admit_direct_append` 的容量检查改用 `resident_tokens` 的 radix 真值(去掉 `kv_committed_len / kv_allocated_len` 双值不一致的可能)。
|
||||
|
||||
### 3.7 SGLang 调度路径配套改动
|
||||
|
||||
参考 `schedule_batch.py:1572-1646`,当前 streaming-session correction(commit b8e6f13 / 986f351 引入)建立在 SessionSlot 拥有独立 KV 范围之上。block-level refactor 后这条 correction 路径**完全无需存在**——req 的 fill_ids / prefix_indices 由 inner radix `match_prefix` 直接给出一致值。
|
||||
|
||||
**移除项**:
|
||||
- `schedule_batch.py:1572-1585` 的 `actual_extend_len = max(0, len(fill_ids) - len(prefix_indices))` correction 块。
|
||||
- `schedule_batch.py:1646` 的 `assert seq_len - pre_len == req.extend_input_len`(refactor 后该不变量结构上必然成立)。
|
||||
- E3 触发的 latent landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2)随之消失。
|
||||
|
||||
---
|
||||
|
||||
## 4. 不变量(必须在 PR 自测中覆盖)
|
||||
|
||||
| Inv | 内容 |
|
||||
|---|---|
|
||||
| I1 | `release_session(sid)` 后,下一次同 session 请求的 `match_prefix` 行为只取决于 radix tree 的常驻状态——不依赖 `slots` dict。 |
|
||||
| I2 | 任意 (session_id, turn_id) 的 `cache_finished_req` 调用后,radix tree 上必然存在一条 root→leaf 路径覆盖该 turn 的全部 committed token(即 `tokens_under(slot.last_node)` 严格不降)。 |
|
||||
| I3 | `restore_to_req` 必须**幂等**:在 chunked-prefill 重试场景下,对同一 req 可被调用多次而最终 req 状态等价。当前实现靠"不清 slot 字段"实现 → refactor 后改由 radix `match_prefix` 的纯函数性质保证。 |
|
||||
| I4 | 无 streaming-session 的请求(`req.session is None`)行为 **不变**:所有路径 short-circuit 到 inner。 |
|
||||
| I5 | 任一 turn 结束后,对 `slot.last_node` 的 `inc_lock_ref` 必须有对应的 `dec_lock_ref`,且 `release_session` 是最终的释放点。 |
|
||||
|
||||
---
|
||||
|
||||
## 5. 测试计划(无 GPU 可跑)
|
||||
|
||||
### 5.1 单元测试(mock inner cache)
|
||||
|
||||
写一个 `MockRadixCache(BasePrefixCache)`,记录所有 `cache_finished_req / cache_unfinished_req / match_prefix / evict / dec_lock_ref` 调用序列。然后:
|
||||
|
||||
| Test | 断言 |
|
||||
|---|---|
|
||||
| `test_release_session_no_direct_free` | 调 `release_session` 后,Mock 上 **没有** 直接 `free(kv_indices)` 调用,只有 `dec_lock_ref` |
|
||||
| `test_subsequent_turn_inserts_radix` | 模拟 turn 0 → 1 → 2 三次 `cache_finished_req`,断言每次都触发 `inner.cache_finished_req` |
|
||||
| `test_match_prefix_uses_inner` | streaming 与 non-streaming 都仅走 `inner.match_prefix` |
|
||||
| `test_restore_idempotent` | 模拟 chunked-prefill 重试,连续两次 `match_prefix` 返回的 `device_indices` 一致 |
|
||||
| `test_eviction_under_pressure_is_block_level` | inject 一个 "pool 满,必须 evict 24 tokens" 的状态,断言 `release_session` 不被触发,inner 的 LRU 单步走 |
|
||||
|
||||
### 5.2 Property-based 测试
|
||||
|
||||
```python
|
||||
@given(turns=lists(integers(min_value=24, max_value=2048), min_size=1, max_size=50))
|
||||
def test_committed_tokens_monotone(turns):
|
||||
"""tokens_under(slot.last_node) is monotonically non-decreasing across turns."""
|
||||
...
|
||||
```
|
||||
|
||||
### 5.3 Integration smoke(需要 GPU,但放在 sweep 脚本里)
|
||||
|
||||
执行 `sweep_e2_kvc_v2_rdma.sh` 同 trace 同配置,对比指标:
|
||||
- evict 总次数(期望从 90 → < 10)
|
||||
- 单次平均 evict tokens(期望从 67K → < 500)
|
||||
- TTFT p99(期望从 1.28s → < 0.7s)
|
||||
- direct-to-D 命中率(期望 ≥ 85%)
|
||||
|
||||
---
|
||||
|
||||
## 6. 工程量与风险
|
||||
|
||||
### 6.1 工程量
|
||||
|
||||
| 工作 | 估时 | 风险 |
|
||||
|---|---|---|
|
||||
| §3.1–§3.6 SessionAwareCache 改造 | 2–3 天 | 中:需要熟悉 radix 内部 lock_ref / evict 协议 |
|
||||
| §3.7 schedule_batch 清理 | 0.5 天 | 低:是删代码 |
|
||||
| §4 不变量单元测试 | 2 天 | 低 |
|
||||
| §5.3 GPU smoke + 数据对比 | 2 天 | 中:mooncake 仍可能触发 E2 级联 death,需要 §S3 修复一并跑 |
|
||||
| **总计** | **~1 周** | |
|
||||
|
||||
### 6.2 关键风险
|
||||
|
||||
1. **`inner.cache_finished_req` 对 streaming-session req 的兼容性**:当前 SGLang 标准 radix 假设 req 在 cache_finished_req 时是 "完整 prefill+decode 完成"。streaming-session 的 req 在每个 turn 结束时还会留下"未完成的 conversation",要确保 inner 在插入时不会把 decode-only tokens 当成可丢弃尾巴。需要 audit `radix_cache.py:cache_finished_req` 的实现。
|
||||
|
||||
2. **lock_ref 顺序**:turn N+1 开始的 `match_prefix` → inc_lock_ref(new_node),turn N 结束的 dec_lock_ref(old_node),时序若反了会在并发下让 LRU 把刚 commit 的 leaf 误 evict。建议加 assertion:`dec_lock_ref` 之前 `inc_lock_ref` 必须先到。
|
||||
|
||||
3. **chunked-prefill retry**:见 I3。SGLang 当前 `restore_to_req` 不清 slot 字段就是为此 retry。refactor 后必须确认 inner radix `match_prefix` 在 retry 下也幂等(标准 radix tree 是的,但要写测试明确锁住这个性质)。
|
||||
|
||||
---
|
||||
|
||||
## 7. 与 D→P sync 工作的关系
|
||||
|
||||
block-level evict 是 [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) 的**前置条件**:
|
||||
|
||||
- D→P sync 需要 P 端 radix tree **可接收外部喂入的 KV block**。
|
||||
- 当前 P 端 radix 假设单生产者(本 worker 模型输出)。
|
||||
- block-level refactor 完成后,streaming-session 的 KV 已经走标准 radix 路径——再让 radix tree 接受"外部喂入"的额外生产者就只是扩展 insert API,而不是发明新的存储路径。
|
||||
|
||||
→ 两件事可顺序做:先 block-level evict,再 D→P sync。
|
||||
|
||||
---
|
||||
|
||||
## 8. 接班 agent 的最小动作
|
||||
|
||||
1. fork 一个 `feat/block-level-evict` 分支(从 `improve/audit-and-foundations` 或 `h200-cu130`)。
|
||||
2. 实现 §3.1–§3.6。
|
||||
3. 写 §5.1 + §5.2 单元测试。
|
||||
4. 在 8×H100 / H200 上跑 §5.3 smoke,对比 evict 频次和 TTFT p99。
|
||||
5. 若 §6.2 风险 1 成立,进 SGLang `radix_cache.py` 看是否需要给 streaming-session req 加 `is_session_active=True` flag 阻止"丢弃 decode 尾"。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:把 session 当 lifecycle 边界(保留),但**不要**让它做 eviction 边界(移交给 radix LRU)。这次 refactor 同时解决"reseed 太频繁"和"P 端 radix 不可外部喂入"两个 blocker。
|
||||
247
docs/D_TO_P_SYNC_CONTRACT_ZH.md
Normal file
247
docs/D_TO_P_SYNC_CONTRACT_ZH.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# D→P 增量 KV 同步 — 接口契约与 rollout 计划
|
||||
|
||||
**日期**:2026-05-12
|
||||
**前置**:[RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md)(缺口定位)+ [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)(前置条件)
|
||||
**性质**:跨层接口契约 + staleness budget 形式化 + 分阶段 rollout
|
||||
**Status**:草案。`feat/d-to-p-sync` 分支当前为空,本文是该分支应当首先 land 的设计文档
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
reseed 慢路径的 50% 时间在 P 重 prefill,**修复 transfer 段(启 RDMA)只能解一半**。彻底消除长尾的唯一办法是让 P 端 backup 增量跟上 D 端的 append:
|
||||
|
||||
> D 在 direct-to-D 路径上完成一个 turn → 异步把新 commit 的 KV block 推回 P 端 radix → 下次 reseed 时 P 端 radix 命中完整 prefix,无需 re-prefill,仅一次 P→D transfer。
|
||||
|
||||
本文给出三层(mooncake / SGLang / agentic-pd-hybrid)的接口契约、一个 **staleness budget β** 的形式化定义,以及四阶段 rollout 计划,让该工作可以与 block-level eviction 解耦推进。
|
||||
|
||||
---
|
||||
|
||||
## 1. Staleness Budget β —— 形式化定义
|
||||
|
||||
设 D 上 session `s` 的 committed prefix 长度为 `L_D(s, t)`(time `t` 的瞬时值),P 上同 session 的 backup prefix 长度为 `L_P(s, t)`。
|
||||
|
||||
```
|
||||
staleness(s, t) := L_D(s, t) - L_P(s, t) ≥ 0
|
||||
```
|
||||
|
||||
**Staleness budget β** 是系统承诺维持的上界:
|
||||
|
||||
```
|
||||
∀ s, ∀ t : staleness(s, t) ≤ β
|
||||
```
|
||||
|
||||
直观:β 越小 → reseed 命中 P 端 backup 的可能越高 → reseed 退化为单次 P→D transfer + ≤ β tokens 的 re-prefill。
|
||||
|
||||
- **β = 0**:完全同步(D 每 commit 一块就阻塞等 P ack)。延迟成本高,不推荐。
|
||||
- **β = ∞**:当前状态(P 端 backup 永远 seed-time 静态快照)。
|
||||
- **β = 一个 page(24 tokens)**:单 block sync。理论最优粒度,但 D 端每次 append 都触发一次 D→P RPC。
|
||||
- **β = O(append_len)(典型 1K–4K)**:批量 sync。推荐起点,把同 turn 的 decode 输出聚合后整批推送。
|
||||
- **β = O(turn_size)(典型 ~50K)**:粗粒度 sync。失效 reseed bypass,仅减少 transfer。不可取。
|
||||
|
||||
→ rollout 推荐 β = `max(page_size, min(committed_in_turn, β_max))`,`β_max` 默认 4096。
|
||||
|
||||
---
|
||||
|
||||
## 2. 三层接口契约
|
||||
|
||||
### 2.1 Mooncake 层:双角色化
|
||||
|
||||
**当前状态**(详见 [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) §3):
|
||||
|
||||
- `MooncakeKVManager` 在初始化时按 `disaggregation_mode ∈ {PREFILL, DECODE}` 强角色化。
|
||||
- `MooncakeKVSender` 仅在 PREFILL 模式实例化,`MooncakeKVReceiver` 仅在 DECODE 模式实例化。
|
||||
- `add_transfer_request` 含硬约束 `assert disaggregation_mode == PREFILL`。
|
||||
|
||||
**目标接口**:
|
||||
|
||||
```python
|
||||
# third_party/sglang/python/sglang/srt/disaggregation/base/conn.py
|
||||
class BaseKVManager:
|
||||
roles: set[KVRole] # 替换原单值字段,允许 {PREFILL, DECODE}
|
||||
|
||||
class KVRole(Enum):
|
||||
PREFILL = "prefill"
|
||||
DECODE = "decode"
|
||||
PREFILL_BACKUP_RECEIVER = "prefill_backup_receiver" # 新:P 端接收 D→P sync
|
||||
DECODE_BACKUP_SENDER = "decode_backup_sender" # 新:D 端发送 D→P sync
|
||||
```
|
||||
|
||||
**新增类**(实现层 ~400 LOC):
|
||||
|
||||
| 类 | 角色 | 关键方法 |
|
||||
|---|---|---|
|
||||
| `DecodeKVSender` | D 端把 append 后的新 KV block 推回 P | `enqueue_sync(session_id, kv_blocks, target_p)` 异步入队,返回 `sync_id` |
|
||||
| `PrefillKVReceiver` | P 端接收 D→P sync 包 | `recv_loop()` 后台线程;每个包触发 callback 注入 radix tree |
|
||||
|
||||
**Bootstrap channel**:需要独立于现有 P→D 通道的第二个 bootstrap socket(避免 buffer pointer 协商冲突)。配置:
|
||||
- 默认 disable,由 ServerArgs flag `--enable-d2p-sync` 开启
|
||||
- 新增 port range `BOOTSTRAP_D2P_PORT_BASE = 22000`
|
||||
|
||||
### 2.2 SGLang 层:Radix 多生产者扩展
|
||||
|
||||
**当前状态**:P 端 radix 假设单生产者(本 worker 模型输出)。`RadixCache.cache_finished_req` 内部直接从 `req_to_token_pool[req_pool_idx, :]` 取 KV indices 插入树。
|
||||
|
||||
**目标接口**(在 [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) 完成之后):
|
||||
|
||||
```python
|
||||
class RadixCache(BasePrefixCache):
|
||||
def insert_external(
|
||||
self,
|
||||
token_ids: Sequence[int],
|
||||
kv_tensor: torch.Tensor,
|
||||
*,
|
||||
source_worker_id: str,
|
||||
session_id: str,
|
||||
) -> InsertExternalResult:
|
||||
"""
|
||||
Insert KV blocks supplied by an external worker (D→P sync).
|
||||
|
||||
Allocates fresh slots in token_to_kv_pool, copies kv_tensor into them,
|
||||
and threads the resulting indices through the radix tree exactly like
|
||||
cache_finished_req would for a local prefill.
|
||||
|
||||
Invariants:
|
||||
- Same model layout (verified at handshake time, not per-call).
|
||||
- On collision with existing radix path, no-op for the shared prefix
|
||||
and only insert the diverging suffix.
|
||||
- Inserted nodes get lock_ref += 1 if `pin=True`, default False.
|
||||
D→P sync is best-effort; LRU is allowed to evict the inserted leaves.
|
||||
"""
|
||||
```
|
||||
|
||||
**关键设计点**:
|
||||
|
||||
| 决策 | 选项 | 推荐 |
|
||||
|---|---|---|
|
||||
| KV index 重映射 | A) D 发原 indices, P 重映射;B) D 发紧密打包的 tensor,P 重新分配 | **B**:避免跨 worker 索引泄漏 |
|
||||
| 失败处理 | A) D→P 失败 → 退化为重 prefill;B) 重试 N 次 | **A** + 后续 reseed 时若 P 未命中走旧路径 |
|
||||
| Reference counting | sync 进 P 的 KV 是否被 pin? | **不 pin**:P 端 LRU 自然管理,避免 backup 把生产 KV 挤出 |
|
||||
| 与 evict 协调 | sync 来到时 P 满怎么办? | 让 sync insert 触发 inner.evict → 与本地生产 KV 公平 LRU 竞争 |
|
||||
| 同 session 多 P 实例 | router round-robin 把 turn 派到不同 P 怎么办? | **接受 multi-source**:每个 P 维护自己的 backup;reseed 时挑 staleness 最小者 |
|
||||
|
||||
### 2.3 agentic-pd-hybrid 层:Hooks 与状态机
|
||||
|
||||
**新增 CLI flag**:
|
||||
|
||||
```bash
|
||||
--enable-d2p-sync # off by default
|
||||
--d2p-staleness-budget-tokens 4096 # β_max
|
||||
--d2p-sync-batch-min-tokens 24 # 至少 ≥ 1 page 才触发
|
||||
--d2p-sync-target-policy {last_p, round_robin, broadcast}
|
||||
# last_p: 推回该 session 上次 seed 的 P
|
||||
# broadcast: 推到所有 P(reseed 时灵活但带宽大)
|
||||
```
|
||||
|
||||
**新增 state 字段**(`replay.py` 的 `DirectSessionState`):
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class DirectSessionState:
|
||||
...
|
||||
# NEW: per-P backup view, populated by D->P sync callbacks.
|
||||
prefill_resident_tokens_by_p: dict[str, int] = field(default_factory=dict)
|
||||
last_d2p_sync_at: float | None = None
|
||||
```
|
||||
|
||||
**Hook 在 `_invoke_session_direct` 完成后**:
|
||||
|
||||
```python
|
||||
async def _invoke_session_direct(...):
|
||||
...
|
||||
response = await self._stream_direct_to_d(...)
|
||||
if response.ok and self.config.enable_d2p_sync:
|
||||
new_committed = response.kv_committed_len
|
||||
prev_p_resident = max(session.prefill_resident_tokens_by_p.values(), default=0)
|
||||
staleness = new_committed - prev_p_resident
|
||||
if staleness >= self.config.d2p_sync_batch_min_tokens:
|
||||
target_p = self._choose_d2p_target(session)
|
||||
asyncio.create_task(
|
||||
self._issue_d2p_sync(session, target_p, prev_p_resident, new_committed)
|
||||
)
|
||||
```
|
||||
|
||||
**Hook 在 reseed 路径**(`_invoke_kvcache_seeded_router`):
|
||||
|
||||
```python
|
||||
async def _invoke_kvcache_seeded_router(..., request):
|
||||
...
|
||||
if self.config.enable_d2p_sync:
|
||||
# Probe P-side residency before issuing full re-prefill.
|
||||
probe = await self._probe_prefill_residency(session_id)
|
||||
if probe.resident_tokens >= request.prefix_len - β_max:
|
||||
# Use the up-to-date backup: skip re-prefill, just trigger P→D transfer.
|
||||
return await self._invoke_p_to_d_transfer_only(...)
|
||||
# Fall back to existing path.
|
||||
return await self._invoke_kvcache_seeded_router_legacy(...)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 性质(待证明)
|
||||
|
||||
### 3.1 Theorem 4 候选(论文形式)
|
||||
|
||||
*设 staleness budget β 维持成立。对一个 session `s` 在 D 上累积长度 L、被 evict 后 reseed 触发:*
|
||||
|
||||
```
|
||||
reseed_cost(s) ≤ T_p2d(L) + T_prefill(min(β, L))
|
||||
```
|
||||
|
||||
*其中 T_p2d 是 P→D transfer 时间(在 RDMA 下 ~L · 4 ns/token),T_prefill 是 prefill 时间(在 H100 TP1 Qwen3-30B 下 ~50K tokens/s)。当 β ≪ L 时退化为 single P→D transfer 主导。*
|
||||
|
||||
**对比 baseline**(无 D→P sync):`reseed_cost = T_p2d(L) + T_prefill(L − seed_size)`,re-prefill 占主导。
|
||||
|
||||
### 3.2 与 Theorem 2 的关系
|
||||
|
||||
Theorem 2 只保证 direct-to-D 路径的快速命中。Theorem 4 把"fast path miss 时的 fallback cost"也压低到次秒级,使 KVC 在**全分位数**击败 DP 成为可能。
|
||||
|
||||
---
|
||||
|
||||
## 4. 四阶段 Rollout
|
||||
|
||||
| Phase | 范围 | GPU 需求 | 验收指标 |
|
||||
|---|---|---|---|
|
||||
| **P1** | block-level eviction refactor([BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)) | 4×H100 smoke | evict 单次平均 ≤ 500 tokens |
|
||||
| **P2** | mooncake 双角色化 + microbench(D→P 单包 RTT、带宽利用) | 单机 + RDMA | P→D RTT < 50ms(local),单 16K-token block 带宽 ≥ 50% 理论上限 |
|
||||
| **P3** | SGLang `insert_external` + agentic-pd-hybrid hook(仅 best-effort,无 reseed probe) | 4×H100 + RDMA | sync 触发率 > 80% 同 turn 内完成;不引入新 failure mode |
|
||||
| **P4** | reseed probe 接通 + 端到端 evaluation | 4×H100 + RDMA | reseed 单次 < 0.5s(vs 当前 3–7s),TTFT p99 < 0.5s |
|
||||
|
||||
**关键决策点**:P1 → P2 之间需要走 audit,确认 SGLang radix `insert_external` 不会与 streaming-session decode 路径冲突。若发现严重冲突,引入 "P-only sync mode" 占位,等架构稳定再放开。
|
||||
|
||||
---
|
||||
|
||||
## 5. 风险与对策
|
||||
|
||||
| 风险 | 影响 | 对策 |
|
||||
|---|---|---|
|
||||
| Mooncake 双角色化破坏现有 P→D 单向路径 | E2 已暴露 mooncake "instance not alive" 级联,再加一条通道可能放大 | P2 阶段先用独立 bootstrap channel + feature flag;保留 disable 路径 |
|
||||
| D→P sync 占用 D 出口带宽,影响 direct-to-D append-prefill 延迟 | 直接劣化主路径 | sync 用低优先级 QP(RDMA SL=0),且 batch 触发,单 turn 内最多 1 次 |
|
||||
| P 端 radix 被 backup 填满,反而挤出本地生产 KV | P 端 prefill 速度降 | sync 插入不 pin(§2.2),让 LRU 公平竞争 |
|
||||
| 多 P 多 backup view 协调复杂 | router 选择 target_p 时需考虑 staleness | 起点用 `last_p` policy(recency-biased),观察实测分布再决定是否上 `broadcast` |
|
||||
| 跨 SGLang patch 升级时 `insert_external` 与 upstream API 漂移 | 维护负担 | 把 API 限制在我方 vendor patch 边界(不污染 upstream radix),并写 contract test |
|
||||
|
||||
---
|
||||
|
||||
## 6. 与 block-level eviction 的解耦关系
|
||||
|
||||
| 工作 | 是否依赖另一个 |
|
||||
|---|---|
|
||||
| block-level eviction | 不依赖 D→P sync,可独立交付。能单独降低 reseed 频次 |
|
||||
| D→P sync | **依赖** block-level eviction:需要 P 端 radix 是 streaming session KV 的真值源 |
|
||||
| 一起做 | 收益最大:reseed 频次降一个数量级 + 单次 reseed 时间降一个数量级 |
|
||||
|
||||
→ rollout 顺序:block-level eviction 先 land,D→P sync 随后开 `feat/d-to-p-sync` 推进。两者**不应**合在一个 PR 里。
|
||||
|
||||
---
|
||||
|
||||
## 7. 接班 agent 的最小动作
|
||||
|
||||
1. 在 `feat/d-to-p-sync` 分支上 land 本文。
|
||||
2. 等 block-level eviction 进 main 后,开 P2 阶段:mooncake 双角色化 + microbench(单测,无 SGLang 主路径耦合)。
|
||||
3. P3 阶段加 `insert_external` 与 hook;以 disabled-by-default 进 main。
|
||||
4. P4 端到端 evaluation 后再判断 reseed probe policy(`last_p` vs `broadcast`)。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:D→P 增量同步不是"再加一条网络通道"那么简单,关键是把 P 端 radix 从单生产者扩展到允许 best-effort 外部喂入。Block-level eviction 是这件事的前置条件——所以两件工作可以一前一后,不能颠倒。
|
||||
185
docs/EVALUATION_PROTOCOL_ZH.md
Normal file
185
docs/EVALUATION_PROTOCOL_ZH.md
Normal file
@@ -0,0 +1,185 @@
|
||||
# 评测协议(Paper-quality)
|
||||
|
||||
**日期**:2026-05-12
|
||||
**性质**:评测协议规范,覆盖 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.1 M1–M6 全部薄弱点
|
||||
**对象**:跑实验的合作者;写 paper 的人;artifact reviewer
|
||||
|
||||
---
|
||||
|
||||
## 0. 总原则
|
||||
|
||||
> 论文里每一个数字都必须能回答两个问题:
|
||||
> 1. **抽样误差有多大?**(bootstrap CI、N、std)
|
||||
> 2. **公平吗?**(同 trial、同 trace、同 token cap、同 timeout、paired)
|
||||
|
||||
当前 sweep 报告(`KVCACHE_CENTRIC_PROGRESS_ZH.md` / `V2_RESULTS_ZH.md`)都不满足上述任一条。本文给出合规模板。
|
||||
|
||||
---
|
||||
|
||||
## 1. 评测维度(M1–M6 一对一解决)
|
||||
|
||||
### 1.1 M1 — 统计显著性
|
||||
|
||||
| 决策 | 规则 |
|
||||
|---|---|
|
||||
| `N` 每个 config 最小 run 数 | **3**(headline 数字)/ **5**(ablation 终值) |
|
||||
| 报告统计量 | `mean ± std`,**附 2.5/97.5 bootstrap CI** |
|
||||
| 多 run 聚合 | 把每 run 的 per-request latency append 后整体做 bootstrap;不要先 per-run 求 mean 再 average mean |
|
||||
| 差异显著性 | paired bootstrap p-value(≥ 5000 samples) |
|
||||
| `N=1` 仅允许 | smoke / sanity check,**不进 headline 表** |
|
||||
|
||||
### 1.2 M2 — 公平 paired 比较
|
||||
|
||||
| 决策 | 规则 |
|
||||
|---|---|
|
||||
| trace fixity | 用同一个 `samples-*.jsonl` 文件;replay 用 `--use-trace-as-sample` 锁定 |
|
||||
| timeout | 所有 mechanism 同 `--request-timeout-s`;不允许某一组用 600s 而另一组 300s |
|
||||
| token cap | 同 `--max-input-len`(取所有 baseline 的最小值并显式 truncate) |
|
||||
| 错误 / abort | **不**只算成功请求;abort 与 timeout 各自单列 `error_count`,按全集(含错误)报指标,或 paired-on-same-trial-mask |
|
||||
| 时间窗 | `time_scale` 一致;不允许同 sweep 内换 |
|
||||
| Worker 数 / GPU 类型 | 一致;topology 差异必须标注 |
|
||||
|
||||
**反例**:当前 `E1 vs E2` 表([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) §4)显式声明 "not a fair head-to-head"——E2 80% 失败,successful-only 算 latency 与 E1 全集对比。**这种表不能直接进 paper**。
|
||||
|
||||
### 1.3 M3 — Trace 分层
|
||||
|
||||
| 维度 | 分桶建议 |
|
||||
|---|---|
|
||||
| `turn_id` | `{1, 2-5, 6-20, 21+}` |
|
||||
| `append_len` | `{≤128, 128-1K, 1K-8K, >8K}` |
|
||||
| `overlap_ratio` | `{≤0.3, 0.3-0.7, >0.7}` |
|
||||
| `inter_turn_gap_s` | `{≤5, 5-30, 30-300, >300}` |
|
||||
| `input_len` | `{≤8K, 8K-64K, >64K}` |
|
||||
|
||||
**报告要求**:headline 数字之外,至少给一张"按 turn_id × append_len"的 heatmap,让 reviewer 看到收益来自哪个 slice。
|
||||
|
||||
**反例**:当前 Real Ali 实验仅在 KVC-fit slice(high overlap + small append + 100% direct-eligible)上报 -46% p50。这是上限,不是平均。必须同时给出 full Ali 上的 paired 表。
|
||||
|
||||
### 1.4 M4 — Baseline 矩阵
|
||||
|
||||
至少以下 baseline 中跑 **2 个**:
|
||||
|
||||
| Baseline | 类别 | 库 |
|
||||
|---|---|---|
|
||||
| vLLM + automatic prefix caching | 同 model 单 worker prefix cache | vLLM main |
|
||||
| SGLang DP cache-aware(4×TP1) | 当前主要 baseline | 本仓 vendored SGLang |
|
||||
| SGLang PD-disaggregation(kv-aware) | naive 但 cache-aware 拓扑 | 本仓 |
|
||||
| DistServe | P/D 分离 baseline | DistServe upstream |
|
||||
| SplitWise | P/D split + adaptive routing | open-source impl |
|
||||
| Mooncake-Master scheduler | 同代设计 | mooncake-master |
|
||||
|
||||
**额外推荐**:跑一个 "oracle" baseline——assume `Σ.resident[d]` 完美已知 + admission 永不失败,作为 KVC 的上限对照。
|
||||
|
||||
### 1.5 M5 — Trace 组合
|
||||
|
||||
| Trace | 用途 |
|
||||
|---|---|
|
||||
| Ali coding agent (full) | 主结果;含 single-turn dilution |
|
||||
| Ali KVC-fit slice | KVC 上限演示 |
|
||||
| SWE-Bench 50 sess | 已有;多轮高 overlap workload |
|
||||
| ShareGPT | 对比 chat workload(短 turn,低 overlap)。**用来证明 KVC 不会在不合适 workload 上劣化** |
|
||||
| Inferact | tool-use heavy 的 agent workload |
|
||||
| Mooncake trace | 单 turn LLM serving 的 baseline trace |
|
||||
| Synthetic adversarial | 自构:burst 100 个新 session 同时 seed,验证 mooncake death 与 reset-on-success 的 robustness |
|
||||
|
||||
**最低组合**:Ali full + SWE-Bench + ShareGPT + Synthetic adversarial。
|
||||
|
||||
### 1.6 M6 — 硬件覆盖
|
||||
|
||||
| Tier | 用途 |
|
||||
|---|---|
|
||||
| 单节点 ≤ 8 GPU | 当前所有结果 |
|
||||
| 双节点 NVLink + IB | 验证跨节点 D→P sync 与 mooncake 行为 |
|
||||
| 4 节点 cluster(≥ 16 GPU) | scaling 数字、cluster scheduler 假设 |
|
||||
| 异构(H100 + L40S) | topology-aware routing |
|
||||
|
||||
**最低组合**:单节点 4×H200 + 双节点 NVLink + IB。剩下两个 tier 可放 future work。
|
||||
|
||||
---
|
||||
|
||||
## 2. 报告模板
|
||||
|
||||
### 2.1 主结果表(Table 1)
|
||||
|
||||
```
|
||||
| Config | N | mean ± std | p50 [CI] | p90 [CI] | p99 [CI] | err% | timeout% |
|
||||
|--------|---|------------|----------|----------|----------|------|----------|
|
||||
```
|
||||
|
||||
加注:trace name、time_scale、`max_input_len`、`request_timeout_s`、所有共用参数。
|
||||
|
||||
### 2.2 Paired delta 表
|
||||
|
||||
```
|
||||
| Pair | N pairs | mean delta [CI] | p50 delta [CI] | wins / losses | p-value |
|
||||
```
|
||||
|
||||
`N pairs` = 两边都 successful 的 trial 数。`wins` = `latency_kvc < latency_baseline` 的 trial 数。
|
||||
|
||||
### 2.3 分层表(Table 2)
|
||||
|
||||
每个分层维度(§1.3)独立一张。
|
||||
|
||||
### 2.4 Negative-result 章节(强制)
|
||||
|
||||
paper 必须有专章列出:
|
||||
|
||||
- KVC 在 ShareGPT 上比 baseline 慢的具体数字。
|
||||
- KVC 在 trace 哪些 percentile / slice 不胜。
|
||||
- 失败的 sweep(mooncake death、E3 crash)的诊断链路。
|
||||
|
||||
→ 论文 reviewer 看见诚实的 negative result 会显著提高印象分。当前的 [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §4 雏形可以扩成这一章。
|
||||
|
||||
---
|
||||
|
||||
## 3. 工具支持(本仓需要的脚本)
|
||||
|
||||
| 脚本 | 状态 | 说明 |
|
||||
|---|---|---|
|
||||
| `scripts/analysis/recompute_summary.py` | ✅ 已有 | 修复 abort 污染的 latency;本协议主要数据入口 |
|
||||
| `scripts/analysis/stratified.py` | ⏳ 本分支新增 | 按 §1.3 维度切桶 + 输出表 |
|
||||
| `scripts/analysis/paired_compare.py` | ⏳ 本分支新增 | paired bootstrap,输出 §2.2 表 |
|
||||
| `scripts/analysis/plot_*` | ✅ 已有 | TTFT PDF、GPU 利用率、cache efficiency |
|
||||
|
||||
→ 本分支的 stratified + paired 脚本 land 后,跑实验的合作者可以一条命令出表。
|
||||
|
||||
---
|
||||
|
||||
## 4. Artifact 要求(SOSP/OSDI AE)
|
||||
|
||||
| 项目 | 标准 |
|
||||
|---|---|
|
||||
| Dockerfile | 单一 `Dockerfile.artifact`,4×A100/H100 即可启 |
|
||||
| 一键脚本 | `bash artifact/reproduce_main_table.sh`,1 小时内出 Table 1 |
|
||||
| 数据集 | 提供 `outputs/sample-*.jsonl` 子集(可 ~5GB 内);full Ali 走 instruction |
|
||||
| 复现度 | bootstrap CI 与原文重叠即算复现,不要求 bit-exact |
|
||||
| 文档 | `artifact/README.md`,列出每张表 / 图对应的命令 |
|
||||
|
||||
→ 本路线图 §M1 修复后再准备 artifact。
|
||||
|
||||
---
|
||||
|
||||
## 5. 自检清单(提 paper draft 前用)
|
||||
|
||||
- [ ] 每张表 N ≥ 3,含 mean±std 与 95% CI。
|
||||
- [ ] 没有 "successful only" 字样;所有错误已列入 `err%`。
|
||||
- [ ] 所有 baseline 用同 `max_input_len` / 同 `request_timeout_s` / 同 `time_scale`。
|
||||
- [ ] 至少 3 个 trace + 1 个 synthetic adversarial。
|
||||
- [ ] 至少 1 个 non-SGLang baseline。
|
||||
- [ ] 有 negative-result 章节。
|
||||
- [ ] 有 KVC 在 single-turn workload 上的 dilution 数据。
|
||||
- [ ] 形式化部分:Algorithm 1/2/3 + Theorem 1/2,以及 D→P sync 完成后的 Theorem 4。
|
||||
- [ ] 失败模式 forensic:mooncake death、E3 crash、cold-D 都进 §Limitations 或 §Discussion。
|
||||
|
||||
---
|
||||
|
||||
## 6. 路线图衔接
|
||||
|
||||
- [ ] Phase A — 实现本分支 `scripts/analysis/stratified.py` + `scripts/analysis/paired_compare.py`(无 GPU 可做)。
|
||||
- [ ] Phase B — 把现有 `kvc-real-ali-iter-v1` 的 600-req/15min 数据用新工具重出一份分层表 / paired 表,存入 `outputs/`(GPU 不需重跑)。
|
||||
- [ ] Phase C — 跑 ShareGPT + Synthetic adversarial baseline(GPU 需 ~12h)。
|
||||
- [ ] Phase D — 选 1 个非 SGLang baseline(推荐 vLLM + prefix caching)补齐 M4(GPU 需 ~24h)。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:当前结果"看起来已经赢",但按本协议重报后,赢的 magnitude 会缩小、赢的 slice 会窄化、负面 slice 会暴露。这是论文必须经历的过程;越早做越省事。
|
||||
222
docs/FAILURE_MODES_ZH.md
Normal file
222
docs/FAILURE_MODES_ZH.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Failure-mode Taxonomy
|
||||
|
||||
**日期**:2026-05-13
|
||||
**性质**:集中清单 + 诊断手册
|
||||
**对象**:跑实验时遇到失败要立刻 lookup 的合作者;写 paper §Limitations 时需引用的人;reviewer 想问"你为什么觉得这次会更稳"时的答案
|
||||
|
||||
本文把当前系统已识别的失败模式按"症状 → 根因 → 触发条件 → 当前缓解 → 真正的修复"梳成一张表。所有条目都有 forensic 链接到原始实验 doc。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
5 类已识别失败模式,按"是否阻碍 paper claim"分组:
|
||||
|
||||
| 类别 | 名称 | 阻碍 paper | 真正修复 |
|
||||
|---|---|:---:|---|
|
||||
| **A. 控制层级联** | Mooncake "instance not alive" cascade | ✅ | admission backoff + per-D pending-seed budget |
|
||||
| **B. 路由偏置** | Cold-D / overlap-pinning | ✅ | first-principles overlap term redefinition |
|
||||
| **C. KV 抖动** | Evict storm(session-level evict) | ✅ | [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) |
|
||||
| **C'. KV 抖动** | Reseed storm(turn 1 大 seed 并发) | ✅ | per-D pending-seed budget + (C 缓解后频率自降) |
|
||||
| **D. Vendor 不变量** | streaming-session correction invariant crash (E3) | ❌(hotfix 已 land) | 删除 correction 路径(block-level evict 完成后) |
|
||||
|
||||
A / B / C 三类是 Milestone 1 必须解决的;C' 是 A 的次因;D 已临时止血但根本修复绑在 C 上。
|
||||
|
||||
---
|
||||
|
||||
## 1. A — Mooncake "instance not alive" cascade
|
||||
|
||||
### 1.1 症状
|
||||
|
||||
- 客户端看:`RuntimeError: generate stream ended before producing any token`
|
||||
- D scheduler 日志:`[mooncake] Decode instance could be dead, dropping ...`
|
||||
- 整批请求被 abort,单一 sweep 在数分钟内从健康降到 80% failure([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) E2:1054 / 1285 失败)
|
||||
|
||||
### 1.2 根因(forensic 链路)
|
||||
|
||||
```
|
||||
admission no-space (D KV pool 满)
|
||||
→ router 立刻 fallback 走 seed/reseed 路径
|
||||
→ 多个并发 seed 同时打 mooncake P→D
|
||||
→ P→D 出口排队,handshake 阶段超时
|
||||
→ mooncake 把对端标记 dead
|
||||
→ SGLang 把 dead 链路上的 in-flight req 全部 abort
|
||||
→ 客户端看到批量 generate-stream 中断
|
||||
```
|
||||
|
||||
### 1.3 触发条件
|
||||
|
||||
- D KV pool 接近满(≥ ρ·K_d,默认 0.95)
|
||||
- router fallback chain 把多个 reseed 在毫秒级窗口内发起
|
||||
- mooncake heartbeat 超时(默认窗口短)
|
||||
|
||||
### 1.4 当前缓解
|
||||
|
||||
- `--kvcache-seed-min-turn-id=2` 跳过 turn 1 大 seed,减少首爆(main 分支 stable 配置)
|
||||
- `--mc-transfer-timeout=1800s` 默认值(commit 905d671)减少假性 dead
|
||||
- `--request-timeout-s=180/300` 让客户端不至于看见整 hour 卡死,但不阻止 cascade 自身
|
||||
|
||||
→ 这些都是治标,不是治本。E2 在 4×H200 NDR 真硬件下仍 80% 失败 ([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md))。
|
||||
|
||||
### 1.5 真正的修复(路线图 §S3)
|
||||
|
||||
1. **admission RPC backoff + jitter**:拒绝时不立刻 fallback,给 D scheduler 喘息机会。
|
||||
2. **per-D pending-seed budget**:同时刻最多 K 个 seed 在 transfer 队列里,超出排队而不爆裂。
|
||||
3. **mooncake heartbeat 与 admission 解耦**:admission 路径不再 imply "对端 alive"。
|
||||
4. **Backpressure pause hint 闭环**([SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) §2.3 当前 EXPERIMENTAL)。
|
||||
|
||||
---
|
||||
|
||||
## 2. B — Cold-D / overlap-pinning
|
||||
|
||||
### 2.1 症状
|
||||
|
||||
- N=k decode workers,但只有 ~k-1 真正承载流量;某些 D 0 binding
|
||||
- Per-D load 直方图严重偏斜(E2:D0:600 / D1:685 / **D2:0**)
|
||||
- 整体 throughput 受最忙 D 限制;裸 latency 不一定差,但容量利用率差 33%+
|
||||
|
||||
### 2.2 根因
|
||||
|
||||
Inferact / Ali coding agent trace 在每个 session 开头有 ~12K 的"system prompt + tool schema",这些 24-token 块在所有 session 之间共享 hash。kv-aware policy 的 `overlap` term 把它们当成"该 D 已经常驻这些 hash" → 任何新 session 都被 score 推向 D0/D1(最先 warm 的两个)→ D2 永远 0 overlap → 永远不被选 → 永远 cold。
|
||||
|
||||
### 2.3 触发条件
|
||||
|
||||
- 多 session workload + 共享 boilerplate prefix
|
||||
- `migration_reject_threshold > 0` 且 reject 从未触发(因为 D0/D1 还没满)
|
||||
|
||||
### 2.4 当前缓解
|
||||
|
||||
`KvAwarePolicy.load_floor_bonus`(commit 93fce42):
|
||||
|
||||
```
|
||||
floor_bonus = K * max(0, mean - assigned) / max(1, mean)
|
||||
```
|
||||
|
||||
E3 实测 D2 binding 从 0 升到 22.5%([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §1)。
|
||||
|
||||
→ 这是 patch,不是修复。`K` 是 magic number;boilerplate 的 hash 数量大于 `K / sticky_bonus` 时仍 cold。
|
||||
|
||||
### 2.5 真正的修复(路线图 §S5)
|
||||
|
||||
把 `overlap` 重新定义为 **"该 session 在该 D 上独占 prefix 的 hash 数"**:
|
||||
|
||||
```
|
||||
exclusive_overlap(s, d) := |prefix_hashes(s) ∩ resident[d] ∩ session_owned[s]|
|
||||
```
|
||||
|
||||
其中 `session_owned[s]` 排除其它 session 也持有的 hash。Boilerplate 共享 hash 不进 `exclusive_overlap`,score 自然分散。需要 D 端在 `admit_direct_append` 响应里返回 per-session resident hash 集合的 sketch(Bloom filter / minhash)。
|
||||
|
||||
---
|
||||
|
||||
## 3. C — Evict storm(session-level eviction)
|
||||
|
||||
### 3.1 症状
|
||||
|
||||
- 在 D 内存有压力的 workload 下,每 1–2 分钟出现 30–90K tokens 的 KV pool 释放峰
|
||||
- 紧随其后的同 session 请求触发 `Reseed`:P 重 prefill 50K + mooncake transfer 50K(3–7s)
|
||||
- TTFT 长尾完全由这类 reseed 主导([V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §3.2)
|
||||
|
||||
### 3.2 根因
|
||||
|
||||
`SessionAwareCache.release_session` 一次性 `free([cache_protected_len, kv_allocated_len))`——即整段 session-exclusive 尾部。E3 实测:90 次 evict、平均一次 free 67,726 tokens、25/50 session 受影响([KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) §0)。
|
||||
|
||||
→ 与 SGLang 标准 radix 的 leaf-by-leaf 渐进 evict 形成鲜明对比。这部分 KV 从未进 radix,所以享受不到 LRU 的细粒度蚕食。
|
||||
|
||||
### 3.3 触发条件
|
||||
|
||||
- D KV pool 接近满
|
||||
- `maybe_trim_decode_session_cache` 被 scheduler 触发(在 `DecodePreallocQueue` 检测到 `available_size() <= 0` 时)
|
||||
|
||||
### 3.4 当前缓解
|
||||
|
||||
- `--kvcache-session-soft-cap=N`(main 分支):限制 D 上常驻 session 数 → 提前 trim,避免顶到爆
|
||||
- `--kvcache-direct-max-uncached-tokens=8192`(v2):降低 direct path 吃 KV 的速度
|
||||
|
||||
→ 都是放慢节奏,没有解决"单次 free 太大"的根本问题。
|
||||
|
||||
### 3.5 真正的修复(路线图 §S1)
|
||||
|
||||
[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md):让 streaming-session decode 输出每 turn finish 时 `inner.cache_finished_req` 进 radix → `release_session` 退化为 `dec_lock_ref` + 删 slot → radix LRU 按 24-token leaf 蚕食。
|
||||
|
||||
预期:单次 evict 从 67K 降到 ≤ 500 tokens;reseed 频次降一个数量级。
|
||||
|
||||
---
|
||||
|
||||
## 4. C' — Reseed storm(turn 1 大 seed 并发)
|
||||
|
||||
### 4.1 症状
|
||||
|
||||
- workload 起步阶段(前 30–60s)所有 session 同时打 turn 1
|
||||
- 多个并发 `Seed`(每个 ~50–90K tokens)打 mooncake → 与 §1 cascade 重合
|
||||
|
||||
### 4.2 根因
|
||||
|
||||
`KvAwarePolicy` 启动阶段 `resident[d]` 全空,所有 D score 相同,但 ε 重试 + per-trial admit 不阻止并发。
|
||||
|
||||
### 4.3 触发条件
|
||||
|
||||
- trace `time_scale=1` 重放下,session 在原始到达密度内同时启动
|
||||
- 没有 per-D pending-seed 限流
|
||||
|
||||
### 4.4 当前缓解
|
||||
|
||||
- `--kvcache-seed-min-turn-id=2`:跳过 turn 1 seed 完全(main 分支 stable 配置)
|
||||
- 副作用:失去 turn-1 的 KV 注入,turn 2 必走 reseed(但反而稳定,因为 reseed 是分散在时间上的)
|
||||
|
||||
### 4.5 真正的修复
|
||||
|
||||
- per-D pending-seed budget(同 §1.5 第 2 项)
|
||||
- §3.5 完成后 evict 频次自降,间接降低 reseed 频次
|
||||
|
||||
---
|
||||
|
||||
## 5. D — Streaming-session correction invariant crash (E3 landmine)
|
||||
|
||||
### 5.1 症状
|
||||
|
||||
- D scheduler 抛 `AssertionError` at `schedule_batch.py:1646`:`seq_len - pre_len == req.extend_input_len`
|
||||
- 整个 D worker 进程退出 → router 看见对端死 → §1 cascade
|
||||
|
||||
### 5.2 根因
|
||||
|
||||
[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2:streaming-session correction(commit b8e6f13)把 `extend_input_len` 改写为 `max(0, fill_len - prefix_len)`,但下游 invariant 还从原始 fill_ids/prefix_indices 计算。当 `fill_len < prefix_len`(多 turn 累积 prefix > 当前 turn 增量)时数学上不可能满足。
|
||||
|
||||
### 5.3 触发条件
|
||||
|
||||
- streaming session 跨 turn 已 commit prefix 长于本 turn 的新增 fill_ids
|
||||
- E2 因 pipeline 阻塞从未跑到这个状态;E3 修了 cold-D bottleneck → pipeline 更快 → landmine 暴露
|
||||
|
||||
### 5.4 当前缓解
|
||||
|
||||
commit 986f351 的 pre-filter pass:在 `prepare_for_extend` 入口 drop 这类 req(让 client 看错误响应而不是 worker 崩)。是止血。
|
||||
|
||||
### 5.5 真正的修复
|
||||
|
||||
`schedule_batch.py:1572–1646` 这整段 correction 路径在 block-level eviction refactor 完成后**结构上不再需要**——[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.7 已说明 refactor 后 fill_ids / prefix_indices 一致性由 radix `match_prefix` 自动保证。
|
||||
|
||||
→ 不要再加更多 correction 子句;要删整段。
|
||||
|
||||
---
|
||||
|
||||
## 6. 失败诊断 cheat sheet
|
||||
|
||||
跑 sweep 时按下表 lookup:
|
||||
|
||||
| 你看到 | 大概率是 | 先查 |
|
||||
|---|---|---|
|
||||
| 客户端 `RuntimeError: generate stream ended before...` | §1 cascade | D scheduler log 搜 `instance could be dead` |
|
||||
| 某个 D `binding=0` 而其它 D 繁忙 | §2 cold-D | `per_decode_load` 直方图 |
|
||||
| TTFT p99 突然抬到 5–8s 量级 | §3 evict storm | `release_session` 调用频次 + 平均 free tokens |
|
||||
| Sweep 起步阶段失败率高、稳态低 | §4 reseed storm | mooncake transfer queue 在前 30s 的峰值 |
|
||||
| D worker 进程异常退出 | §5 invariant crash | scheduler log 搜 `AssertionError`、`extend_input_len` |
|
||||
|
||||
---
|
||||
|
||||
## 7. 与路线图的衔接
|
||||
|
||||
- [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) Milestone 1 的第 1/3/4 项分别对应本表 C / A / B 的真正修复。完成 Milestone 1 后本表 §1–§4 应该都从"未修"降级为"已缓解",§5 直接消失。
|
||||
- 论文 §Limitations 必须老实写出现状:"we identify five failure modes; A/C are addressed by this work, B/C' are partially addressed, D is a transient artifact of the in-progress refactor."
|
||||
|
||||
---
|
||||
|
||||
**核心句**:把失败模式当 first-class artifact 来管理——每个失败都有"症状 → 根因 → 触发 → 缓解 → 真正修复"五字段,是把 prototype 推到 production-grade 的关键工具。reviewer 看见你能枚举失败远比看见你赢得 baseline 更让人信服。
|
||||
119
docs/INDEX_ZH.md
Normal file
119
docs/INDEX_ZH.md
Normal file
@@ -0,0 +1,119 @@
|
||||
# 文档索引
|
||||
|
||||
**目的**:让任何合作者在 10 分钟内找到他需要的文档;让 Reviewer 知道哪些先看。
|
||||
|
||||
---
|
||||
|
||||
## 0. 时间紧的 3 篇
|
||||
|
||||
按这个顺序读完即可参与讨论:
|
||||
|
||||
1. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) — 项目当前进度、薄弱点、路线图。
|
||||
2. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) — 算法形式化(Algorithm 1/2/3 + Theorem 1/2)。
|
||||
3. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §0 + §6 — v2 当前 win/lose snapshot。
|
||||
|
||||
---
|
||||
|
||||
## 1. 按主题分类
|
||||
|
||||
### 1.1 进度 / 现状
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) | 跨分支整合 + 路线图(本分支的总入口) |
|
||||
| [PROJECT_OVERVIEW.md](PROJECT_OVERVIEW.md) | 项目目标 + 三种 mechanism(pd-disagg / pd-colo / kvcache-centric)的术语区分 |
|
||||
| [ONBOARDING_NEXT_AGENT_ZH.md](ONBOARDING_NEXT_AGENT_ZH.md) | 接班 agent 30 分钟上手手册(来自 `kvc-debug-journey-v1-to-v4`) |
|
||||
|
||||
### 1.2 算法 / 形式化
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) | Algorithm 1(Route)/ 2(Admit)/ 3(Dispatch)+ Theorem 1(无饿死)+ Theorem 2(fast-path 命中下限) |
|
||||
| [MIGRATION_V1_FINDINGS_ZH.md](MIGRATION_V1_FINDINGS_ZH.md) | v1 thrashing pathology 的实测 + 为什么 reset-on-success 是关键修复 |
|
||||
|
||||
### 1.3 实验结果
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) | SWE-Bench 50 sess ts=1:v2 vs 4DP CA 的 6/8 win + TTFT p99 落后原因 |
|
||||
| [V2_RESULTS_ZH.md](V2_RESULTS_ZH.md) | v2 原始战报(headline 数字略乐观,请同时看 deep analysis) |
|
||||
| [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) | H200 + RDMA 上 E1(naive 1P3D + kv-aware)vs E2(KVC v2);E2 80% failure 的 forensic |
|
||||
| [E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) | E3(+load-floor bonus)16 min 触发 SGLang patch invariant crash |
|
||||
| [E1_E2_FIX_DESIGN_ZH.md](E1_E2_FIX_DESIGN_ZH.md) | Q1(mooncake death)+ Q2(cold-D2)的 fix 设计 |
|
||||
|
||||
### 1.4 当前关键 design discussion
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) | 架构层反思:session-level evict 与 KVC continuity 设计冲突 |
|
||||
| [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) | block-level evict refactor 的具体 API / 步骤 / 测试计划(本分支新增) |
|
||||
| [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) | reseed 慢路径时间线 + D→P 同步缺口的 forensic |
|
||||
| [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) | D→P sync 的接口契约、staleness budget、rollout 阶段(本分支新增) |
|
||||
|
||||
### 1.5 评测 / 方法论
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md) | paper-quality 评测协议(N、CI、paired、stratify、baseline list、trace mix)—— 本分支新增 |
|
||||
| [REFACTOR_PLAN_V1_ZH.md](REFACTOR_PLAN_V1_ZH.md) | 为什么从 ts=10 切到 ts=1 |
|
||||
| [TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md](TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md) | ts=10 时代的结构性问题清单(多数已 supersede) |
|
||||
|
||||
### 1.6 工程债 / 失败模式
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) | 785 行 vendored SGLang patch 的归类清单(MUST-HAVE / WORKAROUND / EXPERIMENTAL / INSTRUMENTATION)—— 本分支新增 |
|
||||
| [FAILURE_MODES_ZH.md](FAILURE_MODES_ZH.md) | 5 类失败模式的诊断 + 缓解 + 真正修复(mooncake cascade / cold-D / evict storm / reseed storm / E3 invariant)—— 本分支新增 |
|
||||
|
||||
### 1.7 环境
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md) | H200 + driver 570 + cu12.8 环境搭建 + 11 条 lesson learned |
|
||||
|
||||
### 1.7 归档(仅历史参考)
|
||||
|
||||
`docs/archive/` 下的内容已被新文档 supersede,不必看:
|
||||
|
||||
- `AGENTIC_FIT_ANALYSIS_ZH.md`、`STRUCTURAL_VALIDATION_REPORT_ZH.md`:ts=10 早期分析。
|
||||
- `KVCACHE_CENTRIC_PROGRESS_ZH.md`:早期项目快照。
|
||||
- `KVC_DEBUG_JOURNEY_V1_TO_V5.md`、`V5_PROFILE_INVESTIGATION_ZH.md`:v1–v5 调优过程笔记。
|
||||
- `REFACTOR_PLAN_ZH.md`:v0 重构计划。
|
||||
- `SWEBENCH_EXPERIMENT_*.md`:早期实验日志。
|
||||
|
||||
---
|
||||
|
||||
## 2. 按角色推荐阅读路径
|
||||
|
||||
### 2.1 我是新接手的 SWE/research agent
|
||||
|
||||
1. 先读本文 §0 的 3 篇。
|
||||
2. 再看 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3(薄弱点)+ §5(GPU-free 工作清单)。
|
||||
3. 选一个 Milestone 1 子项开始做。`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md` 与 `docs/D_TO_P_SYNC_CONTRACT_ZH.md` 是已经准备好的两条工程主线。
|
||||
|
||||
### 2.2 我是 paper reviewer / 审稿预读
|
||||
|
||||
1. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md):算法 + theorem。
|
||||
2. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md):核心实测对比 + 我们自己识别的 limitation。
|
||||
3. [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md):真硬件 + RDMA 上的 ablation(含 E2 的 80% failure forensic,证明我们能解释失败)。
|
||||
4. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3:我们自己列出的薄弱点与未来工作(不藏问题)。
|
||||
|
||||
### 2.3 我是要复现实验的 student
|
||||
|
||||
1. [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md)。
|
||||
2. [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md):跑哪些 sweep、按什么协议比较。
|
||||
3. `scripts/sweep_ts1_migration_v2.sh`:v2 主 sweep;`scripts/sweep_e1_naive_1p3d.sh` / `scripts/sweep_e2_kvc_v2_rdma.sh`:E1/E2 ablation。
|
||||
|
||||
### 2.4 我想看 control plane 与 admission
|
||||
|
||||
1. `src/agentic_pd_hybrid/policies.py`:`KvAwarePolicy.select` 是 Algorithm 1 的实现。
|
||||
2. `src/agentic_pd_hybrid/replay.py`:`_invoke_session_direct` / `_invoke_kvcache_seeded_router` 是 Algorithm 3 的 orchestration。
|
||||
3. `third_party/sglang/python/sglang/srt/managers/scheduler.py`:D 端 `_admit_direct_append` 是 Algorithm 2 实现。
|
||||
|
||||
---
|
||||
|
||||
## 3. 这份索引的维护约定
|
||||
|
||||
- 新加一份 design / experiment doc 必须在本文 §1 表格里加一行。
|
||||
- 文档归档(移到 `docs/archive/`)时本文同步删除条目或标 "已归档"。
|
||||
- 本文不写实质内容,只做导航;任何深入说明都在被指向的文档里。
|
||||
165
docs/SGLANG_PATCH_INVENTORY_ZH.md
Normal file
165
docs/SGLANG_PATCH_INVENTORY_ZH.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# Vendored SGLang Patch — 归类清单
|
||||
|
||||
**日期**:2026-05-13
|
||||
**基线**:clean SGLang v0.5.10 snapshot @ `bded083`
|
||||
**当前 HEAD**:`origin/h200-cu130` + 本分支 (785 行新增 / 17 行删除 / 10 文件)
|
||||
**目的**:让 reviewer 与下一个合作者一眼看清"哪些 patch 是核心机制、哪些是 workaround、哪些可以在 refactor 后下线"。对应 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.2 / §S6 的工程债项。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
| 分类 | 文件数 | 行数(估) | 命运 |
|
||||
|---|---:|---:|---|
|
||||
| MUST-HAVE — 核心机制(Algorithm 1/2/3、streaming session lifecycle、admit RPC) | 6 | ~450 | 长期保留,是 paper claim 的核心 |
|
||||
| WORKAROUND — 已识别的 latent 问题修补,应在 refactor 后下线 | 2 | ~150 | block-level eviction refactor 完成后大量删除 |
|
||||
| EXPERIMENTAL — 未闭环的特性,论文不依赖 | 1 | ~60 | 可下线或保留为 future-work hook |
|
||||
| INSTRUMENTATION — 诊断 / 日志 | 1 | ~50 | 保留但应隔离到 debug build |
|
||||
| MINOR — 杂项 | 1 | ~3 | 不影响决策 |
|
||||
|
||||
**关键指引**:当 block-level eviction refactor([BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md))完成时,WORKAROUND 类的 ~150 行应同步删除。E3 触发的 `schedule_batch.py` invariant landmine 是这条路径上的产物,不修引擎而是修 evict 粒度才是正解。
|
||||
|
||||
---
|
||||
|
||||
## 1. 文件粒度清单
|
||||
|
||||
### 1.1 `mem_cache/session_aware_cache.py` — MUST-HAVE *(待 refactor)*
|
||||
|
||||
| 项目 | 内容 | 引入 | 分类 |
|
||||
|---|---|---|---|
|
||||
| `SessionSlot` dataclass | streaming session 跨 turn 复用 KV 的 metadata | b8e6f13 | MUST-HAVE |
|
||||
| `last_access_time` 字段 | LRU 决策需要 | 6e5ed8d | MUST-HAVE |
|
||||
| `match_prefix` / `cache_finished_req` / `cache_unfinished_req` 的 streaming 分支 | session 复用快路径 | b8e6f13 | **MUST-HAVE → 待 refactor**(block-level evict 后语义大改) |
|
||||
| `release_session` 直接 `free(kv_indices)` | session 退出时一次性归还 KV | b8e6f13 | **WORKAROUND → 替换**(refactor 后改为只 `dec_lock_ref`) |
|
||||
| `slot_held_tokens` / `get_session_status` / `list_session_statuses` | 状态查询 | 6e5ed8d | MUST-HAVE |
|
||||
|
||||
**说明**:本文件是 KVC 设计的中枢。block-level eviction refactor([BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.1–§3.6)改造的就是这里。`SessionSlot` 的 5 个 KV-ownership 字段(`req_pool_idx` / `kv_committed_len` / `kv_allocated_len` / `cache_protected_len` / `swa_evicted_seqlen`)应在 refactor 后删除;这部分**将由 commit message 单独标记**,方便回滚。
|
||||
|
||||
### 1.2 `managers/scheduler.py` — 混合类别
|
||||
|
||||
D worker 端的 Algorithm 2 实现,含多个独立 patch。按行级归类:
|
||||
|
||||
| 函数 / 行段 | 内容 | 分类 | 何时可下线 |
|
||||
|---|---|---|---|
|
||||
| `admit_direct_append(...)` | Algorithm 2 的 D 端 admission RPC handler | **MUST-HAVE** | 不下线(论文核心) |
|
||||
| `_should_allow_local_prefill_on_decode(req)` | 决定 decode worker 是否接受无 bootstrap 的本地 append-prefill | **MUST-HAVE** | 不下线 |
|
||||
| `_decode_session_cache_low_watermark_tokens()` | 水位线参数读取 | **WORKAROUND** | block-level evict 后由 radix LRU 取代 |
|
||||
| `_decode_session_cache_target_available_tokens()` | 目标可用 token 数计算 | **WORKAROUND** | 同上 |
|
||||
| `maybe_trim_decode_session_cache(...)` | 主动 trim session(触发 `release_session`) | **WORKAROUND** | 同上:refactor 后 radix LRU 自然蚕食,trim 不再必要 |
|
||||
| `_compute_backpressure_pause_hint(...)` | 给 router 的 pause 提示 | **EXPERIMENTAL** | 信号未闭环([REAL_ALI_KVC_EXPERIMENT_LOG_ZH.md](../docs/archive/) §4.3),路线图 §S10;可保留为 future work hook |
|
||||
| `_compute_pool_breakdown_for_diagnostics()` | 池状态快照供 `/server_info` | **INSTRUMENTATION** | 长期保留但建议门 flag 化 |
|
||||
|
||||
### 1.3 `managers/schedule_batch.py` — WORKAROUND(待删除)
|
||||
|
||||
| 项目 | 内容 | 引入 | 分类 |
|
||||
|---|---|---|---|
|
||||
| streaming-session `extend_input_len` correction (lines ~1572–1585) | 在 fill_ids < prefix_indices 时把 extend_input_len 改为 0 | b8e6f13 | **WORKAROUND** |
|
||||
| pre-filter pass dropping `fill_ids < prefix_indices` reqs | E3 触发 assertion 后的 hotfix(commit 986f351) | 986f351 | **WORKAROUND** |
|
||||
| invariant assert `seq_len - pre_len == req.extend_input_len` 的容忍逻辑 | 与 correction 配套 | b8e6f13 | **WORKAROUND** |
|
||||
|
||||
**全部** ~85 行在 block-level eviction refactor 完成后**应整体删除**——`BLOCK_LEVEL_EVICTION_DESIGN_ZH §3.7` 已说明 refactor 后该不变量结构上必然成立,correction 路径无需存在。E3 的 landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2) 是该 workaround 的产物。
|
||||
|
||||
### 1.4 `managers/session_controller.py` — MUST-HAVE
|
||||
|
||||
| 项目 | 内容 | 分类 |
|
||||
|---|---|---|
|
||||
| streaming session lifecycle hooks(open / close / admit signal) | 让 P/D worker 知道何时开始 / 结束一个 streaming session | MUST-HAVE |
|
||||
| session ID 路由 | 让 admission RPC 找到正确的 SessionSlot | MUST-HAVE |
|
||||
|
||||
不下线。
|
||||
|
||||
### 1.5 `managers/io_struct.py` — MUST-HAVE
|
||||
|
||||
| 项目 | 内容 | 分类 |
|
||||
|---|---|---|
|
||||
| `AdmitDirectAppendReqInput` / `AdmitDirectAppendReqOutput` | admit RPC 的请求 / 响应消息类型 | MUST-HAVE |
|
||||
| backpressure pause hint 字段 | 同上消息的 optional 字段 | EXPERIMENTAL |
|
||||
|
||||
可以把 EXPERIMENTAL 字段折叠到 MUST-HAVE 消息里保持兼容;本身不构成下线压力。
|
||||
|
||||
### 1.6 `managers/tokenizer_communicator_mixin.py` — MUST-HAVE
|
||||
|
||||
admit RPC 的 communicator-side glue。19 行,不下线。
|
||||
|
||||
### 1.7 `entrypoints/http_server.py` — MUST-HAVE
|
||||
|
||||
`/admit_direct_append` HTTP endpoint 注册。6 行。
|
||||
|
||||
### 1.8 `disaggregation/decode.py` — 混合类别
|
||||
|
||||
| 项目 | 内容 | 分类 |
|
||||
|---|---|---|
|
||||
| `DecodeReqToTokenPool`: `assert len(reusing) <= 1` 放宽 | 让 local append-prefill 在一个 batch 里复用多个 req_pool_idx | **MUST-HAVE** |
|
||||
| `DecodePreallocQueue` 引入 `refresh_allocatable_tokens` + `maybe_trim_decode_session_cache` 触发 | pool 满时主动 trim session | **WORKAROUND**(refactor 后改由 radix LRU 自然 shed) |
|
||||
| `--disaggregation-decode-allow-local-prefill` flag | 服务端 opt-in 本地 append-prefill | **MUST-HAVE** |
|
||||
|
||||
trim 触发逻辑 ~30 行在 refactor 后应删除。
|
||||
|
||||
### 1.9 `server_args.py` — MUST-HAVE
|
||||
|
||||
| 项目 | 内容 | 分类 |
|
||||
|---|---|---|
|
||||
| `--radix-eviction-policy priority` 选项 | E1/E2 实验需要 | MUST-HAVE |
|
||||
| `--disaggregation-decode-allow-local-prefill` flag | 见 §1.8 | MUST-HAVE |
|
||||
|
||||
13 行,全部是 CLI 接口扩展,不下线。
|
||||
|
||||
### 1.10 `disaggregation/mooncake_transfer_engine.py` — MINOR
|
||||
|
||||
3 行小调整。不构成决策点。
|
||||
|
||||
---
|
||||
|
||||
## 2. 按分类汇总
|
||||
|
||||
### 2.1 MUST-HAVE(保留)
|
||||
|
||||
约 6 个文件、450 行:
|
||||
- `admit_direct_append` 主链路(Algorithm 2):scheduler + io_struct + tokenizer_communicator_mixin + http_server + session_controller
|
||||
- `SessionSlot` 主链路(streaming session lifecycle):session_aware_cache 多数字段、session_controller
|
||||
- CLI / server interface:server_args、decode.py 的 `allow_local_prefill`
|
||||
|
||||
### 2.2 WORKAROUND(block-level evict refactor 后删除)
|
||||
|
||||
约 2.5 个文件、150 行:
|
||||
- `session_aware_cache.release_session` 的 token-free 路径
|
||||
- `scheduler.py` 的 `_decode_session_cache_*_watermark_tokens` + `maybe_trim_decode_session_cache`
|
||||
- `schedule_batch.py` streaming-session correction + drop-pre-filter(含 E3 landmine 的 hotfix)
|
||||
- `decode.py` `DecodePreallocQueue` 中的 trim 触发
|
||||
|
||||
→ 这些 patch 的存在是当前架构的产物;refactor 后应整段删除而不是修小 bug。
|
||||
|
||||
### 2.3 EXPERIMENTAL(未闭环)
|
||||
|
||||
约 60 行:
|
||||
- backpressure pause hint(`_compute_backpressure_pause_hint` + io_struct 字段):可作为未来 control-plane 反馈机制的 hook 保留;若 1 个月后仍未接通,下线
|
||||
|
||||
### 2.4 INSTRUMENTATION(长期保留但门 flag 化)
|
||||
|
||||
约 50 行:
|
||||
- `_compute_pool_breakdown_for_diagnostics` + 相关 `/server_info` 字段:建议加 `--enable-diagnostic-pool-snapshot` flag,避免 prod 路径背诊断开销
|
||||
|
||||
### 2.5 MINOR
|
||||
|
||||
约 3 行:忽略。
|
||||
|
||||
---
|
||||
|
||||
## 3. 维护约定
|
||||
|
||||
1. **新加 SGLang 改动必须落到本表**:在 commit message 用 `feat(sglang): ...` / `fix(sglang): ...` 前缀,并在 PR 描述声明落到 §2 哪一类。
|
||||
2. **不直接覆盖 upstream 文件**:所有 patch 必须可在 v0.5.10 上 git apply(保留 hunk header 整洁)。
|
||||
3. **删除 WORKAROUND 时同步删 doc**:refactor 完成的同一个 PR 应把本文表中对应行划掉。
|
||||
4. **不下放 EXPERIMENTAL 到主路径**:未闭环的 patch 必须默认 disabled。
|
||||
|
||||
---
|
||||
|
||||
## 4. 与路线图的衔接
|
||||
|
||||
- Milestone 1([AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §4)执行 block-level eviction refactor 时,**整段 §2.2 应该消失**——这是衡量 refactor 完成度的客观指标。
|
||||
- Milestone 2 把 control plane 拆层(§4.8)时,§2.3 backpressure pause hint 应或被启用、或被下线,不允许悬挂。
|
||||
- Milestone 3 引入 learning-based admission(§4.15)时,§2.1 的 `admit_direct_append` 接口应保持稳定,policy 替换在 router 侧而非 D 侧。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:vendored SGLang 的 785 行不是 monolithic 黑箱——三分之二是核心机制(论文必备),三分之一是当前架构的 workaround(refactor 后可整段删)。reviewer 看到本表能立刻判断"哪些是 paper 的真贡献、哪些是 prototype 当前的临时支撑"。
|
||||
@@ -20,8 +20,21 @@ build-backend = "setuptools.build_meta"
|
||||
[tool.setuptools.packages.find]
|
||||
where = ["src"]
|
||||
|
||||
[dependency-groups]
|
||||
# Pure-Python unit tests. Install via:
|
||||
# uv sync --group test
|
||||
# These tests deliberately import only the algorithm-layer modules
|
||||
# (policies, trace, topology) so they run without SGLang / GPU / CUDA.
|
||||
test = [
|
||||
"pytest>=8.0",
|
||||
]
|
||||
|
||||
[tool.uv]
|
||||
prerelease = "allow"
|
||||
|
||||
[tool.uv.sources]
|
||||
sglang = { path = "third_party/sglang/python", editable = true }
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
addopts = "-q"
|
||||
|
||||
225
scripts/analysis/paired_compare.py
Executable file
225
scripts/analysis/paired_compare.py
Executable file
@@ -0,0 +1,225 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Paired latency comparison with bootstrap CI.
|
||||
|
||||
Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix): when comparing
|
||||
mechanism A vs B on the same trace, the only honest comparison is paired
|
||||
on same-trial-mask. This script joins two metrics.jsonl by request_id,
|
||||
keeps the rows where BOTH sides succeeded, and reports paired deltas
|
||||
with 95% bootstrap CIs.
|
||||
|
||||
Out vs the existing `compare_no_error.py`:
|
||||
- works on raw metrics.jsonl, not pre-aggregated summary.json
|
||||
- bootstrap CIs (not just point estimates)
|
||||
- reports paired-mask size + per-side failure counts so the reader
|
||||
sees how many rows were dropped from the comparison
|
||||
|
||||
Usage:
|
||||
scripts/analysis/paired_compare.py \
|
||||
--baseline outputs/run-dp/request-metrics.jsonl \
|
||||
--candidate outputs/run-kvc/request-metrics.jsonl
|
||||
scripts/analysis/paired_compare.py ... --bootstrap 5000 --seed 42
|
||||
scripts/analysis/paired_compare.py ... --json > paired.json
|
||||
|
||||
stdlib only — no scipy/numpy. Runs without GPU and without SGLang.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import random
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def _load(path: Path) -> dict[str, dict]:
|
||||
out: dict[str, dict] = {}
|
||||
with path.open() as handle:
|
||||
for line in handle:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
row = json.loads(line)
|
||||
rid = row.get("request_id")
|
||||
if rid is None:
|
||||
continue
|
||||
out[rid] = row
|
||||
return out
|
||||
|
||||
|
||||
def _ok(row: dict) -> bool:
|
||||
return row.get("error") is None and row.get("latency_s") is not None
|
||||
|
||||
|
||||
def _quantile(values: list[float], q: float) -> float:
|
||||
if not values:
|
||||
return float("nan")
|
||||
s = sorted(values)
|
||||
if len(s) == 1:
|
||||
return s[0]
|
||||
pos = (len(s) - 1) * q
|
||||
lo = math.floor(pos)
|
||||
hi = math.ceil(pos)
|
||||
if lo == hi:
|
||||
return s[lo]
|
||||
return s[lo] + (s[hi] - s[lo]) * (pos - lo)
|
||||
|
||||
|
||||
def _stats(deltas: list[float]) -> dict[str, float]:
|
||||
if not deltas:
|
||||
return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
|
||||
return {
|
||||
"mean": sum(deltas) / len(deltas),
|
||||
"p50": _quantile(deltas, 0.50),
|
||||
"p90": _quantile(deltas, 0.90),
|
||||
"p99": _quantile(deltas, 0.99),
|
||||
}
|
||||
|
||||
|
||||
def _bootstrap_ci(
|
||||
deltas: list[float], statistic, n_boot: int, rng: random.Random
|
||||
) -> tuple[float, float]:
|
||||
"""Return (lo, hi) 95% CI for `statistic(deltas)`."""
|
||||
if len(deltas) < 2:
|
||||
return (float("nan"), float("nan"))
|
||||
n = len(deltas)
|
||||
samples = []
|
||||
for _ in range(n_boot):
|
||||
# resample with replacement
|
||||
resample = [deltas[rng.randrange(n)] for _ in range(n)]
|
||||
samples.append(statistic(resample))
|
||||
samples.sort()
|
||||
lo = samples[int(0.025 * (n_boot - 1))]
|
||||
hi = samples[int(0.975 * (n_boot - 1))]
|
||||
return (lo, hi)
|
||||
|
||||
|
||||
def compare(
|
||||
baseline: dict[str, dict],
|
||||
candidate: dict[str, dict],
|
||||
*,
|
||||
metric: str,
|
||||
n_boot: int,
|
||||
seed: int,
|
||||
) -> dict:
|
||||
common_ids = set(baseline.keys()) & set(candidate.keys())
|
||||
paired_ids = [
|
||||
rid for rid in common_ids if _ok(baseline[rid]) and _ok(candidate[rid])
|
||||
]
|
||||
paired_ids.sort()
|
||||
|
||||
base_only_fail = sum(1 for rid in common_ids if not _ok(baseline[rid]))
|
||||
cand_only_fail = sum(1 for rid in common_ids if not _ok(candidate[rid]))
|
||||
|
||||
deltas = []
|
||||
wins = losses = ties = 0
|
||||
for rid in paired_ids:
|
||||
b = baseline[rid].get(metric)
|
||||
c = candidate[rid].get(metric)
|
||||
if b is None or c is None:
|
||||
continue
|
||||
d = float(c) - float(b)
|
||||
deltas.append(d)
|
||||
if d < 0:
|
||||
wins += 1
|
||||
elif d > 0:
|
||||
losses += 1
|
||||
else:
|
||||
ties += 1
|
||||
|
||||
rng = random.Random(seed)
|
||||
stats = _stats(deltas)
|
||||
ci_mean = _bootstrap_ci(deltas, lambda x: sum(x) / len(x), n_boot, rng)
|
||||
ci_p50 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.50), n_boot, rng)
|
||||
ci_p90 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.90), n_boot, rng)
|
||||
|
||||
return {
|
||||
"metric": metric,
|
||||
"baseline_size": len(baseline),
|
||||
"candidate_size": len(candidate),
|
||||
"intersection_size": len(common_ids),
|
||||
"paired_size": len(paired_ids),
|
||||
"baseline_fail_in_common": base_only_fail,
|
||||
"candidate_fail_in_common": cand_only_fail,
|
||||
"delta_stats": stats,
|
||||
"delta_mean_ci95": ci_mean,
|
||||
"delta_p50_ci95": ci_p50,
|
||||
"delta_p90_ci95": ci_p90,
|
||||
"wins_candidate": wins,
|
||||
"losses_candidate": losses,
|
||||
"ties": ties,
|
||||
}
|
||||
|
||||
|
||||
def _fmt(x: float, w: int = 6) -> str:
|
||||
if x is None or (isinstance(x, float) and math.isnan(x)):
|
||||
return " nan "
|
||||
return f"{x:+{w}.3f}"
|
||||
|
||||
|
||||
def render(result: dict) -> str:
|
||||
s = result["delta_stats"]
|
||||
mlo, mhi = result["delta_mean_ci95"]
|
||||
p5lo, p5hi = result["delta_p50_ci95"]
|
||||
p9lo, p9hi = result["delta_p90_ci95"]
|
||||
n = result["paired_size"]
|
||||
lines = [
|
||||
f"# paired comparison ({result['metric']})",
|
||||
"",
|
||||
f"baseline rows: {result['baseline_size']}",
|
||||
f"candidate rows: {result['candidate_size']}",
|
||||
f"intersection (rid): {result['intersection_size']}",
|
||||
f"paired (both ok): {result['paired_size']}",
|
||||
f" baseline fails in common: {result['baseline_fail_in_common']}",
|
||||
f" candidate fails in common: {result['candidate_fail_in_common']}",
|
||||
"",
|
||||
"## delta (candidate - baseline) — negative = candidate is faster",
|
||||
"",
|
||||
"| stat | value | 95% CI |",
|
||||
"|---|---:|---:|",
|
||||
f"| mean | {_fmt(s['mean'])} | [{_fmt(mlo)}, {_fmt(mhi)}] |",
|
||||
f"| p50 | {_fmt(s['p50'])} | [{_fmt(p5lo)}, {_fmt(p5hi)}] |",
|
||||
f"| p90 | {_fmt(s['p90'])} | [{_fmt(p9lo)}, {_fmt(p9hi)}] |",
|
||||
f"| p99 | {_fmt(s['p99'])} | — |",
|
||||
"",
|
||||
f"win/loss/tie: {result['wins_candidate']} / {result['losses_candidate']} / {result['ties']} (of {n})",
|
||||
]
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
|
||||
p.add_argument("--baseline", required=True, type=Path)
|
||||
p.add_argument("--candidate", required=True, type=Path)
|
||||
p.add_argument(
|
||||
"--metric",
|
||||
default="latency_s",
|
||||
choices=["latency_s", "ttft_s", "tpot_s"],
|
||||
help="which per-request field to compare (default: latency_s)",
|
||||
)
|
||||
p.add_argument("--bootstrap", type=int, default=2000)
|
||||
p.add_argument("--seed", type=int, default=20260512)
|
||||
p.add_argument("--json", action="store_true")
|
||||
args = p.parse_args()
|
||||
|
||||
baseline = _load(args.baseline)
|
||||
candidate = _load(args.candidate)
|
||||
if not baseline or not candidate:
|
||||
print("empty input on one side", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
result = compare(
|
||||
baseline, candidate,
|
||||
metric=args.metric, n_boot=args.bootstrap, seed=args.seed,
|
||||
)
|
||||
|
||||
if args.json:
|
||||
json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
|
||||
sys.stdout.write("\n")
|
||||
else:
|
||||
print(render(result))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
227
scripts/analysis/stratified.py
Executable file
227
scripts/analysis/stratified.py
Executable file
@@ -0,0 +1,227 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Stratified latency / TTFT reporter for paper-quality evaluation.
|
||||
|
||||
Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix): every headline
|
||||
number must be accompanied by a stratified breakdown so reviewers can
|
||||
see which slice the gains come from.
|
||||
|
||||
Buckets the request rows from one or more metrics.jsonl files along:
|
||||
- turn_id : {1, 2-5, 6-20, 21+}
|
||||
- input_length : {<=8K, 8K-64K, >64K}
|
||||
- overlap_ratio : {<=0.3, 0.3-0.7, >0.7}
|
||||
- append_tokens : input_length - observed_overlap_blocks * BLOCK_SIZE
|
||||
|
||||
For each bucket, reports:
|
||||
- n (total rows in bucket)
|
||||
- n_ok (rows with no error and latency_s set)
|
||||
- latency_s mean / p50 / p90 / p99
|
||||
- ttft_s mean / p50 / p90 / p99
|
||||
- err_pct (1 - n_ok/n)
|
||||
|
||||
Usage:
|
||||
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl \
|
||||
[outputs/<other-run>/request-metrics.jsonl ...]
|
||||
scripts/analysis/stratified.py --dim turn_id outputs/<run>/request-metrics.jsonl
|
||||
scripts/analysis/stratified.py --json outputs/<run>/request-metrics.jsonl > strat.json
|
||||
|
||||
stdlib only — no pandas/numpy. Runs without GPU and without SGLang.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Iterable
|
||||
|
||||
BLOCK_SIZE = 24 # SGLang radix block, matches docs/KVC_ROUTER_ALGORITHM.md §2
|
||||
|
||||
TURN_BUCKETS: list[tuple[str, tuple[int, int]]] = [
|
||||
("turn=1", (1, 1)),
|
||||
("turn=2-5", (2, 5)),
|
||||
("turn=6-20", (6, 20)),
|
||||
("turn=21+", (21, 10**9)),
|
||||
]
|
||||
INPUT_BUCKETS: list[tuple[str, tuple[int, int]]] = [
|
||||
("input<=8K", (0, 8 * 1024)),
|
||||
("input=8K-64K", (8 * 1024 + 1, 64 * 1024)),
|
||||
("input>64K", (64 * 1024 + 1, 10**9)),
|
||||
]
|
||||
OVERLAP_BUCKETS: list[tuple[str, tuple[float, float]]] = [
|
||||
("overlap<=0.3", (0.0, 0.3)),
|
||||
("overlap=0.3-0.7", (0.3, 0.7)),
|
||||
("overlap>0.7", (0.7, 1.0001)),
|
||||
]
|
||||
APPEND_BUCKETS: list[tuple[str, tuple[int, int]]] = [
|
||||
("append<=128", (0, 128)),
|
||||
("append=128-1K", (129, 1024)),
|
||||
("append=1K-8K", (1025, 8 * 1024)),
|
||||
("append>8K", (8 * 1024 + 1, 10**9)),
|
||||
]
|
||||
|
||||
DIM_BUCKETS: dict[str, list[tuple[str, tuple]]] = {
|
||||
"turn_id": TURN_BUCKETS,
|
||||
"input_length": INPUT_BUCKETS,
|
||||
"overlap_ratio": OVERLAP_BUCKETS,
|
||||
"append_tokens": APPEND_BUCKETS,
|
||||
}
|
||||
|
||||
|
||||
def _quantile(values: list[float], q: float) -> float:
|
||||
"""Linear-interpolation quantile, stdlib only."""
|
||||
if not values:
|
||||
return float("nan")
|
||||
s = sorted(values)
|
||||
if len(s) == 1:
|
||||
return s[0]
|
||||
pos = (len(s) - 1) * q
|
||||
lo = math.floor(pos)
|
||||
hi = math.ceil(pos)
|
||||
if lo == hi:
|
||||
return s[lo]
|
||||
return s[lo] + (s[hi] - s[lo]) * (pos - lo)
|
||||
|
||||
|
||||
def _stats(values: list[float]) -> dict[str, float]:
|
||||
if not values:
|
||||
return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
|
||||
return {
|
||||
"mean": sum(values) / len(values),
|
||||
"p50": _quantile(values, 0.50),
|
||||
"p90": _quantile(values, 0.90),
|
||||
"p99": _quantile(values, 0.99),
|
||||
}
|
||||
|
||||
|
||||
def _bucket_for(value: float | int, buckets: list[tuple[str, tuple]]) -> str:
|
||||
for label, (lo, hi) in buckets:
|
||||
if lo <= value <= hi:
|
||||
return label
|
||||
return "OOB"
|
||||
|
||||
|
||||
def _classify(row: dict, dim: str) -> str:
|
||||
if dim == "turn_id":
|
||||
return _bucket_for(int(row.get("turn_id", 0)), TURN_BUCKETS)
|
||||
if dim == "input_length":
|
||||
return _bucket_for(int(row.get("input_length", 0)), INPUT_BUCKETS)
|
||||
if dim == "overlap_ratio":
|
||||
inp = max(1, int(row.get("input_length", 0)))
|
||||
cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
|
||||
ratio = min(1.0, cached / inp)
|
||||
return _bucket_for(ratio, OVERLAP_BUCKETS)
|
||||
if dim == "append_tokens":
|
||||
inp = int(row.get("input_length", 0))
|
||||
cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
|
||||
return _bucket_for(max(0, inp - cached), APPEND_BUCKETS)
|
||||
raise ValueError(f"Unknown dim: {dim}")
|
||||
|
||||
|
||||
def load_rows(paths: Iterable[Path]) -> list[dict]:
|
||||
rows: list[dict] = []
|
||||
for path in paths:
|
||||
with path.open() as handle:
|
||||
for line in handle:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
|
||||
def stratify(rows: list[dict], dim: str) -> dict[str, dict]:
|
||||
by_bucket: dict[str, list[dict]] = defaultdict(list)
|
||||
for row in rows:
|
||||
by_bucket[_classify(row, dim)].append(row)
|
||||
|
||||
output: dict[str, dict] = {}
|
||||
for label, _ in DIM_BUCKETS[dim]:
|
||||
bucket_rows = by_bucket.get(label, [])
|
||||
n = len(bucket_rows)
|
||||
ok = [r for r in bucket_rows if r.get("error") is None and r.get("latency_s") is not None]
|
||||
n_ok = len(ok)
|
||||
lat = [float(r["latency_s"]) for r in ok]
|
||||
ttft = [float(r["ttft_s"]) for r in ok if r.get("ttft_s") is not None]
|
||||
output[label] = {
|
||||
"n": n,
|
||||
"n_ok": n_ok,
|
||||
"err_pct": (n - n_ok) / n if n else 0.0,
|
||||
"latency_s": _stats(lat),
|
||||
"ttft_s": _stats(ttft),
|
||||
}
|
||||
return output
|
||||
|
||||
|
||||
def render_table(name: str, stats: dict[str, dict]) -> str:
|
||||
lines = [
|
||||
f"## stratified by {name}",
|
||||
"",
|
||||
"| bucket | n | n_ok | err% | lat mean | lat p50 | lat p90 | lat p99 | ttft mean | ttft p50 | ttft p90 | ttft p99 |",
|
||||
"|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|",
|
||||
]
|
||||
for label, _ in DIM_BUCKETS[name]:
|
||||
s = stats[label]
|
||||
lat = s["latency_s"]
|
||||
ttft = s["ttft_s"]
|
||||
lines.append(
|
||||
"| {label} | {n} | {n_ok} | {err:.1%} | "
|
||||
"{lm:.3f} | {l50:.3f} | {l90:.3f} | {l99:.3f} | "
|
||||
"{tm:.3f} | {t50:.3f} | {t90:.3f} | {t99:.3f} |".format(
|
||||
label=label,
|
||||
n=s["n"],
|
||||
n_ok=s["n_ok"],
|
||||
err=s["err_pct"],
|
||||
lm=lat["mean"],
|
||||
l50=lat["p50"],
|
||||
l90=lat["p90"],
|
||||
l99=lat["p99"],
|
||||
tm=ttft["mean"],
|
||||
t50=ttft["p50"],
|
||||
t90=ttft["p90"],
|
||||
t99=ttft["p99"],
|
||||
)
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
|
||||
parser.add_argument("metrics_paths", nargs="+", type=Path)
|
||||
parser.add_argument(
|
||||
"--dim",
|
||||
choices=list(DIM_BUCKETS.keys()) + ["all"],
|
||||
default="all",
|
||||
help="stratification dimension (default: all four)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json",
|
||||
action="store_true",
|
||||
help="emit JSON instead of markdown tables",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
rows = load_rows(args.metrics_paths)
|
||||
if not rows:
|
||||
print("no rows loaded", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
dims = list(DIM_BUCKETS.keys()) if args.dim == "all" else [args.dim]
|
||||
result = {dim: stratify(rows, dim) for dim in dims}
|
||||
|
||||
if args.json:
|
||||
json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
|
||||
sys.stdout.write("\n")
|
||||
return
|
||||
|
||||
header_paths = ", ".join(str(p) for p in args.metrics_paths)
|
||||
print(f"# stratified report ({len(rows)} rows from {header_paths})\n")
|
||||
for dim in dims:
|
||||
print(render_table(dim, result[dim]))
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -152,6 +152,49 @@ class StickyDecodePolicy:
|
||||
)
|
||||
|
||||
|
||||
CandidateScore = tuple[int, int, int, int]
|
||||
|
||||
|
||||
def score_candidate(
|
||||
*,
|
||||
overlap: int,
|
||||
sticky: bool,
|
||||
inflight: int,
|
||||
assigned: int,
|
||||
mean_assigned: float,
|
||||
sticky_bonus: int,
|
||||
load_floor_bonus: int,
|
||||
) -> CandidateScore:
|
||||
"""Pure scoring function for KvAwarePolicy (Algorithm 1 in KVC_ROUTER_ALGORITHM.md).
|
||||
|
||||
Returns the 4-tuple compared lexicographically by `select()` to pick the
|
||||
best D. Extracted as a top-level function so unit tests can exercise it
|
||||
without constructing topology/state objects.
|
||||
|
||||
Score tuple positions:
|
||||
0: overlap + sticky_bonus*sticky + floor_bonus — primary, KV reuse aware
|
||||
1: sticky — tie-1, session locality
|
||||
2: -inflight — tie-2, prefer low load
|
||||
3: -assigned — tie-3, prefer rarely-picked
|
||||
|
||||
Load-floor bonus is gated on `not sticky` (turn-1+ sessions continue to
|
||||
stick to their original D). The boost magnitude scales linearly with the
|
||||
D's deficit relative to the running mean of decode_assignment_counts:
|
||||
floor_bonus = load_floor_bonus * max(0, mean - assigned) / max(1, mean)
|
||||
When mean == 0 (warmup) the bonus is 0 for all candidates (lex tiebreak
|
||||
falls through to iteration order).
|
||||
|
||||
See docs/E1_E2_FIX_DESIGN_ZH.md §Q2 for the load-floor design and
|
||||
docs/KVC_ROUTER_ALGORITHM.md §3.1 for the lex-score formalism.
|
||||
"""
|
||||
floor_bonus = 0
|
||||
if load_floor_bonus > 0 and not sticky and mean_assigned > 0:
|
||||
deficit = max(0.0, mean_assigned - assigned)
|
||||
floor_bonus = int(load_floor_bonus * deficit / mean_assigned)
|
||||
primary = overlap + (sticky_bonus if sticky else 0) + floor_bonus
|
||||
return (primary, int(sticky), -inflight, -assigned)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class KvAwarePolicy:
|
||||
name: str = "kv-aware"
|
||||
@@ -161,27 +204,11 @@ class KvAwarePolicy:
|
||||
# 0 disables the mechanism. Default 3 picked empirically to allow brief
|
||||
# transient saturation without panicking, but to reroute persistent starvation.
|
||||
migration_reject_threshold: int = 3
|
||||
# Load-floor bonus: graduated boost added to lex-score position 0 for
|
||||
# under-loaded D workers, gated on `not sticky` so turn-1+ requests of an
|
||||
# existing session continue to stick to their original D. The boost
|
||||
# magnitude scales linearly with the D's deficit relative to the running
|
||||
# mean of `decode_assignment_counts`:
|
||||
# floor_bonus = K * max(0, mean - assigned[D]) / max(1, mean)
|
||||
# When mean=0 (warmup), bonus is 0 for all workers (lex tiebreak by
|
||||
# iteration order). Once any D has been assigned, under-loaded D's get a
|
||||
# bonus that approaches K as their deficit-to-mean ratio approaches 1.
|
||||
# The bonus naturally decays as load equalises, leaving the original
|
||||
# overlap+sticky scoring in charge of steady-state selection.
|
||||
#
|
||||
# Set this above the maximum cross-session boilerplate overlap you expect
|
||||
# so that fresh sessions are routed to under-loaded D's even when those
|
||||
# D's currently have 0 overlap, but below the magnitude of "real" prefix
|
||||
# overlap (e.g., a session with 800-block per-session prefix on an
|
||||
# already-warm D should still go there).
|
||||
#
|
||||
# 0 disables. See docs/E1_E2_FIX_DESIGN_ZH.md §Q2 for the full design and
|
||||
# docs/E1_E2_RESULTS_ZH.md §5d for why this is needed on Inferact-shaped
|
||||
# workloads where boilerplate overlap pins D2 cold forever.
|
||||
# Load-floor bonus: see score_candidate() docstring for the exact formula.
|
||||
# Set above the max cross-session boilerplate overlap you expect (so fresh
|
||||
# sessions reach under-loaded D's even at 0 overlap), but below the
|
||||
# magnitude of "real" prefix overlap (so a warm D still wins for its own
|
||||
# session). 0 disables.
|
||||
load_floor_bonus: int = 0
|
||||
|
||||
def select(
|
||||
@@ -194,15 +221,12 @@ class KvAwarePolicy:
|
||||
prefill_worker_id = state.next_prefill_worker_id(topology)
|
||||
session = state.session_state.get(request.session_id)
|
||||
|
||||
# Pre-compute the running mean of decode assignments. Used by the
|
||||
# load-floor bonus inside the candidate loop.
|
||||
n_route_workers = max(1, len(topology.route_workers))
|
||||
total_assigned = sum(state.decode_assignment_counts.values())
|
||||
mean_assigned = total_assigned / n_route_workers
|
||||
|
||||
best_decode_worker_id: str | None = None
|
||||
best_score: tuple[int, int, int, int] | None = None
|
||||
candidates_considered = 0
|
||||
best_score: CandidateScore | None = None
|
||||
for worker in topology.route_workers:
|
||||
# Migration: skip workers that have rejected this session too many times.
|
||||
# If all candidates get filtered (degenerate case), fall through to
|
||||
@@ -213,25 +237,17 @@ class KvAwarePolicy:
|
||||
)
|
||||
if rejects >= self.migration_reject_threshold:
|
||||
continue
|
||||
candidates_considered += 1
|
||||
overlap = _overlap_blocks(request, state, worker.worker_id)
|
||||
sticky = int(session is not None and session.last_decode_worker == worker.worker_id)
|
||||
inflight_penalty = -state.inflight_decode.get(worker.worker_id, 0)
|
||||
worker_assigned = state.decode_assignment_counts.get(worker.worker_id, 0)
|
||||
assignment_penalty = -worker_assigned
|
||||
|
||||
# Load-floor bonus: only for fresh placements (not sticky), and
|
||||
# only when the knob is enabled. See docstring above.
|
||||
floor_bonus = 0
|
||||
if self.load_floor_bonus > 0 and not sticky and mean_assigned > 0:
|
||||
deficit = max(0.0, mean_assigned - worker_assigned)
|
||||
floor_bonus = int(self.load_floor_bonus * deficit / mean_assigned)
|
||||
|
||||
score = (
|
||||
overlap + sticky * self.sticky_bonus + floor_bonus,
|
||||
sticky,
|
||||
inflight_penalty,
|
||||
assignment_penalty,
|
||||
score = score_candidate(
|
||||
overlap=_overlap_blocks(request, state, worker.worker_id),
|
||||
sticky=(
|
||||
session is not None
|
||||
and session.last_decode_worker == worker.worker_id
|
||||
),
|
||||
inflight=state.inflight_decode.get(worker.worker_id, 0),
|
||||
assigned=state.decode_assignment_counts.get(worker.worker_id, 0),
|
||||
mean_assigned=mean_assigned,
|
||||
sticky_bonus=self.sticky_bonus,
|
||||
load_floor_bonus=self.load_floor_bonus,
|
||||
)
|
||||
if best_score is None or score > best_score:
|
||||
best_score = score
|
||||
|
||||
39
tests/README.md
Normal file
39
tests/README.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# Tests
|
||||
|
||||
Pure-Python unit + property tests for the algorithm layer. These tests do
|
||||
**not** import SGLang and do **not** need a GPU — they validate the routing
|
||||
algorithm (Algorithm 1/2/3 in `docs/KVC_ROUTER_ALGORITHM.md`) and its
|
||||
theorems against the pure functions extracted from `policies.py`.
|
||||
|
||||
## Run
|
||||
|
||||
```bash
|
||||
uv sync --group test
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
Or, without uv:
|
||||
|
||||
```bash
|
||||
pip install pytest
|
||||
PYTHONPATH=src pytest tests
|
||||
```
|
||||
|
||||
## Scope
|
||||
|
||||
- `test_policy_scoring.py` — Algorithm 1 lex-score properties (overlap
|
||||
dominates sticky, load-floor gating, tie-breakers).
|
||||
- `test_no_starvation.py` — Theorem 1: bounded retries before some D either
|
||||
accepts or the least-rejected D is forced through the degenerate path.
|
||||
|
||||
Future:
|
||||
- block-level eviction `MockRadixCache` tests (see
|
||||
`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md` §5).
|
||||
- D→P sync `staleness_budget` property tests (see
|
||||
`docs/D_TO_P_SYNC_CONTRACT_ZH.md` §1).
|
||||
|
||||
## Why no integration tests here
|
||||
|
||||
Anything that needs SGLang, mooncake, or a real model is an integration
|
||||
test and must run on hardware. Those tests live as `scripts/sweep_*.sh`
|
||||
under the evaluation protocol in `docs/EVALUATION_PROTOCOL_ZH.md`.
|
||||
0
tests/__init__.py
Normal file
0
tests/__init__.py
Normal file
66
tests/_fixtures.py
Normal file
66
tests/_fixtures.py
Normal file
@@ -0,0 +1,66 @@
|
||||
"""Lightweight fixtures for algorithm-layer tests.
|
||||
|
||||
Builds minimal TraceRequest / SingleNodeTopology / RoutingState instances
|
||||
without invoking build_single_node_topology() (which validates GPU budgets
|
||||
we don't care about in unit tests).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from agentic_pd_hybrid.topology import SingleNodeTopology, WorkerSpec
|
||||
from agentic_pd_hybrid.trace import TraceRequest
|
||||
|
||||
|
||||
def make_topology(decode_count: int = 3, prefill_count: int = 1) -> SingleNodeTopology:
|
||||
prefill_workers = tuple(
|
||||
WorkerSpec(
|
||||
role="prefill",
|
||||
ordinal=i,
|
||||
gpu_ids=(i,),
|
||||
host="127.0.0.1",
|
||||
port=30000 + i,
|
||||
)
|
||||
for i in range(prefill_count)
|
||||
)
|
||||
decode_workers = tuple(
|
||||
WorkerSpec(
|
||||
role="decode",
|
||||
ordinal=i,
|
||||
gpu_ids=(prefill_count + i,),
|
||||
host="127.0.0.1",
|
||||
port=31000 + i,
|
||||
)
|
||||
for i in range(decode_count)
|
||||
)
|
||||
return SingleNodeTopology(
|
||||
model_path="/dev/null/test-model",
|
||||
prefill_workers=prefill_workers,
|
||||
decode_workers=decode_workers,
|
||||
direct_workers=(),
|
||||
router_host="127.0.0.1",
|
||||
router_port=8000,
|
||||
transfer_backend="mooncake",
|
||||
trust_remote_code=True,
|
||||
)
|
||||
|
||||
|
||||
def make_request(
|
||||
*,
|
||||
session_id: str = "sess-1",
|
||||
turn_id: int = 0,
|
||||
hash_ids: tuple[int, ...] = (),
|
||||
input_length: int = 1024,
|
||||
output_length: int = 64,
|
||||
) -> TraceRequest:
|
||||
return TraceRequest(
|
||||
request_id=f"{session_id}-t{turn_id}",
|
||||
session_id=session_id,
|
||||
chat_id=int(turn_id),
|
||||
parent_chat_id=-1 if turn_id == 0 else int(turn_id - 1),
|
||||
timestamp_s=float(turn_id),
|
||||
input_length=input_length,
|
||||
output_length=output_length,
|
||||
request_type="user",
|
||||
turn_id=turn_id,
|
||||
hash_ids=hash_ids,
|
||||
)
|
||||
150
tests/test_no_starvation.py
Normal file
150
tests/test_no_starvation.py
Normal file
@@ -0,0 +1,150 @@
|
||||
"""Theorem 1 — no permanent starvation under bounded retries.
|
||||
|
||||
Reference: docs/KVC_ROUTER_ALGORITHM.md §4.1.
|
||||
|
||||
For any session s with τ_reject ≥ 1, after at most |D| · τ_reject
|
||||
consecutive admission rejects on s, the routing policy MUST still
|
||||
return a valid decision (via the degenerate "least-rejected D"
|
||||
fallback). The session cannot be permanently starved at the policy
|
||||
layer.
|
||||
|
||||
We can't exercise the full Dispatch loop here (it lives in replay.py and
|
||||
needs HTTP, mooncake, etc.). What we CAN test is the policy-layer
|
||||
guarantee: after K = |D| · τ_reject reject bumps, select() never raises
|
||||
and never returns a worker that's both blacklisted *and* has positive
|
||||
overlap (the degenerate path chooses by least-rejected).
|
||||
|
||||
This is the property-layer companion to test_policy_scoring.py's
|
||||
quantitative checks.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from agentic_pd_hybrid.policies import KvAwarePolicy, RoutingState
|
||||
|
||||
from ._fixtures import make_request, make_topology
|
||||
|
||||
|
||||
def test_select_returns_valid_decision_under_full_blacklist():
|
||||
"""Bump all (s, d) reject counters past τ_reject. select() must still
|
||||
pick a worker (degenerate fallback, no exception, no None)."""
|
||||
topology = make_topology(decode_count=3)
|
||||
state = RoutingState.create(topology)
|
||||
request = make_request(session_id="s-stuck", turn_id=0)
|
||||
policy = KvAwarePolicy(migration_reject_threshold=3)
|
||||
|
||||
# Pre-fill the blacklist for every D.
|
||||
for worker in topology.route_workers:
|
||||
for _ in range(3):
|
||||
state.record_admission_reject(request.session_id, worker.worker_id)
|
||||
|
||||
decision = policy.select(request=request, topology=topology, state=state)
|
||||
assert decision.decode_worker_id is not None
|
||||
assert decision.decode_worker_id in {w.worker_id for w in topology.route_workers}
|
||||
|
||||
|
||||
def test_bounded_retries_to_force_degenerate_path():
|
||||
"""Theorem 1: at most |D| · τ_reject rejects suffice to either exhaust
|
||||
every D or to force the degenerate fallback. Simulate the worst case
|
||||
where each retry picks a fresh D and is immediately rejected."""
|
||||
topology = make_topology(decode_count=4)
|
||||
state = RoutingState.create(topology)
|
||||
request = make_request(session_id="s-worst", turn_id=0)
|
||||
threshold = 3
|
||||
policy = KvAwarePolicy(migration_reject_threshold=threshold)
|
||||
|
||||
seen_decoders: set[str] = set()
|
||||
max_retries = len(topology.route_workers) * threshold
|
||||
|
||||
for retry in range(max_retries):
|
||||
decision = policy.select(request=request, topology=topology, state=state)
|
||||
seen_decoders.add(decision.decode_worker_id)
|
||||
# Adversary: this D rejects this session.
|
||||
state.record_admission_reject(request.session_id, decision.decode_worker_id)
|
||||
|
||||
# After |D|·τ_reject rejects every D must be blacklisted, so the next
|
||||
# select() takes the degenerate "least-rejected" branch and STILL
|
||||
# returns a valid worker.
|
||||
final = policy.select(request=request, topology=topology, state=state)
|
||||
assert final.decode_worker_id in {w.worker_id for w in topology.route_workers}
|
||||
# And we should have explored every D over the bounded retries — the
|
||||
# algorithm cannot trap a session on a single D when all are rejecting.
|
||||
assert seen_decoders == {w.worker_id for w in topology.route_workers}
|
||||
|
||||
|
||||
def test_least_rejected_d_chosen_when_all_blacklisted():
|
||||
"""When every D is past threshold, the degenerate fallback chooses the
|
||||
one with the *fewest* rejects (Algorithm 1, line 4)."""
|
||||
topology = make_topology(decode_count=3)
|
||||
state = RoutingState.create(topology)
|
||||
request = make_request(session_id="s-lr", turn_id=0)
|
||||
policy = KvAwarePolicy(migration_reject_threshold=3)
|
||||
|
||||
# Skew rejections: decode-0 has 5, decode-1 has 10, decode-2 has 3.
|
||||
# All are >= threshold=3, so the filter wipes out every candidate.
|
||||
# The fallback should pick decode-2 (smallest rejection count).
|
||||
workers = list(topology.route_workers)
|
||||
bumps = {workers[0].worker_id: 5, workers[1].worker_id: 10, workers[2].worker_id: 3}
|
||||
for wid, n in bumps.items():
|
||||
for _ in range(n):
|
||||
state.record_admission_reject(request.session_id, wid)
|
||||
|
||||
decision = policy.select(request=request, topology=topology, state=state)
|
||||
assert decision.decode_worker_id == workers[2].worker_id
|
||||
|
||||
|
||||
def test_other_session_unaffected_by_blacklist():
|
||||
"""Algorithm 1's filter is per-(session, D), not per-D. Session A's
|
||||
rejects must not influence session B's routing."""
|
||||
topology = make_topology(decode_count=2)
|
||||
state = RoutingState.create(topology)
|
||||
policy = KvAwarePolicy(migration_reject_threshold=3)
|
||||
|
||||
# Blacklist decode-0 for session A.
|
||||
workers = list(topology.route_workers)
|
||||
for _ in range(3):
|
||||
state.record_admission_reject("session-A", workers[0].worker_id)
|
||||
|
||||
# Session B sees a clean slate — should be able to pick decode-0
|
||||
# (which is the iteration-order winner under empty state).
|
||||
decision_b = policy.select(
|
||||
request=make_request(session_id="session-B"),
|
||||
topology=topology,
|
||||
state=state,
|
||||
)
|
||||
# decode-0 wins iteration-order tiebreak when all scores are (0,0,0,0).
|
||||
assert decision_b.decode_worker_id == workers[0].worker_id
|
||||
|
||||
|
||||
def test_threshold_zero_disables_blacklist():
|
||||
"""migration_reject_threshold=0 means the migration mechanism is off:
|
||||
every D stays a candidate regardless of its reject count."""
|
||||
topology = make_topology(decode_count=2)
|
||||
state = RoutingState.create(topology)
|
||||
request = make_request(session_id="s-no-mig")
|
||||
policy = KvAwarePolicy(migration_reject_threshold=0)
|
||||
|
||||
workers = list(topology.route_workers)
|
||||
# Pile a huge number of rejects on decode-0.
|
||||
for _ in range(100):
|
||||
state.record_admission_reject(request.session_id, workers[0].worker_id)
|
||||
|
||||
decision = policy.select(request=request, topology=topology, state=state)
|
||||
# decode-0 should still be eligible; with empty overlap/sticky/inflight,
|
||||
# iteration order picks decode-0 first.
|
||||
assert decision.decode_worker_id == workers[0].worker_id
|
||||
|
||||
|
||||
def test_reject_counter_only_grows_on_record():
|
||||
"""RoutingState.record_admission_reject is the ONLY mutator for the
|
||||
counter. select() must not silently bump it."""
|
||||
topology = make_topology(decode_count=2)
|
||||
state = RoutingState.create(topology)
|
||||
request = make_request(session_id="s-clean")
|
||||
policy = KvAwarePolicy()
|
||||
|
||||
for _ in range(5):
|
||||
policy.select(request=request, topology=topology, state=state)
|
||||
|
||||
# No explicit record_admission_reject -> all counters stay zero.
|
||||
assert sum(state.session_d_rejects.values()) == 0
|
||||
189
tests/test_policy_scoring.py
Normal file
189
tests/test_policy_scoring.py
Normal file
@@ -0,0 +1,189 @@
|
||||
"""Unit tests for Algorithm 1 (KvAwarePolicy score_candidate).
|
||||
|
||||
Reference: docs/KVC_ROUTER_ALGORITHM.md §3.1. The lex-score is
|
||||
|
||||
(overlap + sticky_bonus*sticky + floor_bonus,
|
||||
sticky,
|
||||
-inflight,
|
||||
-assigned)
|
||||
|
||||
These tests pin down the qualitative properties that the algorithm's
|
||||
correctness arguments rely on. They run without SGLang/GPU.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from agentic_pd_hybrid.policies import score_candidate
|
||||
|
||||
|
||||
def _score(**overrides):
|
||||
"""Helper: build a score with all defaults and per-test overrides."""
|
||||
args = dict(
|
||||
overlap=0,
|
||||
sticky=False,
|
||||
inflight=0,
|
||||
assigned=0,
|
||||
mean_assigned=0.0,
|
||||
sticky_bonus=1,
|
||||
load_floor_bonus=0,
|
||||
)
|
||||
args.update(overrides)
|
||||
return score_candidate(**args)
|
||||
|
||||
|
||||
# -- Determinism ----------------------------------------------------------------
|
||||
|
||||
|
||||
def test_score_is_pure():
|
||||
"""Same kwargs must produce the same tuple (no hidden state)."""
|
||||
a = _score(overlap=3, sticky=True, inflight=1, assigned=7)
|
||||
b = _score(overlap=3, sticky=True, inflight=1, assigned=7)
|
||||
assert a == b
|
||||
|
||||
|
||||
def test_score_returns_4_tuple():
|
||||
s = _score()
|
||||
assert isinstance(s, tuple)
|
||||
assert len(s) == 4
|
||||
assert all(isinstance(x, int) for x in s)
|
||||
|
||||
|
||||
# -- Primary term: overlap dominates sticky --------------------------------------
|
||||
|
||||
|
||||
def test_overlap_strictly_dominates_pure_sticky():
|
||||
"""Theorem-2 building block: any positive overlap on a non-sticky D wins
|
||||
against a sticky-only D with zero overlap (sticky_bonus=1)."""
|
||||
overlap = _score(overlap=2, sticky=False)
|
||||
sticky_only = _score(overlap=0, sticky=True)
|
||||
assert overlap > sticky_only
|
||||
|
||||
|
||||
def test_overlap_plus_sticky_beats_overlap_alone():
|
||||
"""Two D's with equal overlap: sticky one wins (sticky_bonus contributes
|
||||
to primary AND wins tie-1)."""
|
||||
sticky_d = _score(overlap=5, sticky=True)
|
||||
fresh_d = _score(overlap=5, sticky=False)
|
||||
assert sticky_d > fresh_d
|
||||
|
||||
|
||||
# -- Tie breakers ----------------------------------------------------------------
|
||||
|
||||
|
||||
def test_tiebreaker_inflight_lower_wins():
|
||||
"""Equal primary & sticky: prefer the D with fewer in-flight requests."""
|
||||
low = _score(overlap=3, sticky=False, inflight=0, assigned=10)
|
||||
high = _score(overlap=3, sticky=False, inflight=5, assigned=10)
|
||||
assert low > high
|
||||
|
||||
|
||||
def test_tiebreaker_assigned_lower_wins():
|
||||
"""Equal primary & sticky & inflight: prefer rarely-picked D."""
|
||||
rare = _score(overlap=3, sticky=False, inflight=2, assigned=1)
|
||||
frequent = _score(overlap=3, sticky=False, inflight=2, assigned=99)
|
||||
assert rare > frequent
|
||||
|
||||
|
||||
def test_tiebreaker_strict_lex_order():
|
||||
"""Sticky always beats non-sticky on tie-1 even if non-sticky has lower
|
||||
inflight (the lex order is strict, position 1 outranks positions 2/3)."""
|
||||
sticky_busy = _score(overlap=4, sticky=True, inflight=10, assigned=10)
|
||||
fresh_idle = _score(overlap=4, sticky=False, inflight=0, assigned=0)
|
||||
# Note: with sticky_bonus=1 added to position 0, sticky_busy actually wins
|
||||
# on position 0 first (5 > 4). Force equal primary by lowering sticky's
|
||||
# overlap.
|
||||
sticky_busy_eq_primary = _score(overlap=3, sticky=True, inflight=10, assigned=10)
|
||||
fresh_idle_eq_primary = _score(overlap=4, sticky=False, inflight=0, assigned=0)
|
||||
# Now equal primary (3+1=4 vs 4). Sticky wins position 1.
|
||||
assert sticky_busy_eq_primary > fresh_idle_eq_primary
|
||||
|
||||
|
||||
# -- Load-floor bonus ------------------------------------------------------------
|
||||
|
||||
|
||||
def test_load_floor_disabled_by_default():
|
||||
"""load_floor_bonus=0 → no contribution to primary."""
|
||||
s = _score(overlap=0, sticky=False, mean_assigned=10, assigned=0)
|
||||
assert s[0] == 0
|
||||
|
||||
|
||||
def test_load_floor_gated_off_when_sticky():
|
||||
"""Even with load_floor_bonus>0, sticky D does NOT receive the boost.
|
||||
Otherwise a session would migrate away from its warm D under load."""
|
||||
sticky_under_loaded = _score(
|
||||
overlap=0, sticky=True, mean_assigned=10, assigned=0, load_floor_bonus=200
|
||||
)
|
||||
# primary = overlap(0) + sticky_bonus(1) + floor(0) = 1
|
||||
assert sticky_under_loaded[0] == 1
|
||||
|
||||
|
||||
def test_load_floor_zero_when_mean_zero():
|
||||
"""Warmup case: mean_assigned=0 -> no D gets boost -> degenerate to lex
|
||||
tiebreak by iteration order."""
|
||||
s = _score(
|
||||
overlap=0, sticky=False, mean_assigned=0, assigned=0, load_floor_bonus=200
|
||||
)
|
||||
assert s[0] == 0
|
||||
|
||||
|
||||
def test_load_floor_proportional_to_deficit():
|
||||
"""floor_bonus = K * deficit / mean. assigned=0, mean=10, K=200 -> 200."""
|
||||
s_zero = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
|
||||
)
|
||||
s_half = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=5, load_floor_bonus=200
|
||||
)
|
||||
s_full = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
|
||||
)
|
||||
# deficit = max(0, 10-0)=10 -> bonus = int(200*10/10) = 200
|
||||
# deficit = max(0, 10-5)=5 -> bonus = int(200*5/10) = 100
|
||||
# deficit = max(0, 10-10)=0 -> bonus = 0
|
||||
assert s_zero[0] == 200
|
||||
assert s_half[0] == 100
|
||||
assert s_full[0] == 0
|
||||
|
||||
|
||||
def test_load_floor_does_not_underflow_when_overloaded():
|
||||
"""assigned > mean -> deficit clamped to 0, no negative bonus."""
|
||||
s = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=50, load_floor_bonus=200
|
||||
)
|
||||
assert s[0] == 0
|
||||
|
||||
|
||||
# -- Routing intent: real overlap beats load-floor bonus -------------------------
|
||||
|
||||
|
||||
def test_real_prefix_overlap_beats_load_floor_on_warm_d():
|
||||
"""E1_E2_FIX_DESIGN_ZH §Q2: load_floor should be set such that
|
||||
real per-session prefix overlap outweighs the cold-D bonus.
|
||||
With overlap=800 (a per-session prefix) and load_floor_bonus=200,
|
||||
a warm D (high overlap, possibly high load) should still win against
|
||||
a cold D with floor bonus."""
|
||||
warm = _score(
|
||||
overlap=800, sticky=True, mean_assigned=10, assigned=10, load_floor_bonus=200
|
||||
)
|
||||
cold = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
|
||||
)
|
||||
# warm primary = 800 + 1 + 0 = 801. cold primary = 0 + 0 + 200 = 200.
|
||||
assert warm[0] == 801
|
||||
assert cold[0] == 200
|
||||
assert warm > cold
|
||||
|
||||
|
||||
def test_boilerplate_overlap_loses_to_load_floor_for_cold_d():
|
||||
"""Same §Q2: load_floor should beat cross-session boilerplate overlap.
|
||||
If load_floor_bonus=200 and the worst-case boilerplate overlap is ~50,
|
||||
a fresh cold D should still win against a slightly-warm-from-boilerplate D."""
|
||||
warm_boilerplate = _score(
|
||||
overlap=50, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
|
||||
)
|
||||
cold_under_loaded = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
|
||||
)
|
||||
# warm_boilerplate primary = 50 + 0 + 0 = 50 (assigned=mean, no deficit).
|
||||
# cold_under_loaded primary = 0 + 0 + 200 = 200.
|
||||
assert cold_under_loaded > warm_boilerplate
|
||||
Reference in New Issue
Block a user