13 Commits

Author SHA1 Message Date
110bd68000 docs(failures): consolidated 5-mode failure taxonomy
Consolidates failure modes scattered across V2_DEEP_ANALYSIS,
E1_E2_RESULTS, E3_FINDINGS, KVC_EVICTION_GRANULARITY,
REAL_ALI_KVC_EXPERIMENT into a single lookup table with
five fields per mode: symptom → root cause → trigger →
current mitigation → real fix.

Five modes covered:
  A. Mooncake "instance not alive" cascade
     — E2 80%-failure pathology; admission no-space →
       seed burst → heartbeat drop → batch abort
  B. Cold-D / overlap-pinning
     — shared boilerplate hash pins all sessions to a
       subset of D's; load_floor_bonus is a patch, the
       real fix is exclusive_overlap redefinition
  C. Evict storm (session-level eviction)
     — release_session frees 38–88K tokens in one shot;
       fix is BLOCK_LEVEL_EVICTION_DESIGN
  C'. Reseed storm (turn-1 concurrent seeds)
     — startup-phase mooncake burst; fix is per-D
       pending-seed budget, frequency drops after C
  D. Streaming-session correction invariant crash (E3)
     — schedule_batch.py:1646 landmine, hotfixed by
       986f351, root-fix is removing the correction
       path entirely (BLOCK_LEVEL_EVICTION §3.7)

Each mode has a forensic link back to the original
experiment doc that surfaced it.

§6 adds a diagnostic cheat sheet: "if you see X, look at Y."
§7 wires every mode to a roadmap item — Milestone 1 should
graduate §1–§4 to "mitigated" and eliminate §5.

INDEX_ZH gets a new §1.6 section linking this and the
SGLang patch inventory.

No code change. Reading dependency for anyone debugging
a sweep or writing paper §Limitations.
2026-05-13 00:43:58 +08:00
d93228e156 docs(sglang): patch surface inventory + retire-after-refactor list
Resolves AUDIT_AND_ROADMAP §S6: the 785 lines of vendored
SGLang patch are a known reviewer trust risk because the
prototype touches scheduler.py / schedule_batch.py /
session_aware_cache.py / disaggregation hot paths. Without
classification readers cannot tell core mechanism from
temporary scaffold.

Classifies each of the 10 patched files into:
  MUST-HAVE         — Algorithm 1/2/3, streaming session
                       lifecycle, admit RPC. ~450 lines.
                       Long-term retention.
  WORKAROUND        — release_session token-free,
                       maybe_trim_decode_session_cache,
                       streaming-session extend_input_len
                       correction (incl. the E3 landmine
                       hotfix from commit 986f351),
                       DecodePreallocQueue trim trigger.
                       ~150 lines. To DELETE entirely
                       after block-level eviction refactor
                       (BLOCK_LEVEL_EVICTION_DESIGN §3.7).
  EXPERIMENTAL      — backpressure pause hint
                       (_compute_backpressure_pause_hint).
                       ~60 lines. Signal not closed-loop
                       per REAL_ALI §4.3; retain as hook
                       or retire in 1 month.
  INSTRUMENTATION   — _compute_pool_breakdown_for_diagnostics.
                       ~50 lines. Keep behind a flag.
  MINOR             — ~3 lines. Ignore.

The §2 summary gives reviewers a one-glance picture of
what's core vs. scaffold. Maintenance convention in §3
mandates classifying every new (sglang) patch at commit
time.

§4 wires the classification into the roadmap: clearing
the WORKAROUND bucket is the objective completion marker
for block-level eviction refactor.

No code change.
2026-05-13 00:42:22 +08:00
9a81c993ab docs(onboarding): link new audit / design / eval docs from
the root README + AGENTS.md

Without this, the four docs added on this branch
(AUDIT_AND_ROADMAP, INDEX, BLOCK_LEVEL_EVICTION_DESIGN,
D_TO_P_SYNC_CONTRACT, EVALUATION_PROTOCOL) are reachable
only by listing docs/. This wires them into the two entry
points an agent or collaborator hits first.

README.md changes:
  - top-of-page pointer to INDEX_ZH for new collaborators
  - pointer to AUDIT_AND_ROADMAP_ZH for project state
  - "单元测试 (无 GPU)" section: how to run pytest
  - "评测脚本" section: invocations for the two new
    analysis scripts

AGENTS.md changes:
  - top section "For new collaborators / agents" before
    the existing "Environment" block, pointing at INDEX_ZH,
    AUDIT_AND_ROADMAP_ZH, the two ready-to-pick-up design
    docs, and EVALUATION_PROTOCOL_ZH
  - pytest invocation under Environment
2026-05-12 23:58:56 +08:00
dbb9eee471 feat(analysis): paired comparison with bootstrap CI
Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix):
mechanism A vs B comparisons on the same trace must be
paired on same-trial-mask, with errors and aborts surfaced
rather than silently dropped.

How it differs from scripts/analysis/compare_no_error.py:
  - works on raw request-metrics.jsonl (not pre-aggregated
    summary.json) so it can recompute paired masks
  - reports 95% bootstrap CIs for mean / p50 / p90
  - exposes intersection size + per-side failure count in
    the intersection so the reader can see how many rows
    were dropped from the comparison and whether the
    candidate's win came from selection effects

stdlib only — random.Random for bootstrap, no scipy/numpy.
Default 2000 bootstrap iterations; seed is configurable
for reproducibility.

Verified locally on a synthetic 20-row pair (5s constant
delta + one candidate failure): correctly reports
paired_size=19, candidate_fail_in_common=1, mean delta
-5.000s, 19/0/0 win/loss/tie.

CLI:
  scripts/analysis/paired_compare.py \\
      --baseline outputs/run-dp/request-metrics.jsonl \\
      --candidate outputs/run-kvc/request-metrics.jsonl \\
      [--metric latency_s|ttft_s|tpot_s] \\
      [--bootstrap 5000] [--seed 42] [--json]
2026-05-12 23:57:57 +08:00
4021f27ee2 feat(analysis): stratified latency / TTFT reporter
Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix):
headline numbers must be accompanied by stratified
breakdowns so reviewers can see which slice the gains
come from.

The script reads one or more request-metrics.jsonl files
and buckets rows along four orthogonal dimensions:
  - turn_id        : {1, 2-5, 6-20, 21+}
  - input_length   : {<=8K, 8K-64K, >64K}
  - overlap_ratio  : {<=0.3, 0.3-0.7, >0.7}
  - append_tokens  : {<=128, 128-1K, 1K-8K, >8K}

Per bucket: n, n_ok, err_pct, latency/ttft mean+p50+p90+p99.
Output is markdown by default, --json for machine read.

stdlib only — no pandas/numpy. Verified on a synthetic
5-row jsonl (turn=1 with one error correctly reports
33.3% err% on the bucket).

Why this script and not pandas:
  - the existing scripts/analysis/* are stdlib-only;
    keeping consistency
  - reviewers can run it on the artifact without
    pip-installing anything beyond pytest
  - speed irrelevant; runs in <1s on the largest existing
    sweep (4449 rows)

Usage shown in EVALUATION_PROTOCOL_ZH §3.
2026-05-12 23:57:13 +08:00
c5f552e122 test(policy): Theorem 1 no-starvation property tests
Adds the algorithm-layer guarantee tests for
docs/KVC_ROUTER_ALGORITHM.md §4.1. The full Dispatch loop
lives in replay.py (HTTP + mooncake), but the policy-layer
guarantee is testable in isolation: under any reject
sequence, select() must keep returning a valid worker.

Cases:
  - select returns a valid decision even after every (s,d)
    is past τ_reject (degenerate fallback)
  - |D|·τ_reject rejects suffice to explore every D
    (cannot trap a session on one D under universal
    rejection)
  - degenerate fallback picks the least-rejected D
    (Algorithm 1 line 4)
  - per-(session, D) isolation: session A's blacklist
    does not affect session B
  - migration_reject_threshold=0 disables blacklist
  - select() does NOT silently bump the reject counter
    (the only mutator is record_admission_reject)

Adds tests/_fixtures.py with minimal make_topology() and
make_request() helpers that skip build_single_node_topology's
GPU-budget validation (irrelevant in unit tests).

Verified locally: 20/20 passing under pytest 9.0.3. The
six new tests cover only Algorithm 1's policy-layer
half of Theorem 1; the reset-on-success half lives in
Algorithm 3 (replay.py) and is a future test target.
2026-05-12 23:55:57 +08:00
a785b83023 test(policy): unit tests for Algorithm 1 lex scoring
Adds the project's first test suite. Covers the
score_candidate() pure function from the previous refactor
commit, validating the qualitative properties that
KVC_ROUTER_ALGORITHM.md §3.1 and §4.2 rely on.

Tests / properties:
  - determinism: same args -> same tuple
  - shape: 4-int tuple
  - primary term: overlap dominates pure sticky
  - primary term: sticky_bonus credited
  - tie-2 inflight: lower wins
  - tie-3 assigned: lower wins
  - strict lex order: sticky wins position-1 over fresh-idle
  - load_floor disabled by default
  - load_floor gated off when sticky=True
  - load_floor zero during warmup (mean=0)
  - load_floor proportional to deficit (200/100/0 at 0/50/100% load)
  - load_floor does not underflow when overloaded
  - real per-session overlap beats load_floor on warm D
  - boilerplate overlap loses to load_floor on cold D
    (the cold-D fix from E1_E2_FIX_DESIGN §Q2)

Test infrastructure:
  - tests/ package with README explaining the GPU-free
    scope and the run instruction
  - pyproject.toml [dependency-groups] test = [pytest>=8]
    (install via `uv sync --group test`)
  - pyproject.toml [tool.pytest.ini_options] sets testpaths

Verified locally: 14/14 passing under pytest 9.0.3 in an
isolated 3.13 venv. No SGLang / GPU touched.
2026-05-12 23:54:48 +08:00
76a79dfdda refactor(policy): extract pure score_candidate() from KvAwarePolicy
Pulls the per-D score computation out of KvAwarePolicy.select
into a top-level pure function that takes primitives. The
in-method behavior is unchanged — the loop now calls
score_candidate() instead of inlining the arithmetic.

Motivation:
  Algorithm 1 (KVC_ROUTER_ALGORITHM.md §3.1) is the routing
  core. Until now its only API was select(), which requires
  building TraceRequest + SingleNodeTopology + RoutingState
  to test even a single lex-score property. After this
  extraction, unit tests can drive the four-tuple score
  directly with integers.

What changed:
  - Added module-level CandidateScore type alias.
  - Added score_candidate(*, overlap, sticky, inflight,
    assigned, mean_assigned, sticky_bonus,
    load_floor_bonus) -> CandidateScore.
  - KvAwarePolicy.select() loop body collapsed to a
    score_candidate() call; sticky now bool (was int)
    inside the call site.
  - Moved the load-floor docstring from KvAwarePolicy
    onto score_candidate where the formula lives.

Verified pure:
  - same kwargs -> same tuple
  - overlap=5 beats sticky-only (no load_floor): (5,0,0,0) > (1,1,0,0)
  - load_floor gated off when sticky=True

No behavior change; follow-up commit adds the unit tests
this refactor enables.
2026-05-12 23:53:17 +08:00
591cd6d382 docs(eval): paper-quality evaluation protocol (M1–M6)
Codifies the methodology fixes for every weakness called
out in AUDIT_AND_ROADMAP_ZH §3.1. Existing sweep reports
(KVCACHE_CENTRIC_PROGRESS_ZH, V2_RESULTS_ZH) violate at
least one of these; future runs must use this protocol.

Contents:
- §1.1 M1 — N≥3 + bootstrap CI; no N=1 in headline
- §1.2 M2 — paired-on-same-trial-mask; same trace /
       timeout / max_input_len / time_scale; errors
       and aborts each get their own column
- §1.3 M3 — required stratification dimensions
       (turn_id / append_len / overlap_ratio /
       inter_turn_gap / input_len)
- §1.4 M4 — minimum 2 baselines from a 6-item list,
       including at least one non-SGLang baseline
- §1.5 M5 — trace mix: Ali full + SWE-Bench +
       ShareGPT + synthetic adversarial
- §1.6 M6 — hardware tiers; single-node 4xH200 +
       dual-node NVLink/IB as minimum
- §2 report templates (main table, paired delta,
      stratified, negative-result section)
- §3 tool support: marks the two scripts that the
      follow-up commits on this branch add
- §4 SOSP/OSDI artifact requirements
- §5 pre-submission self-checklist
- §6 phased delivery plan for catching up to protocol

No code change; reading dependency for the analyzer
scripts that follow.
2026-05-12 23:51:46 +08:00
fd37eda367 docs(design): D->P sync interface contract + 4-phase rollout
Companion to BLOCK_LEVEL_EVICTION_DESIGN_ZH. Specifies the
three-layer contract (mooncake / SGLang / agentic-pd-hybrid)
that the empty feat/d-to-p-sync branch is meant to fill.

Contents:
- §1 staleness budget β as a first-class system parameter,
      with recommended default (page_size .. 4096 tokens)
- §2.1 mooncake double-role API: KVRole enum extension,
      DecodeKVSender / PrefillKVReceiver class shapes,
      independent bootstrap channel
- §2.2 SGLang RadixCache.insert_external signature with
      five concrete design decisions (re-mapping policy,
      failure handling, lock_ref discipline, evict
      interaction, multi-P backup view)
- §2.3 agentic-pd-hybrid CLI flags, DirectSessionState
      additions, hook points in _invoke_session_direct
      and _invoke_kvcache_seeded_router
- §3 candidate Theorem 4 (reseed_cost upper bound under
      staleness budget β)
- §4 P1..P4 rollout with validation criteria per phase
- §5 five enumerated risks + mitigation
- §6 explicit decoupling: block-level eviction first,
      then D->P sync; do NOT bundle in one PR

Makes the feat/d-to-p-sync branch actionable for the next
collaborator without GPU until P2 microbench phase.
2026-05-12 23:50:39 +08:00
683c44bd71 docs(design): block-level eviction refactor — concrete API plan
Turns the architectural manifesto
(KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) into a
function-by-function design the next collaborator can
implement against.

Contents:
- §1 current SessionAwareCache state with exact field
      semantics (req_pool_idx / kv_committed_len /
      kv_allocated_len / cache_protected_len)
- §3.1–§3.6 post-refactor source sketches for
      SessionSlot, cache_finished_req,
      cache_unfinished_req, match_prefix,
      release_session, get_session_status
- §3.7 the schedule_batch.py:1572-1646 correction
      block we can remove (the E3 landmine)
- §4 five invariants the PR must defend
- §5 GPU-free unit + property test plan with a
      MockRadixCache shape
- §6 ~1 week engineering estimate and three risks
- §7 dependency relationship to the planned
      D->P sync work
- §8 minimal step list for the implementing agent

No code change yet. Future commits on a
feat/block-level-evict branch will execute against
this spec.
2026-05-12 23:49:18 +08:00
baa843a3f9 docs(index): collaborator-facing doc index
Single navigation entry point. Existing docs were scattered
across five branches with no clear reading order — this is
the fix. Includes:

- 3-doc fast path for anyone joining
- topic-grouped table (algorithm / experiments / design
  discussions / evaluation / environment / archive)
- role-based reading paths (new SWE, paper reviewer,
  reproducing student, control-plane reader)

Index also references the four docs added later on this
branch (AUDIT_AND_ROADMAP, BLOCK_LEVEL_EVICTION_DESIGN,
D_TO_P_SYNC_CONTRACT, EVALUATION_PROTOCOL) so reviewers
can see the planned layout up front.
2026-05-12 23:47:28 +08:00
6cdea52f28 docs(audit): cross-branch audit + 3-milestone roadmap
Consolidates the state of the five working branches
(main / kvc-debug-journey-v1-to-v4 / feat/d-to-p-sync /
h200-cu130 / kvc-real-ali-iter-v1) into a single
collaborator-facing document.

Sections:
- §1 per-branch state
- §2 contributions a reviewer cannot refute
- §3 weaknesses (M1–M6 methodology, S1–S10 system,
      infra) ranked by how badly they hurt at OSDI/SOSP
- §4 3-milestone roadmap (defensible submission →
      production substrate → OSDI'27 increments)
- §5 GPU-free work queue (what subsequent commits
      in this branch deliver)

No code change. Acts as the index target for the
follow-up commits on this branch.
2026-05-12 23:46:40 +08:00
18 changed files with 2408 additions and 44 deletions

View File

@@ -1,9 +1,33 @@
# AGENTS.md
## For new collaborators / agents
Before doing anything else, read [docs/INDEX_ZH.md](docs/INDEX_ZH.md). It points to the
3 must-read docs and a role-based reading path (new SWE, paper reviewer,
reproducing student, control-plane reader).
Cross-branch progress, weaknesses, and roadmap live in
[docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md). It is the single source of truth
for "what's done, what's broken, what to do next."
Two engineering work items are pre-specced and ready to pick up:
- block-level eviction refactor — [docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)
- D→P incremental KV sync — [docs/D_TO_P_SYNC_CONTRACT_ZH.md](docs/D_TO_P_SYNC_CONTRACT_ZH.md)
Evaluation protocol (paper-quality N, paired CI, stratification,
baselines) is in [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md).
## Environment
Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
Algorithm-layer unit tests (no GPU, no SGLang):
```bash
uv sync --group test
uv run pytest
```
## Goal
Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.

View File

@@ -6,6 +6,9 @@
更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。
新加入的合作者:先看 [docs/INDEX_ZH.md](docs/INDEX_ZH.md),按"我是谁"选 3 篇必读文档。
项目当前进度、薄弱点、路线图总览见 [docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md)。
## 当前做了什么
- 启动单机 SGLang P/D 栈。
@@ -99,3 +102,28 @@ uv run agentic-pd-hybrid replay \
- SGLang 改动:`feat(sglang): ...` / `fix(sglang): ...`
- `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
- 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。
## 单元测试(无 GPU
算法层policies、Algorithm 1 / Theorem 1有 pure-Python 单测,跑测试不需要 GPU、不需要 SGLang
```bash
uv sync --group test
uv run pytest
```
详见 [tests/README.md](tests/README.md)。
## 评测脚本
按 [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md) 跑数据后:
```bash
# M3: 按 turn_id / input_length / overlap_ratio / append_tokens 分桶
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl
# M2: paired-on-same-trial bootstrap 95% CI
scripts/analysis/paired_compare.py \
--baseline outputs/run-dp/request-metrics.jsonl \
--candidate outputs/run-kvc/request-metrics.jsonl
```

View File

@@ -0,0 +1,140 @@
# 项目整体审阅与下一阶段路线图
**日期**2026-05-12
**分支起点**`improve/audit-and-foundations`(基于 `h200-cu130`
**性质**:跨分支整合 + 路线图,供合作者判断每个 commit 是否值得 merge
**对象**:项目下一个 SWE / research agent + 论文 reviewer 预读
本文把 `main` / `kvc-debug-journey-v1-to-v4` / `feat/d-to-p-sync` / `h200-cu130` / `kvc-real-ali-iter-v1` 五个分支的进度、已成立的贡献、薄弱点、走到 SOSP/OSDI + 工业级的路线图集中到一处,方便快速对齐。
---
## 0. TL;DR
1. **已经成立**v1 → v2 算法reset-on-success、字典序 Route、worker-mode Admit RPC有形式化定义 + 两条 theorem + SWE-Bench 50 sess ts=1 上 6/8 指标击败 4DP CA 的实测。
2. **核心薄弱点**(a) session-level eviction 与 KVC 设计意图冲突;(b) D→P 增量 KV 同步不存在TTFT p99 长尾来自此;(c) mooncake "instance not alive" 级联是控制层根本可用性问题;(d) 评测仍缺多 baseline 多 trace 强统计。
3. **不需要 GPU 也能推进**的事:算法层 unit test、形式化设计文档block-level evict、D→P sync 接口契约)、评测协议、分层分析工具、文档体系收口。本路线图的 Milestone 1 大部分都属于此类。
4. **进 OSDI/SOSP 必须做的**:执行 §S1block-level evict+ §S2D→P sync POC+ §M2/M3/M4多 baseline / 全 Ali / paired 协议)。预计 34 个月单/双人。
---
## 1. 五个分支的状态总览
| 分支 | 角色 | 当前状态 | 最关键产出 |
|---|---|---|---|
| `main` | "已发布" 基线 | 落后 origin 18 commit2P4D + worker-admission + seed-min2 报出 vs default PD 的 9% mean / 19% p90 改善 | `KVCACHE_CENTRIC_PROGRESS_ZH.md` 的两档策略latency-best vs stable |
| `kvc-debug-journey-v1-to-v4` | 主工作分支 | v1→v5 完整算法演化;`KVC_ROUTER_ALGORITHM.md` 三段算法 + 两条 theorem | SWE-Bench 50 sess ts=1v2 6/8 指标击败 4DP CA**TTFT p99 仍输 3×**1.28s vs 0.43s),诊断为 8.3% reseed 慢路径 |
| `feat/d-to-p-sync` | 占位分支 | 代码空,仅 `RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` | 已排除"capacity-backup 是 D→P sync"的误解;列出 4 项工程子任务 |
| `h200-cu130` | 真硬件 + RDMA 验证 | 4×H200 + mlx5_60 NDR 400 Gb/s 上跑 E1/E2/E3 | **E2 80% failure**mooncake 死链级联);**E3 16min 触发 SGLang patch invariant crash**;最新 `KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 把 root cause 上升到"session-level 是错的 eviction granularity" |
| `kvc-real-ali-iter-v1` | 真 Ali trace 验证 | 8×H20179-req KVC-fit slice + 600-req/15min cold-window | KVC vs DPKVC-fit p50 46% ✅real 15min p90 +19s ❌53 errors vs DP 1KVC 默认 mem-fraction OOM必须降到 0.82 |
---
## 2. 已经"硬"成立的贡献
按"reviewer 能不能反驳"为标尺:
1. **Reset-on-success 修复 v1 thrashing**v1 永久 blacklist → migration 死循环 failure mode 有实测 + Algorithm 3 形式化 + Theorem 1 的不饿死证明(`KVC_ROUTER_ALGORITHM.md` §3.4 / §4.1)。
2. **三段算法分工清晰**Algorithm 1字典序 Route+ Algorithm 2D 自治 Admit RPC+ Algorithm 3Dispatch + reset-on-success。v5 把 admission 从 router 估算改成 D RPCOption D是把 capacity ground truth 与 routing score 解耦的正确分层。
3. **Direct-to-D 快路径的确定性命中**Theorem 2只要 residency ⊇ prefix ∧ append ≤ τ_append ∧ cap_ok 三条件同时成立必走快路径SWE-Bench 91.6% 命中、TTFT p50 = 0.43s 是结构性结果。
4. **每一个 negative result 都有 forensic 级解释**mooncake death、cold-D、reseed 慢路径、session-level evict 都有代码定位 + 时间线 + 反例。这条对 paper 是真正加分项。
---
## 3. 让 reviewer 一击致命的薄弱点
### 3.1 评测方法层
- **M1 N 不足**SWE-Bench v2 baseline N=3 确认 categoricalv2 自身 N 不足;缺 bootstrap CI。
- **M2 比较口径不对等**E2 80% 失败时用 "successful only" 算 latency 与 E1 全集比paper 必须 paired-on-same-trial。
- **M3 trace 偏 KVC-friendly**KVC-fit slice 按 small-append + high overlap 筛过full Aliturn2+ ratio 26%、single-turn 极多)的 dilution 后结果没跑过。
- **M4 baseline 不够强**:缺 vLLM + prefix-cache、DistServe、SplitWise、Mooncake-Master 任何一个。
- **M5 trace 单一性**:缺 ShareGPT/Mooncake trace、缺 long-context tool-use agent benchmark、缺合成 adversarial trace。
- **M6 硬件覆盖**:只 single-node ≤ 8 GPU没有跨节点、没有 ≥ 32 GPU 集群实测。
### 3.2 系统设计层
- **S1 Session-level eviction 与 KVC 设计意图冲突**90 次 evict、平均一次 free 67K tokens、25/50 session 必须 5090K 重 prefill。`KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 已识别但未实现修复。
- **S2 D→P 增量同步不存在**TTFT p99 长尾 50% 来自 P 重 prefill。`capacity-backup` 是 seed-time 静态快照,不是 D→P sync。修复需改 SGLang radix 的单生产者假设。
- **S3 Mooncake 级联 death**admission no-space → 持续重试 seed → 心跳掉线 → SGLang 整批 abortE2 1054/1285 失败)。控制层根本可用性 bug。
- **S4 Admission RPC 同步阻塞**:缺 backoff / hedging / staleness budget。D scheduler GIL 抖动即把 router 卡死。
- **S5 Cold-D / overlap-pinning**boilerplate 24-token block hash 让所有 session 与 D0/D1 重叠 → D2/D3 0 binding。load-floor bonus 是补丁,不是 first-principles 修复。
- **S6 SGLang 本地 patch 已 785 行 / 10 文件**,含 `schedule_batch.py:1646` 这种 hot-path 不变量改动E3 crash 就是 vendored patch 引入的 latent landmine。
- **S7 失败恢复 / 幂等性**streaming session 在 chunked-prefill retry 下幂等性靠 `SessionSlot.restore_to_req`;缺 worker crash / mooncake 重连 / partial KV 损坏的恢复 protocol。
- **S8 没有 multi-tenant / SLO-aware scheduling**:算法目标隐式 w_ttft=w_lat=1。生产里 interactive / batch / background 必须分级。
- **S9 Topology fixed at boot**P/D 比例是启动参数。生产负载需要 elastic。
- **S10 Backpressure pause hint 信号未闭环**:触发 20 次但因 no-BP 无人响应control-plane 没接通。
### 3.3 工程基础设施层
- **可观测性**metrics 是 jsonl + 离线 `recompute_summary.py`;生产需要 Prometheus + Grafana + OpenTelemetry trace。
- **形式化测试**:算法层与状态层缺 unit test`SessionSlot.restore_to_req` 幂等性是作者自己 flag 的 invariant。
- **混沌注入**mooncake death 这种 control-plane failure 必须有 fault injection harness。
- **代码体量**`replay.py` 2460 行,集 orchestration / policy hook / control plane / metrics 于一身——prototype OKpaper-quality artifact 偏弱。
---
## 4. 路线图
分三个 milestone。每个 milestone 可独立交付paper 章节或工程 release
### Milestone 1 — Defensible SOSP/OSDI submission34 个月,单 / 双人)
**目标**:把现有算法 + 失败诊断收口成能扛 PC 第一轮的稿子。
1. **执行 §S1block-level eviction refactor** — 见 `docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`
- Streaming-session decode 输出在每个 turn finish 时通过 `cache_finished_req` 增量提交进 radix tree。
- `SessionSlot` 退化为纯 metadata仅持 `last_node` + lock_ref
- `release_session` 改为 `dec_lock_ref` + 删 slotevict 完全交给 SGLang radix LRU。
- 预期evict 粒度从 67K tokens/次降到 24 tokens/次reseed 频率降一个数量级。
2. **执行 §S2D→P 增量同步 POC** — 见 `docs/D_TO_P_SYNC_CONTRACT_ZH.md`
- microbench 证明D append 完成后异步推 KV block 回 P 端 radix → 下次 reseed 跳过 re-prefill。
3. **修 §S3mooncake death 级联)**admission RPC backoff + jitterper-D pending-seed budgetmooncake heartbeat 与 admission 解耦。
4. **修 §S5 的 first-principles 解法**:把 `overlap` 重定义为 "session 在 D 上独占 prefix 的 hash 数"(去掉 boilerplate 共享 hash 贡献),让 score 自然分散。
5. **重做评测**:见 `docs/EVALUATION_PROTOCOL_ZH.md`。N≥3 + bootstrap CI + 多 baseline + 全 Ali + 分层报告。
6. **形式化扩充**:加 Theorem 3block-level evict 下重 prefill cost 上界)+ Theorem 4D→P sync 的 staleness budget β 与 reseed cost 关系)。
7. **Artifact**:一键脚本 + Dockerfile + 4×A100 一小时复现核心 table/figure。
### Milestone 2 — Production-quality serving substrate再 36 个月23 人)
8. **控制平面分层**:把 `replay.py` 拆成 `router/` / `control/` / `obs/` / `orch/`
9. **Elastic topology**autoscaling controller输入 (P queue, D transfer queue, D KV usage)。
10. **Multi-tenant + SLO classes**interactive / batch / background 三档独立 admission budget。
11. **Failure injection harness**mooncake link flap / D OOM kill / router GC pause / partial KV corruption每个 case 有恢复 SLA。
12. **Persistent KV tier**CPU DRAM + NVMe + RDMA-attached poolevict 改为 demote。
13. **Cross-node + heterogeneous**H100 + H200 + L40S 混合topology-aware routing。
14. **Observability**per-request OpenTelemetry + Prometheus per-D + Grafana 主面板。
### Milestone 3 — 真正能进 OSDI'27 的科研增量612 个月)
15. **Learning-based admission / migration**multi-armed bandit / RL 控制 τ_reject 与 K用 trace 训 session-aliveness predictor。
16. **跨 router residency consensus**:轻量 gossip 共享 `Σ.resident[d]`
17. **可证明 competitive ratio**:在 oracle KV-residency 模型下证明 KVC expected TTFT 与 offline optimal 比值有界。
18. **分布式 prefix tree**:逻辑 prefix 映射到多 D 物理副本,支持 multi-tenant prefix 共享system prompt / tool schema
19. **Energy-aware variant**GPU SM 利用率 + PCIe/RDMA 能耗进目标函数。
20. **End-to-end agent serving framing**:从 request-level latency 上升到 agent task completion timecoding agent 一个 task 30+ turn
---
## 5. 不需要 GPU 也能推进的工作清单
按 ROI 排:
- [x] 本路线图(`AUDIT_AND_ROADMAP_ZH.md`)。
- [x] 合作者入口(`docs/INDEX_ZH.md`)。
- [x] Block-level eviction 具体设计(`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`)。
- [x] D→P sync 接口契约(`docs/D_TO_P_SYNC_CONTRACT_ZH.md`)。
- [x] 评测协议(`docs/EVALUATION_PROTOCOL_ZH.md`)。
- [x] `KvAwarePolicy` 纯函数 score 抽取 + unit testAlgorithm 1
- [x] 不饿死性质测试Theorem 1
- [x] 分层分析脚本(按 turn-index / append-size / overlap 三维分桶)。
- [x] Paired-comparison 协议 helper。
- [ ] Mooncake death 的可重现 mock harness无 GPU 也能跑)。
- [ ] SGLang patch surface 的归类清单(每个 patch 标"必须" / "实验性" / "可下线")。
- [ ] Failure-mode taxonomy 文档cold-D、overlap-pin、mooncake death、reseed storm、evict storm
---
## 6. 单句结论
> 这个项目已经具备了 SOSP/OSDI workshop / poster 的素材;要进 main track需要把 §S1block-level evict和 §S2D→P sync做实、把 §M3full Ali和 §M4两个强 baseline补齐、把 §S3mooncake 级联 death的 control-plane fix 写进可重复 artifact。如果只能做一件事先做 block-level eviction refactor —— 它同时解决"reseed 太频繁"和"P 端 radix 多生产者扩展的前置条件"。

View File

@@ -0,0 +1,309 @@
# Block-level Eviction Refactor — 设计文档
**日期**2026-05-12
**前置**[KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md)(架构层 manifesto
**性质**:实现层设计 + API 草案 + 测试计划,供下一个合作者直接据此编码
**Status**:草案,未实现。代码全部 quoted from `third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py @ origin/h200-cu130`
---
## 0. TL;DR
`SessionAwareCache` 当前对 streaming-session **整段 KV 一次性 free** 的语义改成:
1. Streaming-session decode 输出在 turn finish 时 **增量 commit 进 radix tree**
2. `SessionSlot` 退化为**纯 metadata**(仅持 `last_node` + lock_ref 状态),不再独占 KV 区间。
3. `release_session` 改为只 dec_lock_ref + 删 slot**让 SGLang 标准 radix LRU 按 block 粒度蚕食**。
预期收益evict 粒度从一次 ~67K tokens 降到 ~24 tokenspage_size 个 tokenreseed 频率降一个数量级;同时把 P 端 radix tree 改造成可被外部喂数据(为 [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) 铺路)。
---
## 1. 现状代码梳理
### 1.1 关键文件与函数
`third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py`
| 函数 / 字段 | 当前语义 |
|---|---|
| `SessionSlot.req_pool_idx` | streaming-session 独占的 req_pool 槽位 |
| `SessionSlot.kv_committed_len` | 上一 turn 完成时已 commit 的 KV 长度(已计入 cache_protected_len 部分进入 radix |
| `SessionSlot.kv_allocated_len` | 当前已分配但**未进 radix** 的 KV 长度("session-exclusive 尾部" |
| `SessionSlot.cache_protected_len` | 首 turn 提交 radix 时的 protected 边界 |
| `match_prefix(streaming req)` | 命中 slot → 返回 `req_to_token[req_pool_idx, :prefix_len]`bypass radix |
| `cache_unfinished_req(streaming req)` | subsequent turns → **完全 skip inner**(不进 radix |
| `cache_finished_req(streaming req)` | 调 `slot.save_from_req`**不调 inner.cache_finished_req** |
| `release_session(sid)` | `dec_lock_ref(slot.last_node)` + `free(req_to_token[req_pool_idx, cache_protected_len:kv_allocated_len])` + 回收 req_pool 槽位 |
### 1.2 当前为什么是错的(重述)
`[cache_protected_len, kv_allocated_len)` 是首轮入 radix 之后所有累积的 decode 输出 + 后续 turn 的 extend。在 Inferact / SWE-Bench 实测:
- `cache_protected_len` ≈ 首 turn boilerplate ~12K
- `kv_allocated_len` 累积 50100K
- 每次 `release_session` 一次性释放 3888K这部分**从未进 radix**,无法享受 leaf-by-leaf 渐进 evict
→ session 被 evict 后必须从 client 原 prompt 重 prefill 全长 + mooncake transfer 全长,跟 naive PD-disagg 等价(详见 manifesto §1
---
## 2. 目标行为表
| 场景 | 现状 | 目标 |
|---|---|---|
| Session 累积 50K KVD 满了 | `release_session` 一次释放 38K | radix LRU 从最老 leaf 开始 evict单次 ~24 tokens |
| Session 被 evict 后再到来 | 必须 reseed 50K | 仅 re-prefill 被 evict 的 leaf 部分(典型 ≤ 5K |
| Evicted session TTFT | 5090K reseed ≈ 37s | 5K append-prefill ≈ 200ms |
| 不被 evict 的 session | 同 session 内 turns append-only | 同样 append-only不变 |
| Direct-to-D fast path 命中率 | 91.6% (SWE-Bench) / 38% (E3 Inferact) | 应 ≥ 85% 即使 saturation |
---
## 3. 设计
### 3.1 SessionSlot 字段精简
**after refactor**
```python
@dataclass
class SessionSlot:
virtual_node: _VirtualNode = field(default_factory=_VirtualNode)
# Pointer into the radix tree — the deepest node owned by this session's
# committed prefix. Held under inc_lock_ref so radix LRU never evicts this
# *active* leaf out from under a turn-in-progress. Released by
# release_session.
last_node: Any = None
swa_uuid_for_lock: Optional[str] = None
# Bookkeeping fields (no longer authoritative ownership of KV indices).
last_access_time: float = field(default_factory=time.monotonic)
# Mamba state stays slot-owned (mamba doesn't fit the radix model).
mamba_pool_idx: Any = None
mamba_ping_pong_track_buffer: Any = None
mamba_next_track_idx: Any = None
mamba_last_track_seqlen: Any = None
mamba_branching_seqlen: Any = None
```
**删除**`req_pool_idx``kv_committed_len``kv_allocated_len``cache_protected_len``swa_evicted_seqlen`。这些字段的真值改由 radix tree + req_to_token_pool 共同维护。
### 3.2 `cache_finished_req` 改造
**after refactor**
```python
def cache_finished_req(self, req: Req, is_insert: bool = True, **kwargs):
if not _is_streaming(req):
return self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
session_id = req.session.session_id
slot = self.slots.setdefault(session_id, SessionSlot())
# KEY CHANGE: always delegate to inner — this inserts the new tokens
# (kv_committed_len .. fill_ids end) as radix-tree blocks. Subsequent
# match_prefix calls for this session will hit the radix tree directly.
result = self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
# Update slot bookkeeping only (no KV ownership).
slot.last_node = req.last_node
slot.swa_uuid_for_lock = req.swa_uuid_for_lock
slot.last_access_time = time.monotonic()
# Mamba state still goes through slot.
slot.mamba_pool_idx = req.mamba_pool_idx
...
return result
```
**不变量**
- `inner.cache_finished_req` 会把 `[kv_committed_len_old, kv_committed_len_new)` 范围内对齐到 page_size 的 KV 插入 radix。这个语义来自 SGLang 标准实现,无需改 inner。
- `slot.last_node` 现在指向**当前 session 已 commit prefix 的尾节点**,每个 turn 后向前推进。
- `dec_lock_ref(old_last_node)` + `inc_lock_ref(new_last_node)` 必须在 turn 切换时执行。
### 3.3 `cache_unfinished_req` 改造
streaming session 的 subsequent turn **不再 skip inner**。原因:现在 `match_prefix` 走 radixchunked-prefill 中间状态也需要 inner 维护:
```python
def cache_unfinished_req(self, req: Req, **kwargs):
if _is_streaming(req) and kwargs.get("chunked", False):
# Chunked prefill: forward to inner so the per-chunk extend gets
# tracked in the radix LRU access timestamps.
...
self.inner.cache_unfinished_req(req, **kwargs)
```
具体的 chunked 处理细节需要保留对 `prefix_indices` 重建的逻辑(参考当前实现 lines 215225但调用 `inner.cache_unfinished_req` 不能 skip。
### 3.4 `match_prefix` 改造
退化为**纯 inner 转发**——SessionSlot 不再持 KV 指针:
```python
def match_prefix(self, params: MatchPrefixParams) -> MatchResult:
# No more slot-fast-path. Streaming sessions reuse KV via radix tree
# match like every other request.
return self.inner.match_prefix(params)
```
调用方需要的 "这个 session 的 committed prefix 长度" 信息改为通过 `inner.match_prefix(...).device_indices.shape[0]` 推导。
### 3.5 `release_session` 改造
**after refactor**
```python
def release_session(self, session_id: str) -> int:
slot = self.slots.pop(session_id, None)
if slot is None:
return 0
# Just release our radix lock — radix LRU can now reclaim our prefix
# leaves at its own pace. NO direct token_to_kv_pool free.
if slot.last_node is not None:
if slot.swa_uuid_for_lock is not None:
self.inner.dec_lock_ref(
slot.last_node,
DecLockRefParams(swa_uuid_for_lock=slot.swa_uuid_for_lock),
)
else:
self.inner.dec_lock_ref(slot.last_node)
# Mamba state still needs explicit cleanup if present.
if slot.mamba_pool_idx is not None:
...
return 0 # "freed_tokens" no longer meaningful; radix LRU shed lazily
```
### 3.6 `get_session_status` / `list_session_statuses` 改造
`resident_tokens` 现在的真值来自 radix tree。需要在 inner 暴露一个 helper
```python
# In BasePrefixCache / RadixCache:
def tokens_under(self, node) -> int:
"""Count tokens in the path from root to `node` (inclusive)."""
...
# In SessionAwareCache:
def get_session_status(self, session_id: str) -> Optional[Dict[str, Any]]:
slot = self.slots.get(session_id)
if slot is None:
return None
resident_tokens = self.inner.tokens_under(slot.last_node) if slot.last_node else 0
return {
"session_id": session_id,
"resident": resident_tokens > 0,
"resident_tokens": int(resident_tokens),
"last_access_time": float(slot.last_access_time),
}
```
`admit_direct_append` 的容量检查改用 `resident_tokens` 的 radix 真值(去掉 `kv_committed_len / kv_allocated_len` 双值不一致的可能)。
### 3.7 SGLang 调度路径配套改动
参考 `schedule_batch.py:1572-1646`,当前 streaming-session correctioncommit b8e6f13 / 986f351 引入)建立在 SessionSlot 拥有独立 KV 范围之上。block-level refactor 后这条 correction 路径**完全无需存在**——req 的 fill_ids / prefix_indices 由 inner radix `match_prefix` 直接给出一致值。
**移除项**
- `schedule_batch.py:1572-1585``actual_extend_len = max(0, len(fill_ids) - len(prefix_indices))` correction 块。
- `schedule_batch.py:1646``assert seq_len - pre_len == req.extend_input_len`refactor 后该不变量结构上必然成立)。
- E3 触发的 latent landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2)随之消失。
---
## 4. 不变量(必须在 PR 自测中覆盖)
| Inv | 内容 |
|---|---|
| I1 | `release_session(sid)` 后,下一次同 session 请求的 `match_prefix` 行为只取决于 radix tree 的常驻状态——不依赖 `slots` dict。 |
| I2 | 任意 (session_id, turn_id) 的 `cache_finished_req` 调用后radix tree 上必然存在一条 root→leaf 路径覆盖该 turn 的全部 committed token`tokens_under(slot.last_node)` 严格不降)。 |
| I3 | `restore_to_req` 必须**幂等**:在 chunked-prefill 重试场景下,对同一 req 可被调用多次而最终 req 状态等价。当前实现靠"不清 slot 字段"实现 → refactor 后改由 radix `match_prefix` 的纯函数性质保证。 |
| I4 | 无 streaming-session 的请求(`req.session is None`)行为 **不变**:所有路径 short-circuit 到 inner。 |
| I5 | 任一 turn 结束后,对 `slot.last_node``inc_lock_ref` 必须有对应的 `dec_lock_ref`,且 `release_session` 是最终的释放点。 |
---
## 5. 测试计划(无 GPU 可跑)
### 5.1 单元测试mock inner cache
写一个 `MockRadixCache(BasePrefixCache)`,记录所有 `cache_finished_req / cache_unfinished_req / match_prefix / evict / dec_lock_ref` 调用序列。然后:
| Test | 断言 |
|---|---|
| `test_release_session_no_direct_free` | 调 `release_session`Mock 上 **没有** 直接 `free(kv_indices)` 调用,只有 `dec_lock_ref` |
| `test_subsequent_turn_inserts_radix` | 模拟 turn 0 → 1 → 2 三次 `cache_finished_req`,断言每次都触发 `inner.cache_finished_req` |
| `test_match_prefix_uses_inner` | streaming 与 non-streaming 都仅走 `inner.match_prefix` |
| `test_restore_idempotent` | 模拟 chunked-prefill 重试,连续两次 `match_prefix` 返回的 `device_indices` 一致 |
| `test_eviction_under_pressure_is_block_level` | inject 一个 "pool 满,必须 evict 24 tokens" 的状态,断言 `release_session` 不被触发inner 的 LRU 单步走 |
### 5.2 Property-based 测试
```python
@given(turns=lists(integers(min_value=24, max_value=2048), min_size=1, max_size=50))
def test_committed_tokens_monotone(turns):
"""tokens_under(slot.last_node) is monotonically non-decreasing across turns."""
...
```
### 5.3 Integration smoke需要 GPU但放在 sweep 脚本里)
执行 `sweep_e2_kvc_v2_rdma.sh` 同 trace 同配置,对比指标:
- evict 总次数(期望从 90 → < 10
- 单次平均 evict tokens期望从 67K < 500
- TTFT p99期望从 1.28s < 0.7s
- direct-to-D 命中率期望 85%
---
## 6. 工程量与风险
### 6.1 工程量
| 工作 | 估时 | 风险 |
|---|---|---|
| §3.1–§3.6 SessionAwareCache 改造 | 23 | 需要熟悉 radix 内部 lock_ref / evict 协议 |
| §3.7 schedule_batch 清理 | 0.5 | 是删代码 |
| §4 不变量单元测试 | 2 | |
| §5.3 GPU smoke + 数据对比 | 2 | mooncake 仍可能触发 E2 级联 death需要 §S3 修复一并跑 |
| **总计** | **~1 ** | |
### 6.2 关键风险
1. **`inner.cache_finished_req` streaming-session req 的兼容性**当前 SGLang 标准 radix 假设 req cache_finished_req 时是 "完整 prefill+decode 完成"。streaming-session req 在每个 turn 结束时还会留下"未完成的 conversation"要确保 inner 在插入时不会把 decode-only tokens 当成可丢弃尾巴需要 audit `radix_cache.py:cache_finished_req` 的实现
2. **lock_ref 顺序**turn N+1 开始的 `match_prefix` inc_lock_ref(new_node)turn N 结束的 dec_lock_ref(old_node)时序若反了会在并发下让 LRU 把刚 commit leaf evict建议加 assertion`dec_lock_ref` 之前 `inc_lock_ref` 必须先到
3. **chunked-prefill retry** I3SGLang 当前 `restore_to_req` 不清 slot 字段就是为此 retryrefactor 后必须确认 inner radix `match_prefix` retry 下也幂等标准 radix tree 是的但要写测试明确锁住这个性质)。
---
## 7. 与 D→P sync 工作的关系
block-level evict [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) **前置条件**
- DP sync 需要 P radix tree **可接收外部喂入的 KV block**
- 当前 P radix 假设单生产者 worker 模型输出)。
- block-level refactor 完成后streaming-session KV 已经走标准 radix 路径——再让 radix tree 接受"外部喂入"的额外生产者就只是扩展 insert API而不是发明新的存储路径
两件事可顺序做 block-level evict DP sync
---
## 8. 接班 agent 的最小动作
1. fork 一个 `feat/block-level-evict` 分支 `improve/audit-and-foundations` `h200-cu130`)。
2. 实现 §3.1–§3.6
3. §5.1 + §5.2 单元测试
4. 8×H100 / H200 上跑 §5.3 smoke对比 evict 频次和 TTFT p99
5. §6.2 风险 1 成立 SGLang `radix_cache.py` 看是否需要给 streaming-session req `is_session_active=True` flag 阻止"丢弃 decode "。
---
**核心句** session lifecycle 边界保留**不要**让它做 eviction 边界移交给 radix LRU)。这次 refactor 同时解决"reseed 太频繁""P radix 不可外部喂入"两个 blocker

View File

@@ -0,0 +1,247 @@
# D→P 增量 KV 同步 — 接口契约与 rollout 计划
**日期**2026-05-12
**前置**[RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md)(缺口定位)+ [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)(前置条件)
**性质**:跨层接口契约 + staleness budget 形式化 + 分阶段 rollout
**Status**:草案。`feat/d-to-p-sync` 分支当前为空,本文是该分支应当首先 land 的设计文档
---
## 0. TL;DR
reseed 慢路径的 50% 时间在 P 重 prefill**修复 transfer 段(启 RDMA只能解一半**。彻底消除长尾的唯一办法是让 P 端 backup 增量跟上 D 端的 append
> D 在 direct-to-D 路径上完成一个 turn → 异步把新 commit 的 KV block 推回 P 端 radix → 下次 reseed 时 P 端 radix 命中完整 prefix无需 re-prefill仅一次 P→D transfer。
本文给出三层mooncake / SGLang / agentic-pd-hybrid的接口契约、一个 **staleness budget β** 的形式化定义,以及四阶段 rollout 计划,让该工作可以与 block-level eviction 解耦推进。
---
## 1. Staleness Budget β —— 形式化定义
设 D 上 session `s` 的 committed prefix 长度为 `L_D(s, t)`time `t` 的瞬时值P 上同 session 的 backup prefix 长度为 `L_P(s, t)`
```
staleness(s, t) := L_D(s, t) - L_P(s, t) ≥ 0
```
**Staleness budget β** 是系统承诺维持的上界:
```
∀ s, ∀ t : staleness(s, t) ≤ β
```
直观:β 越小 → reseed 命中 P 端 backup 的可能越高 → reseed 退化为单次 P→D transfer + ≤ β tokens 的 re-prefill。
- **β = 0**完全同步D 每 commit 一块就阻塞等 P ack。延迟成本高不推荐。
- **β = ∞**当前状态P 端 backup 永远 seed-time 静态快照)。
- **β = 一个 page24 tokens**:单 block sync。理论最优粒度但 D 端每次 append 都触发一次 D→P RPC。
- **β = O(append_len)(典型 1K4K**:批量 sync。推荐起点把同 turn 的 decode 输出聚合后整批推送。
- **β = O(turn_size)(典型 ~50K**:粗粒度 sync。失效 reseed bypass仅减少 transfer。不可取。
→ rollout 推荐 β = `max(page_size, min(committed_in_turn, β_max))``β_max` 默认 4096。
---
## 2. 三层接口契约
### 2.1 Mooncake 层:双角色化
**当前状态**(详见 [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) §3
- `MooncakeKVManager` 在初始化时按 `disaggregation_mode ∈ {PREFILL, DECODE}` 强角色化。
- `MooncakeKVSender` 仅在 PREFILL 模式实例化,`MooncakeKVReceiver` 仅在 DECODE 模式实例化。
- `add_transfer_request` 含硬约束 `assert disaggregation_mode == PREFILL`
**目标接口**
```python
# third_party/sglang/python/sglang/srt/disaggregation/base/conn.py
class BaseKVManager:
roles: set[KVRole] # 替换原单值字段,允许 {PREFILL, DECODE}
class KVRole(Enum):
PREFILL = "prefill"
DECODE = "decode"
PREFILL_BACKUP_RECEIVER = "prefill_backup_receiver" # 新P 端接收 D→P sync
DECODE_BACKUP_SENDER = "decode_backup_sender" # 新D 端发送 D→P sync
```
**新增类**(实现层 ~400 LOC
| 类 | 角色 | 关键方法 |
|---|---|---|
| `DecodeKVSender` | D 端把 append 后的新 KV block 推回 P | `enqueue_sync(session_id, kv_blocks, target_p)` 异步入队,返回 `sync_id` |
| `PrefillKVReceiver` | P 端接收 D→P sync 包 | `recv_loop()` 后台线程;每个包触发 callback 注入 radix tree |
**Bootstrap channel**:需要独立于现有 P→D 通道的第二个 bootstrap socket避免 buffer pointer 协商冲突)。配置:
- 默认 disable由 ServerArgs flag `--enable-d2p-sync` 开启
- 新增 port range `BOOTSTRAP_D2P_PORT_BASE = 22000`
### 2.2 SGLang 层Radix 多生产者扩展
**当前状态**P 端 radix 假设单生产者(本 worker 模型输出)。`RadixCache.cache_finished_req` 内部直接从 `req_to_token_pool[req_pool_idx, :]` 取 KV indices 插入树。
**目标接口**(在 [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) 完成之后):
```python
class RadixCache(BasePrefixCache):
def insert_external(
self,
token_ids: Sequence[int],
kv_tensor: torch.Tensor,
*,
source_worker_id: str,
session_id: str,
) -> InsertExternalResult:
"""
Insert KV blocks supplied by an external worker (D→P sync).
Allocates fresh slots in token_to_kv_pool, copies kv_tensor into them,
and threads the resulting indices through the radix tree exactly like
cache_finished_req would for a local prefill.
Invariants:
- Same model layout (verified at handshake time, not per-call).
- On collision with existing radix path, no-op for the shared prefix
and only insert the diverging suffix.
- Inserted nodes get lock_ref += 1 if `pin=True`, default False.
D→P sync is best-effort; LRU is allowed to evict the inserted leaves.
"""
```
**关键设计点**
| 决策 | 选项 | 推荐 |
|---|---|---|
| KV index 重映射 | A) D 发原 indices, P 重映射B) D 发紧密打包的 tensorP 重新分配 | **B**:避免跨 worker 索引泄漏 |
| 失败处理 | A) D→P 失败 → 退化为重 prefillB) 重试 N 次 | **A** + 后续 reseed 时若 P 未命中走旧路径 |
| Reference counting | sync 进 P 的 KV 是否被 pin | **不 pin**P 端 LRU 自然管理,避免 backup 把生产 KV 挤出 |
| 与 evict 协调 | sync 来到时 P 满怎么办? | 让 sync insert 触发 inner.evict → 与本地生产 KV 公平 LRU 竞争 |
| 同 session 多 P 实例 | router round-robin 把 turn 派到不同 P 怎么办? | **接受 multi-source**:每个 P 维护自己的 backupreseed 时挑 staleness 最小者 |
### 2.3 agentic-pd-hybrid 层Hooks 与状态机
**新增 CLI flag**
```bash
--enable-d2p-sync # off by default
--d2p-staleness-budget-tokens 4096 # β_max
--d2p-sync-batch-min-tokens 24 # 至少 ≥ 1 page 才触发
--d2p-sync-target-policy {last_p, round_robin, broadcast}
# last_p: 推回该 session 上次 seed 的 P
# broadcast: 推到所有 Preseed 时灵活但带宽大)
```
**新增 state 字段**`replay.py``DirectSessionState`
```python
@dataclass
class DirectSessionState:
...
# NEW: per-P backup view, populated by D->P sync callbacks.
prefill_resident_tokens_by_p: dict[str, int] = field(default_factory=dict)
last_d2p_sync_at: float | None = None
```
**Hook 在 `_invoke_session_direct` 完成后**
```python
async def _invoke_session_direct(...):
...
response = await self._stream_direct_to_d(...)
if response.ok and self.config.enable_d2p_sync:
new_committed = response.kv_committed_len
prev_p_resident = max(session.prefill_resident_tokens_by_p.values(), default=0)
staleness = new_committed - prev_p_resident
if staleness >= self.config.d2p_sync_batch_min_tokens:
target_p = self._choose_d2p_target(session)
asyncio.create_task(
self._issue_d2p_sync(session, target_p, prev_p_resident, new_committed)
)
```
**Hook 在 reseed 路径**`_invoke_kvcache_seeded_router`
```python
async def _invoke_kvcache_seeded_router(..., request):
...
if self.config.enable_d2p_sync:
# Probe P-side residency before issuing full re-prefill.
probe = await self._probe_prefill_residency(session_id)
if probe.resident_tokens >= request.prefix_len - β_max:
# Use the up-to-date backup: skip re-prefill, just trigger P→D transfer.
return await self._invoke_p_to_d_transfer_only(...)
# Fall back to existing path.
return await self._invoke_kvcache_seeded_router_legacy(...)
```
---
## 3. 性质(待证明)
### 3.1 Theorem 4 候选(论文形式)
*设 staleness budget β 维持成立。对一个 session `s` 在 D 上累积长度 L、被 evict 后 reseed 触发:*
```
reseed_cost(s) ≤ T_p2d(L) + T_prefill(min(β, L))
```
*其中 T_p2d 是 P→D transfer 时间(在 RDMA 下 ~L · 4 ns/tokenT_prefill 是 prefill 时间(在 H100 TP1 Qwen3-30B 下 ~50K tokens/s。当 β ≪ L 时退化为 single P→D transfer 主导。*
**对比 baseline**(无 D→P sync`reseed_cost = T_p2d(L) + T_prefill(L seed_size)`re-prefill 占主导。
### 3.2 与 Theorem 2 的关系
Theorem 2 只保证 direct-to-D 路径的快速命中。Theorem 4 把"fast path miss 时的 fallback cost"也压低到次秒级,使 KVC 在**全分位数**击败 DP 成为可能。
---
## 4. 四阶段 Rollout
| Phase | 范围 | GPU 需求 | 验收指标 |
|---|---|---|---|
| **P1** | block-level eviction refactor[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) | 4×H100 smoke | evict 单次平均 ≤ 500 tokens |
| **P2** | mooncake 双角色化 + microbenchD→P 单包 RTT、带宽利用 | 单机 + RDMA | P→D RTT < 50mslocal 16K-token block 带宽 50% 理论上限 |
| **P3** | SGLang `insert_external` + agentic-pd-hybrid hook best-effort reseed probe | 4×H100 + RDMA | sync 触发率 > 80% 同 turn 内完成;不引入新 failure mode |
| **P4** | reseed probe 接通 + 端到端 evaluation | 4×H100 + RDMA | reseed 单次 < 0.5svs 当前 37sTTFT p99 < 0.5s |
**关键决策点**P1 P2 之间需要走 audit确认 SGLang radix `insert_external` 不会与 streaming-session decode 路径冲突若发现严重冲突引入 "P-only sync mode" 占位等架构稳定再放开
---
## 5. 风险与对策
| 风险 | 影响 | 对策 |
|---|---|---|
| Mooncake 双角色化破坏现有 PD 单向路径 | E2 已暴露 mooncake "instance not alive" 级联再加一条通道可能放大 | P2 阶段先用独立 bootstrap channel + feature flag保留 disable 路径 |
| DP sync 占用 D 出口带宽影响 direct-to-D append-prefill 延迟 | 直接劣化主路径 | sync 用低优先级 QPRDMA SL=0 batch 触发 turn 内最多 1 |
| P radix backup 填满反而挤出本地生产 KV | P prefill 速度降 | sync 插入不 pin(§2.2 LRU 公平竞争 |
| P backup view 协调复杂 | router 选择 target_p 时需考虑 staleness | 起点用 `last_p` policyrecency-biased观察实测分布再决定是否上 `broadcast` |
| SGLang patch 升级时 `insert_external` upstream API 漂移 | 维护负担 | API 限制在我方 vendor patch 边界不污染 upstream radix并写 contract test |
---
## 6. 与 block-level eviction 的解耦关系
| 工作 | 是否依赖另一个 |
|---|---|
| block-level eviction | 不依赖 DP sync可独立交付能单独降低 reseed 频次 |
| DP sync | **依赖** block-level eviction需要 P radix streaming session KV 的真值源 |
| 一起做 | 收益最大reseed 频次降一个数量级 + 单次 reseed 时间降一个数量级 |
rollout 顺序block-level eviction landDP sync 随后开 `feat/d-to-p-sync` 推进两者**不应**合在一个 PR
---
## 7. 接班 agent 的最小动作
1. `feat/d-to-p-sync` 分支上 land 本文
2. block-level eviction main P2 阶段mooncake 双角色化 + microbench单测 SGLang 主路径耦合)。
3. P3 阶段加 `insert_external` hook disabled-by-default main
4. P4 端到端 evaluation 后再判断 reseed probe policy`last_p` vs `broadcast`)。
---
**核心句**DP 增量同步不是"再加一条网络通道"那么简单关键是把 P radix 从单生产者扩展到允许 best-effort 外部喂入Block-level eviction 是这件事的前置条件——所以两件工作可以一前一后不能颠倒

View File

@@ -0,0 +1,185 @@
# 评测协议Paper-quality
**日期**2026-05-12
**性质**:评测协议规范,覆盖 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.1 M1M6 全部薄弱点
**对象**:跑实验的合作者;写 paper 的人artifact reviewer
---
## 0. 总原则
> 论文里每一个数字都必须能回答两个问题:
> 1. **抽样误差有多大?**bootstrap CI、N、std
> 2. **公平吗?**(同 trial、同 trace、同 token cap、同 timeout、paired
当前 sweep 报告(`KVCACHE_CENTRIC_PROGRESS_ZH.md` / `V2_RESULTS_ZH.md`)都不满足上述任一条。本文给出合规模板。
---
## 1. 评测维度M1M6 一对一解决)
### 1.1 M1 — 统计显著性
| 决策 | 规则 |
|---|---|
| `N` 每个 config 最小 run 数 | **3**headline 数字)/ **5**ablation 终值) |
| 报告统计量 | `mean ± std`**附 2.5/97.5 bootstrap CI** |
| 多 run 聚合 | 把每 run 的 per-request latency append 后整体做 bootstrap不要先 per-run 求 mean 再 average mean |
| 差异显著性 | paired bootstrap p-value≥ 5000 samples |
| `N=1` 仅允许 | smoke / sanity check**不进 headline 表** |
### 1.2 M2 — 公平 paired 比较
| 决策 | 规则 |
|---|---|
| trace fixity | 用同一个 `samples-*.jsonl` 文件replay 用 `--use-trace-as-sample` 锁定 |
| timeout | 所有 mechanism 同 `--request-timeout-s`;不允许某一组用 600s 而另一组 300s |
| token cap | 同 `--max-input-len`(取所有 baseline 的最小值并显式 truncate |
| 错误 / abort | **不**只算成功请求abort 与 timeout 各自单列 `error_count`,按全集(含错误)报指标,或 paired-on-same-trial-mask |
| 时间窗 | `time_scale` 一致;不允许同 sweep 内换 |
| Worker 数 / GPU 类型 | 一致topology 差异必须标注 |
**反例**:当前 `E1 vs E2` 表([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) §4显式声明 "not a fair head-to-head"——E2 80% 失败successful-only 算 latency 与 E1 全集对比。**这种表不能直接进 paper**。
### 1.3 M3 — Trace 分层
| 维度 | 分桶建议 |
|---|---|
| `turn_id` | `{1, 2-5, 6-20, 21+}` |
| `append_len` | `{≤128, 128-1K, 1K-8K, >8K}` |
| `overlap_ratio` | `{≤0.3, 0.3-0.7, >0.7}` |
| `inter_turn_gap_s` | `{≤5, 5-30, 30-300, >300}` |
| `input_len` | `{≤8K, 8K-64K, >64K}` |
**报告要求**headline 数字之外,至少给一张"按 turn_id × append_len"的 heatmap让 reviewer 看到收益来自哪个 slice。
**反例**:当前 Real Ali 实验仅在 KVC-fit slicehigh overlap + small append + 100% direct-eligible上报 -46% p50。这是上限不是平均。必须同时给出 full Ali 上的 paired 表。
### 1.4 M4 — Baseline 矩阵
至少以下 baseline 中跑 **2 个**
| Baseline | 类别 | 库 |
|---|---|---|
| vLLM + automatic prefix caching | 同 model 单 worker prefix cache | vLLM main |
| SGLang DP cache-aware4×TP1 | 当前主要 baseline | 本仓 vendored SGLang |
| SGLang PD-disaggregationkv-aware | naive 但 cache-aware 拓扑 | 本仓 |
| DistServe | P/D 分离 baseline | DistServe upstream |
| SplitWise | P/D split + adaptive routing | open-source impl |
| Mooncake-Master scheduler | 同代设计 | mooncake-master |
**额外推荐**:跑一个 "oracle" baseline——assume `Σ.resident[d]` 完美已知 + admission 永不失败,作为 KVC 的上限对照。
### 1.5 M5 — Trace 组合
| Trace | 用途 |
|---|---|
| Ali coding agent (full) | 主结果;含 single-turn dilution |
| Ali KVC-fit slice | KVC 上限演示 |
| SWE-Bench 50 sess | 已有;多轮高 overlap workload |
| ShareGPT | 对比 chat workload短 turn低 overlap。**用来证明 KVC 不会在不合适 workload 上劣化** |
| Inferact | tool-use heavy 的 agent workload |
| Mooncake trace | 单 turn LLM serving 的 baseline trace |
| Synthetic adversarial | 自构burst 100 个新 session 同时 seed验证 mooncake death 与 reset-on-success 的 robustness |
**最低组合**Ali full + SWE-Bench + ShareGPT + Synthetic adversarial。
### 1.6 M6 — 硬件覆盖
| Tier | 用途 |
|---|---|
| 单节点 ≤ 8 GPU | 当前所有结果 |
| 双节点 NVLink + IB | 验证跨节点 D→P sync 与 mooncake 行为 |
| 4 节点 cluster≥ 16 GPU | scaling 数字、cluster scheduler 假设 |
| 异构H100 + L40S | topology-aware routing |
**最低组合**:单节点 4×H200 + 双节点 NVLink + IB。剩下两个 tier 可放 future work。
---
## 2. 报告模板
### 2.1 主结果表Table 1
```
| Config | N | mean ± std | p50 [CI] | p90 [CI] | p99 [CI] | err% | timeout% |
|--------|---|------------|----------|----------|----------|------|----------|
```
加注trace name、time_scale、`max_input_len``request_timeout_s`、所有共用参数。
### 2.2 Paired delta 表
```
| Pair | N pairs | mean delta [CI] | p50 delta [CI] | wins / losses | p-value |
```
`N pairs` = 两边都 successful 的 trial 数。`wins` = `latency_kvc < latency_baseline` 的 trial 数。
### 2.3 分层表Table 2
每个分层维度§1.3)独立一张。
### 2.4 Negative-result 章节(强制)
paper 必须有专章列出:
- KVC 在 ShareGPT 上比 baseline 慢的具体数字。
- KVC 在 trace 哪些 percentile / slice 不胜。
- 失败的 sweepmooncake death、E3 crash的诊断链路。
→ 论文 reviewer 看见诚实的 negative result 会显著提高印象分。当前的 [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §4 雏形可以扩成这一章。
---
## 3. 工具支持(本仓需要的脚本)
| 脚本 | 状态 | 说明 |
|---|---|---|
| `scripts/analysis/recompute_summary.py` | ✅ 已有 | 修复 abort 污染的 latency本协议主要数据入口 |
| `scripts/analysis/stratified.py` | ⏳ 本分支新增 | 按 §1.3 维度切桶 + 输出表 |
| `scripts/analysis/paired_compare.py` | ⏳ 本分支新增 | paired bootstrap输出 §2.2 表 |
| `scripts/analysis/plot_*` | ✅ 已有 | TTFT PDF、GPU 利用率、cache efficiency |
→ 本分支的 stratified + paired 脚本 land 后,跑实验的合作者可以一条命令出表。
---
## 4. Artifact 要求SOSP/OSDI AE
| 项目 | 标准 |
|---|---|
| Dockerfile | 单一 `Dockerfile.artifact`4×A100/H100 即可启 |
| 一键脚本 | `bash artifact/reproduce_main_table.sh`1 小时内出 Table 1 |
| 数据集 | 提供 `outputs/sample-*.jsonl` 子集(可 ~5GB 内full Ali 走 instruction |
| 复现度 | bootstrap CI 与原文重叠即算复现,不要求 bit-exact |
| 文档 | `artifact/README.md`,列出每张表 / 图对应的命令 |
→ 本路线图 §M1 修复后再准备 artifact。
---
## 5. 自检清单(提 paper draft 前用)
- [ ] 每张表 N ≥ 3含 mean±std 与 95% CI。
- [ ] 没有 "successful only" 字样;所有错误已列入 `err%`
- [ ] 所有 baseline 用同 `max_input_len` / 同 `request_timeout_s` / 同 `time_scale`
- [ ] 至少 3 个 trace + 1 个 synthetic adversarial。
- [ ] 至少 1 个 non-SGLang baseline。
- [ ] 有 negative-result 章节。
- [ ] 有 KVC 在 single-turn workload 上的 dilution 数据。
- [ ] 形式化部分Algorithm 1/2/3 + Theorem 1/2以及 D→P sync 完成后的 Theorem 4。
- [ ] 失败模式 forensicmooncake death、E3 crash、cold-D 都进 §Limitations 或 §Discussion。
---
## 6. 路线图衔接
- [ ] Phase A — 实现本分支 `scripts/analysis/stratified.py` + `scripts/analysis/paired_compare.py`(无 GPU 可做)。
- [ ] Phase B — 把现有 `kvc-real-ali-iter-v1` 的 600-req/15min 数据用新工具重出一份分层表 / paired 表,存入 `outputs/`GPU 不需重跑)。
- [ ] Phase C — 跑 ShareGPT + Synthetic adversarial baselineGPU 需 ~12h
- [ ] Phase D — 选 1 个非 SGLang baseline推荐 vLLM + prefix caching补齐 M4GPU 需 ~24h
---
**核心句**:当前结果"看起来已经赢",但按本协议重报后,赢的 magnitude 会缩小、赢的 slice 会窄化、负面 slice 会暴露。这是论文必须经历的过程;越早做越省事。

222
docs/FAILURE_MODES_ZH.md Normal file
View File

@@ -0,0 +1,222 @@
# Failure-mode Taxonomy
**日期**2026-05-13
**性质**:集中清单 + 诊断手册
**对象**:跑实验时遇到失败要立刻 lookup 的合作者;写 paper §Limitations 时需引用的人reviewer 想问"你为什么觉得这次会更稳"时的答案
本文把当前系统已识别的失败模式按"症状 → 根因 → 触发条件 → 当前缓解 → 真正的修复"梳成一张表。所有条目都有 forensic 链接到原始实验 doc。
---
## 0. TL;DR
5 类已识别失败模式,按"是否阻碍 paper claim"分组:
| 类别 | 名称 | 阻碍 paper | 真正修复 |
|---|---|:---:|---|
| **A. 控制层级联** | Mooncake "instance not alive" cascade | ✅ | admission backoff + per-D pending-seed budget |
| **B. 路由偏置** | Cold-D / overlap-pinning | ✅ | first-principles overlap term redefinition |
| **C. KV 抖动** | Evict stormsession-level evict | ✅ | [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) |
| **C'. KV 抖动** | Reseed stormturn 1 大 seed 并发) | ✅ | per-D pending-seed budget + (C 缓解后频率自降) |
| **D. Vendor 不变量** | streaming-session correction invariant crash (E3) | ❌hotfix 已 land | 删除 correction 路径block-level evict 完成后) |
A / B / C 三类是 Milestone 1 必须解决的C' 是 A 的次因D 已临时止血但根本修复绑在 C 上。
---
## 1. A — Mooncake "instance not alive" cascade
### 1.1 症状
- 客户端看:`RuntimeError: generate stream ended before producing any token`
- D scheduler 日志:`[mooncake] Decode instance could be dead, dropping ...`
- 整批请求被 abort单一 sweep 在数分钟内从健康降到 80% failure[E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) E21054 / 1285 失败)
### 1.2 根因forensic 链路)
```
admission no-space (D KV pool 满)
→ router 立刻 fallback 走 seed/reseed 路径
→ 多个并发 seed 同时打 mooncake P→D
→ P→D 出口排队handshake 阶段超时
→ mooncake 把对端标记 dead
→ SGLang 把 dead 链路上的 in-flight req 全部 abort
→ 客户端看到批量 generate-stream 中断
```
### 1.3 触发条件
- D KV pool 接近满(≥ ρ·K_d默认 0.95
- router fallback chain 把多个 reseed 在毫秒级窗口内发起
- mooncake heartbeat 超时(默认窗口短)
### 1.4 当前缓解
- `--kvcache-seed-min-turn-id=2` 跳过 turn 1 大 seed减少首爆main 分支 stable 配置)
- `--mc-transfer-timeout=1800s` 默认值commit 905d671减少假性 dead
- `--request-timeout-s=180/300` 让客户端不至于看见整 hour 卡死,但不阻止 cascade 自身
→ 这些都是治标不是治本。E2 在 4×H200 NDR 真硬件下仍 80% 失败 ([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md))。
### 1.5 真正的修复(路线图 §S3
1. **admission RPC backoff + jitter**:拒绝时不立刻 fallback给 D scheduler 喘息机会。
2. **per-D pending-seed budget**:同时刻最多 K 个 seed 在 transfer 队列里,超出排队而不爆裂。
3. **mooncake heartbeat 与 admission 解耦**admission 路径不再 imply "对端 alive"。
4. **Backpressure pause hint 闭环**[SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) §2.3 当前 EXPERIMENTAL
---
## 2. B — Cold-D / overlap-pinning
### 2.1 症状
- N=k decode workers但只有 ~k-1 真正承载流量;某些 D 0 binding
- Per-D load 直方图严重偏斜E2D0:600 / D1:685 / **D2:0**
- 整体 throughput 受最忙 D 限制;裸 latency 不一定差,但容量利用率差 33%+
### 2.2 根因
Inferact / Ali coding agent trace 在每个 session 开头有 ~12K 的"system prompt + tool schema",这些 24-token 块在所有 session 之间共享 hash。kv-aware policy 的 `overlap` term 把它们当成"该 D 已经常驻这些 hash" → 任何新 session 都被 score 推向 D0/D1最先 warm 的两个)→ D2 永远 0 overlap → 永远不被选 → 永远 cold。
### 2.3 触发条件
- 多 session workload + 共享 boilerplate prefix
- `migration_reject_threshold > 0` 且 reject 从未触发(因为 D0/D1 还没满)
### 2.4 当前缓解
`KvAwarePolicy.load_floor_bonus`commit 93fce42
```
floor_bonus = K * max(0, mean - assigned) / max(1, mean)
```
E3 实测 D2 binding 从 0 升到 22.5%[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §1
→ 这是 patch不是修复。`K` 是 magic numberboilerplate 的 hash 数量大于 `K / sticky_bonus` 时仍 cold。
### 2.5 真正的修复(路线图 §S5
`overlap` 重新定义为 **"该 session 在该 D 上独占 prefix 的 hash 数"**
```
exclusive_overlap(s, d) := |prefix_hashes(s) ∩ resident[d] ∩ session_owned[s]|
```
其中 `session_owned[s]` 排除其它 session 也持有的 hash。Boilerplate 共享 hash 不进 `exclusive_overlap`score 自然分散。需要 D 端在 `admit_direct_append` 响应里返回 per-session resident hash 集合的 sketchBloom filter / minhash
---
## 3. C — Evict stormsession-level eviction
### 3.1 症状
- 在 D 内存有压力的 workload 下,每 12 分钟出现 3090K tokens 的 KV pool 释放峰
- 紧随其后的同 session 请求触发 `Reseed`P 重 prefill 50K + mooncake transfer 50K37s
- TTFT 长尾完全由这类 reseed 主导([V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §3.2
### 3.2 根因
`SessionAwareCache.release_session` 一次性 `free([cache_protected_len, kv_allocated_len))`——即整段 session-exclusive 尾部。E3 实测90 次 evict、平均一次 free 67,726 tokens、25/50 session 受影响([KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) §0
→ 与 SGLang 标准 radix 的 leaf-by-leaf 渐进 evict 形成鲜明对比。这部分 KV 从未进 radix所以享受不到 LRU 的细粒度蚕食。
### 3.3 触发条件
- D KV pool 接近满
- `maybe_trim_decode_session_cache` 被 scheduler 触发(在 `DecodePreallocQueue` 检测到 `available_size() <= 0` 时)
### 3.4 当前缓解
- `--kvcache-session-soft-cap=N`main 分支):限制 D 上常驻 session 数 → 提前 trim避免顶到爆
- `--kvcache-direct-max-uncached-tokens=8192`v2降低 direct path 吃 KV 的速度
→ 都是放慢节奏,没有解决"单次 free 太大"的根本问题。
### 3.5 真正的修复(路线图 §S1
[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md):让 streaming-session decode 输出每 turn finish 时 `inner.cache_finished_req` 进 radix → `release_session` 退化为 `dec_lock_ref` + 删 slot → radix LRU 按 24-token leaf 蚕食。
预期:单次 evict 从 67K 降到 ≤ 500 tokensreseed 频次降一个数量级。
---
## 4. C' — Reseed stormturn 1 大 seed 并发)
### 4.1 症状
- workload 起步阶段(前 3060s所有 session 同时打 turn 1
- 多个并发 `Seed`(每个 ~5090K tokens打 mooncake → 与 §1 cascade 重合
### 4.2 根因
`KvAwarePolicy` 启动阶段 `resident[d]` 全空,所有 D score 相同,但 ε 重试 + per-trial admit 不阻止并发。
### 4.3 触发条件
- trace `time_scale=1` 重放下session 在原始到达密度内同时启动
- 没有 per-D pending-seed 限流
### 4.4 当前缓解
- `--kvcache-seed-min-turn-id=2`:跳过 turn 1 seed 完全main 分支 stable 配置)
- 副作用:失去 turn-1 的 KV 注入turn 2 必走 reseed但反而稳定因为 reseed 是分散在时间上的)
### 4.5 真正的修复
- per-D pending-seed budget同 §1.5 第 2 项)
- §3.5 完成后 evict 频次自降,间接降低 reseed 频次
---
## 5. D — Streaming-session correction invariant crash (E3 landmine)
### 5.1 症状
- D scheduler 抛 `AssertionError` at `schedule_batch.py:1646``seq_len - pre_len == req.extend_input_len`
- 整个 D worker 进程退出 → router 看见对端死 → §1 cascade
### 5.2 根因
[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2streaming-session correctioncommit b8e6f13`extend_input_len` 改写为 `max(0, fill_len - prefix_len)`,但下游 invariant 还从原始 fill_ids/prefix_indices 计算。当 `fill_len < prefix_len`(多 turn 累积 prefix > 当前 turn 增量)时数学上不可能满足。
### 5.3 触发条件
- streaming session 跨 turn 已 commit prefix 长于本 turn 的新增 fill_ids
- E2 因 pipeline 阻塞从未跑到这个状态E3 修了 cold-D bottleneck → pipeline 更快 → landmine 暴露
### 5.4 当前缓解
commit 986f351 的 pre-filter pass`prepare_for_extend` 入口 drop 这类 req让 client 看错误响应而不是 worker 崩)。是止血。
### 5.5 真正的修复
`schedule_batch.py:15721646` 这整段 correction 路径在 block-level eviction refactor 完成后**结构上不再需要**——[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.7 已说明 refactor 后 fill_ids / prefix_indices 一致性由 radix `match_prefix` 自动保证。
→ 不要再加更多 correction 子句;要删整段。
---
## 6. 失败诊断 cheat sheet
跑 sweep 时按下表 lookup
| 你看到 | 大概率是 | 先查 |
|---|---|---|
| 客户端 `RuntimeError: generate stream ended before...` | §1 cascade | D scheduler log 搜 `instance could be dead` |
| 某个 D `binding=0` 而其它 D 繁忙 | §2 cold-D | `per_decode_load` 直方图 |
| TTFT p99 突然抬到 58s 量级 | §3 evict storm | `release_session` 调用频次 + 平均 free tokens |
| Sweep 起步阶段失败率高、稳态低 | §4 reseed storm | mooncake transfer queue 在前 30s 的峰值 |
| D worker 进程异常退出 | §5 invariant crash | scheduler log 搜 `AssertionError``extend_input_len` |
---
## 7. 与路线图的衔接
- [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) Milestone 1 的第 1/3/4 项分别对应本表 C / A / B 的真正修复。完成 Milestone 1 后本表 §1§4 应该都从"未修"降级为"已缓解"§5 直接消失。
- 论文 §Limitations 必须老实写出现状:"we identify five failure modes; A/C are addressed by this work, B/C' are partially addressed, D is a transient artifact of the in-progress refactor."
---
**核心句**:把失败模式当 first-class artifact 来管理——每个失败都有"症状 → 根因 → 触发 → 缓解 → 真正修复"五字段,是把 prototype 推到 production-grade 的关键工具。reviewer 看见你能枚举失败远比看见你赢得 baseline 更让人信服。

119
docs/INDEX_ZH.md Normal file
View File

@@ -0,0 +1,119 @@
# 文档索引
**目的**:让任何合作者在 10 分钟内找到他需要的文档;让 Reviewer 知道哪些先看。
---
## 0. 时间紧的 3 篇
按这个顺序读完即可参与讨论:
1. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) — 项目当前进度、薄弱点、路线图。
2. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) — 算法形式化Algorithm 1/2/3 + Theorem 1/2
3. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §0 + §6 — v2 当前 win/lose snapshot。
---
## 1. 按主题分类
### 1.1 进度 / 现状
| 文档 | 内容 |
|---|---|
| [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) | 跨分支整合 + 路线图(本分支的总入口) |
| [PROJECT_OVERVIEW.md](PROJECT_OVERVIEW.md) | 项目目标 + 三种 mechanismpd-disagg / pd-colo / kvcache-centric的术语区分 |
| [ONBOARDING_NEXT_AGENT_ZH.md](ONBOARDING_NEXT_AGENT_ZH.md) | 接班 agent 30 分钟上手手册(来自 `kvc-debug-journey-v1-to-v4` |
### 1.2 算法 / 形式化
| 文档 | 内容 |
|---|---|
| [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) | Algorithm 1Route/ 2Admit/ 3Dispatch+ Theorem 1无饿死+ Theorem 2fast-path 命中下限) |
| [MIGRATION_V1_FINDINGS_ZH.md](MIGRATION_V1_FINDINGS_ZH.md) | v1 thrashing pathology 的实测 + 为什么 reset-on-success 是关键修复 |
### 1.3 实验结果
| 文档 | 内容 |
|---|---|
| [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) | SWE-Bench 50 sess ts=1v2 vs 4DP CA 的 6/8 win + TTFT p99 落后原因 |
| [V2_RESULTS_ZH.md](V2_RESULTS_ZH.md) | v2 原始战报headline 数字略乐观,请同时看 deep analysis |
| [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) | H200 + RDMA 上 E1naive 1P3D + kv-awarevs E2KVC v2E2 80% failure 的 forensic |
| [E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) | E3+load-floor bonus16 min 触发 SGLang patch invariant crash |
| [E1_E2_FIX_DESIGN_ZH.md](E1_E2_FIX_DESIGN_ZH.md) | Q1mooncake death+ Q2cold-D2的 fix 设计 |
### 1.4 当前关键 design discussion
| 文档 | 内容 |
|---|---|
| [KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) | 架构层反思session-level evict 与 KVC continuity 设计冲突 |
| [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) | block-level evict refactor 的具体 API / 步骤 / 测试计划(本分支新增) |
| [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) | reseed 慢路径时间线 + D→P 同步缺口的 forensic |
| [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) | D→P sync 的接口契约、staleness budget、rollout 阶段(本分支新增) |
### 1.5 评测 / 方法论
| 文档 | 内容 |
|---|---|
| [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md) | paper-quality 评测协议N、CI、paired、stratify、baseline list、trace mix—— 本分支新增 |
| [REFACTOR_PLAN_V1_ZH.md](REFACTOR_PLAN_V1_ZH.md) | 为什么从 ts=10 切到 ts=1 |
| [TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md](TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md) | ts=10 时代的结构性问题清单(多数已 supersede |
### 1.6 工程债 / 失败模式
| 文档 | 内容 |
|---|---|
| [SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) | 785 行 vendored SGLang patch 的归类清单MUST-HAVE / WORKAROUND / EXPERIMENTAL / INSTRUMENTATION—— 本分支新增 |
| [FAILURE_MODES_ZH.md](FAILURE_MODES_ZH.md) | 5 类失败模式的诊断 + 缓解 + 真正修复mooncake cascade / cold-D / evict storm / reseed storm / E3 invariant—— 本分支新增 |
### 1.7 环境
| 文档 | 内容 |
|---|---|
| [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md) | H200 + driver 570 + cu12.8 环境搭建 + 11 条 lesson learned |
### 1.7 归档(仅历史参考)
`docs/archive/` 下的内容已被新文档 supersede不必看
- `AGENTIC_FIT_ANALYSIS_ZH.md``STRUCTURAL_VALIDATION_REPORT_ZH.md`ts=10 早期分析。
- `KVCACHE_CENTRIC_PROGRESS_ZH.md`:早期项目快照。
- `KVC_DEBUG_JOURNEY_V1_TO_V5.md``V5_PROFILE_INVESTIGATION_ZH.md`v1v5 调优过程笔记。
- `REFACTOR_PLAN_ZH.md`v0 重构计划。
- `SWEBENCH_EXPERIMENT_*.md`:早期实验日志。
---
## 2. 按角色推荐阅读路径
### 2.1 我是新接手的 SWE/research agent
1. 先读本文 §0 的 3 篇。
2. 再看 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3薄弱点+ §5GPU-free 工作清单)。
3. 选一个 Milestone 1 子项开始做。`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md``docs/D_TO_P_SYNC_CONTRACT_ZH.md` 是已经准备好的两条工程主线。
### 2.2 我是 paper reviewer / 审稿预读
1. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md):算法 + theorem。
2. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md):核心实测对比 + 我们自己识别的 limitation。
3. [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md):真硬件 + RDMA 上的 ablation含 E2 的 80% failure forensic证明我们能解释失败
4. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3我们自己列出的薄弱点与未来工作不藏问题
### 2.3 我是要复现实验的 student
1. [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md)。
2. [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md):跑哪些 sweep、按什么协议比较。
3. `scripts/sweep_ts1_migration_v2.sh`v2 主 sweep`scripts/sweep_e1_naive_1p3d.sh` / `scripts/sweep_e2_kvc_v2_rdma.sh`E1/E2 ablation。
### 2.4 我想看 control plane 与 admission
1. `src/agentic_pd_hybrid/policies.py``KvAwarePolicy.select` 是 Algorithm 1 的实现。
2. `src/agentic_pd_hybrid/replay.py``_invoke_session_direct` / `_invoke_kvcache_seeded_router` 是 Algorithm 3 的 orchestration。
3. `third_party/sglang/python/sglang/srt/managers/scheduler.py`D 端 `_admit_direct_append` 是 Algorithm 2 实现。
---
## 3. 这份索引的维护约定
- 新加一份 design / experiment doc 必须在本文 §1 表格里加一行。
- 文档归档(移到 `docs/archive/`)时本文同步删除条目或标 "已归档"。
- 本文不写实质内容,只做导航;任何深入说明都在被指向的文档里。

View File

@@ -0,0 +1,165 @@
# Vendored SGLang Patch — 归类清单
**日期**2026-05-13
**基线**clean SGLang v0.5.10 snapshot @ `bded083`
**当前 HEAD**`origin/h200-cu130` + 本分支 (785 行新增 / 17 行删除 / 10 文件)
**目的**:让 reviewer 与下一个合作者一眼看清"哪些 patch 是核心机制、哪些是 workaround、哪些可以在 refactor 后下线"。对应 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.2 / §S6 的工程债项。
---
## 0. TL;DR
| 分类 | 文件数 | 行数(估) | 命运 |
|---|---:|---:|---|
| MUST-HAVE — 核心机制Algorithm 1/2/3、streaming session lifecycle、admit RPC | 6 | ~450 | 长期保留,是 paper claim 的核心 |
| WORKAROUND — 已识别的 latent 问题修补,应在 refactor 后下线 | 2 | ~150 | block-level eviction refactor 完成后大量删除 |
| EXPERIMENTAL — 未闭环的特性,论文不依赖 | 1 | ~60 | 可下线或保留为 future-work hook |
| INSTRUMENTATION — 诊断 / 日志 | 1 | ~50 | 保留但应隔离到 debug build |
| MINOR — 杂项 | 1 | ~3 | 不影响决策 |
**关键指引**:当 block-level eviction refactor[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)完成时WORKAROUND 类的 ~150 行应同步删除。E3 触发的 `schedule_batch.py` invariant landmine 是这条路径上的产物,不修引擎而是修 evict 粒度才是正解。
---
## 1. 文件粒度清单
### 1.1 `mem_cache/session_aware_cache.py` — MUST-HAVE *(待 refactor*
| 项目 | 内容 | 引入 | 分类 |
|---|---|---|---|
| `SessionSlot` dataclass | streaming session 跨 turn 复用 KV 的 metadata | b8e6f13 | MUST-HAVE |
| `last_access_time` 字段 | LRU 决策需要 | 6e5ed8d | MUST-HAVE |
| `match_prefix` / `cache_finished_req` / `cache_unfinished_req` 的 streaming 分支 | session 复用快路径 | b8e6f13 | **MUST-HAVE → 待 refactor**block-level evict 后语义大改) |
| `release_session` 直接 `free(kv_indices)` | session 退出时一次性归还 KV | b8e6f13 | **WORKAROUND → 替换**refactor 后改为只 `dec_lock_ref` |
| `slot_held_tokens` / `get_session_status` / `list_session_statuses` | 状态查询 | 6e5ed8d | MUST-HAVE |
**说明**:本文件是 KVC 设计的中枢。block-level eviction refactor[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.1§3.6)改造的就是这里。`SessionSlot` 的 5 个 KV-ownership 字段(`req_pool_idx` / `kv_committed_len` / `kv_allocated_len` / `cache_protected_len` / `swa_evicted_seqlen`)应在 refactor 后删除;这部分**将由 commit message 单独标记**,方便回滚。
### 1.2 `managers/scheduler.py` — 混合类别
D worker 端的 Algorithm 2 实现,含多个独立 patch。按行级归类
| 函数 / 行段 | 内容 | 分类 | 何时可下线 |
|---|---|---|---|
| `admit_direct_append(...)` | Algorithm 2 的 D 端 admission RPC handler | **MUST-HAVE** | 不下线(论文核心) |
| `_should_allow_local_prefill_on_decode(req)` | 决定 decode worker 是否接受无 bootstrap 的本地 append-prefill | **MUST-HAVE** | 不下线 |
| `_decode_session_cache_low_watermark_tokens()` | 水位线参数读取 | **WORKAROUND** | block-level evict 后由 radix LRU 取代 |
| `_decode_session_cache_target_available_tokens()` | 目标可用 token 数计算 | **WORKAROUND** | 同上 |
| `maybe_trim_decode_session_cache(...)` | 主动 trim session触发 `release_session` | **WORKAROUND** | 同上refactor 后 radix LRU 自然蚕食trim 不再必要 |
| `_compute_backpressure_pause_hint(...)` | 给 router 的 pause 提示 | **EXPERIMENTAL** | 信号未闭环([REAL_ALI_KVC_EXPERIMENT_LOG_ZH.md](../docs/archive/) §4.3),路线图 §S10可保留为 future work hook |
| `_compute_pool_breakdown_for_diagnostics()` | 池状态快照供 `/server_info` | **INSTRUMENTATION** | 长期保留但建议门 flag 化 |
### 1.3 `managers/schedule_batch.py` — WORKAROUND待删除
| 项目 | 内容 | 引入 | 分类 |
|---|---|---|---|
| streaming-session `extend_input_len` correction (lines ~15721585) | 在 fill_ids < prefix_indices 时把 extend_input_len 改为 0 | b8e6f13 | **WORKAROUND** |
| pre-filter pass dropping `fill_ids < prefix_indices` reqs | E3 触发 assertion 后的 hotfixcommit 986f351 | 986f351 | **WORKAROUND** |
| invariant assert `seq_len - pre_len == req.extend_input_len` 的容忍逻辑 | correction 配套 | b8e6f13 | **WORKAROUND** |
**全部** ~85 行在 block-level eviction refactor 完成后**应整体删除**——`BLOCK_LEVEL_EVICTION_DESIGN_ZH §3.7` 已说明 refactor 后该不变量结构上必然成立correction 路径无需存在E3 landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2) 是该 workaround 的产物
### 1.4 `managers/session_controller.py` — MUST-HAVE
| 项目 | 内容 | 分类 |
|---|---|---|
| streaming session lifecycle hooksopen / close / admit signal | P/D worker 知道何时开始 / 结束一个 streaming session | MUST-HAVE |
| session ID 路由 | admission RPC 找到正确的 SessionSlot | MUST-HAVE |
不下线
### 1.5 `managers/io_struct.py` — MUST-HAVE
| 项目 | 内容 | 分类 |
|---|---|---|
| `AdmitDirectAppendReqInput` / `AdmitDirectAppendReqOutput` | admit RPC 的请求 / 响应消息类型 | MUST-HAVE |
| backpressure pause hint 字段 | 同上消息的 optional 字段 | EXPERIMENTAL |
可以把 EXPERIMENTAL 字段折叠到 MUST-HAVE 消息里保持兼容本身不构成下线压力
### 1.6 `managers/tokenizer_communicator_mixin.py` — MUST-HAVE
admit RPC communicator-side glue19 不下线
### 1.7 `entrypoints/http_server.py` — MUST-HAVE
`/admit_direct_append` HTTP endpoint 注册6
### 1.8 `disaggregation/decode.py` — 混合类别
| 项目 | 内容 | 分类 |
|---|---|---|
| `DecodeReqToTokenPool`: `assert len(reusing) <= 1` 放宽 | local append-prefill 在一个 batch 里复用多个 req_pool_idx | **MUST-HAVE** |
| `DecodePreallocQueue` 引入 `refresh_allocatable_tokens` + `maybe_trim_decode_session_cache` 触发 | pool 满时主动 trim session | **WORKAROUND**refactor 后改由 radix LRU 自然 shed |
| `--disaggregation-decode-allow-local-prefill` flag | 服务端 opt-in 本地 append-prefill | **MUST-HAVE** |
trim 触发逻辑 ~30 行在 refactor 后应删除
### 1.9 `server_args.py` — MUST-HAVE
| 项目 | 内容 | 分类 |
|---|---|---|
| `--radix-eviction-policy priority` 选项 | E1/E2 实验需要 | MUST-HAVE |
| `--disaggregation-decode-allow-local-prefill` flag | §1.8 | MUST-HAVE |
13 全部是 CLI 接口扩展不下线
### 1.10 `disaggregation/mooncake_transfer_engine.py` — MINOR
3 行小调整不构成决策点
---
## 2. 按分类汇总
### 2.1 MUST-HAVE保留
6 个文件450
- `admit_direct_append` 主链路Algorithm 2scheduler + io_struct + tokenizer_communicator_mixin + http_server + session_controller
- `SessionSlot` 主链路streaming session lifecyclesession_aware_cache 多数字段session_controller
- CLI / server interfaceserver_argsdecode.py `allow_local_prefill`
### 2.2 WORKAROUNDblock-level evict refactor 后删除)
2.5 个文件150
- `session_aware_cache.release_session` token-free 路径
- `scheduler.py` `_decode_session_cache_*_watermark_tokens` + `maybe_trim_decode_session_cache`
- `schedule_batch.py` streaming-session correction + drop-pre-filter E3 landmine hotfix
- `decode.py` `DecodePreallocQueue` 中的 trim 触发
这些 patch 的存在是当前架构的产物refactor 后应整段删除而不是修小 bug
### 2.3 EXPERIMENTAL未闭环
60
- backpressure pause hint`_compute_backpressure_pause_hint` + io_struct 字段可作为未来 control-plane 反馈机制的 hook 保留 1 个月后仍未接通下线
### 2.4 INSTRUMENTATION长期保留但门 flag 化)
50
- `_compute_pool_breakdown_for_diagnostics` + 相关 `/server_info` 字段建议加 `--enable-diagnostic-pool-snapshot` flag避免 prod 路径背诊断开销
### 2.5 MINOR
3 忽略
---
## 3. 维护约定
1. **新加 SGLang 改动必须落到本表** commit message `feat(sglang): ...` / `fix(sglang): ...` 前缀并在 PR 描述声明落到 §2 哪一类
2. **不直接覆盖 upstream 文件**所有 patch 必须可在 v0.5.10 git apply保留 hunk header 整洁)。
3. **删除 WORKAROUND 时同步删 doc**refactor 完成的同一个 PR 应把本文表中对应行划掉
4. **不下放 EXPERIMENTAL 到主路径**未闭环的 patch 必须默认 disabled
---
## 4. 与路线图的衔接
- Milestone 1[AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §4执行 block-level eviction refactor **整段 §2.2 应该消失**——这是衡量 refactor 完成度的客观指标
- Milestone 2 control plane 拆层(§4.8,§2.3 backpressure pause hint 应或被启用或被下线不允许悬挂
- Milestone 3 引入 learning-based admission(§4.15,§2.1 `admit_direct_append` 接口应保持稳定policy 替换在 router 侧而非 D
---
**核心句**vendored SGLang 785 行不是 monolithic 黑箱——三分之二是核心机制论文必备三分之一是当前架构的 workaroundrefactor 后可整段删)。reviewer 看到本表能立刻判断"哪些是 paper 的真贡献哪些是 prototype 当前的临时支撑"。

View File

@@ -20,8 +20,21 @@ build-backend = "setuptools.build_meta"
[tool.setuptools.packages.find]
where = ["src"]
[dependency-groups]
# Pure-Python unit tests. Install via:
# uv sync --group test
# These tests deliberately import only the algorithm-layer modules
# (policies, trace, topology) so they run without SGLang / GPU / CUDA.
test = [
"pytest>=8.0",
]
[tool.uv]
prerelease = "allow"
[tool.uv.sources]
sglang = { path = "third_party/sglang/python", editable = true }
[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-q"

View File

@@ -0,0 +1,225 @@
#!/usr/bin/env python3
"""Paired latency comparison with bootstrap CI.
Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix): when comparing
mechanism A vs B on the same trace, the only honest comparison is paired
on same-trial-mask. This script joins two metrics.jsonl by request_id,
keeps the rows where BOTH sides succeeded, and reports paired deltas
with 95% bootstrap CIs.
Out vs the existing `compare_no_error.py`:
- works on raw metrics.jsonl, not pre-aggregated summary.json
- bootstrap CIs (not just point estimates)
- reports paired-mask size + per-side failure counts so the reader
sees how many rows were dropped from the comparison
Usage:
scripts/analysis/paired_compare.py \
--baseline outputs/run-dp/request-metrics.jsonl \
--candidate outputs/run-kvc/request-metrics.jsonl
scripts/analysis/paired_compare.py ... --bootstrap 5000 --seed 42
scripts/analysis/paired_compare.py ... --json > paired.json
stdlib only — no scipy/numpy. Runs without GPU and without SGLang.
"""
from __future__ import annotations
import argparse
import json
import math
import random
import sys
from pathlib import Path
def _load(path: Path) -> dict[str, dict]:
out: dict[str, dict] = {}
with path.open() as handle:
for line in handle:
line = line.strip()
if not line:
continue
row = json.loads(line)
rid = row.get("request_id")
if rid is None:
continue
out[rid] = row
return out
def _ok(row: dict) -> bool:
return row.get("error") is None and row.get("latency_s") is not None
def _quantile(values: list[float], q: float) -> float:
if not values:
return float("nan")
s = sorted(values)
if len(s) == 1:
return s[0]
pos = (len(s) - 1) * q
lo = math.floor(pos)
hi = math.ceil(pos)
if lo == hi:
return s[lo]
return s[lo] + (s[hi] - s[lo]) * (pos - lo)
def _stats(deltas: list[float]) -> dict[str, float]:
if not deltas:
return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
return {
"mean": sum(deltas) / len(deltas),
"p50": _quantile(deltas, 0.50),
"p90": _quantile(deltas, 0.90),
"p99": _quantile(deltas, 0.99),
}
def _bootstrap_ci(
deltas: list[float], statistic, n_boot: int, rng: random.Random
) -> tuple[float, float]:
"""Return (lo, hi) 95% CI for `statistic(deltas)`."""
if len(deltas) < 2:
return (float("nan"), float("nan"))
n = len(deltas)
samples = []
for _ in range(n_boot):
# resample with replacement
resample = [deltas[rng.randrange(n)] for _ in range(n)]
samples.append(statistic(resample))
samples.sort()
lo = samples[int(0.025 * (n_boot - 1))]
hi = samples[int(0.975 * (n_boot - 1))]
return (lo, hi)
def compare(
baseline: dict[str, dict],
candidate: dict[str, dict],
*,
metric: str,
n_boot: int,
seed: int,
) -> dict:
common_ids = set(baseline.keys()) & set(candidate.keys())
paired_ids = [
rid for rid in common_ids if _ok(baseline[rid]) and _ok(candidate[rid])
]
paired_ids.sort()
base_only_fail = sum(1 for rid in common_ids if not _ok(baseline[rid]))
cand_only_fail = sum(1 for rid in common_ids if not _ok(candidate[rid]))
deltas = []
wins = losses = ties = 0
for rid in paired_ids:
b = baseline[rid].get(metric)
c = candidate[rid].get(metric)
if b is None or c is None:
continue
d = float(c) - float(b)
deltas.append(d)
if d < 0:
wins += 1
elif d > 0:
losses += 1
else:
ties += 1
rng = random.Random(seed)
stats = _stats(deltas)
ci_mean = _bootstrap_ci(deltas, lambda x: sum(x) / len(x), n_boot, rng)
ci_p50 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.50), n_boot, rng)
ci_p90 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.90), n_boot, rng)
return {
"metric": metric,
"baseline_size": len(baseline),
"candidate_size": len(candidate),
"intersection_size": len(common_ids),
"paired_size": len(paired_ids),
"baseline_fail_in_common": base_only_fail,
"candidate_fail_in_common": cand_only_fail,
"delta_stats": stats,
"delta_mean_ci95": ci_mean,
"delta_p50_ci95": ci_p50,
"delta_p90_ci95": ci_p90,
"wins_candidate": wins,
"losses_candidate": losses,
"ties": ties,
}
def _fmt(x: float, w: int = 6) -> str:
if x is None or (isinstance(x, float) and math.isnan(x)):
return " nan "
return f"{x:+{w}.3f}"
def render(result: dict) -> str:
s = result["delta_stats"]
mlo, mhi = result["delta_mean_ci95"]
p5lo, p5hi = result["delta_p50_ci95"]
p9lo, p9hi = result["delta_p90_ci95"]
n = result["paired_size"]
lines = [
f"# paired comparison ({result['metric']})",
"",
f"baseline rows: {result['baseline_size']}",
f"candidate rows: {result['candidate_size']}",
f"intersection (rid): {result['intersection_size']}",
f"paired (both ok): {result['paired_size']}",
f" baseline fails in common: {result['baseline_fail_in_common']}",
f" candidate fails in common: {result['candidate_fail_in_common']}",
"",
"## delta (candidate - baseline) — negative = candidate is faster",
"",
"| stat | value | 95% CI |",
"|---|---:|---:|",
f"| mean | {_fmt(s['mean'])} | [{_fmt(mlo)}, {_fmt(mhi)}] |",
f"| p50 | {_fmt(s['p50'])} | [{_fmt(p5lo)}, {_fmt(p5hi)}] |",
f"| p90 | {_fmt(s['p90'])} | [{_fmt(p9lo)}, {_fmt(p9hi)}] |",
f"| p99 | {_fmt(s['p99'])} | — |",
"",
f"win/loss/tie: {result['wins_candidate']} / {result['losses_candidate']} / {result['ties']} (of {n})",
]
return "\n".join(lines)
def main() -> None:
p = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
p.add_argument("--baseline", required=True, type=Path)
p.add_argument("--candidate", required=True, type=Path)
p.add_argument(
"--metric",
default="latency_s",
choices=["latency_s", "ttft_s", "tpot_s"],
help="which per-request field to compare (default: latency_s)",
)
p.add_argument("--bootstrap", type=int, default=2000)
p.add_argument("--seed", type=int, default=20260512)
p.add_argument("--json", action="store_true")
args = p.parse_args()
baseline = _load(args.baseline)
candidate = _load(args.candidate)
if not baseline or not candidate:
print("empty input on one side", file=sys.stderr)
sys.exit(1)
result = compare(
baseline, candidate,
metric=args.metric, n_boot=args.bootstrap, seed=args.seed,
)
if args.json:
json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
sys.stdout.write("\n")
else:
print(render(result))
if __name__ == "__main__":
main()

227
scripts/analysis/stratified.py Executable file
View File

@@ -0,0 +1,227 @@
#!/usr/bin/env python3
"""Stratified latency / TTFT reporter for paper-quality evaluation.
Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix): every headline
number must be accompanied by a stratified breakdown so reviewers can
see which slice the gains come from.
Buckets the request rows from one or more metrics.jsonl files along:
- turn_id : {1, 2-5, 6-20, 21+}
- input_length : {<=8K, 8K-64K, >64K}
- overlap_ratio : {<=0.3, 0.3-0.7, >0.7}
- append_tokens : input_length - observed_overlap_blocks * BLOCK_SIZE
For each bucket, reports:
- n (total rows in bucket)
- n_ok (rows with no error and latency_s set)
- latency_s mean / p50 / p90 / p99
- ttft_s mean / p50 / p90 / p99
- err_pct (1 - n_ok/n)
Usage:
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl \
[outputs/<other-run>/request-metrics.jsonl ...]
scripts/analysis/stratified.py --dim turn_id outputs/<run>/request-metrics.jsonl
scripts/analysis/stratified.py --json outputs/<run>/request-metrics.jsonl > strat.json
stdlib only — no pandas/numpy. Runs without GPU and without SGLang.
"""
from __future__ import annotations
import argparse
import json
import math
import sys
from collections import defaultdict
from pathlib import Path
from typing import Iterable
BLOCK_SIZE = 24 # SGLang radix block, matches docs/KVC_ROUTER_ALGORITHM.md §2
TURN_BUCKETS: list[tuple[str, tuple[int, int]]] = [
("turn=1", (1, 1)),
("turn=2-5", (2, 5)),
("turn=6-20", (6, 20)),
("turn=21+", (21, 10**9)),
]
INPUT_BUCKETS: list[tuple[str, tuple[int, int]]] = [
("input<=8K", (0, 8 * 1024)),
("input=8K-64K", (8 * 1024 + 1, 64 * 1024)),
("input>64K", (64 * 1024 + 1, 10**9)),
]
OVERLAP_BUCKETS: list[tuple[str, tuple[float, float]]] = [
("overlap<=0.3", (0.0, 0.3)),
("overlap=0.3-0.7", (0.3, 0.7)),
("overlap>0.7", (0.7, 1.0001)),
]
APPEND_BUCKETS: list[tuple[str, tuple[int, int]]] = [
("append<=128", (0, 128)),
("append=128-1K", (129, 1024)),
("append=1K-8K", (1025, 8 * 1024)),
("append>8K", (8 * 1024 + 1, 10**9)),
]
DIM_BUCKETS: dict[str, list[tuple[str, tuple]]] = {
"turn_id": TURN_BUCKETS,
"input_length": INPUT_BUCKETS,
"overlap_ratio": OVERLAP_BUCKETS,
"append_tokens": APPEND_BUCKETS,
}
def _quantile(values: list[float], q: float) -> float:
"""Linear-interpolation quantile, stdlib only."""
if not values:
return float("nan")
s = sorted(values)
if len(s) == 1:
return s[0]
pos = (len(s) - 1) * q
lo = math.floor(pos)
hi = math.ceil(pos)
if lo == hi:
return s[lo]
return s[lo] + (s[hi] - s[lo]) * (pos - lo)
def _stats(values: list[float]) -> dict[str, float]:
if not values:
return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
return {
"mean": sum(values) / len(values),
"p50": _quantile(values, 0.50),
"p90": _quantile(values, 0.90),
"p99": _quantile(values, 0.99),
}
def _bucket_for(value: float | int, buckets: list[tuple[str, tuple]]) -> str:
for label, (lo, hi) in buckets:
if lo <= value <= hi:
return label
return "OOB"
def _classify(row: dict, dim: str) -> str:
if dim == "turn_id":
return _bucket_for(int(row.get("turn_id", 0)), TURN_BUCKETS)
if dim == "input_length":
return _bucket_for(int(row.get("input_length", 0)), INPUT_BUCKETS)
if dim == "overlap_ratio":
inp = max(1, int(row.get("input_length", 0)))
cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
ratio = min(1.0, cached / inp)
return _bucket_for(ratio, OVERLAP_BUCKETS)
if dim == "append_tokens":
inp = int(row.get("input_length", 0))
cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
return _bucket_for(max(0, inp - cached), APPEND_BUCKETS)
raise ValueError(f"Unknown dim: {dim}")
def load_rows(paths: Iterable[Path]) -> list[dict]:
rows: list[dict] = []
for path in paths:
with path.open() as handle:
for line in handle:
line = line.strip()
if not line:
continue
rows.append(json.loads(line))
return rows
def stratify(rows: list[dict], dim: str) -> dict[str, dict]:
by_bucket: dict[str, list[dict]] = defaultdict(list)
for row in rows:
by_bucket[_classify(row, dim)].append(row)
output: dict[str, dict] = {}
for label, _ in DIM_BUCKETS[dim]:
bucket_rows = by_bucket.get(label, [])
n = len(bucket_rows)
ok = [r for r in bucket_rows if r.get("error") is None and r.get("latency_s") is not None]
n_ok = len(ok)
lat = [float(r["latency_s"]) for r in ok]
ttft = [float(r["ttft_s"]) for r in ok if r.get("ttft_s") is not None]
output[label] = {
"n": n,
"n_ok": n_ok,
"err_pct": (n - n_ok) / n if n else 0.0,
"latency_s": _stats(lat),
"ttft_s": _stats(ttft),
}
return output
def render_table(name: str, stats: dict[str, dict]) -> str:
lines = [
f"## stratified by {name}",
"",
"| bucket | n | n_ok | err% | lat mean | lat p50 | lat p90 | lat p99 | ttft mean | ttft p50 | ttft p90 | ttft p99 |",
"|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|",
]
for label, _ in DIM_BUCKETS[name]:
s = stats[label]
lat = s["latency_s"]
ttft = s["ttft_s"]
lines.append(
"| {label} | {n} | {n_ok} | {err:.1%} | "
"{lm:.3f} | {l50:.3f} | {l90:.3f} | {l99:.3f} | "
"{tm:.3f} | {t50:.3f} | {t90:.3f} | {t99:.3f} |".format(
label=label,
n=s["n"],
n_ok=s["n_ok"],
err=s["err_pct"],
lm=lat["mean"],
l50=lat["p50"],
l90=lat["p90"],
l99=lat["p99"],
tm=ttft["mean"],
t50=ttft["p50"],
t90=ttft["p90"],
t99=ttft["p99"],
)
)
return "\n".join(lines)
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
parser.add_argument("metrics_paths", nargs="+", type=Path)
parser.add_argument(
"--dim",
choices=list(DIM_BUCKETS.keys()) + ["all"],
default="all",
help="stratification dimension (default: all four)",
)
parser.add_argument(
"--json",
action="store_true",
help="emit JSON instead of markdown tables",
)
args = parser.parse_args()
rows = load_rows(args.metrics_paths)
if not rows:
print("no rows loaded", file=sys.stderr)
sys.exit(1)
dims = list(DIM_BUCKETS.keys()) if args.dim == "all" else [args.dim]
result = {dim: stratify(rows, dim) for dim in dims}
if args.json:
json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
sys.stdout.write("\n")
return
header_paths = ", ".join(str(p) for p in args.metrics_paths)
print(f"# stratified report ({len(rows)} rows from {header_paths})\n")
for dim in dims:
print(render_table(dim, result[dim]))
print()
if __name__ == "__main__":
main()

View File

@@ -152,6 +152,49 @@ class StickyDecodePolicy:
)
CandidateScore = tuple[int, int, int, int]
def score_candidate(
*,
overlap: int,
sticky: bool,
inflight: int,
assigned: int,
mean_assigned: float,
sticky_bonus: int,
load_floor_bonus: int,
) -> CandidateScore:
"""Pure scoring function for KvAwarePolicy (Algorithm 1 in KVC_ROUTER_ALGORITHM.md).
Returns the 4-tuple compared lexicographically by `select()` to pick the
best D. Extracted as a top-level function so unit tests can exercise it
without constructing topology/state objects.
Score tuple positions:
0: overlap + sticky_bonus*sticky + floor_bonus — primary, KV reuse aware
1: sticky — tie-1, session locality
2: -inflight — tie-2, prefer low load
3: -assigned — tie-3, prefer rarely-picked
Load-floor bonus is gated on `not sticky` (turn-1+ sessions continue to
stick to their original D). The boost magnitude scales linearly with the
D's deficit relative to the running mean of decode_assignment_counts:
floor_bonus = load_floor_bonus * max(0, mean - assigned) / max(1, mean)
When mean == 0 (warmup) the bonus is 0 for all candidates (lex tiebreak
falls through to iteration order).
See docs/E1_E2_FIX_DESIGN_ZH.md §Q2 for the load-floor design and
docs/KVC_ROUTER_ALGORITHM.md §3.1 for the lex-score formalism.
"""
floor_bonus = 0
if load_floor_bonus > 0 and not sticky and mean_assigned > 0:
deficit = max(0.0, mean_assigned - assigned)
floor_bonus = int(load_floor_bonus * deficit / mean_assigned)
primary = overlap + (sticky_bonus if sticky else 0) + floor_bonus
return (primary, int(sticky), -inflight, -assigned)
@dataclass(frozen=True)
class KvAwarePolicy:
name: str = "kv-aware"
@@ -161,27 +204,11 @@ class KvAwarePolicy:
# 0 disables the mechanism. Default 3 picked empirically to allow brief
# transient saturation without panicking, but to reroute persistent starvation.
migration_reject_threshold: int = 3
# Load-floor bonus: graduated boost added to lex-score position 0 for
# under-loaded D workers, gated on `not sticky` so turn-1+ requests of an
# existing session continue to stick to their original D. The boost
# magnitude scales linearly with the D's deficit relative to the running
# mean of `decode_assignment_counts`:
# floor_bonus = K * max(0, mean - assigned[D]) / max(1, mean)
# When mean=0 (warmup), bonus is 0 for all workers (lex tiebreak by
# iteration order). Once any D has been assigned, under-loaded D's get a
# bonus that approaches K as their deficit-to-mean ratio approaches 1.
# The bonus naturally decays as load equalises, leaving the original
# overlap+sticky scoring in charge of steady-state selection.
#
# Set this above the maximum cross-session boilerplate overlap you expect
# so that fresh sessions are routed to under-loaded D's even when those
# D's currently have 0 overlap, but below the magnitude of "real" prefix
# overlap (e.g., a session with 800-block per-session prefix on an
# already-warm D should still go there).
#
# 0 disables. See docs/E1_E2_FIX_DESIGN_ZH.md §Q2 for the full design and
# docs/E1_E2_RESULTS_ZH.md §5d for why this is needed on Inferact-shaped
# workloads where boilerplate overlap pins D2 cold forever.
# Load-floor bonus: see score_candidate() docstring for the exact formula.
# Set above the max cross-session boilerplate overlap you expect (so fresh
# sessions reach under-loaded D's even at 0 overlap), but below the
# magnitude of "real" prefix overlap (so a warm D still wins for its own
# session). 0 disables.
load_floor_bonus: int = 0
def select(
@@ -194,15 +221,12 @@ class KvAwarePolicy:
prefill_worker_id = state.next_prefill_worker_id(topology)
session = state.session_state.get(request.session_id)
# Pre-compute the running mean of decode assignments. Used by the
# load-floor bonus inside the candidate loop.
n_route_workers = max(1, len(topology.route_workers))
total_assigned = sum(state.decode_assignment_counts.values())
mean_assigned = total_assigned / n_route_workers
best_decode_worker_id: str | None = None
best_score: tuple[int, int, int, int] | None = None
candidates_considered = 0
best_score: CandidateScore | None = None
for worker in topology.route_workers:
# Migration: skip workers that have rejected this session too many times.
# If all candidates get filtered (degenerate case), fall through to
@@ -213,25 +237,17 @@ class KvAwarePolicy:
)
if rejects >= self.migration_reject_threshold:
continue
candidates_considered += 1
overlap = _overlap_blocks(request, state, worker.worker_id)
sticky = int(session is not None and session.last_decode_worker == worker.worker_id)
inflight_penalty = -state.inflight_decode.get(worker.worker_id, 0)
worker_assigned = state.decode_assignment_counts.get(worker.worker_id, 0)
assignment_penalty = -worker_assigned
# Load-floor bonus: only for fresh placements (not sticky), and
# only when the knob is enabled. See docstring above.
floor_bonus = 0
if self.load_floor_bonus > 0 and not sticky and mean_assigned > 0:
deficit = max(0.0, mean_assigned - worker_assigned)
floor_bonus = int(self.load_floor_bonus * deficit / mean_assigned)
score = (
overlap + sticky * self.sticky_bonus + floor_bonus,
sticky,
inflight_penalty,
assignment_penalty,
score = score_candidate(
overlap=_overlap_blocks(request, state, worker.worker_id),
sticky=(
session is not None
and session.last_decode_worker == worker.worker_id
),
inflight=state.inflight_decode.get(worker.worker_id, 0),
assigned=state.decode_assignment_counts.get(worker.worker_id, 0),
mean_assigned=mean_assigned,
sticky_bonus=self.sticky_bonus,
load_floor_bonus=self.load_floor_bonus,
)
if best_score is None or score > best_score:
best_score = score

39
tests/README.md Normal file
View File

@@ -0,0 +1,39 @@
# Tests
Pure-Python unit + property tests for the algorithm layer. These tests do
**not** import SGLang and do **not** need a GPU — they validate the routing
algorithm (Algorithm 1/2/3 in `docs/KVC_ROUTER_ALGORITHM.md`) and its
theorems against the pure functions extracted from `policies.py`.
## Run
```bash
uv sync --group test
uv run pytest
```
Or, without uv:
```bash
pip install pytest
PYTHONPATH=src pytest tests
```
## Scope
- `test_policy_scoring.py` — Algorithm 1 lex-score properties (overlap
dominates sticky, load-floor gating, tie-breakers).
- `test_no_starvation.py` — Theorem 1: bounded retries before some D either
accepts or the least-rejected D is forced through the degenerate path.
Future:
- block-level eviction `MockRadixCache` tests (see
`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md` §5).
- D→P sync `staleness_budget` property tests (see
`docs/D_TO_P_SYNC_CONTRACT_ZH.md` §1).
## Why no integration tests here
Anything that needs SGLang, mooncake, or a real model is an integration
test and must run on hardware. Those tests live as `scripts/sweep_*.sh`
under the evaluation protocol in `docs/EVALUATION_PROTOCOL_ZH.md`.

0
tests/__init__.py Normal file
View File

66
tests/_fixtures.py Normal file
View File

@@ -0,0 +1,66 @@
"""Lightweight fixtures for algorithm-layer tests.
Builds minimal TraceRequest / SingleNodeTopology / RoutingState instances
without invoking build_single_node_topology() (which validates GPU budgets
we don't care about in unit tests).
"""
from __future__ import annotations
from agentic_pd_hybrid.topology import SingleNodeTopology, WorkerSpec
from agentic_pd_hybrid.trace import TraceRequest
def make_topology(decode_count: int = 3, prefill_count: int = 1) -> SingleNodeTopology:
prefill_workers = tuple(
WorkerSpec(
role="prefill",
ordinal=i,
gpu_ids=(i,),
host="127.0.0.1",
port=30000 + i,
)
for i in range(prefill_count)
)
decode_workers = tuple(
WorkerSpec(
role="decode",
ordinal=i,
gpu_ids=(prefill_count + i,),
host="127.0.0.1",
port=31000 + i,
)
for i in range(decode_count)
)
return SingleNodeTopology(
model_path="/dev/null/test-model",
prefill_workers=prefill_workers,
decode_workers=decode_workers,
direct_workers=(),
router_host="127.0.0.1",
router_port=8000,
transfer_backend="mooncake",
trust_remote_code=True,
)
def make_request(
*,
session_id: str = "sess-1",
turn_id: int = 0,
hash_ids: tuple[int, ...] = (),
input_length: int = 1024,
output_length: int = 64,
) -> TraceRequest:
return TraceRequest(
request_id=f"{session_id}-t{turn_id}",
session_id=session_id,
chat_id=int(turn_id),
parent_chat_id=-1 if turn_id == 0 else int(turn_id - 1),
timestamp_s=float(turn_id),
input_length=input_length,
output_length=output_length,
request_type="user",
turn_id=turn_id,
hash_ids=hash_ids,
)

150
tests/test_no_starvation.py Normal file
View File

@@ -0,0 +1,150 @@
"""Theorem 1 — no permanent starvation under bounded retries.
Reference: docs/KVC_ROUTER_ALGORITHM.md §4.1.
For any session s with τ_reject ≥ 1, after at most |D| · τ_reject
consecutive admission rejects on s, the routing policy MUST still
return a valid decision (via the degenerate "least-rejected D"
fallback). The session cannot be permanently starved at the policy
layer.
We can't exercise the full Dispatch loop here (it lives in replay.py and
needs HTTP, mooncake, etc.). What we CAN test is the policy-layer
guarantee: after K = |D| · τ_reject reject bumps, select() never raises
and never returns a worker that's both blacklisted *and* has positive
overlap (the degenerate path chooses by least-rejected).
This is the property-layer companion to test_policy_scoring.py's
quantitative checks.
"""
from __future__ import annotations
from agentic_pd_hybrid.policies import KvAwarePolicy, RoutingState
from ._fixtures import make_request, make_topology
def test_select_returns_valid_decision_under_full_blacklist():
"""Bump all (s, d) reject counters past τ_reject. select() must still
pick a worker (degenerate fallback, no exception, no None)."""
topology = make_topology(decode_count=3)
state = RoutingState.create(topology)
request = make_request(session_id="s-stuck", turn_id=0)
policy = KvAwarePolicy(migration_reject_threshold=3)
# Pre-fill the blacklist for every D.
for worker in topology.route_workers:
for _ in range(3):
state.record_admission_reject(request.session_id, worker.worker_id)
decision = policy.select(request=request, topology=topology, state=state)
assert decision.decode_worker_id is not None
assert decision.decode_worker_id in {w.worker_id for w in topology.route_workers}
def test_bounded_retries_to_force_degenerate_path():
"""Theorem 1: at most |D| · τ_reject rejects suffice to either exhaust
every D or to force the degenerate fallback. Simulate the worst case
where each retry picks a fresh D and is immediately rejected."""
topology = make_topology(decode_count=4)
state = RoutingState.create(topology)
request = make_request(session_id="s-worst", turn_id=0)
threshold = 3
policy = KvAwarePolicy(migration_reject_threshold=threshold)
seen_decoders: set[str] = set()
max_retries = len(topology.route_workers) * threshold
for retry in range(max_retries):
decision = policy.select(request=request, topology=topology, state=state)
seen_decoders.add(decision.decode_worker_id)
# Adversary: this D rejects this session.
state.record_admission_reject(request.session_id, decision.decode_worker_id)
# After |D|·τ_reject rejects every D must be blacklisted, so the next
# select() takes the degenerate "least-rejected" branch and STILL
# returns a valid worker.
final = policy.select(request=request, topology=topology, state=state)
assert final.decode_worker_id in {w.worker_id for w in topology.route_workers}
# And we should have explored every D over the bounded retries — the
# algorithm cannot trap a session on a single D when all are rejecting.
assert seen_decoders == {w.worker_id for w in topology.route_workers}
def test_least_rejected_d_chosen_when_all_blacklisted():
"""When every D is past threshold, the degenerate fallback chooses the
one with the *fewest* rejects (Algorithm 1, line 4)."""
topology = make_topology(decode_count=3)
state = RoutingState.create(topology)
request = make_request(session_id="s-lr", turn_id=0)
policy = KvAwarePolicy(migration_reject_threshold=3)
# Skew rejections: decode-0 has 5, decode-1 has 10, decode-2 has 3.
# All are >= threshold=3, so the filter wipes out every candidate.
# The fallback should pick decode-2 (smallest rejection count).
workers = list(topology.route_workers)
bumps = {workers[0].worker_id: 5, workers[1].worker_id: 10, workers[2].worker_id: 3}
for wid, n in bumps.items():
for _ in range(n):
state.record_admission_reject(request.session_id, wid)
decision = policy.select(request=request, topology=topology, state=state)
assert decision.decode_worker_id == workers[2].worker_id
def test_other_session_unaffected_by_blacklist():
"""Algorithm 1's filter is per-(session, D), not per-D. Session A's
rejects must not influence session B's routing."""
topology = make_topology(decode_count=2)
state = RoutingState.create(topology)
policy = KvAwarePolicy(migration_reject_threshold=3)
# Blacklist decode-0 for session A.
workers = list(topology.route_workers)
for _ in range(3):
state.record_admission_reject("session-A", workers[0].worker_id)
# Session B sees a clean slate — should be able to pick decode-0
# (which is the iteration-order winner under empty state).
decision_b = policy.select(
request=make_request(session_id="session-B"),
topology=topology,
state=state,
)
# decode-0 wins iteration-order tiebreak when all scores are (0,0,0,0).
assert decision_b.decode_worker_id == workers[0].worker_id
def test_threshold_zero_disables_blacklist():
"""migration_reject_threshold=0 means the migration mechanism is off:
every D stays a candidate regardless of its reject count."""
topology = make_topology(decode_count=2)
state = RoutingState.create(topology)
request = make_request(session_id="s-no-mig")
policy = KvAwarePolicy(migration_reject_threshold=0)
workers = list(topology.route_workers)
# Pile a huge number of rejects on decode-0.
for _ in range(100):
state.record_admission_reject(request.session_id, workers[0].worker_id)
decision = policy.select(request=request, topology=topology, state=state)
# decode-0 should still be eligible; with empty overlap/sticky/inflight,
# iteration order picks decode-0 first.
assert decision.decode_worker_id == workers[0].worker_id
def test_reject_counter_only_grows_on_record():
"""RoutingState.record_admission_reject is the ONLY mutator for the
counter. select() must not silently bump it."""
topology = make_topology(decode_count=2)
state = RoutingState.create(topology)
request = make_request(session_id="s-clean")
policy = KvAwarePolicy()
for _ in range(5):
policy.select(request=request, topology=topology, state=state)
# No explicit record_admission_reject -> all counters stay zero.
assert sum(state.session_d_rejects.values()) == 0

View File

@@ -0,0 +1,189 @@
"""Unit tests for Algorithm 1 (KvAwarePolicy score_candidate).
Reference: docs/KVC_ROUTER_ALGORITHM.md §3.1. The lex-score is
(overlap + sticky_bonus*sticky + floor_bonus,
sticky,
-inflight,
-assigned)
These tests pin down the qualitative properties that the algorithm's
correctness arguments rely on. They run without SGLang/GPU.
"""
from __future__ import annotations
from agentic_pd_hybrid.policies import score_candidate
def _score(**overrides):
"""Helper: build a score with all defaults and per-test overrides."""
args = dict(
overlap=0,
sticky=False,
inflight=0,
assigned=0,
mean_assigned=0.0,
sticky_bonus=1,
load_floor_bonus=0,
)
args.update(overrides)
return score_candidate(**args)
# -- Determinism ----------------------------------------------------------------
def test_score_is_pure():
"""Same kwargs must produce the same tuple (no hidden state)."""
a = _score(overlap=3, sticky=True, inflight=1, assigned=7)
b = _score(overlap=3, sticky=True, inflight=1, assigned=7)
assert a == b
def test_score_returns_4_tuple():
s = _score()
assert isinstance(s, tuple)
assert len(s) == 4
assert all(isinstance(x, int) for x in s)
# -- Primary term: overlap dominates sticky --------------------------------------
def test_overlap_strictly_dominates_pure_sticky():
"""Theorem-2 building block: any positive overlap on a non-sticky D wins
against a sticky-only D with zero overlap (sticky_bonus=1)."""
overlap = _score(overlap=2, sticky=False)
sticky_only = _score(overlap=0, sticky=True)
assert overlap > sticky_only
def test_overlap_plus_sticky_beats_overlap_alone():
"""Two D's with equal overlap: sticky one wins (sticky_bonus contributes
to primary AND wins tie-1)."""
sticky_d = _score(overlap=5, sticky=True)
fresh_d = _score(overlap=5, sticky=False)
assert sticky_d > fresh_d
# -- Tie breakers ----------------------------------------------------------------
def test_tiebreaker_inflight_lower_wins():
"""Equal primary & sticky: prefer the D with fewer in-flight requests."""
low = _score(overlap=3, sticky=False, inflight=0, assigned=10)
high = _score(overlap=3, sticky=False, inflight=5, assigned=10)
assert low > high
def test_tiebreaker_assigned_lower_wins():
"""Equal primary & sticky & inflight: prefer rarely-picked D."""
rare = _score(overlap=3, sticky=False, inflight=2, assigned=1)
frequent = _score(overlap=3, sticky=False, inflight=2, assigned=99)
assert rare > frequent
def test_tiebreaker_strict_lex_order():
"""Sticky always beats non-sticky on tie-1 even if non-sticky has lower
inflight (the lex order is strict, position 1 outranks positions 2/3)."""
sticky_busy = _score(overlap=4, sticky=True, inflight=10, assigned=10)
fresh_idle = _score(overlap=4, sticky=False, inflight=0, assigned=0)
# Note: with sticky_bonus=1 added to position 0, sticky_busy actually wins
# on position 0 first (5 > 4). Force equal primary by lowering sticky's
# overlap.
sticky_busy_eq_primary = _score(overlap=3, sticky=True, inflight=10, assigned=10)
fresh_idle_eq_primary = _score(overlap=4, sticky=False, inflight=0, assigned=0)
# Now equal primary (3+1=4 vs 4). Sticky wins position 1.
assert sticky_busy_eq_primary > fresh_idle_eq_primary
# -- Load-floor bonus ------------------------------------------------------------
def test_load_floor_disabled_by_default():
"""load_floor_bonus=0 → no contribution to primary."""
s = _score(overlap=0, sticky=False, mean_assigned=10, assigned=0)
assert s[0] == 0
def test_load_floor_gated_off_when_sticky():
"""Even with load_floor_bonus>0, sticky D does NOT receive the boost.
Otherwise a session would migrate away from its warm D under load."""
sticky_under_loaded = _score(
overlap=0, sticky=True, mean_assigned=10, assigned=0, load_floor_bonus=200
)
# primary = overlap(0) + sticky_bonus(1) + floor(0) = 1
assert sticky_under_loaded[0] == 1
def test_load_floor_zero_when_mean_zero():
"""Warmup case: mean_assigned=0 -> no D gets boost -> degenerate to lex
tiebreak by iteration order."""
s = _score(
overlap=0, sticky=False, mean_assigned=0, assigned=0, load_floor_bonus=200
)
assert s[0] == 0
def test_load_floor_proportional_to_deficit():
"""floor_bonus = K * deficit / mean. assigned=0, mean=10, K=200 -> 200."""
s_zero = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
)
s_half = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=5, load_floor_bonus=200
)
s_full = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
)
# deficit = max(0, 10-0)=10 -> bonus = int(200*10/10) = 200
# deficit = max(0, 10-5)=5 -> bonus = int(200*5/10) = 100
# deficit = max(0, 10-10)=0 -> bonus = 0
assert s_zero[0] == 200
assert s_half[0] == 100
assert s_full[0] == 0
def test_load_floor_does_not_underflow_when_overloaded():
"""assigned > mean -> deficit clamped to 0, no negative bonus."""
s = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=50, load_floor_bonus=200
)
assert s[0] == 0
# -- Routing intent: real overlap beats load-floor bonus -------------------------
def test_real_prefix_overlap_beats_load_floor_on_warm_d():
"""E1_E2_FIX_DESIGN_ZH §Q2: load_floor should be set such that
real per-session prefix overlap outweighs the cold-D bonus.
With overlap=800 (a per-session prefix) and load_floor_bonus=200,
a warm D (high overlap, possibly high load) should still win against
a cold D with floor bonus."""
warm = _score(
overlap=800, sticky=True, mean_assigned=10, assigned=10, load_floor_bonus=200
)
cold = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
)
# warm primary = 800 + 1 + 0 = 801. cold primary = 0 + 0 + 200 = 200.
assert warm[0] == 801
assert cold[0] == 200
assert warm > cold
def test_boilerplate_overlap_loses_to_load_floor_for_cold_d():
"""Same §Q2: load_floor should beat cross-session boilerplate overlap.
If load_floor_bonus=200 and the worst-case boilerplate overlap is ~50,
a fresh cold D should still win against a slightly-warm-from-boilerplate D."""
warm_boilerplate = _score(
overlap=50, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
)
cold_under_loaded = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
)
# warm_boilerplate primary = 50 + 0 + 0 = 50 (assigned=mean, no deficit).
# cold_under_loaded primary = 0 + 0 + 200 = 200.
assert cold_under_loaded > warm_boilerplate