12 Commits

Author SHA1 Message Date
7568e041ff docs(kvc): record real Ali KVC experiment results 2026-05-12 05:28:06 +00:00
4e8f943875 feat(kvc): add real Ali replay workflow 2026-05-12 05:28:00 +00:00
kzlin
1d51704dad docs(kvc): agentic-fit analysis, refactor plan, validation report
Three new docs covering the structural-fit investigation:

- AGENTIC_FIT_ANALYSIS_ZH.md: §1-§7 of structural design issues that
  surface KVC vs vanilla DP gap on real agentic workloads (SWE 50sess).
  Quantifies session pinning, LRU shortfall, P-side imbalance,
  time-scale distortion, etc., with code citations and N=3 rerun data.

- REFACTOR_PLAN_ZH.md: KISS-edition refactor plan. After verifying the
  original "estimate inflation" and "resident_blocks aging" claims were
  not real bugs, scope shrinks to one code change (backpressure) plus a
  4-run smoke sweep within an 8h budget.

- STRUCTURAL_VALIDATION_REPORT_ZH.md: validates §1-§7 claims using
  existing v5 baseline rerun data + 8DP CA baseline. Each claim labeled
  fully-supported / indirect / retracted with the data source. Notes
  that backpressure E2E validation is pending GPU smoke run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:30:11 +08:00
kzlin
7affb565b2 feat(kvc): add backpressure smoke sweep + analyzer (and v6 p1 profile script)
scripts/sweep_backpressure_smoke.sh: 4-run smoke matrix (KVC baseline /
KVC + backpressure / KVC + backpressure @ time-scale=1 / DP @
time-scale=1) designed to fit ~3-4h GPU budget. Validates §3 backpressure
implementation and partially probes §7 time-scale distortion.

scripts/analysis/analyze_backpressure_smoke.py: consumes the new
structural/* jsonl files plus request-metrics; emits headline metrics,
backpressure histograms, admission probe stats, and per-session pinning
distribution.

scripts/sweep_tp1_v6_p1_profile.sh: pre-existing v6 P1 profile sweep
script (was untracked; included for completeness).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:29:56 +08:00
kzlin
c47adaf8e3 feat(kvc): honor admission backpressure hints + structural event logging
Replay-side changes paired with the SGLang admission hint:

- DecodeResidencyState gains pause_until_s; admission probe parses
  recommended_pause_ms and updates the per-D pause window.
- _wait_for_decode_pause is invoked at request entry points
  (_invoke_router, _invoke_session_direct) so requests stall before
  hitting a saturated D instead of timing out via mooncake.
- New CLI flags: --enable-backpressure (default off, baseline preserved),
  --backpressure-max-pause-s (cap on per-request sleep, default 2s).

Structural instrumentation written under <run_dir>/structural/:
- admission-events.jsonl: every admission probe (RTT, queue_depth,
  pause_ms, available_tokens, evicted_count)
- backpressure-events.jsonl: every actual pause sleep
- session-d-binding.jsonl: per-request policy decision

Used to validate the structural claims documented separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:29:46 +08:00
kzlin
ca4b64c79a feat(sglang): expose backpressure pause hint in admit_direct_append
Add `recommended_pause_ms` field to DirectAppendAdmissionReqOutput so D
can advise callers when its transfer queue is heavy or KV pool is near
capacity. The hint is computed from transfer_queue_depth,
retracted_queue_depth, and post-trim token_usage; thresholds are simple
heuristics (>0.90 usage, >=8 queue depth, retracted>0).

Default behavior is unchanged for callers that ignore the field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:29:30 +08:00
kzlin
4978c0d0cd profile(kvc): rewrite v5+profile report after critic audit + P0/P1 instrument
Hostile audit of the original report flagged three load-bearing errors:

1. held_tokens semantic was inverted. session_held_tokens() at
   session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len)
   per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held -
   avail" actually CONTAINS the radix-tree protected prefix cache (likely the
   single biggest component for shared agentic prefixes), not just running
   batch + in-flight as the original report claimed.

2. Admission-race causal hypothesis for the 415 EXP2+profile errors is
   contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they
   passed admission and died downstream ("generate stream ended before
   producing any token", raised by the client when a 200 response had an empty
   stream).

3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1
   (session-cap-fb -356 / kvcache-centric +406), and /server_info is not a
   passive read — it dispatches into the scheduler main loop and iterates
   every session slot.

Plus: per-D error% confounded by sticky session affinity (only 18 unique
sessions cause 415 errors, decode-3 had 0 errors only because no high-error
session landed there); decile 10 "recovery" was an equal-time binning
artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not
6h; p50/p90 latency comparison is N=1.

Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction
with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4).

Action items split into P0 (verify, must do first) and P1 (instrument):

P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2
(no polling, identical config to the original v5 run) to test whether the
9-error baseline result is reproducible. If 3 runs give ~9 errors and
profile gives 415, polling is the leading suspect. Currently running
in background.

P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only
"pool_breakdown" dict to /server_info covering: radix_evictable_tokens,
radix_protected_tokens, slot_private_held_tokens, session_slot_count,
running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens},
prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these,
"unaccounted = cap - sum(known)" exposes true leakage. replay.py captures
all fields into the per-tick row; analyzer prints the decomposition and
gracefully handles old timeseries (prints "P1 instrument absent").

Mock-tested end-to-end. SGLang patch is read-only and does not affect
admission/scheduling. Old v5+profile data still analyzes correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:29:21 +08:00
kzlin
51f5386691 profile(kvc): add D KV pool timeseries poller + analyzer for v6 root-cause
v5 dropped errors but pushed session-cap fallback to 46-51%. Before adding
v6 mitigations we need to attribute that capacity loss to one of:
  (a) active sessions — real footprint
  (b) idle-evictable sessions — LRU not aggressive enough
  (c) prefill backup blocks / in-flight / fragmentation — release timing

Without this it's all guessing. Plumb a 1Hz poller into replay that hits
each P/D worker's /server_info, captures session_cache + memory_usage, and
writes a per-worker time-series JSONL to <run_dir>/d-pool-timeseries.jsonl.
Off by default (--pool-poll-interval-s 0); v5+profile sweep enables it at
1.0s. Per-tick HTTP cost is ~8 parallel /server_info calls — negligible
relative to the 50min run.

Analyzer (scripts/analysis/analyze_pool_timeseries.py) decomposes each D's
capacity into active_held / idle_evictable / other (= cap-held-avail, the
backup-blocks bucket) / free, and reports session residency churn across
workers as a starvation/thrashing signal.

Mock-tested poller end-to-end (cancellation clean, file flushed, sessions
captured); analyzer validated against synthetic timeseries.

Next: run scripts/sweep_tp1_v5_optD_profile.sh on hardware (~90min), then
analyze results to pick a v6 direction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:04:21 +08:00
kzlin
6572d7f3f4 docs: add v5 chapter (Option D worker-mode admission) and rename to V1_TO_V5
v5 sweep (sweep_tp1_v5_optD.sh) lands the previously-deferred Option D:
worker admission_mode authoritative for direct_append + seed + reseed,
bypassing replay's local _decode_session_soft_cap.

Key findings now documented:
- errors collapse from 9-10% to 0.2% (mooncake timeouts gone)
- session-cap fallback rises 33-35% -> 46-51% — D's true KV pool is the
  binding constraint, not replay's estimator; v4's "low fallback" was
  hiding capacity overruns as transfer-timeout errors
- direct-to-D subset latency unchanged from v4 (admission overhead negligible)
- new bottleneck: D's physical KV pool — points v6 at prefill backup release
  timing, priority eviction tuning, chunked seed, cross-D session migration,
  and real RDMA

Also adds a 5th lesson on errors-vs-fallback reciprocity and updates the
code index with the v5 endpoint extension and new CLI knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:13:25 +08:00
kzlin
6e5ed8da80 feat(kvc): Option D - delegate seed/reseed admission to D worker
v4 (cap=16) saw 35% session-cap fallback because the local soft_cap
min(16, usable / target) evaluates to 1-2 for large agentic inputs.
The cap was hit not because D was full but because replay's heuristic
underestimated capacity.

This change makes worker admission_mode authoritative for ALL paths:

SGLang side:
- io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field
  ("direct_append" | "seed", default "direct_append" preserves prior
  behavior).
- scheduler.py:admit_direct_append: when mode == "seed", skip the
  resident-on-D requirement and run the same capacity check + LRU
  eviction (maybe_trim_decode_session_cache) that direct_append uses.
  This lets D atomically decide if a new session can be admitted based
  on actual token_to_kv_pool_allocator state.

Replay side (replay.py):
- _query_decode_direct_admission gains a `mode` parameter.
- _reserve_decode_session_capacity: in worker admission_mode, the
  seed/reseed branch now queries D with mode="seed" and trusts the
  result, instead of estimating capacity from the residency snapshot.
- _should_admit_new_decode_session: in worker mode, skip the local
  soft_cap pre-check and let D decide. Same-D session fast-path is
  preserved.

Effects:
- Local hardcoded cap of 16 is bypassed under worker mode; D's real
  KV pool size is the only constraint.
- LRU eviction runs in D's process atomically with admission, so
  starvation (the v3 bimodal "lucky vs starved sessions" pattern)
  should resolve.

scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D
configs as v4 with the new admission path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:40:03 +08:00
kzlin
74194e660a docs: v4 final results, error analysis, and updated journey
Add v4 sweep results and post-mortem analysis showing:

- direct-to-D path: 54.3% (1P7D) / 58.0% (2P6D) of requests now use
  KVC cleanly. P50=0.5s and TTFT P50=0.043s; this path beats baseline
  8DP across the board (P50 -24%, TTFT P50 -54%, TTFT P90 -79%).

- Overall vs baseline (errors+truncated excluded):
  v4 2P6D P50=0.85s vs baseline 0.66s (28% slower).
  Reason is not errors -- 35% of requests still hit
  fallback-large-append-session-cap, where capacity-based
  cap = usable_tokens / target_tokens evaluates to 1-2 (not 16)
  for large agentic inputs.

- 9-10% errors on KVC variants are mooncake TCP transfer timeouts,
  not SGLang logic bugs. Prefill log shows
  "Failed to send kv chunk ... 32s timeout ... session not alive".
  Errors concentrate in turn>=31 (large inputs) after run >44.8%.

Track:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: append v4 results table,
  per-mode breakdown, and error root cause.
- scripts/analysis/{analyze_v3,analyze_v4,analyze_errors,compare_no_error}.py
- outputs/qwen3-30b-tp1-v{3,4}*/exp*_summary.json (force-added,
  small JSON; metrics.jsonl excluded due to size).
- outputs/qwen3-30b-tp1-v{3,4}*/sweep_results.txt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:34:01 +08:00
kzlin
c9d350b372 docs: KVC v1-v4 debug journey + raise session soft_cap to 16
Document the iterative debugging from v1 (broken KVC) through v4
(routing fixed + session cap raised), with code-level analysis of
the two main bugs encountered:

1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`):
   `--policy default` for KVC mechanism caused replay's round-robin
   policy and the PD router's round-robin to diverge, sending requests
   with `session_params` to a D worker that did not have the session
   open. Resulted in 56-61% truncation with finish_reason
   "session id X does not exist".
   Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay
   emits `x-smg-target-worker` and PD router uses consistent_hashing.

2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap`
   dominated 52-65% of requests. Root cause was hardcoded
   `min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4
   sessions = 28 slots for 52 trace sessions, ~24 sessions starved
   permanently (bimodal direct-to-D rate of 0% or 99%).
   Fix: raise the cap to 16 (replay.py).

Also includes the v3 finding that direct-to-d-session path P50=0.495s
and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s)
- the KVC core mechanism works when fallback paths are avoided.

Files:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index
- docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes
- scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts
- src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields
- src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill
- src/agentic_pd_hybrid/metrics.py: truncated_request_count

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 21:10:41 +08:00
50 changed files with 7228 additions and 114 deletions

View File

@@ -0,0 +1,434 @@
# Agentic 场景下的结构性设计缺陷分析
**日期**2026-05-06
**对照数据**`outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/exp2_2p6d_run1_*`KVC kv-aware Option D2P6D4449 reqs / 52 sessions+ `outputs/qwen3-30b-tp1-exps/exp1_8way_dp_cache_aware_summary.json`(同 trace 8-way DP cache-aware baseline
**模型**Qwen3-30B-A3BTP1单机 8×H100 80GB。
**研究问题**:把 SWE trace 视为"真实 agentic"的代表KVC 机制相对 vanilla DP 系统性输在哪里——除了"D 容量 4.6× 过载"之外的结构性原因。
> 本文是对 `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` 与 `docs/V5_PROFILE_INVESTIGATION_ZH.md` 的补充:版本演进与瓶颈定位之外,从设计层看哪些假设和真实 agentic workload 不匹配。
---
## TL;DR
按重要性排序的结构性缺陷:
| # | 缺陷 | 数据 | 修复方向 | 工程量 |
|---|---|---|---|---|
| 1 | **KvAwarePolicy 不感知 D 容量session 永久 pin 到首次落点 D** | session 平均访问的不同 D 数 = **1.00**direct-to-D 命中率呈极端双峰15 session 0-20%、14 session 80-100% | score 函数加 capacity-aware 项;允许跨 D session 迁移 | 中 |
| 2 | **D 端 LRU 只能 evict idle sessionhot session 永远踢不掉** | D 跑全程仅 9-43 次 trim 事件 vs 80-150 次 transfer 错误token_usage 顶到 1.00 | 加 score-based eviction按访问频率/最近性多层) | 中 |
| 3 | **没有 D→Router→Replay 的 backpressure 通道** | concurrency 一路 32 不降D 失败时 replay 无感 | admission 响应加 `recommended_pause_ms`replay 端按它降并发 | 小 |
| 4 | **Admission HTTP round-trip 与 scheduler 主循环耦合** | v5+profile 仅加 1Hz polling 就让 errors 从 9 涨到 415 | 拆成 lock-free `/probe` + 进 scheduler 队列的 `/commit_evict` | 中 |
| 5 | **P-side round-robin 不感知 D 健康** | prefill-0 出 367 KVTransferErrorprefill-1 仅 4——但请求量近乎对半 | router 选 P 时考虑目标 D 健康度 | 中 |
| 6 | **Replay 端 session footprint 估算膨胀 30×** | `_estimate_session_resident_tokens = input + output`,把 turn-50 的 80K 上下文当成"需要全新 80K 空间" | 改成"增量 token"估算 | 小 |
| 7 | **time-scale=10 把测试条件人为推到失真区间** | inter-turn gap p50 从 2.5s 压到 0.25s——KVC 想利用的"自然 idle 窗口"被消除 | 跑一组 time-scale=1 baseline 验证 | 小(仅配置) |
**最重要的对照事实**:同 trace、同硬件、同模型下 8-way DP cache-aware无 PD 拆分、无 KVC、无 session 抽象):
| 指标 | 8-way DP CA | v5 KVC 2P6D |
|---|---|---|
| Errors | **0** | 372 (8.4%) |
| Latency mean | **1.43s** | 3.50s |
| Latency P50 | **0.65s** | 1.11s |
| Latency P99 | **8.37s** | 20.37s |
| TTFT mean | **0.12s** | 2.13s |
| TTFT P90 | **0.26s** | 6.47s |
| Per-worker 请求量分布 | 508619±10% | 561858±26% |
**naive DP 在每一项都赢,包括 latency mean 的 145% 优势**。这定义了 KVC 在该 workload 下"必须超过"的基线。
---
## 1. Session 永久 pin 到 D + 容量盲选(最核心问题)
### 1.1 现象
每个 session 在整次运行中只访问 **1.00 个不同 D worker**(见上文数据)。结合 direct-to-D 命中率分布:
```
direct-to-D 命中率分桶n=52 sessions
0-20%: 15 sessions ← 几乎每 turn 都失败回退到 P→D 全量传输
20-40%: 7
40-60%: 11
60-80%: 5
80-100%: 14 sessions ← 几乎每 turn 都走 direct-to-D 快路径
```
**几乎没有中间态**——这是典型的不公平资源分配信号。
被饿死与被照顾的 session 在工作量上差异明显:
- 饿死 session 平均 peak input56,011 token
- 顺利 session 平均 peak input31,344 token**1.8× 差距**
**大 session 倾向被饿死**——因为它们在容量已紧张的 D 上更容易触发 admission 拒。
### 1.2 根因(代码级)
`policies.py:166-172` `KvAwarePolicy.select`
```python
score = (
overlap + sticky * self.sticky_bonus, # 主项: 历史 KV overlap
sticky, # 二级: 是否 last_decode_worker
inflight_penalty, # 三级: 当前 inflight 数(很小)
assignment_penalty, # 四级: 累计被分配数(更小)
)
```
评分中**完全无 D 当前容量项**。Session X 第一次落到 D-2 时积累 hash_id 在 D-2 上;之后无论 D-2 多满X 的 turn N+1 都会被打分到 D-2因为 overlap 主导)。
更糟的是 `RoutingState.decode_resident_blocks``policies.py:46`)从不缩减——即使 D 早 evict 了某些块replay 仍认为它们在那。运行中期所有 D 的 overlap 集合都接近"trace 全部 hash_id"policy 退化为纯 sticky。
### 1.3 后果——具体到 session 的体验
**饿死 session如 session 50400105 turns0 次 direct-to-D每 turn 流程**
1. policy 选 D永远是同一个
2. admission 拒D 容量已被占住)
3. 走 fallback-session-cap → P 全量 prefill 50K-100K token
4. mooncake 推 KV → D 仍无空间 → 32s timeout 或 KVTransferError
5. 用户每 turn 体验 5-10s 延迟,反复出错
**顺利 session如 session 3840118 turns97% direct-to-D每 turn 流程**
1. policy 选 D永远是该 session 的初始 D
2. admission 通过(这个 session 一直占着这个 D 的 slot
3. direct-to-DD 上 append-prefill 几百 token零 P 介入、零 mooncake transfer
4. TTFT 0.043s、E2E 0.495s
**这不是"平均慢一点",是结构性不公平**——SLO 视角下 P99 是被饿死那 15 session 的尾巴拉出来的。
### 1.4 为什么 naive DP 反而赢
8-way DP cache-aware 用纯 hash-based 路由,没有 session 抽象,没有 PD 拆分:
- 每个请求按 prefix hash 路由到一个 worker → 同 session 的 turn 在 worker 上自然有 prefix 命中
- 容量过载时 SGLang 自己的 radix cache + 调度器统一管 KV 池
- 不存在 admission/fallback/reseed 路径
- 不存在 mooncake transfer
- per-worker 负载误差 ±10%vs KVC ±26%),自动接近均衡
**KVC 引入的 session affinity / KV 复用 / admission 三件套,在容量紧张时反而加剧了不均衡,没有任何一项能挽回 vs DP 的差距。**
### 1.5 修复方向
`KvAwarePolicy.select` 里加:
```python
# 当前 D 容量利用率worker-mode admission 已经能查到)
capacity_penalty = -worker_capacity_used_ratio[worker.worker_id]
# 当多个 D 都有 overlap 时,按容量挑最空的;
# 当某 D 容量 > 阈值时,禁止该 D 进入候选
if worker_capacity_used_ratio[worker.worker_id] > HARD_CAP:
continue
score = (
overlap_capped, # overlap 但限幅,避免单个 D 永远赢
capacity_penalty, # ← 新增
sticky,
inflight_penalty,
)
```
更激进的修法:当一个 session 被某 D 反复拒 N 次后,主动 release 它在该 D 上的 session 状态,**允许下次 turn 走另一个 D**(代价是丢失已积累的 KV但目前 fallback 路径本来也丢了)。
---
## 2. D 端 LRU eviction 跟不上压力
### 2.1 数据
每个 D 全程:
| Worker | Trim 事件(主动 LRU | KVTransferError + OOM | 峰值 token_usage |
|---|---:|---:|---:|
| decode-0 | 9 | 0 | 0.99 |
| decode-1 | 43 | 12 (4 err + 8 oom) | 0.99 |
| decode-2 | 16 | 459 (153 err + 306 oom) | 0.97 |
| decode-3 | 37 | 87 (29 err + 58 oom) | 0.99 |
| decode-4 | 28 | 270 (90 err + 180 oom) | **1.00** |
| decode-5 | 30 | 279 (93 err + 186 oom) | **1.00** |
**LRU 触发频率比错误次数低 5-15 倍。** D-4 / D-5 直接顶到 token_usage=1.00。
### 2.2 根因
`scheduler.py:2040` `evict_idle_streaming_sessions_lru` 的 idle 判定:
```python
# 只能 evict "所有 req 都 finished + streaming 模式" 的 session
```
但 SWE 高并发下每个 session 几乎一直有 inflight reqtime-scale=10 又压缩了 inter-turn gap。**hot session 永远不 idleLRU 永远找不到东西可踢**。结果 D 一路开到 100% → 下一笔 transfer 来直接 OOM/timeout。
### 2.3 修复方向
引入分层 eviction
1. **Idle session 优先**(当前)
2. **冷 session 次优**(最近 N 秒无访问,即使有 inflight也可以 retract 那个 inflight 让位)
3. **hot session 强制 retract**(在 hard cap 触发时)
vanilla SGLang 已有 `disagg_decode_prealloc_queue.retracted_queue` 机制(看 `admit_direct_append` 引用),但**没有人主动触发 retract**——目前只有内部异常时才会进 retracted_queue。需要把 retract 提升为正常 admission 路径的一部分。
---
## 3. 没有 D→Replay 的 backpressure 通道
### 3.1 名词解释
**Backpressure反压** = 流式系统下游过载时把信号反向传给上游让它降速。例TCP 滑动窗口、Kafka consumer lag、gRPC HTTP/2 flow control。
### 3.2 当前状态
- D 端 transfer queue 堆 → 32s 后 timeout → 抛 KVTransferError
- error 抛回 P → P 抛给 router → router 抛给 replay → replay 走 fallback 路径
- **整个链路上没有"D 过载,请慢点发"的信号**——concurrency 一直保持上限
后果D 一旦开始失败,会**持续失败**(因为 replay 没降速),直到 D 自己消化完积压。
### 3.3 修复方向
`admit_direct_append` 响应里加:
```python
{
"can_admit": ...,
"recommended_pause_ms": int, # ← 新增:下次发同类请求前建议等多久
"queue_depth": int, # ← 新增D transfer queue 当前深度
...
}
```
replay 端在 admission 拒被拒时按 `recommended_pause_ms` 降并发或退避。**这是最便宜的一条改动**——不改协议、不改 SGLang 内部,只改两端代码。
---
## 4. Admission RPC 与 scheduler 耦合——结构 vs 工程的精确边界
### 4.1 现象
`docs/V5_PROFILE_INVESTIGATION_ZH.md` 报告:仅加 1Hz `/server_info` polling 就让 EXP2 errors 从 9 涨到 415。`/server_info` 在 scheduler 主循环里遍历 session slots 算 `is_idle`1 Hz × 8 worker 就足以扰动调度。
但实际负载下 admission RPC 频率远高于 1Hz每个 turn 1 + reseed + direct-to-D 都调一次。concurrency=32 + 4449 reqs / ~2700s ≈ **每秒 16+ 次 admission RPC**
### 4.2 这是结构问题还是工程问题——精确拆解
`admit_direct_append``scheduler.py:3581`)做两件事:
```python
# (a) 读池子状态——轻
available_tokens = self.token_to_kv_pool_allocator.available_size()
# (b) 触发 LRU 扫描——重,且必须修改池子状态
trim_result = self.maybe_trim_decode_session_cache(...)
```
| 部分 | 性质 | 是否能靠工程化解决 |
|---|---|---|
| (a) 读池子状态 | 几个原子读 | **完全可工程化**——做成 lock-free shared-memory snapshot 即可 |
| (b) LRU eviction | 修改 GPU 池子,必须独占 | **结构性的**——Python GIL + 共享 GPU 池子无法并发修改 |
**关键观察**:实际负载里 (b) 是少数路径——大部分 admission 只需要"看一下够不够",不需要立即 evict。
### 4.3 工程化修复方案
把 admission API 拆成两个端点:
```
POST /session_cache/probe ← 90% 流量
- 只读 lock-free snapshot
- 返回 (can_admit_estimate, available_tokens, queue_depth)
- 不进 scheduler 队列
POST /session_cache/commit_evict ← 10% 流量
- probe 不够时才调
- 进 scheduler 队列,做实际 LRU
- 保留当前 admit_direct_append 语义
```
snapshot 由 scheduler 在每个 step 末尾写到一段 mmap 共享内存atomic publishreplay 端 mmap 读,零 syscall 零序列化。一秒内能撑数千次 probe。
### 4.4 关于"协程/多线程/多进程/换语言"
| 工具 | 对本问题的实际效果 |
|---|---|
| asyncio 协程 | SGLang 已用,对 scheduler 主循环本身无帮助 |
| Python 多线程 | GIL 拦着,且 GPU 池子状态只能 scheduler 进程改 |
| 多进程 | scheduler 已是独立进程;问题是它**自己的 step 循环**串行了 admission 与 decode |
| orjson / uvloop | 网络/JSON 加速 5-10×但 LRU 遍历不在那条热路径 |
| Rust/C++ 重写 scheduler | 把 LRU 遍历提速 5-10×但**结构性共享问题仍在** |
**正确的工程化解法是重设计 API拆 probe / commit不是单纯换更快的库或语言。**
---
## 5. P-side 路由不感知 D 健康
### 5.1 数据
```
prefill-0: 367 KVTransferError, 361 "Decode instance could be dead"
prefill-1: 4 KVTransferError, 0 "Decode instance could be dead"
请求量对比:
prefill-0: 2225 requests
prefill-1: 2224 requests ← 几乎对半
```
**两 P 请求量完全均衡,错误率差 92×**。日志里 prefill-0 的错误反复指向某个特定 D`10.45.80.47:XXXXX`)——它跟某个 hot D 形成了"死亡链路"。
### 5.2 根因
`pd_router.py:43-49` 的 P 选择是裸 round-robin
```python
prefill_url, bootstrap_port = self.config.prefill_urls[
self.prefill_cursor % len(self.config.prefill_urls)
]
```
不知道 D 是否健康,不会避开"正在和 D-X 死磕"的 P。
### 5.3 修复方向
router 选 P 时考虑 (P 当前 inflight transfer 数, 目标 D 健康度) 联合得分。健康度可以用 §3 提的 `queue_depth` 字段。
---
## 6. Replay 端 session footprint 估算膨胀 30×
### 6.1 代码
`replay.py:898-899`
```python
def _estimate_session_resident_tokens(request: TraceRequest) -> int:
return request.input_length + request.output_length
```
被用于 `_decode_session_soft_cap``replay.py:1051`)和 `_should_admit_new_decode_session`
### 6.2 问题
对一个已经在 D 上有 80K KV 的 turn 50
- 真实增量需求input 新增几千 token + output 几百 token = ~3K
- 估算返回值80K + 1K = 81K**膨胀 ~27×**
后果router-mode admission 系统性误判——本来能 admit 的 session 被 replay 自己拒掉。v5 worker-mode 让 D 自己看真实容量部分修了这个,**但 KvAwarePolicy 选 D 时仍用这个膨胀估算**——选 D 仍然是错的。
### 6.3 修复
```python
def _estimate_session_resident_tokens(request: TraceRequest) -> int:
if request.turn_id == 1:
return request.input_length + request.output_length
# turn 2+: only the increment matters for additional reservation
return max(0, request.input_length - request.cached_tokens) + request.output_length
```
---
## 7. time-scale=10 测量失真
### 7.1 它是什么
`replay.py` 把原始 trace 每个请求的 `timestamp` 字段做 `t / time_scale` 缩放后再按这个时间发。
- 原始 trace 跨度 ~6000s≈100 分钟)
- time-scale=10 → 实际 replay 跨度 ~600s≈10 分钟)
### 7.2 为什么这么设计
**纯粹为了节省测试时间**——单次 1× 跑 100 分钟sweep 5 版 × 3 重复 = 25h GPU 时间10× 只要 2.5h。
### 7.3 它扭曲了什么
| 维度 | 原始 trace | replay (time-scale=10) |
|---|---|---|
| inter-turn gap p10 | 1.6s | 0.16s |
| inter-turn gap p50 | 2.5s | 0.25s |
| inter-turn gap p90 | 7.8s | 0.78s |
| inter-turn gap max | 261s | 26s |
真实 agentic 用户/agent 在每个 turn 之间停 2-8 秒思考、打字、tool call。**这些间隙正好是 KVC 想利用的"自然 idle 窗口"**——session 短暂 idle 时 LRU 可以 evict、其他 session 可以 admit。
time-scale=10 把这些窗口压到 0.2-0.8s**人为消除了 KVC 的设计前提条件**。
### 7.4 严重的实验有效性威胁
所有 v3-v6 数据基于 time-scale=10。这意味着前面所有"KVC 在 SWE 上输给 baseline"的结论都带着这个失真。**真实部署里 inter-turn gap 是 2.5s 的话KVC 可能根本不会撞到当前看到的容量瓶颈**——D 有时间在 turn 之间释放/重排。
**应该单独跑一组 time-scale=1 的 baseline 对比**,才能判断 KVC 输给 DP 是因为机制本身不行,还是因为 benchmark 把它推到了不该工作的区间。这是这个项目目前**最重要但还没做**的验证。
---
## 8. 应用层抽象不需要在引擎层引入(撤回)
之前草稿里提过"框架不支持 speculative 多分支、嵌套 sub-agent、tool call 中断"——这是过度抽象。**应用层模式都可以由 timestamp + 独立 session_id 隐式表达**
| 应用层模式 | 表现在 trace 里 | 推理引擎需要做什么 |
|---|---|---|
| Tool call 异步返回 | turn N 与 N+1 之间 timestamp gap 很大 | 啥都不用,按时间发请求即可 |
| 嵌套 sub-agent | 父 session timestamp 突然停顿sub-agent 是独立 session_id | 把它们当成两个独立 session 即可KV 也无需共享) |
| Speculative N 分支 | N 个独立 session_id 同时发 | 用 radix prefix cache 自然命中前缀;不需要任何额外抽象 |
**这条不构成结构性缺陷。** 已从结论中移除。
---
## 9. 行动项(按 ROI 排序)
### 优先级 P0修了显著改善饿死/不公平)
1. **[§1] KvAwarePolicy 加 capacity-aware penalty + 允许 session 跨 D 迁移** — 工程量中、收益最大
2. **[§2] D 端引入分层 eviction冷 session、hot retract** — 工程量中、收益大
3. **[§7] 跑一组 time-scale=1 baseline** — 工程量小(仅配置),但**不做这条所有结论都不可信**
### 优先级 P1修了把工程稳定性补齐
4. **[§3] D→Replay backpressure 通道**admission 响应加 pause hint — 工程量小
5. **[§4] 拆 admission 为 probe + commit_evict** — 工程量中
6. **[§6] 修 `_estimate_session_resident_tokens` 用增量** — 工程量小
### 优先级 P2等 P0 数据后再决定)
7. **[§5] P-side 选 P 时考虑 D 健康** — 工程量中
---
## 10. 局限与未验证假设
1. **N=1**:所有数据来自单次 runv6 P0 已证 EXP2 errors 在 9-912 间漂移single-run variance 巨大)。本文所有数字都应理解为"代表性观察"而非"统计显著结论"。
2. **time-scale=10 失真**§7所有"KVC 输给 DP"的程度可能是被 benchmark 放大的。这是最大的不确定性。
3. **8DP 对比的硬件优势**DP 是 8 个 worker 全部跑 prefill+decodeKVC 是 2P+6D只有 6 个能解码。理论上 8 worker 对 6 worker 自带 1.33× 解码并发优势。本文未折算这部分——但 8DP 优势远大于 1.33×latency mean 145% 优势所以核心结论KVC 在该 workload 下系统性输)不受此影响。
4. **mooncake TCP loopback**:所有 transfer 错误是单机 TCP 模拟下的产物。生产环境 RDMA 下错误率分布可能完全不同。
5. **KvAwarePolicy 的 stale `decode_resident_blocks`**§1.2 末尾)现象有数据观察支撑(运行中期 overlap 失去判别力),但**没有系统性测过"清掉 stale 状态会怎样"**。
6. **P-side 错误集中在 prefill-0**§5.1)的因果链是推测——可能也是"prefill-0 早启动 + race"的偶然结果。N>1 数据未验证。
---
## 附录 A数据产物索引
```
outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
├── exp2_2p6d_run1_metrics.jsonl ← 本文主数据源
├── exp2_2p6d_run1_summary.json
├── exp2_2p6d_run2_* (errors=912, single-run variance 证据)
├── exp2_2p6d_run3_* (errors=396)
└── kvcache-centric-*-20260429T142429Z/logs/
├── decode-{0..5}.log ← §2.1 LRU vs error 计数
└── prefill-{0,1}.log ← §5.1 P 错误分布
outputs/qwen3-30b-tp1-exps/
├── exp1_8way_dp_cache_aware_summary.json ← 对照 baseline
└── RESULTS_SUMMARY.md
```
## 附录 B相关文档
- `docs/PROJECT_OVERVIEW.md` — 项目目标与已实现功能
- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 版本演进
- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(已 critic 修订)
- `docs/SWEBENCH_EXPERIMENT_RESULTS.md` — Qwen3.5-35B-A3B SWE 实验

View File

@@ -0,0 +1,367 @@
# KVC 实验踩坑记录与代码 Bug 分析v1 → v5
记录从 v1 到 v5 KVC 实验的踩坑过程、错误诊断、以及最终定位的代码 bug。
模型: Qwen3-30B-A3B (TP1),硬件: 单节点 8×H100 80GB。
Trace: `qwen35-swebench-50sess.jsonl`4449 请求52 sessions
## TL;DR
| 版本 | 关键变化 | 截断率 | direct-to-D 占比 | P50 | 主要瓶颈 |
|------|----------|:---:|:---:|:---:|----------|
| v1 (smoke / 早期) | mechanism 跑通 | - | - | - | - |
| v2 | KVC + `--policy default` | **56.8% / 61.4%** | <0.1% | 0.08s* | Routing 错位默认策略 |
| v3 | KVC + `--policy kv-aware` | **0.9%** | 30-42% | 1.5-1.8s | session-cap fallback (52-65%) |
| v4 | v3 + soft_cap 416 | 1.0% | 54-58% | 1.08 / 0.84s | session-cap fb 35%、9-10% mooncake errors |
| v5 | Option Dworker-mode 驱动 seed/reseed | 0.9% | 41-45% | 1.59 / 1.31s | D KV pool 真容量不足 fallback 反而 46-51% |
`*` v2 P50 是假数字——超过半数请求只生成 1 token 就被 abort
## v2 踩坑Default policy 与 KVC 机制根本不兼容
### 表象
`scripts/sweep_tp1_v2_fixed.sh` 跑出来
- Exp18-way DPbaseline4449/4449 成功P50=0.65serror=0
- Exp21P7D KVC**2524 truncated (56.8%)**18 errorsP50=0.08s* ()
- Exp32P6D KVC**2733 truncated (61.4%)**17 errorsP50=0.08s* ()
每个截断请求 `actual_output_tokens=1``finish_reason="abort: session id X does not exist"`
### 错误的早期诊断
之前 `RESULTS_SUMMARY.md` 把锅扣在 SGLang `--disaggregation-decode-allow-local-prefill` flag 认为是 D worker 在有 `bootstrap_room` 时仍然做了 local prefill这个诊断**完全错误**—— `scheduler.py:1975-1980` `_should_allow_local_prefill_on_decode`
```python
def _should_allow_local_prefill_on_decode(self, req: Req) -> bool:
return (
self.disaggregation_mode == DisaggregationMode.DECODE
and self.server_args.disaggregation_decode_allow_local_prefill
and req.bootstrap_room is None # ← 有 bootstrap_room 不会走 local prefill
)
```
KVC reseed 路径的请求都带 `bootstrap_room`根本不会触发 local prefill
### 实际根因Replay 与 PD Router 的 round-robin 错位
实验脚本里 KVC `--policy default` baseline `--policy kv-aware`
`benchmark.py:287-300` 这两者的差别巨大
```python
def _decode_policy_for(policy_name: str) -> str:
if policy_name == "sticky": return "manual"
if policy_name == "kv-aware": return "consistent_hashing"
return "round_robin" # default
def _header_mode_for(policy_name: str) -> str:
if policy_name == "sticky": return "routing-key"
if policy_name == "kv-aware": return "target-worker"
return "none" # default
```
`default` policy + KVC 机制下
1. Replay policy`policies.py:DefaultPolicy`round-robin 选一个 D比如 D-3
2. Replay D-3 `open_session(session_id=X)``replay.py:1722-1731`
3. Replay 通过 PD Router 发请求 `session_params` `header_mode=none`**不发任何 routing header**
4. PD Router (`pd_router.py:_select_decode_index`) 看到 `decode_policy=round_robin`**自己独立的计数器**round-robin发到了 D-5
5. D-5 scheduler 看到 `session_params` 里有 session_id但自己的 `session_controller` 里没这个 sessionsession D-3 )→ abort with `"Invalid request: session id X does not exist"` (`scheduler.py:1824-1836`)
两个独立的 round-robin 计数器只要一次错位任何并发或 direct-to-D 绕过 router 的请求都会引起就永远对不上
### 为什么 turn 0 不出问题?
Turn 0 `_invoke_plain_router``replay.py:1894`不带 `session_params`作为普通 PD disagg 请求处理发到任何 D 都行Turn 1+ 才开始走带 session_params KVC 路径撞上路由错位
### 数据特征验证per-session pattern
```
session 11360 (58 turns): pattern = .TTTTT.TTTTTTT.TTTTTT... ← turn 0 OK1+ 全 T
session 18720 (87 turns): pattern = .TTTTTTTTTTTTTTTTTT...
```
每个 D worker 收到了全部 52 session 的请求理想情况下应该是 ~7-8 /D因为 round-robin session 完全打散)。
### 修复
唯一正确的修复是把 KVC policy `default` 改成 `kv-aware`
```diff
- --policy default
+ --policy kv-aware
```
`KvAwarePolicy` (`policies.py:146-187`) 做两件事
1. `_overlap_blocks` + `sticky_bonus` 给每个 D 打分session 自然粘在同一个 D**session 亲和性**
2. `header_mode=target-worker` `x-smg-target-worker` header
3. PD Router `consistent_hashing` 模式看到 header 就直接用不再 round-robin
## v3 改 kv-aware policy 后:路由对了,但新瓶颈出现
`scripts/sweep_tp1_v3_kvaware.sh` 把所有 KVC 实验改成 `--policy kv-aware`结果
| 指标 | v2 1P7D (default) | **v3 1P7D (kv-aware)** | v3 2P6D | 8-way DP baseline |
|------|:---:|:---:|:---:|:---:|
| 截断 | 56.8% | **0.9%** | 0.9% | 1.5% |
| Errors | 18 | 363 (8.2%) | 9 | 0 |
| Mean | 4.74s | 4.88s | 3.58s | 1.43s |
| P50 | 0.08s* () | 1.75s | 1.52s | 0.65s |
| P90 | 12.14s | 12.67s | 9.23s | 3.61s |
| TTFT P50 | - | 0.36s | 0.33s | 0.09s |
**截断从 56.8% 降到 0.9%,路由问题彻底解决**
P50 仍然是 baseline 2-3
### Direct-to-D 路径表现优秀KVC 该有的样子)
execution_mode 拆开看
| 路径 | Exp1 1P7D 占比 | Exp1 1P7D P50 | Exp1 1P7D TTFT P50 |
|------|:---:|:---:|:---:|
| `kvcache-direct-to-d-session` | 42.0% | **0.495s** | **0.043s** |
| `pd-router-fallback-large-append-session-cap` 🔥 | **52.6%** | 5.6s | 3.7s |
Direct-to-D 路径下
- P50 = 0.495s**比 baseline 0.65s 25%**
- TTFT P50 = 0.043s**比 baseline 0.093s 2 **
- KV transfer = 0 P 介入 D append-prefill
这才是 KVC 真正的价值但只有 30-42% 请求走到这条路
### 新瓶颈session-cap fallback 占了 52-65%
`pd-router-fallback-large-append-session-cap` 1P7D 52.6%、2P6D 65.4%。这条路径意味着 router 想开新 session D admission 拒绝了"d-session-cap"只好回退到 plain routerP 全量 prefill + 传给 D session 复用)。
### Bimodal session 分布starvation
| Session | Total turns | Direct-to-D | Session-cap fallback |
|---------|:---:|:---:|:---:|
| 22080 | 129 | **98%** | 0% |
| 3840 | 118 | **97%** | 0% |
| 70560 | 150 | **0%** | **99%** |
| 39360 | 148 | **0%** | **99%** |
| 61600 | 117 | **0%** | **99%** |
要么完全幸运要么完全饿死——典型的双峰分布
### 根因:硬编码 cap=4
`replay.py:_decode_session_soft_cap` 原始代码
```python
def _decode_session_soft_cap(...) -> int:
target_tokens = max(1, _estimate_session_resident_tokens(request))
usable_capacity_tokens = _usable_capacity_tokens(residency, server_url)
...
if usable_capacity_tokens <= 0:
return 4
return max(1, min(4, usable_capacity_tokens // target_tokens))
# ^^^ 硬编码上限 4
```
7 D × 每个 D 最多 4 session = **28 个 session slot 总容量**。Trace 52 session 24 session 永远抢不到 slot
启动期 race condition 决定了哪些 session "幸运儿"—— 28 个挤进来的 session 的所有后续 turn 都走 direct-to-D剩下 24 session 永远走 session-cap fallback)。
## v4 改进:把硬 cap 从 4 提到 16
`replay.py:_decode_session_soft_cap` 一行修改
```diff
- if usable_capacity_tokens <= 0:
- return 4
- return max(1, min(4, usable_capacity_tokens // target_tokens))
+ if usable_capacity_tokens <= 0:
+ return 16
+ return max(1, min(16, usable_capacity_tokens // target_tokens))
```
7 D × 16 = 112 slot远超 52 session 需求
### v4 实际结果vs v3 1P7D / 2P6D
| 指标 | v3 1P7D | **v4 1P7D** | v3 2P6D | **v4 2P6D** | baseline 8DP |
|------|:---:|:---:|:---:|:---:|:---:|
| Errors | 363 (8%) | 435 (10%) | 9 (0%) | **403 (9%)** | 0 |
| 截断 | 42 | 43 | 42 | 36 | 68 |
| **direct-to-D** | 38.6% | **54.3%** | 30.5% | **58.0%** | - |
| **session-cap fallback** | 48.3% | 37.4% | 65.4% | **34.7%** | - |
| Session reused | 1716 | 2180 | 1358 | **2348** | - |
| KV transfer blocks | 62K | 53K | 79K | **51K** | - |
| Mean | 4.88s | 4.21s | 3.58s | **2.51s** | 1.43s |
| **P50** | 1.75s | 1.08s | 1.52s | **0.84s** | **0.65s** |
| P90 | 12.67s | 13.38s | 9.23s | **6.51s** | 3.61s |
| P99 | 28.72s | 24.45s | 18.70s | 18.34s | 8.38s |
| **TTFT P50** | 0.36s | 0.056s | 0.33s | **0.051s** | 0.094s |
| TTFT P90 | 10.97s | 11.90s | 6.95s | **2.64s** | 0.26s |
direct-to-D 占比从 v3 30-38% 涨到 v4 54-58%
session 复用 +27% (1P7D) / +73% (2P6D)
KV transfer -15% (1P7D) / -36% (2P6D)
TTFT P50 反超 baseline 46%0.051s vs 0.094s
### Direct-to-D 路径全面碾压 baselineKVC 真实价值)
| Config | n | Lat P50 | Lat P90 | TTFT P50 | TTFT P90 |
|--------|:---:|:---:|:---:|:---:|:---:|
| baseline 8DP | 4381 | 0.66s | 3.65s | 0.094s | 0.256s |
| v4 1P7D direct-to-D | 2179 | 0.495s | 3.03s | 0.044s | 0.055s |
| **v4 2P6D direct-to-D** | **2348** | **0.499s** | **2.86s** | **0.043s** | **0.054s** |
direct-to-D 子集相对 baseline
- P50 24-30%
- P90 16-22%
- TTFT P50 54%
- TTFT P90 79%
### 整体性能(去掉 errors 和 truncatedvs baseline
| Config | clean | Mean | P50 | P90 | P99 |
|--------|:---:|:---:|:---:|:---:|:---:|
| baseline 8DP | 4381 | 1.45s | 0.66s | 3.65s | 8.38s |
| v4 2P6D | 4010 | 2.53s | 0.85s | 6.55s | 18.33s |
vs baselineP50 28%、P90 80%、P99 119%。即使错误率为 0整体仍输 baseline——根因是 35% 请求被推到 fallback 路径
### 新瓶颈 135% 请求仍走 session-cap fallback
抬到 16 后真实瓶颈是 capacity-based 计算`min(16, usable_capacity_tokens // target_tokens)`
- `target_tokens = input + output`agentic 里常见 50-100K
- D KV pool 100-150K tokens80GB H100, mem_fraction=0.835
- `usable / target` = 1-2远没到 16 真实 cap capacity 算出来的小数字
要解决必须改 capacity-based 估算逻辑或上方案 D D 自己决定)。
### 新瓶颈 29-10% errorsmooncake 传输超时)
P-side log 显示
```
KVTransferError: Failed to send kv chunk of <bootstrap_room> to 10.45.7.165:40319
Sync batch data transfer timeout after 32722558107ns (32 秒超时)
Decode instance could be dead, remote mooncake session ... is not alive
```
特征
- 所有 errors run 44.8% 之后出现系统压力累积
- 98% errors 集中在 turn 31 input 的请求
- v3 cap=4 1P7D 已有 363 errors 1 D 集中受冲击v4 cap=16 把压力均匀分布但量级更大
mooncake TCP loopback 在并发上去后撞超时**不是 SGLang 逻辑 bug**。修复方向
1. 加长 mooncake transfer timeout现在 32s
2. 限制并发 inflight transfer 数量
3. 改用 RDMAloopback 是单机模拟生产环境换真 RDMA
4. chunked KV transfer
## v5 落地方案 Dworker-mode 驱动 seed/reseed
`scripts/sweep_tp1_v5_optD.sh` 真正把方案 D 落到了代码里改动核心 `--kvcache-admission-mode` `local`(replay 估算) 改成 `worker`(D 决策)并扩展到 **direct_append + seed + reseed 全部路径**
### 关键代码改动
1. SGLang `scheduler.py` `admit_direct_append` 端点新增 `mode` 字段支持 `direct_append | seed`seed 模式会触发 D 真正去 reserve KV pool 块并主动调用 `maybe_trim_decode_session_cache` LRU
2. Replay `replay.py` reseed / turn-1 seed / large-append-reseed 都改走同一个 admit endpoint`_decode_session_soft_cap` worker mode 下被完全 bypass
3. 新增运行参数`--kvcache-admission-mode worker``--kvcache-seed-min-turn-id 1``--kvcache-seed-max-inflight-decode -1``--kvcache-prefill-backup-policy release-after-transfer``--kvcache-prefill-priority-eviction`
### 假设
- v4 35% session-cap fallback 来自 replay 视图过期 + capacity-based 计算保守 D 自己看 KV pool 应该把这 35% 救回来
- D 主动 LRU eviction replay 自己写的 reservation 更准确**应该**让更多 session seed 进来
### v5 实际结果vs v4 同配置)
| 指标 | v4 1P7D | **v5 1P7D** | v4 2P6D | **v5 2P6D** | baseline 8DP |
|------|:---:|:---:|:---:|:---:|:---:|
| Errors | 435 (10%) | **9 (0.2%)** | 403 (9%) | **9 (0.2%)** | 0 |
| 截断 | 43 | 42 | 36 | 42 | 68 |
| direct-to-D | 54.3% | 44.7% | 58.0% | 41.3% | - |
| **session-cap fallback** | 37.4% | **45.6%** | 34.7% | **50.6%** | - |
| no-d-capacity fallback | 0.3% | 1.2% | 0.2% | 0.8% | - |
| pd-router-turn1-seed (新可见) | - | 1.2% | - | 1.1% | - |
| pd-router-d-session-reseed (新可见) | - | 4.8% | - | 3.4% | - |
| pd-router-large-append-reseed (新可见) | - | 1.0% | - | 1.0% | - |
| Session reused | 2180 | 1990 | 2348 | 1837 | - |
| KV transfer blocks | 53K | 66K | 51K | 69K | - |
| Mean | 4.21s | 5.18s | 2.51s | 3.49s | 1.45s |
| **P50** | 1.08s | 1.59s | 0.84s | 1.31s | 0.66s |
| P90 | 13.38s | 14.67s | 6.51s | 9.09s | 3.65s |
| P99 | 24.45s | 26.09s | 18.34s | 24.92s | 8.38s |
| TTFT P50 | 0.056s | 0.21s | 0.051s | 0.24s | 0.094s |
| TTFT P90 | 11.90s | 13.06s | 2.64s | 6.90s | 0.26s |
**可靠性大幅提升**mooncake 传输超时 errors 9-10% 跌到 0.2%。D 真容量决策避免了 v4 那种"乐观 admit 30s 后超时"的死亡链路
reseed / turn1-seed 路径首次显式出现证明 admission 端点对 seed 模式确实生效了
**session-cap fallback 不降反升**3746% 3551%)。说明 v4 的本地 soft_cap 实际上** D 真实容量更乐观**——admit 进来后转身就 OOM统计成了 error 而不是 fallback
直接结果**direct-to-D 占比下降整体延迟全面变差**。P50/P90/P99 TTFT 都退步
### Direct-to-D 子集还是稳的KVC 真实价值仍在)
| Config | n | Lat P50 | Lat P90 | TTFT P50 | TTFT P90 |
|--------|:---:|:---:|:---:|:---:|:---:|
| baseline 8DP | 4381 | 0.66s | 3.65s | 0.094s | 0.256s |
| v4 2P6D direct-to-D | 2348 | 0.499s | 2.86s | 0.043s | 0.054s |
| **v5 1P7D direct-to-D** | 1990 | 0.475s | 3.04s | 0.043s | 0.055s |
| **v5 2P6D direct-to-D** | 1837 | 0.483s | 3.04s | 0.043s | 0.054s |
direct-to-D 的尾延迟和 TTFT v4 几乎完全一致端点决策开销可忽略**v5 的回退不是路径本身变慢而是更多请求被赶到 fallback**。
### Fallback 路径反而比 v4 更糟
| Config | n | Lat P50 | Lat P90 | TTFT P50 |
|--------|:---:|:---:|:---:|:---:|
| v5 1P7D session-cap fallback | 2027 | 6.38s | 17.47s | 4.49s |
| v5 2P6D session-cap fallback | 2253 | 3.13s | 11.25s | 0.89s |
由于 fallback 占比上升且这条路径本身就比 direct-to-D 慢一个数量级整体均值被拖累得更厉害
### v5 真正暴露的瓶颈D 的 KV pool 物理容量
admission 决策权交给 D 之后瓶颈从"replay 估得太死"变成"D 真的装不下"
- 80GB H100 × `mem_fraction_static=0.835` D 单卡 KV pool 100-150K tokens
- agentic context session turn footprint 50-100K
- D 上能并存的 session 数量本就 2-3 7 D 50 session 基本不可能
v4 cap=16 之所以"看起来好"部分是因为本地 soft_cap 没真的查 D free pool开了一堆**最终会失败** session统计成 errors 而非 fallback)。v5 把这部分洗成了"诚实的拒绝"——可靠性跃升的代价是看见了真实容量上限
### v6 应该针对什么
D 物理容量管理打开而不是再调 replay
1. **prefill backup 提早 release**已经加了 `release-after-transfer` 但可能还不够及时 P 上的 backup blocks 不要长期占用 KV pool
2. **priority eviction 策略调优**已开 `--kvcache-prefill-priority-eviction`当前 LRU 可能把 hot session 误踢需要按 session 命中频率/最近访问做加权
3. **chunked / streamed seed**不要一次 reserve 整个 prompt 的容量 chunk 分摊
4. **跨 D 的 session migration**当一个 D 满了但隔壁 D 空时主动迁移而不是直接 fallback P
5. **真正的多机 RDMA**单机 mooncake loopback errors 的根因之一上多机 + RDMA 才能让 prefill backup release 后的 KV transfer 真的稳
工程量1-3 SGLang 内部改 (`scheduler.py` + `session_controller.py`)4 需要 router 协议扩展5 是部署变更
## 关键文件与代码位置索引
| 现象 | 代码位置 |
|------|----------|
| Replay policy round-robin | `policies.py:63-67` `RoutingState.next_decode_worker_id` |
| KV-aware policysession 亲和 | `policies.py:146-187` `KvAwarePolicy.select` |
| PD router decode 选择 | `pd_router.py:51-74` `_select_decode_index` |
| Header 构建 | `replay.py:2407-2424` `_build_headers` |
| Policy router config 映射 | `benchmark.py:287-300` `_decode_policy_for/_header_mode_for` |
| Session admission cap | `replay.py:889-905` `_decode_session_soft_cap` |
| 已有的 D admission 端点 | `scheduler.py:3497-3580` `admit_direct_append`v5 扩展支持 `mode=seed` |
| Worker-mode admission 调用方 | `replay.py` reseed / turn1-seed / large-append-reseed 路径 |
| Prefill backup 释放策略v5 引入 | `--kvcache-prefill-backup-policy release-after-transfer` |
| Prefill priority evictionv5 引入 | `--kvcache-prefill-priority-eviction` |
| Session D 上找不到的报错 | `scheduler.py:1824-1836` |
| `_should_allow_local_prefill_on_decode` | `scheduler.py:1975-1980` |
| Reseed 流程入口 | `replay.py:1665-1809` `_invoke_kvcache_seeded_router` |
| Direct-to-D 流程 | `replay.py:2351-2398` `_invoke_decode_session_direct` |
## 经验教训
1. **policy 和 mechanism 是两个正交维度**——`--policy default` 不是"无脑默认值"它真的是 round-robin session 亲和性KVC 机制必须配 session 亲和的 policy
2. **不要无脑相信前一个 agent 的 RESULTS_SUMMARY**——v2 的诊断"local prefill bug"和实际 finish_reason"session id does not exist"完全对不上任何错误诊断必须用 finish_reasonexecution_mode 这些原始字段交叉验证
3. **bimodal 分布是 starvation 的强信号**——v3 数据里某些 session 100% 走快路径某些 100% 走慢路径几乎肯定是某种"先到先得"的资源竞争看到这种模式立刻去找硬编码 cap 或全局共享资源
4. **测量要看分组而非整体均值**——v3 整体 P50=1.5s 看似比 baseline 但拆开看 direct-to-D 子集 P50=0.495s 已经反超 baseline整体均值被 fallback 路径拖累 KVC 的核心价值是真实存在的
5. **errors 与 fallback 是同一类资源压力的两副面孔**——v4 " fallback + error "不是更优解是把容量超限的失败从"显式拒绝"伪装成"超时失败"。v5 把决策权交给真容量后fallback errors 这是更诚实的指标不要被 v4 fallback 数字误导当看到错误率和 fallback 率呈反相关时要警惕 admission 决策是否在说谎

View File

@@ -0,0 +1,514 @@
# Real Ali KVC 实验日志
**分支**`kvc-real-ali-iter-v1`,从 `kvc-debug-journey-v1-to-v4` checkout 出来。
**日期**2026-05-11/12。
**环境**:单机 8x NVIDIA H20SGLang xPyD模型 `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`
**真实 trace**`/home/admin/cpfs/wjh/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`
本日志记录真实 Ali workload 上的 KVC pd-hybrid 迭代。结论只按当前证据成立;`time-scale=10` smoke 和 KVC-friendly slice 不作为 full workload headline。
## 1. 当前最新进展
已新增真实 Ali trace 的固定样本和 sweep 管线:
- `scripts/prepare_real_ali_samples.py`:从真实 Ali trace 生成可复现实验样本,保留真实 input/output/hash_ids/timestamp可选择 rebase timestamp。
- `scripts/sweep_real_ali_kvc.sh`:对同一 prebuilt sample 依次跑 DP cache-aware、PD-disaggregation、KVC、KVC+backpressure。
- `benchmark-live --use-trace-as-sample`:直接 replay 指定 trace避免不同策略重新采样导致不可比。
- `replay-progress.jsonl` heartbeat后续长跑会每 30s 写客户端侧进度,不轮询 `/server_info`,避免扰动 scheduler。
- `prepare_real_ali_samples.py --max-sampled-duration-s`:为快速 smoke 生成 capped sample只用于迭代不用于 headline。
已经完成的真实 Ali KVC-fit smoke
- 样本:`outputs/real-ali-kvc-iter/samples-balanced/ali-kvc-fit-smallappend.jsonl`
- 179 requests64 sessions全部 multi-turnturn2+ 共 115 个direct-eligible ratio 100%。
- `time-scale=10`concurrency 32。
- DP cache-aware、PD-disaggregation、KVC no-backpressure、KVC+backpressure 均已完成。
## 2. 全量 Ali trace 画像
`outputs/real-ali-kvc-iter/ali-full-profile.json` 显示:
| 指标 | 数值 |
|---|---:|
| requests | 763,727 |
| sessions | 555,905 |
| multi-turn sessions | 39,247 |
| turn2+ requests | 207,822 |
| turn2+ direct-eligible ratio | 82.95% |
| input p50 / p90 / p99 | 4,329 / 51,067 / 112,955 tokens |
| output p50 / p90 / p99 | 93 / 826 / 5,616 tokens |
| append p50 / p90 / p99 | 303 / 2,879 / 17,885 tokens |
| inter-turn gap p50 / p90 / p99 | 4.65s / 38.68s / 1,133s |
这个 profile 说明 KVC 有真实适用面turn2+ 的 hash overlap 和小 append 很常见。但 full workload 里 single-turn session 极多KVC 收益会被显著稀释;因此必须分 slice 报告,不能只报 KVC-fit 子集。
## 3. 已跑样本
### Continuous 15min cold-window session sample
路径:`outputs/real-ali-kvc-iter/samples-window-900s-600req/ali-window.jsonl`
- 600 requests439 sessions32 multi-turn sessions。
- rebased duration886.544s,覆盖约 15min。
- turn2+ requests161direct-eligible143ratio 88.8%。
- input p50 / p90 / p993,871 / 68,234 / 98,131。
- output p50 / p90 / p9985 / 712 / 5,195。
- append p50 / p90 / p99274 / 2,202 / 16,120。
- inter-turn gap p50 / p90 / p994.656s / 19.376s / 63.575s。
这是对 179-request KVC-fit smoke 的替代验证样本。它按 900s 窗口分成 15 个时间桶,轮转选择窗口内从 root 开始的整 session直到达到 600 requests。这样避免 parent 缺失导致 `load_trace()` 把真实 session 切碎,也让请求覆盖整个 15min而不是只取窗口开头 600 条。
重要边界:它是 **cold-window / new-session-only** sample不是完整 raw production window它排除了窗口开始前已经活跃的 ongoing sessions。因此可以用于“600+ 请求、15min、真实混合负载”的稳定性验证但不能单独代表全量 Ali production window。
### KVC-fit small append
路径:`outputs/real-ali-kvc-iter/samples-balanced/ali-kvc-fit-smallappend.jsonl`
- 179 requests64 sessions。
- input p50 / p906,446 / 15,491。
- output p50 / p90112 / 1,159。
- append p50 / p90215 / 855。
- overlap ratio p50 / p900.875 / 0.938。
这是 KVC-friendly slice用来验证机制上限和 microbenchmark 是否能迁移到真实 token/hash 序列。
### Representative-mt / early multi-turn balanced
路径:`outputs/real-ali-kvc-iter/samples-balanced/ali-representative-mt.jsonl`
- 460 requests64 sessions。
- input p50 / p9041,175 / 98,621。
- append p50 / p90 / p99272 / 1,979 / 13,900。
这个样本更接近真实 multi-turn 压力,后续用于验证大上下文、大 resident KV 下是否仍能稳定。但它当前实现是“从 start_time 后取最早 64 个 multi-turn session”不是严格随机或分层 representative正式 headline 需要按 input/append/output/gap 分层抽样。
### Capped smoke samples
为避免少数真实长 gap 让 smoke 浪费大量 wall time新增
- `outputs/real-ali-kvc-iter/samples-balanced-cap120s/ali-kvc-fit-smallappend.jsonl`177 requests64 sessionsduration 65.859s。
- `outputs/real-ali-kvc-iter/samples-balanced-cap120s/ali-representative-mt.jsonl`359 requests64 sessionsduration 117.366s。
这些样本去掉了 KVC-fit 原样本末尾 timestamp 3613s 和 5414s 的两个请求,因此只能用于快速工程迭代;正式对比仍应使用完整样本或真实连续窗口。
## 4. 当前结果
### 4.1 DP cache-aware vs KVC+backpressure, KVC-fit, time-scale=10
| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| 8-way DP cache-aware | 179 | 0 | 0 | 6.603s | 3.126s | 17.639s | 34.582s | 1.112s | 1.052s |
| KVC 2P6D + worker admission + backpressure | 179 | 0 | 0 | 4.443s | 2.076s | 13.288s | 21.202s | 0.700s | 0.154s |
Paired comparisonKVC - DP
- overall E2E mean delta-2.161sp50 delta-1.427s152/179 wins。
- turn2+ direct 子集mean delta -2.503sp50 delta -1.508s103/115 wins。
- turn2+ TTFT mean delta-0.930sp50 delta -0.887s。
执行路径:
- KVC turn1 seed64 requests。
- `kvcache-direct-to-d-session`115 requests。
- session reused115。
- actual KV transfer blocks623。
结构日志:
- admission probes179全为 `ok`
- transfer queue depthp50=0p90=2max=3。
- backpressure event0。
解释:这轮证明的是 **KVC direct-to-D/session reuse** 在真实 Ali KVC-fit slice 上有正信号;不是证明 backpressure 有效,因为没有触发 backpressure。
### 4.2 PD-disaggregation baseline, KVC-fit, time-scale=10
| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| PD-disaggregation 2P6D | 179 | 0 | 0 | 7.850s | 6.306s | 15.192s | 22.405s | 4.994s | 5.336s |
Paired comparisonPD - DP
- overall E2E mean delta+1.247s。
- p50 delta+2.231s。
- 46/179 faster133/179 slower。
解释:在这个 KVC-fit slice 上,普通 PD-disaggregation 明显弱于 8-way DP cache-aware。它付出了 P->D transfer 和拆分调度成本,却没有 KVC direct-to-D 的 bypass 收益。
### 4.3 KVC no-backpressure 消融, KVC-fit, time-scale=10
| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| KVC 2P6D worker admission, no backpressure | 179 | 0 | 0 | 4.404s | 1.936s | 13.200s | 21.326s | 0.604s | 0.139s |
Paired comparison
- KVC no-BP vs DPmean delta -2.200sp50 delta -1.434s153/179 wins。
- KVC no-BP vs PD-disaggregationmean delta -3.447sp50 delta -3.514s163/179 wins。
- KVC no-BP vs KVC+BPmean delta -0.039sp50 delta -0.005s92/179 wins。
结构分析:
- direct-to-D rate64.25%。
- admission probes179全为 `ok`
- transfer queue depthp50=0p90=2max=3。
- pause_ms 全 0backpressure event 0。
解释no-backpressure 与 +backpressure 几乎等价,说明本 slice 没有 D 压力;本轮提升来自 direct-to-D不来自反压。
### 4.4 Continuous 15min / 600-request window, time-scale=1
样本:`outputs/real-ali-kvc-iter/samples-window-900s-600req/ali-window.jsonl`
重要边界:这是 cold-window / new-session-only session sample不是完整 raw window。它覆盖约 15min`missing_parent_count=0`,但排除了窗口开始前已活跃的 ongoing sessions。
运行结果:
| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 | TTFT p90 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| DP cache-aware 8-way | 600 | 1 | 0 | 13.942s | 5.222s | 29.299s | 151.183s | 6.162s | 1.746s | 19.176s |
| PD-disaggregation 2P6D | 600 | 1 | 0 | 40.886s | 40.018s | 84.681s | 113.460s | 38.545s | 37.782s | 81.852s |
| KVC 2P6D mem_fraction_static=0.82 | 600 | 53 | 0 | 12.386s | 4.225s | 37.998s | 78.234s | 10.078s | 2.674s | 27.774s |
KVC 默认启动失败:
- 默认 KVC 2P6D 在 H20 上两次启动 OOM均未进入 replay。
- 日志显示 decode/prefill worker 启动时只剩约 526MB模型加载阶段 OOM。
- `--load-format layered` 不支持 Qwen3-Coder-30B-A3B。
- 使用 `--mem-fraction-static 0.82` 后 KVC 能启动并完成 replay但这降低了 KV pool 容量,因此这轮 KVC 是 memory-constrained rerun。
- 尝试 `KVC_SEED_MIN_TURN_ID=2` + `mem_fraction_static=0.82` 时,启动阶段 scheduler 被 SIGKILL疑似 OS OOM killer未进入 replay。
Paired comparison只在两边都有 latency 的 547 个 paired request 上计算):
- KVC vs DPmean delta -1.335sp50 delta -0.055sp90 delta +19.371s284 wins / 263 losses。
- KVC vs PDmean delta -28.341sp50 delta -25.687sp90 delta +2.834s465 wins / 82 losses。
KVC 结构数据:
- execution modes388 `pd-router-turn1-seed`90 `kvcache-direct-to-d-session`67 `pd-router-fallback-large-append-session-cap`1 `pd-router-large-append-reseed`1 `pd-router-turn1-d-backpressure`53 `kvcache-centric` error rows。
- direct-to-D rate15.0%。
- direct-to-D session 分布413/439 sessions 在 0-20% direct rate只有 6 sessions 在 80-100%。
- admission probes533reason `ok` 531`no-space` 2queue depth p50=0p90=2max=5。
- pause hint 非零 20 次,但没有 backpressure event因为本轮 no-BP。
KVC error breakdown
- 50 `ReadTimeout`
- 2 `HTTPStatusError 400 Bad Request` on `open_session`
- 1 context length error同 DP/PD 的 `input_length=310521 > 262144`
- 错误主要集中在 turn150 turn13 turn2+。
解释:
1. KVC 相对普通 PD 仍明显更好,说明普通 P->D disaggregation 在真实 600-request 窗口上成本很高。
2. KVC 相对 DP 只在 clean request 的 mean/p50 上有小幅正信号,但 p90 变差,而且 error_count 从 DP 的 1 增到 53。
3. 因此在这个 600-request / 15min window 上,**KVC 不能算稳定提升系统**。主要问题不是 direct-to-D 快路径无效,而是该快路径覆盖率只有 15%,并且 turn1 seed / session admission / memory-constrained KV pool 引入大量 timeout。
4. 这直接修正 179-request KVC-fit smoke 的结论:小样本证明 KVC 适用 slice 存在600-request mixed window 证明当前实现还不能稳定服务真实混合 workload。
## 5. 是否已经相对 pd-colocation/pd-disaggregation 取得提升
当前只能下这个限定结论:
1. **相对 PD-disaggregation已经取得清晰提升。**
PD-disaggregation p50 6.306sKVC no-BP p50 1.936sKVC+BP p50 2.076sTTFT p50 5.336s vs 0.139s/0.154s。收益主要来自 turn2+ 直接打到已有 D session避免每轮 P 全量 prefill 和 P->D KV transfer。
2. **相对强 DP cache-aware在 KVC-fit slice 上有提升。**
KVC no-BP 和 KVC+BP overall mean/p50/p90/p99 都优于 DP并且 paired wins 分别是 153/179 和 152/179。但这是 KVC-friendly、全 multi-turn、turn2+ 100% direct-eligible 的 slice不代表 full Ali workload。
3. **相对 full workload尚未证明。**
全量 Ali 里 single-turn 占多数,且长上下文和长尾 output 较多。KVC 的收益面会被 single-turn 稀释D resident KV 容量和 tail 稳定性会成为更强约束。
4. **相对 600-request / 15min mixed window尚未取得稳定提升。**
KVC clean E2E mean/p50 有正信号,但 error_count=53/600p90 paired delta 相对 DP 变差。按“E2E + error/truncation”标准这不能算系统性胜出。
## 6. 提升来自哪里
主要收益链路:
1. turn1 seed 在 D 上建立 session。
2. turn2+ 若 append 小、hash overlap 高,直接走 `kvcache-direct-to-d-session`
3. direct-to-D 避免 P worker 参与,不走 P->D KV transfer。
4. D 只对 append suffix 做少量 prefill已有前缀 KV 直接复用。
这带来两个可观测收益:
- TTFT 大幅下降turn2+ direct 子集 TTFT mean 从 DP 的约 1.04s 降到约 0.112s。
- E2E 下降direct 子集 mean E2E 降低约 2.50s。
另外KVC 的 cached_tokens 统计显著更高KVC mean cached tokens 5,992DP mean 228。这说明它确实复用了大段真实前缀 KV。
## 7. 遇到的问题与修复
### 问题 1通用 sampler 会被单个长 session 主导
现象:真实 Ali session 分布长尾明显duration-oriented 采样容易选出不均衡样本,导致策略比较不可重复或不代表多 session 竞争。
修复:新增 `scripts/prepare_real_ali_samples.py`,按 session 上限和每 session turn 上限生成 balanced sample并保留真实 token/hash/timestamp。
### 问题 2不同策略重新采样导致不可比
现象:`benchmark-live` 原本会按参数重新采样,不同策略可能 replay 不同请求。
修复:新增 `--use-trace-as-sample`,所有策略 copy 并 replay 同一个 prebuilt sample后续 paired comparison 才有意义。
### 问题 3长 trace replay 中途没有进度
现象:`request-metrics.jsonl` 和 summary 只在 replay 结束后写出,跑真实 pacing 时很难判断是正常等待还是卡住。
修复:新增 `replay-progress.jsonl` heartbeat每 30s 写 submitted/completed/inflight/errors/execution_modes。它只使用客户端本地状态不访问 `/server_info`
### 问题 4`/server_info` polling 会扰动 scheduler
现象:旧 profiling 里 1Hz polling 曾明显改变错误数。真实 performance run 如果持续 poll pool会把测量工具变成干扰源。
修复:`scripts/sweep_real_ali_kvc.sh` 默认关闭 pool polling。容量类问题依赖结构日志和必要时单独 profile run不混入 headline performance run。
### 问题 5backpressure smoke 没有触发 backpressure
现象KVC-fit smoke 中 transfer queue max 只有 3所有 admission reason 都是 `ok`pause_ms 全 0。
结论:这轮不能证明 backpressure 有效,只能证明 direct-to-D 有效。需要更高 session 数、更大 resident KV 或更强并发的压力样本专门验证 backpressure。
### 问题 6环境和旧报告不一致
现象:旧文档写的是 H100本轮真实环境是 H20模型路径也在 `/home/admin/cpfs/wjh/models/...`
处理:本日志按 H20 记录;跨文档比较时只看机制趋势,不把 H100/H20 的绝对 latency 混为同一实验。
### 问题 7continuous window 可能截断 session ancestry
现象:按 timestamp 直接截窗口可能留下 parent turn 在窗口外的请求。对 KVC 来说,这会让 session reuse/turn chain 与真实 workload 不一致。
处理:当前 continuous window 只作为待改进候选,不作为正式 headline。正式窗口需要保留 warmup ancestors或显式保留原始 session chain 信息。
## 8. 如果后续 full workload 效果不好,当前假设
可能不是实现小 bug而是方案适用面和资源约束共同导致
1. **single-turn 稀释收益**:全量 Ali session 中 single-turn 占多数KVC seed 只带来成本,没有 turn2+ reuse。
2. **长上下文挤占 D KV 池**input p90 51K、p99 113Kresident KV 长尾会限制 D 上可同时保留的 session。
3. **direct 不是免费 lunch**turn1 seed、admission probe、session lifecycle 都有额外成本;只有后续 turns 充分复用时才摊薄。
4. **D 端容量和 eviction 仍是核心风险**:旧 SWE 实验已经显示 session pinning + D 容量盲选会造成 starvationearly multi-turn balanced 样本可能复现。
5. **普通 PD-disaggregation 很弱**:如果 KVC fallback 频繁退回普通 PD 路径,整体会被 P->D transfer 和高 TTFT 拖垮。
6. **H20 显存余量不足会改变 KVC 条件**:默认 KVC 2P6D 启动 OOM必须降 `mem_fraction_static` 才能完成 600-request run这会进一步降低 D KV pool放大 session-cap 和 timeout。
## 9. 下一步验证顺序
1. 补 sticky/session-affinity baseline拆出“粘到同一个 D”和“KVC direct bypass”的贡献。
2. 补 KVC `seed-min-turn-id=2` 或 no-turn1-seed验证 turn1 seed 成本是否值得。
3. 在 early multi-turn balanced 样本上跑 DP / PD / KVC no-BP / KVC+BP验证大上下文真实 multi-turn 压力。
4. 选小固定样本跑 `time-scale=1`,避免只在压缩 replay 条件下成立。
5. 做包含 single-turn 的 continuous window并处理窗口内 parent turn 缺失问题,再按 full Ali 分布加权报告。
6. 对最终候选配置做 N>=3 rerun报告方差N=1 只作为 smoke。
7. 针对 600-request window 优先跑 `seed-min-turn-id=2`,减少 single-turn turn1 seed目标是先把 53/600 errors 降到接近 DP 的 1/600再讨论 latency。
- 当前第一次尝试未进入 replay启动阶段疑似 OS OOM需要先解决 H20 启动显存/系统内存稳定性,或者降低 worker 数/模型内存占用。
## 10. KVC error 根因与 multi-turn-only 验证准备
用户指出 179-request run 不够,并要求至少 15min / 600+ 请求;当前正式问题定位基于
`outputs/real-ali-kvc-iter/runs/window900s-600req-ts1-kvc-mem082/kvcache-centric-kv-aware-worker-admission-20260511T093601Z`
### 10.1 为什么 KVC 有大量 error
该 run 为 600 requestsKVC mem0.82 有 53 errors
- 50 个 `ReadTimeout`
- 2 个 `/open_session` HTTP 400。
- 1 个真实超上下文错误input 310,521 > model context 262,144。
按 turn 看50/53 errors 在 turn1。按 structural admission 看,绝大多数失败请求在
`structural/admission-events.jsonl` 中已经被 D 端 admission 判定 `can_admit=true`,所以这不是单纯的
`d-session-cap``no-space`。主要失败点是 turn1 seed 进入 KVC seeded path 后,在
P/D streaming session bootstrap、P->D transfer 或 router streaming 过程中超时;而混合真实窗口中 single-turn session 很多,
这些 turn1 seed 对大多数 session 没有后续复用收益。
结论:当前 KVC error 的主因是 **对 single-turn / 未知是否多轮的 session 做了过多 turn1 seed**,它把大量新 session 推进
KVC control-plane 和 seeded router 路径,增加超时和 session lifecycle 残留;不是 direct-to-D fast path 本身出错。
### 10.2 已做修复/消融开关
代码与脚本修复:
- `scripts/sweep_real_ali_kvc.sh` 新增 `KVC_SEED_ONLY_MULTITURN=1`,会传入
`--kvcache-seed-only-multiturn-sessions`。这是 oracle 消融,用来验证“只 seed 会有后续 turn 的 session”能否消除 turn1 seed 错误。
- `src/agentic_pd_hybrid/replay.py``/open_session` 400 增加 close+retry 一次,并写
`structural/session-lifecycle.jsonl`。这是 lifecycle 健壮性修复,目标是处理 timeout 后服务端残留 session 导致的
“already exists” 400不改变 routing policy。
- `scripts/prepare_real_ali_samples.py` 新增 `--window-min-turns``--window-output-name`,用于生成可复现的 multi-turn-only window 样本。
验证:
- `uv run python -m py_compile scripts/prepare_real_ali_samples.py src/agentic_pd_hybrid/replay.py src/agentic_pd_hybrid/benchmark.py src/agentic_pd_hybrid/cli.py`
- `bash -n scripts/sweep_real_ali_kvc.sh`
### 10.3 已生成 multi-turn-only 样本
样本路径:
`outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl`
生成命令:
```bash
uv run python scripts/prepare_real_ali_samples.py \
--trace /home/admin/cpfs/wjh/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
--output-root outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn \
--window-duration-s 900 \
--window-target-requests 600 \
--window-buckets 15 \
--window-min-turns 2 \
--window-output-name ali-window-multiturn.jsonl \
--profiles representative-mt \
--max-sessions 64 \
--max-turns-per-session 12
```
样本 profile
- 626 requests107 sessions107 个都是 multi-turn sessions。
- sampled duration 889.341s。
- turn2+ = 519。
- direct-eligible turn2+ = 473 / 519 = 91.1%。
- missing parent = 0。
- input p50/p90/p99 = 26,846 / 91,596 / 123,898 tokens。
这个 case 是“过滤掉 single-turn 的多轮压力切片”,不能替代 full mixed workload但可以回答
如果 workload 确实以多轮 coding agent session 为主KVC 的 direct-to-D 覆盖率和稳定性是否接近 microbenchmark。
### 10.4 GPU 资源阻塞
截至本次记录8 张 GPU 均被另一组 `vllm serve` 进程占用,每张约 82GB / 98GB端口为 51000-51007。
这些不是本 repo 的 SGLang/benchmark 进程,因此未启动新的性能 run避免把资源冲突误判为 KVC 策略失败。
GPU 释放后,优先跑两组:
```bash
# 混合真实窗口:验证 seed-only-multiturn 是否把 53/600 errors 降下来
TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req/ali-window.jsonl \
OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-ts1-kvc-seedonly-mt-mem082 \
RUNS="kvc" \
TIME_SCALE=1 \
CONCURRENCY=32 \
REQUEST_TIMEOUT_S=600 \
STACK_TIMEOUT_S=1800 \
EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
KVC_SEED_ONLY_MULTITURN=1 \
bash scripts/sweep_real_ali_kvc.sh
# 多轮-only workloadDP vs KVC对照过滤 workload 是否能复现 microbenchmark 收益
TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl \
OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-mem082 \
RUNS="dp kvc" \
TIME_SCALE=1 \
CONCURRENCY=32 \
REQUEST_TIMEOUT_S=600 \
STACK_TIMEOUT_S=1800 \
EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
KVC_SEED_ONLY_MULTITURN=1 \
bash scripts/sweep_real_ali_kvc.sh
```
### 10.5 multi-turn-only 启动尝试被 GPU 占用阻塞
用户要求启动 multi-turn-only 的 `pd-disaggregation` vs `kvcache-centric` 对比。启动前检查发现 8 张 GPU 均被外部
`vllm serve` 进程占用,每张约 84GB / 98GB端口为 51000-51007。该进程不属于本 repo 的 SGLang/benchmark run。
因此本次没有强行启动 SGLang。原因是剩余显存不足以启动 2P6D 或 8-worker 对照,强行运行只会得到初始化 OOM 或不稳定超时,
不能用于判断 KVC pd-hybrid 是否优于 pd-disaggregation。
资源释放后要运行的 multi-turn-only 对比命令:
```bash
TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl \
OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082 \
RUNS="pd kvc" \
TIME_SCALE=1 \
CONCURRENCY=32 \
REQUEST_TIMEOUT_S=600 \
STACK_TIMEOUT_S=1800 \
EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
KVC_SEED_ONLY_MULTITURN=1 \
bash scripts/sweep_real_ali_kvc.sh
```
### 10.6 multi-turn-only PD vs KVC 正式结果
资源释放后已启动并完成 multi-turn-only 对比。运行命令:
```bash
TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl \
OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082 \
RUNS="pd kvc" \
TIME_SCALE=1 \
CONCURRENCY=32 \
REQUEST_TIMEOUT_S=600 \
STACK_TIMEOUT_S=1800 \
EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
KVC_SEED_ONLY_MULTITURN=1 \
bash scripts/sweep_real_ali_kvc.sh
```
Run 目录:
- PD`outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082/pd-disaggregation-kv-aware-20260512T030433Z`
- KVC`outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082/kvcache-centric-kv-aware-worker-admission-20260512T040444Z`
样本仍是 626 requests、107 sessions、889.341s,全部为 multi-turn session。
| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 | TTFT p90 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| PD-disaggregation 2P6D | 626 | 0 | 0 | 97.013s | 70.243s | 214.309s | 308.406s | 94.506s | 69.048s | 212.528s |
| KVC 2P6D worker admission, no BP, seed-only-multiturn | 626 | 39 | 0 | 43.362s | 8.239s | 135.289s | 236.475s | 40.578s | 1.442s | 132.233s |
Paired comparison 只在 KVC 成功且 PD 也有 latency 的 587 个 request 上计算:
- PD same-request E2E mean/p50/p90/p9997.457s / 70.514s / 214.095s / 309.362s。
- KVC same-request E2E mean/p50/p90/p9943.362s / 8.239s / 135.930s / 237.283s。
- mean E2E reduction55.5%。
- absolute mean improvement54.095s。
- wins/losses472 / 115。
按 KVC execution mode 拆分:
| KVC mode | Count | KVC mean | PD same mean | Reduction |
|---|---:|---:|---:|---:|
| `kvcache-direct-to-d-session` | 286 | 2.255s | 92.944s | 97.6% |
| `pd-router-fallback-large-append-session-cap` | 169 | 88.869s | 113.614s | 21.8% |
| `pd-router-d-session-reseed` | 25 | 143.456s | 106.501s | -34.7% |
| `pd-router-large-append-reseed` | 19 | 47.631s | 88.981s | 46.5% |
| `pd-router-turn1-seed` | 78 | 55.974s | 73.050s | 23.4% |
按 turn 深度拆分:
- turn2+504 successful paired requestsKVC mean 40.791s vs PD mean 101.055sreduction 59.6%。
- turn>=5299 successful paired requestsKVC mean 34.121s vs PD mean 104.697sreduction 67.4%。
- turn>=10161 successful paired requestsKVC mean 39.027s vs PD mean 86.548sreduction 54.9%。
KVC execution modes
- `kvcache-direct-to-d-session`286。
- `pd-router-fallback-large-append-session-cap`169。
- `pd-router-turn1-seed`78。
- `pd-router-d-session-reseed`25。
- `pd-router-large-append-reseed`19。
- `pd-router-fallback-no-d-capacity`4。
- `pd-router-turn1-d-backpressure`5。
- `pd-router-d-session-reseed-after-eviction`1。
- error rows39记录为 `kvcache-centric`
KVC 的收益来源非常清楚286 个 direct-to-D request 的 same-request mean 从 PD 的 92.944s 降到 2.255s,基本复现了 microbenchmark 的核心机制收益。它跳过 P worker 和 P->D KV transfer只在已有 D session 上处理 append suffix。总体 actual KV transfer blocks 从 PD same-success 的 4436 降到 KVC success 的 3827summary 口径下 KVC total actual KV transfer blocks 为 3827低于 PD 的 5276。
但这轮仍不能作为“稳定生产级胜出”结论:
1. KVC 仍有 39/626 errorserror rate 6.23%PD 为 0。
2. 39 个错误全部是客户端 `ReadTimeout`,不是服务端 OOM/Traceback服务端日志未发现对应崩溃关键字。
3. 错误分布24 个 turn115 个 turn2+;按 decode 节点分布为 decode-0 15、decode-1 9、decode-3 7、decode-4 5、decode-5 3。
4. 8 次 `/open_session` 400 已被 close+retry 兜住,并写入 `structural/session-lifecycle.jsonl`,没有形成 HTTP 400 error row。
5. 长尾 drain 明显PD 约 60min 完成KVC 约 40min 完成;二者都远超 889s trace duration。KVC 在 900s 时已完成 490/626而 PD 只完成 283/626说明 KVC 中段吞吐更好,但最后几十个 large-append fallback 仍然拖尾。
6. direct-to-D 覆盖率为 286/626 = 45.7%,低于样本静态 direct-eligible turn2+ ratio 91.1%。缺口主要来自 D session/residency capacity、large append session cap、reseed/fallback。
当前判断:
- 如果只看 successful paired requestmulti-turn-only workload 上 KVC 相对 PD-disaggregation 已经有很强 E2E 提升,且提升主要来自 direct-to-D session reuse。
- 如果按系统可靠性看,当前实现还不合格,因为 6.23% timeout 会抵消“稳定系统”的结论。
- 真实 workload 与 microbenchmark 差距的主要原因不是 KVC fast path 无效,而是 fast path 覆盖率不足、D 侧 resident KV/session admission 压力、large append fallback、以及 seeded/reseed path 的 timeout 稳定性。

123
docs/REFACTOR_PLAN_ZH.md Normal file
View File

@@ -0,0 +1,123 @@
# Refactor Plan v0极简版
**日期**2026-05-06
**目标**:用最小改动 + 轻量实验,验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` 提出的结构性缺陷是否真实存在、影响多大。
**预算**8h GPU 时间(约 4-6 次 ~30-60 min smoke run
**KISS 边界**:不动 SGLang `scheduler.py` 主循环结构;不引入新 mooncake 协议;不实现 cross-D session migration不做 admission probe/commit 拆分;不动 LRU eviction 策略。
## 计划结论(与用户已确认的)
回审 plan-v0 时发现两个原 Phase 1 改动**都不是 bug**
- `_estimate_session_resident_tokens` 返回 full prompt 是设计如此——所有需要"增量"的 call site 都已经做 `target - current` 减法(`replay.py:1247-1254``:1393-1394``:1490-1491`)。
- `decode_resident_blocks` 不缩减只是浪费几 MB 内存,**不影响 routing 决策**SWE trace 的 hash_ids 是 session-uniquepolicy 仍能正确选 D
最终极简版只做一件代码改动(**加 backpressure**+ 大量 instrumentation。
## 唯一代码改动Backpressure 信号
### 改动点 1SGLang `admit_direct_append` 响应增加两个字段
文件:`third_party/sglang/python/sglang/srt/managers/io_struct.py``scheduler.py`
```python
@dataclass
class DirectAppendAdmissionReqOutput:
... # 已有字段保留
recommended_pause_ms: int = 0 # 新增
queue_depth: int = 0 # 新增
```
`scheduler.py:admit_direct_append` 末尾计算 hint
```python
def _compute_backpressure_pause_hint(self) -> float:
depth = len(self.disagg_decode_transfer_queue.queue)
if depth < 8:
return 0.0
return min(2000.0, depth * 100.0) # 简单线性
```
### 改动点 2replay 端按 hint 退避
文件:`src/agentic_pd_hybrid/replay.py`
- `DecodeResidencyState` 新增 `pause_until_s: dict[str, float]`
- `_query_decode_direct_admission` 解析响应里的 `recommended_pause_ms`,更新 `pause_until_s[server_url] = now + pause_ms / 1000`
- 在调 `_invoke_router` / `_invoke_decode_session_direct` 前检查 `pause_until_s[decode_url]`,若 `now < pause_until` 则 sleep 到该时刻
### 改动点 3新 CLI flag
`src/agentic_pd_hybrid/cli.py``benchmark.py`
```
--enable-backpressure # 默认 false保留 baseline 行为
```
### 改动点 4观测日志
每个 run dir 新增三个 jsonl
- `admission-events.jsonl`:每次 admission RPCtimestamp, session, D, can_admit, queue_depth, pause_ms, latency_s, available_tokens, evicted_session_count
- `backpressure-events.jsonl`:每次实际 sleeptimestamp, D, sleep_ms, queue_depth_at_signal
- `session-d-binding.jsonl`:每个 session 第一次 open 在某 D 时记录timestamp, session, D, turn_id
## 实验矩阵8h 预算内)
按"先做 anchor再做单变量对照"排序。每行右侧是预估机时。
| ID | 配置 | 目的 | 机时 |
|---|---|---|---|
| **E0 (existing)** | v5 baselinetime-scale=10无 backpressure | Anchor已存在 `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/run1` | 0 |
| **E1** | v5 + backpressure ONtime-scale=10全 trace | 验证 Claim §3backpressure 是否能消除 KVTransferError 雪崩) | ~50 min |
| **E2** | v5 baselinetime-scale=1**短 trace**(前 12 sessions ≈ 1000 reqs | 验证 Claim §7time-scale=10 失真);不开 backpressure | ~60 min |
| **E3** | 8DP CAtime-scale=1同 E2 trace | E2 的对照——真实时序下 KVC 是否仍输 DP | ~60 min |
| **E4** | v5 + backpressuretime-scale=1同 E2 trace | backpressure 在真实时序下还有用吗? | ~60 min |
| **E5**(备选) | v5 baselinetime-scale=10**concurrency=4**,全 trace | 验证 Claim §1高并发是不是必要条件 | ~50 min |
4-5 个 run~3-5h。剩余预算给失败重跑/分析。
## 实验目标——回到 §1-§7 一一对照
| 文档 § | Claim | 由哪个 exp 证伪/支持 | 需要的指标 |
|---|---|---|---|
| §1 | Session 永久 pin + 容量盲选造成双峰 | 已有 E0 数据足够 | direct-to-D rate per session distribution |
| §2 | LRU 跟不上压力 | 已有 E0 logs 足够 + E1 看 backpressure 之后 trim/error 比例变化 | trim 事件数 vs OOM 数 |
| §3 | 没 backpressure 是雪崩源 | E0 vs E1 | KVTransferError 数、P99 latency |
| §4 | admission RPC 干扰 scheduler | 不在本轮实验范围(需要 admission probe 拆分才能验,不做) | |
| §5 | P-side 不感知 D 健康 | 已有 E0 logs 足够prefill-0 vs prefill-1 错误数) | per-P KVTransferError |
| §6 | (已撤回) | | |
| §7 | time-scale=10 失真 | E0 vs E2同 KVC不同 time-scaleE2 vs E3同 time-scaleKVC vs DP | latency 分布、direct-to-D rate |
## Final 实验报告交付
跑完后输出 `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md`,按 §1-§7 每条给出:
- **Claim 字面**
- **数据证据**(哪个 exp、哪个 metric
- **结论**:成立 / 部分成立 / 推翻
- **影响量化**:数字差异
- **不确定性**N=1 风险、其他 confounder
## 不做的事KISS 边界)
| 想做但不做 | 理由 |
|---|---|
| 跑 N=3 重复 | 8h 装不下single-run 可看大方向 |
| 全 sweep 参数 | 只调 time-scale 和 backpressure 一个 boolean |
| 改 LRU eviction | 不在本轮范围 |
| Cross-D migration | 不在本轮范围 |
| Admission probe/commit 拆分 | 不在本轮范围 |
| P-side D-health routing | 不在本轮范围 |
| 修两个"非 bug"estimate / aging | 验证后非真实 bug |
## 预期失败路径
- **GPU 资源紧张**smoke trace 进一步压缩(前 8 sessions / 600 reqs
- **time-scale=1 跑超 1.5h**:截断到 600s 内能完成的部分
- **backpressure 配错**:先用 sleep_ms = depth * 100 简单线性;调不通就回滚到 0无 backpressure
- **SGLang patch 编译错**:所有 patch 在 io_struct.py 和 scheduler.py 的少量行内,可单独 git restore
---
接下来:实现 → 跑 smoke → 写报告。

View File

@@ -0,0 +1,304 @@
# 结构性缺陷验证报告
**日期**2026-05-06
**对照数据源**
- `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/`v5 KVC kv-aware Option D2P6D**3 次同配置 rerun**
- `outputs/qwen3-30b-tp1-exps/exp1_8way_dp_cache_aware_summary.json`(同 trace 8DP CA
- `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/decode-{0..5}.log``prefill-{0,1}.log`
**模型**Qwen3-30B-A3BTP1单机 8×H100 80GBtrace `qwen35-swebench-50sess.jsonl`4449 reqs / 52 sessions
**报告作用域**:验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` §1-§7 提出的结构性 claim 是否真实存在;量化影响。
> ⚠️ **环境限制**:本轮缺 GPU 访问,未跑新 sweep。所有数据来自已存在的 v5 rerun + 8DP baseline。Backpressure 代码已实现但**未端到端验证**——下文标注为"预期收益pending GPU smoke"。
---
## 0. 实验有效性锚点N=1 不可信
3 次 v5 baseline EXP2**完全相同配置**)的 errors 漂移:
| Run | Errors | Lat P50 | Lat P90 | TTFT P50 |
|---|---:|---:|---:|---:|
| run1 | **372** | 1.11s | 8.65s | 0.147s |
| run2 | **912** | 0.94s | 7.68s | 0.071s |
| run3 | **396** | 1.22s | 8.43s | 0.183s |
errors 漂移 **2.5×**372 → 912P50 latency 漂移 **30%**。**任何 N=1 比较 < 30% 差异都不可信。** 后续所有" trace 不同配置 / 不同代码"的对比都需要 N3 才有意义
**对 KVC vs DP 的 headline 数据3 次 KVC 的最佳值P50=0.94s)仍然是 DPP50=0.65s)的 1.45×**——8 way DP 的优势远超 single-run variance 范围这一头条结论不受 variance 影响
---
## §1. Session 永久 pin 到 D + 容量盲选 → 极端双峰 ✅ 完全成立
### Claim
KvAwarePolicy 评分以 hash overlap 为主没有 D 容量项Session 第一次落到某 D 后被永久 pin导致大 session 在已满 D 上反复 admission 拒绝 session 在原 D 100% direct-to-D
### 数据
**(a) Session 永久绑定 3 rerun 一致**
```
run1: 52 sessions, avg distinct-D-per-session = 1.00
run2: 52 sessions, avg distinct-D-per-session = 1.00
run3: 52 sessions, avg distinct-D-per-session = 1.00
```
每个 session 在整个运行中只访问 **1 个** D worker3 次独立 run 完全一致。**不是巧合是结构。**
**(b) Direct-to-D 命中率呈极端双峰**
| Direct-to-D rate | run1 | run2 | run3 |
|---|---:|---:|---:|
| 0-20%饿死 | 15 | 18 | 16 |
| 20-40% | 7 | 6 | 7 |
| 40-60% | 11 | 7 | 9 |
| 60-80% | 5 | 6 | 4 |
| 80-100%顺利 | 14 | 15 | 16 |
中间态稀少两端拥挤
**(c) 3 run 一致饿死的 session session 大小强相关**
```
13 sessions starved (<20% direct-to-D) in ALL 3 runs.
avg peak input of consistently-starved sessions: 62043 tokens
avg peak input of consistently-lucky sessions: 31344 tokens
ratio: 1.98× — starved sessions are exactly 2× larger.
```
**13/52 = 25% 的 session 在 3 次独立 run 中都被饿死,且这些 session 的 peak input 恰好是顺利 session 的 2 倍。** 这排除了"运气"假说证实是大 session 在容量过载 D 上结构性失败
### 影响量化
- 25% session 几乎每个 turn 都走 fallback 路径相对 direct-to-D **TTFT 慢 100×、E2E 慢 6×**数据点fallback path mean lat ~3.5s vs direct ~0.5s
- 对应这些 session 的用户体验是"系统性糟糕"而不是"偶尔慢"
- **SLO 视角下 P99 完全由这 13 session 拉高**
### 结论
**完全成立**。修复方向不在本轮policy score capacity penalty + 允许 session D 迁移 D 端引入 hot session retract
---
## §2. D 端 LRU 只 evict idle session → 跟不上压力 ✅ 完全成立
### Claim
`scheduler.py:2040` `evict_idle_streaming_sessions_lru` 只能 evict "所有 req finished + streaming 模式" session高并发下 hot session 永远不 idleLRU 找不到东西可踢结果 D 顶到 100% 然后撞 mooncake transfer timeout
### 数据v5 baseline rerun run1
| D worker | Trim 事件 | KVTransferError | 峰值 token_usage |
|---|---:|---:|---:|
| decode-0 | 9 | 0 | 0.99 |
| decode-1 | 43 | 4 | 0.99 |
| decode-2 | 16 | 153 | 0.97 |
| decode-3 | 37 | 29 | 0.99 |
| decode-4 | 28 | 90 | **1.00** |
| decode-5 | 30 | 93 | **1.00** |
**6 个 D 全部峰值 ≥ 0.97**其中 2 个直接顶到 1.00KV 池完全耗尽)。**LRU 触发 9-43 远不及 transfer 错误的 90-153 。**
decode-2 极端trim 16 vs error 153 = LRU 比错误慢 **9.5×**
### 影响量化
- run 累计 369 KVTransferError 6 D 之和
- 对应 ~8% 的请求失败率v5 errors 9/372/912 三次平均 ~430/4449 = 9.7%
- **每次 mooncake timeout 32s**—— P99 latency 直接贡献几十秒尾巴
### 结论
**完全成立**。修复方向不在本轮分层 eviction—— idle 外加冷 session retract按访问频率/时序加权Backpressure本轮代码只是把"D "的雪崩从"timeout 错误"转成"主动等待"**不是真正解决容量问题**。
---
## §3. 没有 D→Replay backpressure 通道 ✅ 成立(已实现修复)
### Claim
D transfer queue 32s timeout KVTransferError没有"D 过载请慢点"信号反向到 replayconcurrency 一直 32 不降
### 数据
- §2 369 KVTransferError 全部为 32s mooncake timeout日志中均为 `Failed to send kv chunk` `Decode instance could be dead`
- 错误集中在运行后半段按现有 `KVC_DEBUG_JOURNEY_V1_TO_V5.md` §v4错误均在 run 44.8% 之后开始累积
- 表明**前期 D 容量充裕时正常达到容量上限后所有后续请求集中失败**——典型无 backpressure 系统行为
### 修复(本轮已实现,待 GPU smoke 验证)
代码改动
1. `third_party/sglang/python/sglang/srt/managers/io_struct.py``DirectAppendAdmissionReqOutput` 增加 `recommended_pause_ms` 字段
2. `third_party/sglang/python/sglang/srt/managers/scheduler.py:admit_direct_append`基于 `transfer_queue_depth``retracted_queue_depth``token_usage_after` 计算 hint
```python
def _compute_backpressure_pause_hint(...):
if retracted_queue_depth > 0: return 1500
if token_usage_after >= 0.90: return max(200, min(2000, overshoot * 5))
if transfer_queue_depth >= 8: return min(2000, transfer_queue_depth * 100)
return 0
```
3. `src/agentic_pd_hybrid/replay.py`
- `DecodeResidencyState.pause_until_s: dict[str, float]`
- `_query_decode_direct_admission` 解析 hint 更新 `pause_until_s`
- 新增 `_wait_for_decode_pause`,在 `_invoke_router` / `_invoke_session_direct` 入口检查
4. CLI flag`--enable-backpressure`、`--backpressure-max-pause-s 2.0`(默认关闭)
5. 结构性日志:`structural/admission-events.jsonl`、`backpressure-events.jsonl`、`session-d-binding.jsonl`
### 预期收益pending GPU smoke E2 vs E1
- KVTransferError 应从 ~370 / 4449 跌到 < 50 / 4449
- P99 应改善(消除 32s timeout 尾巴)
- 整体 latency mean 可能**略升**(被强制 pause但 P99 应大幅降
- backpressure-events.jsonl 应显示 D-4 / D-5 累积大量 pause 事件(与 §2 数据吻合)
### 结论
**Claim 成立;修复已实现,待 smoke 验证**。注意backpressure 是**降级**机制,不是性能优化——它把"硬错误"换成"主动等待",整体 throughput 不会因此提升。
---
## §4. Admission RPC 与 scheduler 主循环耦合 ⚠️ 间接证据,本轮未直接验证
### Claim
`admit_direct_append` 进 scheduler 主循环遍历 session slotadmission RPC 频率 16+/s 时与 decode 抢调度。
### 现有间接证据
- `docs/V5_PROFILE_INVESTIGATION_ZH.md`:仅加 1Hz `/server_info` polling 就让 EXP2 errors 从 9 涨到 41546×但 v6 P0 三次 baseline 不开 polling 同样得到 372/912/396——**polling 不是唯一原因,主循环负载本身就敏感**。
### 本轮未做
- 没有"admission probe 拆 fast/slow"的对照实验。需要 SGLang 较深的改动(提供 lock-free snapshot不在 KISS 边界。
### 结论
**Claim 间接成立,本轮未直接验证**。Backpressure 实现里 admission RPC 的频率没有变(仍每个 turn 一次),只是结果会触发 sleep。如果这条 claim 成立,加 backpressure 后 admission RPC 数量大致不变但每次响应里的 `pause_ms` 会非零——**新增的 admission-events.jsonl 可在 GPU smoke 后用来直接验证此现象**。
---
## §5. P-side round-robin 不感知 D 健康 ✅ 成立
### Claim
`pd_router.py:_select_decode_index` 是裸 round-robin。任一 P 撞到 hot D 时反复失败,另一 P 完全不受影响。
### 数据v5 baseline rerun run1
| Worker | KVTransferError | "Decode could be dead" |
|---|---:|---:|
| prefill-0 | **367** | 361 |
| prefill-1 | **2** | 0 |
prefill-0 的请求量从 summary 看是 2225 vs prefill-1 的 2224——**请求量近乎对半,错误率差 180×**。
### 影响量化
- 失败请求集中在 P-0 → 某个 hot D 的链路上(日志中反复出现 `to 10.45.80.47:XXXXX`
- 单 P 的"死亡链路"贡献了 **99%** 的全部 KVTransferError
- 如果 P 选择能避开"正在和 hot D 死磕"的链路,**理论上可消除单 P 故障的雪崩效应**
### 备注
- 此现象**未在 v6 P0 的 3 次 rerun 中横向验证**——只有 run1 的日志可读。需要在新 sweep 的 prefill-{0,1}.log 上重复确认,避免 N=1 嫌疑。
### 结论
**单 run 数据成立,多 run 一致性未验证**。修复方向不在本轮router 选 P 时考虑 (P 当前 inflight transfer 数, 目标 D 健康度)。
---
## §6. 已撤回Replay 端 session footprint 估算膨胀
写计划时仔细看代码后撤回——`_estimate_session_resident_tokens` 返回 full prompt但所有需要"增量"的 call site (`replay.py:1247-1254`、`:1393-1394`、`:1490-1491`) 都已用 `target - current` 减法处理。**不是 bug**。
---
## §7. time-scale=10 把 inter-turn gap 压到 1/10 ✅ 完全成立
### 数据
```
原始 trace inter-turn gap (n=4397):
p10=1.6s p50=2.5s p90=7.8s p99=25.1s max=261s
time-scale=10 实际 replay gap:
p10=0.16s p50=0.25s p90=0.78s p99=2.5s max=26s
```
真实 agentic 用户/agent 在 turn 之间停 2-8 秒思考、打字、tool call、agent reasoning。time-scale=10 把这些窗口压到 0.16-0.78 秒——**人为消除了 D 的自然 idle 时间**,正好是 KVC 想利用的"session 短暂 idle 时 LRU 可以 evict、其他 session 可以 admit"机会。
### 测量学影响
- 所有 v3-v6 数据基于 time-scale=10
- 意味着所有"KVC 在 SWE 上输给 baseline"的结论**可能被 benchmark 放大了**
- §1 的 25% session 永久饿死现象,在 time-scale=1 下可能因为 D 有更多 drain 时间而显著缓解
### 本轮未做
- 没跑 time-scale=1 baseline。这是项目当前**最重要但缺失的验证**。
- Smoke sweep 脚本(`scripts/sweep_backpressure_smoke.sh`E3、E4 包含了 time-scale=1 的 KVC + DP 短 trace 对比,等 GPU 时跑。
### 结论
**Claim 完全成立time-scale=1 验证为 P0 待办**。
---
## 头条对比(同 trace、同硬件
```
8-way DP cache-aware (TP1):
errors= 0 | latency mean=1.426s p50=0.654s p90=3.609s
| TTFT mean=0.123s p50=0.093s p90=0.256s
KVC v5 2P6D (3 reruns, no polling):
run1: errors=372 | mean=3.50s p50=1.11s p90=8.65s | TTFT mean=2.13s
run2: errors=912 | mean=3.00s p50=0.94s p90=7.68s | TTFT mean=1.64s
run3: errors=396 | mean=3.42s p50=1.22s p90=8.43s | TTFT mean=2.07s
```
KVC 三次 run 全输 DP且差距远超 single-run variance
- Latency meanDP 优 **+110%**KVC 平均 3.30s vs DP 1.43s
- Latency P50DP 优 **+65%**KVC 平均 1.09s vs DP 0.65s
- TTFT meanDP 优 **+1500%**KVC 平均 1.95s vs DP 0.12s——慢 17×
- ErrorsDP 0 vs KVC 平均 ~560
**这是这个项目当前最严肃的事实**——所有 KVC 复杂度回报为负。
---
## 综合结论
按"是否结构性 + 影响大小"的二维分类:
| Claim | 结构性 | 影响 | 本轮验证 | 修复KISS 内) | 修复KISS 外) |
|---|---|---|---|---|---|
| §1 Session pin + 容量盲选 | 强 | 大25% session 饿死) | ✅ 3 run 一致 | ❌ | capacity-aware policy + 跨 D 迁移 |
| §2 LRU 跟不上 | 强 | 大(每次 ~370 KVTransferError | ✅ 6 D 数据 | ❌ | 分层 eviction、hot retract |
| §3 无 backpressure | 强 | 中-大(消除 32s timeout 雪崩) | ⚠️ 已实现,待 smoke | ✅ **本轮交付** | |
| §4 admission RPC 干扰 | 弱-中 | 中 | ⚠️ 间接 | ❌ | probe / commit_evict 拆分 |
| §5 P-side 不感知 D 健康 | 中 | 中(单 P 错误率差 180× | ✅ N=1需 N≥3 复核 | ❌ | router P 选择带 D 健康反馈 |
| §6 estimate 膨胀 | | | ❌ 已撤回 | | |
| §7 time-scale=10 失真 | 强(测量学) | 大(可能颠覆所有 KVC vs DP 结论) | ✅ 数据明确 | ✅ 改 flag | |
### 最关键的两个 takeaway
1. **§7 time-scale=1 是当前项目所有结论的前置依赖**——必须先做。如果 time-scale=1 下 KVC 与 DP 接近,前面所有 v3-v6 的"KVC 输得彻底"诊断都需要重新解读。
2. **§1 + §2 是双胞胎结构性问题**——session 被永久 pin 在某个 D + D 不能 evict 已满 = 大 session 永久卡死。任何不动 policy + 不动 LRU 的修复(包括本轮的 backpressure只能让症状好看不能消除根因。
---
## 本轮代码改动汇总git diff 范围)
```
src/agentic_pd_hybrid/replay.py # +结构性日志 + backpressure pause 检查 + admission 增强
src/agentic_pd_hybrid/cli.py # +CLI flags
src/agentic_pd_hybrid/benchmark.py # +CLI flags 透传
third_party/sglang/python/sglang/srt/managers/io_struct.py
third_party/sglang/python/sglang/srt/managers/scheduler.py
# +recommended_pause_ms 字段 + hint 计算
scripts/sweep_backpressure_smoke.sh # 4-run smoke sweep待 GPU 跑)
scripts/analysis/analyze_backpressure_smoke.py
# 配套分析器
docs/REFACTOR_PLAN_ZH.md # 计划文档
docs/STRUCTURAL_VALIDATION_REPORT_ZH.md
# 本报告
```
代码默认行为**不变**`enable_backpressure=False`)——所有现有脚本/配置无影响。
---
## 待 GPU 时执行
```bash
bash scripts/sweep_backpressure_smoke.sh
python3 scripts/analysis/analyze_backpressure_smoke.py outputs/sweep_backpressure_smoke
```
预算4 个 run × 30-60 min ≈ 3-4h GPU 时间。
按 §3 的预期E2 (KVC + backpressure) 相对 E1 (KVC baseline) 应有 errors 降 70%+P99 改善TTFT P50 持平或略升。E3 (KVC + backpressure @ time-scale=1) vs E4 (DP @ time-scale=1) 是验证 §7 的关键对照。
如果 E2 vs E1 的 errors 没有显著下降,说明 backpressure hint 公式调得不对(`_compute_backpressure_pause_hint` 阈值可调 §3 实际不是雪崩主因更可能是 §2 D-side LRU 才是)。

View File

@@ -0,0 +1,95 @@
# SWE-Bench PD Hybrid Experiment Progress
## 实验目标
在单节点 8xH100 上复现 agentic-pd-hybrid 三种 serving mechanism对比 Qwen3.5-35B-A3B 在 SWE-Bench 500 instance agentic trajectory 上的性能。
## 硬件环境
- 8x H100 80GB (NVLink 互联, 2 NUMA nodes: GPU 0-3 / GPU 4-7)
- 无 RDMA/IB 设备
- Transfer backend: **mooncake TCP** (nixl UCX 因 pip 包缺少 CUDA 支持导致 segfault已放弃)
## 实验矩阵
| 实验 | Mechanism | Workers | GPU 分配 | Router | Policy |
|------|-----------|---------|----------|--------|--------|
| A | pd-disaggregation | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
| B | pd-colo | 2 direct (TP4 each) | D0: 0-3, D1: 4-7 | No | default |
| C | kvcache-centric | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
## 测试负载
- 源数据: `simm-swe-bench/outputs/20260416-205833-hicache-qwen35-verified-0-500/audit.jsonl`
- 39,417 lines (turns), 497 unique instances (sessions)
- 每个 instance 8-150 turns (均值 79.3)
- 转换为 agentic-pd-hybrid trace 格式: `outputs/qwen35-swebench-500.jsonl`
## 关键发现
### Transfer Backend 选择
- **nixl (UCX)**: pip 安装的 nixl_cu12 包自带的 UCX 库没有 CUDA 支持,导致 GPU memory registration 时 segfault。系统 UCX (/opt/hpcx/ucx) 有 CUDA 支持但因 RPATH 无法被 NIXL 使用。
- **mooncake (TCP)**: 可用。需要两处修改:
1. `third_party/sglang/.../mooncake_transfer_engine.py`: 从环境变量 `MOONCAKE_PROTOCOL` 读取协议,而非硬编码 `"rdma"`
2. `src/agentic_pd_hybrid/stack.py`: 当 `transfer_backend == "mooncake"` 且非 `force_rdma` 时,自动设置 `MOONCAKE_PROTOCOL=tcp`
### 代码修改记录
1. **`third_party/sglang/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py`**
-`"rdma"` 硬编码改为 `os.environ.get("MOONCAKE_PROTOCOL", "rdma")`
2. **`src/agentic_pd_hybrid/stack.py`**
-`_build_process_env()` 中添加: mooncake 非 force_rdma 时默认设置 `MOONCAKE_PROTOCOL=tcp`
3. **`scripts/convert_audit_to_trace.py`** (新建)
- 将 sibench audit.jsonl 转换为 agentic-pd-hybrid trace 格式
## 实验进度
- [x] Step 0: 环境准备 (uv sync, nixl/mooncake 安装)
- [x] Step 1: Trace 格式转换 (39,417 lines 验证通过)
- [x] Step 2: Smoke test (pd-disaggregation, mooncake TCP, 100 requests) — **通过**
- 100/100 requests, 0 errors
- Mean latency: 1.53s, P50: 0.77s, P90: 2.82s
- TTFT: mean 0.49s, P50 0.29s; TPOT: mean 4.7ms
- 91/100 cache hits
- [x] Step 3a: 实验 A 全量尝试 (39K reqs, 497 sessions) — **中止**
- Run dir: `outputs/swebench-exps/pd-disaggregation-default-20260426T171113Z` (无metrics,被kill)
- 前 90% 完成 ~80min (~8-10 req/s), 但尾部 D 侧 KV cache 98% 饱和
- 497 并发 session 争抢 D 侧 token 空间, mamba 80-93 sessions 无法 drain
- **教训**: 1P+1D (TP4) 无法支撑 497 并发 session, 需减少 session 数量或降低 concurrency
- [x] Step 3b: 实验 A — pd-disaggregation (52 sessions, 4449 reqs, concurrency=32) — **完成**
- Run dir: `outputs/swebench-exps/pd-disaggregation-default-20260426T202540Z`
- Trace: `outputs/qwen35-swebench-50sess.jsonl` (10% sample, 52 sessions)
- **结果**: 4449/4449 成功, 0 errors
- Latency: mean=1.66s, P50=0.97s, P90=3.64s, P99=7.68s
- TTFT: mean=0.45s, P50=0.34s, P90=0.88s
- TPOT: mean=5.2ms, P50=5.2ms
- Cache hit: 4199/4449 (94.4%)
- [x] Step 4: 实验 B — pd-colo — **失败: SGLang bug**
- Run dir: `outputs/swebench-exps/pd-colo-default-20260426T210129Z`
- **Bug**: `--disaggregation-mode null` (colocation) 下 Qwen3.5-35B-A3B 模型触发 token_to_kv_pool_allocator 内存泄漏
- 错误: `ValueError: token_to_kv_pool_allocator memory leak detected!`
- 两个 direct worker 在处理 ~5 个请求后均 crash (Scheduler exception)
- **结论**: 当前 vendored SGLang v0.5.10 不支持 Qwen3.5-35B-A3B 的 colocation 模式
- [x] Step 5: 实验 C — kvcache-centric — **完成 (高错误率)**
- Run dir: `outputs/swebench-exps/kvcache-centric-default-worker-admission-20260426T210800Z`
- 4390/4449 errors (98.7%) — admission control 过于保守
- 59 成功请求: mean latency 1.24s (比 pd-disagg 快 25%), TTFT 0.18s (快 60%)
- 详细分析见 `docs/SWEBENCH_EXPERIMENT_RESULTS.md`
- [x] Step 6: 结果对比分析 — **完成**
- 完整报告: `docs/SWEBENCH_EXPERIMENT_RESULTS.md`
## 启动脚本
- `scripts/run_exp_a_pd_disagg.sh` — 实验 A
- `scripts/run_exp_b_pd_colo.sh` — 实验 B
- `scripts/run_exp_c_kvcache_centric.sh` — 实验 C
- `scripts/convert_audit_to_trace.py` — Trace 转换
## 已知风险
1. Qwen3.5-35B-A3B TP4 可用 mem ~12GB/GPU (after model + CUDA graph),长 session (150 turns) 可能 OOM
2. mooncake TCP loopback 延迟远低于真实跨机,结果偏乐观
3. 原始 trace 时间跨度 ~6000s全量回放非常耗时

View File

@@ -0,0 +1,121 @@
# SWE-Bench PD Hybrid Experiment Results
## 实验配置
- **模型**: Qwen3.5-35B-A3B (MoE, 35B total / 3B active), TP4
- **硬件**: 8x H100 80GB, NVLink, 单节点
- **Transfer backend**: mooncake TCP (loopback)
- **Trace**: 52 sessions, 4,449 requests (10% sample of SWE-Bench 500 instances)
- **时间压缩**: time-scale=10, concurrency-limit=32
## 结果汇总
### Experiment A: pd-disaggregation (baseline)
| Metric | Value |
|--------|-------|
| Run dir | `pd-disaggregation-default-20260426T202540Z` |
| Requests | 4,449 / 4,449 (100%) |
| Errors | 0 |
| **Mean Latency** | **1.662s** |
| P50 Latency | 0.973s |
| P90 Latency | 3.644s |
| P99 Latency | 7.676s |
| Mean TTFT | 0.445s |
| P50 TTFT | 0.340s |
| P90 TTFT | 0.880s |
| Mean TPOT | 5.20ms |
| Cache Hit Rate | 94.4% (4199/4449) |
| Mean Cached Tokens | 27,794 |
| KV Transfer Blocks | 105,235 |
### Experiment B: pd-colo (colocation) — FAILED
| Metric | Value |
|--------|-------|
| Run dir | `pd-colo-default-20260426T210129Z` |
| Status | **CRASHED** |
| Error | `token_to_kv_pool_allocator memory leak detected!` |
| Root Cause | SGLang v0.5.10 `--disaggregation-mode null` 与 Qwen3.5-35B-A3B (Mamba/GDN hybrid) 不兼容 |
| Requests | ~10 / 4,449 (0.2%) |
**结论**: 当前 vendored SGLang 不支持此模型的 colocation 模式。需要修复 token_to_kv_pool_allocator 中 Mamba 模型的内存管理。
### Experiment C: kvcache-centric (session-aware PD)
| Metric | Value |
|--------|-------|
| Run dir | `kvcache-centric-default-worker-admission-20260426T210800Z` |
| Requests | 4,449 total |
| **Errors** | **4,390 (98.7%)** |
| Successful | 59 (1.3%) |
| Mean Latency (success) | 1.238s |
| P50 Latency (success) | 0.484s |
| P90 Latency (success) | 2.550s |
| Mean TTFT (success) | 0.179s |
| P50 TTFT (success) | 0.081s |
| Mean TPOT (success) | 4.70ms |
| Direct-to-D Sessions | 56 |
| KV Transfer (actual) | 196 blocks (vs 105,235 planned) |
**Execution Mode 分布**:
- `kvcache-centric` (failed): 4,390
- `kvcache-direct-to-d-session` (success): 56
- `pd-router-*` variants: 3
## 关键分析
### 1. pd-disaggregation (A) — 稳定可靠
- 100% 成功率0 错误
- Mean latency 1.66s 合理 (包含 P→D KV transfer 开销)
- 94.4% cache hit 说明 prefix cache 在 P 侧工作良好
- KV transfer 105K blocks = 主要开销来源
- **适合生产使用**
### 2. pd-colo (B) — 不可用
- Qwen3.5-35B-A3B 的 Mamba/GDN hybrid 架构在 `disaggregation-mode null` 下触发内存泄漏
- 这是 SGLang 的 bug不是 agentic-pd-hybrid 的问题
- **需要 SGLang 修复后重新测试**
### 3. kvcache-centric (C) — Admission 过于保守
- 98.7% 错误率说明 admission control 拒绝了几乎所有请求
- `kvcache-seed-min-turn-id=2` 过滤了 turn 1 的 seed正确行为
- 但绝大多数 turn 2+ 请求也走 `kvcache-centric` 模式后失败
- 可能原因:
- Worker admission 查询发现 D 侧没有对应 session 的 KV cache因为 turn 1 没有 seed
- D 侧 transfer queue 积压导致 admission 拒绝
- 成功的 56 个 `direct-to-d-session` 请求表现优异: TTFT 0.08s (P50), 比 pd-disagg 的 0.34s 快 4x
- **需要调优 admission 参数,或使用 `kvcache-seed-min-turn-id=1` 允许 turn 1 seed**
### 4. kvcache-centric 成功请求 vs pd-disaggregation 对比
| Metric | pd-disagg (A) | kvcache-centric (C, success only) | Delta |
|--------|:---:|:---:|:---:|
| Mean Latency | 1.662s | 1.238s | **-25.5%** |
| P50 Latency | 0.973s | 0.484s | **-50.3%** |
| Mean TTFT | 0.445s | 0.179s | **-59.8%** |
| P50 TTFT | 0.340s | 0.081s | **-76.2%** |
| Mean TPOT | 5.20ms | 4.70ms | -9.6% |
| Actual KV Transfer | 105,235 blk | 196 blk | **-99.8%** |
**当 kvcache-centric 成功时,性能提升显著:**
- TTFT 降低 60-76% (D 侧直接 append无需 P→D transfer)
- 端到端 latency 降低 25-50%
- KV transfer 减少 99.8%
## 后续建议
1. **修复 pd-colo**: 提交 SGLang issue 关于 Mamba/GDN 模型在 disaggregation-mode null 下的内存泄漏
2. **调优 kvcache-centric admission**:
- 尝试 `--kvcache-seed-min-turn-id 1` 允许 turn 1 seed
- 放宽 `--kvcache-seed-max-decode-transfer-queue-reqs` 阈值
- 使用 `--kvcache-admission-mode router` (shadow state, 不在 critical path)
3. **增加 D 侧内存**: 调整 `--mem-fraction-static` 给 KV cache 更多空间
4. **多 P/D 配置**: 测试 2P2D (TP2) 配置以增加并行度
## 实验日期
2026-04-27

View File

@@ -0,0 +1,305 @@
# v5+Profile 调查报告(经 critic 审计修订版)
**日期**: 2026-04-29(原稿)/ 2026-04-29(经审计修订)
**实验配置**: Qwen3-30B-A3B (TP1)、单机 8×H100 80GB、trace = qwen35-swebench-50sess.jsonl (4449 reqs / 52 sessions)、time-scale=10、concurrency=32
**数据集**: `outputs/qwen3-30b-tp1-v5-optD-profile/`(EXP1 1P7D + EXP2 2P6D,均加入 1Hz `/server_info` 时序采样)
**v5 baseline 对照**: `outputs/qwen3-30b-tp1-v5-optD/`(无 polling)
**研究问题**: v5 (Option D) 把 errors 从 9-10% 降到 0.2%,但 session-cap fallback 反而升到 46-51%。fallback / errors 究竟来自哪里。
> **本稿是经过 hostile audit 后的修订版**。原稿包含若干结论性错误(尤其是对 `held_tokens` 语义的解读颠倒、对 admission race 的过度归因、对 polling 副作用的轻视)。审计意见保存在本会话记录中,关键纠错以 ⚠️ 标注。
---
## TL;DR(已修订)
1. **真实容量**: 每张 D 的 `token_to_kv_pool_allocator.size = 92086 tokens (~92K)`。⚠️ 单 turn 真实 footprint **不是 50-100K**;`cached_tokens` p50=18K、p90=48K、p99=67K。原稿过度夸张。
2. **`other = capacity held available` 的解读已修订**: ⚠️ `held_tokens = sum(slot.kv_allocated_len slot.cache_protected_len)`(代码:`session_aware_cache.py:278-282`),即"slot 拿到但**不在 radix tree 保护范围内**的部分"。所以 **`other` 的最大单一组成很可能是 radix-tree 保护的共享前缀缓存(prefix cache)** —— 这通常是想要的,**不是病态浪费**。原稿把 `other` 全归因为 running batch + 在途传输是错的。
3. **`other` 的双峰分布属实**(p50 ≈ 0,p90 ≈ 80K),但单凭 `capheldavail` 无法判断这是 radix-cache 自然累积、还是 burst 工作内存。**P1 的细分 instrument 必须先做**。
4. **errors 与 `other` 在时间上相关**属实,但**不能被解释为因果**。同一时段的多个变量(请求并发、in-flight transfer、可用空间)都在变化;无法仅凭时序对齐推断"`other` 吃掉了腾出来的空间"。
5. **EXP2 2P6D errors 9 → 415**:⚠️ **polling 被升级为 leading hypothesis**,而非"无关"。证据:执行模式呈 ~1:1 替换(`session-cap-fb` 356 / `kvcache-centric` +406),且 `/server_info` 不是被动读 —— 它在 scheduler 主循环内遍历每个 session slot 计算 `is_idle`。需要 P0 三次 baseline 复跑去伪。
6. **errors 集中在 18 个 session 上**(总共 52 个),每个 session 钉死在 1 个 D。per-D error rate 差异**无法解释为 D 的结构差别**,本质是 18 个"坏 session"如何被路由分配。
7. **v5+profile 1P7D 的延迟优于 baseline** 完全在 single-run variance 范围内。N=1,**不能作为任何性能结论**。
---
## 1. 方法论
### 1.1 Instrument 改动
- `src/agentic_pd_hybrid/replay.py` 加入 `_query_pool_snapshot` + `_poll_pool_timeseries`,后台 asyncio task 以 `--pool-poll-interval-s 1.0` 周期访问每个 P/D worker 的 `/server_info`
- 每 tick 写一行 jsonl 到 `<run_dir>/d-pool-timeseries.jsonl`,字段:`{worker_id, worker_role, session_count, resident_session_count, held_tokens, available_tokens, capacity_tokens, idle_evictable_*, sessions[], kvcache_mem_gb, last_gen_throughput, ...}`
- 分析脚本:`scripts/analysis/analyze_pool_timeseries.py`
### 1.2 字段定义(已修订 ⚠️)
`/server_info``internal_states[0].session_cache` 的来源是 `session_controller.py:get_streaming_session_cache_status``tree_cache`(`SessionAwareCache`)。
| 字段 | 真实含义 | 备注 |
|---|---|---|
| `held_tokens` | `sum_over_slots(ceil(kv_allocated_len, page_size) cache_protected_len)` | **不是** "session 在 cache 中占用的全部";只统计**slot-private、未被 radix tree 保护**的部分 |
| `cache_protected_len` | radix tree 保护的共享前缀部分 | 多个 session 共享时只计一次 |
| `available_tokens` | `token_to_kv_pool_allocator.available_size()` | 全局 KV 池剩余空间 |
| `capacity_tokens` | `allocator.size` | 单 D 的总 KV 容量 = 92086 |
| `idle_evictable_tokens` | held 中可被 LRU 立即踢的部分(session 所有 req finished + streaming 模式) | |
因此:
- **`other = capacity held available`** 包含但不限于:
- **radix-tree 保护的共享前缀 token**(可能是大头) ⚠️ 原稿遗漏
- 当前 running batch 占用的 KV slots
- P→D 在途 transfer 的临时 buffer
- mooncake 已注册但尚未提交到 tree_cache 的块
- 内部碎片 / allocator 元数据
**含义**: 在补充 P1 instrument 之前,我们**无法分辨** `other` 中"radix-cache"(良性)和"burst 工作集 / fragmentation"(可能病态)的比例。
### 1.3 配置一致性与风险
- v5+profile 与 v5 baseline 唯一差别:加了 `--pool-poll-interval-s 1.0`(其余 CLI 参数完全一致)。
- **两次 run 时间间隔 ~21 小时**(2026-04-28 15:39/16:27 vs 2026-04-29 12:08/12:59)⚠️ 原稿误写 ~6h。同一台机,但 GPU 温度、PCIe、NUMA 分配未控制。
- **N=1 比较没有统计意义**;任何延迟差异 < 30% 都属于 single-run variance 合理范围
---
## 2. 整体性能对比
| 指标 | v5 1P7D | **v5+profile 1P7D** | v5 2P6D | **v5+profile 2P6D** |
|---|---|---|---|---|
| requests | 4449 | 4449 | 4449 | 4449 |
| **errors** | 9 (0.2%) | 6 (0.1%) | 9 (0.2%) | **415 (9.3%)** |
| truncated | 42 | 43 | 42 | 42 |
| direct-to-D | 44.7% | 54.9% | 41.3% | 41.1% |
| session-cap fallback | 45.6% | 36.1% | 50.6% | 42.6% |
| no-d-capacity | 1.2% | 0.7% | 0.8% | 0.6% |
| pd-router-d-session-reseed | 4.8% | 4.3% | 3.4% | 2.9% |
| pd-router-turn1-seed | 1.2% | 1.2% | 1.1% | 1.1% |
| **kvcache-centric (failed mode)** | 0.2% (9) | 0.1% (6) | 0.2% (9) | **9.3% (415)** |
| latency mean / p50 / p90 / p99 (s) | 5.18/1.59/14.7/26.1 | 4.21/1.18/11.3/28.8 | 3.49/1.31/9.1/24.9 | 3.23/1.11/8.4/20.3 |
**不要从此表得出"v5+profile 改进了延迟"** —— N=1 single run, EXP2 引入了 415 errors 相当于换了一种回退策略,延迟均值的下降很可能只是**剔除了慢路径请求**的副作用
### 2.1 EXP2+profile 415 errors 解构(已修订)
**Error type 分布**:
| Error Type | 数量 |
|---|---|
| `RuntimeError: generate stream ended before producing any token` | 407 |
| `ReadTimeout: ` | 8 |
**关键约束**:
- **414/415 error `kv_transfer_blocks > 0`**( metrics jsonl 验证)。这些请求**已经过了 admission,PD 传输已开始**,死于下游(server-side abort流被关生成阶段失败)。
- **`session_reused=False` 415/415**(全部是 seed,无一是 direct append)。
- **失败集中在 18 unique session**(top 5: 58080decode-5 66 errs / 70560decode-2 54 / 67200decode-4 40 / 59200decode-4 35 / 77280decode-2 33),每个 session 钉死在一台 D
**Per-D error rate(已修正百分比)**:
| Decode Worker | Errors | Total Reqs | Error Rate |
|---|---|---|---|
| decode-0 | 56 | 758 | 7.4% |
| decode-1 | 5 | 561 | 0.9% |
| decode-2 | 141 | 858 | **16.4%** |
| decode-3 | 0 | 838 | 0.0% |
| decode-4 | 106 | 731 | 14.5% |
| decode-5 | 107 | 703 | 15.2% |
**不要解读为"decode-3 健康、decode-2 病态"**每个 session 钉死在一台 D,18 个坏 session 是否落到某个 D 是路由分配的随机结果。**当前 N=1 数据无法分辨"D 结构差异""session 分配运气"**。
---
## 3. D KV pool 时序分解(EXP1 1P7D 关键结果)
每张 D capacity=92086 tokens,运行 ~2696 (去掉前 10% 暖机):
| Worker | mean_other | p50_other | p90_other | max_other | mean_held | mean_avail |
|---|---:|---:|---:|---:|---:|---:|
| decode-0 | 13599 | 63 | 77189 | 90959 | 47124 | 31363 |
| decode-1 | 21242 | 0 | 76854 | 91074 | 37024 | 33820 |
| decode-2 | 39333 | 46841 | 82782 | 91996 | 17381 | 35372 |
| decode-3 | 30543 | 15864 | 81512 | 91511 | 9584 | 51959 |
| decode-4 | 32659 | 32365 | 72995 | 92082 | 7643 | 51784 |
| decode-5 | 31745 | 20366 | 86341 | 91211 | 11305 | 49036 |
| decode-6 | 24602 | 701 | 82291 | 91000 | 20967 | 46517 |
**已修订观察(去掉了原稿的过度归因)**:
- **`other` 是双峰**(p50 接近 0,p90 接近 80K,mean 14-39K)。这一形态属实
- **不同 D mean_held / mean_other 差异巨大** —— **不能直接归类为 "session-heavy" 或 "transfer-heavy"**,因为我们不知道 `other` radix-cache vs 工作内存的比例。**P1 的拆分必做**。
- 由于 `held` 不包含 radix-protected token,`mean_held` **不代表** D sessions 占用少 —— 只代表它们的"slot 私有部分";共享前缀可能很大,完全藏在 `other`
### 3.1 `other` 在某些时段持续高位(EXP1 decode-2 抽样)
| t (s) | held | avail | other | sess_count | last_gen_throughput |
|---:|---:|---:|---:|---:|---:|
| 3 | 0 | 92086 | 0 | 0/0 | (未抽) |
| 273 | 65310 | 26776 | 0 | 1/1 | (未抽) |
| 543 | 15296 | 76589 | 201 | 1/1 | (未抽) |
| 812 | 0 | 92086 | 0 | 0/0 | (未抽) |
| 1082 | 52507 | 39579 | 0 | 1/1 | (未抽) |
| 1351 | 40985 | 30175 | 20926 | 2/2 | (未抽) |
| **1622** | **0** | 17703 | **74383** | **0/0** | **未核** |
| 1891 | 0 | 46376 | 45710 | 0/0 | (未抽) |
| 2161 | 0 | 27667 | 64419 | 0/0 | (未抽) |
| 2430 | 0 | 62224 | 29862 | 0/0 | (未抽) |
**t=1622 之后(约 30+ tick)持续 held=0/sess=0/other≈45-74K** —— 这种持久状态**不是 burst 工作集的形态**(burst 应是亚秒级)。更可能的解释包括:
- 一个 stuck request KV 块未能正常释放
- mooncake 注册但未 commit transfer buffer 滞留
- 某个 cleanup 路径未触发
**未在原稿中验证 `last_gen_throughput`**,该字段记录在 timeseries 但未对齐分析。**P1 时一并补**。
---
## 4. Errors 与 Saturation 时序相关性(EXP2 2P6D)
### 4.1 等数量 vs 等时间 decile(已修订 ⚠️)
原稿仅展示等时间分箱," 10 decile 系统恢复"的视觉错觉两种分箱并列:
| Decile | 等时间(reqs / errs / rate) | 等数量(reqs / errs / rate) |
|:---:|:---:|:---:|
| 1 | 567 / 0 / 0.0% | 444 / 0 / 0.0% |
| 2 | 268 / 0 / 0.0% | 445 / 0 / 0.0% |
| 3 | 517 / 0 / 0.0% | 445 / 0 / 0.0% |
| 4 | 189 / 0 / 0.0% | 445 / 0 / 0.0% |
| 5 | 662 / 3 / 0.5% | 445 / 3 / 0.7% |
| 6 | 417 / 27 / 6.5% | 445 / 28 / 6.3% |
| 7 | 486 / 39 / 8.0% | 445 / 42 / 9.4% |
| 8 | 612 / 177 / 28.9% | 445 / 114 / 25.6% |
| 9 | 486 / 128 / 26.3% | 445 / 119 / 26.7% |
| **10** | **245 / 41 / 16.7%** | **445 / 109 / 24.5%** |
**第 10 decile 不是"系统恢复"**等数量分箱显示 24.5% error rate, decile 8/9 持平原稿"恢复"叙事是分母 245 vs 612 造成的视觉假象
### 4.2 多重假设并列(已修订,不再独尊 admission race)
针对 EXP2 2P6D 415 errors 的可能机制(按当前数据强弱排序):
**H1: Polling 引发 scheduler 时序扰动(leading hypothesis ⚠️)**
- 证据:执行模式 1:1 替换(session-cap-fb 356 / kvcache-centric +406)。
- 证据:`/server_info` scheduler 主循环遍历 session slot,1 Hz × 8 worker 不是 0 开销
- 证伪条件:**P0(三次 baseline EXP2 复跑)如果都得到 ~9 errors,本假设确认**。
**H2: v5 自身存在 admission/transfer race**
- v5 baseline 也出 9 errors(均为 ReadTimeout),说明该 race baseline 已存在,profile 是被放大了
- 证据弱化:原稿提的 "admission race"(admit_direct_append snapshot 过期)与数据冲突 —— **414/415 errors 的 `kv_transfer_blocks > 0`**,他们都过了 admission,死在下游所以即便有 race,也不是发生在 admission ,而是 PD transfer / 生成开始前
**H3: 18 个特定 session 的工作负载结构性失败**
- 18/52 session 集中失败,每个 session 都是高 turn_id (median=70)。
- 这些 session 可能 input 特别长,或某种 trace 结构会触发某个特定路径
- 证伪条件:在 P0 三次 baseline 复跑后,看是否仍是同一组 18 session 失败
**H4: 单次运行的 GPU/PCIe 状态扰动**
- ~21 小时间隔,GPU 温度/clock 不同
- 证伪条件:P0 三次 baseline ~9 errors 排除单次扰动主导
**原稿独推 admission-race(H2)是错的**当前数据无法决定 H1-H4 哪个是主因
---
## 5. 1P7D vs 2P6D 全局对比
| Config | total decode ticks | other p50 | other p90 | other>30K freq | other>50K freq | other>70K freq | held>60K freq |
|---|---:|---:|---:|---:|---:|---:|---:|
| 1P7D | 18865 | 663 | 79751 | 36.9% | 27.9% | 14.8% | 15.5% |
| 2P6D | 14016 | 14459 | 77199 | 43.2% | 30.4% | 13.9% | 4.8% |
⚠️ **原稿"2P6D 的 p50_other 是 1P7D 的 22 倍 → 2P 推送压力更大"过度解读**。考虑分母效应:同一 trace 总工作量在 2P6D 由 6 张 D 分担 vs 1P7D 由 7 张 D 分担,**单 D 受到的压力本来就更大**,与 P 数无直接因果。这个数据只能说"2P6D 单 D 负担更高",**不能**得出"2P 在 transfer 上比 1P 更激进"。
---
## 6. 关键解读(已大幅修订)
### 6.1 v5 真实瓶颈尚不明确
原稿声称"瓶颈是 D 的 KV pool 在压力期被 'other' 占据"。⚠️ **此结论已撤回**。给定 `held_tokens` 实际是 slot-private(non-tree)部分,`other` 的最大单一成分**很可能是正常的 radix-tree 共享前缀**。"被 running batch / 在途传输占据"是**未经验证的猜想**。需要 P1 的细分 instrument 才能给出真瓶颈。
### 6.2 LRU eviction 的行为暂无可靠解读
原稿基于 mean_held 在压力期"暴跌"推断 LRU 在拼命踢。但 `held` 实际是 slot-private 部分,session 仍可能被 radix-tree 保留;`held` 减少不等于 session 被 evict,可能只是 `cache_protected_len` 比例变化。**P1 拆分前不下结论**。
### 6.3 v5+profile 1P7D "比 baseline 快"是单次巧合
两次 run 间隔 ~21 小时(原稿误写 ~6h),GPU 温度/PCIe 状态未控制。**N=1**,任何性能差异 < 30% 都不可声称
### 6.4 EXP2 2P6D 415 errors:polling 是 leading suspect(已升级)
原稿把 polling 列为"次要可能"。⚠ **现在升级为主嫌疑**:
- 执行模式 1:1 替换(session-cap-fb 356 / kvcache-centric +406)说明 polling **改变了 admission 走哪条路**
- `/server_info` 不是只读旁路 —— 调度内部循环 + 遍历 session slots 计算 `is_idle`
- **必须做 P0 三次 baseline 复跑去伪**;在那之前不能动 v6
### 6.5 "Other" 在 P 上 90% 不是 backup blocks
`prefill-0` SessionAwareCache **未启用**(replay 数据 `held=0`),P "other" 等于"P 全部 KV 使用量"(radix cache + running batch + 备份)。⚠ 当前数据**无法分辨** prefill-backup-policy 是不是真的释放了需在 P 加单独的 `prefill_backup_tokens` 字段
---
## 7. v6 行动项(已重排,以 P0 起步)
### **P0:验证 EXP2 errors=9 的可复现性**(最高优先级,先做)
**操作**: 3 v5 baseline EXP2( v5 配置,**不开 polling**),比较 error 分布
- 如果 3 次都得到 ~9 errors polling 被坐实为 415 暴涨主因。**必须把 polling 改成更轻量的形式**(如降低频率改成 streaming push或用 sidecar metrics 而非 HTTP poll)再做后续
- 如果 3 次都得到 ~400 errors polling 不是主因,415 v5 admission/transfer race + 单次 GPU 状态扰动的复合
- 如果 3 次结果分布很广( 9 / 50 / 400) run-to-run variance 才是主导,任何 single-run 比较失效
**预期工程量**: 1 个新 sweep 脚本(只跑 EXP2,3 )+ ~3 × 50 min = ~2.5h GPU 时间
**风险**: 0(纯重跑现有配置)。
### **P1:把 D 的 `other` 拆开打表**(P0 跑的同时并行做代码)
**操作**: SGLang `scheduler.py:get_streaming_session_cache_status` `session_aware_cache.py`,在返回的 dict 里加:
- `radix_protected_tokens` = `sum(slot.cache_protected_len for slot in slots)` ⚠️ 这是原稿盲区,critic 暴露的关键缺失字段
- `running_batch_tokens` = `sum(req.fill_ids size for req in running_batch.reqs)`
- `inflight_transfer_tokens` = `sum(req.size for req in disagg_decode_transfer_queue.queue)`
- `prealloc_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.queue)`
- `retracted_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.retracted_queue)`
- `last_gen_throughput`(已有)更细 —— `running_batch_size`(req )
**预期收益**: `other_unaccounted = capacity held available radix_protected running_batch inflight prealloc retracted` 应该接近 0剩余的就是真"病态"内存
**风险**: (纯只读 stat,不改 admission 逻辑)。
**工程量**: ~80 SGLang patch + 同步 replay.py `_query_pool_snapshot` + analyzer
### **P2:如果 P0 暴露 polling 是主因,改 polling 实现**
- 选项 A:把 `/server_info` 改成事件驱动 push(scheduler step 末尾把 stats 写到环形缓冲区,polling 只读不进 scheduler 队列)
- 选项 B:把 polling 频率从 1Hz 降到 5Hz/10s, P1 的拆分数据上验证够用
- 选项 C:scheduler 端加锁分离, stats 读和 admission 决策的临界区拆开
### **P3(条件性,等 P0+P1 数据)**:决定真正的优化方向
原稿 §7 5 条优先级在 `other` 模型纠正后**全部需要重新评估**。等真实拆分数据出来再排
---
## 8. 局限与 Confounders(已扩充)
1. `held_tokens` 语义在原稿被解读颠倒,引发 `other` 的因果归因错误(已纠正, §1.2)。
2. `other` 字段是计算所得且**未细分**,无法直接归因需要 P1 instrument 才能区分 radix-cacherunning batchinflight
3. EXP2+profile 415 errors baseline 9 errors **量级差异无法 deconfound**;polling leading suspect 但未证实。**P0 是必经步骤**。
4. **N=1** 的实验配置:任何 v5+profile vs v5 baseline 的延迟/失败差异都属于 single-run variance 合理范围,**不能作为方向性结论**。
5. trace single-shot,52 sessions × 4449 reqs 的特定结构可能放大某些路径
6. `capacity = 92086` `token_to_kv_pool_allocator.size`,来自 `mem_fraction_static`(未抽具体值),"H100 80GB 的物理上限"差距是 SGLang 的安全裕量
7. §3.1 t=1622 持续高 `other` 30+ tick 的现象 **未与 `last_gen_throughput` 交叉验证**;原稿"running batch + 在途传输"的解释是猜想而非证据
8. 18/52 失败 session 的特征(turn_idinput 长度prefix shape)**未做对比分析**;不能排除某个 session 类型本来就会触发某个固定 bug
9. polling 频率 1Hz 错过亚秒级 burst —— `other` 的双峰可能比测到的更剧烈
10. critic 指出 `pd-router-d-session-reseed` EXP1 (193 vs 152)、EXP2 (127 vs 152)的反向移动**未在原稿分析**,这是 admission/路由 决策的清晰信号,应该在 P1 之后回看
---
## 9. 后续指令(已更新顺序)
1. **P0**: `scripts/sweep_tp1_v5_baseline_rerun_exp2.sh`,3 EXP2 baseline, polling
2. **P1**: 同时改 SGLang `other` 真正拆开
3. 完成 P0+P1 后:
- 重跑 EXP2 一次 + instrument( polling),拿到 `other` 拆分
- 对比 baseline-rerun 三次的 errors 分布
- 决定是否回退 polling admission还是攻 specific 18 session 的工作负载特征
4. 任何 v6 代码改动(优化 admission / eviction / transfer)**必须在 P0+P1 之后**。
---
## 10. 数据产物
```
outputs/qwen3-30b-tp1-v5-optD-profile/
├── exp{1,2}_*_metrics.jsonl # 4449 行 / 实验
├── exp{1,2}_*_summary.json
├── exp{1,2}_*_pool_timeseries.jsonl # 12 MB / 10 MB
└── kvcache-centric-...20260429T{120847,125911}Z/ # 原始 run dir
outputs/qwen3-30b-tp1-v5-optD/ # baseline 对照(N=1)
└── exp{1,2}_1p7d_kvc_optD_*
# 待 P0 产生:
outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
└── exp2_2p6d_run{1,2,3}_*
```
分析脚本:`scripts/analysis/analyze_pool_timeseries.py`(`--json` 拿机器可读输出)。

View File

@@ -0,0 +1,88 @@
{
"actual_output_tokens_stats": {
"count": 4086.0,
"mean": 213.95105237395987,
"p50": 83.0,
"p90": 562.0,
"p99": 1346.0
},
"cache_hit_request_count": 3929,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 22635.924702180266,
"p50": 20010.0,
"p90": 48002.0,
"p99": 65424.0
},
"decode_request_priorities": {},
"error_count": 363,
"execution_modes": {
"kvcache-centric": 363,
"kvcache-direct-to-d-session": 1716,
"pd-router-d-session-reseed": 23,
"pd-router-fallback-d-backpressure": 12,
"pd-router-fallback-large-append": 5,
"pd-router-fallback-large-append-seed-filter-early-turn": 51,
"pd-router-fallback-large-append-session-cap": 2148,
"pd-router-fallback-no-d-capacity": 7,
"pd-router-fallback-session-cap": 32,
"pd-router-large-append-reseed": 39,
"pd-router-large-append-reseed-after-eviction": 2,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-no-d-capacity": 3,
"pd-router-turn1-seed": 34,
"pd-router-turn1-session-cap": 13
},
"latency_stats_s": {
"count": 4086.0,
"mean": 4.8753733304192455,
"p50": 1.754677688702941,
"p90": 12.66968655679375,
"p99": 28.717210091650486
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 616,
"decode-1": 658,
"decode-2": 674,
"decode-3": 582,
"decode-4": 656,
"decode-5": 662,
"decode-6": 601
},
"per_prefill_load": {
"prefill-0": 4449
},
"prefill_request_priorities": {
"-100": 98,
"100": 2272
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 1716,
"total_actual_kv_transfer_blocks": 62123,
"total_cached_tokens": 100707229,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4086.0,
"mean": 0.005829451223571163,
"p50": 0.005684156496173296,
"p90": 0.007143743503740225,
"p99": 0.008634991403068266
},
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T095141Z/sampled-trace.jsonl",
"truncated_request_count": 42,
"ttft_stats_s": {
"count": 4086.0,
"mean": 3.5955862397812597,
"p50": 0.36274072993546724,
"p90": 10.972254231572151,
"p99": 27.433656523004174
}
}

View File

@@ -0,0 +1,85 @@
{
"actual_output_tokens_stats": {
"count": 4440.0,
"mean": 225.87972972972972,
"p50": 86.0,
"p90": 576.0,
"p99": 1347.0
},
"cache_hit_request_count": 4201,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 24345.55787817487,
"p50": 21504.0,
"p90": 48792.0,
"p99": 69120.0
},
"decode_request_priorities": {},
"error_count": 9,
"execution_modes": {
"kvcache-centric": 9,
"kvcache-direct-to-d-session": 1358,
"pd-router-d-session-reseed": 12,
"pd-router-fallback-d-backpressure": 2,
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
"pd-router-fallback-large-append-session-cap": 2902,
"pd-router-fallback-session-cap": 25,
"pd-router-large-append-reseed": 34,
"pd-router-large-append-reseed-after-eviction": 4,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-seed": 30,
"pd-router-turn1-session-cap": 20
},
"latency_stats_s": {
"count": 4440.0,
"mean": 3.582334662846558,
"p50": 1.517257746309042,
"p90": 9.225348330102861,
"p99": 18.70269925892353
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 710,
"decode-1": 630,
"decode-2": 763,
"decode-3": 737,
"decode-4": 879,
"decode-5": 730
},
"per_prefill_load": {
"prefill-0": 2225,
"prefill-1": 2224
},
"prefill_request_priorities": {
"-100": 80,
"100": 3002
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 1358,
"total_actual_kv_transfer_blocks": 78979,
"total_cached_tokens": 108313387,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4440.0,
"mean": 0.005882534704321737,
"p50": 0.005807478777200416,
"p90": 0.00712956755887717,
"p99": 0.008372141476720572
},
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T104343Z/sampled-trace.jsonl",
"truncated_request_count": 42,
"ttft_stats_s": {
"count": 4440.0,
"mean": 2.2045287611873334,
"p50": 0.32809355948120356,
"p90": 6.947275545448065,
"p99": 16.705802395939827
}
}

View File

@@ -0,0 +1,189 @@
[2026-04-28 17:51:41] Starting TP1 v3 sweep (KVC with kv-aware policy)
[2026-04-28 17:51:41] Model: /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
[2026-04-28 17:51:41] Trace: outputs/qwen35-swebench-50sess.jsonl (4449 requests, 52 sessions)
[2026-04-28 17:51:41] Key change: --policy kv-aware for KVC (was --policy default in v2)
[2026-04-28 17:51:41]
[2026-04-28 17:51:41] === [EXP1] 1P7D KVC kv-aware ===
[2026-04-28 18:43:43] === exp1_1p7d_kvc_kvaware COMPLETED ===
[2026-04-28 18:43:43] Summary:
{
"actual_output_tokens_stats": {
"count": 4086.0,
"mean": 213.95105237395987,
"p50": 83.0,
"p90": 562.0,
"p99": 1346.0
},
"cache_hit_request_count": 3929,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 22635.924702180266,
"p50": 20010.0,
"p90": 48002.0,
"p99": 65424.0
},
"decode_request_priorities": {},
"error_count": 363,
"execution_modes": {
"kvcache-centric": 363,
"kvcache-direct-to-d-session": 1716,
"pd-router-d-session-reseed": 23,
"pd-router-fallback-d-backpressure": 12,
"pd-router-fallback-large-append": 5,
"pd-router-fallback-large-append-seed-filter-early-turn": 51,
"pd-router-fallback-large-append-session-cap": 2148,
"pd-router-fallback-no-d-capacity": 7,
"pd-router-fallback-session-cap": 32,
"pd-router-large-append-reseed": 39,
"pd-router-large-append-reseed-after-eviction": 2,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-no-d-capacity": 3,
"pd-router-turn1-seed": 34,
"pd-router-turn1-session-cap": 13
},
"latency_stats_s": {
"count": 4086.0,
"mean": 4.8753733304192455,
"p50": 1.754677688702941,
"p90": 12.66968655679375,
"p99": 28.717210091650486
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 616,
"decode-1": 658,
"decode-2": 674,
"decode-3": 582,
"decode-4": 656,
"decode-5": 662,
"decode-6": 601
},
"per_prefill_load": {
"prefill-0": 4449
},
"prefill_request_priorities": {
"-100": 98,
"100": 2272
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 1716,
"total_actual_kv_transfer_blocks": 62123,
"total_cached_tokens": 100707229,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4086.0,
"mean": 0.005829451223571163,
"p50": 0.005684156496173296,
"p90": 0.007143743503740225,
"p99": 0.008634991403068266
},
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T095141Z/sampled-trace.jsonl",
"truncated_request_count": 42,
"ttft_stats_s": {
"count": 4086.0,
"mean": 3.5955862397812597,
"p50": 0.36274072993546724,
"p90": 10.972254231572151,
"p99": 27.433656523004174
}
}
[2026-04-28 18:43:43] Saved to outputs/qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_summary.json + exp1_1p7d_kvc_kvaware_metrics.jsonl
[2026-04-28 18:43:43]
[2026-04-28 18:43:43] === [EXP2] 2P6D KVC kv-aware ===
[2026-04-28 19:30:38] === exp2_2p6d_kvc_kvaware COMPLETED ===
[2026-04-28 19:30:38] Summary:
{
"actual_output_tokens_stats": {
"count": 4440.0,
"mean": 225.87972972972972,
"p50": 86.0,
"p90": 576.0,
"p99": 1347.0
},
"cache_hit_request_count": 4201,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 24345.55787817487,
"p50": 21504.0,
"p90": 48792.0,
"p99": 69120.0
},
"decode_request_priorities": {},
"error_count": 9,
"execution_modes": {
"kvcache-centric": 9,
"kvcache-direct-to-d-session": 1358,
"pd-router-d-session-reseed": 12,
"pd-router-fallback-d-backpressure": 2,
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
"pd-router-fallback-large-append-session-cap": 2902,
"pd-router-fallback-session-cap": 25,
"pd-router-large-append-reseed": 34,
"pd-router-large-append-reseed-after-eviction": 4,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-seed": 30,
"pd-router-turn1-session-cap": 20
},
"latency_stats_s": {
"count": 4440.0,
"mean": 3.582334662846558,
"p50": 1.517257746309042,
"p90": 9.225348330102861,
"p99": 18.70269925892353
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 710,
"decode-1": 630,
"decode-2": 763,
"decode-3": 737,
"decode-4": 879,
"decode-5": 730
},
"per_prefill_load": {
"prefill-0": 2225,
"prefill-1": 2224
},
"prefill_request_priorities": {
"-100": 80,
"100": 3002
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 1358,
"total_actual_kv_transfer_blocks": 78979,
"total_cached_tokens": 108313387,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4440.0,
"mean": 0.005882534704321737,
"p50": 0.005807478777200416,
"p90": 0.00712956755887717,
"p99": 0.008372141476720572
},
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T104343Z/sampled-trace.jsonl",
"truncated_request_count": 42,
"ttft_stats_s": {
"count": 4440.0,
"mean": 2.2045287611873334,
"p50": 0.32809355948120356,
"p90": 6.947275545448065,
"p99": 16.705802395939827
}
}
[2026-04-28 19:30:38] Saved to outputs/qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_summary.json + exp2_2p6d_kvc_kvaware_metrics.jsonl
[2026-04-28 19:30:38]
[2026-04-28 19:30:38] === ALL TP1 V3 SWEEP EXPERIMENTS DONE ===

View File

@@ -0,0 +1,88 @@
{
"actual_output_tokens_stats": {
"count": 4014.0,
"mean": 215.048081714001,
"p50": 83.0,
"p90": 570.0,
"p99": 1343.0
},
"cache_hit_request_count": 3865,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 21373.60867610699,
"p50": 18429.0,
"p90": 45643.0,
"p99": 65088.0
},
"decode_request_priorities": {},
"error_count": 435,
"execution_modes": {
"kvcache-centric": 435,
"kvcache-direct-to-d-session": 2180,
"pd-router-d-session-reseed": 44,
"pd-router-d-session-reseed-after-eviction": 1,
"pd-router-fallback-d-backpressure": 36,
"pd-router-fallback-large-append": 35,
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
"pd-router-fallback-large-append-session-cap": 1500,
"pd-router-fallback-no-d-capacity": 13,
"pd-router-fallback-session-cap": 43,
"pd-router-large-append-reseed": 55,
"pd-router-large-append-reseed-after-eviction": 3,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-no-d-capacity": 5,
"pd-router-turn1-seed": 46
},
"latency_stats_s": {
"count": 4014.0,
"mean": 4.214657033050009,
"p50": 1.0827504023909569,
"p90": 13.380241627804935,
"p99": 24.453291333280504
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 690,
"decode-1": 599,
"decode-2": 660,
"decode-3": 584,
"decode-4": 606,
"decode-5": 646,
"decode-6": 664
},
"per_prefill_load": {
"prefill-0": 4449
},
"prefill_request_priorities": {
"-100": 149,
"100": 1685
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 2180,
"total_actual_kv_transfer_blocks": 52857,
"total_cached_tokens": 95091185,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4014.0,
"mean": 0.005804301410418847,
"p50": 0.005607025208882987,
"p90": 0.007293824862528552,
"p99": 0.008864479259402893
},
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T125022Z/sampled-trace.jsonl",
"truncated_request_count": 43,
"ttft_stats_s": {
"count": 4014.0,
"mean": 2.915135478307124,
"p50": 0.05643345229327679,
"p90": 11.900803190656006,
"p99": 22.758968392387033
}
}

View File

@@ -0,0 +1,86 @@
{
"actual_output_tokens_stats": {
"count": 4046.0,
"mean": 224.65002471576867,
"p50": 84.0,
"p90": 576.0,
"p99": 1349.0
},
"cache_hit_request_count": 3925,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 22852.7439874129,
"p50": 19584.0,
"p90": 49009.0,
"p99": 67320.0
},
"decode_request_priorities": {},
"error_count": 403,
"execution_modes": {
"kvcache-centric": 403,
"kvcache-direct-to-d-session": 2348,
"pd-router-d-session-reseed": 28,
"pd-router-fallback-d-backpressure": 7,
"pd-router-fallback-large-append": 68,
"pd-router-fallback-large-append-seed-filter-early-turn": 45,
"pd-router-fallback-large-append-session-cap": 1403,
"pd-router-fallback-no-d-capacity": 9,
"pd-router-fallback-session-cap": 25,
"pd-router-large-append-reseed": 57,
"pd-router-large-append-reseed-after-eviction": 6,
"pd-router-turn1-no-d-capacity": 1,
"pd-router-turn1-seed": 49
},
"latency_stats_s": {
"count": 4046.0,
"mean": 2.505981629502371,
"p50": 0.8372491216287017,
"p90": 6.5139341270551085,
"p99": 18.335972285829484
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 767,
"decode-1": 680,
"decode-2": 906,
"decode-3": 818,
"decode-4": 800,
"decode-5": 478
},
"per_prefill_load": {
"prefill-0": 2225,
"prefill-1": 2224
},
"prefill_request_priorities": {
"-100": 140,
"100": 1558
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 2348,
"total_actual_kv_transfer_blocks": 50727,
"total_cached_tokens": 101671858,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4046.0,
"mean": 0.005708743129332261,
"p50": 0.005565466725497757,
"p90": 0.006912594398356141,
"p99": 0.008102089307750717
},
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T134057Z/sampled-trace.jsonl",
"truncated_request_count": 36,
"ttft_stats_s": {
"count": 4046.0,
"mean": 1.1653790952959129,
"p50": 0.05140436999499798,
"p90": 2.6447059931233525,
"p99": 15.121314341202378
}
}

View File

@@ -0,0 +1,190 @@
[2026-04-28 20:50:21] Starting TP1 v4 sweep (KVC kv-aware, session soft_cap raised 4->16)
[2026-04-28 20:50:21] Model: /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
[2026-04-28 20:50:21] Trace: outputs/qwen35-swebench-50sess.jsonl (4449 requests, 52 sessions)
[2026-04-28 20:50:21] Key change: _decode_session_soft_cap now min(16, ...) instead of min(4, ...)
[2026-04-28 20:50:21]
[2026-04-28 20:50:21] === [EXP1] 1P7D KVC kv-aware cap=16 ===
[2026-04-28 21:40:57] === exp1_1p7d_kvc_cap16 COMPLETED ===
[2026-04-28 21:40:57] Summary:
{
"actual_output_tokens_stats": {
"count": 4014.0,
"mean": 215.048081714001,
"p50": 83.0,
"p90": 570.0,
"p99": 1343.0
},
"cache_hit_request_count": 3865,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 21373.60867610699,
"p50": 18429.0,
"p90": 45643.0,
"p99": 65088.0
},
"decode_request_priorities": {},
"error_count": 435,
"execution_modes": {
"kvcache-centric": 435,
"kvcache-direct-to-d-session": 2180,
"pd-router-d-session-reseed": 44,
"pd-router-d-session-reseed-after-eviction": 1,
"pd-router-fallback-d-backpressure": 36,
"pd-router-fallback-large-append": 35,
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
"pd-router-fallback-large-append-session-cap": 1500,
"pd-router-fallback-no-d-capacity": 13,
"pd-router-fallback-session-cap": 43,
"pd-router-large-append-reseed": 55,
"pd-router-large-append-reseed-after-eviction": 3,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-no-d-capacity": 5,
"pd-router-turn1-seed": 46
},
"latency_stats_s": {
"count": 4014.0,
"mean": 4.214657033050009,
"p50": 1.0827504023909569,
"p90": 13.380241627804935,
"p99": 24.453291333280504
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 690,
"decode-1": 599,
"decode-2": 660,
"decode-3": 584,
"decode-4": 606,
"decode-5": 646,
"decode-6": 664
},
"per_prefill_load": {
"prefill-0": 4449
},
"prefill_request_priorities": {
"-100": 149,
"100": 1685
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 2180,
"total_actual_kv_transfer_blocks": 52857,
"total_cached_tokens": 95091185,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4014.0,
"mean": 0.005804301410418847,
"p50": 0.005607025208882987,
"p90": 0.007293824862528552,
"p99": 0.008864479259402893
},
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T125022Z/sampled-trace.jsonl",
"truncated_request_count": 43,
"ttft_stats_s": {
"count": 4014.0,
"mean": 2.915135478307124,
"p50": 0.05643345229327679,
"p90": 11.900803190656006,
"p99": 22.758968392387033
}
}
[2026-04-28 21:40:57] Saved to outputs/qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_summary.json + exp1_1p7d_kvc_cap16_metrics.jsonl
[2026-04-28 21:40:57]
[2026-04-28 21:40:57] === [EXP2] 2P6D KVC kv-aware cap=16 ===
[2026-04-28 22:27:53] === exp2_2p6d_kvc_cap16 COMPLETED ===
[2026-04-28 22:27:53] Summary:
{
"actual_output_tokens_stats": {
"count": 4046.0,
"mean": 224.65002471576867,
"p50": 84.0,
"p90": 576.0,
"p99": 1349.0
},
"cache_hit_request_count": 3925,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 22852.7439874129,
"p50": 19584.0,
"p90": 49009.0,
"p99": 67320.0
},
"decode_request_priorities": {},
"error_count": 403,
"execution_modes": {
"kvcache-centric": 403,
"kvcache-direct-to-d-session": 2348,
"pd-router-d-session-reseed": 28,
"pd-router-fallback-d-backpressure": 7,
"pd-router-fallback-large-append": 68,
"pd-router-fallback-large-append-seed-filter-early-turn": 45,
"pd-router-fallback-large-append-session-cap": 1403,
"pd-router-fallback-no-d-capacity": 9,
"pd-router-fallback-session-cap": 25,
"pd-router-large-append-reseed": 57,
"pd-router-large-append-reseed-after-eviction": 6,
"pd-router-turn1-no-d-capacity": 1,
"pd-router-turn1-seed": 49
},
"latency_stats_s": {
"count": 4046.0,
"mean": 2.505981629502371,
"p50": 0.8372491216287017,
"p90": 6.5139341270551085,
"p99": 18.335972285829484
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 767,
"decode-1": 680,
"decode-2": 906,
"decode-3": 818,
"decode-4": 800,
"decode-5": 478
},
"per_prefill_load": {
"prefill-0": 2225,
"prefill-1": 2224
},
"prefill_request_priorities": {
"-100": 140,
"100": 1558
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 2348,
"total_actual_kv_transfer_blocks": 50727,
"total_cached_tokens": 101671858,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4046.0,
"mean": 0.005708743129332261,
"p50": 0.005565466725497757,
"p90": 0.006912594398356141,
"p99": 0.008102089307750717
},
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T134057Z/sampled-trace.jsonl",
"truncated_request_count": 36,
"ttft_stats_s": {
"count": 4046.0,
"mean": 1.1653790952959129,
"p50": 0.05140436999499798,
"p90": 2.6447059931233525,
"p99": 15.121314341202378
}
}
[2026-04-28 22:27:53] Saved to outputs/qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_summary.json + exp2_2p6d_kvc_cap16_metrics.jsonl
[2026-04-28 22:27:53]
[2026-04-28 22:27:53] === ALL TP1 V4 SWEEP EXPERIMENTS DONE ===

View File

@@ -0,0 +1,191 @@
#!/usr/bin/env python3
"""Analyze backpressure smoke sweep outputs.
For each run dir with a `request-metrics.jsonl` and the new `structural/`
subdir (admission-events.jsonl, backpressure-events.jsonl,
session-d-binding.jsonl), report:
- Headline (errors, latency, ttft, direct-to-D rate)
- Backpressure pause histogram (count, p50/p90 sleep, total pause time per D)
- Admission probe stats (RPC count, mean RTT, queue_depth distribution,
pause_ms distribution)
- Session pinning (distinct D per session, bimodal direct-to-D rate)
"""
from __future__ import annotations
import argparse
import json
import statistics
from collections import Counter, defaultdict
from pathlib import Path
def load_jsonl(path: Path) -> list[dict]:
if not path.exists():
return []
return [json.loads(l) for l in path.open("r", encoding="utf-8") if l.strip()]
def summarize_run(run_dir: Path) -> dict:
metrics_path = next(run_dir.rglob("request-metrics.jsonl"), None)
if metrics_path is None:
return {"run_dir": str(run_dir), "error": "no request-metrics.jsonl"}
summary_path = metrics_path.with_suffix(metrics_path.suffix + ".summary.json")
summary = (
json.load(summary_path.open()) if summary_path.exists() else {}
)
structural_dir = run_dir / "structural"
if not structural_dir.exists():
# try metrics dir's parent / structural
structural_dir = metrics_path.parent / "structural"
admission_events = load_jsonl(structural_dir / "admission-events.jsonl")
backpressure_events = load_jsonl(structural_dir / "backpressure-events.jsonl")
binding_events = load_jsonl(structural_dir / "session-d-binding.jsonl")
out: dict = {"run_dir": str(run_dir)}
# Headline metrics from summary.json
out["request_count"] = summary.get("request_count")
out["error_count"] = summary.get("error_count")
out["latency"] = summary.get("latency_stats_s")
out["ttft"] = summary.get("ttft_stats_s")
out["execution_modes"] = summary.get("execution_modes")
out["per_decode_load"] = summary.get("per_decode_load")
out["per_prefill_load"] = summary.get("per_prefill_load")
# Direct-to-D rate from execution_modes
em = summary.get("execution_modes", {}) or {}
direct = em.get("kvcache-direct-to-d-session", 0)
total = sum(em.values()) or 1
out["direct_to_d_rate"] = direct / total
# Session pinning
bind_per_session: dict[str, set[int]] = defaultdict(set)
for ev in binding_events:
bind_per_session[ev["session_id"]].add(ev["decode_worker_index"])
if bind_per_session:
out["session_count"] = len(bind_per_session)
out["avg_distinct_d_per_session"] = (
sum(len(v) for v in bind_per_session.values()) / len(bind_per_session)
)
else:
out["session_count"] = 0
out["avg_distinct_d_per_session"] = None
# Direct-to-D rate per session (bimodal check)
records = load_jsonl(metrics_path)
sess_records: dict[str, list[dict]] = defaultdict(list)
for r in records:
sess_records[r["session_id"]].append(r)
rates = []
for sid, turns in sess_records.items():
ndir = sum(
1 for t in turns if t.get("execution_mode") == "kvcache-direct-to-d-session"
)
rates.append(ndir / len(turns))
if rates:
buckets = [0, 0, 0, 0, 0]
for r in rates:
buckets[min(4, int(r * 5))] += 1
out["direct_to_d_rate_buckets"] = {
"0-20%": buckets[0],
"20-40%": buckets[1],
"40-60%": buckets[2],
"60-80%": buckets[3],
"80-100%": buckets[4],
}
# Backpressure events
if backpressure_events:
sleeps = [ev["sleep_s"] for ev in backpressure_events]
out["backpressure"] = {
"event_count": len(backpressure_events),
"total_sleep_s": round(sum(sleeps), 2),
"sleep_p50_s": round(statistics.median(sleeps), 4),
"sleep_p90_s": round(
sorted(sleeps)[int(len(sleeps) * 0.9)] if sleeps else 0, 4
),
"events_per_d": dict(
Counter(ev["server_url"] for ev in backpressure_events).most_common()
),
}
else:
out["backpressure"] = {"event_count": 0, "note": "no backpressure events"}
# Admission probe stats
if admission_events:
rtts = [ev["rtt_s"] for ev in admission_events]
depths = [ev.get("queue_depth", 0) for ev in admission_events]
pauses = [ev.get("recommended_pause_ms", 0) for ev in admission_events]
out["admission_probes"] = {
"count": len(admission_events),
"mean_rtt_s": round(sum(rtts) / len(rtts), 4),
"p99_rtt_s": round(sorted(rtts)[int(len(rtts) * 0.99)], 4),
"queue_depth_p50": int(statistics.median(depths)),
"queue_depth_p90": int(sorted(depths)[int(len(depths) * 0.9)]),
"queue_depth_max": max(depths),
"pause_ms_p50": int(statistics.median(pauses)),
"pause_ms_p90": int(sorted(pauses)[int(len(pauses) * 0.9)]),
"pause_ms_max": max(pauses),
"nonzero_pause_count": sum(1 for p in pauses if p > 0),
"by_reason": dict(
Counter(ev.get("reason") or "ok" for ev in admission_events).most_common()
),
}
return out
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("sweep_root", type=Path)
ap.add_argument("--json", action="store_true", help="emit JSON only")
args = ap.parse_args()
summaries = []
for run_dir in sorted(args.sweep_root.iterdir()):
if not run_dir.is_dir():
continue
summary = summarize_run(run_dir)
summaries.append(summary)
if args.json:
print(json.dumps(summaries, indent=2))
return
for s in summaries:
print(f"\n{'=' * 70}")
print(f" {s['run_dir']}")
print(f"{'=' * 70}")
if "error" in s:
print(f" ERROR: {s['error']}")
continue
print(f" reqs={s.get('request_count')} errors={s.get('error_count')}")
if s.get("latency"):
lt = s["latency"]
print(
f" latency: mean={lt.get('mean'):.3f} "
f"p50={lt.get('p50'):.3f} p90={lt.get('p90'):.3f} p99={lt.get('p99'):.3f}"
)
if s.get("ttft"):
tt = s["ttft"]
print(
f" ttft: mean={tt.get('mean'):.3f} "
f"p50={tt.get('p50'):.3f} p90={tt.get('p90'):.3f}"
)
print(f" direct_to_d_rate: {s.get('direct_to_d_rate', 0) * 100:.1f}%")
print(f" sessions: {s.get('session_count')} | "
f"avg distinct-D-per-session: {s.get('avg_distinct_d_per_session')}")
if s.get("direct_to_d_rate_buckets"):
print(f" direct-to-D distribution by session: {s['direct_to_d_rate_buckets']}")
if s.get("backpressure"):
print(f" backpressure: {s['backpressure']}")
if s.get("admission_probes"):
print(f" admission probes: {s['admission_probes']}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,83 @@
#!/usr/bin/env python3
"""Deep dive into v4 errors: which path, which D, which session, which turn."""
import json
import numpy as np
from pathlib import Path
from collections import Counter, defaultdict
BASE = Path(__file__).parent
def load_rows(jsonl_path):
rows = []
with open(jsonl_path) as f:
for line in f:
rows.append(json.loads(line))
return rows
# Compare v3 and v4 errors
for label, path in [
("v3 1P7D", BASE.parent / "qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_metrics.jsonl"),
("v4 1P7D", BASE / "exp1_1p7d_kvc_cap16_metrics.jsonl"),
("v3 2P6D", BASE.parent / "qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_metrics.jsonl"),
("v4 2P6D", BASE / "exp2_2p6d_kvc_cap16_metrics.jsonl"),
]:
if not path.exists():
print(f"\nSKIP {label}: {path} not found")
continue
rows = load_rows(path)
err = [r for r in rows if r.get("error") is not None]
print(f"\n========== {label} ({len(err)} errors / {len(rows)} total = {len(err)/len(rows)*100:.1f}%) ==========")
# Error finish_reason distribution
fr_counter = Counter()
for r in err:
fr = str(r.get("finish_reason") or r.get("error") or "?")
fr_counter[fr[:80]] += 1
print(f"finish_reason distribution:")
for fr, cnt in fr_counter.most_common():
print(f" {cnt:>4}x {fr}")
# Errors by execution mode (these are aborted before mode assignment usually)
mode_counter = Counter(r.get("execution_mode", "?") for r in err)
print(f"\nerror by execution_mode:")
for mode, cnt in mode_counter.most_common():
print(f" {cnt:>4}x {mode}")
# Errors per D worker
dw_counter = Counter(r.get("assigned_decode_node", "?") for r in err)
print(f"\nerror per assigned_decode_node:")
for dw, cnt in dw_counter.most_common():
print(f" {cnt:>4}x {dw}")
# Errors by turn distribution
turn_counter = Counter(r.get("turn_id", -1) for r in err)
early = sum(c for t, c in turn_counter.items() if t <= 5)
mid = sum(c for t, c in turn_counter.items() if 5 < t <= 30)
late = sum(c for t, c in turn_counter.items() if t > 30)
print(f"\nerror by turn: early(0-5)={early} mid(6-30)={mid} late(31+)={late}")
# Per-session error rate
per_sess_err = defaultdict(int)
per_sess_total = defaultdict(int)
for r in rows:
per_sess_total[r["session_id"]] += 1
if r.get("error") is not None:
per_sess_err[r["session_id"]] += 1
sess_with_err = [(sid, per_sess_err[sid], per_sess_total[sid]) for sid in per_sess_err]
sess_with_err.sort(key=lambda x: -x[1])
print(f"\ntop 5 sessions by error count:")
for sid, e, t in sess_with_err[:5]:
print(f" session {sid}: {e}/{t} errors ({e/t*100:.0f}%)")
# Errors timeline: are they bursty?
err_ts = sorted([r.get("trace_timestamp_s", 0) for r in err])
if err_ts:
first_ts = err_ts[0]
last_ts = err_ts[-1]
all_ts = sorted([r.get("trace_timestamp_s", 0) for r in rows])
first_all = all_ts[0]
last_all = all_ts[-1]
run_duration = last_all - first_all
err_first_pct = (err_ts[0] - first_all) / run_duration * 100 if run_duration > 0 else 0
err_last_pct = (err_ts[-1] - first_all) / run_duration * 100 if run_duration > 0 else 0
print(f"\nerror time range (% of run): {err_first_pct:.1f}% - {err_last_pct:.1f}%")

View File

@@ -0,0 +1,346 @@
#!/usr/bin/env python3
"""Analyze d-pool-timeseries.jsonl produced by --pool-poll-interval-s.
Answers v6's main question: where is D's KV pool actually spent?
For each decode worker, decomposes capacity over the run wall-clock into:
- resident_held_active = held - idle_evictable (sessions in active use)
- resident_held_idle = idle_evictable (sessions kept around but evictable)
- prefill_backup_or_other = capacity - held - available (everything else: backup blocks,
in-flight transfers, fragmentation)
- free_available = available
Also reports session residency churn (how many distinct sessions ever resided per D, and
how often a session bounced between workers — a strong starvation signal).
Usage:
python scripts/analysis/analyze_pool_timeseries.py <run_dir>
or
python scripts/analysis/analyze_pool_timeseries.py <pool_timeseries.jsonl>
Output: human-readable text. Add --json to also print a machine-readable summary.
"""
from __future__ import annotations
import argparse
import json
import statistics
from collections import Counter, defaultdict
from pathlib import Path
from typing import Any
def _load_jsonl(path: Path) -> list[dict[str, Any]]:
rows: list[dict[str, Any]] = []
with path.open() as fh:
for line in fh:
line = line.strip()
if not line:
continue
rows.append(json.loads(line))
return rows
def _resolve_input(path: Path) -> Path:
if path.is_file():
return path
if path.is_dir():
candidate = path / "d-pool-timeseries.jsonl"
if candidate.is_file():
return candidate
raise FileNotFoundError(
f"{candidate} not found; pass the file directly or a run dir containing it."
)
raise FileNotFoundError(path)
def _percentile(values: list[float], p: float) -> float:
if not values:
return 0.0
s = sorted(values)
idx = min(len(s) - 1, max(0, int(round((len(s) - 1) * p))))
return s[idx]
def _fmt_tokens(n: float) -> str:
if n >= 1_000_000:
return f"{n / 1_000_000:.2f}M"
if n >= 1_000:
return f"{n / 1_000:.1f}K"
return f"{int(n)}"
def _fmt_pct(n: float, total: float) -> str:
if total <= 0:
return " - "
return f"{100 * n / total:5.1f}%"
def analyze(timeseries_path: Path) -> dict[str, Any]:
rows = _load_jsonl(timeseries_path)
if not rows:
raise ValueError(f"empty timeseries: {timeseries_path}")
by_worker: dict[str, list[dict[str, Any]]] = defaultdict(list)
for row in rows:
if row.get("error") and "session_cache_enabled" not in row:
# poller failed at this tick — skip
continue
wid = row.get("worker_id") or "?"
by_worker[wid].append(row)
summary: dict[str, Any] = {
"timeseries_path": str(timeseries_path),
"total_rows": len(rows),
"tick_count": len(by_worker[next(iter(by_worker))]) if by_worker else 0,
"wall_s_span": (
max(r.get("wall_s", 0.0) for r in rows)
- min(r.get("wall_s", 0.0) for r in rows)
),
"workers": {},
}
print(f"\n=== Pool timeseries: {timeseries_path}")
print(
f" rows={summary['total_rows']} workers={len(by_worker)} "
f"span={summary['wall_s_span']:.1f}s"
)
# Print per-worker decomposition table
header = (
f"{'worker':<12} {'role':<8} {'cap':>8} | "
f"{'avg_active':>10} {'avg_idle':>10} {'avg_other':>10} {'avg_free':>10} | "
f"{'p90_held':>10} {'max_held':>10} {'p90_avail':>10}"
)
print(header)
print("-" * len(header))
for wid in sorted(by_worker.keys()):
ws = by_worker[wid]
role = ws[0].get("worker_role", "?")
cap_vals = [int(r.get("capacity_tokens") or 0) for r in ws]
held_vals = [int(r.get("held_tokens") or 0) for r in ws]
avail_vals = [int(r.get("available_tokens") or 0) for r in ws]
idle_vals = [int(r.get("idle_evictable_tokens") or 0) for r in ws]
# active = held - idle (sessions in active use)
active_vals = [max(0, h - i) for h, i in zip(held_vals, idle_vals)]
# other = capacity - held - available (prefill backup blocks, in-flight, fragmentation)
other_vals = [
max(0, c - h - a) for c, h, a in zip(cap_vals, held_vals, avail_vals)
]
cap = max(cap_vals) if cap_vals else 0
avg_active = statistics.fmean(active_vals) if active_vals else 0.0
avg_idle = statistics.fmean(idle_vals) if idle_vals else 0.0
avg_other = statistics.fmean(other_vals) if other_vals else 0.0
avg_avail = statistics.fmean(avail_vals) if avail_vals else 0.0
p90_held = _percentile([float(v) for v in held_vals], 0.90)
max_held = max(held_vals) if held_vals else 0
p90_avail = _percentile([float(v) for v in avail_vals], 0.90)
sess_counts = [int(r.get("session_count") or 0) for r in ws]
resident_counts = [int(r.get("resident_session_count") or 0) for r in ws]
print(
f"{wid:<12} {role:<8} {_fmt_tokens(cap):>8} | "
f"{_fmt_tokens(avg_active):>4} {_fmt_pct(avg_active, cap):>5} "
f"{_fmt_tokens(avg_idle):>4} {_fmt_pct(avg_idle, cap):>5} "
f"{_fmt_tokens(avg_other):>4} {_fmt_pct(avg_other, cap):>5} "
f"{_fmt_tokens(avg_avail):>4} {_fmt_pct(avg_avail, cap):>5} | "
f"{_fmt_tokens(p90_held):>10} {_fmt_tokens(max_held):>10} "
f"{_fmt_tokens(p90_avail):>10}"
)
summary["workers"][wid] = {
"role": role,
"capacity_tokens": cap,
"avg_active_held_tokens": avg_active,
"avg_idle_evictable_tokens": avg_idle,
"avg_other_tokens": avg_other,
"avg_available_tokens": avg_avail,
"p90_held_tokens": p90_held,
"max_held_tokens": max_held,
"p90_available_tokens": p90_avail,
"max_session_count": max(sess_counts) if sess_counts else 0,
"max_resident_session_count": (
max(resident_counts) if resident_counts else 0
),
"ticks": len(ws),
}
print(
"\nLegend: active=held-idle idle=idle_evictable "
"other=cap-held-avail (radix-protected + running-batch + in-flight + frag)"
)
# P1: decomposition of "other" using pool_breakdown fields (zeros if instrument absent)
has_breakdown = any(
any(r.get(k) for k in (
"radix_evictable_tokens",
"radix_protected_tokens",
"running_batch_kv_tokens",
"transfer_queue_tokens",
"prealloc_queue_tokens",
"retracted_queue_tokens",
))
for r in rows
)
if has_breakdown:
print("\n=== P1 'other' decomposition (per worker, mean over run) ===")
print(
f"{'worker':<12} {'role':<8} | "
f"{'r_evictable':>11} {'r_protected':>11} {'slot_private':>12} | "
f"{'run_batch':>10} {'transfer':>9} {'prealloc':>9} {'retracted':>10} | "
f"{'unaccounted':>11}"
)
for wid in sorted(by_worker.keys()):
ws = by_worker[wid]
role = ws[0].get("worker_role", "?")
cap = max(int(r.get("capacity_tokens") or 0) for r in ws)
def m(field: str) -> float:
vals = [int(r.get(field) or 0) for r in ws]
return statistics.fmean(vals) if vals else 0.0
r_ev = m("radix_evictable_tokens")
r_pr = m("radix_protected_tokens")
slot = m("slot_private_held_tokens")
rb = m("running_batch_kv_tokens")
tq = m("transfer_queue_tokens")
pq = m("prealloc_queue_tokens")
rq = m("retracted_queue_tokens")
avail = m("available_tokens")
# `running_batch_kv_tokens` overlaps with radix_protected for tree-tracked
# reqs — do NOT subtract it again. Decomposition assumes:
# capacity ≈ avail + r_evictable + r_protected + slot_private
# + transfer_queue + prealloc_queue + retracted_queue + unaccounted
unacc = max(
0,
cap - avail - r_ev - r_pr - slot - tq - pq - rq,
)
print(
f"{wid:<12} {role:<8} | "
f"{_fmt_tokens(r_ev):>11} {_fmt_tokens(r_pr):>11} {_fmt_tokens(slot):>12} | "
f"{_fmt_tokens(rb):>10} {_fmt_tokens(tq):>9} {_fmt_tokens(pq):>9} {_fmt_tokens(rq):>10} | "
f"{_fmt_tokens(unacc):>11}"
)
summary["workers"][wid]["pool_breakdown_avg"] = {
"radix_evictable": r_ev,
"radix_protected": r_pr,
"slot_private_held": slot,
"running_batch_kv": rb,
"transfer_queue": tq,
"prealloc_queue": pq,
"retracted_queue": rq,
"available": avail,
"unaccounted": unacc,
}
print(
"\nNote: running_batch_kv_tokens overlaps with radix_protected_tokens "
"(tree-tracked decode reqs are also in protected); not summed."
)
else:
print("\n(P1 instrument absent: pool_breakdown fields are all zero)")
# Session residency churn: how many distinct sessions ever sat on each worker,
# and how many sessions hopped across workers (= starvation indicator).
print("\n=== Session residency churn ===")
sessions_per_worker: dict[str, set[str]] = defaultdict(set)
workers_per_session: dict[str, set[str]] = defaultdict(set)
resident_ticks_per_session: Counter[str] = Counter()
resident_ticks_per_worker: Counter[str] = Counter()
for row in rows:
wid = row.get("worker_id")
if wid is None or row.get("worker_role") != "decode":
continue
sessions = row.get("sessions") or []
if not isinstance(sessions, list):
continue
for entry in sessions:
if not isinstance(entry, dict):
continue
sid = entry.get("session_id")
if sid is None:
continue
if entry.get("resident"):
sessions_per_worker[wid].add(sid)
workers_per_session[sid].add(wid)
resident_ticks_per_session[(wid, sid)] += 1
resident_ticks_per_worker[wid] += 1
# Per-decode worker: distinct session count
print(f" {'worker':<12} {'distinct_sess':>14} {'resident_ticks':>16}")
for wid in sorted(sessions_per_worker.keys()):
print(
f" {wid:<12} {len(sessions_per_worker[wid]):>14} "
f"{resident_ticks_per_worker[wid]:>16}"
)
# Per session: how many workers it hopped across
hops = Counter(len(ws) for ws in workers_per_session.values())
print(f"\n Sessions seen on N workers (decode side):")
for n, count in sorted(hops.items()):
print(f" on {n} worker(s): {count} sessions")
starvation = [sid for sid, ws in workers_per_session.items() if len(ws) == 0]
multi_hopper = sorted(
((sid, ws) for sid, ws in workers_per_session.items() if len(ws) >= 2),
key=lambda x: -len(x[1]),
)[:10]
if multi_hopper:
print(
"\n Top sessions seen resident on multiple workers (potential thrashing):"
)
for sid, ws in multi_hopper:
print(f" {sid}: {len(ws)} workers ({sorted(ws)})")
summary["session_residency"] = {
"distinct_sessions_per_worker": {
wid: len(s) for wid, s in sessions_per_worker.items()
},
"session_hop_count_distribution": dict(hops),
"starvation_session_count": len(starvation),
}
# If a request-metrics file is co-located, also bucket fallback reasons
# against contemporaneous pool state (rough — uses tick nearest to median tick).
metrics_path = timeseries_path.with_name("request-metrics.jsonl")
if metrics_path.exists():
print(f"\n=== Request-metrics summary ({metrics_path.name}) ===")
mrows = _load_jsonl(metrics_path)
modes = Counter(r.get("execution_mode") or "?" for r in mrows)
total = sum(modes.values())
for mode, count in modes.most_common():
print(f" {count:>6} ({100 * count / total:5.1f}%) {mode}")
summary["execution_modes"] = dict(modes)
return summary
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"path",
type=Path,
help="Path to d-pool-timeseries.jsonl OR a run dir containing it",
)
parser.add_argument(
"--json",
action="store_true",
help="Also print a machine-readable JSON summary",
)
args = parser.parse_args()
resolved = _resolve_input(args.path)
summary = analyze(resolved)
if args.json:
print("\n=== JSON summary ===")
print(json.dumps(summary, indent=2, sort_keys=True, default=str))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,89 @@
#!/usr/bin/env python3
"""Analyze v3 (kv-aware) results — find why fallback-large-append-session-cap dominates."""
import json
import numpy as np
from pathlib import Path
from collections import Counter, defaultdict
BASE = Path(__file__).parent
def load_rows(jsonl_path):
rows = []
with open(jsonl_path) as f:
for line in f:
rows.append(json.loads(line))
return rows
exp1 = load_rows(BASE / "exp1_1p7d_kvc_kvaware_metrics.jsonl")
exp2 = load_rows(BASE / "exp2_2p6d_kvc_kvaware_metrics.jsonl")
for name, rows in [("Exp1 1P7D", exp1), ("Exp2 2P6D", exp2)]:
print(f"\n========== {name} ==========")
ok = [r for r in rows if r.get("error") is None]
# Execution mode breakdown by latency
modes = Counter(r["execution_mode"] for r in ok)
print(f"\nExecution modes (n={len(ok)}):")
for mode, count in modes.most_common():
mode_rows = [r for r in ok if r["execution_mode"] == mode]
lats = [r["latency_s"] for r in mode_rows]
ttfts = [r["ttft_s"] for r in mode_rows]
print(f" {mode}: n={count} ({count/len(ok)*100:.1f}%) "
f"lat P50={np.percentile(lats,50):.3f}s P90={np.percentile(lats,90):.3f}s | "
f"ttft P50={np.percentile(ttfts,50):.3f}s P90={np.percentile(ttfts,90):.3f}s")
# Per-D session distribution
per_d_sessions = defaultdict(set)
for r in ok:
d = r.get("assigned_decode_node", "?")
per_d_sessions[d].add(r["session_id"])
print(f"\nSessions per D worker:")
for d in sorted(per_d_sessions.keys()):
print(f" {d}: {len(per_d_sessions[d])} unique sessions")
# session-cap fallback analysis
sc_rows = [r for r in ok if r["execution_mode"] == "pd-router-fallback-large-append-session-cap"]
if sc_rows:
print(f"\nSession-cap fallback details (n={len(sc_rows)}):")
# Which sessions hit this most?
sc_per_sess = Counter(r["session_id"] for r in sc_rows)
print(f" Sessions hitting session-cap (top 5):")
for sid, cnt in sc_per_sess.most_common(5):
print(f" session {sid}: {cnt} times")
# Per-D distribution
sc_per_d = Counter(r.get("assigned_decode_node", "?") for r in sc_rows)
print(f" Per-D distribution: {dict(sc_per_d.most_common())}")
# Input length distribution
inp = [r.get("input_length", 0) for r in sc_rows]
print(f" Input length: P50={np.percentile(inp,50):.0f} P90={np.percentile(inp,90):.0f}")
# Turn distribution
turns = Counter(r.get("turn_id", -1) for r in sc_rows)
print(f" Turn distribution (top 5): {dict(turns.most_common(5))}")
# Direct-to-D analysis (ideal path)
dd_rows = [r for r in ok if r["execution_mode"] == "kvcache-direct-to-d-session"]
if dd_rows:
lats = [r["latency_s"] for r in dd_rows]
ttfts = [r["ttft_s"] for r in dd_rows]
kv_blocks = [r.get("actual_kv_transfer_blocks", 0) for r in dd_rows]
cached = [r.get("cached_tokens", 0) for r in dd_rows]
print(f"\nDirect-to-D details (n={len(dd_rows)}):")
print(f" lat P50={np.percentile(lats,50):.3f}s P90={np.percentile(lats,90):.3f}s P99={np.percentile(lats,99):.3f}s")
print(f" ttft P50={np.percentile(ttfts,50):.3f}s P90={np.percentile(ttfts,90):.3f}s")
print(f" KV transfer: P50={np.percentile(kv_blocks,50):.0f} (should be 0 — no P involved)")
print(f" cached_tokens P50={np.percentile(cached,50):.0f}")
# Sessions: how many turns each, how many used direct-to-d
print(f"\nPer-session direct-to-D rate (top 10 by total turns):")
per_sess = defaultdict(list)
for r in ok:
per_sess[r["session_id"]].append(r)
sess_stats = []
for sid, sreqs in per_sess.items():
total = len(sreqs)
dd = sum(1 for r in sreqs if r["execution_mode"] == "kvcache-direct-to-d-session")
sc = sum(1 for r in sreqs if "session-cap" in r["execution_mode"])
sess_stats.append((sid, total, dd, sc))
sess_stats.sort(key=lambda x: -x[1])
for sid, total, dd, sc in sess_stats[:10]:
print(f" session {sid}: {total} turns, {dd} direct-to-D ({dd/total*100:.0f}%), {sc} session-cap fallback ({sc/total*100:.0f}%)")

View File

@@ -0,0 +1,52 @@
#!/usr/bin/env python3
"""V4 results analysis: errors, execution modes, latency by mode."""
import json
import numpy as np
from pathlib import Path
from collections import Counter
BASE = Path(__file__).parent
def load_rows(jsonl_path):
rows = []
with open(jsonl_path) as f:
for line in f:
rows.append(json.loads(line))
return rows
for name, path in [
("Exp1 1P7D cap=16", BASE / "exp1_1p7d_kvc_cap16_metrics.jsonl"),
("Exp2 2P6D cap=16", BASE / "exp2_2p6d_kvc_cap16_metrics.jsonl"),
]:
rows = load_rows(path)
print(f"\n========== {name} ==========")
ok = [r for r in rows if r.get("error") is None]
err = [r for r in rows if r.get("error") is not None]
print(f"Total: {len(rows)}, OK: {len(ok)}, Errors: {len(err)}")
# Errors finish_reason
if err:
finish_reasons = Counter()
for r in err:
fr = str(r.get("finish_reason") or r.get("error") or "?")
# Truncate long messages
short = fr[:120]
finish_reasons[short] += 1
print(f"\nError finish_reasons (top 5):")
for fr, cnt in finish_reasons.most_common(5):
print(f" {cnt}x: {fr}")
# Execution mode latency breakdown
modes = Counter(r["execution_mode"] for r in ok)
print(f"\nTop execution modes by latency:")
print(f"{'mode':<55}{'n':<8}{'%':<8}{'P50 lat':<10}{'P90 lat':<10}{'TTFT P50':<10}")
for mode, count in modes.most_common(8):
mode_rows = [r for r in ok if r["execution_mode"] == mode]
lats = [r["latency_s"] for r in mode_rows]
ttfts = [r["ttft_s"] for r in mode_rows]
print(f" {mode:<53}{count:<8}{count/len(ok)*100:>5.1f}% {np.percentile(lats,50):>7.3f}s {np.percentile(lats,90):>7.3f}s {np.percentile(ttfts,50):>7.3f}s")
# Per-D load
per_d = Counter(r.get("assigned_decode_node", "?") for r in ok)
print(f"\nPer-D load: max/min ratio = {max(per_d.values())/max(min(per_d.values()),1):.2f}x")
print(f" {dict(per_d.most_common())}")

View File

@@ -0,0 +1,136 @@
#!/usr/bin/env python3
"""Compare KVC variants vs baseline, EXCLUDING errors and truncated requests."""
import json
import numpy as np
from pathlib import Path
OUT = Path("/mnt/kzlin/workflow/pd-hybrid/agentic-pd-hybrid/outputs")
DATASETS = [
("baseline 8DP", OUT / "qwen3-30b-tp1-v2-fixed/exp1_8way_dp_cache_aware_metrics.jsonl"),
("v3 1P7D", OUT / "qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_metrics.jsonl"),
("v3 2P6D", OUT / "qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_metrics.jsonl"),
("v4 1P7D", OUT / "qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_metrics.jsonl"),
("v4 2P6D", OUT / "qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_metrics.jsonl"),
]
def load_rows(path):
rows = []
with open(path) as f:
for line in f:
rows.append(json.loads(line))
return rows
def is_truncated(row):
a = row.get("actual_output_tokens")
r = row.get("requested_output_tokens")
if a is not None and r is not None and r > 1:
return a < r * 0.5
return False
def stats(values):
if not values:
return {"n": 0}
a = np.array(values)
return {
"n": len(a),
"mean": float(np.mean(a)),
"p50": float(np.percentile(a, 50)),
"p90": float(np.percentile(a, 90)),
"p99": float(np.percentile(a, 99)),
}
def fmt(s, key):
if s["n"] == 0:
return "N/A"
v = s[key]
return f"{v:.3f}s" if v < 100 else f"{v:.1f}s"
results = []
for label, path in DATASETS:
if not path.exists():
print(f"SKIP {label}")
continue
rows = load_rows(path)
total = len(rows)
err_n = sum(1 for r in rows if r.get("error") is not None)
trunc_n = sum(1 for r in rows if r.get("error") is None and is_truncated(r))
# Filter: error=None AND not truncated AND latency present
clean = [r for r in rows
if r.get("error") is None
and not is_truncated(r)
and r.get("latency_s") is not None]
lats = [r["latency_s"] for r in clean]
ttfts = [r["ttft_s"] for r in clean if r.get("ttft_s") is not None]
results.append({
"label": label,
"total": total,
"err": err_n,
"trunc": trunc_n,
"clean_n": len(clean),
"lat": stats(lats),
"ttft": stats(ttfts),
})
# Print comparison table
print(f"\n{'='*100}")
print("LATENCY (excluding errors AND truncated)")
print(f"{'='*100}")
print(f"{'config':<16}{'total':>7}{'err':>6}{'trunc':>7}{'clean':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
for r in results:
print(f"{r['label']:<16}{r['total']:>7}{r['err']:>6}{r['trunc']:>7}{r['clean_n']:>7} "
f"{fmt(r['lat'],'mean'):>9}{fmt(r['lat'],'p50'):>9}{fmt(r['lat'],'p90'):>9}{fmt(r['lat'],'p99'):>9}")
print(f"\n{'='*100}")
print("TTFT (excluding errors AND truncated)")
print(f"{'='*100}")
print(f"{'config':<16}{'clean':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
for r in results:
print(f"{r['label']:<16}{r['clean_n']:>7} "
f"{fmt(r['ttft'],'mean'):>9}{fmt(r['ttft'],'p50'):>9}{fmt(r['ttft'],'p90'):>9}{fmt(r['ttft'],'p99'):>9}")
# Also: per-execution-mode breakdown for v4 only (the most interesting)
print(f"\n{'='*100}")
print("V4 2P6D: per-execution-mode (excluding errors and truncated)")
print(f"{'='*100}")
v4_2p6d = next((p for l, p in DATASETS if l == "v4 2P6D"), None)
if v4_2p6d:
rows = load_rows(v4_2p6d)
clean = [r for r in rows if r.get("error") is None and not is_truncated(r)]
from collections import Counter
modes = Counter(r["execution_mode"] for r in clean)
print(f"{'mode':<55}{'n':>7}{'%':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
for mode, count in modes.most_common(10):
m_rows = [r for r in clean if r["execution_mode"] == mode]
s = stats([r["latency_s"] for r in m_rows])
pct = count/len(clean)*100
print(f" {mode:<53}{count:>7}{pct:>6.1f}% {fmt(s,'mean'):>9}{fmt(s,'p50'):>9}{fmt(s,'p90'):>9}{fmt(s,'p99'):>9}")
# Also: WHAT IF we only count direct-to-D? (Pure KVC performance)
print(f"\n{'='*100}")
print("Pure KVC (kvcache-direct-to-d-session ONLY) vs Baseline")
print(f"{'='*100}")
for label, path in DATASETS:
if not path.exists() or "1P7D" not in label and "2P6D" not in label:
continue
rows = load_rows(path)
direct = [r for r in rows
if r.get("error") is None and not is_truncated(r)
and r.get("execution_mode") == "kvcache-direct-to-d-session"]
if not direct:
continue
s_lat = stats([r["latency_s"] for r in direct])
s_ttft = stats([r["ttft_s"] for r in direct if r.get("ttft_s") is not None])
print(f"{label:<16}n={s_lat['n']:>5} lat: P50={fmt(s_lat,'p50')} P90={fmt(s_lat,'p90')} ttft: P50={fmt(s_ttft,'p50')} P90={fmt(s_ttft,'p90')}")
# Baseline for reference (already non-fallback by definition)
print()
baseline_path = OUT / "qwen3-30b-tp1-v2-fixed/exp1_8way_dp_cache_aware_metrics.jsonl"
baseline_rows = load_rows(baseline_path)
clean = [r for r in baseline_rows if r.get("error") is None and not is_truncated(r)]
s_lat = stats([r["latency_s"] for r in clean])
s_ttft = stats([r["ttft_s"] for r in clean if r.get("ttft_s") is not None])
print(f"{'baseline 8DP':<16}n={s_lat['n']:>5} lat: P50={fmt(s_lat,'p50')} P90={fmt(s_lat,'p90')} ttft: P50={fmt(s_ttft,'p50')} P90={fmt(s_ttft,'p90')}")

View File

@@ -0,0 +1,110 @@
#!/usr/bin/env python3
"""Convert sibench audit.jsonl to agentic-pd-hybrid trace format.
Source format (sibench audit.jsonl):
{"instance_id": "...", "ts": float, "messages": [...],
"audit": {"prompt_tokens": int, "completion_tokens": int, ...}}
Target format (agentic-pd-hybrid trace JSONL):
{"chat_id": int, "parent_chat_id": int, "timestamp": float,
"turn": int, "input_length": int, "output_length": int,
"type": str, "hash_ids": [int, ...]}
"""
import json
import sys
from collections import defaultdict
from pathlib import Path
BLOCK_TOKEN_BUDGET = 24 # tokens per block, matching trace.py default
def convert(src: Path, dst: Path) -> None:
# Group lines by instance_id, preserving order within each instance
instances: dict[str, list[dict]] = defaultdict(list)
with src.open() as f:
for line in f:
line = line.strip()
if not line:
continue
rec = json.loads(line)
instances[rec["instance_id"]].append(rec)
# Sort each instance's turns by timestamp
for iid in instances:
instances[iid].sort(key=lambda r: r["ts"])
# Assign stable chat_id bases: each instance gets a block of IDs
# Max turns across all instances determines the spacing
max_turns = max(len(turns) for turns in instances.values())
spacing = max_turns + 10 # extra headroom
total_written = 0
with dst.open("w") as out:
for inst_idx, (iid, turns) in enumerate(instances.items()):
base_chat_id = (inst_idx + 1) * spacing # start from spacing to avoid 0
# Track cumulative hash_ids for prefix cache simulation
cumulative_hash_ids: list[int] = []
global_block_counter = inst_idx * 100_000 # unique block namespace per instance
for turn_idx, rec in enumerate(turns):
audit = rec.get("audit", {})
input_length = audit.get("prompt_tokens", 0)
output_length = audit.get("completion_tokens", 0)
if input_length <= 0:
# Fallback: estimate from message content
total_chars = sum(len(m.get("content", "")) for m in rec.get("messages", []))
input_length = max(1, total_chars // 4)
if output_length <= 0:
output_length = 128 # reasonable default
chat_id = base_chat_id + turn_idx
if turn_idx == 0:
parent_chat_id = -1
else:
parent_chat_id = base_chat_id + turn_idx - 1
# Build hash_ids: for turn 0, generate blocks for full input
# For turn N>0, keep previous blocks and add new ones for the delta
if turn_idx == 0:
num_blocks = input_length // BLOCK_TOKEN_BUDGET
cumulative_hash_ids = list(
range(global_block_counter, global_block_counter + num_blocks)
)
global_block_counter += num_blocks
else:
# The new input is the full prompt (cumulative), so the delta
# is the new tokens beyond what was in the previous turn's prompt
prev_input = audit.get("prompt_tokens", 0)
prev_rec_audit = turns[turn_idx - 1].get("audit", {})
prev_input_length = prev_rec_audit.get("prompt_tokens", 0)
delta = max(0, prev_input - prev_input_length) if prev_input_length > 0 else 0
new_blocks = delta // BLOCK_TOKEN_BUDGET
new_ids = list(
range(global_block_counter, global_block_counter + new_blocks)
)
global_block_counter += new_blocks
cumulative_hash_ids = cumulative_hash_ids + new_ids
trace_line = {
"chat_id": chat_id,
"parent_chat_id": parent_chat_id,
"timestamp": rec["ts"],
"turn": turn_idx,
"input_length": input_length,
"output_length": output_length,
"type": "chat",
"hash_ids": cumulative_hash_ids,
}
out.write(json.dumps(trace_line, separators=(",", ":")) + "\n")
total_written += 1
print(f"Converted {total_written} lines from {len(instances)} instances -> {dst}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print(f"Usage: {sys.argv[0]} <input_audit.jsonl> <output_trace.jsonl>")
sys.exit(1)
convert(Path(sys.argv[1]), Path(sys.argv[2]))

View File

@@ -0,0 +1,450 @@
#!/usr/bin/env python3
"""Prepare balanced real-Ali trace samples for KVC experiments.
The generic sampler is duration-oriented and can be dominated by one long
session. This script keeps real request lengths/timestamps but caps turns per
session so live sweeps can compare policies on a repeatable multi-session
workload.
"""
from __future__ import annotations
import argparse
import json
import statistics
from collections import defaultdict
from dataclasses import asdict, dataclass
from pathlib import Path
from agentic_pd_hybrid.trace import TraceRequest, load_trace
@dataclass(frozen=True)
class SampleSummary:
input_trace_path: str
output_trace_path: str
profile: str
request_count: int
session_count: int
multi_turn_session_count: int
turn2plus_count: int
direct_eligible_turn2plus_count: int
direct_eligible_turn2plus_ratio: float
missing_parent_count: int
max_sessions: int
max_turns_per_session: int
start_time_s: float
end_time_s: float
sampled_duration_s: float
rebased_timestamps: bool
input_tokens: dict[str, float] | None
output_tokens: dict[str, float] | None
append_tokens: dict[str, float] | None
inter_turn_gap_s: dict[str, float] | None
overlap_ratio: dict[str, float] | None
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--trace", type=Path, required=True)
parser.add_argument("--output-root", type=Path, required=True)
parser.add_argument("--max-sessions", type=int, default=64)
parser.add_argument("--max-turns-per-session", type=int, default=12)
parser.add_argument("--start-time-s", type=float, default=0.0)
parser.add_argument(
"--window-duration-s",
type=float,
default=None,
help=(
"If set, also write continuous-window samples that keep only requests "
"inside [start-time, start-time + window-duration]."
),
)
parser.add_argument(
"--window-target-requests",
type=int,
default=None,
help=(
"For continuous-window samples, select whole sessions across time "
"buckets until at least this many requests are included. This keeps "
"the window span while making live runs tractable."
),
)
parser.add_argument(
"--window-buckets",
type=int,
default=15,
help="Number of time buckets used with --window-target-requests.",
)
parser.add_argument(
"--window-min-turns",
type=int,
default=1,
help=(
"Minimum number of in-window turns per selected session for "
"continuous-window samples."
),
)
parser.add_argument(
"--window-output-name",
default="ali-window.jsonl",
help="Output filename for the continuous-window sample.",
)
parser.add_argument(
"--max-sampled-duration-s",
type=float,
default=None,
help=(
"For balanced profile samples, drop requests after the first selected "
"timestamp plus this duration. Use only for quick smoke runs; headline "
"runs should preserve the full sampled span."
),
)
parser.add_argument(
"--profiles",
nargs="+",
default=["representative-mt", "kvc-fit-smallappend"],
choices=["representative-mt", "kvc-fit-smallappend"],
)
parser.add_argument(
"--no-rebase-timestamps",
action="store_true",
help="Keep original timestamps instead of shifting the sample to start at 0.",
)
args = parser.parse_args()
requests = load_trace(args.trace)
sessions: dict[str, list[TraceRequest]] = defaultdict(list)
for request in requests:
sessions[request.session_id].append(request)
args.output_root.mkdir(parents=True, exist_ok=True)
if args.window_duration_s is not None:
if args.window_target_requests is None:
selected = _select_window(
requests=requests,
start_time_s=args.start_time_s,
window_duration_s=args.window_duration_s,
)
profile = "window"
else:
selected = _select_window_session_sample(
sessions=sessions,
start_time_s=args.start_time_s,
window_duration_s=args.window_duration_s,
target_requests=args.window_target_requests,
bucket_count=args.window_buckets,
min_turns=args.window_min_turns,
)
profile = (
"window-session-sample"
if args.window_min_turns <= 1
else f"window-session-sample-min{args.window_min_turns}turns"
)
output_path = args.output_root / args.window_output_name
summary = _write_sample(
selected=selected,
input_trace_path=args.trace,
output_path=output_path,
profile=profile,
max_sessions=args.max_sessions,
max_turns_per_session=args.max_turns_per_session,
rebase_timestamps=not args.no_rebase_timestamps,
)
print(
f"window: wrote {summary.request_count} requests from "
f"{summary.session_count} sessions to {output_path}"
)
for profile in args.profiles:
selected = _select_profile(
sessions=sessions,
profile=profile,
start_time_s=args.start_time_s,
max_sessions=args.max_sessions,
max_turns_per_session=args.max_turns_per_session,
max_sampled_duration_s=args.max_sampled_duration_s,
)
output_path = args.output_root / f"ali-{profile}.jsonl"
summary = _write_sample(
selected=selected,
input_trace_path=args.trace,
output_path=output_path,
profile=profile,
max_sessions=args.max_sessions,
max_turns_per_session=args.max_turns_per_session,
rebase_timestamps=not args.no_rebase_timestamps,
)
print(
f"{profile}: wrote {summary.request_count} requests from "
f"{summary.session_count} sessions to {output_path}"
)
def _select_profile(
*,
sessions: dict[str, list[TraceRequest]],
profile: str,
start_time_s: float,
max_sessions: int,
max_turns_per_session: int,
max_sampled_duration_s: float | None,
) -> list[TraceRequest]:
eligible: list[list[TraceRequest]] = []
for session_requests in sessions.values():
ordered = _ordered(session_requests)
if len(ordered) < 2:
continue
if ordered[0].timestamp_s < start_time_s:
continue
if profile == "kvc-fit-smallappend" and not _is_kvc_fit_smallappend(ordered):
continue
eligible.append(ordered[:max_turns_per_session])
eligible.sort(key=lambda items: (items[0].timestamp_s, items[0].session_id))
selected_sessions = eligible[:max_sessions]
selected = [request for items in selected_sessions for request in items]
selected.sort(key=lambda request: (request.timestamp_s, request.chat_id))
if selected and max_sampled_duration_s is not None:
first_ts = selected[0].timestamp_s
end_ts = first_ts + max_sampled_duration_s
selected = [
request for request in selected if request.timestamp_s <= end_ts
]
return selected
def _select_window(
*,
requests: list[TraceRequest],
start_time_s: float,
window_duration_s: float,
) -> list[TraceRequest]:
end_time_s = start_time_s + window_duration_s
selected = [
request
for request in requests
if start_time_s <= request.timestamp_s <= end_time_s
]
selected.sort(key=lambda request: (request.timestamp_s, request.chat_id))
return selected
def _select_window_session_sample(
*,
sessions: dict[str, list[TraceRequest]],
start_time_s: float,
window_duration_s: float,
target_requests: int,
bucket_count: int,
min_turns: int,
) -> list[TraceRequest]:
if target_requests <= 0:
raise ValueError("--window-target-requests must be positive")
if bucket_count <= 0:
raise ValueError("--window-buckets must be positive")
if min_turns <= 0:
raise ValueError("--window-min-turns must be positive")
end_time_s = start_time_s + window_duration_s
bucket_width_s = window_duration_s / bucket_count
buckets: list[list[list[TraceRequest]]] = [[] for _ in range(bucket_count)]
for session_requests in sessions.values():
ordered = _ordered(session_requests)
if not ordered:
continue
first = ordered[0]
if first.timestamp_s < start_time_s or first.timestamp_s > end_time_s:
continue
in_window = [
request
for request in ordered
if start_time_s <= request.timestamp_s <= end_time_s
]
if len(in_window) < min_turns:
continue
bucket_index = min(
bucket_count - 1,
int((first.timestamp_s - start_time_s) / bucket_width_s),
)
buckets[bucket_index].append(in_window)
for bucket in buckets:
bucket.sort(key=lambda items: (items[0].timestamp_s, items[0].session_id))
selected_sessions: list[list[TraceRequest]] = []
selected_count = 0
positions = [0 for _ in range(bucket_count)]
while selected_count < target_requests:
progressed = False
for index, bucket in enumerate(buckets):
if positions[index] >= len(bucket):
continue
session_requests = bucket[positions[index]]
positions[index] += 1
selected_sessions.append(session_requests)
selected_count += len(session_requests)
progressed = True
if selected_count >= target_requests:
break
if not progressed:
break
selected = [request for items in selected_sessions for request in items]
selected.sort(key=lambda request: (request.timestamp_s, request.chat_id))
if len(selected) < target_requests:
raise ValueError(
f"window session sample selected only {len(selected)} requests; "
f"target was {target_requests}"
)
return selected
def _is_kvc_fit_smallappend(session_requests: list[TraceRequest]) -> bool:
initial = session_requests[0]
if initial.input_length < 2048 or initial.input_length > 16000:
return False
for request in session_requests:
if request.output_length > 2048:
return False
for previous, current in zip(session_requests, session_requests[1:], strict=False):
append_tokens = current.input_length - (
previous.input_length + previous.output_length
)
if append_tokens <= 0 or append_tokens > 2048:
return False
if _overlap_ratio(previous, current) < 0.75:
return False
return True
def _write_sample(
*,
selected: list[TraceRequest],
input_trace_path: Path,
output_path: Path,
profile: str,
max_sessions: int,
max_turns_per_session: int,
rebase_timestamps: bool,
) -> SampleSummary:
if not selected:
raise ValueError(f"profile {profile!r} selected no requests")
first_ts = selected[0].timestamp_s
output_path.parent.mkdir(parents=True, exist_ok=True)
with output_path.open("w", encoding="utf-8") as handle:
for request in selected:
timestamp = request.timestamp_s - first_ts if rebase_timestamps else request.timestamp_s
payload = {
"chat_id": request.chat_id,
"parent_chat_id": request.parent_chat_id,
"timestamp": round(timestamp, 6),
"input_length": request.input_length,
"output_length": request.output_length,
"type": request.request_type,
"turn": request.turn_id,
"hash_ids": list(request.hash_ids),
}
handle.write(json.dumps(payload, sort_keys=True) + "\n")
sessions = defaultdict(list)
for request in selected:
sessions[request.session_id].append(request)
selected_chat_ids = {request.chat_id for request in selected}
missing_parent_count = sum(
1
for request in selected
if request.parent_chat_id >= 0 and request.parent_chat_id not in selected_chat_ids
)
append_values: list[float] = []
gap_values: list[float] = []
overlap_values: list[float] = []
direct_eligible_count = 0
for session_requests in sessions.values():
ordered = _ordered(session_requests)
for previous, current in zip(ordered, ordered[1:], strict=False):
append_tokens = current.input_length - (
previous.input_length + previous.output_length
)
overlap_ratio = _overlap_ratio(previous, current)
append_values.append(float(append_tokens))
gap_values.append(float(current.timestamp_s - previous.timestamp_s))
overlap_values.append(overlap_ratio)
if append_tokens > 0 and append_tokens <= 2048 and overlap_ratio > 0:
direct_eligible_count += 1
turn2plus_count = sum(max(0, len(items) - 1) for items in sessions.values())
start = min(request.timestamp_s for request in selected)
end = max(request.timestamp_s for request in selected)
summary = SampleSummary(
input_trace_path=str(input_trace_path),
output_trace_path=str(output_path),
profile=profile,
request_count=len(selected),
session_count=len(sessions),
multi_turn_session_count=sum(1 for items in sessions.values() if len(items) > 1),
turn2plus_count=turn2plus_count,
direct_eligible_turn2plus_count=direct_eligible_count,
direct_eligible_turn2plus_ratio=(
direct_eligible_count / turn2plus_count if turn2plus_count else 0.0
),
missing_parent_count=missing_parent_count,
max_sessions=max_sessions,
max_turns_per_session=max_turns_per_session,
start_time_s=0.0 if rebase_timestamps else start,
end_time_s=end - start if rebase_timestamps else end,
sampled_duration_s=end - start,
rebased_timestamps=rebase_timestamps,
input_tokens=_stats([float(request.input_length) for request in selected]),
output_tokens=_stats([float(request.output_length) for request in selected]),
append_tokens=_stats(append_values),
inter_turn_gap_s=_stats(gap_values),
overlap_ratio=_stats(overlap_values),
)
with output_path.with_suffix(output_path.suffix + ".summary.json").open(
"w", encoding="utf-8"
) as handle:
json.dump(asdict(summary), handle, indent=2, sort_keys=True)
return summary
def _ordered(session_requests: list[TraceRequest]) -> list[TraceRequest]:
return sorted(
session_requests,
key=lambda request: (request.timestamp_s, request.turn_id, request.chat_id),
)
def _overlap_ratio(previous: TraceRequest, current: TraceRequest) -> float:
if not current.hash_ids:
return 0.0
previous_blocks = set(previous.hash_ids)
overlap = sum(1 for block in current.hash_ids if block in previous_blocks)
return overlap / len(current.hash_ids)
def _stats(values: list[float]) -> dict[str, float] | None:
if not values:
return None
ordered = sorted(values)
return {
"count": float(len(ordered)),
"mean": statistics.fmean(ordered),
"min": ordered[0],
"p50": _percentile(ordered, 0.50),
"p90": _percentile(ordered, 0.90),
"p99": _percentile(ordered, 0.99),
"max": ordered[-1],
}
def _percentile(sorted_values: list[float], percentile: float) -> float:
if len(sorted_values) == 1:
return sorted_values[0]
return sorted_values[round((len(sorted_values) - 1) * percentile)]
if __name__ == "__main__":
main()

73
scripts/run_all_experiments.sh Executable file
View File

@@ -0,0 +1,73 @@
#!/bin/bash
# Run all 3 PD hybrid experiments sequentially
# Uses 52 sessions / 4,449 requests (10% sample of 497 sessions)
# Each experiment takes ~30-40 min
set -euo pipefail
cd "$(dirname "$0")/.."
TRACE="outputs/qwen35-swebench-50sess.jsonl"
MODEL="/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B"
OUTPUT="outputs/swebench-exps"
echo "=== Experiment A: pd-disaggregation ==="
uv run agentic-pd-hybrid benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism pd-disaggregation \
--policy default \
--model-path "$MODEL" \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
echo "=== Experiment B: pd-colo ==="
uv run agentic-pd-hybrid benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism pd-colo \
--policy default \
--model-path "$MODEL" \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 2 --direct-tp-size 4 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
echo "=== Experiment C: kvcache-centric ==="
uv run agentic-pd-hybrid benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism kvcache-centric \
--policy default \
--model-path "$MODEL" \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 2 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
echo "=== All experiments complete ==="

24
scripts/run_exp_a_pd_disagg.sh Executable file
View File

@@ -0,0 +1,24 @@
#!/bin/bash
# Experiment A: pd-disaggregation baseline
# 1P(GPU 0-3) + 1D(GPU 4-7), TP4, mooncake TCP
# Full 39K trace from SWE-Bench 500 instances
set -euo pipefail
cd "$(dirname "$0")/.."
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-swebench-500.jsonl \
--output-root outputs/swebench-exps \
--mechanism pd-disaggregation \
--policy default \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 64 \
--timeout-s 900 \
--request-timeout-s 300

View File

@@ -0,0 +1,23 @@
#!/bin/bash
# Experiment B1: Naive DP colocation — round-robin policy
# 2 direct workers (GPU 0-3, 4-7), TP4, DP router with round-robin
# No disaggregation — each worker does prefill+decode locally
set -euo pipefail
cd "$(dirname "$0")/.."
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-swebench-50sess.jsonl \
--output-root outputs/swebench-exps \
--mechanism pd-colo \
--policy default \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 2 --direct-tp-size 4 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300

View File

@@ -0,0 +1,23 @@
#!/bin/bash
# Experiment B2: Naive DP colocation — cache-aware (kv-aware) policy
# 2 direct workers (GPU 0-3, 4-7), TP4, DP router with consistent-hashing
# Replay kv-aware policy picks the worker with most prefix overlap
set -euo pipefail
cd "$(dirname "$0")/.."
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-swebench-50sess.jsonl \
--output-root outputs/swebench-exps \
--mechanism pd-colo \
--policy kv-aware \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 2 --direct-tp-size 4 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300

24
scripts/run_exp_b_pd_colo.sh Executable file
View File

@@ -0,0 +1,24 @@
#!/bin/bash
# Experiment B: pd-colo (direct/colocation)
# 2 direct workers (GPU 0-3, 4-7), TP4, no router
# Full 39K trace from SWE-Bench 500 instances
set -euo pipefail
cd "$(dirname "$0")/.."
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-swebench-500.jsonl \
--output-root outputs/swebench-exps \
--mechanism pd-colo \
--policy default \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 2 --direct-tp-size 4 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 64 \
--timeout-s 900 \
--request-timeout-s 300

View File

@@ -0,0 +1,28 @@
#!/bin/bash
# Experiment C: kvcache-centric (session-aware PD)
# 1P(GPU 0-3) + 1D(GPU 4-7), TP4, mooncake TCP
# Full 39K trace from SWE-Bench 500 instances
set -euo pipefail
cd "$(dirname "$0")/.."
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-swebench-500.jsonl \
--output-root outputs/swebench-exps \
--mechanism kvcache-centric \
--policy default \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 64 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 2 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction

30
scripts/smoke_test.sh Executable file
View File

@@ -0,0 +1,30 @@
#!/bin/bash
# Smoke test: pd-disaggregation with mooncake TCP, 100 requests
set -euo pipefail
cd "$(dirname "$0")/.."
# Sample a small trace for smoke testing
uv run agentic-pd-hybrid sample-sessions \
--trace outputs/qwen35-swebench-500.jsonl \
--output outputs/qwen35-smoke-3sess.jsonl \
--session-sample-rate 0.02 \
--min-turns 5 \
--target-duration-s 300 \
--max-requests 100
# Run smoke test
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-smoke-3sess.jsonl \
--output-root outputs/smoke \
--mechanism pd-disaggregation \
--policy default \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300

View File

@@ -0,0 +1,114 @@
#!/usr/bin/env bash
# Smoke sweep: validate backpressure code change on top of v5 Option D config.
# Designed to fit in ~3-4h GPU budget (4 runs × ~30-60 min).
#
# Usage:
# bash scripts/sweep_backpressure_smoke.sh
#
# Prerequisites: GPUs available; trace at outputs/qwen35-swebench-50sess.jsonl;
# model at $MODEL_PATH (default Qwen3-30B-A3B-Instruct-2507).
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "$REPO_ROOT"
OUT_ROOT=${OUT_ROOT:-outputs/sweep_backpressure_smoke}
TRACE=${TRACE:-outputs/qwen35-swebench-50sess.jsonl}
MODEL=${MODEL:-/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507}
mkdir -p "$OUT_ROOT"
LOG="$OUT_ROOT/sweep.log"
echo "[$(date '+%F %T')] Starting backpressure smoke sweep" | tee -a "$LOG"
echo " Trace: $TRACE" | tee -a "$LOG"
echo " Model: $MODEL" | tee -a "$LOG"
echo " Output root: $OUT_ROOT" | tee -a "$LOG"
KVC_COMMON_ARGS=(
--trace "$TRACE"
--model "$MODEL"
--mechanism kvcache-centric
--policy kv-aware
--kvcache-admission-mode worker
--kvcache-seed-min-turn-id 1
--kvcache-seed-max-inflight-decode -1
--kvcache-prefill-backup-policy release-after-transfer
--kvcache-prefill-priority-eviction
--prefill-workers 2
--decode-workers 6
--prefill-gpu-ids 0,1
--decode-gpu-ids 2,3,4,5,6,7
--transfer-backend mooncake
--target-duration-s 2000
--session-sample-rate 1.0
--min-turns 2
--concurrency-limit 32
)
DP_COMMON_ARGS=(
--trace "$TRACE"
--model "$MODEL"
--mechanism pd-colo
--policy kv-aware
--direct-workers 8
--direct-gpu-ids 0,1,2,3,4,5,6,7
--transfer-backend mooncake
--target-duration-s 2000
--session-sample-rate 1.0
--min-turns 2
--concurrency-limit 32
)
run_kvc_baseline_ts10() {
local out="$OUT_ROOT/E1_kvc_baseline_ts10"
echo "[$(date '+%F %T')] === E1: KVC baseline (no backpressure) time-scale=10 ===" | tee -a "$LOG"
python -m agentic_pd_hybrid.cli benchmark-live \
"${KVC_COMMON_ARGS[@]}" \
--output-root "$out" \
--time-scale 10 \
2>&1 | tee -a "$LOG"
}
run_kvc_backpressure_ts10() {
local out="$OUT_ROOT/E2_kvc_backpressure_ts10"
echo "[$(date '+%F %T')] === E2: KVC + backpressure ON, time-scale=10 ===" | tee -a "$LOG"
python -m agentic_pd_hybrid.cli benchmark-live \
"${KVC_COMMON_ARGS[@]}" \
--output-root "$out" \
--time-scale 10 \
--enable-backpressure \
--backpressure-max-pause-s 2.0 \
2>&1 | tee -a "$LOG"
}
run_kvc_backpressure_ts1() {
local out="$OUT_ROOT/E3_kvc_backpressure_ts1_short"
echo "[$(date '+%F %T')] === E3: KVC + backpressure ON, time-scale=1, FIRST 1000 reqs ===" | tee -a "$LOG"
python -m agentic_pd_hybrid.cli benchmark-live \
"${KVC_COMMON_ARGS[@]}" \
--output-root "$out" \
--time-scale 1 \
--enable-backpressure \
--backpressure-max-pause-s 2.0 \
--target-duration-s 1800 \
2>&1 | tee -a "$LOG"
}
run_dp_baseline_ts1() {
local out="$OUT_ROOT/E4_dp_ts1_short"
echo "[$(date '+%F %T')] === E4: 8-way DP cache-aware, time-scale=1, FIRST 1000 reqs ===" | tee -a "$LOG"
python -m agentic_pd_hybrid.cli benchmark-live \
"${DP_COMMON_ARGS[@]}" \
--output-root "$out" \
--time-scale 1 \
--target-duration-s 1800 \
2>&1 | tee -a "$LOG"
}
# Sequence — add/remove as fits the budget.
run_kvc_baseline_ts10
run_kvc_backpressure_ts10
run_kvc_backpressure_ts1
run_dp_baseline_ts1
echo "[$(date '+%F %T')] === sweep DONE ===" | tee -a "$LOG"
echo "Run analysis with: python scripts/analysis/analyze_backpressure_smoke.py $OUT_ROOT" | tee -a "$LOG"

60
scripts/sweep_kvc_qwen3_30b.sh Executable file
View File

@@ -0,0 +1,60 @@
#!/bin/bash
# KVC admission control parameter sweep on Qwen3-30B
# 5 experiments, ~35 min each, ~3 hours total
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-exps
VENV_PYTHON=.venv/bin/python
run_kvc() {
local label=$1
local inflight=$2
local min_turn=$3
echo "=== [$label] inflight=$inflight min_turn=$min_turn === $(date)"
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy default \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id $min_turn \
--kvcache-seed-max-inflight-decode $inflight \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
echo "=== [$label] DONE === $(date)"
echo ""
}
# C1: inflight=8, min-turn=2
run_kvc "C1" 8 2
# C2: inflight=16, min-turn=2
run_kvc "C2" 16 2
# C3: inflight=-1 (disabled), min-turn=2
run_kvc "C3" -1 2
# C4: inflight=8, min-turn=1
run_kvc "C4" 8 1
# C5: inflight=-1 (disabled), min-turn=1
run_kvc "C5" -1 1
echo "=== ALL SWEEP EXPERIMENTS DONE === $(date)"

170
scripts/sweep_real_ali_kvc.sh Executable file
View File

@@ -0,0 +1,170 @@
#!/usr/bin/env bash
# Real Ali workload sweep for KVC pd-hybrid.
#
# This script expects a prebuilt sample trace and replays it exactly for every
# mechanism. It intentionally keeps pool polling disabled for performance runs.
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "$REPO_ROOT"
MODEL=${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}
TRACE=${TRACE:-outputs/real-ali-kvc-iter/samples-balanced/ali-kvc-fit-smallappend.jsonl}
OUT_ROOT=${OUT_ROOT:-outputs/real-ali-kvc-iter/runs}
TIME_SCALE=${TIME_SCALE:-1}
CONCURRENCY=${CONCURRENCY:-32}
REQUEST_TIMEOUT_S=${REQUEST_TIMEOUT_S:-300}
STACK_TIMEOUT_S=${STACK_TIMEOUT_S:-1200}
RUNS=${RUNS:-"dp kvc_bp"}
EXTRA_SERVER_ARGS=${EXTRA_SERVER_ARGS:-}
PREFILL_EXTRA_SERVER_ARGS=${PREFILL_EXTRA_SERVER_ARGS:-}
DECODE_EXTRA_SERVER_ARGS=${DECODE_EXTRA_SERVER_ARGS:-}
KVC_SEED_MIN_TURN_ID=${KVC_SEED_MIN_TURN_ID:-1}
KVC_SEED_ONLY_MULTITURN=${KVC_SEED_ONLY_MULTITURN:-0}
mkdir -p "$OUT_ROOT"
LOG="$OUT_ROOT/sweep.log"
log() {
echo "[$(date '+%F %T')] $*" | tee -a "$LOG"
}
common_args=(
--trace "$TRACE"
--model-path "$MODEL"
--output-root "$OUT_ROOT"
--use-trace-as-sample
--time-scale "$TIME_SCALE"
--concurrency-limit "$CONCURRENCY"
--timeout-s "$STACK_TIMEOUT_S"
--request-timeout-s "$REQUEST_TIMEOUT_S"
)
if [[ -n "$EXTRA_SERVER_ARGS" ]]; then
common_args+=(--extra-server-args "$EXTRA_SERVER_ARGS")
fi
if [[ -n "$PREFILL_EXTRA_SERVER_ARGS" ]]; then
common_args+=(--prefill-extra-server-args "$PREFILL_EXTRA_SERVER_ARGS")
fi
if [[ -n "$DECODE_EXTRA_SERVER_ARGS" ]]; then
common_args+=(--decode-extra-server-args "$DECODE_EXTRA_SERVER_ARGS")
fi
kvc_args=(
"${common_args[@]}"
--mechanism kvcache-centric
--policy kv-aware
--prefill-workers 2
--decode-workers 6
--prefill-tp-size 1
--decode-tp-size 1
--prefill-gpu-ids 0,1
--decode-gpu-ids 2,3,4,5,6,7
--transfer-backend mooncake
--gpu-budget 8
--kvcache-admission-mode worker
--kvcache-seed-min-turn-id "$KVC_SEED_MIN_TURN_ID"
--kvcache-seed-max-inflight-decode -1
--kvcache-prefill-backup-policy release-after-transfer
--kvcache-prefill-priority-eviction
)
if [[ "$KVC_SEED_ONLY_MULTITURN" == "1" ]]; then
kvc_args+=(--kvcache-seed-only-multiturn-sessions)
fi
run_dp() {
log "=== DP cache-aware baseline: 8 direct workers ==="
uv run agentic-pd-hybrid benchmark-live \
"${common_args[@]}" \
--mechanism pd-colo \
--policy kv-aware \
--prefill-workers 0 \
--decode-workers 0 \
--direct-workers 8 \
--direct-tp-size 1 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--gpu-budget 8
}
run_pd_disagg() {
log "=== PD-disaggregation baseline: 2P6D ==="
uv run agentic-pd-hybrid benchmark-live \
"${common_args[@]}" \
--mechanism pd-disaggregation \
--policy kv-aware \
--prefill-workers 2 \
--decode-workers 6 \
--prefill-tp-size 1 \
--decode-tp-size 1 \
--prefill-gpu-ids 0,1 \
--decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8
}
run_pd_sticky() {
log "=== PD-disaggregation sticky baseline: 2P6D ==="
uv run agentic-pd-hybrid benchmark-live \
"${common_args[@]}" \
--mechanism pd-disaggregation \
--policy sticky \
--prefill-workers 2 \
--decode-workers 6 \
--prefill-tp-size 1 \
--decode-tp-size 1 \
--prefill-gpu-ids 0,1 \
--decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8
}
run_kvc() {
log "=== KVC baseline: 2P6D worker admission, no backpressure ==="
uv run agentic-pd-hybrid benchmark-live "${kvc_args[@]}"
}
run_kvc_bp() {
log "=== KVC candidate: 2P6D worker admission + backpressure ==="
uv run agentic-pd-hybrid benchmark-live \
"${kvc_args[@]}" \
--enable-backpressure \
--backpressure-max-pause-s 2.0
}
summarize_latest() {
log "=== Latest summaries ==="
find "$OUT_ROOT" -maxdepth 2 -name 'request-metrics.jsonl.summary.json' -print \
| sort \
| while read -r summary; do
python - "$summary" <<'PY'
import json, sys
p=sys.argv[1]
d=json.load(open(p))
lat=d.get("latency_stats_s") or {}
tt=d.get("ttft_stats_s") or {}
em=d.get("execution_modes") or {}
print(p)
print(" reqs", d.get("request_count"), "errors", d.get("error_count"), "trunc", d.get("truncated_request_count"))
print(" lat mean/p50/p90/p99", lat.get("mean"), lat.get("p50"), lat.get("p90"), lat.get("p99"))
print(" ttft mean/p50/p90", tt.get("mean"), tt.get("p50"), tt.get("p90"))
print(" modes", em)
PY
done | tee -a "$LOG"
}
log "Trace: $TRACE"
log "Model: $MODEL"
log "Runs: $RUNS | time-scale=$TIME_SCALE concurrency=$CONCURRENCY | kvc-seed-min-turn-id=$KVC_SEED_MIN_TURN_ID | kvc-seed-only-multiturn=$KVC_SEED_ONLY_MULTITURN"
for run in $RUNS; do
case "$run" in
dp) run_dp ;;
pd) run_pd_disagg ;;
pd_sticky) run_pd_sticky ;;
kvc) run_kvc ;;
kvc_bp) run_kvc_bp ;;
*) log "Unknown run name: $run"; exit 2 ;;
esac
done
summarize_latest
log "DONE"

133
scripts/sweep_tp1_configs.sh Executable file
View File

@@ -0,0 +1,133 @@
#!/bin/bash
# TP1 configuration sweep: 8-way DP, 1P7D KVC, 2P6D KVC
# Qwen3-30B-A3B TP=1, single GPU per worker
# Most aggressive KVC admission: inflight=-1 (off), seed-min-turn=1
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-exps
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
# Also copy summary to a named file for easy access
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
log "Saved to $OUTPUT/${label}_summary.json"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 configuration sweep"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
########################################
# Experiment 1: 8-way DP cache-aware
########################################
log ""
log "=== [EXP1] 8-way DP cache-aware (8 direct × TP1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism pd-colo \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 8 --direct-tp-size 1 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
# Find latest run dir for this experiment
EXP1_DIR=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
save_result "exp1_8way_dp_cache_aware" "$EXP1_DIR"
########################################
# Experiment 2: 1P + 7D KVC (most aggressive)
########################################
log ""
log "=== [EXP2] 1P7D KVC (inflight=off, min-turn=1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy default \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_1p7d_kvc_aggressive" "$EXP2_DIR"
########################################
# Experiment 3: 2P + 6D KVC (most aggressive)
########################################
log ""
log "=== [EXP3] 2P6D KVC (inflight=off, min-turn=1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy default \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP3_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp3_2p6d_kvc_aggressive" "$EXP3_DIR"
########################################
log ""
log "=== ALL TP1 SWEEP EXPERIMENTS DONE ==="

131
scripts/sweep_tp1_v2_fixed.sh Executable file
View File

@@ -0,0 +1,131 @@
#!/bin/bash
# TP1 configuration sweep v2 — after session_params fix + audit fields
# Qwen3-30B-A3B TP=1, single GPU per worker
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v2-fixed
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v2 sweep (session_params fix + audit fields)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
########################################
# Experiment 1: 8-way DP cache-aware
########################################
log ""
log "=== [EXP1] 8-way DP cache-aware (8 direct × TP1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism pd-colo \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 8 --direct-tp-size 1 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
EXP1_DIR=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
save_result "exp1_8way_dp_cache_aware" "$EXP1_DIR"
########################################
# Experiment 2: 1P + 7D KVC (aggressive)
########################################
log ""
log "=== [EXP2] 1P7D KVC (inflight=off, min-turn=1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy default \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_1p7d_kvc_aggressive" "$EXP2_DIR"
########################################
# Experiment 3: 2P + 6D KVC (aggressive)
########################################
log ""
log "=== [EXP3] 2P6D KVC (inflight=off, min-turn=1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy default \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP3_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp3_2p6d_kvc_aggressive" "$EXP3_DIR"
########################################
log ""
log "=== ALL TP1 V2 SWEEP EXPERIMENTS DONE ==="

108
scripts/sweep_tp1_v3_kvaware.sh Executable file
View File

@@ -0,0 +1,108 @@
#!/bin/bash
# TP1 v3 sweep — KVC with kv-aware policy (fix routing mismatch)
# v2 used --policy default for KVC experiments, causing session routing
# mismatch: replay round-robin ≠ router round-robin → "session not found".
# v3 uses --policy kv-aware for KVC to ensure session affinity.
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v3-kvaware
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v3 sweep (KVC with kv-aware policy)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Key change: --policy kv-aware for KVC (was --policy default in v2)"
########################################
# Experiment 1: 1P + 7D KVC kv-aware
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_kvaware" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_kvaware" "$EXP2_DIR"
########################################
log ""
log "=== ALL TP1 V3 SWEEP EXPERIMENTS DONE ==="

108
scripts/sweep_tp1_v4_cap16.sh Executable file
View File

@@ -0,0 +1,108 @@
#!/bin/bash
# TP1 v4 sweep — KVC with kv-aware policy + soft_cap raised from 4 to 16
# v3 (kv-aware) fixed routing but session-cap fallback still dominated 52-65%
# of requests. Hardcoded min(4, ...) in _decode_session_soft_cap was the
# bottleneck — only 4*7=28 session slots for 52 trace sessions.
# v4 raises the cap to 16 (4*7=28 -> 16*7=112 slots).
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v4-cap16
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v4 sweep (KVC kv-aware, session soft_cap raised 4->16)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Key change: _decode_session_soft_cap now min(16, ...) instead of min(4, ...)"
########################################
# Experiment 1: 1P + 7D KVC kv-aware (cap=16)
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware cap=16 ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_cap16" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware (cap=16)
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware cap=16 ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_cap16" "$EXP2_DIR"
log ""
log "=== ALL TP1 V4 SWEEP EXPERIMENTS DONE ==="

View File

@@ -0,0 +1,89 @@
#!/bin/bash
# P0: Re-run v5 baseline EXP2 (2P6D) three times to establish whether
# errors=9 is a stable property of the v5 config or single-run variance.
# Critic of V5_PROFILE_INVESTIGATION_ZH.md flagged that the 415 errors in
# v5+profile EXP2 may have been polling-induced. We need 3 baseline runs
# (no polling, identical config to original v5) to test reproducibility.
#
# Output:
# outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
# ├── exp2_2p6d_run{1,2,3}_summary.json
# ├── exp2_2p6d_run{1,2,3}_metrics.jsonl
# └── kvcache-centric-...<ts>/ (one per run)
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v5-optD-baseline-rerun
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
run_exp2() {
local run_idx=$1
local label="exp2_2p6d_run${run_idx}"
log ""
log "=== [RUN ${run_idx}/3] EXP2 2P6D KVC kv-aware Option D (no polling) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
local run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
log "=== [RUN ${run_idx}/3] $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
log " errors = $errs (baseline reference = 9)"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
else
log "WARNING: no summary file in $run_dir"
fi
}
log "=== P0: v5 baseline EXP2 reproducibility test (3 runs, no polling) ==="
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Goal: confirm whether errors=9 in v5 baseline EXP2 is reproducible"
log " (v5+profile saw 415 errors; we need to know if polling was causal)"
for i in 1 2 3; do
run_exp2 $i
done
log ""
log "=== P0 SUMMARY: errors per run ==="
for i in 1 2 3; do
if [ -f "$OUTPUT/exp2_2p6d_run${i}_summary.json" ]; then
e=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/exp2_2p6d_run${i}_summary.json')); print(d.get('error_count',0))")
log " run ${i}: errors = $e"
fi
done
log "=== P0 ALL DONE ==="

114
scripts/sweep_tp1_v5_optD.sh Executable file
View File

@@ -0,0 +1,114 @@
#!/bin/bash
# TP1 v5 sweep — Option D: D-side admission for seed/reseed.
#
# v4 (cap=16) still saw 35% session-cap fallback because the local soft_cap
# evaluates min(16, usable_capacity_tokens / target_tokens) and target_tokens
# (= input + output) is 50-100K in agentic workloads, giving cap = 1-2.
#
# v5 makes worker admission_mode authoritative for ALL admission decisions
# (direct_append AND seed/reseed). Replay calls D's
# /session_cache/admit_direct_append with mode={direct_append|seed} and
# defers to D's KV pool availability + LRU eviction. Replay's local
# _decode_session_soft_cap is bypassed entirely under worker mode.
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v5-optD
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v5 sweep (Option D: D-side seed admission)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Key change: worker admission_mode now drives seed/reseed via D's admit endpoint"
########################################
# Experiment 1: 1P + 7D KVC kv-aware Option D
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware Option D ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_optD" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware Option D
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware Option D ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_optD" "$EXP2_DIR"
log ""
log "=== ALL TP1 V5 SWEEP EXPERIMENTS DONE ==="

View File

@@ -0,0 +1,125 @@
#!/bin/bash
# TP1 v5 + profiling — re-run the v5 (Option D) config with the new
# d-pool-timeseries poller enabled, so we can attribute each session-cap
# fallback to actual D KV pool occupancy (held vs available vs idle-evictable
# vs prefill-backup) instead of guessing.
#
# Output:
# outputs/qwen3-30b-tp1-v5-optD-profile/
# ├── kvcache-centric-kv-aware-worker-admission-<ts>/
# │ ├── request-metrics.jsonl
# │ ├── request-metrics.jsonl.summary.json
# │ └── d-pool-timeseries.jsonl ← NEW (1Hz P/D /server_info snapshots)
# ├── exp1_1p7d_kvc_optD_profile_metrics.jsonl
# └── exp2_2p6d_kvc_optD_profile_metrics.jsonl
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v5-optD-profile
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
POLL_INTERVAL=1.0
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
else
log "WARNING: no d-pool-timeseries.jsonl produced"
fi
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v5 + profile sweep (Option D + ${POLL_INTERVAL}s pool polling)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Profiling: --pool-poll-interval-s $POLL_INTERVAL (writes d-pool-timeseries.jsonl)"
########################################
# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--pool-poll-interval-s $POLL_INTERVAL
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_optD_profile" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--pool-poll-interval-s $POLL_INTERVAL
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_optD_profile" "$EXP2_DIR"
log ""
log "=== ALL TP1 V5+PROFILE EXPERIMENTS DONE ==="

View File

@@ -0,0 +1,129 @@
#!/bin/bash
# v6 P1: re-run the v5 (Option D) config with the pool_breakdown instrument
# (commit 4978c0d) so d-pool-timeseries.jsonl carries radix_protected /
# slot_private / running_batch / {transfer,prealloc,retracted}_queue tokens.
#
# This is the same config as scripts/sweep_tp1_v5_optD_profile.sh but writes
# to a separate output dir, leaving the pre-instrument v5+profile run intact
# for before/after comparison.
#
# Output:
# outputs/qwen3-30b-tp1-v6-p1-profile/
# ├── kvcache-centric-kv-aware-worker-admission-<ts>/
# │ ├── request-metrics.jsonl
# │ ├── request-metrics.jsonl.summary.json
# │ └── d-pool-timeseries.jsonl ← now with pool_breakdown fields
# ├── exp{1,2}_*_metrics.jsonl
# └── exp{1,2}_*_pool_timeseries.jsonl
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v6-p1-profile
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
POLL_INTERVAL=1.0
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
else
log "WARNING: no d-pool-timeseries.jsonl produced"
fi
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting v6 P1 sweep (v5 Option D config + ${POLL_INTERVAL}s pool polling + pool_breakdown)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Goal: capture pool_breakdown fields (radix_protected / slot_private / running_batch / queues)"
log " to decompose 'other' on the v5 baseline workload"
########################################
# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--pool-poll-interval-s $POLL_INTERVAL
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_v6_p1" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--pool-poll-interval-s $POLL_INTERVAL
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_v6_p1" "$EXP2_DIR"
log ""
log "=== ALL v6 P1 EXPERIMENTS DONE ==="

View File

@@ -3,13 +3,20 @@ from __future__ import annotations
import asyncio
import json
import signal
import shutil
from collections import Counter
from dataclasses import asdict, dataclass, replace
from datetime import UTC, datetime
from pathlib import Path
from agentic_pd_hybrid.replay import ReplayConfig, replay_trace
from agentic_pd_hybrid.sampling import SessionSampleConfig, sample_trace_sessions
from agentic_pd_hybrid.sampling import (
SessionSampleConfig,
SessionSampleSummary,
sample_trace_sessions,
)
from agentic_pd_hybrid.stack import ManagedPdStack, launch_pd_stack
from agentic_pd_hybrid.trace import load_trace
from agentic_pd_hybrid.topology import SingleNodeTopology
@@ -43,12 +50,18 @@ class BenchmarkConfig:
kvcache_prefill_priority_eviction: bool = False
kvcache_prefill_direct_priority: int = -100
kvcache_prefill_normal_priority: int = 100
pool_poll_interval_s: float = 0.0
pool_poll_include_sessions: bool = True
enable_backpressure: bool = False
backpressure_max_pause_s: float = 2.0
progress_interval_s: float = 30.0
sample_profile: str = "default"
min_initial_input_tokens: int | None = None
max_initial_input_tokens: int | None = None
max_append_input_tokens: int | None = None
max_output_tokens: int | None = None
min_overlap_ratio: float | None = None
use_trace_as_sample: bool = False
launch_stack: bool = True
@@ -90,22 +103,37 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
)
sampled_trace_path = run_dir / "sampled-trace.jsonl"
sample_summary = sample_trace_sessions(
SessionSampleConfig(
trace_path=config.trace_path,
output_path=sampled_trace_path,
target_duration_s=config.target_duration_s,
start_time_s=config.start_time_s,
if config.use_trace_as_sample:
shutil.copyfile(config.trace_path, sampled_trace_path)
sample_summary = _summarize_trace_sample(
input_trace_path=config.trace_path,
sampled_trace_path=sampled_trace_path,
profile=config.sample_profile,
session_sample_rate=config.session_sample_rate,
min_turns=config.min_turns,
profile=config.sample_profile, # type: ignore[arg-type]
min_initial_input_tokens=config.min_initial_input_tokens,
max_initial_input_tokens=config.max_initial_input_tokens,
max_append_input_tokens=config.max_append_input_tokens,
max_output_tokens=config.max_output_tokens,
min_overlap_ratio=config.min_overlap_ratio,
)
)
else:
sample_summary = sample_trace_sessions(
SessionSampleConfig(
trace_path=config.trace_path,
output_path=sampled_trace_path,
target_duration_s=config.target_duration_s,
start_time_s=config.start_time_s,
session_sample_rate=config.session_sample_rate,
min_turns=config.min_turns,
profile=config.sample_profile, # type: ignore[arg-type]
min_initial_input_tokens=config.min_initial_input_tokens,
max_initial_input_tokens=config.max_initial_input_tokens,
max_append_input_tokens=config.max_append_input_tokens,
max_output_tokens=config.max_output_tokens,
min_overlap_ratio=config.min_overlap_ratio,
)
)
stack: ManagedPdStack | None = None
previous_sigint = signal.getsignal(signal.SIGINT)
@@ -119,6 +147,8 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
try:
signal.signal(signal.SIGINT, _handle_termination)
signal.signal(signal.SIGTERM, _handle_termination)
_mechanisms_with_router = {"pd-disaggregation", "kvcache-centric", "pd-colo"}
_naive_dp = config.mechanism_name == "pd-colo"
if config.launch_stack:
stack = launch_pd_stack(
topology=topology,
@@ -132,18 +162,19 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
else config.timeout_s
),
include_router=(
config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
config.mechanism_name in _mechanisms_with_router
),
naive_dp=_naive_dp,
)
router_url = (
stack.router_url
if config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
if config.mechanism_name in _mechanisms_with_router
else None
)
else:
router_url = (
topology.router_url
if config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
if config.mechanism_name in _mechanisms_with_router
else None
)
@@ -187,6 +218,11 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
),
kvcache_prefill_direct_priority=config.kvcache_prefill_direct_priority,
kvcache_prefill_normal_priority=config.kvcache_prefill_normal_priority,
pool_poll_interval_s=config.pool_poll_interval_s,
pool_poll_include_sessions=config.pool_poll_include_sessions,
enable_backpressure=config.enable_backpressure,
backpressure_max_pause_s=config.backpressure_max_pause_s,
progress_interval_s=config.progress_interval_s,
)
if config.request_timeout_s is not None:
replay_config = replace(
@@ -243,12 +279,18 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
"kvcache_prefill_normal_priority": (
config.kvcache_prefill_normal_priority
),
"pool_poll_interval_s": config.pool_poll_interval_s,
"pool_poll_include_sessions": config.pool_poll_include_sessions,
"enable_backpressure": config.enable_backpressure,
"backpressure_max_pause_s": config.backpressure_max_pause_s,
"progress_interval_s": config.progress_interval_s,
"sample_profile": config.sample_profile,
"min_initial_input_tokens": config.min_initial_input_tokens,
"max_initial_input_tokens": config.max_initial_input_tokens,
"max_append_input_tokens": config.max_append_input_tokens,
"max_output_tokens": config.max_output_tokens,
"min_overlap_ratio": config.min_overlap_ratio,
"use_trace_as_sample": config.use_trace_as_sample,
"sample_summary": asdict(sample_summary),
"topology": {
"model_path": config.topology.model_path,
@@ -295,3 +337,44 @@ def _header_mode_for(policy_name: str) -> str:
if policy_name == "kv-aware":
return "target-worker"
return "none"
def _summarize_trace_sample(
*,
input_trace_path: Path,
sampled_trace_path: Path,
profile: str,
session_sample_rate: float,
min_turns: int,
min_initial_input_tokens: int | None,
max_initial_input_tokens: int | None,
max_append_input_tokens: int | None,
max_output_tokens: int | None,
min_overlap_ratio: float | None,
) -> SessionSampleSummary:
requests = load_trace(sampled_trace_path)
if not requests:
raise ValueError(f"Trace sample is empty: {sampled_trace_path}")
session_turns = Counter(request.session_id for request in requests)
start_time_s = requests[0].timestamp_s
end_time_s = requests[-1].timestamp_s
return SessionSampleSummary(
input_trace_path=str(input_trace_path),
output_trace_path=str(sampled_trace_path),
request_count=len(requests),
session_count=len(session_turns),
multi_turn_session_count=sum(1 for turns in session_turns.values() if turns > 1),
start_time_s=start_time_s,
end_time_s=end_time_s,
sampled_duration_s=end_time_s - start_time_s,
session_sample_rate=session_sample_rate,
min_turns=min_turns,
profile=profile,
min_initial_input_tokens=min_initial_input_tokens,
max_initial_input_tokens=max_initial_input_tokens,
max_append_input_tokens=max_append_input_tokens,
max_output_tokens=max_output_tokens,
min_overlap_ratio=min_overlap_ratio,
mean_append_input_tokens=None,
mean_turn_overlap_ratio=None,
)

View File

@@ -2,6 +2,7 @@ from __future__ import annotations
import argparse
import asyncio
import shlex
from pathlib import Path
from agentic_pd_hybrid.benchmark import BenchmarkConfig, run_live_benchmark
@@ -228,6 +229,47 @@ def main() -> None:
)
replay.add_argument("--kvcache-prefill-direct-priority", type=int, default=-100)
replay.add_argument("--kvcache-prefill-normal-priority", type=int, default=100)
replay.add_argument(
"--pool-poll-interval-s",
type=float,
default=0.0,
help=(
"Poll each P/D worker's /server_info every N seconds and write a "
"time-series snapshot to <run_dir>/d-pool-timeseries.jsonl. "
"0 disables polling."
),
)
replay.add_argument(
"--pool-poll-no-sessions",
action="store_true",
help=(
"Disable per-session detail in the pool timeseries (smaller files)."
),
)
replay.add_argument(
"--enable-backpressure",
action="store_true",
help=(
"Honor recommended_pause_ms hints from D's admission endpoint. "
"When set, replay sleeps before issuing requests to a saturated D. "
"Default off — preserves baseline behavior."
),
)
replay.add_argument(
"--backpressure-max-pause-s",
type=float,
default=2.0,
help="Cap on per-request backpressure sleep, regardless of D hint.",
)
replay.add_argument(
"--progress-interval-s",
type=float,
default=30.0,
help=(
"Write client-side replay progress to <output_dir>/replay-progress.jsonl "
"every N seconds. 0 disables the heartbeat."
),
)
sample = subparsers.add_parser(
"sample-sessions",
@@ -439,6 +481,45 @@ def main() -> None:
)
benchmark.add_argument("--kvcache-prefill-direct-priority", type=int, default=-100)
benchmark.add_argument("--kvcache-prefill-normal-priority", type=int, default=100)
benchmark.add_argument(
"--pool-poll-interval-s",
type=float,
default=0.0,
help=(
"Poll each P/D worker's /server_info every N seconds and write a "
"time-series snapshot to <run_dir>/d-pool-timeseries.jsonl. "
"0 disables polling."
),
)
benchmark.add_argument(
"--pool-poll-no-sessions",
action="store_true",
help=(
"Disable per-session detail in the pool timeseries (smaller files)."
),
)
benchmark.add_argument(
"--enable-backpressure",
action="store_true",
help=(
"Honor recommended_pause_ms hints from D's admission endpoint."
),
)
benchmark.add_argument(
"--backpressure-max-pause-s",
type=float,
default=2.0,
help="Cap on per-request backpressure sleep, regardless of D hint.",
)
benchmark.add_argument(
"--progress-interval-s",
type=float,
default=30.0,
help=(
"Write client-side replay progress to <run_dir>/replay-progress.jsonl "
"every N seconds. 0 disables the heartbeat."
),
)
benchmark.add_argument(
"--sample-profile",
choices=["default", "small-append"],
@@ -450,16 +531,31 @@ def main() -> None:
benchmark.add_argument("--max-append-input-tokens", type=int, default=None)
benchmark.add_argument("--max-output-tokens", type=int, default=None)
benchmark.add_argument("--min-overlap-ratio", type=float, default=None)
benchmark.add_argument(
"--use-trace-as-sample",
action="store_true",
help=(
"Replay the provided --trace exactly instead of sampling sessions into "
"a new trace. Use this for prebuilt real-workload samples."
),
)
args = parser.parse_args()
if args.command == "print-launch":
topology = _topology_from_args(args)
has_pd = bool(topology.prefill_workers and topology.decode_workers)
has_direct_only = bool(
topology.direct_workers
and not topology.prefill_workers
and not topology.decode_workers
)
plan = build_launch_plan(
topology,
prefill_policy=args.prefill_policy,
decode_policy=args.decode_policy,
include_router=bool(topology.prefill_workers and topology.decode_workers),
include_router=has_pd or has_direct_only,
naive_dp=has_direct_only,
)
print(plan.render())
return
@@ -513,6 +609,11 @@ def main() -> None:
),
kvcache_prefill_direct_priority=args.kvcache_prefill_direct_priority,
kvcache_prefill_normal_priority=args.kvcache_prefill_normal_priority,
pool_poll_interval_s=args.pool_poll_interval_s,
pool_poll_include_sessions=not args.pool_poll_no_sessions,
enable_backpressure=args.enable_backpressure,
backpressure_max_pause_s=args.backpressure_max_pause_s,
progress_interval_s=args.progress_interval_s,
)
results = asyncio.run(replay_trace(config))
print(
@@ -655,12 +756,18 @@ def main() -> None:
kvcache_prefill_normal_priority=(
args.kvcache_prefill_normal_priority
),
pool_poll_interval_s=args.pool_poll_interval_s,
pool_poll_include_sessions=not args.pool_poll_no_sessions,
enable_backpressure=args.enable_backpressure,
backpressure_max_pause_s=args.backpressure_max_pause_s,
progress_interval_s=args.progress_interval_s,
sample_profile=args.sample_profile,
min_initial_input_tokens=args.min_initial_input_tokens,
max_initial_input_tokens=args.max_initial_input_tokens,
max_append_input_tokens=args.max_append_input_tokens,
max_output_tokens=args.max_output_tokens,
min_overlap_ratio=args.min_overlap_ratio,
use_trace_as_sample=args.use_trace_as_sample,
launch_stack=True,
)
)
@@ -720,6 +827,26 @@ def _add_topology_arguments(parser: argparse.ArgumentParser) -> None:
"--no-trust-remote-code",
action="store_true",
)
parser.add_argument(
"--extra-server-args",
default="",
help="Extra arguments appended to every sglang.launch_server command.",
)
parser.add_argument(
"--prefill-extra-server-args",
default="",
help="Extra arguments appended only to prefill launch_server commands.",
)
parser.add_argument(
"--decode-extra-server-args",
default="",
help="Extra arguments appended only to decode launch_server commands.",
)
parser.add_argument(
"--direct-extra-server-args",
default="",
help="Extra arguments appended only to direct launch_server commands.",
)
def _topology_from_args(args: argparse.Namespace):
@@ -749,7 +876,13 @@ def _topology_from_args(args: argparse.Namespace):
force_rdma=args.force_rdma,
trust_remote_code=not args.no_trust_remote_code,
ib_device=args.ib_device,
direct_extra_server_args=("--enable-streaming-session",),
extra_server_args=tuple(shlex.split(args.extra_server_args)),
prefill_extra_server_args=tuple(shlex.split(args.prefill_extra_server_args)),
decode_extra_server_args=tuple(shlex.split(args.decode_extra_server_args)),
direct_extra_server_args=(
"--enable-streaming-session",
*tuple(shlex.split(args.direct_extra_server_args)),
),
)

View File

@@ -34,7 +34,24 @@ def build_launch_plan(
decode_policy: str = "manual",
include_router: bool = True,
router_request_timeout_s: float | None = None,
naive_dp: bool = False,
) -> LaunchPlan:
router_command: tuple[str, ...] | None = None
if include_router:
if topology.prefill_workers and topology.decode_workers:
router_command = _build_router_command(
topology,
prefill_policy=prefill_policy,
decode_policy=decode_policy,
request_timeout_s=router_request_timeout_s,
)
elif naive_dp and topology.direct_workers:
router_command = _build_dp_router_command(
topology,
backend_policy=decode_policy,
request_timeout_s=router_request_timeout_s,
)
return LaunchPlan(
prefill_commands=tuple(
_build_server_command(topology, worker) for worker in topology.prefill_workers
@@ -43,24 +60,17 @@ def build_launch_plan(
_build_server_command(topology, worker) for worker in topology.decode_workers
),
direct_commands=tuple(
_build_server_command(topology, worker) for worker in topology.direct_workers
),
router_command=(
_build_router_command(
topology,
prefill_policy=prefill_policy,
decode_policy=decode_policy,
request_timeout_s=router_request_timeout_s,
)
if include_router and topology.prefill_workers and topology.decode_workers
else None
_build_server_command(topology, worker, naive_dp=naive_dp)
for worker in topology.direct_workers
),
router_command=router_command,
)
def _build_server_command(
topology: SingleNodeTopology,
worker: WorkerSpec,
naive_dp: bool = False,
) -> tuple[str, ...]:
command = [
sys.executable,
@@ -76,11 +86,15 @@ def _build_server_command(
str(worker.port),
"--base-gpu-id",
str(worker.gpu_id),
"--disaggregation-mode",
_disaggregation_mode_for(worker),
"--disaggregation-transfer-backend",
topology.transfer_backend,
]
# Naive DP direct workers: no disaggregation flags at all
if not (naive_dp and worker.role == "direct"):
command.extend([
"--disaggregation-mode",
_disaggregation_mode_for(worker),
"--disaggregation-transfer-backend",
topology.transfer_backend,
])
if worker.tp_size > 1:
command.extend(["--tp-size", str(worker.tp_size)])
if topology.trust_remote_code:
@@ -135,6 +149,32 @@ def _build_router_command(
return tuple(command)
def _build_dp_router_command(
topology: SingleNodeTopology,
*,
backend_policy: str,
request_timeout_s: float | None,
) -> tuple[str, ...]:
command: list[str] = [
sys.executable,
"-B",
"-u",
"-m",
"agentic_pd_hybrid.pd_router",
"--host",
topology.router_host,
"--port",
str(topology.router_port),
"--backend-policy",
backend_policy,
]
if request_timeout_s is not None:
command.extend(["--request-timeout-s", str(request_timeout_s)])
for worker in topology.direct_workers:
command.extend(["--backend", worker.url])
return tuple(command)
def _render_named_command(name: str, command: tuple[str, ...]) -> str:
return f"# {name}\n" + " ".join(shlex.quote(part) for part in command)

View File

@@ -43,6 +43,9 @@ class RequestMetrics:
ttft_s: float | None
tpot_s: float | None
error: str | None = None
actual_output_tokens: int | None = None
requested_output_tokens: int | None = None
finish_reason: str | None = None
@classmethod
def from_decision(
@@ -63,6 +66,9 @@ class RequestMetrics:
prefill_request_priority: int | None = None,
decode_request_priority: int | None = None,
error: str | None = None,
actual_output_tokens: int | None = None,
requested_output_tokens: int | None = None,
finish_reason: str | None = None,
) -> "RequestMetrics":
return cls(
request_id=request.request_id,
@@ -95,6 +101,9 @@ class RequestMetrics:
ttft_s=ttft_s,
tpot_s=tpot_s,
error=error,
actual_output_tokens=actual_output_tokens,
requested_output_tokens=requested_output_tokens,
finish_reason=finish_reason,
)
@@ -158,6 +167,17 @@ def write_summary_json(
str(key): value for key, value in sorted(decode_priorities.items())
},
"error_count": sum(1 for row in rows if row.error is not None),
"truncated_request_count": sum(
1
for row in rows
if row.actual_output_tokens is not None
and row.requested_output_tokens is not None
and row.requested_output_tokens > 1
and row.actual_output_tokens < row.requested_output_tokens * 0.5
),
"actual_output_tokens_stats": _stats(
[float(row.actual_output_tokens) for row in rows if row.actual_output_tokens is not None]
),
}
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w", encoding="utf-8") as handle:

View File

@@ -74,8 +74,58 @@ class RouterState:
return idx
@dataclass
class DpRouterConfig:
host: str
port: int
backend_urls: list[str]
backend_policy: str = "round_robin"
request_timeout_s: float = 1800.0
class DpRouterState:
"""DP (data-parallel) router: forward each request to exactly one backend."""
def __init__(self, config: DpRouterConfig):
if not config.backend_urls:
raise ValueError("At least one backend worker is required")
self.config = config
self.cursor = 0
self.sticky_map: dict[str, int] = {}
def select_backend(self, headers: dict[str, str]) -> str:
idx = self._select_index(headers)
return self.config.backend_urls[idx]
def _select_index(self, headers: dict[str, str]) -> int:
target_worker = headers.get("x-smg-target-worker")
routing_key = headers.get("x-smg-routing-key")
if (
self.config.backend_policy == "consistent_hashing"
and target_worker is not None
):
idx = int(target_worker)
if 0 <= idx < len(self.config.backend_urls):
return idx
if self.config.backend_policy == "manual" and routing_key:
cached = self.sticky_map.get(routing_key)
if cached is not None:
return cached
idx = self.cursor % len(self.config.backend_urls)
self.cursor += 1
self.sticky_map[routing_key] = idx
return idx
idx = self.cursor % len(self.config.backend_urls)
self.cursor += 1
return idx
app = FastAPI()
router_state: RouterState | None = None
dp_state: DpRouterState | None = None
@app.get("/health")
@@ -85,6 +135,16 @@ async def health() -> Response:
@app.get("/health_generate")
async def health_generate() -> Response:
if dp_state is not None:
async with aiohttp.ClientSession() as session:
tasks = [
session.get(f"{url}/health_generate")
for url in dp_state.config.backend_urls
]
for response in asyncio.as_completed(tasks):
async with await response:
pass
return Response(status_code=200)
state = _require_state()
async with aiohttp.ClientSession() as session:
tasks = []
@@ -101,6 +161,11 @@ async def health_generate() -> Response:
@app.get("/v1/models")
async def models() -> ORJSONResponse:
if dp_state is not None:
async with aiohttp.ClientSession() as session:
async with session.get(f"{dp_state.config.backend_urls[0]}/v1/models") as resp:
payload = await resp.json()
return ORJSONResponse(payload, status_code=resp.status)
state = _require_state()
async with aiohttp.ClientSession() as session:
async with session.get(f"{state.config.prefill_urls[0][0]}/v1/models") as response:
@@ -147,6 +212,15 @@ async def _forward_to_backend(
headers: dict[str, str],
endpoint_name: str,
) -> Response:
# DP mode: forward to a single backend
if dp_state is not None:
return await _forward_to_dp_backend(
request_data=request_data,
headers=headers,
endpoint_name=endpoint_name,
)
# PD mode: coordinate prefill + decode
state = _require_state()
prefill_server, bootstrap_port, decode_server = state.select_pair(headers)
prefill_request, decode_request = _build_backend_requests(
@@ -186,6 +260,63 @@ async def _forward_to_backend(
)
async def _forward_to_dp_backend(
*,
request_data: dict,
headers: dict[str, str],
endpoint_name: str,
) -> Response:
assert dp_state is not None
backend_server = dp_state.select_backend(headers)
cleaned = _strip_internal_fields(request_data)
timeout_s = dp_state.config.request_timeout_s
if request_data.get("stream", False):
return StreamingResponse(
_stream_dp_generate(
request_data=cleaned,
backend_server=backend_server,
endpoint_name=endpoint_name,
timeout_s=timeout_s,
),
media_type="text/event-stream",
)
async with aiohttp.ClientSession(
timeout=aiohttp.ClientTimeout(total=timeout_s)
) as session:
async with session.post(
f"{backend_server}/{endpoint_name}", json=cleaned
) as response:
body = await response.read()
return Response(
content=body,
status_code=response.status,
media_type=response.content_type,
)
async def _stream_dp_generate(
*,
request_data: dict,
backend_server: str,
endpoint_name: str,
timeout_s: float,
) -> AsyncIterator[bytes]:
async with aiohttp.ClientSession(
timeout=aiohttp.ClientTimeout(total=timeout_s)
) as session:
async with session.post(
f"{backend_server}/{endpoint_name}", json=request_data
) as response:
if response.status != HTTPStatus.OK:
payload = await response.read()
yield payload
return
async for chunk in response.content.iter_chunked(_STREAM_CHUNK_SIZE):
yield chunk
async def _stream_generate(
*,
prefill_request: dict,
@@ -241,6 +372,12 @@ def _build_backend_requests(
prefill_request.update(bootstrap_payload)
decode_request.update(bootstrap_payload)
# session_params is only meaningful for the decode worker (streaming session
# KV reuse). Sending it to the prefill worker causes the D side to
# short-circuit with local-prefill on already-open sessions, returning
# truncated responses while P's KV transfer gets aborted.
prefill_request.pop("session_params", None)
if prefill_priority is not None:
prefill_request["priority"] = int(prefill_priority)
if decode_priority is not None:
@@ -262,7 +399,7 @@ def _require_state() -> RouterState:
def main() -> None:
parser = argparse.ArgumentParser(description="Minimal local PD router")
parser = argparse.ArgumentParser(description="Minimal local PD / DP router")
parser.add_argument("--host", default="127.0.0.1")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument(
@@ -270,30 +407,58 @@ def main() -> None:
nargs=2,
metavar=("URL", "BOOTSTRAP_PORT"),
action="append",
required=True,
default=None,
)
parser.add_argument(
"--decode",
action="append",
required=True,
default=None,
)
parser.add_argument("--prefill-policy", default="round_robin")
parser.add_argument("--decode-policy", default="manual")
parser.add_argument(
"--backend",
action="append",
default=None,
help="Backend URL for DP (data-parallel) mode. Repeat for each worker.",
)
parser.add_argument(
"--backend-policy",
default="round_robin",
help="Routing policy for DP mode: round_robin, manual, consistent_hashing.",
)
parser.add_argument("--request-timeout-s", type=float, default=1800.0)
args = parser.parse_args()
global router_state
router_state = RouterState(
RouterConfig(
host=args.host,
port=args.port,
prefill_urls=[(url, int(port)) for url, port in args.prefill],
decode_urls=list(args.decode),
prefill_policy=args.prefill_policy,
decode_policy=args.decode_policy,
request_timeout_s=args.request_timeout_s,
global router_state, dp_state
if args.backend:
# DP mode: simple forward to one of N backends
dp_state = DpRouterState(
DpRouterConfig(
host=args.host,
port=args.port,
backend_urls=list(args.backend),
backend_policy=args.backend_policy,
request_timeout_s=args.request_timeout_s,
)
)
)
elif args.prefill and args.decode:
# PD mode: prefill/decode coordination
router_state = RouterState(
RouterConfig(
host=args.host,
port=args.port,
prefill_urls=[(url, int(port)) for url, port in args.prefill],
decode_urls=list(args.decode),
prefill_policy=args.prefill_policy,
decode_policy=args.decode_policy,
request_timeout_s=args.request_timeout_s,
)
)
else:
parser.error("Either --backend (DP mode) or both --prefill and --decode (PD mode) are required")
uvicorn.run(app, host=args.host, port=args.port, log_level="info")

File diff suppressed because it is too large Load Diff

View File

@@ -66,6 +66,7 @@ def launch_pd_stack(
timeout_s: float = 1200.0,
router_request_timeout_s: float | None = None,
include_router: bool = True,
naive_dp: bool = False,
) -> ManagedPdStack:
run_dir.mkdir(parents=True, exist_ok=True)
logs_dir = run_dir / "logs"
@@ -77,6 +78,7 @@ def launch_pd_stack(
decode_policy=decode_policy,
include_router=include_router,
router_request_timeout_s=router_request_timeout_s,
naive_dp=naive_dp,
)
prefill_processes = [
@@ -195,6 +197,9 @@ def _build_process_env(topology: SingleNodeTopology) -> dict[str, str]:
env["MC_MS_AUTO_DISC"] = "0"
if topology.ib_device:
env["MOONCAKE_DEVICE"] = topology.ib_device
elif topology.transfer_backend == "mooncake":
# Default to TCP when RDMA is not forced (e.g. loopback on same node)
env.setdefault("MOONCAKE_PROTOCOL", "tcp")
repo_root = Path(__file__).resolve().parents[2]
python_paths = [

View File

@@ -189,10 +189,11 @@ class MooncakeTransferEngine:
device_name if device_name is not None else "",
)
else:
protocol = os.environ.get("MOONCAKE_PROTOCOL", "rdma")
ret_value = self.engine.initialize(
hostname,
"P2PHANDSHAKE",
"rdma",
protocol,
device_name if device_name is not None else "",
)
if ret_value != 0:

View File

@@ -1602,6 +1602,9 @@ class DirectAppendAdmissionReqInput(BaseReq):
session_id: str
uncached_input_tokens: int
output_tokens: int
# "direct_append": existing behavior — require session resident on this D
# "seed": new admission for session not yet resident; do capacity check + LRU eviction
mode: str = "direct_append"
@dataclass
@@ -1619,6 +1622,9 @@ class DirectAppendAdmissionReqOutput(BaseReq):
decode_prealloc_queue_reqs: int = 0
decode_transfer_queue_reqs: int = 0
decode_retracted_queue_reqs: int = 0
# Backpressure hint: if > 0, the caller should pause this many ms before
# sending more requests to this D. Computed from transfer-queue depth.
recommended_pause_ms: int = 0
@dataclass

View File

@@ -3181,6 +3181,89 @@ class Scheduler(
success = False
return success
def _compute_pool_breakdown_for_diagnostics(self) -> dict:
"""Read-only KV pool decomposition for the agentic-pd-hybrid profiler.
Decomposes capacity into:
- radix_evictable_tokens / radix_protected_tokens: tree-managed
- slot_private_held_tokens: SessionAwareCache out-of-tree slot holds
- running_batch_kv_tokens: kv_allocated_len of currently-decoding reqs
(overlaps with radix_protected; not additive)
- {transfer,prealloc,retracted}_queue_{reqs,tokens}: disagg queues
- available_tokens: free pool
Caller computes "unaccounted = capacity - sum_of_known" to find leakage.
Implementation is best-effort; missing components return omitted keys.
"""
breakdown: dict = {
"capacity_tokens": int(self.max_total_num_tokens or 0),
"available_tokens": int(self.token_to_kv_pool_allocator.available_size()),
}
# Radix tree (works for SessionAwareCache and most inner caches)
try:
ev = self.tree_cache.evictable_size()
pr = self.tree_cache.protected_size()
if isinstance(ev, tuple):
ev = ev[0]
if isinstance(pr, tuple):
pr = pr[0]
breakdown["radix_evictable_tokens"] = int(ev or 0)
breakdown["radix_protected_tokens"] = int(pr or 0)
except Exception:
pass
# SessionAwareCache slot-private holds (already in session_cache.held_tokens
# but mirrored here for one-stop decomposition)
try:
from sglang.srt.mem_cache.session_aware_cache import SessionAwareCache
if isinstance(self.tree_cache, SessionAwareCache):
breakdown["slot_private_held_tokens"] = int(
self.tree_cache.session_held_tokens()
)
breakdown["session_slot_count"] = int(
self.tree_cache.session_held_req_count()
)
except Exception:
pass
# Running batch KV (overlaps with radix_protected for tree-tracked reqs)
try:
running_reqs = self.running_batch.reqs
breakdown["running_batch_reqs"] = len(running_reqs)
breakdown["running_batch_kv_tokens"] = sum(
int(getattr(req, "kv_allocated_len", 0) or 0)
for req in running_reqs
)
except Exception:
pass
# Disagg decode queues
if self.disaggregation_mode == DisaggregationMode.DECODE:
try:
tq = self.disagg_decode_transfer_queue.queue
pq = self.disagg_decode_prealloc_queue.queue
rq = self.disagg_decode_prealloc_queue.retracted_queue
breakdown["transfer_queue_reqs"] = len(tq)
breakdown["transfer_queue_tokens"] = sum(
int(getattr(getattr(dr, "req", None), "kv_allocated_len", 0) or 0)
for dr in tq
)
breakdown["prealloc_queue_reqs"] = len(pq)
breakdown["prealloc_queue_tokens"] = sum(
int(getattr(getattr(dr, "req", None), "kv_allocated_len", 0) or 0)
for dr in pq
)
breakdown["retracted_queue_reqs"] = len(rq)
breakdown["retracted_queue_tokens"] = sum(
int(getattr(req, "kv_allocated_len", 0) or 0)
for req in rq
)
except Exception:
pass
return breakdown
def get_internal_state(self, recv_req: GetInternalStateReq):
ret = vars(get_global_server_args())
ret["last_gen_throughput"] = self.last_gen_throughput
@@ -3196,6 +3279,7 @@ class Scheduler(
ret["session_cache"] = (
self.session_controller.get_streaming_session_cache_status()
)
ret["pool_breakdown"] = self._compute_pool_breakdown_for_diagnostics()
if not self.spec_algorithm.is_none() and self.spec_total_num_forward_ct > 0:
ret["avg_spec_accept_length"] = (
@@ -3508,6 +3592,9 @@ class Scheduler(
reason="unsupported",
)
mode = getattr(recv_req, "mode", "direct_append") or "direct_append"
is_seed = mode == "seed"
session_cache_status = self.session_controller.get_streaming_session_cache_status(
recv_req.session_id
)
@@ -3515,27 +3602,28 @@ class Scheduler(
resident = bool(
isinstance(target_session, dict) and target_session.get("resident")
)
if not resident:
if not resident and not is_seed:
# direct_append requires the session already resident on this D.
# For seed we skip this check and let capacity decide.
transfer_queue_depth = len(self.disagg_decode_transfer_queue.queue)
retracted_queue_depth = len(self.disagg_decode_prealloc_queue.retracted_queue)
available_size = int(self.token_to_kv_pool_allocator.available_size())
token_usage = 1.0 - available_size / max(1, self.max_total_num_tokens)
return DirectAppendAdmissionReqOutput(
can_admit=False,
resident=False,
reason="session-not-resident",
available_tokens_before=int(
self.token_to_kv_pool_allocator.available_size()
),
available_tokens_after=int(
self.token_to_kv_pool_allocator.available_size()
),
token_usage=(
1.0
- self.token_to_kv_pool_allocator.available_size()
/ max(1, self.max_total_num_tokens)
),
available_tokens_before=available_size,
available_tokens_after=available_size,
token_usage=token_usage,
num_running_reqs=len(self.running_batch.reqs),
decode_prealloc_queue_reqs=len(self.disagg_decode_prealloc_queue.queue),
decode_transfer_queue_reqs=len(self.disagg_decode_transfer_queue.queue),
decode_retracted_queue_reqs=len(
self.disagg_decode_prealloc_queue.retracted_queue
decode_transfer_queue_reqs=transfer_queue_depth,
decode_retracted_queue_reqs=retracted_queue_depth,
recommended_pause_ms=self._compute_backpressure_pause_hint(
transfer_queue_depth=transfer_queue_depth,
retracted_queue_depth=retracted_queue_depth,
token_usage_after=token_usage,
),
)
@@ -3543,10 +3631,13 @@ class Scheduler(
0, recv_req.output_tokens
)
available_tokens_before = int(self.token_to_kv_pool_allocator.available_size())
# Don't evict the session itself when it's already resident; for seed
# of a fresh session there is nothing to exclude.
exclude_ids = {recv_req.session_id} if resident else set()
trim_result = self.maybe_trim_decode_session_cache(
required_tokens=required_tokens,
force=available_tokens_before < required_tokens,
exclude_session_ids={recv_req.session_id},
exclude_session_ids=exclude_ids,
)
available_tokens_after = int(self.token_to_kv_pool_allocator.available_size())
decode_retracted_queue_reqs = len(self.disagg_decode_prealloc_queue.retracted_queue)
@@ -3556,6 +3647,7 @@ class Scheduler(
)
reason = None if can_admit else "no-space"
transfer_queue_depth = len(self.disagg_decode_transfer_queue.queue)
return DirectAppendAdmissionReqOutput(
can_admit=can_admit,
resident=True,
@@ -3570,10 +3662,36 @@ class Scheduler(
),
num_running_reqs=len(self.running_batch.reqs),
decode_prealloc_queue_reqs=len(self.disagg_decode_prealloc_queue.queue),
decode_transfer_queue_reqs=len(self.disagg_decode_transfer_queue.queue),
decode_transfer_queue_reqs=transfer_queue_depth,
decode_retracted_queue_reqs=decode_retracted_queue_reqs,
recommended_pause_ms=self._compute_backpressure_pause_hint(
transfer_queue_depth=transfer_queue_depth,
retracted_queue_depth=decode_retracted_queue_reqs,
token_usage_after=(
1.0 - available_tokens_after / max(1, self.max_total_num_tokens)
),
),
)
def _compute_backpressure_pause_hint(
self,
*,
transfer_queue_depth: int,
retracted_queue_depth: int,
token_usage_after: float,
) -> int:
# If D is already retracting requests, pause aggressively.
if retracted_queue_depth > 0:
return 1500
# KV pool above 90%: pause proportional to overshoot.
if token_usage_after >= 0.90:
overshoot = int((token_usage_after - 0.90) * 10000)
return max(200, min(2000, overshoot * 5))
# Transfer queue heavy: pause linearly with depth.
if transfer_queue_depth >= 8:
return min(2000, transfer_queue_depth * 100)
return 0
def maybe_sleep_on_idle(self):
if self.idle_sleeper is not None:
self.idle_sleeper.maybe_sleep()