Compare commits
12 Commits
main
...
kvc-real-a
| Author | SHA1 | Date | |
|---|---|---|---|
| 7568e041ff | |||
| 4e8f943875 | |||
|
|
1d51704dad | ||
|
|
7affb565b2 | ||
|
|
c47adaf8e3 | ||
|
|
ca4b64c79a | ||
|
|
4978c0d0cd | ||
|
|
51f5386691 | ||
|
|
6572d7f3f4 | ||
|
|
6e5ed8da80 | ||
|
|
74194e660a | ||
|
|
c9d350b372 |
434
docs/AGENTIC_FIT_ANALYSIS_ZH.md
Normal file
434
docs/AGENTIC_FIT_ANALYSIS_ZH.md
Normal file
@@ -0,0 +1,434 @@
|
||||
# Agentic 场景下的结构性设计缺陷分析
|
||||
|
||||
**日期**:2026-05-06
|
||||
**对照数据**:`outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/exp2_2p6d_run1_*`(KVC kv-aware Option D,2P6D,4449 reqs / 52 sessions)+ `outputs/qwen3-30b-tp1-exps/exp1_8way_dp_cache_aware_summary.json`(同 trace 8-way DP cache-aware baseline)。
|
||||
**模型**:Qwen3-30B-A3B(TP1),单机 8×H100 80GB。
|
||||
**研究问题**:把 SWE trace 视为"真实 agentic"的代表,KVC 机制相对 vanilla DP 系统性输在哪里——除了"D 容量 4.6× 过载"之外的结构性原因。
|
||||
|
||||
> 本文是对 `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` 与 `docs/V5_PROFILE_INVESTIGATION_ZH.md` 的补充:版本演进与瓶颈定位之外,从设计层看哪些假设和真实 agentic workload 不匹配。
|
||||
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
按重要性排序的结构性缺陷:
|
||||
|
||||
| # | 缺陷 | 数据 | 修复方向 | 工程量 |
|
||||
|---|---|---|---|---|
|
||||
| 1 | **KvAwarePolicy 不感知 D 容量;session 永久 pin 到首次落点 D** | session 平均访问的不同 D 数 = **1.00**;direct-to-D 命中率呈极端双峰(15 session 0-20%、14 session 80-100%) | score 函数加 capacity-aware 项;允许跨 D session 迁移 | 中 |
|
||||
| 2 | **D 端 LRU 只能 evict idle session,hot session 永远踢不掉** | D 跑全程仅 9-43 次 trim 事件 vs 80-150 次 transfer 错误;token_usage 顶到 1.00 | 加 score-based eviction(按访问频率/最近性多层) | 中 |
|
||||
| 3 | **没有 D→Router→Replay 的 backpressure 通道** | concurrency 一路 32 不降;D 失败时 replay 无感 | admission 响应加 `recommended_pause_ms`;replay 端按它降并发 | 小 |
|
||||
| 4 | **Admission HTTP round-trip 与 scheduler 主循环耦合** | v5+profile 仅加 1Hz polling 就让 errors 从 9 涨到 415 | 拆成 lock-free `/probe` + 进 scheduler 队列的 `/commit_evict` | 中 |
|
||||
| 5 | **P-side round-robin 不感知 D 健康** | prefill-0 出 367 KVTransferError,prefill-1 仅 4——但请求量近乎对半 | router 选 P 时考虑目标 D 健康度 | 中 |
|
||||
| 6 | **Replay 端 session footprint 估算膨胀 30×** | `_estimate_session_resident_tokens = input + output`,把 turn-50 的 80K 上下文当成"需要全新 80K 空间" | 改成"增量 token"估算 | 小 |
|
||||
| 7 | **time-scale=10 把测试条件人为推到失真区间** | inter-turn gap p50 从 2.5s 压到 0.25s——KVC 想利用的"自然 idle 窗口"被消除 | 跑一组 time-scale=1 baseline 验证 | 小(仅配置) |
|
||||
|
||||
**最重要的对照事实**:同 trace、同硬件、同模型下 8-way DP cache-aware(无 PD 拆分、无 KVC、无 session 抽象):
|
||||
|
||||
| 指标 | 8-way DP CA | v5 KVC 2P6D |
|
||||
|---|---|---|
|
||||
| Errors | **0** | 372 (8.4%) |
|
||||
| Latency mean | **1.43s** | 3.50s |
|
||||
| Latency P50 | **0.65s** | 1.11s |
|
||||
| Latency P99 | **8.37s** | 20.37s |
|
||||
| TTFT mean | **0.12s** | 2.13s |
|
||||
| TTFT P90 | **0.26s** | 6.47s |
|
||||
| Per-worker 请求量分布 | 508–619(±10%) | 561–858(±26%) |
|
||||
|
||||
**naive DP 在每一项都赢,包括 latency mean 的 145% 优势**。这定义了 KVC 在该 workload 下"必须超过"的基线。
|
||||
|
||||
---
|
||||
|
||||
## 1. Session 永久 pin 到 D + 容量盲选(最核心问题)
|
||||
|
||||
### 1.1 现象
|
||||
|
||||
每个 session 在整次运行中只访问 **1.00 个不同 D worker**(见上文数据)。结合 direct-to-D 命中率分布:
|
||||
|
||||
```
|
||||
direct-to-D 命中率分桶(n=52 sessions):
|
||||
0-20%: 15 sessions ← 几乎每 turn 都失败回退到 P→D 全量传输
|
||||
20-40%: 7
|
||||
40-60%: 11
|
||||
60-80%: 5
|
||||
80-100%: 14 sessions ← 几乎每 turn 都走 direct-to-D 快路径
|
||||
```
|
||||
|
||||
**几乎没有中间态**——这是典型的不公平资源分配信号。
|
||||
|
||||
被饿死与被照顾的 session 在工作量上差异明显:
|
||||
- 饿死 session 平均 peak input:56,011 token
|
||||
- 顺利 session 平均 peak input:31,344 token(**1.8× 差距**)
|
||||
|
||||
**大 session 倾向被饿死**——因为它们在容量已紧张的 D 上更容易触发 admission 拒。
|
||||
|
||||
### 1.2 根因(代码级)
|
||||
|
||||
`policies.py:166-172` `KvAwarePolicy.select`:
|
||||
|
||||
```python
|
||||
score = (
|
||||
overlap + sticky * self.sticky_bonus, # 主项: 历史 KV overlap
|
||||
sticky, # 二级: 是否 last_decode_worker
|
||||
inflight_penalty, # 三级: 当前 inflight 数(很小)
|
||||
assignment_penalty, # 四级: 累计被分配数(更小)
|
||||
)
|
||||
```
|
||||
|
||||
评分中**完全无 D 当前容量项**。Session X 第一次落到 D-2 时积累 hash_id 在 D-2 上;之后无论 D-2 多满,X 的 turn N+1 都会被打分到 D-2(因为 overlap 主导)。
|
||||
|
||||
更糟的是 `RoutingState.decode_resident_blocks`(`policies.py:46`)从不缩减——即使 D 早 evict 了某些块,replay 仍认为它们在那。运行中期所有 D 的 overlap 集合都接近"trace 全部 hash_id",policy 退化为纯 sticky。
|
||||
|
||||
### 1.3 后果——具体到 session 的体验
|
||||
|
||||
**饿死 session(如 session 50400,105 turns,0 次 direct-to-D)每 turn 流程**:
|
||||
|
||||
1. policy 选 D(永远是同一个)
|
||||
2. admission 拒(D 容量已被占住)
|
||||
3. 走 fallback-session-cap → P 全量 prefill 50K-100K token
|
||||
4. mooncake 推 KV → D 仍无空间 → 32s timeout 或 KVTransferError
|
||||
5. 用户每 turn 体验 5-10s 延迟,反复出错
|
||||
|
||||
**顺利 session(如 session 3840,118 turns,97% direct-to-D)每 turn 流程**:
|
||||
|
||||
1. policy 选 D(永远是该 session 的初始 D)
|
||||
2. admission 通过(这个 session 一直占着这个 D 的 slot)
|
||||
3. direct-to-D:D 上 append-prefill 几百 token,零 P 介入、零 mooncake transfer
|
||||
4. TTFT 0.043s、E2E 0.495s
|
||||
|
||||
**这不是"平均慢一点",是结构性不公平**——SLO 视角下 P99 是被饿死那 15 session 的尾巴拉出来的。
|
||||
|
||||
### 1.4 为什么 naive DP 反而赢
|
||||
|
||||
8-way DP cache-aware 用纯 hash-based 路由,没有 session 抽象,没有 PD 拆分:
|
||||
|
||||
- 每个请求按 prefix hash 路由到一个 worker → 同 session 的 turn 在 worker 上自然有 prefix 命中
|
||||
- 容量过载时 SGLang 自己的 radix cache + 调度器统一管 KV 池
|
||||
- 不存在 admission/fallback/reseed 路径
|
||||
- 不存在 mooncake transfer
|
||||
- per-worker 负载误差 ±10%(vs KVC ±26%),自动接近均衡
|
||||
|
||||
**KVC 引入的 session affinity / KV 复用 / admission 三件套,在容量紧张时反而加剧了不均衡,没有任何一项能挽回 vs DP 的差距。**
|
||||
|
||||
### 1.5 修复方向
|
||||
|
||||
`KvAwarePolicy.select` 里加:
|
||||
|
||||
```python
|
||||
# 当前 D 容量利用率(worker-mode admission 已经能查到)
|
||||
capacity_penalty = -worker_capacity_used_ratio[worker.worker_id]
|
||||
|
||||
# 当多个 D 都有 overlap 时,按容量挑最空的;
|
||||
# 当某 D 容量 > 阈值时,禁止该 D 进入候选
|
||||
if worker_capacity_used_ratio[worker.worker_id] > HARD_CAP:
|
||||
continue
|
||||
|
||||
score = (
|
||||
overlap_capped, # overlap 但限幅,避免单个 D 永远赢
|
||||
capacity_penalty, # ← 新增
|
||||
sticky,
|
||||
inflight_penalty,
|
||||
)
|
||||
```
|
||||
|
||||
更激进的修法:当一个 session 被某 D 反复拒 N 次后,主动 release 它在该 D 上的 session 状态,**允许下次 turn 走另一个 D**(代价是丢失已积累的 KV,但目前 fallback 路径本来也丢了)。
|
||||
|
||||
---
|
||||
|
||||
## 2. D 端 LRU eviction 跟不上压力
|
||||
|
||||
### 2.1 数据
|
||||
|
||||
每个 D 全程:
|
||||
|
||||
| Worker | Trim 事件(主动 LRU) | KVTransferError + OOM | 峰值 token_usage |
|
||||
|---|---:|---:|---:|
|
||||
| decode-0 | 9 | 0 | 0.99 |
|
||||
| decode-1 | 43 | 12 (4 err + 8 oom) | 0.99 |
|
||||
| decode-2 | 16 | 459 (153 err + 306 oom) | 0.97 |
|
||||
| decode-3 | 37 | 87 (29 err + 58 oom) | 0.99 |
|
||||
| decode-4 | 28 | 270 (90 err + 180 oom) | **1.00** |
|
||||
| decode-5 | 30 | 279 (93 err + 186 oom) | **1.00** |
|
||||
|
||||
**LRU 触发频率比错误次数低 5-15 倍。** D-4 / D-5 直接顶到 token_usage=1.00。
|
||||
|
||||
### 2.2 根因
|
||||
|
||||
`scheduler.py:2040` `evict_idle_streaming_sessions_lru` 的 idle 判定:
|
||||
|
||||
```python
|
||||
# 只能 evict "所有 req 都 finished + streaming 模式" 的 session
|
||||
```
|
||||
|
||||
但 SWE 高并发下每个 session 几乎一直有 inflight req(time-scale=10 又压缩了 inter-turn gap)。**hot session 永远不 idle,LRU 永远找不到东西可踢**。结果 D 一路开到 100% → 下一笔 transfer 来直接 OOM/timeout。
|
||||
|
||||
### 2.3 修复方向
|
||||
|
||||
引入分层 eviction:
|
||||
|
||||
1. **Idle session 优先**(当前)
|
||||
2. **冷 session 次优**(最近 N 秒无访问,即使有 inflight,也可以 retract 那个 inflight 让位)
|
||||
3. **hot session 强制 retract**(在 hard cap 触发时)
|
||||
|
||||
vanilla SGLang 已有 `disagg_decode_prealloc_queue.retracted_queue` 机制(看 `admit_direct_append` 引用),但**没有人主动触发 retract**——目前只有内部异常时才会进 retracted_queue。需要把 retract 提升为正常 admission 路径的一部分。
|
||||
|
||||
---
|
||||
|
||||
## 3. 没有 D→Replay 的 backpressure 通道
|
||||
|
||||
### 3.1 名词解释
|
||||
|
||||
**Backpressure(反压)** = 流式系统下游过载时把信号反向传给上游让它降速。例:TCP 滑动窗口、Kafka consumer lag、gRPC HTTP/2 flow control。
|
||||
|
||||
### 3.2 当前状态
|
||||
|
||||
- D 端 transfer queue 堆 → 32s 后 timeout → 抛 KVTransferError
|
||||
- error 抛回 P → P 抛给 router → router 抛给 replay → replay 走 fallback 路径
|
||||
- **整个链路上没有"D 过载,请慢点发"的信号**——concurrency 一直保持上限
|
||||
|
||||
后果:D 一旦开始失败,会**持续失败**(因为 replay 没降速),直到 D 自己消化完积压。
|
||||
|
||||
### 3.3 修复方向
|
||||
|
||||
`admit_direct_append` 响应里加:
|
||||
|
||||
```python
|
||||
{
|
||||
"can_admit": ...,
|
||||
"recommended_pause_ms": int, # ← 新增:下次发同类请求前建议等多久
|
||||
"queue_depth": int, # ← 新增:D transfer queue 当前深度
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
replay 端在 admission 拒被拒时按 `recommended_pause_ms` 降并发或退避。**这是最便宜的一条改动**——不改协议、不改 SGLang 内部,只改两端代码。
|
||||
|
||||
---
|
||||
|
||||
## 4. Admission RPC 与 scheduler 耦合——结构 vs 工程的精确边界
|
||||
|
||||
### 4.1 现象
|
||||
|
||||
`docs/V5_PROFILE_INVESTIGATION_ZH.md` 报告:仅加 1Hz `/server_info` polling 就让 EXP2 errors 从 9 涨到 415。`/server_info` 在 scheduler 主循环里遍历 session slots 算 `is_idle`,1 Hz × 8 worker 就足以扰动调度。
|
||||
|
||||
但实际负载下 admission RPC 频率远高于 1Hz:每个 turn 1 + reseed + direct-to-D 都调一次。concurrency=32 + 4449 reqs / ~2700s ≈ **每秒 16+ 次 admission RPC**。
|
||||
|
||||
### 4.2 这是结构问题还是工程问题——精确拆解
|
||||
|
||||
`admit_direct_append`(`scheduler.py:3581`)做两件事:
|
||||
|
||||
```python
|
||||
# (a) 读池子状态——轻
|
||||
available_tokens = self.token_to_kv_pool_allocator.available_size()
|
||||
|
||||
# (b) 触发 LRU 扫描——重,且必须修改池子状态
|
||||
trim_result = self.maybe_trim_decode_session_cache(...)
|
||||
```
|
||||
|
||||
| 部分 | 性质 | 是否能靠工程化解决 |
|
||||
|---|---|---|
|
||||
| (a) 读池子状态 | 几个原子读 | **完全可工程化**——做成 lock-free shared-memory snapshot 即可 |
|
||||
| (b) LRU eviction | 修改 GPU 池子,必须独占 | **结构性的**——Python GIL + 共享 GPU 池子无法并发修改 |
|
||||
|
||||
**关键观察**:实际负载里 (b) 是少数路径——大部分 admission 只需要"看一下够不够",不需要立即 evict。
|
||||
|
||||
### 4.3 工程化修复方案
|
||||
|
||||
把 admission API 拆成两个端点:
|
||||
|
||||
```
|
||||
POST /session_cache/probe ← 90% 流量
|
||||
- 只读 lock-free snapshot
|
||||
- 返回 (can_admit_estimate, available_tokens, queue_depth)
|
||||
- 不进 scheduler 队列
|
||||
|
||||
POST /session_cache/commit_evict ← 10% 流量
|
||||
- probe 不够时才调
|
||||
- 进 scheduler 队列,做实际 LRU
|
||||
- 保留当前 admit_direct_append 语义
|
||||
```
|
||||
|
||||
snapshot 由 scheduler 在每个 step 末尾写到一段 mmap 共享内存(atomic publish);replay 端 mmap 读,零 syscall 零序列化。一秒内能撑数千次 probe。
|
||||
|
||||
### 4.4 关于"协程/多线程/多进程/换语言"
|
||||
|
||||
| 工具 | 对本问题的实际效果 |
|
||||
|---|---|
|
||||
| asyncio 协程 | SGLang 已用,对 scheduler 主循环本身无帮助 |
|
||||
| Python 多线程 | GIL 拦着,且 GPU 池子状态只能 scheduler 进程改 |
|
||||
| 多进程 | scheduler 已是独立进程;问题是它**自己的 step 循环**串行了 admission 与 decode |
|
||||
| orjson / uvloop | 网络/JSON 加速 5-10×,但 LRU 遍历不在那条热路径 |
|
||||
| Rust/C++ 重写 scheduler | 把 LRU 遍历提速 5-10×,但**结构性共享问题仍在** |
|
||||
|
||||
**正确的工程化解法是重设计 API(拆 probe / commit),不是单纯换更快的库或语言。**
|
||||
|
||||
---
|
||||
|
||||
## 5. P-side 路由不感知 D 健康
|
||||
|
||||
### 5.1 数据
|
||||
|
||||
```
|
||||
prefill-0: 367 KVTransferError, 361 "Decode instance could be dead"
|
||||
prefill-1: 4 KVTransferError, 0 "Decode instance could be dead"
|
||||
|
||||
请求量对比:
|
||||
prefill-0: 2225 requests
|
||||
prefill-1: 2224 requests ← 几乎对半
|
||||
```
|
||||
|
||||
**两 P 请求量完全均衡,错误率差 92×**。日志里 prefill-0 的错误反复指向某个特定 D(`10.45.80.47:XXXXX`)——它跟某个 hot D 形成了"死亡链路"。
|
||||
|
||||
### 5.2 根因
|
||||
|
||||
`pd_router.py:43-49` 的 P 选择是裸 round-robin:
|
||||
|
||||
```python
|
||||
prefill_url, bootstrap_port = self.config.prefill_urls[
|
||||
self.prefill_cursor % len(self.config.prefill_urls)
|
||||
]
|
||||
```
|
||||
|
||||
不知道 D 是否健康,不会避开"正在和 D-X 死磕"的 P。
|
||||
|
||||
### 5.3 修复方向
|
||||
|
||||
router 选 P 时考虑 (P 当前 inflight transfer 数, 目标 D 健康度) 联合得分。健康度可以用 §3 提的 `queue_depth` 字段。
|
||||
|
||||
---
|
||||
|
||||
## 6. Replay 端 session footprint 估算膨胀 30×
|
||||
|
||||
### 6.1 代码
|
||||
|
||||
`replay.py:898-899`:
|
||||
|
||||
```python
|
||||
def _estimate_session_resident_tokens(request: TraceRequest) -> int:
|
||||
return request.input_length + request.output_length
|
||||
```
|
||||
|
||||
被用于 `_decode_session_soft_cap`(`replay.py:1051`)和 `_should_admit_new_decode_session`。
|
||||
|
||||
### 6.2 问题
|
||||
|
||||
对一个已经在 D 上有 80K KV 的 turn 50:
|
||||
- 真实增量需求:input 新增几千 token + output 几百 token = ~3K
|
||||
- 估算返回值:80K + 1K = 81K(**膨胀 ~27×**)
|
||||
|
||||
后果:router-mode admission 系统性误判——本来能 admit 的 session 被 replay 自己拒掉。v5 worker-mode 让 D 自己看真实容量部分修了这个,**但 KvAwarePolicy 选 D 时仍用这个膨胀估算**——选 D 仍然是错的。
|
||||
|
||||
### 6.3 修复
|
||||
|
||||
```python
|
||||
def _estimate_session_resident_tokens(request: TraceRequest) -> int:
|
||||
if request.turn_id == 1:
|
||||
return request.input_length + request.output_length
|
||||
# turn 2+: only the increment matters for additional reservation
|
||||
return max(0, request.input_length - request.cached_tokens) + request.output_length
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. time-scale=10 测量失真
|
||||
|
||||
### 7.1 它是什么
|
||||
|
||||
`replay.py` 把原始 trace 每个请求的 `timestamp` 字段做 `t / time_scale` 缩放后再按这个时间发。
|
||||
|
||||
- 原始 trace 跨度 ~6000s(≈100 分钟)
|
||||
- time-scale=10 → 实际 replay 跨度 ~600s(≈10 分钟)
|
||||
|
||||
### 7.2 为什么这么设计
|
||||
|
||||
**纯粹为了节省测试时间**——单次 1× 跑 100 分钟,sweep 5 版 × 3 重复 = 25h GPU 时间;10× 只要 2.5h。
|
||||
|
||||
### 7.3 它扭曲了什么
|
||||
|
||||
| 维度 | 原始 trace | replay (time-scale=10) |
|
||||
|---|---|---|
|
||||
| inter-turn gap p10 | 1.6s | 0.16s |
|
||||
| inter-turn gap p50 | 2.5s | 0.25s |
|
||||
| inter-turn gap p90 | 7.8s | 0.78s |
|
||||
| inter-turn gap max | 261s | 26s |
|
||||
|
||||
真实 agentic 用户/agent 在每个 turn 之间停 2-8 秒(思考、打字、tool call)。**这些间隙正好是 KVC 想利用的"自然 idle 窗口"**——session 短暂 idle 时 LRU 可以 evict、其他 session 可以 admit。
|
||||
|
||||
time-scale=10 把这些窗口压到 0.2-0.8s,**人为消除了 KVC 的设计前提条件**。
|
||||
|
||||
### 7.4 严重的实验有效性威胁
|
||||
|
||||
所有 v3-v6 数据基于 time-scale=10。这意味着前面所有"KVC 在 SWE 上输给 baseline"的结论都带着这个失真。**真实部署里 inter-turn gap 是 2.5s 的话,KVC 可能根本不会撞到当前看到的容量瓶颈**——D 有时间在 turn 之间释放/重排。
|
||||
|
||||
**应该单独跑一组 time-scale=1 的 baseline 对比**,才能判断 KVC 输给 DP 是因为机制本身不行,还是因为 benchmark 把它推到了不该工作的区间。这是这个项目目前**最重要但还没做**的验证。
|
||||
|
||||
---
|
||||
|
||||
## 8. 应用层抽象不需要在引擎层引入(撤回)
|
||||
|
||||
之前草稿里提过"框架不支持 speculative 多分支、嵌套 sub-agent、tool call 中断"——这是过度抽象。**应用层模式都可以由 timestamp + 独立 session_id 隐式表达**:
|
||||
|
||||
| 应用层模式 | 表现在 trace 里 | 推理引擎需要做什么 |
|
||||
|---|---|---|
|
||||
| Tool call 异步返回 | turn N 与 N+1 之间 timestamp gap 很大 | 啥都不用,按时间发请求即可 |
|
||||
| 嵌套 sub-agent | 父 session timestamp 突然停顿;sub-agent 是独立 session_id | 把它们当成两个独立 session 即可(KV 也无需共享) |
|
||||
| Speculative N 分支 | N 个独立 session_id 同时发 | 用 radix prefix cache 自然命中前缀;不需要任何额外抽象 |
|
||||
|
||||
**这条不构成结构性缺陷。** 已从结论中移除。
|
||||
|
||||
---
|
||||
|
||||
## 9. 行动项(按 ROI 排序)
|
||||
|
||||
### 优先级 P0(修了显著改善饿死/不公平)
|
||||
|
||||
1. **[§1] KvAwarePolicy 加 capacity-aware penalty + 允许 session 跨 D 迁移** — 工程量中、收益最大
|
||||
2. **[§2] D 端引入分层 eviction(冷 session、hot retract)** — 工程量中、收益大
|
||||
3. **[§7] 跑一组 time-scale=1 baseline** — 工程量小(仅配置),但**不做这条所有结论都不可信**
|
||||
|
||||
### 优先级 P1(修了把工程稳定性补齐)
|
||||
|
||||
4. **[§3] D→Replay backpressure 通道**(admission 响应加 pause hint) — 工程量小
|
||||
5. **[§4] 拆 admission 为 probe + commit_evict** — 工程量中
|
||||
6. **[§6] 修 `_estimate_session_resident_tokens` 用增量** — 工程量小
|
||||
|
||||
### 优先级 P2(等 P0 数据后再决定)
|
||||
|
||||
7. **[§5] P-side 选 P 时考虑 D 健康** — 工程量中
|
||||
|
||||
---
|
||||
|
||||
## 10. 局限与未验证假设
|
||||
|
||||
1. **N=1**:所有数据来自单次 run(v6 P0 已证 EXP2 errors 在 9-912 间漂移,single-run variance 巨大)。本文所有数字都应理解为"代表性观察"而非"统计显著结论"。
|
||||
2. **time-scale=10 失真**(§7):所有"KVC 输给 DP"的程度可能是被 benchmark 放大的。这是最大的不确定性。
|
||||
3. **8DP 对比的硬件优势**:DP 是 8 个 worker 全部跑 prefill+decode;KVC 是 2P+6D,只有 6 个能解码。理论上 8 worker 对 6 worker 自带 1.33× 解码并发优势。本文未折算这部分——但 8DP 优势远大于 1.33×(latency mean 145% 优势),所以核心结论(KVC 在该 workload 下系统性输)不受此影响。
|
||||
4. **mooncake TCP loopback**:所有 transfer 错误是单机 TCP 模拟下的产物。生产环境 RDMA 下错误率分布可能完全不同。
|
||||
5. **KvAwarePolicy 的 stale `decode_resident_blocks`**(§1.2 末尾)现象有数据观察支撑(运行中期 overlap 失去判别力),但**没有系统性测过"清掉 stale 状态会怎样"**。
|
||||
6. **P-side 错误集中在 prefill-0**(§5.1)的因果链是推测——可能也是"prefill-0 早启动 + race"的偶然结果。N>1 数据未验证。
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:数据产物索引
|
||||
|
||||
```
|
||||
outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
|
||||
├── exp2_2p6d_run1_metrics.jsonl ← 本文主数据源
|
||||
├── exp2_2p6d_run1_summary.json
|
||||
├── exp2_2p6d_run2_* (errors=912, single-run variance 证据)
|
||||
├── exp2_2p6d_run3_* (errors=396)
|
||||
└── kvcache-centric-*-20260429T142429Z/logs/
|
||||
├── decode-{0..5}.log ← §2.1 LRU vs error 计数
|
||||
└── prefill-{0,1}.log ← §5.1 P 错误分布
|
||||
|
||||
outputs/qwen3-30b-tp1-exps/
|
||||
├── exp1_8way_dp_cache_aware_summary.json ← 对照 baseline
|
||||
└── RESULTS_SUMMARY.md
|
||||
```
|
||||
|
||||
## 附录 B:相关文档
|
||||
|
||||
- `docs/PROJECT_OVERVIEW.md` — 项目目标与已实现功能
|
||||
- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 版本演进
|
||||
- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(已 critic 修订)
|
||||
- `docs/SWEBENCH_EXPERIMENT_RESULTS.md` — Qwen3.5-35B-A3B SWE 实验
|
||||
367
docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md
Normal file
367
docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md
Normal file
@@ -0,0 +1,367 @@
|
||||
# KVC 实验踩坑记录与代码 Bug 分析(v1 → v5)
|
||||
|
||||
记录从 v1 到 v5 KVC 实验的踩坑过程、错误诊断、以及最终定位的代码 bug。
|
||||
模型: Qwen3-30B-A3B (TP1),硬件: 单节点 8×H100 80GB。
|
||||
Trace: `qwen35-swebench-50sess.jsonl`(4449 请求,52 sessions)。
|
||||
|
||||
## TL;DR
|
||||
|
||||
| 版本 | 关键变化 | 截断率 | direct-to-D 占比 | P50 | 主要瓶颈 |
|
||||
|------|----------|:---:|:---:|:---:|----------|
|
||||
| v1 (smoke / 早期) | mechanism 跑通 | - | - | - | - |
|
||||
| v2 | KVC + `--policy default` | **56.8% / 61.4%** | <0.1% | 0.08s* | Routing 错位(默认策略) |
|
||||
| v3 | KVC + `--policy kv-aware` | **0.9%** | 30-42% | 1.5-1.8s | session-cap fallback (52-65%) |
|
||||
| v4 | v3 + soft_cap 4→16 | 1.0% | 54-58% | 1.08 / 0.84s | session-cap fb 仍 35%、9-10% mooncake errors |
|
||||
| v5 | Option D:worker-mode 驱动 seed/reseed | 0.9% | 41-45% | 1.59 / 1.31s | D KV pool 真容量不足 → fallback 反而 ↑ 至 46-51% |
|
||||
|
||||
`*` v2 的 P50 是假数字——超过半数请求只生成 1 个 token 就被 abort。
|
||||
|
||||
## v2 踩坑:Default policy 与 KVC 机制根本不兼容
|
||||
|
||||
### 表象
|
||||
|
||||
`scripts/sweep_tp1_v2_fixed.sh` 跑出来:
|
||||
- Exp1(8-way DP,baseline):4449/4449 成功,P50=0.65s,error=0
|
||||
- Exp2(1P7D KVC):**2524 truncated (56.8%)**,18 errors,P50=0.08s* (假)
|
||||
- Exp3(2P6D KVC):**2733 truncated (61.4%)**,17 errors,P50=0.08s* (假)
|
||||
|
||||
每个截断请求 `actual_output_tokens=1`,`finish_reason="abort: session id X does not exist"`。
|
||||
|
||||
### 错误的早期诊断
|
||||
|
||||
之前 `RESULTS_SUMMARY.md` 把锅扣在 SGLang 的 `--disaggregation-decode-allow-local-prefill` flag 上,认为是 D worker 在有 `bootstrap_room` 时仍然做了 local prefill。这个诊断**完全错误**——查 `scheduler.py:1975-1980` 的 `_should_allow_local_prefill_on_decode`:
|
||||
|
||||
```python
|
||||
def _should_allow_local_prefill_on_decode(self, req: Req) -> bool:
|
||||
return (
|
||||
self.disaggregation_mode == DisaggregationMode.DECODE
|
||||
and self.server_args.disaggregation_decode_allow_local_prefill
|
||||
and req.bootstrap_room is None # ← 有 bootstrap_room 不会走 local prefill
|
||||
)
|
||||
```
|
||||
|
||||
KVC reseed 路径的请求都带 `bootstrap_room`,根本不会触发 local prefill。
|
||||
|
||||
### 实际根因:Replay 与 PD Router 的 round-robin 错位
|
||||
|
||||
实验脚本里 KVC 用 `--policy default`,而 baseline 用 `--policy kv-aware`。
|
||||
看 `benchmark.py:287-300` 这两者的差别巨大:
|
||||
|
||||
```python
|
||||
def _decode_policy_for(policy_name: str) -> str:
|
||||
if policy_name == "sticky": return "manual"
|
||||
if policy_name == "kv-aware": return "consistent_hashing"
|
||||
return "round_robin" # default
|
||||
|
||||
def _header_mode_for(policy_name: str) -> str:
|
||||
if policy_name == "sticky": return "routing-key"
|
||||
if policy_name == "kv-aware": return "target-worker"
|
||||
return "none" # default
|
||||
```
|
||||
|
||||
`default` policy + KVC 机制下:
|
||||
1. Replay policy(`policies.py:DefaultPolicy`)round-robin 选一个 D,比如 D-3
|
||||
2. Replay 在 D-3 上 `open_session(session_id=X)`(`replay.py:1722-1731`)
|
||||
3. Replay 通过 PD Router 发请求(带 `session_params`),但 `header_mode=none`,**不发任何 routing header**
|
||||
4. PD Router (`pd_router.py:_select_decode_index`) 看到 `decode_policy=round_robin`,用**自己独立的计数器**round-robin,发到了 D-5
|
||||
5. D-5 的 scheduler 看到 `session_params` 里有 session_id,但自己的 `session_controller` 里没这个 session(session 在 D-3 上)→ abort with `"Invalid request: session id X does not exist"` (`scheduler.py:1824-1836`)
|
||||
|
||||
两个独立的 round-robin 计数器只要一次错位(任何并发或 direct-to-D 绕过 router 的请求都会引起)就永远对不上。
|
||||
|
||||
### 为什么 turn 0 不出问题?
|
||||
|
||||
Turn 0 走 `_invoke_plain_router`(`replay.py:1894`),不带 `session_params`,作为普通 PD disagg 请求处理,发到任何 D 都行。Turn 1+ 才开始走带 session_params 的 KVC 路径,撞上路由错位。
|
||||
|
||||
### 数据特征验证(per-session pattern)
|
||||
|
||||
```
|
||||
session 11360 (58 turns): pattern = .TTTTT.TTTTTTT.TTTTTT... ← turn 0 OK,1+ 全 T
|
||||
session 18720 (87 turns): pattern = .TTTTTTTTTTTTTTTTTT...
|
||||
```
|
||||
|
||||
每个 D worker 收到了全部 52 个 session 的请求(理想情况下应该是 ~7-8 个/D,因为 round-robin 把 session 完全打散)。
|
||||
|
||||
### 修复
|
||||
|
||||
唯一正确的修复是把 KVC 的 policy 从 `default` 改成 `kv-aware`:
|
||||
|
||||
```diff
|
||||
- --policy default
|
||||
+ --policy kv-aware
|
||||
```
|
||||
|
||||
`KvAwarePolicy` (`policies.py:146-187`) 做两件事:
|
||||
1. 用 `_overlap_blocks` + `sticky_bonus` 给每个 D 打分,session 自然粘在同一个 D(**session 亲和性**)
|
||||
2. `header_mode=target-worker`,发 `x-smg-target-worker` header
|
||||
3. PD Router 用 `consistent_hashing` 模式,看到 header 就直接用,不再 round-robin
|
||||
|
||||
## v3 改 kv-aware policy 后:路由对了,但新瓶颈出现
|
||||
|
||||
`scripts/sweep_tp1_v3_kvaware.sh` 把所有 KVC 实验改成 `--policy kv-aware`,结果:
|
||||
|
||||
| 指标 | v2 1P7D (default) | **v3 1P7D (kv-aware)** | v3 2P6D | 8-way DP baseline |
|
||||
|------|:---:|:---:|:---:|:---:|
|
||||
| 截断 | 56.8% | **0.9%** | 0.9% | 1.5% |
|
||||
| Errors | 18 | 363 (8.2%) | 9 | 0 |
|
||||
| Mean | 4.74s | 4.88s | 3.58s | 1.43s |
|
||||
| P50 | 0.08s* (假) | 1.75s | 1.52s | 0.65s |
|
||||
| P90 | 12.14s | 12.67s | 9.23s | 3.61s |
|
||||
| TTFT P50 | - | 0.36s | 0.33s | 0.09s |
|
||||
|
||||
✅ **截断从 56.8% 降到 0.9%,路由问题彻底解决**。
|
||||
❌ 但 P50 仍然是 baseline 的 2-3 倍。
|
||||
|
||||
### Direct-to-D 路径表现优秀(KVC 该有的样子)
|
||||
|
||||
按 execution_mode 拆开看:
|
||||
|
||||
| 路径 | Exp1 1P7D 占比 | Exp1 1P7D P50 | Exp1 1P7D TTFT P50 |
|
||||
|------|:---:|:---:|:---:|
|
||||
| `kvcache-direct-to-d-session` ✨ | 42.0% | **0.495s** | **0.043s** |
|
||||
| `pd-router-fallback-large-append-session-cap` 🔥 | **52.6%** | 5.6s | 3.7s |
|
||||
|
||||
Direct-to-D 路径下:
|
||||
- P50 = 0.495s(**比 baseline 0.65s 快 25%**)
|
||||
- TTFT P50 = 0.043s(**比 baseline 0.093s 快 2 倍**)
|
||||
- KV transfer = 0(无 P 介入,纯 D 上 append-prefill)
|
||||
|
||||
这才是 KVC 真正的价值。但只有 30-42% 请求走到这条路。
|
||||
|
||||
### 新瓶颈:session-cap fallback 占了 52-65%
|
||||
|
||||
`pd-router-fallback-large-append-session-cap` 占 1P7D 的 52.6%、2P6D 的 65.4%。这条路径意味着 router 想开新 session 在 D 上,但 admission 拒绝了("d-session-cap"),只好回退到 plain router(P 全量 prefill + 传给 D,无 session 复用)。
|
||||
|
||||
### Bimodal session 分布(starvation)
|
||||
|
||||
| Session | Total turns | Direct-to-D | Session-cap fallback |
|
||||
|---------|:---:|:---:|:---:|
|
||||
| 22080 | 129 | **98%** | 0% |
|
||||
| 3840 | 118 | **97%** | 0% |
|
||||
| 70560 | 150 | **0%** | **99%** |
|
||||
| 39360 | 148 | **0%** | **99%** |
|
||||
| 61600 | 117 | **0%** | **99%** |
|
||||
|
||||
要么完全幸运,要么完全饿死——典型的双峰分布。
|
||||
|
||||
### 根因:硬编码 cap=4
|
||||
|
||||
看 `replay.py:_decode_session_soft_cap` 原始代码:
|
||||
|
||||
```python
|
||||
def _decode_session_soft_cap(...) -> int:
|
||||
target_tokens = max(1, _estimate_session_resident_tokens(request))
|
||||
usable_capacity_tokens = _usable_capacity_tokens(residency, server_url)
|
||||
...
|
||||
if usable_capacity_tokens <= 0:
|
||||
return 4
|
||||
return max(1, min(4, usable_capacity_tokens // target_tokens))
|
||||
# ^^^ 硬编码上限 4
|
||||
```
|
||||
|
||||
7 个 D × 每个 D 最多 4 个 session = **28 个 session slot 总容量**。Trace 有 52 个 session → 24 个 session 永远抢不到 slot。
|
||||
|
||||
启动期 race condition 决定了哪些 session 是"幸运儿"——前 28 个挤进来的 session 的所有后续 turn 都走 direct-to-D(快);剩下 24 个 session 永远走 session-cap fallback(慢)。
|
||||
|
||||
## v4 改进:把硬 cap 从 4 提到 16
|
||||
|
||||
`replay.py:_decode_session_soft_cap` 一行修改:
|
||||
|
||||
```diff
|
||||
- if usable_capacity_tokens <= 0:
|
||||
- return 4
|
||||
- return max(1, min(4, usable_capacity_tokens // target_tokens))
|
||||
+ if usable_capacity_tokens <= 0:
|
||||
+ return 16
|
||||
+ return max(1, min(16, usable_capacity_tokens // target_tokens))
|
||||
```
|
||||
|
||||
7 D × 16 = 112 个 slot,远超 52 个 session 需求。
|
||||
|
||||
### v4 实际结果(vs v3 1P7D / 2P6D)
|
||||
|
||||
| 指标 | v3 1P7D | **v4 1P7D** | v3 2P6D | **v4 2P6D** | baseline 8DP |
|
||||
|------|:---:|:---:|:---:|:---:|:---:|
|
||||
| Errors | 363 (8%) | 435 (10%) | 9 (0%) | **403 (9%)** | 0 |
|
||||
| 截断 | 42 | 43 | 42 | 36 | 68 |
|
||||
| **direct-to-D** | 38.6% | **54.3%** | 30.5% | **58.0%** ⭐ | - |
|
||||
| **session-cap fallback** | 48.3% | 37.4% | 65.4% | **34.7%** | - |
|
||||
| Session reused | 1716 | 2180 | 1358 | **2348** | - |
|
||||
| KV transfer blocks | 62K | 53K | 79K | **51K** | - |
|
||||
| Mean | 4.88s | 4.21s | 3.58s | **2.51s** | 1.43s |
|
||||
| **P50** | 1.75s | 1.08s | 1.52s | **0.84s** | **0.65s** |
|
||||
| P90 | 12.67s | 13.38s | 9.23s | **6.51s** | 3.61s |
|
||||
| P99 | 28.72s | 24.45s | 18.70s | 18.34s | 8.38s |
|
||||
| **TTFT P50** | 0.36s | 0.056s | 0.33s | **0.051s** ⭐ | 0.094s |
|
||||
| TTFT P90 | 10.97s | 11.90s | 6.95s | **2.64s** | 0.26s |
|
||||
|
||||
✓ direct-to-D 占比从 v3 的 30-38% 涨到 v4 的 54-58%
|
||||
✓ session 复用 +27% (1P7D) / +73% (2P6D)
|
||||
✓ KV transfer 量 -15% (1P7D) / -36% (2P6D)
|
||||
✓ TTFT P50 反超 baseline 46%(0.051s vs 0.094s)
|
||||
|
||||
### Direct-to-D 路径全面碾压 baseline(KVC 真实价值)
|
||||
|
||||
| Config | n | Lat P50 | Lat P90 | TTFT P50 | TTFT P90 |
|
||||
|--------|:---:|:---:|:---:|:---:|:---:|
|
||||
| baseline 8DP | 4381 | 0.66s | 3.65s | 0.094s | 0.256s |
|
||||
| v4 1P7D direct-to-D | 2179 | 0.495s | 3.03s | 0.044s | 0.055s |
|
||||
| **v4 2P6D direct-to-D** | **2348** | **0.499s** | **2.86s** | **0.043s** | **0.054s** |
|
||||
|
||||
direct-to-D 子集相对 baseline:
|
||||
- P50 快 24-30%
|
||||
- P90 快 16-22%
|
||||
- TTFT P50 快 54%
|
||||
- TTFT P90 快 79%
|
||||
|
||||
### 整体性能(去掉 errors 和 truncated)vs baseline
|
||||
|
||||
| Config | clean | Mean | P50 | P90 | P99 |
|
||||
|--------|:---:|:---:|:---:|:---:|:---:|
|
||||
| baseline 8DP | 4381 | 1.45s | 0.66s | 3.65s | 8.38s |
|
||||
| v4 2P6D | 4010 | 2.53s | 0.85s | 6.55s | 18.33s |
|
||||
|
||||
vs baseline:P50 慢 28%、P90 慢 80%、P99 慢 119%。即使错误率为 0,整体仍输 baseline——根因是 35% 请求被推到 fallback 路径。
|
||||
|
||||
### 新瓶颈 1:35% 请求仍走 session-cap fallback
|
||||
|
||||
抬到 16 后真实瓶颈是 capacity-based 计算:`min(16, usable_capacity_tokens // target_tokens)`。
|
||||
- `target_tokens = input + output`,agentic 里常见 50-100K
|
||||
- D 的 KV pool ≈ 100-150K tokens(80GB H100, mem_fraction=0.835)
|
||||
- `usable / target` = 1-2,远没到 16 → 真实 cap 是 capacity 算出来的小数字
|
||||
|
||||
要解决必须改 capacity-based 估算逻辑(或上方案 D,让 D 自己决定)。
|
||||
|
||||
### 新瓶颈 2:9-10% errors(mooncake 传输超时)
|
||||
|
||||
P-side log 显示:
|
||||
|
||||
```
|
||||
KVTransferError: Failed to send kv chunk of <bootstrap_room> to 10.45.7.165:40319
|
||||
Sync batch data transfer timeout after 32722558107ns (32 秒超时)
|
||||
Decode instance could be dead, remote mooncake session ... is not alive
|
||||
```
|
||||
|
||||
特征:
|
||||
- 所有 errors 在 run 的 44.8% 之后出现(系统压力累积)
|
||||
- 98% errors 集中在 turn ≥ 31(大 input 的请求)
|
||||
- v3 cap=4 时 1P7D 已有 363 errors(仅 1 个 D 集中受冲击),v4 cap=16 把压力均匀分布但量级更大
|
||||
|
||||
是 mooncake TCP loopback 在并发上去后撞超时,**不是 SGLang 逻辑 bug**。修复方向:
|
||||
1. 加长 mooncake transfer timeout(现在 32s)
|
||||
2. 限制并发 inflight transfer 数量
|
||||
3. 改用 RDMA(loopback 是单机模拟,生产环境换真 RDMA)
|
||||
4. chunked KV transfer
|
||||
|
||||
## v5 落地方案 D:worker-mode 驱动 seed/reseed
|
||||
|
||||
`scripts/sweep_tp1_v5_optD.sh` 真正把方案 D 落到了代码里。改动核心:把 `--kvcache-admission-mode` 从 `local`(replay 估算) 改成 `worker`(D 决策),并扩展到 **direct_append + seed + reseed 全部路径**。
|
||||
|
||||
### 关键代码改动
|
||||
|
||||
1. SGLang 侧:`scheduler.py` 的 `admit_direct_append` 端点新增 `mode` 字段,支持 `direct_append | seed`,seed 模式会触发 D 真正去 reserve KV pool 块并主动调用 `maybe_trim_decode_session_cache` 做 LRU。
|
||||
2. Replay 侧:`replay.py` 中 reseed / turn-1 seed / large-append-reseed 都改走同一个 admit endpoint;`_decode_session_soft_cap` 在 worker mode 下被完全 bypass。
|
||||
3. 新增运行参数:`--kvcache-admission-mode worker`、`--kvcache-seed-min-turn-id 1`、`--kvcache-seed-max-inflight-decode -1`、`--kvcache-prefill-backup-policy release-after-transfer`、`--kvcache-prefill-priority-eviction`。
|
||||
|
||||
### 假设
|
||||
|
||||
- v4 的 35% session-cap fallback 来自 replay 视图过期 + capacity-based 计算保守 → 让 D 自己看 KV pool 应该把这 35% 救回来。
|
||||
- D 主动 LRU eviction 比 replay 自己写的 reservation 更准确,**应该**让更多 session 能 seed 进来。
|
||||
|
||||
### v5 实际结果(vs v4 同配置)
|
||||
|
||||
| 指标 | v4 1P7D | **v5 1P7D** | v4 2P6D | **v5 2P6D** | baseline 8DP |
|
||||
|------|:---:|:---:|:---:|:---:|:---:|
|
||||
| Errors | 435 (10%) | **9 (0.2%)** ⭐ | 403 (9%) | **9 (0.2%)** ⭐ | 0 |
|
||||
| 截断 | 43 | 42 | 36 | 42 | 68 |
|
||||
| direct-to-D | 54.3% | 44.7% ↓ | 58.0% | 41.3% ↓ | - |
|
||||
| **session-cap fallback** | 37.4% | **45.6%** ↑ | 34.7% | **50.6%** ↑ | - |
|
||||
| no-d-capacity fallback | 0.3% | 1.2% | 0.2% | 0.8% | - |
|
||||
| pd-router-turn1-seed (新可见) | - | 1.2% | - | 1.1% | - |
|
||||
| pd-router-d-session-reseed (新可见) | - | 4.8% | - | 3.4% | - |
|
||||
| pd-router-large-append-reseed (新可见) | - | 1.0% | - | 1.0% | - |
|
||||
| Session reused | 2180 | 1990 | 2348 | 1837 | - |
|
||||
| KV transfer blocks | 53K | 66K | 51K | 69K | - |
|
||||
| Mean | 4.21s | 5.18s | 2.51s | 3.49s | 1.45s |
|
||||
| **P50** | 1.08s | 1.59s | 0.84s | 1.31s | 0.66s |
|
||||
| P90 | 13.38s | 14.67s | 6.51s | 9.09s | 3.65s |
|
||||
| P99 | 24.45s | 26.09s | 18.34s | 24.92s | 8.38s |
|
||||
| TTFT P50 | 0.056s | 0.21s | 0.051s | 0.24s | 0.094s |
|
||||
| TTFT P90 | 11.90s | 13.06s | 2.64s | 6.90s | 0.26s |
|
||||
|
||||
✅ **可靠性大幅提升**:mooncake 传输超时 errors 从 9-10% 跌到 0.2%。D 真容量决策避免了 v4 那种"乐观 admit → 30s 后超时"的死亡链路。
|
||||
✅ reseed / turn1-seed 路径首次显式出现,证明 admission 端点对 seed 模式确实生效了。
|
||||
❌ **session-cap fallback 不降反升**(37→46% 与 35→51%)。说明 v4 的本地 soft_cap 实际上**比 D 真实容量更乐观**——admit 进来后转身就 OOM,统计成了 error 而不是 fallback。
|
||||
❌ 直接结果:**direct-to-D 占比下降、整体延迟全面变差**。P50/P90/P99 与 TTFT 都退步。
|
||||
|
||||
### Direct-to-D 子集还是稳的(KVC 真实价值仍在)
|
||||
|
||||
| Config | n | Lat P50 | Lat P90 | TTFT P50 | TTFT P90 |
|
||||
|--------|:---:|:---:|:---:|:---:|:---:|
|
||||
| baseline 8DP | 4381 | 0.66s | 3.65s | 0.094s | 0.256s |
|
||||
| v4 2P6D direct-to-D | 2348 | 0.499s | 2.86s | 0.043s | 0.054s |
|
||||
| **v5 1P7D direct-to-D** | 1990 | 0.475s | 3.04s | 0.043s | 0.055s |
|
||||
| **v5 2P6D direct-to-D** | 1837 | 0.483s | 3.04s | 0.043s | 0.054s |
|
||||
|
||||
direct-to-D 的尾延迟和 TTFT 与 v4 几乎完全一致(端点决策开销可忽略),**v5 的回退不是路径本身变慢,而是更多请求被赶到 fallback**。
|
||||
|
||||
### Fallback 路径反而比 v4 更糟
|
||||
|
||||
| Config | n | Lat P50 | Lat P90 | TTFT P50 |
|
||||
|--------|:---:|:---:|:---:|:---:|
|
||||
| v5 1P7D session-cap fallback | 2027 | 6.38s | 17.47s | 4.49s |
|
||||
| v5 2P6D session-cap fallback | 2253 | 3.13s | 11.25s | 0.89s |
|
||||
|
||||
由于 fallback 占比上升、且这条路径本身就比 direct-to-D 慢一个数量级,整体均值被拖累得更厉害。
|
||||
|
||||
### v5 真正暴露的瓶颈:D 的 KV pool 物理容量
|
||||
|
||||
把 admission 决策权交给 D 之后,瓶颈从"replay 估得太死"变成"D 真的装不下":
|
||||
|
||||
- 80GB H100 × `mem_fraction_static=0.835` → D 单卡 KV pool ≈ 100-150K tokens
|
||||
- agentic 长 context session 单 turn footprint 50-100K
|
||||
- 单 D 上能并存的 session 数量本就 2-3 个 → 7 个 D 装 50 session 基本不可能
|
||||
|
||||
v4 的 cap=16 之所以"看起来好",部分是因为本地 soft_cap 没真的查 D 的 free pool,开了一堆**最终会失败**的 session(统计成 errors 而非 fallback)。v5 把这部分洗成了"诚实的拒绝"——可靠性跃升的代价是看见了真实容量上限。
|
||||
|
||||
### v6 应该针对什么
|
||||
|
||||
把 D 物理容量管理打开,而不是再调 replay:
|
||||
|
||||
1. **prefill backup 提早 release**(已经加了 `release-after-transfer` 但可能还不够及时) → 让 P 上的 backup blocks 不要长期占用 KV pool。
|
||||
2. **priority eviction 策略调优**(已开 `--kvcache-prefill-priority-eviction`):当前 LRU 可能把 hot session 误踢;需要按 session 命中频率/最近访问做加权。
|
||||
3. **chunked / streamed seed**:不要一次 reserve 整个 prompt 的容量,按 chunk 分摊。
|
||||
4. **跨 D 的 session migration**:当一个 D 满了但隔壁 D 空时主动迁移,而不是直接 fallback 到 P。
|
||||
5. **真正的多机 RDMA**:单机 mooncake loopback 是 errors 的根因之一;上多机 + RDMA 才能让 prefill backup release 后的 KV transfer 真的稳。
|
||||
|
||||
工程量:1-3 是 SGLang 内部改 (`scheduler.py` + `session_controller.py`),4 需要 router 协议扩展,5 是部署变更。
|
||||
|
||||
## 关键文件与代码位置索引
|
||||
|
||||
| 现象 | 代码位置 |
|
||||
|------|----------|
|
||||
| Replay policy round-robin | `policies.py:63-67` `RoutingState.next_decode_worker_id` |
|
||||
| KV-aware policy(session 亲和) | `policies.py:146-187` `KvAwarePolicy.select` |
|
||||
| PD router decode 选择 | `pd_router.py:51-74` `_select_decode_index` |
|
||||
| Header 构建 | `replay.py:2407-2424` `_build_headers` |
|
||||
| Policy → router config 映射 | `benchmark.py:287-300` `_decode_policy_for/_header_mode_for` |
|
||||
| Session admission 软 cap | `replay.py:889-905` `_decode_session_soft_cap` |
|
||||
| 已有的 D 侧 admission 端点 | `scheduler.py:3497-3580` `admit_direct_append`(v5 扩展支持 `mode=seed`) |
|
||||
| Worker-mode admission 调用方 | `replay.py` reseed / turn1-seed / large-append-reseed 路径 |
|
||||
| Prefill backup 释放策略(v5 引入) | `--kvcache-prefill-backup-policy release-after-transfer` |
|
||||
| Prefill priority eviction(v5 引入) | `--kvcache-prefill-priority-eviction` |
|
||||
| Session 在 D 上找不到的报错 | `scheduler.py:1824-1836` |
|
||||
| `_should_allow_local_prefill_on_decode` | `scheduler.py:1975-1980` |
|
||||
| Reseed 流程入口 | `replay.py:1665-1809` `_invoke_kvcache_seeded_router` |
|
||||
| Direct-to-D 流程 | `replay.py:2351-2398` `_invoke_decode_session_direct` |
|
||||
|
||||
## 经验教训
|
||||
|
||||
1. **policy 和 mechanism 是两个正交维度**——`--policy default` 不是"无脑默认值",它真的是 round-robin 无 session 亲和性。KVC 机制必须配 session 亲和的 policy。
|
||||
|
||||
2. **不要无脑相信前一个 agent 的 RESULTS_SUMMARY**——v2 的诊断("local prefill bug")和实际 finish_reason("session id does not exist")完全对不上。任何错误诊断必须用 finish_reason、execution_mode 这些原始字段交叉验证。
|
||||
|
||||
3. **bimodal 分布是 starvation 的强信号**——v3 数据里某些 session 100% 走快路径、某些 100% 走慢路径,几乎肯定是某种"先到先得"的资源竞争。看到这种模式立刻去找硬编码 cap 或全局共享资源。
|
||||
|
||||
4. **测量要看分组而非整体均值**——v3 整体 P50=1.5s 看似比 baseline 慢,但拆开看 direct-to-D 子集 P50=0.495s 已经反超 baseline。整体均值被 fallback 路径拖累,但 KVC 的核心价值是真实存在的。
|
||||
|
||||
5. **errors 与 fallback 是同一类资源压力的两副面孔**——v4 的"低 fallback 率 + 高 error 率"不是更优解,是把容量超限的失败从"显式拒绝"伪装成"超时失败"。v5 把决策权交给真容量后,fallback 升、errors 降,这是更诚实的指标,不要被 v4 的 fallback 数字误导。当看到错误率和 fallback 率呈反相关时,要警惕 admission 决策是否在说谎。
|
||||
514
docs/REAL_ALI_KVC_EXPERIMENT_LOG_ZH.md
Normal file
514
docs/REAL_ALI_KVC_EXPERIMENT_LOG_ZH.md
Normal file
@@ -0,0 +1,514 @@
|
||||
# Real Ali KVC 实验日志
|
||||
|
||||
**分支**:`kvc-real-ali-iter-v1`,从 `kvc-debug-journey-v1-to-v4` checkout 出来。
|
||||
**日期**:2026-05-11/12。
|
||||
**环境**:单机 8x NVIDIA H20,SGLang xPyD,模型 `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`。
|
||||
**真实 trace**:`/home/admin/cpfs/wjh/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`。
|
||||
|
||||
本日志记录真实 Ali workload 上的 KVC pd-hybrid 迭代。结论只按当前证据成立;`time-scale=10` smoke 和 KVC-friendly slice 不作为 full workload headline。
|
||||
|
||||
## 1. 当前最新进展
|
||||
|
||||
已新增真实 Ali trace 的固定样本和 sweep 管线:
|
||||
|
||||
- `scripts/prepare_real_ali_samples.py`:从真实 Ali trace 生成可复现实验样本,保留真实 input/output/hash_ids/timestamp,可选择 rebase timestamp。
|
||||
- `scripts/sweep_real_ali_kvc.sh`:对同一 prebuilt sample 依次跑 DP cache-aware、PD-disaggregation、KVC、KVC+backpressure。
|
||||
- `benchmark-live --use-trace-as-sample`:直接 replay 指定 trace,避免不同策略重新采样导致不可比。
|
||||
- `replay-progress.jsonl` heartbeat:后续长跑会每 30s 写客户端侧进度,不轮询 `/server_info`,避免扰动 scheduler。
|
||||
- `prepare_real_ali_samples.py --max-sampled-duration-s`:为快速 smoke 生成 capped sample;只用于迭代,不用于 headline。
|
||||
|
||||
已经完成的真实 Ali KVC-fit smoke:
|
||||
|
||||
- 样本:`outputs/real-ali-kvc-iter/samples-balanced/ali-kvc-fit-smallappend.jsonl`
|
||||
- 179 requests,64 sessions,全部 multi-turn;turn2+ 共 115 个,direct-eligible ratio 100%。
|
||||
- `time-scale=10`,concurrency 32。
|
||||
- DP cache-aware、PD-disaggregation、KVC no-backpressure、KVC+backpressure 均已完成。
|
||||
|
||||
## 2. 全量 Ali trace 画像
|
||||
|
||||
`outputs/real-ali-kvc-iter/ali-full-profile.json` 显示:
|
||||
|
||||
| 指标 | 数值 |
|
||||
|---|---:|
|
||||
| requests | 763,727 |
|
||||
| sessions | 555,905 |
|
||||
| multi-turn sessions | 39,247 |
|
||||
| turn2+ requests | 207,822 |
|
||||
| turn2+ direct-eligible ratio | 82.95% |
|
||||
| input p50 / p90 / p99 | 4,329 / 51,067 / 112,955 tokens |
|
||||
| output p50 / p90 / p99 | 93 / 826 / 5,616 tokens |
|
||||
| append p50 / p90 / p99 | 303 / 2,879 / 17,885 tokens |
|
||||
| inter-turn gap p50 / p90 / p99 | 4.65s / 38.68s / 1,133s |
|
||||
|
||||
这个 profile 说明 KVC 有真实适用面:turn2+ 的 hash overlap 和小 append 很常见。但 full workload 里 single-turn session 极多,KVC 收益会被显著稀释;因此必须分 slice 报告,不能只报 KVC-fit 子集。
|
||||
|
||||
## 3. 已跑样本
|
||||
|
||||
### Continuous 15min cold-window session sample
|
||||
|
||||
路径:`outputs/real-ali-kvc-iter/samples-window-900s-600req/ali-window.jsonl`
|
||||
|
||||
- 600 requests,439 sessions,32 multi-turn sessions。
|
||||
- rebased duration:886.544s,覆盖约 15min。
|
||||
- turn2+ requests:161,direct-eligible:143,ratio 88.8%。
|
||||
- input p50 / p90 / p99:3,871 / 68,234 / 98,131。
|
||||
- output p50 / p90 / p99:85 / 712 / 5,195。
|
||||
- append p50 / p90 / p99:274 / 2,202 / 16,120。
|
||||
- inter-turn gap p50 / p90 / p99:4.656s / 19.376s / 63.575s。
|
||||
|
||||
这是对 179-request KVC-fit smoke 的替代验证样本。它按 900s 窗口分成 15 个时间桶,轮转选择窗口内从 root 开始的整 session,直到达到 600 requests。这样避免 parent 缺失导致 `load_trace()` 把真实 session 切碎,也让请求覆盖整个 15min,而不是只取窗口开头 600 条。
|
||||
|
||||
重要边界:它是 **cold-window / new-session-only** sample,不是完整 raw production window;它排除了窗口开始前已经活跃的 ongoing sessions。因此可以用于“600+ 请求、15min、真实混合负载”的稳定性验证,但不能单独代表全量 Ali production window。
|
||||
|
||||
### KVC-fit small append
|
||||
|
||||
路径:`outputs/real-ali-kvc-iter/samples-balanced/ali-kvc-fit-smallappend.jsonl`
|
||||
|
||||
- 179 requests,64 sessions。
|
||||
- input p50 / p90:6,446 / 15,491。
|
||||
- output p50 / p90:112 / 1,159。
|
||||
- append p50 / p90:215 / 855。
|
||||
- overlap ratio p50 / p90:0.875 / 0.938。
|
||||
|
||||
这是 KVC-friendly slice,用来验证机制上限和 microbenchmark 是否能迁移到真实 token/hash 序列。
|
||||
|
||||
### Representative-mt / early multi-turn balanced
|
||||
|
||||
路径:`outputs/real-ali-kvc-iter/samples-balanced/ali-representative-mt.jsonl`
|
||||
|
||||
- 460 requests,64 sessions。
|
||||
- input p50 / p90:41,175 / 98,621。
|
||||
- append p50 / p90 / p99:272 / 1,979 / 13,900。
|
||||
|
||||
这个样本更接近真实 multi-turn 压力,后续用于验证大上下文、大 resident KV 下是否仍能稳定。但它当前实现是“从 start_time 后取最早 64 个 multi-turn session”,不是严格随机或分层 representative;正式 headline 需要按 input/append/output/gap 分层抽样。
|
||||
|
||||
### Capped smoke samples
|
||||
|
||||
为避免少数真实长 gap 让 smoke 浪费大量 wall time,新增:
|
||||
|
||||
- `outputs/real-ali-kvc-iter/samples-balanced-cap120s/ali-kvc-fit-smallappend.jsonl`:177 requests,64 sessions,duration 65.859s。
|
||||
- `outputs/real-ali-kvc-iter/samples-balanced-cap120s/ali-representative-mt.jsonl`:359 requests,64 sessions,duration 117.366s。
|
||||
|
||||
这些样本去掉了 KVC-fit 原样本末尾 timestamp 3613s 和 5414s 的两个请求,因此只能用于快速工程迭代;正式对比仍应使用完整样本或真实连续窗口。
|
||||
|
||||
## 4. 当前结果
|
||||
|
||||
### 4.1 DP cache-aware vs KVC+backpressure, KVC-fit, time-scale=10
|
||||
|
||||
| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|
||||
| 8-way DP cache-aware | 179 | 0 | 0 | 6.603s | 3.126s | 17.639s | 34.582s | 1.112s | 1.052s |
|
||||
| KVC 2P6D + worker admission + backpressure | 179 | 0 | 0 | 4.443s | 2.076s | 13.288s | 21.202s | 0.700s | 0.154s |
|
||||
|
||||
Paired comparison(KVC - DP):
|
||||
|
||||
- overall E2E mean delta:-2.161s;p50 delta:-1.427s;152/179 wins。
|
||||
- turn2+ direct 子集:mean delta -2.503s;p50 delta -1.508s;103/115 wins。
|
||||
- turn2+ TTFT mean delta:-0.930s;p50 delta -0.887s。
|
||||
|
||||
执行路径:
|
||||
|
||||
- KVC turn1 seed:64 requests。
|
||||
- `kvcache-direct-to-d-session`:115 requests。
|
||||
- session reused:115。
|
||||
- actual KV transfer blocks:623。
|
||||
|
||||
结构日志:
|
||||
|
||||
- admission probes:179,全为 `ok`。
|
||||
- transfer queue depth:p50=0,p90=2,max=3。
|
||||
- backpressure event:0。
|
||||
|
||||
解释:这轮证明的是 **KVC direct-to-D/session reuse** 在真实 Ali KVC-fit slice 上有正信号;不是证明 backpressure 有效,因为没有触发 backpressure。
|
||||
|
||||
### 4.2 PD-disaggregation baseline, KVC-fit, time-scale=10
|
||||
|
||||
| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|
||||
| PD-disaggregation 2P6D | 179 | 0 | 0 | 7.850s | 6.306s | 15.192s | 22.405s | 4.994s | 5.336s |
|
||||
|
||||
Paired comparison(PD - DP):
|
||||
|
||||
- overall E2E mean delta:+1.247s。
|
||||
- p50 delta:+2.231s。
|
||||
- 46/179 faster,133/179 slower。
|
||||
|
||||
解释:在这个 KVC-fit slice 上,普通 PD-disaggregation 明显弱于 8-way DP cache-aware。它付出了 P->D transfer 和拆分调度成本,却没有 KVC direct-to-D 的 bypass 收益。
|
||||
|
||||
### 4.3 KVC no-backpressure 消融, KVC-fit, time-scale=10
|
||||
|
||||
| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|
||||
| KVC 2P6D worker admission, no backpressure | 179 | 0 | 0 | 4.404s | 1.936s | 13.200s | 21.326s | 0.604s | 0.139s |
|
||||
|
||||
Paired comparison:
|
||||
|
||||
- KVC no-BP vs DP:mean delta -2.200s,p50 delta -1.434s,153/179 wins。
|
||||
- KVC no-BP vs PD-disaggregation:mean delta -3.447s,p50 delta -3.514s,163/179 wins。
|
||||
- KVC no-BP vs KVC+BP:mean delta -0.039s,p50 delta -0.005s,92/179 wins。
|
||||
|
||||
结构分析:
|
||||
|
||||
- direct-to-D rate:64.25%。
|
||||
- admission probes:179,全为 `ok`。
|
||||
- transfer queue depth:p50=0,p90=2,max=3。
|
||||
- pause_ms 全 0,backpressure event 0。
|
||||
|
||||
解释:no-backpressure 与 +backpressure 几乎等价,说明本 slice 没有 D 压力;本轮提升来自 direct-to-D,不来自反压。
|
||||
|
||||
### 4.4 Continuous 15min / 600-request window, time-scale=1
|
||||
|
||||
样本:`outputs/real-ali-kvc-iter/samples-window-900s-600req/ali-window.jsonl`
|
||||
|
||||
重要边界:这是 cold-window / new-session-only session sample,不是完整 raw window。它覆盖约 15min,且 `missing_parent_count=0`,但排除了窗口开始前已活跃的 ongoing sessions。
|
||||
|
||||
运行结果:
|
||||
|
||||
| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 | TTFT p90 |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|
||||
| DP cache-aware 8-way | 600 | 1 | 0 | 13.942s | 5.222s | 29.299s | 151.183s | 6.162s | 1.746s | 19.176s |
|
||||
| PD-disaggregation 2P6D | 600 | 1 | 0 | 40.886s | 40.018s | 84.681s | 113.460s | 38.545s | 37.782s | 81.852s |
|
||||
| KVC 2P6D mem_fraction_static=0.82 | 600 | 53 | 0 | 12.386s | 4.225s | 37.998s | 78.234s | 10.078s | 2.674s | 27.774s |
|
||||
|
||||
KVC 默认启动失败:
|
||||
|
||||
- 默认 KVC 2P6D 在 H20 上两次启动 OOM,均未进入 replay。
|
||||
- 日志显示 decode/prefill worker 启动时只剩约 526MB,模型加载阶段 OOM。
|
||||
- `--load-format layered` 不支持 Qwen3-Coder-30B-A3B。
|
||||
- 使用 `--mem-fraction-static 0.82` 后 KVC 能启动并完成 replay,但这降低了 KV pool 容量,因此这轮 KVC 是 memory-constrained rerun。
|
||||
- 尝试 `KVC_SEED_MIN_TURN_ID=2` + `mem_fraction_static=0.82` 时,启动阶段 scheduler 被 SIGKILL,疑似 OS OOM killer,未进入 replay。
|
||||
|
||||
Paired comparison(只在两边都有 latency 的 547 个 paired request 上计算):
|
||||
|
||||
- KVC vs DP:mean delta -1.335s,p50 delta -0.055s,p90 delta +19.371s,284 wins / 263 losses。
|
||||
- KVC vs PD:mean delta -28.341s,p50 delta -25.687s,p90 delta +2.834s,465 wins / 82 losses。
|
||||
|
||||
KVC 结构数据:
|
||||
|
||||
- execution modes:388 `pd-router-turn1-seed`,90 `kvcache-direct-to-d-session`,67 `pd-router-fallback-large-append-session-cap`,1 `pd-router-large-append-reseed`,1 `pd-router-turn1-d-backpressure`,53 `kvcache-centric` error rows。
|
||||
- direct-to-D rate:15.0%。
|
||||
- direct-to-D session 分布:413/439 sessions 在 0-20% direct rate;只有 6 sessions 在 80-100%。
|
||||
- admission probes:533;reason `ok` 531,`no-space` 2;queue depth p50=0,p90=2,max=5。
|
||||
- pause hint 非零 20 次,但没有 backpressure event,因为本轮 no-BP。
|
||||
|
||||
KVC error breakdown:
|
||||
|
||||
- 50 `ReadTimeout`。
|
||||
- 2 `HTTPStatusError 400 Bad Request` on `open_session`。
|
||||
- 1 context length error:同 DP/PD 的 `input_length=310521 > 262144`。
|
||||
- 错误主要集中在 turn1:50 turn1,3 turn2+。
|
||||
|
||||
解释:
|
||||
|
||||
1. KVC 相对普通 PD 仍明显更好,说明普通 P->D disaggregation 在真实 600-request 窗口上成本很高。
|
||||
2. KVC 相对 DP 只在 clean request 的 mean/p50 上有小幅正信号,但 p90 变差,而且 error_count 从 DP 的 1 增到 53。
|
||||
3. 因此在这个 600-request / 15min window 上,**KVC 不能算稳定提升系统**。主要问题不是 direct-to-D 快路径无效,而是该快路径覆盖率只有 15%,并且 turn1 seed / session admission / memory-constrained KV pool 引入大量 timeout。
|
||||
4. 这直接修正 179-request KVC-fit smoke 的结论:小样本证明 KVC 适用 slice 存在;600-request mixed window 证明当前实现还不能稳定服务真实混合 workload。
|
||||
|
||||
## 5. 是否已经相对 pd-colocation/pd-disaggregation 取得提升
|
||||
|
||||
当前只能下这个限定结论:
|
||||
|
||||
1. **相对 PD-disaggregation:已经取得清晰提升。**
|
||||
PD-disaggregation p50 6.306s,KVC no-BP p50 1.936s,KVC+BP p50 2.076s;TTFT p50 5.336s vs 0.139s/0.154s。收益主要来自 turn2+ 直接打到已有 D session,避免每轮 P 全量 prefill 和 P->D KV transfer。
|
||||
|
||||
2. **相对强 DP cache-aware:在 KVC-fit slice 上有提升。**
|
||||
KVC no-BP 和 KVC+BP overall mean/p50/p90/p99 都优于 DP,并且 paired wins 分别是 153/179 和 152/179。但这是 KVC-friendly、全 multi-turn、turn2+ 100% direct-eligible 的 slice,不代表 full Ali workload。
|
||||
|
||||
3. **相对 full workload:尚未证明。**
|
||||
全量 Ali 里 single-turn 占多数,且长上下文和长尾 output 较多。KVC 的收益面会被 single-turn 稀释,D resident KV 容量和 tail 稳定性会成为更强约束。
|
||||
|
||||
4. **相对 600-request / 15min mixed window:尚未取得稳定提升。**
|
||||
KVC clean E2E mean/p50 有正信号,但 error_count=53/600,p90 paired delta 相对 DP 变差。按“E2E + error/truncation”标准,这不能算系统性胜出。
|
||||
|
||||
## 6. 提升来自哪里
|
||||
|
||||
主要收益链路:
|
||||
|
||||
1. turn1 seed 在 D 上建立 session。
|
||||
2. turn2+ 若 append 小、hash overlap 高,直接走 `kvcache-direct-to-d-session`。
|
||||
3. direct-to-D 避免 P worker 参与,不走 P->D KV transfer。
|
||||
4. D 只对 append suffix 做少量 prefill,已有前缀 KV 直接复用。
|
||||
|
||||
这带来两个可观测收益:
|
||||
|
||||
- TTFT 大幅下降:turn2+ direct 子集 TTFT mean 从 DP 的约 1.04s 降到约 0.112s。
|
||||
- E2E 下降:direct 子集 mean E2E 降低约 2.50s。
|
||||
|
||||
另外,KVC 的 cached_tokens 统计显著更高:KVC mean cached tokens 5,992,DP mean 228。这说明它确实复用了大段真实前缀 KV。
|
||||
|
||||
## 7. 遇到的问题与修复
|
||||
|
||||
### 问题 1:通用 sampler 会被单个长 session 主导
|
||||
|
||||
现象:真实 Ali session 分布长尾明显,duration-oriented 采样容易选出不均衡样本,导致策略比较不可重复或不代表多 session 竞争。
|
||||
|
||||
修复:新增 `scripts/prepare_real_ali_samples.py`,按 session 上限和每 session turn 上限生成 balanced sample,并保留真实 token/hash/timestamp。
|
||||
|
||||
### 问题 2:不同策略重新采样导致不可比
|
||||
|
||||
现象:`benchmark-live` 原本会按参数重新采样,不同策略可能 replay 不同请求。
|
||||
|
||||
修复:新增 `--use-trace-as-sample`,所有策略 copy 并 replay 同一个 prebuilt sample;后续 paired comparison 才有意义。
|
||||
|
||||
### 问题 3:长 trace replay 中途没有进度
|
||||
|
||||
现象:`request-metrics.jsonl` 和 summary 只在 replay 结束后写出,跑真实 pacing 时很难判断是正常等待还是卡住。
|
||||
|
||||
修复:新增 `replay-progress.jsonl` heartbeat,每 30s 写 submitted/completed/inflight/errors/execution_modes。它只使用客户端本地状态,不访问 `/server_info`。
|
||||
|
||||
### 问题 4:`/server_info` polling 会扰动 scheduler
|
||||
|
||||
现象:旧 profiling 里 1Hz polling 曾明显改变错误数。真实 performance run 如果持续 poll pool,会把测量工具变成干扰源。
|
||||
|
||||
修复:`scripts/sweep_real_ali_kvc.sh` 默认关闭 pool polling。容量类问题依赖结构日志和必要时单独 profile run,不混入 headline performance run。
|
||||
|
||||
### 问题 5:backpressure smoke 没有触发 backpressure
|
||||
|
||||
现象:KVC-fit smoke 中 transfer queue max 只有 3,所有 admission reason 都是 `ok`,pause_ms 全 0。
|
||||
|
||||
结论:这轮不能证明 backpressure 有效,只能证明 direct-to-D 有效。需要更高 session 数、更大 resident KV 或更强并发的压力样本专门验证 backpressure。
|
||||
|
||||
### 问题 6:环境和旧报告不一致
|
||||
|
||||
现象:旧文档写的是 H100,本轮真实环境是 H20;模型路径也在 `/home/admin/cpfs/wjh/models/...`。
|
||||
|
||||
处理:本日志按 H20 记录;跨文档比较时只看机制趋势,不把 H100/H20 的绝对 latency 混为同一实验。
|
||||
|
||||
### 问题 7:continuous window 可能截断 session ancestry
|
||||
|
||||
现象:按 timestamp 直接截窗口可能留下 parent turn 在窗口外的请求。对 KVC 来说,这会让 session reuse/turn chain 与真实 workload 不一致。
|
||||
|
||||
处理:当前 continuous window 只作为待改进候选,不作为正式 headline。正式窗口需要保留 warmup ancestors,或显式保留原始 session chain 信息。
|
||||
|
||||
## 8. 如果后续 full workload 效果不好,当前假设
|
||||
|
||||
可能不是实现小 bug,而是方案适用面和资源约束共同导致:
|
||||
|
||||
1. **single-turn 稀释收益**:全量 Ali session 中 single-turn 占多数,KVC seed 只带来成本,没有 turn2+ reuse。
|
||||
2. **长上下文挤占 D KV 池**:input p90 51K、p99 113K,resident KV 长尾会限制 D 上可同时保留的 session。
|
||||
3. **direct 不是免费 lunch**:turn1 seed、admission probe、session lifecycle 都有额外成本;只有后续 turns 充分复用时才摊薄。
|
||||
4. **D 端容量和 eviction 仍是核心风险**:旧 SWE 实验已经显示 session pinning + D 容量盲选会造成 starvation;early multi-turn balanced 样本可能复现。
|
||||
5. **普通 PD-disaggregation 很弱**:如果 KVC fallback 频繁退回普通 PD 路径,整体会被 P->D transfer 和高 TTFT 拖垮。
|
||||
6. **H20 显存余量不足会改变 KVC 条件**:默认 KVC 2P6D 启动 OOM,必须降 `mem_fraction_static` 才能完成 600-request run;这会进一步降低 D KV pool,放大 session-cap 和 timeout。
|
||||
|
||||
## 9. 下一步验证顺序
|
||||
|
||||
1. 补 sticky/session-affinity baseline,拆出“粘到同一个 D”和“KVC direct bypass”的贡献。
|
||||
2. 补 KVC `seed-min-turn-id=2` 或 no-turn1-seed,验证 turn1 seed 成本是否值得。
|
||||
3. 在 early multi-turn balanced 样本上跑 DP / PD / KVC no-BP / KVC+BP,验证大上下文真实 multi-turn 压力。
|
||||
4. 选小固定样本跑 `time-scale=1`,避免只在压缩 replay 条件下成立。
|
||||
5. 做包含 single-turn 的 continuous window,并处理窗口内 parent turn 缺失问题,再按 full Ali 分布加权报告。
|
||||
6. 对最终候选配置做 N>=3 rerun,报告方差;N=1 只作为 smoke。
|
||||
7. 针对 600-request window 优先跑 `seed-min-turn-id=2`,减少 single-turn turn1 seed;目标是先把 53/600 errors 降到接近 DP 的 1/600,再讨论 latency。
|
||||
- 当前第一次尝试未进入 replay,启动阶段疑似 OS OOM;需要先解决 H20 启动显存/系统内存稳定性,或者降低 worker 数/模型内存占用。
|
||||
|
||||
## 10. KVC error 根因与 multi-turn-only 验证准备
|
||||
|
||||
用户指出 179-request run 不够,并要求至少 15min / 600+ 请求;当前正式问题定位基于
|
||||
`outputs/real-ali-kvc-iter/runs/window900s-600req-ts1-kvc-mem082/kvcache-centric-kv-aware-worker-admission-20260511T093601Z`。
|
||||
|
||||
### 10.1 为什么 KVC 有大量 error
|
||||
|
||||
该 run 为 600 requests,KVC mem0.82 有 53 errors:
|
||||
|
||||
- 50 个 `ReadTimeout`。
|
||||
- 2 个 `/open_session` HTTP 400。
|
||||
- 1 个真实超上下文错误:input 310,521 > model context 262,144。
|
||||
|
||||
按 turn 看,50/53 errors 在 turn1。按 structural admission 看,绝大多数失败请求在
|
||||
`structural/admission-events.jsonl` 中已经被 D 端 admission 判定 `can_admit=true`,所以这不是单纯的
|
||||
`d-session-cap` 或 `no-space`。主要失败点是 turn1 seed 进入 KVC seeded path 后,在
|
||||
P/D streaming session bootstrap、P->D transfer 或 router streaming 过程中超时;而混合真实窗口中 single-turn session 很多,
|
||||
这些 turn1 seed 对大多数 session 没有后续复用收益。
|
||||
|
||||
结论:当前 KVC error 的主因是 **对 single-turn / 未知是否多轮的 session 做了过多 turn1 seed**,它把大量新 session 推进
|
||||
KVC control-plane 和 seeded router 路径,增加超时和 session lifecycle 残留;不是 direct-to-D fast path 本身出错。
|
||||
|
||||
### 10.2 已做修复/消融开关
|
||||
|
||||
代码与脚本修复:
|
||||
|
||||
- `scripts/sweep_real_ali_kvc.sh` 新增 `KVC_SEED_ONLY_MULTITURN=1`,会传入
|
||||
`--kvcache-seed-only-multiturn-sessions`。这是 oracle 消融,用来验证“只 seed 会有后续 turn 的 session”能否消除 turn1 seed 错误。
|
||||
- `src/agentic_pd_hybrid/replay.py` 对 `/open_session` 400 增加 close+retry 一次,并写
|
||||
`structural/session-lifecycle.jsonl`。这是 lifecycle 健壮性修复,目标是处理 timeout 后服务端残留 session 导致的
|
||||
“already exists” 400,不改变 routing policy。
|
||||
- `scripts/prepare_real_ali_samples.py` 新增 `--window-min-turns` 和 `--window-output-name`,用于生成可复现的 multi-turn-only window 样本。
|
||||
|
||||
验证:
|
||||
|
||||
- `uv run python -m py_compile scripts/prepare_real_ali_samples.py src/agentic_pd_hybrid/replay.py src/agentic_pd_hybrid/benchmark.py src/agentic_pd_hybrid/cli.py`
|
||||
- `bash -n scripts/sweep_real_ali_kvc.sh`
|
||||
|
||||
### 10.3 已生成 multi-turn-only 样本
|
||||
|
||||
样本路径:
|
||||
|
||||
`outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl`
|
||||
|
||||
生成命令:
|
||||
|
||||
```bash
|
||||
uv run python scripts/prepare_real_ali_samples.py \
|
||||
--trace /home/admin/cpfs/wjh/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--output-root outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn \
|
||||
--window-duration-s 900 \
|
||||
--window-target-requests 600 \
|
||||
--window-buckets 15 \
|
||||
--window-min-turns 2 \
|
||||
--window-output-name ali-window-multiturn.jsonl \
|
||||
--profiles representative-mt \
|
||||
--max-sessions 64 \
|
||||
--max-turns-per-session 12
|
||||
```
|
||||
|
||||
样本 profile:
|
||||
|
||||
- 626 requests,107 sessions,107 个都是 multi-turn sessions。
|
||||
- sampled duration 889.341s。
|
||||
- turn2+ = 519。
|
||||
- direct-eligible turn2+ = 473 / 519 = 91.1%。
|
||||
- missing parent = 0。
|
||||
- input p50/p90/p99 = 26,846 / 91,596 / 123,898 tokens。
|
||||
|
||||
这个 case 是“过滤掉 single-turn 的多轮压力切片”,不能替代 full mixed workload,但可以回答:
|
||||
如果 workload 确实以多轮 coding agent session 为主,KVC 的 direct-to-D 覆盖率和稳定性是否接近 microbenchmark。
|
||||
|
||||
### 10.4 GPU 资源阻塞
|
||||
|
||||
截至本次记录,8 张 GPU 均被另一组 `vllm serve` 进程占用,每张约 82GB / 98GB,端口为 51000-51007。
|
||||
这些不是本 repo 的 SGLang/benchmark 进程,因此未启动新的性能 run,避免把资源冲突误判为 KVC 策略失败。
|
||||
|
||||
GPU 释放后,优先跑两组:
|
||||
|
||||
```bash
|
||||
# 混合真实窗口:验证 seed-only-multiturn 是否把 53/600 errors 降下来
|
||||
TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req/ali-window.jsonl \
|
||||
OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-ts1-kvc-seedonly-mt-mem082 \
|
||||
RUNS="kvc" \
|
||||
TIME_SCALE=1 \
|
||||
CONCURRENCY=32 \
|
||||
REQUEST_TIMEOUT_S=600 \
|
||||
STACK_TIMEOUT_S=1800 \
|
||||
EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
|
||||
KVC_SEED_ONLY_MULTITURN=1 \
|
||||
bash scripts/sweep_real_ali_kvc.sh
|
||||
|
||||
# 多轮-only workload:DP vs KVC,对照过滤 workload 是否能复现 microbenchmark 收益
|
||||
TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl \
|
||||
OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-mem082 \
|
||||
RUNS="dp kvc" \
|
||||
TIME_SCALE=1 \
|
||||
CONCURRENCY=32 \
|
||||
REQUEST_TIMEOUT_S=600 \
|
||||
STACK_TIMEOUT_S=1800 \
|
||||
EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
|
||||
KVC_SEED_ONLY_MULTITURN=1 \
|
||||
bash scripts/sweep_real_ali_kvc.sh
|
||||
```
|
||||
|
||||
### 10.5 multi-turn-only 启动尝试被 GPU 占用阻塞
|
||||
|
||||
用户要求启动 multi-turn-only 的 `pd-disaggregation` vs `kvcache-centric` 对比。启动前检查发现 8 张 GPU 均被外部
|
||||
`vllm serve` 进程占用,每张约 84GB / 98GB,端口为 51000-51007。该进程不属于本 repo 的 SGLang/benchmark run。
|
||||
|
||||
因此本次没有强行启动 SGLang。原因是剩余显存不足以启动 2P6D 或 8-worker 对照,强行运行只会得到初始化 OOM 或不稳定超时,
|
||||
不能用于判断 KVC pd-hybrid 是否优于 pd-disaggregation。
|
||||
|
||||
资源释放后要运行的 multi-turn-only 对比命令:
|
||||
|
||||
```bash
|
||||
TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl \
|
||||
OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082 \
|
||||
RUNS="pd kvc" \
|
||||
TIME_SCALE=1 \
|
||||
CONCURRENCY=32 \
|
||||
REQUEST_TIMEOUT_S=600 \
|
||||
STACK_TIMEOUT_S=1800 \
|
||||
EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
|
||||
KVC_SEED_ONLY_MULTITURN=1 \
|
||||
bash scripts/sweep_real_ali_kvc.sh
|
||||
```
|
||||
|
||||
### 10.6 multi-turn-only PD vs KVC 正式结果
|
||||
|
||||
资源释放后已启动并完成 multi-turn-only 对比。运行命令:
|
||||
|
||||
```bash
|
||||
TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl \
|
||||
OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082 \
|
||||
RUNS="pd kvc" \
|
||||
TIME_SCALE=1 \
|
||||
CONCURRENCY=32 \
|
||||
REQUEST_TIMEOUT_S=600 \
|
||||
STACK_TIMEOUT_S=1800 \
|
||||
EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
|
||||
KVC_SEED_ONLY_MULTITURN=1 \
|
||||
bash scripts/sweep_real_ali_kvc.sh
|
||||
```
|
||||
|
||||
Run 目录:
|
||||
|
||||
- PD:`outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082/pd-disaggregation-kv-aware-20260512T030433Z`
|
||||
- KVC:`outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082/kvcache-centric-kv-aware-worker-admission-20260512T040444Z`
|
||||
|
||||
样本仍是 626 requests、107 sessions、889.341s,全部为 multi-turn session。
|
||||
|
||||
| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 | TTFT p90 |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|
||||
| PD-disaggregation 2P6D | 626 | 0 | 0 | 97.013s | 70.243s | 214.309s | 308.406s | 94.506s | 69.048s | 212.528s |
|
||||
| KVC 2P6D worker admission, no BP, seed-only-multiturn | 626 | 39 | 0 | 43.362s | 8.239s | 135.289s | 236.475s | 40.578s | 1.442s | 132.233s |
|
||||
|
||||
Paired comparison 只在 KVC 成功且 PD 也有 latency 的 587 个 request 上计算:
|
||||
|
||||
- PD same-request E2E mean/p50/p90/p99:97.457s / 70.514s / 214.095s / 309.362s。
|
||||
- KVC same-request E2E mean/p50/p90/p99:43.362s / 8.239s / 135.930s / 237.283s。
|
||||
- mean E2E reduction:55.5%。
|
||||
- absolute mean improvement:54.095s。
|
||||
- wins/losses:472 / 115。
|
||||
|
||||
按 KVC execution mode 拆分:
|
||||
|
||||
| KVC mode | Count | KVC mean | PD same mean | Reduction |
|
||||
|---|---:|---:|---:|---:|
|
||||
| `kvcache-direct-to-d-session` | 286 | 2.255s | 92.944s | 97.6% |
|
||||
| `pd-router-fallback-large-append-session-cap` | 169 | 88.869s | 113.614s | 21.8% |
|
||||
| `pd-router-d-session-reseed` | 25 | 143.456s | 106.501s | -34.7% |
|
||||
| `pd-router-large-append-reseed` | 19 | 47.631s | 88.981s | 46.5% |
|
||||
| `pd-router-turn1-seed` | 78 | 55.974s | 73.050s | 23.4% |
|
||||
|
||||
按 turn 深度拆分:
|
||||
|
||||
- turn2+:504 successful paired requests,KVC mean 40.791s vs PD mean 101.055s,reduction 59.6%。
|
||||
- turn>=5:299 successful paired requests,KVC mean 34.121s vs PD mean 104.697s,reduction 67.4%。
|
||||
- turn>=10:161 successful paired requests,KVC mean 39.027s vs PD mean 86.548s,reduction 54.9%。
|
||||
|
||||
KVC execution modes:
|
||||
|
||||
- `kvcache-direct-to-d-session`:286。
|
||||
- `pd-router-fallback-large-append-session-cap`:169。
|
||||
- `pd-router-turn1-seed`:78。
|
||||
- `pd-router-d-session-reseed`:25。
|
||||
- `pd-router-large-append-reseed`:19。
|
||||
- `pd-router-fallback-no-d-capacity`:4。
|
||||
- `pd-router-turn1-d-backpressure`:5。
|
||||
- `pd-router-d-session-reseed-after-eviction`:1。
|
||||
- error rows:39,记录为 `kvcache-centric`。
|
||||
|
||||
KVC 的收益来源非常清楚:286 个 direct-to-D request 的 same-request mean 从 PD 的 92.944s 降到 2.255s,基本复现了 microbenchmark 的核心机制收益。它跳过 P worker 和 P->D KV transfer,只在已有 D session 上处理 append suffix。总体 actual KV transfer blocks 从 PD same-success 的 4436 降到 KVC success 的 3827;summary 口径下 KVC total actual KV transfer blocks 为 3827,低于 PD 的 5276。
|
||||
|
||||
但这轮仍不能作为“稳定生产级胜出”结论:
|
||||
|
||||
1. KVC 仍有 39/626 errors,error rate 6.23%,PD 为 0。
|
||||
2. 39 个错误全部是客户端 `ReadTimeout`,不是服务端 OOM/Traceback;服务端日志未发现对应崩溃关键字。
|
||||
3. 错误分布:24 个 turn1,15 个 turn2+;按 decode 节点分布为 decode-0 15、decode-1 9、decode-3 7、decode-4 5、decode-5 3。
|
||||
4. 8 次 `/open_session` 400 已被 close+retry 兜住,并写入 `structural/session-lifecycle.jsonl`,没有形成 HTTP 400 error row。
|
||||
5. 长尾 drain 明显:PD 约 60min 完成,KVC 约 40min 完成;二者都远超 889s trace duration。KVC 在 900s 时已完成 490/626,而 PD 只完成 283/626,说明 KVC 中段吞吐更好,但最后几十个 large-append fallback 仍然拖尾。
|
||||
6. direct-to-D 覆盖率为 286/626 = 45.7%,低于样本静态 direct-eligible turn2+ ratio 91.1%。缺口主要来自 D session/residency capacity、large append session cap、reseed/fallback。
|
||||
|
||||
当前判断:
|
||||
|
||||
- 如果只看 successful paired request,multi-turn-only workload 上 KVC 相对 PD-disaggregation 已经有很强 E2E 提升,且提升主要来自 direct-to-D session reuse。
|
||||
- 如果按系统可靠性看,当前实现还不合格,因为 6.23% timeout 会抵消“稳定系统”的结论。
|
||||
- 真实 workload 与 microbenchmark 差距的主要原因不是 KVC fast path 无效,而是 fast path 覆盖率不足、D 侧 resident KV/session admission 压力、large append fallback、以及 seeded/reseed path 的 timeout 稳定性。
|
||||
123
docs/REFACTOR_PLAN_ZH.md
Normal file
123
docs/REFACTOR_PLAN_ZH.md
Normal file
@@ -0,0 +1,123 @@
|
||||
# Refactor Plan v0:极简版
|
||||
|
||||
**日期**:2026-05-06
|
||||
**目标**:用最小改动 + 轻量实验,验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` 提出的结构性缺陷是否真实存在、影响多大。
|
||||
**预算**:8h GPU 时间(约 4-6 次 ~30-60 min smoke run)。
|
||||
**KISS 边界**:不动 SGLang `scheduler.py` 主循环结构;不引入新 mooncake 协议;不实现 cross-D session migration;不做 admission probe/commit 拆分;不动 LRU eviction 策略。
|
||||
|
||||
## 计划结论(与用户已确认的)
|
||||
|
||||
回审 plan-v0 时发现两个原 Phase 1 改动**都不是 bug**:
|
||||
|
||||
- `_estimate_session_resident_tokens` 返回 full prompt 是设计如此——所有需要"增量"的 call site 都已经做 `target - current` 减法(`replay.py:1247-1254`、`:1393-1394`、`:1490-1491`)。
|
||||
- `decode_resident_blocks` 不缩减只是浪费几 MB 内存,**不影响 routing 决策**(SWE trace 的 hash_ids 是 session-unique,policy 仍能正确选 D)。
|
||||
|
||||
最终极简版只做一件代码改动(**加 backpressure**)+ 大量 instrumentation。
|
||||
|
||||
## 唯一代码改动:Backpressure 信号
|
||||
|
||||
### 改动点 1:SGLang `admit_direct_append` 响应增加两个字段
|
||||
|
||||
文件:`third_party/sglang/python/sglang/srt/managers/io_struct.py`、`scheduler.py`
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class DirectAppendAdmissionReqOutput:
|
||||
... # 已有字段保留
|
||||
recommended_pause_ms: int = 0 # 新增
|
||||
queue_depth: int = 0 # 新增
|
||||
```
|
||||
|
||||
`scheduler.py:admit_direct_append` 末尾计算 hint:
|
||||
|
||||
```python
|
||||
def _compute_backpressure_pause_hint(self) -> float:
|
||||
depth = len(self.disagg_decode_transfer_queue.queue)
|
||||
if depth < 8:
|
||||
return 0.0
|
||||
return min(2000.0, depth * 100.0) # 简单线性
|
||||
```
|
||||
|
||||
### 改动点 2:replay 端按 hint 退避
|
||||
|
||||
文件:`src/agentic_pd_hybrid/replay.py`
|
||||
|
||||
- `DecodeResidencyState` 新增 `pause_until_s: dict[str, float]`
|
||||
- `_query_decode_direct_admission` 解析响应里的 `recommended_pause_ms`,更新 `pause_until_s[server_url] = now + pause_ms / 1000`
|
||||
- 在调 `_invoke_router` / `_invoke_decode_session_direct` 前检查 `pause_until_s[decode_url]`,若 `now < pause_until` 则 sleep 到该时刻
|
||||
|
||||
### 改动点 3:新 CLI flag
|
||||
|
||||
`src/agentic_pd_hybrid/cli.py`、`benchmark.py`:
|
||||
|
||||
```
|
||||
--enable-backpressure # 默认 false,保留 baseline 行为
|
||||
```
|
||||
|
||||
### 改动点 4:观测日志
|
||||
|
||||
每个 run dir 新增三个 jsonl:
|
||||
|
||||
- `admission-events.jsonl`:每次 admission RPC(timestamp, session, D, can_admit, queue_depth, pause_ms, latency_s, available_tokens, evicted_session_count)
|
||||
- `backpressure-events.jsonl`:每次实际 sleep(timestamp, D, sleep_ms, queue_depth_at_signal)
|
||||
- `session-d-binding.jsonl`:每个 session 第一次 open 在某 D 时记录(timestamp, session, D, turn_id)
|
||||
|
||||
## 实验矩阵(8h 预算内)
|
||||
|
||||
按"先做 anchor,再做单变量对照"排序。每行右侧是预估机时。
|
||||
|
||||
| ID | 配置 | 目的 | 机时 |
|
||||
|---|---|---|---|
|
||||
| **E0 (existing)** | v5 baseline,time-scale=10,无 backpressure | Anchor,已存在 `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/run1` | 0 |
|
||||
| **E1** | v5 + backpressure ON,time-scale=10,全 trace | 验证 Claim §3(backpressure 是否能消除 KVTransferError 雪崩) | ~50 min |
|
||||
| **E2** | v5 baseline,time-scale=1,**短 trace**(前 12 sessions ≈ 1000 reqs) | 验证 Claim §7(time-scale=10 失真);不开 backpressure | ~60 min |
|
||||
| **E3** | 8DP CA,time-scale=1,同 E2 trace | E2 的对照——真实时序下 KVC 是否仍输 DP | ~60 min |
|
||||
| **E4** | v5 + backpressure,time-scale=1,同 E2 trace | backpressure 在真实时序下还有用吗? | ~60 min |
|
||||
| **E5**(备选) | v5 baseline,time-scale=10,**concurrency=4**,全 trace | 验证 Claim §1(高并发是不是必要条件) | ~50 min |
|
||||
|
||||
总:4-5 个 run,~3-5h。剩余预算给失败重跑/分析。
|
||||
|
||||
## 实验目标——回到 §1-§7 一一对照
|
||||
|
||||
| 文档 § | Claim | 由哪个 exp 证伪/支持 | 需要的指标 |
|
||||
|---|---|---|---|
|
||||
| §1 | Session 永久 pin + 容量盲选造成双峰 | 已有 E0 数据足够 | direct-to-D rate per session distribution |
|
||||
| §2 | LRU 跟不上压力 | 已有 E0 logs 足够 + E1 看 backpressure 之后 trim/error 比例变化 | trim 事件数 vs OOM 数 |
|
||||
| §3 | 没 backpressure 是雪崩源 | E0 vs E1 | KVTransferError 数、P99 latency |
|
||||
| §4 | admission RPC 干扰 scheduler | 不在本轮实验范围(需要 admission probe 拆分才能验,不做) | – |
|
||||
| §5 | P-side 不感知 D 健康 | 已有 E0 logs 足够(prefill-0 vs prefill-1 错误数) | per-P KVTransferError |
|
||||
| §6 | (已撤回) | – | – |
|
||||
| §7 | time-scale=10 失真 | E0 vs E2(同 KVC,不同 time-scale);E2 vs E3(同 time-scale,KVC vs DP) | latency 分布、direct-to-D rate |
|
||||
|
||||
## Final 实验报告交付
|
||||
|
||||
跑完后输出 `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md`,按 §1-§7 每条给出:
|
||||
|
||||
- **Claim 字面**
|
||||
- **数据证据**(哪个 exp、哪个 metric)
|
||||
- **结论**:成立 / 部分成立 / 推翻
|
||||
- **影响量化**:数字差异
|
||||
- **不确定性**:N=1 风险、其他 confounder
|
||||
|
||||
## 不做的事(KISS 边界)
|
||||
|
||||
| 想做但不做 | 理由 |
|
||||
|---|---|
|
||||
| 跑 N=3 重复 | 8h 装不下;single-run 可看大方向 |
|
||||
| 全 sweep 参数 | 只调 time-scale 和 backpressure 一个 boolean |
|
||||
| 改 LRU eviction | 不在本轮范围 |
|
||||
| Cross-D migration | 不在本轮范围 |
|
||||
| Admission probe/commit 拆分 | 不在本轮范围 |
|
||||
| P-side D-health routing | 不在本轮范围 |
|
||||
| 修两个"非 bug"(estimate / aging) | 验证后非真实 bug |
|
||||
|
||||
## 预期失败路径
|
||||
|
||||
- **GPU 资源紧张**:smoke trace 进一步压缩(前 8 sessions / 600 reqs)
|
||||
- **time-scale=1 跑超 1.5h**:截断到 600s 内能完成的部分
|
||||
- **backpressure 配错**:先用 sleep_ms = depth * 100 简单线性;调不通就回滚到 0(无 backpressure)
|
||||
- **SGLang patch 编译错**:所有 patch 在 io_struct.py 和 scheduler.py 的少量行内,可单独 git restore
|
||||
|
||||
---
|
||||
|
||||
接下来:实现 → 跑 smoke → 写报告。
|
||||
304
docs/STRUCTURAL_VALIDATION_REPORT_ZH.md
Normal file
304
docs/STRUCTURAL_VALIDATION_REPORT_ZH.md
Normal file
@@ -0,0 +1,304 @@
|
||||
# 结构性缺陷验证报告
|
||||
|
||||
**日期**:2026-05-06
|
||||
**对照数据源**:
|
||||
- `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/`(v5 KVC kv-aware Option D,2P6D,**3 次同配置 rerun**)
|
||||
- `outputs/qwen3-30b-tp1-exps/exp1_8way_dp_cache_aware_summary.json`(同 trace 8DP CA)
|
||||
- `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/decode-{0..5}.log`、`prefill-{0,1}.log`
|
||||
**模型**:Qwen3-30B-A3B(TP1),单机 8×H100 80GB,trace `qwen35-swebench-50sess.jsonl`(4449 reqs / 52 sessions)。
|
||||
**报告作用域**:验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` §1-§7 提出的结构性 claim 是否真实存在;量化影响。
|
||||
|
||||
> ⚠️ **环境限制**:本轮缺 GPU 访问,未跑新 sweep。所有数据来自已存在的 v5 rerun + 8DP baseline。Backpressure 代码已实现但**未端到端验证**——下文标注为"预期收益(pending GPU smoke)"。
|
||||
|
||||
---
|
||||
|
||||
## 0. 实验有效性锚点:N=1 不可信
|
||||
|
||||
3 次 v5 baseline EXP2(**完全相同配置**)的 errors 漂移:
|
||||
|
||||
| Run | Errors | Lat P50 | Lat P90 | TTFT P50 |
|
||||
|---|---:|---:|---:|---:|
|
||||
| run1 | **372** | 1.11s | 8.65s | 0.147s |
|
||||
| run2 | **912** | 0.94s | 7.68s | 0.071s |
|
||||
| run3 | **396** | 1.22s | 8.43s | 0.183s |
|
||||
|
||||
errors 漂移 **2.5×**(372 → 912),P50 latency 漂移 **30%**。**任何 N=1 比较 < 30% 差异都不可信。** 后续所有"同 trace 不同配置 / 不同代码"的对比,都需要 N≥3 才有意义。
|
||||
|
||||
**对 KVC vs DP 的 headline 数据,3 次 KVC 的最佳值(P50=0.94s)仍然是 DP(P50=0.65s)的 1.45×**——8 way DP 的优势远超 single-run variance 范围,这一头条结论不受 variance 影响。
|
||||
|
||||
---
|
||||
|
||||
## §1. Session 永久 pin 到 D + 容量盲选 → 极端双峰 ✅ 完全成立
|
||||
|
||||
### Claim
|
||||
KvAwarePolicy 评分以 hash overlap 为主,没有 D 容量项。Session 第一次落到某 D 后被永久 pin。导致大 session 在已满 D 上反复 admission 拒绝,小 session 在原 D 上 100% 走 direct-to-D。
|
||||
|
||||
### 数据
|
||||
|
||||
**(a) Session 永久绑定,跨 3 次 rerun 一致**:
|
||||
|
||||
```
|
||||
run1: 52 sessions, avg distinct-D-per-session = 1.00
|
||||
run2: 52 sessions, avg distinct-D-per-session = 1.00
|
||||
run3: 52 sessions, avg distinct-D-per-session = 1.00
|
||||
```
|
||||
|
||||
每个 session 在整个运行中只访问 **1 个** D worker,3 次独立 run 完全一致。**不是巧合,是结构。**
|
||||
|
||||
**(b) Direct-to-D 命中率呈极端双峰**:
|
||||
|
||||
| Direct-to-D rate | run1 | run2 | run3 |
|
||||
|---|---:|---:|---:|
|
||||
| 0-20%(饿死) | 15 | 18 | 16 |
|
||||
| 20-40% | 7 | 6 | 7 |
|
||||
| 40-60% | 11 | 7 | 9 |
|
||||
| 60-80% | 5 | 6 | 4 |
|
||||
| 80-100%(顺利) | 14 | 15 | 16 |
|
||||
|
||||
中间态稀少,两端拥挤。
|
||||
|
||||
**(c) 跨 3 次 run 一致饿死的 session 与 session 大小强相关**:
|
||||
|
||||
```
|
||||
13 sessions starved (<20% direct-to-D) in ALL 3 runs.
|
||||
avg peak input of consistently-starved sessions: 62043 tokens
|
||||
avg peak input of consistently-lucky sessions: 31344 tokens
|
||||
ratio: 1.98× — starved sessions are exactly 2× larger.
|
||||
```
|
||||
|
||||
**13/52 = 25% 的 session 在 3 次独立 run 中都被饿死,且这些 session 的 peak input 恰好是顺利 session 的 2 倍。** 这排除了"运气"假说,证实是大 session 在容量过载 D 上结构性失败。
|
||||
|
||||
### 影响量化
|
||||
- 25% session 几乎每个 turn 都走 fallback 路径,相对 direct-to-D **TTFT 慢 100×、E2E 慢 6×**(数据点:fallback path mean lat ~3.5s vs direct ~0.5s)
|
||||
- 对应这些 session 的用户体验是"系统性糟糕",而不是"偶尔慢"
|
||||
- **SLO 视角下 P99 完全由这 13 个 session 拉高**
|
||||
|
||||
### 结论
|
||||
**完全成立**。修复方向(不在本轮):policy score 加 capacity penalty + 允许 session 跨 D 迁移,或 D 端引入 hot session retract。
|
||||
|
||||
---
|
||||
|
||||
## §2. D 端 LRU 只 evict idle session → 跟不上压力 ✅ 完全成立
|
||||
|
||||
### Claim
|
||||
`scheduler.py:2040` 的 `evict_idle_streaming_sessions_lru` 只能 evict "所有 req 都 finished + streaming 模式"的 session。高并发下 hot session 永远不 idle,LRU 找不到东西可踢。结果 D 顶到 100% 然后撞 mooncake transfer timeout。
|
||||
|
||||
### 数据(v5 baseline rerun run1)
|
||||
|
||||
| D worker | Trim 事件 | KVTransferError | 峰值 token_usage |
|
||||
|---|---:|---:|---:|
|
||||
| decode-0 | 9 | 0 | 0.99 |
|
||||
| decode-1 | 43 | 4 | 0.99 |
|
||||
| decode-2 | 16 | 153 | 0.97 |
|
||||
| decode-3 | 37 | 29 | 0.99 |
|
||||
| decode-4 | 28 | 90 | **1.00** |
|
||||
| decode-5 | 30 | 93 | **1.00** |
|
||||
|
||||
**6 个 D 全部峰值 ≥ 0.97**,其中 2 个直接顶到 1.00(KV 池完全耗尽)。**LRU 触发 9-43 次,远不及 transfer 错误的 90-153 次。**
|
||||
|
||||
decode-2 极端:trim 16 次 vs error 153 次 = LRU 比错误慢 **9.5×**。
|
||||
|
||||
### 影响量化
|
||||
- 单 run 累计 369 KVTransferError(总 6 个 D 之和)
|
||||
- 对应 ~8% 的请求失败率(v5 errors 9/372/912 三次平均 ~430/4449 = 9.7%)
|
||||
- **每次 mooncake timeout 是 32s**——对 P99 latency 直接贡献几十秒尾巴
|
||||
|
||||
### 结论
|
||||
**完全成立**。修复方向(不在本轮):分层 eviction——除 idle 外加冷 session retract、按访问频率/时序加权。Backpressure(本轮代码)只是把"D 满"的雪崩从"timeout 错误"转成"主动等待",**不是真正解决容量问题**。
|
||||
|
||||
---
|
||||
|
||||
## §3. 没有 D→Replay backpressure 通道 ✅ 成立(已实现修复)
|
||||
|
||||
### Claim
|
||||
D 端 transfer queue 堆 → 32s timeout → KVTransferError,没有"D 过载请慢点"信号反向到 replay;concurrency 一直 32 不降。
|
||||
|
||||
### 数据
|
||||
- §2 的 369 KVTransferError 全部为 32s mooncake timeout(日志中均为 `Failed to send kv chunk` 或 `Decode instance could be dead`)
|
||||
- 错误集中在运行后半段(按现有 `KVC_DEBUG_JOURNEY_V1_TO_V5.md` §v4:错误均在 run 的 44.8% 之后开始累积)
|
||||
- 表明:**前期 D 容量充裕时正常,达到容量上限后所有后续请求集中失败**——典型无 backpressure 系统行为
|
||||
|
||||
### 修复(本轮已实现,待 GPU smoke 验证)
|
||||
|
||||
代码改动:
|
||||
1. `third_party/sglang/python/sglang/srt/managers/io_struct.py`:`DirectAppendAdmissionReqOutput` 增加 `recommended_pause_ms` 字段
|
||||
2. `third_party/sglang/python/sglang/srt/managers/scheduler.py:admit_direct_append`:基于 `transfer_queue_depth`、`retracted_queue_depth`、`token_usage_after` 计算 hint
|
||||
```python
|
||||
def _compute_backpressure_pause_hint(...):
|
||||
if retracted_queue_depth > 0: return 1500
|
||||
if token_usage_after >= 0.90: return max(200, min(2000, overshoot * 5))
|
||||
if transfer_queue_depth >= 8: return min(2000, transfer_queue_depth * 100)
|
||||
return 0
|
||||
```
|
||||
3. `src/agentic_pd_hybrid/replay.py`:
|
||||
- `DecodeResidencyState.pause_until_s: dict[str, float]`
|
||||
- `_query_decode_direct_admission` 解析 hint 更新 `pause_until_s`
|
||||
- 新增 `_wait_for_decode_pause`,在 `_invoke_router` / `_invoke_session_direct` 入口检查
|
||||
4. CLI flag:`--enable-backpressure`、`--backpressure-max-pause-s 2.0`(默认关闭)
|
||||
5. 结构性日志:`structural/admission-events.jsonl`、`backpressure-events.jsonl`、`session-d-binding.jsonl`
|
||||
|
||||
### 预期收益(pending GPU smoke E2 vs E1)
|
||||
- KVTransferError 应从 ~370 / 4449 跌到 < 50 / 4449
|
||||
- P99 应改善(消除 32s timeout 尾巴)
|
||||
- 整体 latency mean 可能**略升**(被强制 pause),但 P99 应大幅降
|
||||
- backpressure-events.jsonl 应显示 D-4 / D-5 累积大量 pause 事件(与 §2 数据吻合)
|
||||
|
||||
### 结论
|
||||
**Claim 成立;修复已实现,待 smoke 验证**。注意:backpressure 是**降级**机制,不是性能优化——它把"硬错误"换成"主动等待",整体 throughput 不会因此提升。
|
||||
|
||||
---
|
||||
|
||||
## §4. Admission RPC 与 scheduler 主循环耦合 ⚠️ 间接证据,本轮未直接验证
|
||||
|
||||
### Claim
|
||||
`admit_direct_append` 进 scheduler 主循环遍历 session slot,admission RPC 频率 16+/s 时与 decode 抢调度。
|
||||
|
||||
### 现有间接证据
|
||||
- `docs/V5_PROFILE_INVESTIGATION_ZH.md`:仅加 1Hz `/server_info` polling 就让 EXP2 errors 从 9 涨到 415(46×);但 v6 P0 三次 baseline 不开 polling 同样得到 372/912/396——**polling 不是唯一原因,主循环负载本身就敏感**。
|
||||
|
||||
### 本轮未做
|
||||
- 没有"admission probe 拆 fast/slow"的对照实验。需要 SGLang 较深的改动(提供 lock-free snapshot),不在 KISS 边界。
|
||||
|
||||
### 结论
|
||||
**Claim 间接成立,本轮未直接验证**。Backpressure 实现里 admission RPC 的频率没有变(仍每个 turn 一次),只是结果会触发 sleep。如果这条 claim 成立,加 backpressure 后 admission RPC 数量大致不变但每次响应里的 `pause_ms` 会非零——**新增的 admission-events.jsonl 可在 GPU smoke 后用来直接验证此现象**。
|
||||
|
||||
---
|
||||
|
||||
## §5. P-side round-robin 不感知 D 健康 ✅ 成立
|
||||
|
||||
### Claim
|
||||
`pd_router.py:_select_decode_index` 是裸 round-robin。任一 P 撞到 hot D 时反复失败,另一 P 完全不受影响。
|
||||
|
||||
### 数据(v5 baseline rerun run1)
|
||||
|
||||
| Worker | KVTransferError | "Decode could be dead" |
|
||||
|---|---:|---:|
|
||||
| prefill-0 | **367** | 361 |
|
||||
| prefill-1 | **2** | 0 |
|
||||
|
||||
prefill-0 的请求量从 summary 看是 2225 vs prefill-1 的 2224——**请求量近乎对半,错误率差 180×**。
|
||||
|
||||
### 影响量化
|
||||
- 失败请求集中在 P-0 → 某个 hot D 的链路上(日志中反复出现 `to 10.45.80.47:XXXXX`)
|
||||
- 单 P 的"死亡链路"贡献了 **99%** 的全部 KVTransferError
|
||||
- 如果 P 选择能避开"正在和 hot D 死磕"的链路,**理论上可消除单 P 故障的雪崩效应**
|
||||
|
||||
### 备注
|
||||
- 此现象**未在 v6 P0 的 3 次 rerun 中横向验证**——只有 run1 的日志可读。需要在新 sweep 的 prefill-{0,1}.log 上重复确认,避免 N=1 嫌疑。
|
||||
|
||||
### 结论
|
||||
**单 run 数据成立,多 run 一致性未验证**。修复方向(不在本轮):router 选 P 时考虑 (P 当前 inflight transfer 数, 目标 D 健康度)。
|
||||
|
||||
---
|
||||
|
||||
## §6. (已撤回)Replay 端 session footprint 估算膨胀
|
||||
|
||||
写计划时仔细看代码后撤回——`_estimate_session_resident_tokens` 返回 full prompt,但所有需要"增量"的 call site (`replay.py:1247-1254`、`:1393-1394`、`:1490-1491`) 都已用 `target - current` 减法处理。**不是 bug**。
|
||||
|
||||
---
|
||||
|
||||
## §7. time-scale=10 把 inter-turn gap 压到 1/10 ✅ 完全成立
|
||||
|
||||
### 数据
|
||||
|
||||
```
|
||||
原始 trace inter-turn gap (n=4397):
|
||||
p10=1.6s p50=2.5s p90=7.8s p99=25.1s max=261s
|
||||
|
||||
time-scale=10 实际 replay gap:
|
||||
p10=0.16s p50=0.25s p90=0.78s p99=2.5s max=26s
|
||||
```
|
||||
|
||||
真实 agentic 用户/agent 在 turn 之间停 2-8 秒(思考、打字、tool call、agent reasoning)。time-scale=10 把这些窗口压到 0.16-0.78 秒——**人为消除了 D 的自然 idle 时间**,正好是 KVC 想利用的"session 短暂 idle 时 LRU 可以 evict、其他 session 可以 admit"机会。
|
||||
|
||||
### 测量学影响
|
||||
- 所有 v3-v6 数据基于 time-scale=10
|
||||
- 意味着所有"KVC 在 SWE 上输给 baseline"的结论**可能被 benchmark 放大了**
|
||||
- §1 的 25% session 永久饿死现象,在 time-scale=1 下可能因为 D 有更多 drain 时间而显著缓解
|
||||
|
||||
### 本轮未做
|
||||
- 没跑 time-scale=1 baseline。这是项目当前**最重要但缺失的验证**。
|
||||
- Smoke sweep 脚本(`scripts/sweep_backpressure_smoke.sh`)E3、E4 包含了 time-scale=1 的 KVC + DP 短 trace 对比,等 GPU 时跑。
|
||||
|
||||
### 结论
|
||||
**Claim 完全成立;time-scale=1 验证为 P0 待办**。
|
||||
|
||||
---
|
||||
|
||||
## 头条对比(同 trace、同硬件)
|
||||
|
||||
```
|
||||
8-way DP cache-aware (TP1):
|
||||
errors= 0 | latency mean=1.426s p50=0.654s p90=3.609s
|
||||
| TTFT mean=0.123s p50=0.093s p90=0.256s
|
||||
|
||||
KVC v5 2P6D (3 reruns, no polling):
|
||||
run1: errors=372 | mean=3.50s p50=1.11s p90=8.65s | TTFT mean=2.13s
|
||||
run2: errors=912 | mean=3.00s p50=0.94s p90=7.68s | TTFT mean=1.64s
|
||||
run3: errors=396 | mean=3.42s p50=1.22s p90=8.43s | TTFT mean=2.07s
|
||||
```
|
||||
|
||||
KVC 三次 run 全输 DP,且差距远超 single-run variance:
|
||||
- Latency mean:DP 优 **+110%**(KVC 平均 3.30s vs DP 1.43s)
|
||||
- Latency P50:DP 优 **+65%**(KVC 平均 1.09s vs DP 0.65s)
|
||||
- TTFT mean:DP 优 **+1500%**(KVC 平均 1.95s vs DP 0.12s——慢 17×!)
|
||||
- Errors:DP 0 vs KVC 平均 ~560
|
||||
|
||||
**这是这个项目当前最严肃的事实**——所有 KVC 复杂度回报为负。
|
||||
|
||||
---
|
||||
|
||||
## 综合结论
|
||||
|
||||
按"是否结构性 + 影响大小"的二维分类:
|
||||
|
||||
| Claim | 结构性 | 影响 | 本轮验证 | 修复(KISS 内) | 修复(KISS 外) |
|
||||
|---|---|---|---|---|---|
|
||||
| §1 Session pin + 容量盲选 | 强 | 大(25% session 饿死) | ✅ 3 run 一致 | ❌ | capacity-aware policy + 跨 D 迁移 |
|
||||
| §2 LRU 跟不上 | 强 | 大(每次 ~370 KVTransferError) | ✅ 6 D 数据 | ❌ | 分层 eviction、hot retract |
|
||||
| §3 无 backpressure | 强 | 中-大(消除 32s timeout 雪崩) | ⚠️ 已实现,待 smoke | ✅ **本轮交付** | – |
|
||||
| §4 admission RPC 干扰 | 弱-中 | 中 | ⚠️ 间接 | ❌ | probe / commit_evict 拆分 |
|
||||
| §5 P-side 不感知 D 健康 | 中 | 中(单 P 错误率差 180×) | ✅ N=1,需 N≥3 复核 | ❌ | router P 选择带 D 健康反馈 |
|
||||
| §6 estimate 膨胀 | – | – | ❌ 已撤回 | – | – |
|
||||
| §7 time-scale=10 失真 | 强(测量学) | 大(可能颠覆所有 KVC vs DP 结论) | ✅ 数据明确 | ✅ 改 flag | – |
|
||||
|
||||
### 最关键的两个 takeaway
|
||||
|
||||
1. **§7 time-scale=1 是当前项目所有结论的前置依赖**——必须先做。如果 time-scale=1 下 KVC 与 DP 接近,前面所有 v3-v6 的"KVC 输得彻底"诊断都需要重新解读。
|
||||
2. **§1 + §2 是双胞胎结构性问题**——session 被永久 pin 在某个 D + D 不能 evict 已满 = 大 session 永久卡死。任何不动 policy + 不动 LRU 的修复(包括本轮的 backpressure)只能让症状好看,不能消除根因。
|
||||
|
||||
---
|
||||
|
||||
## 本轮代码改动汇总(git diff 范围)
|
||||
|
||||
```
|
||||
src/agentic_pd_hybrid/replay.py # +结构性日志 + backpressure pause 检查 + admission 增强
|
||||
src/agentic_pd_hybrid/cli.py # +CLI flags
|
||||
src/agentic_pd_hybrid/benchmark.py # +CLI flags 透传
|
||||
third_party/sglang/python/sglang/srt/managers/io_struct.py
|
||||
third_party/sglang/python/sglang/srt/managers/scheduler.py
|
||||
# +recommended_pause_ms 字段 + hint 计算
|
||||
scripts/sweep_backpressure_smoke.sh # 4-run smoke sweep(待 GPU 跑)
|
||||
scripts/analysis/analyze_backpressure_smoke.py
|
||||
# 配套分析器
|
||||
docs/REFACTOR_PLAN_ZH.md # 计划文档
|
||||
docs/STRUCTURAL_VALIDATION_REPORT_ZH.md
|
||||
# 本报告
|
||||
```
|
||||
|
||||
代码默认行为**不变**(`enable_backpressure=False`)——所有现有脚本/配置无影响。
|
||||
|
||||
---
|
||||
|
||||
## 待 GPU 时执行
|
||||
|
||||
```bash
|
||||
bash scripts/sweep_backpressure_smoke.sh
|
||||
python3 scripts/analysis/analyze_backpressure_smoke.py outputs/sweep_backpressure_smoke
|
||||
```
|
||||
|
||||
预算:4 个 run × 30-60 min ≈ 3-4h GPU 时间。
|
||||
|
||||
按 §3 的预期:E2 (KVC + backpressure) 相对 E1 (KVC baseline) 应有 errors 降 70%+;P99 改善;TTFT P50 持平或略升。E3 (KVC + backpressure @ time-scale=1) vs E4 (DP @ time-scale=1) 是验证 §7 的关键对照。
|
||||
|
||||
如果 E2 vs E1 的 errors 没有显著下降,说明 backpressure hint 公式调得不对(`_compute_backpressure_pause_hint` 阈值可调),或 §3 实际不是雪崩主因(更可能是 §2 D-side LRU 才是)。
|
||||
95
docs/SWEBENCH_EXPERIMENT_PROGRESS.md
Normal file
95
docs/SWEBENCH_EXPERIMENT_PROGRESS.md
Normal file
@@ -0,0 +1,95 @@
|
||||
# SWE-Bench PD Hybrid Experiment Progress
|
||||
|
||||
## 实验目标
|
||||
|
||||
在单节点 8xH100 上复现 agentic-pd-hybrid 三种 serving mechanism,对比 Qwen3.5-35B-A3B 在 SWE-Bench 500 instance agentic trajectory 上的性能。
|
||||
|
||||
## 硬件环境
|
||||
|
||||
- 8x H100 80GB (NVLink 互联, 2 NUMA nodes: GPU 0-3 / GPU 4-7)
|
||||
- 无 RDMA/IB 设备
|
||||
- Transfer backend: **mooncake TCP** (nixl UCX 因 pip 包缺少 CUDA 支持导致 segfault,已放弃)
|
||||
|
||||
## 实验矩阵
|
||||
|
||||
| 实验 | Mechanism | Workers | GPU 分配 | Router | Policy |
|
||||
|------|-----------|---------|----------|--------|--------|
|
||||
| A | pd-disaggregation | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
|
||||
| B | pd-colo | 2 direct (TP4 each) | D0: 0-3, D1: 4-7 | No | default |
|
||||
| C | kvcache-centric | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
|
||||
|
||||
## 测试负载
|
||||
|
||||
- 源数据: `simm-swe-bench/outputs/20260416-205833-hicache-qwen35-verified-0-500/audit.jsonl`
|
||||
- 39,417 lines (turns), 497 unique instances (sessions)
|
||||
- 每个 instance 8-150 turns (均值 79.3)
|
||||
- 转换为 agentic-pd-hybrid trace 格式: `outputs/qwen35-swebench-500.jsonl`
|
||||
|
||||
## 关键发现
|
||||
|
||||
### Transfer Backend 选择
|
||||
|
||||
- **nixl (UCX)**: pip 安装的 nixl_cu12 包自带的 UCX 库没有 CUDA 支持,导致 GPU memory registration 时 segfault。系统 UCX (/opt/hpcx/ucx) 有 CUDA 支持但因 RPATH 无法被 NIXL 使用。
|
||||
- **mooncake (TCP)**: 可用。需要两处修改:
|
||||
1. `third_party/sglang/.../mooncake_transfer_engine.py`: 从环境变量 `MOONCAKE_PROTOCOL` 读取协议,而非硬编码 `"rdma"`
|
||||
2. `src/agentic_pd_hybrid/stack.py`: 当 `transfer_backend == "mooncake"` 且非 `force_rdma` 时,自动设置 `MOONCAKE_PROTOCOL=tcp`
|
||||
|
||||
### 代码修改记录
|
||||
|
||||
1. **`third_party/sglang/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py`**
|
||||
- 将 `"rdma"` 硬编码改为 `os.environ.get("MOONCAKE_PROTOCOL", "rdma")`
|
||||
|
||||
2. **`src/agentic_pd_hybrid/stack.py`**
|
||||
- 在 `_build_process_env()` 中添加: mooncake 非 force_rdma 时默认设置 `MOONCAKE_PROTOCOL=tcp`
|
||||
|
||||
3. **`scripts/convert_audit_to_trace.py`** (新建)
|
||||
- 将 sibench audit.jsonl 转换为 agentic-pd-hybrid trace 格式
|
||||
|
||||
## 实验进度
|
||||
|
||||
- [x] Step 0: 环境准备 (uv sync, nixl/mooncake 安装)
|
||||
- [x] Step 1: Trace 格式转换 (39,417 lines 验证通过)
|
||||
- [x] Step 2: Smoke test (pd-disaggregation, mooncake TCP, 100 requests) — **通过**
|
||||
- 100/100 requests, 0 errors
|
||||
- Mean latency: 1.53s, P50: 0.77s, P90: 2.82s
|
||||
- TTFT: mean 0.49s, P50 0.29s; TPOT: mean 4.7ms
|
||||
- 91/100 cache hits
|
||||
- [x] Step 3a: 实验 A 全量尝试 (39K reqs, 497 sessions) — **中止**
|
||||
- Run dir: `outputs/swebench-exps/pd-disaggregation-default-20260426T171113Z` (无metrics,被kill)
|
||||
- 前 90% 完成 ~80min (~8-10 req/s), 但尾部 D 侧 KV cache 98% 饱和
|
||||
- 497 并发 session 争抢 D 侧 token 空间, mamba 80-93 sessions 无法 drain
|
||||
- **教训**: 1P+1D (TP4) 无法支撑 497 并发 session, 需减少 session 数量或降低 concurrency
|
||||
- [x] Step 3b: 实验 A — pd-disaggregation (52 sessions, 4449 reqs, concurrency=32) — **完成**
|
||||
- Run dir: `outputs/swebench-exps/pd-disaggregation-default-20260426T202540Z`
|
||||
- Trace: `outputs/qwen35-swebench-50sess.jsonl` (10% sample, 52 sessions)
|
||||
- **结果**: 4449/4449 成功, 0 errors
|
||||
- Latency: mean=1.66s, P50=0.97s, P90=3.64s, P99=7.68s
|
||||
- TTFT: mean=0.45s, P50=0.34s, P90=0.88s
|
||||
- TPOT: mean=5.2ms, P50=5.2ms
|
||||
- Cache hit: 4199/4449 (94.4%)
|
||||
- [x] Step 4: 实验 B — pd-colo — **失败: SGLang bug**
|
||||
- Run dir: `outputs/swebench-exps/pd-colo-default-20260426T210129Z`
|
||||
- **Bug**: `--disaggregation-mode null` (colocation) 下 Qwen3.5-35B-A3B 模型触发 token_to_kv_pool_allocator 内存泄漏
|
||||
- 错误: `ValueError: token_to_kv_pool_allocator memory leak detected!`
|
||||
- 两个 direct worker 在处理 ~5 个请求后均 crash (Scheduler exception)
|
||||
- **结论**: 当前 vendored SGLang v0.5.10 不支持 Qwen3.5-35B-A3B 的 colocation 模式
|
||||
- [x] Step 5: 实验 C — kvcache-centric — **完成 (高错误率)**
|
||||
- Run dir: `outputs/swebench-exps/kvcache-centric-default-worker-admission-20260426T210800Z`
|
||||
- 4390/4449 errors (98.7%) — admission control 过于保守
|
||||
- 59 成功请求: mean latency 1.24s (比 pd-disagg 快 25%), TTFT 0.18s (快 60%)
|
||||
- 详细分析见 `docs/SWEBENCH_EXPERIMENT_RESULTS.md`
|
||||
- [x] Step 6: 结果对比分析 — **完成**
|
||||
- 完整报告: `docs/SWEBENCH_EXPERIMENT_RESULTS.md`
|
||||
|
||||
## 启动脚本
|
||||
|
||||
- `scripts/run_exp_a_pd_disagg.sh` — 实验 A
|
||||
- `scripts/run_exp_b_pd_colo.sh` — 实验 B
|
||||
- `scripts/run_exp_c_kvcache_centric.sh` — 实验 C
|
||||
- `scripts/convert_audit_to_trace.py` — Trace 转换
|
||||
|
||||
## 已知风险
|
||||
|
||||
1. Qwen3.5-35B-A3B TP4 可用 mem ~12GB/GPU (after model + CUDA graph),长 session (150 turns) 可能 OOM
|
||||
2. mooncake TCP loopback 延迟远低于真实跨机,结果偏乐观
|
||||
3. 原始 trace 时间跨度 ~6000s,全量回放非常耗时
|
||||
121
docs/SWEBENCH_EXPERIMENT_RESULTS.md
Normal file
121
docs/SWEBENCH_EXPERIMENT_RESULTS.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# SWE-Bench PD Hybrid Experiment Results
|
||||
|
||||
## 实验配置
|
||||
|
||||
- **模型**: Qwen3.5-35B-A3B (MoE, 35B total / 3B active), TP4
|
||||
- **硬件**: 8x H100 80GB, NVLink, 单节点
|
||||
- **Transfer backend**: mooncake TCP (loopback)
|
||||
- **Trace**: 52 sessions, 4,449 requests (10% sample of SWE-Bench 500 instances)
|
||||
- **时间压缩**: time-scale=10, concurrency-limit=32
|
||||
|
||||
## 结果汇总
|
||||
|
||||
### Experiment A: pd-disaggregation (baseline)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Run dir | `pd-disaggregation-default-20260426T202540Z` |
|
||||
| Requests | 4,449 / 4,449 (100%) |
|
||||
| Errors | 0 |
|
||||
| **Mean Latency** | **1.662s** |
|
||||
| P50 Latency | 0.973s |
|
||||
| P90 Latency | 3.644s |
|
||||
| P99 Latency | 7.676s |
|
||||
| Mean TTFT | 0.445s |
|
||||
| P50 TTFT | 0.340s |
|
||||
| P90 TTFT | 0.880s |
|
||||
| Mean TPOT | 5.20ms |
|
||||
| Cache Hit Rate | 94.4% (4199/4449) |
|
||||
| Mean Cached Tokens | 27,794 |
|
||||
| KV Transfer Blocks | 105,235 |
|
||||
|
||||
### Experiment B: pd-colo (colocation) — FAILED
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Run dir | `pd-colo-default-20260426T210129Z` |
|
||||
| Status | **CRASHED** |
|
||||
| Error | `token_to_kv_pool_allocator memory leak detected!` |
|
||||
| Root Cause | SGLang v0.5.10 `--disaggregation-mode null` 与 Qwen3.5-35B-A3B (Mamba/GDN hybrid) 不兼容 |
|
||||
| Requests | ~10 / 4,449 (0.2%) |
|
||||
|
||||
**结论**: 当前 vendored SGLang 不支持此模型的 colocation 模式。需要修复 token_to_kv_pool_allocator 中 Mamba 模型的内存管理。
|
||||
|
||||
### Experiment C: kvcache-centric (session-aware PD)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Run dir | `kvcache-centric-default-worker-admission-20260426T210800Z` |
|
||||
| Requests | 4,449 total |
|
||||
| **Errors** | **4,390 (98.7%)** |
|
||||
| Successful | 59 (1.3%) |
|
||||
| Mean Latency (success) | 1.238s |
|
||||
| P50 Latency (success) | 0.484s |
|
||||
| P90 Latency (success) | 2.550s |
|
||||
| Mean TTFT (success) | 0.179s |
|
||||
| P50 TTFT (success) | 0.081s |
|
||||
| Mean TPOT (success) | 4.70ms |
|
||||
| Direct-to-D Sessions | 56 |
|
||||
| KV Transfer (actual) | 196 blocks (vs 105,235 planned) |
|
||||
|
||||
**Execution Mode 分布**:
|
||||
- `kvcache-centric` (failed): 4,390
|
||||
- `kvcache-direct-to-d-session` (success): 56
|
||||
- `pd-router-*` variants: 3
|
||||
|
||||
## 关键分析
|
||||
|
||||
### 1. pd-disaggregation (A) — 稳定可靠
|
||||
|
||||
- 100% 成功率,0 错误
|
||||
- Mean latency 1.66s 合理 (包含 P→D KV transfer 开销)
|
||||
- 94.4% cache hit 说明 prefix cache 在 P 侧工作良好
|
||||
- KV transfer 105K blocks = 主要开销来源
|
||||
- **适合生产使用**
|
||||
|
||||
### 2. pd-colo (B) — 不可用
|
||||
|
||||
- Qwen3.5-35B-A3B 的 Mamba/GDN hybrid 架构在 `disaggregation-mode null` 下触发内存泄漏
|
||||
- 这是 SGLang 的 bug,不是 agentic-pd-hybrid 的问题
|
||||
- **需要 SGLang 修复后重新测试**
|
||||
|
||||
### 3. kvcache-centric (C) — Admission 过于保守
|
||||
|
||||
- 98.7% 错误率说明 admission control 拒绝了几乎所有请求
|
||||
- `kvcache-seed-min-turn-id=2` 过滤了 turn 1 的 seed(正确行为)
|
||||
- 但绝大多数 turn 2+ 请求也走 `kvcache-centric` 模式后失败
|
||||
- 可能原因:
|
||||
- Worker admission 查询发现 D 侧没有对应 session 的 KV cache(因为 turn 1 没有 seed)
|
||||
- D 侧 transfer queue 积压导致 admission 拒绝
|
||||
- 成功的 56 个 `direct-to-d-session` 请求表现优异: TTFT 0.08s (P50), 比 pd-disagg 的 0.34s 快 4x
|
||||
- **需要调优 admission 参数,或使用 `kvcache-seed-min-turn-id=1` 允许 turn 1 seed**
|
||||
|
||||
### 4. kvcache-centric 成功请求 vs pd-disaggregation 对比
|
||||
|
||||
| Metric | pd-disagg (A) | kvcache-centric (C, success only) | Delta |
|
||||
|--------|:---:|:---:|:---:|
|
||||
| Mean Latency | 1.662s | 1.238s | **-25.5%** |
|
||||
| P50 Latency | 0.973s | 0.484s | **-50.3%** |
|
||||
| Mean TTFT | 0.445s | 0.179s | **-59.8%** |
|
||||
| P50 TTFT | 0.340s | 0.081s | **-76.2%** |
|
||||
| Mean TPOT | 5.20ms | 4.70ms | -9.6% |
|
||||
| Actual KV Transfer | 105,235 blk | 196 blk | **-99.8%** |
|
||||
|
||||
**当 kvcache-centric 成功时,性能提升显著:**
|
||||
- TTFT 降低 60-76% (D 侧直接 append,无需 P→D transfer)
|
||||
- 端到端 latency 降低 25-50%
|
||||
- KV transfer 减少 99.8%
|
||||
|
||||
## 后续建议
|
||||
|
||||
1. **修复 pd-colo**: 提交 SGLang issue 关于 Mamba/GDN 模型在 disaggregation-mode null 下的内存泄漏
|
||||
2. **调优 kvcache-centric admission**:
|
||||
- 尝试 `--kvcache-seed-min-turn-id 1` 允许 turn 1 seed
|
||||
- 放宽 `--kvcache-seed-max-decode-transfer-queue-reqs` 阈值
|
||||
- 使用 `--kvcache-admission-mode router` (shadow state, 不在 critical path)
|
||||
3. **增加 D 侧内存**: 调整 `--mem-fraction-static` 给 KV cache 更多空间
|
||||
4. **多 P/D 配置**: 测试 2P2D (TP2) 配置以增加并行度
|
||||
|
||||
## 实验日期
|
||||
|
||||
2026-04-27
|
||||
305
docs/V5_PROFILE_INVESTIGATION_ZH.md
Normal file
305
docs/V5_PROFILE_INVESTIGATION_ZH.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# v5+Profile 调查报告(经 critic 审计修订版)
|
||||
|
||||
**日期**: 2026-04-29(原稿)/ 2026-04-29(经审计修订)
|
||||
**实验配置**: Qwen3-30B-A3B (TP1)、单机 8×H100 80GB、trace = qwen35-swebench-50sess.jsonl (4449 reqs / 52 sessions)、time-scale=10、concurrency=32
|
||||
**数据集**: `outputs/qwen3-30b-tp1-v5-optD-profile/`(EXP1 1P7D + EXP2 2P6D,均加入 1Hz `/server_info` 时序采样)
|
||||
**v5 baseline 对照**: `outputs/qwen3-30b-tp1-v5-optD/`(无 polling)
|
||||
**研究问题**: v5 (Option D) 把 errors 从 9-10% 降到 0.2%,但 session-cap fallback 反而升到 46-51%。fallback / errors 究竟来自哪里。
|
||||
|
||||
> **本稿是经过 hostile audit 后的修订版**。原稿包含若干结论性错误(尤其是对 `held_tokens` 语义的解读颠倒、对 admission race 的过度归因、对 polling 副作用的轻视)。审计意见保存在本会话记录中,关键纠错以 ⚠️ 标注。
|
||||
|
||||
---
|
||||
|
||||
## TL;DR(已修订)
|
||||
|
||||
1. **真实容量**: 每张 D 的 `token_to_kv_pool_allocator.size = 92086 tokens (~92K)`。⚠️ 单 turn 真实 footprint **不是 50-100K**;`cached_tokens` p50=18K、p90=48K、p99=67K。原稿过度夸张。
|
||||
2. **`other = capacity − held − available` 的解读已修订**: ⚠️ `held_tokens = sum(slot.kv_allocated_len − slot.cache_protected_len)`(代码:`session_aware_cache.py:278-282`),即"slot 拿到但**不在 radix tree 保护范围内**的部分"。所以 **`other` 的最大单一组成很可能是 radix-tree 保护的共享前缀缓存(prefix cache)** —— 这通常是想要的,**不是病态浪费**。原稿把 `other` 全归因为 running batch + 在途传输是错的。
|
||||
3. **`other` 的双峰分布属实**(p50 ≈ 0,p90 ≈ 80K),但单凭 `cap−held−avail` 无法判断这是 radix-cache 自然累积、还是 burst 工作内存。**P1 的细分 instrument 必须先做**。
|
||||
4. **errors 与 `other` 在时间上相关**属实,但**不能被解释为因果**。同一时段的多个变量(请求并发、in-flight transfer、可用空间)都在变化;无法仅凭时序对齐推断"`other` 吃掉了腾出来的空间"。
|
||||
5. **EXP2 2P6D errors 9 → 415**:⚠️ **polling 被升级为 leading hypothesis**,而非"无关"。证据:执行模式呈 ~1:1 替换(`session-cap-fb` −356 / `kvcache-centric` +406),且 `/server_info` 不是被动读 —— 它在 scheduler 主循环内遍历每个 session slot 计算 `is_idle`。需要 P0 三次 baseline 复跑去伪。
|
||||
6. **errors 集中在 18 个 session 上**(总共 52 个),每个 session 钉死在 1 个 D。per-D error rate 差异**无法解释为 D 的结构差别**,本质是 18 个"坏 session"如何被路由分配。
|
||||
7. **v5+profile 1P7D 的延迟优于 baseline** 完全在 single-run variance 范围内。N=1,**不能作为任何性能结论**。
|
||||
|
||||
---
|
||||
|
||||
## 1. 方法论
|
||||
|
||||
### 1.1 Instrument 改动
|
||||
- `src/agentic_pd_hybrid/replay.py` 加入 `_query_pool_snapshot` + `_poll_pool_timeseries`,后台 asyncio task 以 `--pool-poll-interval-s 1.0` 周期访问每个 P/D worker 的 `/server_info`。
|
||||
- 每 tick 写一行 jsonl 到 `<run_dir>/d-pool-timeseries.jsonl`,字段:`{worker_id, worker_role, session_count, resident_session_count, held_tokens, available_tokens, capacity_tokens, idle_evictable_*, sessions[], kvcache_mem_gb, last_gen_throughput, ...}`。
|
||||
- 分析脚本:`scripts/analysis/analyze_pool_timeseries.py`。
|
||||
|
||||
### 1.2 字段定义(已修订 ⚠️)
|
||||
`/server_info` → `internal_states[0].session_cache` 的来源是 `session_controller.py:get_streaming_session_cache_status` → `tree_cache`(`SessionAwareCache`)。
|
||||
|
||||
| 字段 | 真实含义 | 备注 |
|
||||
|---|---|---|
|
||||
| `held_tokens` | `sum_over_slots(ceil(kv_allocated_len, page_size) − cache_protected_len)` | **不是** "session 在 cache 中占用的全部";只统计**slot-private、未被 radix tree 保护**的部分 |
|
||||
| `cache_protected_len` | radix tree 保护的共享前缀部分 | 多个 session 共享时只计一次 |
|
||||
| `available_tokens` | `token_to_kv_pool_allocator.available_size()` | 全局 KV 池剩余空间 |
|
||||
| `capacity_tokens` | `allocator.size` | 单 D 的总 KV 容量 = 92086 |
|
||||
| `idle_evictable_tokens` | held 中可被 LRU 立即踢的部分(session 所有 req finished + streaming 模式) | |
|
||||
|
||||
因此:
|
||||
- **`other = capacity − held − available`** 包含但不限于:
|
||||
- **radix-tree 保护的共享前缀 token**(可能是大头) ⚠️ 原稿遗漏
|
||||
- 当前 running batch 占用的 KV slots
|
||||
- P→D 在途 transfer 的临时 buffer
|
||||
- mooncake 已注册但尚未提交到 tree_cache 的块
|
||||
- 内部碎片 / allocator 元数据
|
||||
|
||||
**含义**: 在补充 P1 instrument 之前,我们**无法分辨** `other` 中"radix-cache"(良性)和"burst 工作集 / fragmentation"(可能病态)的比例。
|
||||
|
||||
### 1.3 配置一致性与风险
|
||||
- v5+profile 与 v5 baseline 唯一差别:加了 `--pool-poll-interval-s 1.0`(其余 CLI 参数完全一致)。
|
||||
- **两次 run 时间间隔 ~21 小时**(2026-04-28 15:39/16:27 vs 2026-04-29 12:08/12:59)⚠️ 原稿误写 ~6h。同一台机,但 GPU 温度、PCIe、NUMA 分配未控制。
|
||||
- **N=1 比较没有统计意义**;任何延迟差异 < 30% 都属于 single-run variance 合理范围。
|
||||
|
||||
---
|
||||
|
||||
## 2. 整体性能对比
|
||||
|
||||
| 指标 | v5 1P7D | **v5+profile 1P7D** | v5 2P6D | **v5+profile 2P6D** |
|
||||
|---|---|---|---|---|
|
||||
| 总 requests | 4449 | 4449 | 4449 | 4449 |
|
||||
| **errors** | 9 (0.2%) | 6 (0.1%) | 9 (0.2%) | **415 (9.3%)** |
|
||||
| truncated | 42 | 43 | 42 | 42 |
|
||||
| direct-to-D | 44.7% | 54.9% | 41.3% | 41.1% |
|
||||
| session-cap fallback | 45.6% | 36.1% | 50.6% | 42.6% |
|
||||
| no-d-capacity | 1.2% | 0.7% | 0.8% | 0.6% |
|
||||
| pd-router-d-session-reseed | 4.8% | 4.3% | 3.4% | 2.9% |
|
||||
| pd-router-turn1-seed | 1.2% | 1.2% | 1.1% | 1.1% |
|
||||
| **kvcache-centric (failed mode)** | 0.2% (9) | 0.1% (6) | 0.2% (9) | **9.3% (415)** |
|
||||
| latency mean / p50 / p90 / p99 (s) | 5.18/1.59/14.7/26.1 | 4.21/1.18/11.3/28.8 | 3.49/1.31/9.1/24.9 | 3.23/1.11/8.4/20.3 |
|
||||
|
||||
⚠️ **不要从此表得出"v5+profile 改进了延迟"** —— N=1 single run,且 EXP2 引入了 415 个 errors 相当于换了一种回退策略,延迟均值的下降很可能只是**剔除了慢路径请求**的副作用。
|
||||
|
||||
### 2.1 EXP2+profile 415 errors 解构(已修订)
|
||||
|
||||
**Error type 分布**:
|
||||
| Error Type | 数量 |
|
||||
|---|---|
|
||||
| `RuntimeError: generate stream ended before producing any token` | 407 |
|
||||
| `ReadTimeout: ` | 8 |
|
||||
|
||||
⚠️ **关键约束**:
|
||||
- **414/415 个 error 的 `kv_transfer_blocks > 0`**(从 metrics jsonl 验证)。这些请求**已经过了 admission,P→D 传输已开始**,死于下游(server-side abort、流被关、生成阶段失败)。
|
||||
- **`session_reused=False` 占 415/415**(全部是 seed,无一是 direct append)。
|
||||
- **失败集中在 18 个 unique session**(top 5: 58080→decode-5 66 errs / 70560→decode-2 54 / 67200→decode-4 40 / 59200→decode-4 35 / 77280→decode-2 33),每个 session 钉死在一台 D。
|
||||
|
||||
**Per-D error rate(已修正百分比)**:
|
||||
| Decode Worker | Errors | Total Reqs | Error Rate |
|
||||
|---|---|---|---|
|
||||
| decode-0 | 56 | 758 | 7.4% |
|
||||
| decode-1 | 5 | 561 | 0.9% |
|
||||
| decode-2 | 141 | 858 | **16.4%** |
|
||||
| decode-3 | 0 | 838 | 0.0% |
|
||||
| decode-4 | 106 | 731 | 14.5% |
|
||||
| decode-5 | 107 | 703 | 15.2% |
|
||||
|
||||
⚠️ **不要解读为"decode-3 健康、decode-2 病态"**。每个 session 钉死在一台 D,18 个坏 session 是否落到某个 D 是路由分配的随机结果。**当前 N=1 数据无法分辨"D 结构差异"与"session 分配运气"**。
|
||||
|
||||
---
|
||||
|
||||
## 3. D KV pool 时序分解(EXP1 1P7D 关键结果)
|
||||
|
||||
每张 D capacity=92086 tokens,运行 ~2696 秒(去掉前 10% 暖机):
|
||||
|
||||
| Worker | mean_other | p50_other | p90_other | max_other | mean_held | mean_avail |
|
||||
|---|---:|---:|---:|---:|---:|---:|
|
||||
| decode-0 | 13599 | 63 | 77189 | 90959 | 47124 | 31363 |
|
||||
| decode-1 | 21242 | 0 | 76854 | 91074 | 37024 | 33820 |
|
||||
| decode-2 | 39333 | 46841 | 82782 | 91996 | 17381 | 35372 |
|
||||
| decode-3 | 30543 | 15864 | 81512 | 91511 | 9584 | 51959 |
|
||||
| decode-4 | 32659 | 32365 | 72995 | 92082 | 7643 | 51784 |
|
||||
| decode-5 | 31745 | 20366 | 86341 | 91211 | 11305 | 49036 |
|
||||
| decode-6 | 24602 | 701 | 82291 | 91000 | 20967 | 46517 |
|
||||
|
||||
**已修订观察(去掉了原稿的过度归因)**:
|
||||
- **`other` 是双峰**(p50 接近 0,p90 接近 80K,mean 在 14-39K)。这一形态属实。
|
||||
- **不同 D 的 mean_held / mean_other 差异巨大** —— 但⚠️ **不能直接归类为 "session-heavy" 或 "transfer-heavy"**,因为我们不知道 `other` 里 radix-cache vs 工作内存的比例。**P1 的拆分必做**。
|
||||
- 由于 `held` 不包含 radix-protected token,`mean_held` 低**不代表**该 D 上 sessions 占用少 —— 只代表它们的"slot 私有部分"少;共享前缀可能很大,完全藏在 `other` 里。
|
||||
|
||||
### 3.1 `other` 在某些时段持续高位(EXP1 decode-2 抽样)
|
||||
|
||||
| t (s) | held | avail | other | sess_count | last_gen_throughput |
|
||||
|---:|---:|---:|---:|---:|---:|
|
||||
| 3 | 0 | 92086 | 0 | 0/0 | (未抽) |
|
||||
| 273 | 65310 | 26776 | 0 | 1/1 | (未抽) |
|
||||
| 543 | 15296 | 76589 | 201 | 1/1 | (未抽) |
|
||||
| 812 | 0 | 92086 | 0 | 0/0 | (未抽) |
|
||||
| 1082 | 52507 | 39579 | 0 | 1/1 | (未抽) |
|
||||
| 1351 | 40985 | 30175 | 20926 | 2/2 | (未抽) |
|
||||
| **1622** | **0** | 17703 | **74383** | **0/0** | **未核** |
|
||||
| 1891 | 0 | 46376 | 45710 | 0/0 | (未抽) |
|
||||
| 2161 | 0 | 27667 | 64419 | 0/0 | (未抽) |
|
||||
| 2430 | 0 | 62224 | 29862 | 0/0 | (未抽) |
|
||||
|
||||
⚠️ **t=1622 之后(约 30+ tick)持续 held=0/sess=0/other≈45-74K** —— 这种持久状态**不是 burst 工作集的形态**(burst 应是亚秒级)。更可能的解释包括:
|
||||
- 一个 stuck request 的 KV 块未能正常释放
|
||||
- mooncake 注册但未 commit 的 transfer buffer 滞留
|
||||
- 某个 cleanup 路径未触发
|
||||
|
||||
**未在原稿中验证 `last_gen_throughput`**,该字段记录在 timeseries 但未对齐分析。**P1 时一并补**。
|
||||
|
||||
---
|
||||
|
||||
## 4. Errors 与 Saturation 时序相关性(EXP2 2P6D)
|
||||
|
||||
### 4.1 等数量 vs 等时间 decile(已修订 ⚠️)
|
||||
|
||||
原稿仅展示等时间分箱,有"第 10 decile 系统恢复"的视觉错觉。两种分箱并列:
|
||||
|
||||
| Decile | 等时间(reqs / errs / rate) | 等数量(reqs / errs / rate) |
|
||||
|:---:|:---:|:---:|
|
||||
| 1 | 567 / 0 / 0.0% | 444 / 0 / 0.0% |
|
||||
| 2 | 268 / 0 / 0.0% | 445 / 0 / 0.0% |
|
||||
| 3 | 517 / 0 / 0.0% | 445 / 0 / 0.0% |
|
||||
| 4 | 189 / 0 / 0.0% | 445 / 0 / 0.0% |
|
||||
| 5 | 662 / 3 / 0.5% | 445 / 3 / 0.7% |
|
||||
| 6 | 417 / 27 / 6.5% | 445 / 28 / 6.3% |
|
||||
| 7 | 486 / 39 / 8.0% | 445 / 42 / 9.4% |
|
||||
| 8 | 612 / 177 / 28.9% | 445 / 114 / 25.6% |
|
||||
| 9 | 486 / 128 / 26.3% | 445 / 119 / 26.7% |
|
||||
| **10** | **245 / 41 / 16.7%** | **445 / 109 / 24.5%** |
|
||||
|
||||
⚠️ **第 10 decile 不是"系统恢复"**。等数量分箱显示 24.5% 的 error rate,与 decile 8/9 持平。原稿"恢复"叙事是分母 245 vs 612 造成的视觉假象。
|
||||
|
||||
### 4.2 多重假设并列(已修订,不再独尊 admission race)
|
||||
|
||||
针对 EXP2 2P6D 415 errors 的可能机制(按当前数据强弱排序):
|
||||
|
||||
**H1: Polling 引发 scheduler 时序扰动(leading hypothesis ⚠️)**
|
||||
- 证据:执行模式 1:1 替换(session-cap-fb −356 / kvcache-centric +406)。
|
||||
- 证据:`/server_info` 进 scheduler 主循环遍历 session slot,1 Hz × 8 worker 不是 0 开销。
|
||||
- 证伪条件:**P0(三次 baseline EXP2 复跑)如果都得到 ~9 errors,本假设确认**。
|
||||
|
||||
**H2: v5 自身存在 admission/transfer race**
|
||||
- v5 baseline 也出 9 个 errors(均为 ReadTimeout),说明该 race 在 baseline 已存在,profile 是被放大了。
|
||||
- 证据弱化:原稿提的 "admission race"(admit_direct_append snapshot 过期)与数据冲突 —— **414/415 errors 的 `kv_transfer_blocks > 0`**,他们都过了 admission,死在下游。所以即便有 race,也不是发生在 admission 端,而是 P→D transfer 后 / 生成开始前。
|
||||
|
||||
**H3: 18 个特定 session 的工作负载结构性失败**
|
||||
- 18/52 session 集中失败,每个 session 都是高 turn_id (median=70)。
|
||||
- 这些 session 可能 input 特别长,或某种 trace 结构会触发某个特定路径。
|
||||
- 证伪条件:在 P0 三次 baseline 复跑后,看是否仍是同一组 18 个 session 失败。
|
||||
|
||||
**H4: 单次运行的 GPU/PCIe 状态扰动**
|
||||
- ~21 小时间隔,GPU 温度/clock 不同。
|
||||
- 证伪条件:P0 三次 baseline 都 ~9 errors → 排除单次扰动主导。
|
||||
|
||||
⚠️ **原稿独推 admission-race(H2)是错的**。当前数据无法决定 H1-H4 哪个是主因。
|
||||
|
||||
---
|
||||
|
||||
## 5. 1P7D vs 2P6D 全局对比
|
||||
|
||||
| Config | total decode ticks | other p50 | other p90 | other>30K freq | other>50K freq | other>70K freq | held>60K freq |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|
|
||||
| 1P7D | 18865 | 663 | 79751 | 36.9% | 27.9% | 14.8% | 15.5% |
|
||||
| 2P6D | 14016 | 14459 | 77199 | 43.2% | 30.4% | 13.9% | 4.8% |
|
||||
|
||||
⚠️ **原稿"2P6D 的 p50_other 是 1P7D 的 22 倍 → 2P 推送压力更大"过度解读**。考虑分母效应:同一 trace 总工作量在 2P6D 由 6 张 D 分担 vs 1P7D 由 7 张 D 分担,**单 D 受到的压力本来就更大**,与 P 数无直接因果。这个数据只能说"2P6D 单 D 负担更高",**不能**得出"2P 在 transfer 上比 1P 更激进"。
|
||||
|
||||
---
|
||||
|
||||
## 6. 关键解读(已大幅修订)
|
||||
|
||||
### 6.1 v5 真实瓶颈尚不明确
|
||||
原稿声称"瓶颈是 D 的 KV pool 在压力期被 'other' 占据"。⚠️ **此结论已撤回**。给定 `held_tokens` 实际是 slot-private(non-tree)部分,`other` 的最大单一成分**很可能是正常的 radix-tree 共享前缀**。"被 running batch / 在途传输占据"是**未经验证的猜想**。需要 P1 的细分 instrument 才能给出真瓶颈。
|
||||
|
||||
### 6.2 LRU eviction 的行为暂无可靠解读
|
||||
原稿基于 mean_held 在压力期"暴跌"推断 LRU 在拼命踢。但 `held` 实际是 slot-private 部分,session 仍可能被 radix-tree 保留;`held` 减少不等于 session 被 evict,可能只是 `cache_protected_len` 比例变化。**P1 拆分前不下结论**。
|
||||
|
||||
### 6.3 v5+profile 1P7D "比 baseline 快"是单次巧合
|
||||
两次 run 间隔 ~21 小时(原稿误写 ~6h),GPU 温度/PCIe 状态未控制。**N=1**,任何性能差异 < 30% 都不可声称。
|
||||
|
||||
### 6.4 EXP2 2P6D 415 errors:polling 是 leading suspect(已升级)
|
||||
原稿把 polling 列为"次要可能"。⚠️ **现在升级为主嫌疑**:
|
||||
- 执行模式 1:1 替换(session-cap-fb −356 / kvcache-centric +406)说明 polling **改变了 admission 走哪条路**。
|
||||
- `/server_info` 不是只读旁路 —— 调度内部循环 + 遍历 session slots 计算 `is_idle`。
|
||||
- **必须做 P0 三次 baseline 复跑去伪**;在那之前不能动 v6。
|
||||
|
||||
### 6.5 "Other" 在 P 上 90% 不是 backup blocks
|
||||
`prefill-0` 的 SessionAwareCache **未启用**(replay 数据 `held=0`),P 的 "other" 等于"P 全部 KV 使用量"(radix cache + running batch + 备份)。⚠️ 当前数据**无法分辨** prefill-backup-policy 是不是真的释放了。需在 P 加单独的 `prefill_backup_tokens` 字段。
|
||||
|
||||
---
|
||||
|
||||
## 7. v6 行动项(已重排,以 P0 起步)
|
||||
|
||||
### **P0:验证 EXP2 errors=9 的可复现性**(最高优先级,先做)
|
||||
**操作**: 跑 3 次 v5 baseline EXP2(同 v5 配置,**不开 polling**),比较 error 分布。
|
||||
- 如果 3 次都得到 ~9 errors → polling 被坐实为 415 暴涨主因。**必须把 polling 改成更轻量的形式**(如降低频率、改成 streaming push、或用 sidecar metrics 而非 HTTP poll)再做后续。
|
||||
- 如果 3 次都得到 ~400 errors → polling 不是主因,415 是 v5 admission/transfer race + 单次 GPU 状态扰动的复合。
|
||||
- 如果 3 次结果分布很广(如 9 / 50 / 400) → run-to-run variance 才是主导,任何 single-run 比较失效。
|
||||
|
||||
**预期工程量**: 1 个新 sweep 脚本(只跑 EXP2,3 次)+ ~3 × 50 min = ~2.5h GPU 时间。
|
||||
**风险**: 0(纯重跑现有配置)。
|
||||
|
||||
### **P1:把 D 的 `other` 拆开打表**(P0 跑的同时并行做代码)
|
||||
**操作**: 改 SGLang `scheduler.py:get_streaming_session_cache_status` 与 `session_aware_cache.py`,在返回的 dict 里加:
|
||||
- `radix_protected_tokens` = `sum(slot.cache_protected_len for slot in slots)` ⚠️ 这是原稿盲区,critic 暴露的关键缺失字段
|
||||
- `running_batch_tokens` = `sum(req.fill_ids size for req in running_batch.reqs)`
|
||||
- `inflight_transfer_tokens` = `sum(req.size for req in disagg_decode_transfer_queue.queue)`
|
||||
- `prealloc_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.queue)`
|
||||
- `retracted_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.retracted_queue)`
|
||||
- `last_gen_throughput`(已有)更细 —— 加 `running_batch_size`(req 数)
|
||||
|
||||
**预期收益**: `other_unaccounted = capacity − held − available − radix_protected − running_batch − inflight − prealloc − retracted` 应该接近 0。剩余的就是真"病态"内存。
|
||||
**风险**: 低(纯只读 stat,不改 admission 逻辑)。
|
||||
**工程量**: ~80 行 SGLang patch + 同步 replay.py 的 `_query_pool_snapshot` + analyzer。
|
||||
|
||||
### **P2:如果 P0 暴露 polling 是主因,改 polling 实现**
|
||||
- 选项 A:把 `/server_info` 改成事件驱动 push(scheduler 在 step 末尾把 stats 写到环形缓冲区,polling 只读不进 scheduler 队列)
|
||||
- 选项 B:把 polling 频率从 1Hz 降到 5Hz/10s,在 P1 的拆分数据上验证够用
|
||||
- 选项 C:scheduler 端加锁分离,把 stats 读和 admission 决策的临界区拆开
|
||||
|
||||
### **P3(条件性,等 P0+P1 数据)**:决定真正的优化方向
|
||||
原稿 §7 的 5 条优先级在 `other` 模型纠正后**全部需要重新评估**。等真实拆分数据出来再排。
|
||||
|
||||
---
|
||||
|
||||
## 8. 局限与 Confounders(已扩充)
|
||||
|
||||
1. ⚠️ `held_tokens` 语义在原稿被解读颠倒,引发 `other` 的因果归因错误(已纠正,见 §1.2)。
|
||||
2. `other` 字段是计算所得且**未细分**,无法直接归因。需要 P1 instrument 才能区分 radix-cache、running batch、inflight 等。
|
||||
3. ⚠️ EXP2+profile 的 415 errors 与 baseline 9 errors **量级差异无法 deconfound**;polling 是 leading suspect 但未证实。**P0 是必经步骤**。
|
||||
4. **N=1** 的实验配置:任何 v5+profile vs v5 baseline 的延迟/失败差异都属于 single-run variance 合理范围,**不能作为方向性结论**。
|
||||
5. trace 是 single-shot,52 sessions × 4449 reqs 的特定结构可能放大某些路径。
|
||||
6. `capacity = 92086` 是 `token_to_kv_pool_allocator.size`,来自 `mem_fraction_static`(未抽具体值),与"H100 80GB 的物理上限"差距是 SGLang 的安全裕量。
|
||||
7. ⚠️ §3.1 t=1622 持续高 `other` 30+ tick 的现象 **未与 `last_gen_throughput` 交叉验证**;原稿"running batch + 在途传输"的解释是猜想而非证据。
|
||||
8. ⚠️ 18/52 失败 session 的特征(turn_id、input 长度、prefix shape)**未做对比分析**;不能排除某个 session 类型本来就会触发某个固定 bug。
|
||||
9. polling 频率 1Hz 错过亚秒级 burst —— `other` 的双峰可能比测到的更剧烈。
|
||||
10. critic 指出 `pd-router-d-session-reseed` 在 EXP1 涨(193 vs 152)、EXP2 跌(127 vs 152)的反向移动**未在原稿分析**,这是 admission/路由 决策的清晰信号,应该在 P1 之后回看。
|
||||
|
||||
---
|
||||
|
||||
## 9. 后续指令(已更新顺序)
|
||||
|
||||
1. **P0**: 跑 `scripts/sweep_tp1_v5_baseline_rerun_exp2.sh`,3 次 EXP2 baseline,无 polling。
|
||||
2. **P1**: 同时改 SGLang 把 `other` 真正拆开。
|
||||
3. 完成 P0+P1 后:
|
||||
- 重跑 EXP2 一次 + 新 instrument(同 polling),拿到 `other` 拆分。
|
||||
- 对比 baseline-rerun 三次的 errors 分布。
|
||||
- 决定是否回退 polling、调 admission、还是攻 specific 18 个 session 的工作负载特征。
|
||||
4. 任何 v6 代码改动(优化 admission / eviction / transfer)**必须在 P0+P1 之后**。
|
||||
|
||||
---
|
||||
|
||||
## 10. 数据产物
|
||||
|
||||
```
|
||||
outputs/qwen3-30b-tp1-v5-optD-profile/
|
||||
├── exp{1,2}_*_metrics.jsonl # 4449 行 / 实验
|
||||
├── exp{1,2}_*_summary.json
|
||||
├── exp{1,2}_*_pool_timeseries.jsonl # 12 MB / 10 MB
|
||||
└── kvcache-centric-...20260429T{120847,125911}Z/ # 原始 run dir
|
||||
|
||||
outputs/qwen3-30b-tp1-v5-optD/ # baseline 对照(N=1)
|
||||
└── exp{1,2}_1p7d_kvc_optD_*
|
||||
|
||||
# 待 P0 产生:
|
||||
outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
|
||||
└── exp2_2p6d_run{1,2,3}_*
|
||||
```
|
||||
|
||||
分析脚本:`scripts/analysis/analyze_pool_timeseries.py`(`--json` 拿机器可读输出)。
|
||||
@@ -0,0 +1,88 @@
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4086.0,
|
||||
"mean": 213.95105237395987,
|
||||
"p50": 83.0,
|
||||
"p90": 562.0,
|
||||
"p99": 1346.0
|
||||
},
|
||||
"cache_hit_request_count": 3929,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 22635.924702180266,
|
||||
"p50": 20010.0,
|
||||
"p90": 48002.0,
|
||||
"p99": 65424.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 363,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 363,
|
||||
"kvcache-direct-to-d-session": 1716,
|
||||
"pd-router-d-session-reseed": 23,
|
||||
"pd-router-fallback-d-backpressure": 12,
|
||||
"pd-router-fallback-large-append": 5,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 51,
|
||||
"pd-router-fallback-large-append-session-cap": 2148,
|
||||
"pd-router-fallback-no-d-capacity": 7,
|
||||
"pd-router-fallback-session-cap": 32,
|
||||
"pd-router-large-append-reseed": 39,
|
||||
"pd-router-large-append-reseed-after-eviction": 2,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-no-d-capacity": 3,
|
||||
"pd-router-turn1-seed": 34,
|
||||
"pd-router-turn1-session-cap": 13
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 4.8753733304192455,
|
||||
"p50": 1.754677688702941,
|
||||
"p90": 12.66968655679375,
|
||||
"p99": 28.717210091650486
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 616,
|
||||
"decode-1": 658,
|
||||
"decode-2": 674,
|
||||
"decode-3": 582,
|
||||
"decode-4": 656,
|
||||
"decode-5": 662,
|
||||
"decode-6": 601
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 4449
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 98,
|
||||
"100": 2272
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 1716,
|
||||
"total_actual_kv_transfer_blocks": 62123,
|
||||
"total_cached_tokens": 100707229,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 0.005829451223571163,
|
||||
"p50": 0.005684156496173296,
|
||||
"p90": 0.007143743503740225,
|
||||
"p99": 0.008634991403068266
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T095141Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 42,
|
||||
"ttft_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 3.5955862397812597,
|
||||
"p50": 0.36274072993546724,
|
||||
"p90": 10.972254231572151,
|
||||
"p99": 27.433656523004174
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,85 @@
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4440.0,
|
||||
"mean": 225.87972972972972,
|
||||
"p50": 86.0,
|
||||
"p90": 576.0,
|
||||
"p99": 1347.0
|
||||
},
|
||||
"cache_hit_request_count": 4201,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 24345.55787817487,
|
||||
"p50": 21504.0,
|
||||
"p90": 48792.0,
|
||||
"p99": 69120.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 9,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 9,
|
||||
"kvcache-direct-to-d-session": 1358,
|
||||
"pd-router-d-session-reseed": 12,
|
||||
"pd-router-fallback-d-backpressure": 2,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
|
||||
"pd-router-fallback-large-append-session-cap": 2902,
|
||||
"pd-router-fallback-session-cap": 25,
|
||||
"pd-router-large-append-reseed": 34,
|
||||
"pd-router-large-append-reseed-after-eviction": 4,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-seed": 30,
|
||||
"pd-router-turn1-session-cap": 20
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 3.582334662846558,
|
||||
"p50": 1.517257746309042,
|
||||
"p90": 9.225348330102861,
|
||||
"p99": 18.70269925892353
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 710,
|
||||
"decode-1": 630,
|
||||
"decode-2": 763,
|
||||
"decode-3": 737,
|
||||
"decode-4": 879,
|
||||
"decode-5": 730
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 2225,
|
||||
"prefill-1": 2224
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 80,
|
||||
"100": 3002
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 1358,
|
||||
"total_actual_kv_transfer_blocks": 78979,
|
||||
"total_cached_tokens": 108313387,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 0.005882534704321737,
|
||||
"p50": 0.005807478777200416,
|
||||
"p90": 0.00712956755887717,
|
||||
"p99": 0.008372141476720572
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T104343Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 42,
|
||||
"ttft_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 2.2045287611873334,
|
||||
"p50": 0.32809355948120356,
|
||||
"p90": 6.947275545448065,
|
||||
"p99": 16.705802395939827
|
||||
}
|
||||
}
|
||||
189
outputs/qwen3-30b-tp1-v3-kvaware/sweep_results.txt
Normal file
189
outputs/qwen3-30b-tp1-v3-kvaware/sweep_results.txt
Normal file
@@ -0,0 +1,189 @@
|
||||
[2026-04-28 17:51:41] Starting TP1 v3 sweep (KVC with kv-aware policy)
|
||||
[2026-04-28 17:51:41] Model: /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
[2026-04-28 17:51:41] Trace: outputs/qwen35-swebench-50sess.jsonl (4449 requests, 52 sessions)
|
||||
[2026-04-28 17:51:41] Key change: --policy kv-aware for KVC (was --policy default in v2)
|
||||
[2026-04-28 17:51:41]
|
||||
[2026-04-28 17:51:41] === [EXP1] 1P7D KVC kv-aware ===
|
||||
[2026-04-28 18:43:43] === exp1_1p7d_kvc_kvaware COMPLETED ===
|
||||
[2026-04-28 18:43:43] Summary:
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4086.0,
|
||||
"mean": 213.95105237395987,
|
||||
"p50": 83.0,
|
||||
"p90": 562.0,
|
||||
"p99": 1346.0
|
||||
},
|
||||
"cache_hit_request_count": 3929,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 22635.924702180266,
|
||||
"p50": 20010.0,
|
||||
"p90": 48002.0,
|
||||
"p99": 65424.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 363,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 363,
|
||||
"kvcache-direct-to-d-session": 1716,
|
||||
"pd-router-d-session-reseed": 23,
|
||||
"pd-router-fallback-d-backpressure": 12,
|
||||
"pd-router-fallback-large-append": 5,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 51,
|
||||
"pd-router-fallback-large-append-session-cap": 2148,
|
||||
"pd-router-fallback-no-d-capacity": 7,
|
||||
"pd-router-fallback-session-cap": 32,
|
||||
"pd-router-large-append-reseed": 39,
|
||||
"pd-router-large-append-reseed-after-eviction": 2,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-no-d-capacity": 3,
|
||||
"pd-router-turn1-seed": 34,
|
||||
"pd-router-turn1-session-cap": 13
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 4.8753733304192455,
|
||||
"p50": 1.754677688702941,
|
||||
"p90": 12.66968655679375,
|
||||
"p99": 28.717210091650486
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 616,
|
||||
"decode-1": 658,
|
||||
"decode-2": 674,
|
||||
"decode-3": 582,
|
||||
"decode-4": 656,
|
||||
"decode-5": 662,
|
||||
"decode-6": 601
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 4449
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 98,
|
||||
"100": 2272
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 1716,
|
||||
"total_actual_kv_transfer_blocks": 62123,
|
||||
"total_cached_tokens": 100707229,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 0.005829451223571163,
|
||||
"p50": 0.005684156496173296,
|
||||
"p90": 0.007143743503740225,
|
||||
"p99": 0.008634991403068266
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T095141Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 42,
|
||||
"ttft_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 3.5955862397812597,
|
||||
"p50": 0.36274072993546724,
|
||||
"p90": 10.972254231572151,
|
||||
"p99": 27.433656523004174
|
||||
}
|
||||
}
|
||||
[2026-04-28 18:43:43] Saved to outputs/qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_summary.json + exp1_1p7d_kvc_kvaware_metrics.jsonl
|
||||
[2026-04-28 18:43:43]
|
||||
[2026-04-28 18:43:43] === [EXP2] 2P6D KVC kv-aware ===
|
||||
[2026-04-28 19:30:38] === exp2_2p6d_kvc_kvaware COMPLETED ===
|
||||
[2026-04-28 19:30:38] Summary:
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4440.0,
|
||||
"mean": 225.87972972972972,
|
||||
"p50": 86.0,
|
||||
"p90": 576.0,
|
||||
"p99": 1347.0
|
||||
},
|
||||
"cache_hit_request_count": 4201,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 24345.55787817487,
|
||||
"p50": 21504.0,
|
||||
"p90": 48792.0,
|
||||
"p99": 69120.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 9,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 9,
|
||||
"kvcache-direct-to-d-session": 1358,
|
||||
"pd-router-d-session-reseed": 12,
|
||||
"pd-router-fallback-d-backpressure": 2,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
|
||||
"pd-router-fallback-large-append-session-cap": 2902,
|
||||
"pd-router-fallback-session-cap": 25,
|
||||
"pd-router-large-append-reseed": 34,
|
||||
"pd-router-large-append-reseed-after-eviction": 4,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-seed": 30,
|
||||
"pd-router-turn1-session-cap": 20
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 3.582334662846558,
|
||||
"p50": 1.517257746309042,
|
||||
"p90": 9.225348330102861,
|
||||
"p99": 18.70269925892353
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 710,
|
||||
"decode-1": 630,
|
||||
"decode-2": 763,
|
||||
"decode-3": 737,
|
||||
"decode-4": 879,
|
||||
"decode-5": 730
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 2225,
|
||||
"prefill-1": 2224
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 80,
|
||||
"100": 3002
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 1358,
|
||||
"total_actual_kv_transfer_blocks": 78979,
|
||||
"total_cached_tokens": 108313387,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 0.005882534704321737,
|
||||
"p50": 0.005807478777200416,
|
||||
"p90": 0.00712956755887717,
|
||||
"p99": 0.008372141476720572
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T104343Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 42,
|
||||
"ttft_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 2.2045287611873334,
|
||||
"p50": 0.32809355948120356,
|
||||
"p90": 6.947275545448065,
|
||||
"p99": 16.705802395939827
|
||||
}
|
||||
}
|
||||
[2026-04-28 19:30:38] Saved to outputs/qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_summary.json + exp2_2p6d_kvc_kvaware_metrics.jsonl
|
||||
[2026-04-28 19:30:38]
|
||||
[2026-04-28 19:30:38] === ALL TP1 V3 SWEEP EXPERIMENTS DONE ===
|
||||
@@ -0,0 +1,88 @@
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4014.0,
|
||||
"mean": 215.048081714001,
|
||||
"p50": 83.0,
|
||||
"p90": 570.0,
|
||||
"p99": 1343.0
|
||||
},
|
||||
"cache_hit_request_count": 3865,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 21373.60867610699,
|
||||
"p50": 18429.0,
|
||||
"p90": 45643.0,
|
||||
"p99": 65088.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 435,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 435,
|
||||
"kvcache-direct-to-d-session": 2180,
|
||||
"pd-router-d-session-reseed": 44,
|
||||
"pd-router-d-session-reseed-after-eviction": 1,
|
||||
"pd-router-fallback-d-backpressure": 36,
|
||||
"pd-router-fallback-large-append": 35,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
|
||||
"pd-router-fallback-large-append-session-cap": 1500,
|
||||
"pd-router-fallback-no-d-capacity": 13,
|
||||
"pd-router-fallback-session-cap": 43,
|
||||
"pd-router-large-append-reseed": 55,
|
||||
"pd-router-large-append-reseed-after-eviction": 3,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-no-d-capacity": 5,
|
||||
"pd-router-turn1-seed": 46
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 4.214657033050009,
|
||||
"p50": 1.0827504023909569,
|
||||
"p90": 13.380241627804935,
|
||||
"p99": 24.453291333280504
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 690,
|
||||
"decode-1": 599,
|
||||
"decode-2": 660,
|
||||
"decode-3": 584,
|
||||
"decode-4": 606,
|
||||
"decode-5": 646,
|
||||
"decode-6": 664
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 4449
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 149,
|
||||
"100": 1685
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 2180,
|
||||
"total_actual_kv_transfer_blocks": 52857,
|
||||
"total_cached_tokens": 95091185,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 0.005804301410418847,
|
||||
"p50": 0.005607025208882987,
|
||||
"p90": 0.007293824862528552,
|
||||
"p99": 0.008864479259402893
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T125022Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 43,
|
||||
"ttft_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 2.915135478307124,
|
||||
"p50": 0.05643345229327679,
|
||||
"p90": 11.900803190656006,
|
||||
"p99": 22.758968392387033
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,86 @@
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4046.0,
|
||||
"mean": 224.65002471576867,
|
||||
"p50": 84.0,
|
||||
"p90": 576.0,
|
||||
"p99": 1349.0
|
||||
},
|
||||
"cache_hit_request_count": 3925,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 22852.7439874129,
|
||||
"p50": 19584.0,
|
||||
"p90": 49009.0,
|
||||
"p99": 67320.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 403,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 403,
|
||||
"kvcache-direct-to-d-session": 2348,
|
||||
"pd-router-d-session-reseed": 28,
|
||||
"pd-router-fallback-d-backpressure": 7,
|
||||
"pd-router-fallback-large-append": 68,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 45,
|
||||
"pd-router-fallback-large-append-session-cap": 1403,
|
||||
"pd-router-fallback-no-d-capacity": 9,
|
||||
"pd-router-fallback-session-cap": 25,
|
||||
"pd-router-large-append-reseed": 57,
|
||||
"pd-router-large-append-reseed-after-eviction": 6,
|
||||
"pd-router-turn1-no-d-capacity": 1,
|
||||
"pd-router-turn1-seed": 49
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 2.505981629502371,
|
||||
"p50": 0.8372491216287017,
|
||||
"p90": 6.5139341270551085,
|
||||
"p99": 18.335972285829484
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 767,
|
||||
"decode-1": 680,
|
||||
"decode-2": 906,
|
||||
"decode-3": 818,
|
||||
"decode-4": 800,
|
||||
"decode-5": 478
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 2225,
|
||||
"prefill-1": 2224
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 140,
|
||||
"100": 1558
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 2348,
|
||||
"total_actual_kv_transfer_blocks": 50727,
|
||||
"total_cached_tokens": 101671858,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 0.005708743129332261,
|
||||
"p50": 0.005565466725497757,
|
||||
"p90": 0.006912594398356141,
|
||||
"p99": 0.008102089307750717
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T134057Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 36,
|
||||
"ttft_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 1.1653790952959129,
|
||||
"p50": 0.05140436999499798,
|
||||
"p90": 2.6447059931233525,
|
||||
"p99": 15.121314341202378
|
||||
}
|
||||
}
|
||||
190
outputs/qwen3-30b-tp1-v4-cap16/sweep_results.txt
Normal file
190
outputs/qwen3-30b-tp1-v4-cap16/sweep_results.txt
Normal file
@@ -0,0 +1,190 @@
|
||||
[2026-04-28 20:50:21] Starting TP1 v4 sweep (KVC kv-aware, session soft_cap raised 4->16)
|
||||
[2026-04-28 20:50:21] Model: /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
[2026-04-28 20:50:21] Trace: outputs/qwen35-swebench-50sess.jsonl (4449 requests, 52 sessions)
|
||||
[2026-04-28 20:50:21] Key change: _decode_session_soft_cap now min(16, ...) instead of min(4, ...)
|
||||
[2026-04-28 20:50:21]
|
||||
[2026-04-28 20:50:21] === [EXP1] 1P7D KVC kv-aware cap=16 ===
|
||||
[2026-04-28 21:40:57] === exp1_1p7d_kvc_cap16 COMPLETED ===
|
||||
[2026-04-28 21:40:57] Summary:
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4014.0,
|
||||
"mean": 215.048081714001,
|
||||
"p50": 83.0,
|
||||
"p90": 570.0,
|
||||
"p99": 1343.0
|
||||
},
|
||||
"cache_hit_request_count": 3865,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 21373.60867610699,
|
||||
"p50": 18429.0,
|
||||
"p90": 45643.0,
|
||||
"p99": 65088.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 435,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 435,
|
||||
"kvcache-direct-to-d-session": 2180,
|
||||
"pd-router-d-session-reseed": 44,
|
||||
"pd-router-d-session-reseed-after-eviction": 1,
|
||||
"pd-router-fallback-d-backpressure": 36,
|
||||
"pd-router-fallback-large-append": 35,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
|
||||
"pd-router-fallback-large-append-session-cap": 1500,
|
||||
"pd-router-fallback-no-d-capacity": 13,
|
||||
"pd-router-fallback-session-cap": 43,
|
||||
"pd-router-large-append-reseed": 55,
|
||||
"pd-router-large-append-reseed-after-eviction": 3,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-no-d-capacity": 5,
|
||||
"pd-router-turn1-seed": 46
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 4.214657033050009,
|
||||
"p50": 1.0827504023909569,
|
||||
"p90": 13.380241627804935,
|
||||
"p99": 24.453291333280504
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 690,
|
||||
"decode-1": 599,
|
||||
"decode-2": 660,
|
||||
"decode-3": 584,
|
||||
"decode-4": 606,
|
||||
"decode-5": 646,
|
||||
"decode-6": 664
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 4449
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 149,
|
||||
"100": 1685
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 2180,
|
||||
"total_actual_kv_transfer_blocks": 52857,
|
||||
"total_cached_tokens": 95091185,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 0.005804301410418847,
|
||||
"p50": 0.005607025208882987,
|
||||
"p90": 0.007293824862528552,
|
||||
"p99": 0.008864479259402893
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T125022Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 43,
|
||||
"ttft_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 2.915135478307124,
|
||||
"p50": 0.05643345229327679,
|
||||
"p90": 11.900803190656006,
|
||||
"p99": 22.758968392387033
|
||||
}
|
||||
}
|
||||
[2026-04-28 21:40:57] Saved to outputs/qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_summary.json + exp1_1p7d_kvc_cap16_metrics.jsonl
|
||||
[2026-04-28 21:40:57]
|
||||
[2026-04-28 21:40:57] === [EXP2] 2P6D KVC kv-aware cap=16 ===
|
||||
[2026-04-28 22:27:53] === exp2_2p6d_kvc_cap16 COMPLETED ===
|
||||
[2026-04-28 22:27:53] Summary:
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4046.0,
|
||||
"mean": 224.65002471576867,
|
||||
"p50": 84.0,
|
||||
"p90": 576.0,
|
||||
"p99": 1349.0
|
||||
},
|
||||
"cache_hit_request_count": 3925,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 22852.7439874129,
|
||||
"p50": 19584.0,
|
||||
"p90": 49009.0,
|
||||
"p99": 67320.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 403,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 403,
|
||||
"kvcache-direct-to-d-session": 2348,
|
||||
"pd-router-d-session-reseed": 28,
|
||||
"pd-router-fallback-d-backpressure": 7,
|
||||
"pd-router-fallback-large-append": 68,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 45,
|
||||
"pd-router-fallback-large-append-session-cap": 1403,
|
||||
"pd-router-fallback-no-d-capacity": 9,
|
||||
"pd-router-fallback-session-cap": 25,
|
||||
"pd-router-large-append-reseed": 57,
|
||||
"pd-router-large-append-reseed-after-eviction": 6,
|
||||
"pd-router-turn1-no-d-capacity": 1,
|
||||
"pd-router-turn1-seed": 49
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 2.505981629502371,
|
||||
"p50": 0.8372491216287017,
|
||||
"p90": 6.5139341270551085,
|
||||
"p99": 18.335972285829484
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 767,
|
||||
"decode-1": 680,
|
||||
"decode-2": 906,
|
||||
"decode-3": 818,
|
||||
"decode-4": 800,
|
||||
"decode-5": 478
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 2225,
|
||||
"prefill-1": 2224
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 140,
|
||||
"100": 1558
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 2348,
|
||||
"total_actual_kv_transfer_blocks": 50727,
|
||||
"total_cached_tokens": 101671858,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 0.005708743129332261,
|
||||
"p50": 0.005565466725497757,
|
||||
"p90": 0.006912594398356141,
|
||||
"p99": 0.008102089307750717
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T134057Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 36,
|
||||
"ttft_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 1.1653790952959129,
|
||||
"p50": 0.05140436999499798,
|
||||
"p90": 2.6447059931233525,
|
||||
"p99": 15.121314341202378
|
||||
}
|
||||
}
|
||||
[2026-04-28 22:27:53] Saved to outputs/qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_summary.json + exp2_2p6d_kvc_cap16_metrics.jsonl
|
||||
[2026-04-28 22:27:53]
|
||||
[2026-04-28 22:27:53] === ALL TP1 V4 SWEEP EXPERIMENTS DONE ===
|
||||
191
scripts/analysis/analyze_backpressure_smoke.py
Executable file
191
scripts/analysis/analyze_backpressure_smoke.py
Executable file
@@ -0,0 +1,191 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Analyze backpressure smoke sweep outputs.
|
||||
|
||||
For each run dir with a `request-metrics.jsonl` and the new `structural/`
|
||||
subdir (admission-events.jsonl, backpressure-events.jsonl,
|
||||
session-d-binding.jsonl), report:
|
||||
|
||||
- Headline (errors, latency, ttft, direct-to-D rate)
|
||||
- Backpressure pause histogram (count, p50/p90 sleep, total pause time per D)
|
||||
- Admission probe stats (RPC count, mean RTT, queue_depth distribution,
|
||||
pause_ms distribution)
|
||||
- Session pinning (distinct D per session, bimodal direct-to-D rate)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import statistics
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def load_jsonl(path: Path) -> list[dict]:
|
||||
if not path.exists():
|
||||
return []
|
||||
return [json.loads(l) for l in path.open("r", encoding="utf-8") if l.strip()]
|
||||
|
||||
|
||||
def summarize_run(run_dir: Path) -> dict:
|
||||
metrics_path = next(run_dir.rglob("request-metrics.jsonl"), None)
|
||||
if metrics_path is None:
|
||||
return {"run_dir": str(run_dir), "error": "no request-metrics.jsonl"}
|
||||
|
||||
summary_path = metrics_path.with_suffix(metrics_path.suffix + ".summary.json")
|
||||
summary = (
|
||||
json.load(summary_path.open()) if summary_path.exists() else {}
|
||||
)
|
||||
|
||||
structural_dir = run_dir / "structural"
|
||||
if not structural_dir.exists():
|
||||
# try metrics dir's parent / structural
|
||||
structural_dir = metrics_path.parent / "structural"
|
||||
|
||||
admission_events = load_jsonl(structural_dir / "admission-events.jsonl")
|
||||
backpressure_events = load_jsonl(structural_dir / "backpressure-events.jsonl")
|
||||
binding_events = load_jsonl(structural_dir / "session-d-binding.jsonl")
|
||||
|
||||
out: dict = {"run_dir": str(run_dir)}
|
||||
|
||||
# Headline metrics from summary.json
|
||||
out["request_count"] = summary.get("request_count")
|
||||
out["error_count"] = summary.get("error_count")
|
||||
out["latency"] = summary.get("latency_stats_s")
|
||||
out["ttft"] = summary.get("ttft_stats_s")
|
||||
out["execution_modes"] = summary.get("execution_modes")
|
||||
out["per_decode_load"] = summary.get("per_decode_load")
|
||||
out["per_prefill_load"] = summary.get("per_prefill_load")
|
||||
|
||||
# Direct-to-D rate from execution_modes
|
||||
em = summary.get("execution_modes", {}) or {}
|
||||
direct = em.get("kvcache-direct-to-d-session", 0)
|
||||
total = sum(em.values()) or 1
|
||||
out["direct_to_d_rate"] = direct / total
|
||||
|
||||
# Session pinning
|
||||
bind_per_session: dict[str, set[int]] = defaultdict(set)
|
||||
for ev in binding_events:
|
||||
bind_per_session[ev["session_id"]].add(ev["decode_worker_index"])
|
||||
if bind_per_session:
|
||||
out["session_count"] = len(bind_per_session)
|
||||
out["avg_distinct_d_per_session"] = (
|
||||
sum(len(v) for v in bind_per_session.values()) / len(bind_per_session)
|
||||
)
|
||||
else:
|
||||
out["session_count"] = 0
|
||||
out["avg_distinct_d_per_session"] = None
|
||||
|
||||
# Direct-to-D rate per session (bimodal check)
|
||||
records = load_jsonl(metrics_path)
|
||||
sess_records: dict[str, list[dict]] = defaultdict(list)
|
||||
for r in records:
|
||||
sess_records[r["session_id"]].append(r)
|
||||
rates = []
|
||||
for sid, turns in sess_records.items():
|
||||
ndir = sum(
|
||||
1 for t in turns if t.get("execution_mode") == "kvcache-direct-to-d-session"
|
||||
)
|
||||
rates.append(ndir / len(turns))
|
||||
if rates:
|
||||
buckets = [0, 0, 0, 0, 0]
|
||||
for r in rates:
|
||||
buckets[min(4, int(r * 5))] += 1
|
||||
out["direct_to_d_rate_buckets"] = {
|
||||
"0-20%": buckets[0],
|
||||
"20-40%": buckets[1],
|
||||
"40-60%": buckets[2],
|
||||
"60-80%": buckets[3],
|
||||
"80-100%": buckets[4],
|
||||
}
|
||||
|
||||
# Backpressure events
|
||||
if backpressure_events:
|
||||
sleeps = [ev["sleep_s"] for ev in backpressure_events]
|
||||
out["backpressure"] = {
|
||||
"event_count": len(backpressure_events),
|
||||
"total_sleep_s": round(sum(sleeps), 2),
|
||||
"sleep_p50_s": round(statistics.median(sleeps), 4),
|
||||
"sleep_p90_s": round(
|
||||
sorted(sleeps)[int(len(sleeps) * 0.9)] if sleeps else 0, 4
|
||||
),
|
||||
"events_per_d": dict(
|
||||
Counter(ev["server_url"] for ev in backpressure_events).most_common()
|
||||
),
|
||||
}
|
||||
else:
|
||||
out["backpressure"] = {"event_count": 0, "note": "no backpressure events"}
|
||||
|
||||
# Admission probe stats
|
||||
if admission_events:
|
||||
rtts = [ev["rtt_s"] for ev in admission_events]
|
||||
depths = [ev.get("queue_depth", 0) for ev in admission_events]
|
||||
pauses = [ev.get("recommended_pause_ms", 0) for ev in admission_events]
|
||||
out["admission_probes"] = {
|
||||
"count": len(admission_events),
|
||||
"mean_rtt_s": round(sum(rtts) / len(rtts), 4),
|
||||
"p99_rtt_s": round(sorted(rtts)[int(len(rtts) * 0.99)], 4),
|
||||
"queue_depth_p50": int(statistics.median(depths)),
|
||||
"queue_depth_p90": int(sorted(depths)[int(len(depths) * 0.9)]),
|
||||
"queue_depth_max": max(depths),
|
||||
"pause_ms_p50": int(statistics.median(pauses)),
|
||||
"pause_ms_p90": int(sorted(pauses)[int(len(pauses) * 0.9)]),
|
||||
"pause_ms_max": max(pauses),
|
||||
"nonzero_pause_count": sum(1 for p in pauses if p > 0),
|
||||
"by_reason": dict(
|
||||
Counter(ev.get("reason") or "ok" for ev in admission_events).most_common()
|
||||
),
|
||||
}
|
||||
|
||||
return out
|
||||
|
||||
|
||||
def main() -> None:
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("sweep_root", type=Path)
|
||||
ap.add_argument("--json", action="store_true", help="emit JSON only")
|
||||
args = ap.parse_args()
|
||||
|
||||
summaries = []
|
||||
for run_dir in sorted(args.sweep_root.iterdir()):
|
||||
if not run_dir.is_dir():
|
||||
continue
|
||||
summary = summarize_run(run_dir)
|
||||
summaries.append(summary)
|
||||
|
||||
if args.json:
|
||||
print(json.dumps(summaries, indent=2))
|
||||
return
|
||||
|
||||
for s in summaries:
|
||||
print(f"\n{'=' * 70}")
|
||||
print(f" {s['run_dir']}")
|
||||
print(f"{'=' * 70}")
|
||||
if "error" in s:
|
||||
print(f" ERROR: {s['error']}")
|
||||
continue
|
||||
print(f" reqs={s.get('request_count')} errors={s.get('error_count')}")
|
||||
if s.get("latency"):
|
||||
lt = s["latency"]
|
||||
print(
|
||||
f" latency: mean={lt.get('mean'):.3f} "
|
||||
f"p50={lt.get('p50'):.3f} p90={lt.get('p90'):.3f} p99={lt.get('p99'):.3f}"
|
||||
)
|
||||
if s.get("ttft"):
|
||||
tt = s["ttft"]
|
||||
print(
|
||||
f" ttft: mean={tt.get('mean'):.3f} "
|
||||
f"p50={tt.get('p50'):.3f} p90={tt.get('p90'):.3f}"
|
||||
)
|
||||
print(f" direct_to_d_rate: {s.get('direct_to_d_rate', 0) * 100:.1f}%")
|
||||
print(f" sessions: {s.get('session_count')} | "
|
||||
f"avg distinct-D-per-session: {s.get('avg_distinct_d_per_session')}")
|
||||
if s.get("direct_to_d_rate_buckets"):
|
||||
print(f" direct-to-D distribution by session: {s['direct_to_d_rate_buckets']}")
|
||||
if s.get("backpressure"):
|
||||
print(f" backpressure: {s['backpressure']}")
|
||||
if s.get("admission_probes"):
|
||||
print(f" admission probes: {s['admission_probes']}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
83
scripts/analysis/analyze_errors.py
Normal file
83
scripts/analysis/analyze_errors.py
Normal file
@@ -0,0 +1,83 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Deep dive into v4 errors: which path, which D, which session, which turn."""
|
||||
import json
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from collections import Counter, defaultdict
|
||||
|
||||
BASE = Path(__file__).parent
|
||||
|
||||
def load_rows(jsonl_path):
|
||||
rows = []
|
||||
with open(jsonl_path) as f:
|
||||
for line in f:
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
# Compare v3 and v4 errors
|
||||
for label, path in [
|
||||
("v3 1P7D", BASE.parent / "qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_metrics.jsonl"),
|
||||
("v4 1P7D", BASE / "exp1_1p7d_kvc_cap16_metrics.jsonl"),
|
||||
("v3 2P6D", BASE.parent / "qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_metrics.jsonl"),
|
||||
("v4 2P6D", BASE / "exp2_2p6d_kvc_cap16_metrics.jsonl"),
|
||||
]:
|
||||
if not path.exists():
|
||||
print(f"\nSKIP {label}: {path} not found")
|
||||
continue
|
||||
rows = load_rows(path)
|
||||
err = [r for r in rows if r.get("error") is not None]
|
||||
print(f"\n========== {label} ({len(err)} errors / {len(rows)} total = {len(err)/len(rows)*100:.1f}%) ==========")
|
||||
|
||||
# Error finish_reason distribution
|
||||
fr_counter = Counter()
|
||||
for r in err:
|
||||
fr = str(r.get("finish_reason") or r.get("error") or "?")
|
||||
fr_counter[fr[:80]] += 1
|
||||
print(f"finish_reason distribution:")
|
||||
for fr, cnt in fr_counter.most_common():
|
||||
print(f" {cnt:>4}x {fr}")
|
||||
|
||||
# Errors by execution mode (these are aborted before mode assignment usually)
|
||||
mode_counter = Counter(r.get("execution_mode", "?") for r in err)
|
||||
print(f"\nerror by execution_mode:")
|
||||
for mode, cnt in mode_counter.most_common():
|
||||
print(f" {cnt:>4}x {mode}")
|
||||
|
||||
# Errors per D worker
|
||||
dw_counter = Counter(r.get("assigned_decode_node", "?") for r in err)
|
||||
print(f"\nerror per assigned_decode_node:")
|
||||
for dw, cnt in dw_counter.most_common():
|
||||
print(f" {cnt:>4}x {dw}")
|
||||
|
||||
# Errors by turn distribution
|
||||
turn_counter = Counter(r.get("turn_id", -1) for r in err)
|
||||
early = sum(c for t, c in turn_counter.items() if t <= 5)
|
||||
mid = sum(c for t, c in turn_counter.items() if 5 < t <= 30)
|
||||
late = sum(c for t, c in turn_counter.items() if t > 30)
|
||||
print(f"\nerror by turn: early(0-5)={early} mid(6-30)={mid} late(31+)={late}")
|
||||
|
||||
# Per-session error rate
|
||||
per_sess_err = defaultdict(int)
|
||||
per_sess_total = defaultdict(int)
|
||||
for r in rows:
|
||||
per_sess_total[r["session_id"]] += 1
|
||||
if r.get("error") is not None:
|
||||
per_sess_err[r["session_id"]] += 1
|
||||
sess_with_err = [(sid, per_sess_err[sid], per_sess_total[sid]) for sid in per_sess_err]
|
||||
sess_with_err.sort(key=lambda x: -x[1])
|
||||
print(f"\ntop 5 sessions by error count:")
|
||||
for sid, e, t in sess_with_err[:5]:
|
||||
print(f" session {sid}: {e}/{t} errors ({e/t*100:.0f}%)")
|
||||
|
||||
# Errors timeline: are they bursty?
|
||||
err_ts = sorted([r.get("trace_timestamp_s", 0) for r in err])
|
||||
if err_ts:
|
||||
first_ts = err_ts[0]
|
||||
last_ts = err_ts[-1]
|
||||
all_ts = sorted([r.get("trace_timestamp_s", 0) for r in rows])
|
||||
first_all = all_ts[0]
|
||||
last_all = all_ts[-1]
|
||||
run_duration = last_all - first_all
|
||||
err_first_pct = (err_ts[0] - first_all) / run_duration * 100 if run_duration > 0 else 0
|
||||
err_last_pct = (err_ts[-1] - first_all) / run_duration * 100 if run_duration > 0 else 0
|
||||
print(f"\nerror time range (% of run): {err_first_pct:.1f}% - {err_last_pct:.1f}%")
|
||||
346
scripts/analysis/analyze_pool_timeseries.py
Executable file
346
scripts/analysis/analyze_pool_timeseries.py
Executable file
@@ -0,0 +1,346 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Analyze d-pool-timeseries.jsonl produced by --pool-poll-interval-s.
|
||||
|
||||
Answers v6's main question: where is D's KV pool actually spent?
|
||||
|
||||
For each decode worker, decomposes capacity over the run wall-clock into:
|
||||
- resident_held_active = held - idle_evictable (sessions in active use)
|
||||
- resident_held_idle = idle_evictable (sessions kept around but evictable)
|
||||
- prefill_backup_or_other = capacity - held - available (everything else: backup blocks,
|
||||
in-flight transfers, fragmentation)
|
||||
- free_available = available
|
||||
|
||||
Also reports session residency churn (how many distinct sessions ever resided per D, and
|
||||
how often a session bounced between workers — a strong starvation signal).
|
||||
|
||||
Usage:
|
||||
python scripts/analysis/analyze_pool_timeseries.py <run_dir>
|
||||
or
|
||||
python scripts/analysis/analyze_pool_timeseries.py <pool_timeseries.jsonl>
|
||||
|
||||
Output: human-readable text. Add --json to also print a machine-readable summary.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import statistics
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
def _load_jsonl(path: Path) -> list[dict[str, Any]]:
|
||||
rows: list[dict[str, Any]] = []
|
||||
with path.open() as fh:
|
||||
for line in fh:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
|
||||
def _resolve_input(path: Path) -> Path:
|
||||
if path.is_file():
|
||||
return path
|
||||
if path.is_dir():
|
||||
candidate = path / "d-pool-timeseries.jsonl"
|
||||
if candidate.is_file():
|
||||
return candidate
|
||||
raise FileNotFoundError(
|
||||
f"{candidate} not found; pass the file directly or a run dir containing it."
|
||||
)
|
||||
raise FileNotFoundError(path)
|
||||
|
||||
|
||||
def _percentile(values: list[float], p: float) -> float:
|
||||
if not values:
|
||||
return 0.0
|
||||
s = sorted(values)
|
||||
idx = min(len(s) - 1, max(0, int(round((len(s) - 1) * p))))
|
||||
return s[idx]
|
||||
|
||||
|
||||
def _fmt_tokens(n: float) -> str:
|
||||
if n >= 1_000_000:
|
||||
return f"{n / 1_000_000:.2f}M"
|
||||
if n >= 1_000:
|
||||
return f"{n / 1_000:.1f}K"
|
||||
return f"{int(n)}"
|
||||
|
||||
|
||||
def _fmt_pct(n: float, total: float) -> str:
|
||||
if total <= 0:
|
||||
return " - "
|
||||
return f"{100 * n / total:5.1f}%"
|
||||
|
||||
|
||||
def analyze(timeseries_path: Path) -> dict[str, Any]:
|
||||
rows = _load_jsonl(timeseries_path)
|
||||
if not rows:
|
||||
raise ValueError(f"empty timeseries: {timeseries_path}")
|
||||
|
||||
by_worker: dict[str, list[dict[str, Any]]] = defaultdict(list)
|
||||
for row in rows:
|
||||
if row.get("error") and "session_cache_enabled" not in row:
|
||||
# poller failed at this tick — skip
|
||||
continue
|
||||
wid = row.get("worker_id") or "?"
|
||||
by_worker[wid].append(row)
|
||||
|
||||
summary: dict[str, Any] = {
|
||||
"timeseries_path": str(timeseries_path),
|
||||
"total_rows": len(rows),
|
||||
"tick_count": len(by_worker[next(iter(by_worker))]) if by_worker else 0,
|
||||
"wall_s_span": (
|
||||
max(r.get("wall_s", 0.0) for r in rows)
|
||||
- min(r.get("wall_s", 0.0) for r in rows)
|
||||
),
|
||||
"workers": {},
|
||||
}
|
||||
|
||||
print(f"\n=== Pool timeseries: {timeseries_path}")
|
||||
print(
|
||||
f" rows={summary['total_rows']} workers={len(by_worker)} "
|
||||
f"span={summary['wall_s_span']:.1f}s"
|
||||
)
|
||||
|
||||
# Print per-worker decomposition table
|
||||
header = (
|
||||
f"{'worker':<12} {'role':<8} {'cap':>8} | "
|
||||
f"{'avg_active':>10} {'avg_idle':>10} {'avg_other':>10} {'avg_free':>10} | "
|
||||
f"{'p90_held':>10} {'max_held':>10} {'p90_avail':>10}"
|
||||
)
|
||||
print(header)
|
||||
print("-" * len(header))
|
||||
|
||||
for wid in sorted(by_worker.keys()):
|
||||
ws = by_worker[wid]
|
||||
role = ws[0].get("worker_role", "?")
|
||||
cap_vals = [int(r.get("capacity_tokens") or 0) for r in ws]
|
||||
held_vals = [int(r.get("held_tokens") or 0) for r in ws]
|
||||
avail_vals = [int(r.get("available_tokens") or 0) for r in ws]
|
||||
idle_vals = [int(r.get("idle_evictable_tokens") or 0) for r in ws]
|
||||
# active = held - idle (sessions in active use)
|
||||
active_vals = [max(0, h - i) for h, i in zip(held_vals, idle_vals)]
|
||||
# other = capacity - held - available (prefill backup blocks, in-flight, fragmentation)
|
||||
other_vals = [
|
||||
max(0, c - h - a) for c, h, a in zip(cap_vals, held_vals, avail_vals)
|
||||
]
|
||||
cap = max(cap_vals) if cap_vals else 0
|
||||
|
||||
avg_active = statistics.fmean(active_vals) if active_vals else 0.0
|
||||
avg_idle = statistics.fmean(idle_vals) if idle_vals else 0.0
|
||||
avg_other = statistics.fmean(other_vals) if other_vals else 0.0
|
||||
avg_avail = statistics.fmean(avail_vals) if avail_vals else 0.0
|
||||
|
||||
p90_held = _percentile([float(v) for v in held_vals], 0.90)
|
||||
max_held = max(held_vals) if held_vals else 0
|
||||
p90_avail = _percentile([float(v) for v in avail_vals], 0.90)
|
||||
|
||||
sess_counts = [int(r.get("session_count") or 0) for r in ws]
|
||||
resident_counts = [int(r.get("resident_session_count") or 0) for r in ws]
|
||||
|
||||
print(
|
||||
f"{wid:<12} {role:<8} {_fmt_tokens(cap):>8} | "
|
||||
f"{_fmt_tokens(avg_active):>4} {_fmt_pct(avg_active, cap):>5} "
|
||||
f"{_fmt_tokens(avg_idle):>4} {_fmt_pct(avg_idle, cap):>5} "
|
||||
f"{_fmt_tokens(avg_other):>4} {_fmt_pct(avg_other, cap):>5} "
|
||||
f"{_fmt_tokens(avg_avail):>4} {_fmt_pct(avg_avail, cap):>5} | "
|
||||
f"{_fmt_tokens(p90_held):>10} {_fmt_tokens(max_held):>10} "
|
||||
f"{_fmt_tokens(p90_avail):>10}"
|
||||
)
|
||||
|
||||
summary["workers"][wid] = {
|
||||
"role": role,
|
||||
"capacity_tokens": cap,
|
||||
"avg_active_held_tokens": avg_active,
|
||||
"avg_idle_evictable_tokens": avg_idle,
|
||||
"avg_other_tokens": avg_other,
|
||||
"avg_available_tokens": avg_avail,
|
||||
"p90_held_tokens": p90_held,
|
||||
"max_held_tokens": max_held,
|
||||
"p90_available_tokens": p90_avail,
|
||||
"max_session_count": max(sess_counts) if sess_counts else 0,
|
||||
"max_resident_session_count": (
|
||||
max(resident_counts) if resident_counts else 0
|
||||
),
|
||||
"ticks": len(ws),
|
||||
}
|
||||
|
||||
print(
|
||||
"\nLegend: active=held-idle idle=idle_evictable "
|
||||
"other=cap-held-avail (radix-protected + running-batch + in-flight + frag)"
|
||||
)
|
||||
|
||||
# P1: decomposition of "other" using pool_breakdown fields (zeros if instrument absent)
|
||||
has_breakdown = any(
|
||||
any(r.get(k) for k in (
|
||||
"radix_evictable_tokens",
|
||||
"radix_protected_tokens",
|
||||
"running_batch_kv_tokens",
|
||||
"transfer_queue_tokens",
|
||||
"prealloc_queue_tokens",
|
||||
"retracted_queue_tokens",
|
||||
))
|
||||
for r in rows
|
||||
)
|
||||
|
||||
if has_breakdown:
|
||||
print("\n=== P1 'other' decomposition (per worker, mean over run) ===")
|
||||
print(
|
||||
f"{'worker':<12} {'role':<8} | "
|
||||
f"{'r_evictable':>11} {'r_protected':>11} {'slot_private':>12} | "
|
||||
f"{'run_batch':>10} {'transfer':>9} {'prealloc':>9} {'retracted':>10} | "
|
||||
f"{'unaccounted':>11}"
|
||||
)
|
||||
for wid in sorted(by_worker.keys()):
|
||||
ws = by_worker[wid]
|
||||
role = ws[0].get("worker_role", "?")
|
||||
cap = max(int(r.get("capacity_tokens") or 0) for r in ws)
|
||||
|
||||
def m(field: str) -> float:
|
||||
vals = [int(r.get(field) or 0) for r in ws]
|
||||
return statistics.fmean(vals) if vals else 0.0
|
||||
|
||||
r_ev = m("radix_evictable_tokens")
|
||||
r_pr = m("radix_protected_tokens")
|
||||
slot = m("slot_private_held_tokens")
|
||||
rb = m("running_batch_kv_tokens")
|
||||
tq = m("transfer_queue_tokens")
|
||||
pq = m("prealloc_queue_tokens")
|
||||
rq = m("retracted_queue_tokens")
|
||||
avail = m("available_tokens")
|
||||
# `running_batch_kv_tokens` overlaps with radix_protected for tree-tracked
|
||||
# reqs — do NOT subtract it again. Decomposition assumes:
|
||||
# capacity ≈ avail + r_evictable + r_protected + slot_private
|
||||
# + transfer_queue + prealloc_queue + retracted_queue + unaccounted
|
||||
unacc = max(
|
||||
0,
|
||||
cap - avail - r_ev - r_pr - slot - tq - pq - rq,
|
||||
)
|
||||
print(
|
||||
f"{wid:<12} {role:<8} | "
|
||||
f"{_fmt_tokens(r_ev):>11} {_fmt_tokens(r_pr):>11} {_fmt_tokens(slot):>12} | "
|
||||
f"{_fmt_tokens(rb):>10} {_fmt_tokens(tq):>9} {_fmt_tokens(pq):>9} {_fmt_tokens(rq):>10} | "
|
||||
f"{_fmt_tokens(unacc):>11}"
|
||||
)
|
||||
|
||||
summary["workers"][wid]["pool_breakdown_avg"] = {
|
||||
"radix_evictable": r_ev,
|
||||
"radix_protected": r_pr,
|
||||
"slot_private_held": slot,
|
||||
"running_batch_kv": rb,
|
||||
"transfer_queue": tq,
|
||||
"prealloc_queue": pq,
|
||||
"retracted_queue": rq,
|
||||
"available": avail,
|
||||
"unaccounted": unacc,
|
||||
}
|
||||
print(
|
||||
"\nNote: running_batch_kv_tokens overlaps with radix_protected_tokens "
|
||||
"(tree-tracked decode reqs are also in protected); not summed."
|
||||
)
|
||||
else:
|
||||
print("\n(P1 instrument absent: pool_breakdown fields are all zero)")
|
||||
|
||||
# Session residency churn: how many distinct sessions ever sat on each worker,
|
||||
# and how many sessions hopped across workers (= starvation indicator).
|
||||
print("\n=== Session residency churn ===")
|
||||
sessions_per_worker: dict[str, set[str]] = defaultdict(set)
|
||||
workers_per_session: dict[str, set[str]] = defaultdict(set)
|
||||
resident_ticks_per_session: Counter[str] = Counter()
|
||||
resident_ticks_per_worker: Counter[str] = Counter()
|
||||
|
||||
for row in rows:
|
||||
wid = row.get("worker_id")
|
||||
if wid is None or row.get("worker_role") != "decode":
|
||||
continue
|
||||
sessions = row.get("sessions") or []
|
||||
if not isinstance(sessions, list):
|
||||
continue
|
||||
for entry in sessions:
|
||||
if not isinstance(entry, dict):
|
||||
continue
|
||||
sid = entry.get("session_id")
|
||||
if sid is None:
|
||||
continue
|
||||
if entry.get("resident"):
|
||||
sessions_per_worker[wid].add(sid)
|
||||
workers_per_session[sid].add(wid)
|
||||
resident_ticks_per_session[(wid, sid)] += 1
|
||||
resident_ticks_per_worker[wid] += 1
|
||||
|
||||
# Per-decode worker: distinct session count
|
||||
print(f" {'worker':<12} {'distinct_sess':>14} {'resident_ticks':>16}")
|
||||
for wid in sorted(sessions_per_worker.keys()):
|
||||
print(
|
||||
f" {wid:<12} {len(sessions_per_worker[wid]):>14} "
|
||||
f"{resident_ticks_per_worker[wid]:>16}"
|
||||
)
|
||||
|
||||
# Per session: how many workers it hopped across
|
||||
hops = Counter(len(ws) for ws in workers_per_session.values())
|
||||
print(f"\n Sessions seen on N workers (decode side):")
|
||||
for n, count in sorted(hops.items()):
|
||||
print(f" on {n} worker(s): {count} sessions")
|
||||
|
||||
starvation = [sid for sid, ws in workers_per_session.items() if len(ws) == 0]
|
||||
multi_hopper = sorted(
|
||||
((sid, ws) for sid, ws in workers_per_session.items() if len(ws) >= 2),
|
||||
key=lambda x: -len(x[1]),
|
||||
)[:10]
|
||||
if multi_hopper:
|
||||
print(
|
||||
"\n Top sessions seen resident on multiple workers (potential thrashing):"
|
||||
)
|
||||
for sid, ws in multi_hopper:
|
||||
print(f" {sid}: {len(ws)} workers ({sorted(ws)})")
|
||||
|
||||
summary["session_residency"] = {
|
||||
"distinct_sessions_per_worker": {
|
||||
wid: len(s) for wid, s in sessions_per_worker.items()
|
||||
},
|
||||
"session_hop_count_distribution": dict(hops),
|
||||
"starvation_session_count": len(starvation),
|
||||
}
|
||||
|
||||
# If a request-metrics file is co-located, also bucket fallback reasons
|
||||
# against contemporaneous pool state (rough — uses tick nearest to median tick).
|
||||
metrics_path = timeseries_path.with_name("request-metrics.jsonl")
|
||||
if metrics_path.exists():
|
||||
print(f"\n=== Request-metrics summary ({metrics_path.name}) ===")
|
||||
mrows = _load_jsonl(metrics_path)
|
||||
modes = Counter(r.get("execution_mode") or "?" for r in mrows)
|
||||
total = sum(modes.values())
|
||||
for mode, count in modes.most_common():
|
||||
print(f" {count:>6} ({100 * count / total:5.1f}%) {mode}")
|
||||
summary["execution_modes"] = dict(modes)
|
||||
|
||||
return summary
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description=__doc__)
|
||||
parser.add_argument(
|
||||
"path",
|
||||
type=Path,
|
||||
help="Path to d-pool-timeseries.jsonl OR a run dir containing it",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json",
|
||||
action="store_true",
|
||||
help="Also print a machine-readable JSON summary",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
resolved = _resolve_input(args.path)
|
||||
summary = analyze(resolved)
|
||||
if args.json:
|
||||
print("\n=== JSON summary ===")
|
||||
print(json.dumps(summary, indent=2, sort_keys=True, default=str))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
89
scripts/analysis/analyze_v3.py
Normal file
89
scripts/analysis/analyze_v3.py
Normal file
@@ -0,0 +1,89 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Analyze v3 (kv-aware) results — find why fallback-large-append-session-cap dominates."""
|
||||
import json
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from collections import Counter, defaultdict
|
||||
|
||||
BASE = Path(__file__).parent
|
||||
|
||||
def load_rows(jsonl_path):
|
||||
rows = []
|
||||
with open(jsonl_path) as f:
|
||||
for line in f:
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
exp1 = load_rows(BASE / "exp1_1p7d_kvc_kvaware_metrics.jsonl")
|
||||
exp2 = load_rows(BASE / "exp2_2p6d_kvc_kvaware_metrics.jsonl")
|
||||
|
||||
for name, rows in [("Exp1 1P7D", exp1), ("Exp2 2P6D", exp2)]:
|
||||
print(f"\n========== {name} ==========")
|
||||
ok = [r for r in rows if r.get("error") is None]
|
||||
|
||||
# Execution mode breakdown by latency
|
||||
modes = Counter(r["execution_mode"] for r in ok)
|
||||
print(f"\nExecution modes (n={len(ok)}):")
|
||||
for mode, count in modes.most_common():
|
||||
mode_rows = [r for r in ok if r["execution_mode"] == mode]
|
||||
lats = [r["latency_s"] for r in mode_rows]
|
||||
ttfts = [r["ttft_s"] for r in mode_rows]
|
||||
print(f" {mode}: n={count} ({count/len(ok)*100:.1f}%) "
|
||||
f"lat P50={np.percentile(lats,50):.3f}s P90={np.percentile(lats,90):.3f}s | "
|
||||
f"ttft P50={np.percentile(ttfts,50):.3f}s P90={np.percentile(ttfts,90):.3f}s")
|
||||
|
||||
# Per-D session distribution
|
||||
per_d_sessions = defaultdict(set)
|
||||
for r in ok:
|
||||
d = r.get("assigned_decode_node", "?")
|
||||
per_d_sessions[d].add(r["session_id"])
|
||||
print(f"\nSessions per D worker:")
|
||||
for d in sorted(per_d_sessions.keys()):
|
||||
print(f" {d}: {len(per_d_sessions[d])} unique sessions")
|
||||
|
||||
# session-cap fallback analysis
|
||||
sc_rows = [r for r in ok if r["execution_mode"] == "pd-router-fallback-large-append-session-cap"]
|
||||
if sc_rows:
|
||||
print(f"\nSession-cap fallback details (n={len(sc_rows)}):")
|
||||
# Which sessions hit this most?
|
||||
sc_per_sess = Counter(r["session_id"] for r in sc_rows)
|
||||
print(f" Sessions hitting session-cap (top 5):")
|
||||
for sid, cnt in sc_per_sess.most_common(5):
|
||||
print(f" session {sid}: {cnt} times")
|
||||
# Per-D distribution
|
||||
sc_per_d = Counter(r.get("assigned_decode_node", "?") for r in sc_rows)
|
||||
print(f" Per-D distribution: {dict(sc_per_d.most_common())}")
|
||||
# Input length distribution
|
||||
inp = [r.get("input_length", 0) for r in sc_rows]
|
||||
print(f" Input length: P50={np.percentile(inp,50):.0f} P90={np.percentile(inp,90):.0f}")
|
||||
# Turn distribution
|
||||
turns = Counter(r.get("turn_id", -1) for r in sc_rows)
|
||||
print(f" Turn distribution (top 5): {dict(turns.most_common(5))}")
|
||||
|
||||
# Direct-to-D analysis (ideal path)
|
||||
dd_rows = [r for r in ok if r["execution_mode"] == "kvcache-direct-to-d-session"]
|
||||
if dd_rows:
|
||||
lats = [r["latency_s"] for r in dd_rows]
|
||||
ttfts = [r["ttft_s"] for r in dd_rows]
|
||||
kv_blocks = [r.get("actual_kv_transfer_blocks", 0) for r in dd_rows]
|
||||
cached = [r.get("cached_tokens", 0) for r in dd_rows]
|
||||
print(f"\nDirect-to-D details (n={len(dd_rows)}):")
|
||||
print(f" lat P50={np.percentile(lats,50):.3f}s P90={np.percentile(lats,90):.3f}s P99={np.percentile(lats,99):.3f}s")
|
||||
print(f" ttft P50={np.percentile(ttfts,50):.3f}s P90={np.percentile(ttfts,90):.3f}s")
|
||||
print(f" KV transfer: P50={np.percentile(kv_blocks,50):.0f} (should be 0 — no P involved)")
|
||||
print(f" cached_tokens P50={np.percentile(cached,50):.0f}")
|
||||
|
||||
# Sessions: how many turns each, how many used direct-to-d
|
||||
print(f"\nPer-session direct-to-D rate (top 10 by total turns):")
|
||||
per_sess = defaultdict(list)
|
||||
for r in ok:
|
||||
per_sess[r["session_id"]].append(r)
|
||||
sess_stats = []
|
||||
for sid, sreqs in per_sess.items():
|
||||
total = len(sreqs)
|
||||
dd = sum(1 for r in sreqs if r["execution_mode"] == "kvcache-direct-to-d-session")
|
||||
sc = sum(1 for r in sreqs if "session-cap" in r["execution_mode"])
|
||||
sess_stats.append((sid, total, dd, sc))
|
||||
sess_stats.sort(key=lambda x: -x[1])
|
||||
for sid, total, dd, sc in sess_stats[:10]:
|
||||
print(f" session {sid}: {total} turns, {dd} direct-to-D ({dd/total*100:.0f}%), {sc} session-cap fallback ({sc/total*100:.0f}%)")
|
||||
52
scripts/analysis/analyze_v4.py
Normal file
52
scripts/analysis/analyze_v4.py
Normal file
@@ -0,0 +1,52 @@
|
||||
#!/usr/bin/env python3
|
||||
"""V4 results analysis: errors, execution modes, latency by mode."""
|
||||
import json
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from collections import Counter
|
||||
|
||||
BASE = Path(__file__).parent
|
||||
|
||||
def load_rows(jsonl_path):
|
||||
rows = []
|
||||
with open(jsonl_path) as f:
|
||||
for line in f:
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
for name, path in [
|
||||
("Exp1 1P7D cap=16", BASE / "exp1_1p7d_kvc_cap16_metrics.jsonl"),
|
||||
("Exp2 2P6D cap=16", BASE / "exp2_2p6d_kvc_cap16_metrics.jsonl"),
|
||||
]:
|
||||
rows = load_rows(path)
|
||||
print(f"\n========== {name} ==========")
|
||||
ok = [r for r in rows if r.get("error") is None]
|
||||
err = [r for r in rows if r.get("error") is not None]
|
||||
print(f"Total: {len(rows)}, OK: {len(ok)}, Errors: {len(err)}")
|
||||
|
||||
# Errors finish_reason
|
||||
if err:
|
||||
finish_reasons = Counter()
|
||||
for r in err:
|
||||
fr = str(r.get("finish_reason") or r.get("error") or "?")
|
||||
# Truncate long messages
|
||||
short = fr[:120]
|
||||
finish_reasons[short] += 1
|
||||
print(f"\nError finish_reasons (top 5):")
|
||||
for fr, cnt in finish_reasons.most_common(5):
|
||||
print(f" {cnt}x: {fr}")
|
||||
|
||||
# Execution mode latency breakdown
|
||||
modes = Counter(r["execution_mode"] for r in ok)
|
||||
print(f"\nTop execution modes by latency:")
|
||||
print(f"{'mode':<55}{'n':<8}{'%':<8}{'P50 lat':<10}{'P90 lat':<10}{'TTFT P50':<10}")
|
||||
for mode, count in modes.most_common(8):
|
||||
mode_rows = [r for r in ok if r["execution_mode"] == mode]
|
||||
lats = [r["latency_s"] for r in mode_rows]
|
||||
ttfts = [r["ttft_s"] for r in mode_rows]
|
||||
print(f" {mode:<53}{count:<8}{count/len(ok)*100:>5.1f}% {np.percentile(lats,50):>7.3f}s {np.percentile(lats,90):>7.3f}s {np.percentile(ttfts,50):>7.3f}s")
|
||||
|
||||
# Per-D load
|
||||
per_d = Counter(r.get("assigned_decode_node", "?") for r in ok)
|
||||
print(f"\nPer-D load: max/min ratio = {max(per_d.values())/max(min(per_d.values()),1):.2f}x")
|
||||
print(f" {dict(per_d.most_common())}")
|
||||
136
scripts/analysis/compare_no_error.py
Normal file
136
scripts/analysis/compare_no_error.py
Normal file
@@ -0,0 +1,136 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Compare KVC variants vs baseline, EXCLUDING errors and truncated requests."""
|
||||
import json
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
|
||||
OUT = Path("/mnt/kzlin/workflow/pd-hybrid/agentic-pd-hybrid/outputs")
|
||||
|
||||
DATASETS = [
|
||||
("baseline 8DP", OUT / "qwen3-30b-tp1-v2-fixed/exp1_8way_dp_cache_aware_metrics.jsonl"),
|
||||
("v3 1P7D", OUT / "qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_metrics.jsonl"),
|
||||
("v3 2P6D", OUT / "qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_metrics.jsonl"),
|
||||
("v4 1P7D", OUT / "qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_metrics.jsonl"),
|
||||
("v4 2P6D", OUT / "qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_metrics.jsonl"),
|
||||
]
|
||||
|
||||
def load_rows(path):
|
||||
rows = []
|
||||
with open(path) as f:
|
||||
for line in f:
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
def is_truncated(row):
|
||||
a = row.get("actual_output_tokens")
|
||||
r = row.get("requested_output_tokens")
|
||||
if a is not None and r is not None and r > 1:
|
||||
return a < r * 0.5
|
||||
return False
|
||||
|
||||
def stats(values):
|
||||
if not values:
|
||||
return {"n": 0}
|
||||
a = np.array(values)
|
||||
return {
|
||||
"n": len(a),
|
||||
"mean": float(np.mean(a)),
|
||||
"p50": float(np.percentile(a, 50)),
|
||||
"p90": float(np.percentile(a, 90)),
|
||||
"p99": float(np.percentile(a, 99)),
|
||||
}
|
||||
|
||||
def fmt(s, key):
|
||||
if s["n"] == 0:
|
||||
return "N/A"
|
||||
v = s[key]
|
||||
return f"{v:.3f}s" if v < 100 else f"{v:.1f}s"
|
||||
|
||||
results = []
|
||||
for label, path in DATASETS:
|
||||
if not path.exists():
|
||||
print(f"SKIP {label}")
|
||||
continue
|
||||
rows = load_rows(path)
|
||||
total = len(rows)
|
||||
err_n = sum(1 for r in rows if r.get("error") is not None)
|
||||
trunc_n = sum(1 for r in rows if r.get("error") is None and is_truncated(r))
|
||||
|
||||
# Filter: error=None AND not truncated AND latency present
|
||||
clean = [r for r in rows
|
||||
if r.get("error") is None
|
||||
and not is_truncated(r)
|
||||
and r.get("latency_s") is not None]
|
||||
|
||||
lats = [r["latency_s"] for r in clean]
|
||||
ttfts = [r["ttft_s"] for r in clean if r.get("ttft_s") is not None]
|
||||
|
||||
results.append({
|
||||
"label": label,
|
||||
"total": total,
|
||||
"err": err_n,
|
||||
"trunc": trunc_n,
|
||||
"clean_n": len(clean),
|
||||
"lat": stats(lats),
|
||||
"ttft": stats(ttfts),
|
||||
})
|
||||
|
||||
# Print comparison table
|
||||
print(f"\n{'='*100}")
|
||||
print("LATENCY (excluding errors AND truncated)")
|
||||
print(f"{'='*100}")
|
||||
print(f"{'config':<16}{'total':>7}{'err':>6}{'trunc':>7}{'clean':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
|
||||
for r in results:
|
||||
print(f"{r['label']:<16}{r['total']:>7}{r['err']:>6}{r['trunc']:>7}{r['clean_n']:>7} "
|
||||
f"{fmt(r['lat'],'mean'):>9}{fmt(r['lat'],'p50'):>9}{fmt(r['lat'],'p90'):>9}{fmt(r['lat'],'p99'):>9}")
|
||||
|
||||
print(f"\n{'='*100}")
|
||||
print("TTFT (excluding errors AND truncated)")
|
||||
print(f"{'='*100}")
|
||||
print(f"{'config':<16}{'clean':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
|
||||
for r in results:
|
||||
print(f"{r['label']:<16}{r['clean_n']:>7} "
|
||||
f"{fmt(r['ttft'],'mean'):>9}{fmt(r['ttft'],'p50'):>9}{fmt(r['ttft'],'p90'):>9}{fmt(r['ttft'],'p99'):>9}")
|
||||
|
||||
# Also: per-execution-mode breakdown for v4 only (the most interesting)
|
||||
print(f"\n{'='*100}")
|
||||
print("V4 2P6D: per-execution-mode (excluding errors and truncated)")
|
||||
print(f"{'='*100}")
|
||||
v4_2p6d = next((p for l, p in DATASETS if l == "v4 2P6D"), None)
|
||||
if v4_2p6d:
|
||||
rows = load_rows(v4_2p6d)
|
||||
clean = [r for r in rows if r.get("error") is None and not is_truncated(r)]
|
||||
from collections import Counter
|
||||
modes = Counter(r["execution_mode"] for r in clean)
|
||||
print(f"{'mode':<55}{'n':>7}{'%':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
|
||||
for mode, count in modes.most_common(10):
|
||||
m_rows = [r for r in clean if r["execution_mode"] == mode]
|
||||
s = stats([r["latency_s"] for r in m_rows])
|
||||
pct = count/len(clean)*100
|
||||
print(f" {mode:<53}{count:>7}{pct:>6.1f}% {fmt(s,'mean'):>9}{fmt(s,'p50'):>9}{fmt(s,'p90'):>9}{fmt(s,'p99'):>9}")
|
||||
|
||||
# Also: WHAT IF we only count direct-to-D? (Pure KVC performance)
|
||||
print(f"\n{'='*100}")
|
||||
print("Pure KVC (kvcache-direct-to-d-session ONLY) vs Baseline")
|
||||
print(f"{'='*100}")
|
||||
for label, path in DATASETS:
|
||||
if not path.exists() or "1P7D" not in label and "2P6D" not in label:
|
||||
continue
|
||||
rows = load_rows(path)
|
||||
direct = [r for r in rows
|
||||
if r.get("error") is None and not is_truncated(r)
|
||||
and r.get("execution_mode") == "kvcache-direct-to-d-session"]
|
||||
if not direct:
|
||||
continue
|
||||
s_lat = stats([r["latency_s"] for r in direct])
|
||||
s_ttft = stats([r["ttft_s"] for r in direct if r.get("ttft_s") is not None])
|
||||
print(f"{label:<16}n={s_lat['n']:>5} lat: P50={fmt(s_lat,'p50')} P90={fmt(s_lat,'p90')} ttft: P50={fmt(s_ttft,'p50')} P90={fmt(s_ttft,'p90')}")
|
||||
|
||||
# Baseline for reference (already non-fallback by definition)
|
||||
print()
|
||||
baseline_path = OUT / "qwen3-30b-tp1-v2-fixed/exp1_8way_dp_cache_aware_metrics.jsonl"
|
||||
baseline_rows = load_rows(baseline_path)
|
||||
clean = [r for r in baseline_rows if r.get("error") is None and not is_truncated(r)]
|
||||
s_lat = stats([r["latency_s"] for r in clean])
|
||||
s_ttft = stats([r["ttft_s"] for r in clean if r.get("ttft_s") is not None])
|
||||
print(f"{'baseline 8DP':<16}n={s_lat['n']:>5} lat: P50={fmt(s_lat,'p50')} P90={fmt(s_lat,'p90')} ttft: P50={fmt(s_ttft,'p50')} P90={fmt(s_ttft,'p90')}")
|
||||
110
scripts/convert_audit_to_trace.py
Normal file
110
scripts/convert_audit_to_trace.py
Normal file
@@ -0,0 +1,110 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Convert sibench audit.jsonl to agentic-pd-hybrid trace format.
|
||||
|
||||
Source format (sibench audit.jsonl):
|
||||
{"instance_id": "...", "ts": float, "messages": [...],
|
||||
"audit": {"prompt_tokens": int, "completion_tokens": int, ...}}
|
||||
|
||||
Target format (agentic-pd-hybrid trace JSONL):
|
||||
{"chat_id": int, "parent_chat_id": int, "timestamp": float,
|
||||
"turn": int, "input_length": int, "output_length": int,
|
||||
"type": str, "hash_ids": [int, ...]}
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
BLOCK_TOKEN_BUDGET = 24 # tokens per block, matching trace.py default
|
||||
|
||||
|
||||
def convert(src: Path, dst: Path) -> None:
|
||||
# Group lines by instance_id, preserving order within each instance
|
||||
instances: dict[str, list[dict]] = defaultdict(list)
|
||||
with src.open() as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
rec = json.loads(line)
|
||||
instances[rec["instance_id"]].append(rec)
|
||||
|
||||
# Sort each instance's turns by timestamp
|
||||
for iid in instances:
|
||||
instances[iid].sort(key=lambda r: r["ts"])
|
||||
|
||||
# Assign stable chat_id bases: each instance gets a block of IDs
|
||||
# Max turns across all instances determines the spacing
|
||||
max_turns = max(len(turns) for turns in instances.values())
|
||||
spacing = max_turns + 10 # extra headroom
|
||||
|
||||
total_written = 0
|
||||
with dst.open("w") as out:
|
||||
for inst_idx, (iid, turns) in enumerate(instances.items()):
|
||||
base_chat_id = (inst_idx + 1) * spacing # start from spacing to avoid 0
|
||||
# Track cumulative hash_ids for prefix cache simulation
|
||||
cumulative_hash_ids: list[int] = []
|
||||
global_block_counter = inst_idx * 100_000 # unique block namespace per instance
|
||||
|
||||
for turn_idx, rec in enumerate(turns):
|
||||
audit = rec.get("audit", {})
|
||||
input_length = audit.get("prompt_tokens", 0)
|
||||
output_length = audit.get("completion_tokens", 0)
|
||||
|
||||
if input_length <= 0:
|
||||
# Fallback: estimate from message content
|
||||
total_chars = sum(len(m.get("content", "")) for m in rec.get("messages", []))
|
||||
input_length = max(1, total_chars // 4)
|
||||
if output_length <= 0:
|
||||
output_length = 128 # reasonable default
|
||||
|
||||
chat_id = base_chat_id + turn_idx
|
||||
if turn_idx == 0:
|
||||
parent_chat_id = -1
|
||||
else:
|
||||
parent_chat_id = base_chat_id + turn_idx - 1
|
||||
|
||||
# Build hash_ids: for turn 0, generate blocks for full input
|
||||
# For turn N>0, keep previous blocks and add new ones for the delta
|
||||
if turn_idx == 0:
|
||||
num_blocks = input_length // BLOCK_TOKEN_BUDGET
|
||||
cumulative_hash_ids = list(
|
||||
range(global_block_counter, global_block_counter + num_blocks)
|
||||
)
|
||||
global_block_counter += num_blocks
|
||||
else:
|
||||
# The new input is the full prompt (cumulative), so the delta
|
||||
# is the new tokens beyond what was in the previous turn's prompt
|
||||
prev_input = audit.get("prompt_tokens", 0)
|
||||
prev_rec_audit = turns[turn_idx - 1].get("audit", {})
|
||||
prev_input_length = prev_rec_audit.get("prompt_tokens", 0)
|
||||
delta = max(0, prev_input - prev_input_length) if prev_input_length > 0 else 0
|
||||
new_blocks = delta // BLOCK_TOKEN_BUDGET
|
||||
new_ids = list(
|
||||
range(global_block_counter, global_block_counter + new_blocks)
|
||||
)
|
||||
global_block_counter += new_blocks
|
||||
cumulative_hash_ids = cumulative_hash_ids + new_ids
|
||||
|
||||
trace_line = {
|
||||
"chat_id": chat_id,
|
||||
"parent_chat_id": parent_chat_id,
|
||||
"timestamp": rec["ts"],
|
||||
"turn": turn_idx,
|
||||
"input_length": input_length,
|
||||
"output_length": output_length,
|
||||
"type": "chat",
|
||||
"hash_ids": cumulative_hash_ids,
|
||||
}
|
||||
out.write(json.dumps(trace_line, separators=(",", ":")) + "\n")
|
||||
total_written += 1
|
||||
|
||||
print(f"Converted {total_written} lines from {len(instances)} instances -> {dst}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) != 3:
|
||||
print(f"Usage: {sys.argv[0]} <input_audit.jsonl> <output_trace.jsonl>")
|
||||
sys.exit(1)
|
||||
convert(Path(sys.argv[1]), Path(sys.argv[2]))
|
||||
450
scripts/prepare_real_ali_samples.py
Executable file
450
scripts/prepare_real_ali_samples.py
Executable file
@@ -0,0 +1,450 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Prepare balanced real-Ali trace samples for KVC experiments.
|
||||
|
||||
The generic sampler is duration-oriented and can be dominated by one long
|
||||
session. This script keeps real request lengths/timestamps but caps turns per
|
||||
session so live sweeps can compare policies on a repeatable multi-session
|
||||
workload.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import statistics
|
||||
from collections import defaultdict
|
||||
from dataclasses import asdict, dataclass
|
||||
from pathlib import Path
|
||||
|
||||
from agentic_pd_hybrid.trace import TraceRequest, load_trace
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class SampleSummary:
|
||||
input_trace_path: str
|
||||
output_trace_path: str
|
||||
profile: str
|
||||
request_count: int
|
||||
session_count: int
|
||||
multi_turn_session_count: int
|
||||
turn2plus_count: int
|
||||
direct_eligible_turn2plus_count: int
|
||||
direct_eligible_turn2plus_ratio: float
|
||||
missing_parent_count: int
|
||||
max_sessions: int
|
||||
max_turns_per_session: int
|
||||
start_time_s: float
|
||||
end_time_s: float
|
||||
sampled_duration_s: float
|
||||
rebased_timestamps: bool
|
||||
input_tokens: dict[str, float] | None
|
||||
output_tokens: dict[str, float] | None
|
||||
append_tokens: dict[str, float] | None
|
||||
inter_turn_gap_s: dict[str, float] | None
|
||||
overlap_ratio: dict[str, float] | None
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--trace", type=Path, required=True)
|
||||
parser.add_argument("--output-root", type=Path, required=True)
|
||||
parser.add_argument("--max-sessions", type=int, default=64)
|
||||
parser.add_argument("--max-turns-per-session", type=int, default=12)
|
||||
parser.add_argument("--start-time-s", type=float, default=0.0)
|
||||
parser.add_argument(
|
||||
"--window-duration-s",
|
||||
type=float,
|
||||
default=None,
|
||||
help=(
|
||||
"If set, also write continuous-window samples that keep only requests "
|
||||
"inside [start-time, start-time + window-duration]."
|
||||
),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--window-target-requests",
|
||||
type=int,
|
||||
default=None,
|
||||
help=(
|
||||
"For continuous-window samples, select whole sessions across time "
|
||||
"buckets until at least this many requests are included. This keeps "
|
||||
"the window span while making live runs tractable."
|
||||
),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--window-buckets",
|
||||
type=int,
|
||||
default=15,
|
||||
help="Number of time buckets used with --window-target-requests.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--window-min-turns",
|
||||
type=int,
|
||||
default=1,
|
||||
help=(
|
||||
"Minimum number of in-window turns per selected session for "
|
||||
"continuous-window samples."
|
||||
),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--window-output-name",
|
||||
default="ali-window.jsonl",
|
||||
help="Output filename for the continuous-window sample.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max-sampled-duration-s",
|
||||
type=float,
|
||||
default=None,
|
||||
help=(
|
||||
"For balanced profile samples, drop requests after the first selected "
|
||||
"timestamp plus this duration. Use only for quick smoke runs; headline "
|
||||
"runs should preserve the full sampled span."
|
||||
),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--profiles",
|
||||
nargs="+",
|
||||
default=["representative-mt", "kvc-fit-smallappend"],
|
||||
choices=["representative-mt", "kvc-fit-smallappend"],
|
||||
)
|
||||
parser.add_argument(
|
||||
"--no-rebase-timestamps",
|
||||
action="store_true",
|
||||
help="Keep original timestamps instead of shifting the sample to start at 0.",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
requests = load_trace(args.trace)
|
||||
sessions: dict[str, list[TraceRequest]] = defaultdict(list)
|
||||
for request in requests:
|
||||
sessions[request.session_id].append(request)
|
||||
|
||||
args.output_root.mkdir(parents=True, exist_ok=True)
|
||||
if args.window_duration_s is not None:
|
||||
if args.window_target_requests is None:
|
||||
selected = _select_window(
|
||||
requests=requests,
|
||||
start_time_s=args.start_time_s,
|
||||
window_duration_s=args.window_duration_s,
|
||||
)
|
||||
profile = "window"
|
||||
else:
|
||||
selected = _select_window_session_sample(
|
||||
sessions=sessions,
|
||||
start_time_s=args.start_time_s,
|
||||
window_duration_s=args.window_duration_s,
|
||||
target_requests=args.window_target_requests,
|
||||
bucket_count=args.window_buckets,
|
||||
min_turns=args.window_min_turns,
|
||||
)
|
||||
profile = (
|
||||
"window-session-sample"
|
||||
if args.window_min_turns <= 1
|
||||
else f"window-session-sample-min{args.window_min_turns}turns"
|
||||
)
|
||||
output_path = args.output_root / args.window_output_name
|
||||
summary = _write_sample(
|
||||
selected=selected,
|
||||
input_trace_path=args.trace,
|
||||
output_path=output_path,
|
||||
profile=profile,
|
||||
max_sessions=args.max_sessions,
|
||||
max_turns_per_session=args.max_turns_per_session,
|
||||
rebase_timestamps=not args.no_rebase_timestamps,
|
||||
)
|
||||
print(
|
||||
f"window: wrote {summary.request_count} requests from "
|
||||
f"{summary.session_count} sessions to {output_path}"
|
||||
)
|
||||
|
||||
for profile in args.profiles:
|
||||
selected = _select_profile(
|
||||
sessions=sessions,
|
||||
profile=profile,
|
||||
start_time_s=args.start_time_s,
|
||||
max_sessions=args.max_sessions,
|
||||
max_turns_per_session=args.max_turns_per_session,
|
||||
max_sampled_duration_s=args.max_sampled_duration_s,
|
||||
)
|
||||
output_path = args.output_root / f"ali-{profile}.jsonl"
|
||||
summary = _write_sample(
|
||||
selected=selected,
|
||||
input_trace_path=args.trace,
|
||||
output_path=output_path,
|
||||
profile=profile,
|
||||
max_sessions=args.max_sessions,
|
||||
max_turns_per_session=args.max_turns_per_session,
|
||||
rebase_timestamps=not args.no_rebase_timestamps,
|
||||
)
|
||||
print(
|
||||
f"{profile}: wrote {summary.request_count} requests from "
|
||||
f"{summary.session_count} sessions to {output_path}"
|
||||
)
|
||||
|
||||
|
||||
def _select_profile(
|
||||
*,
|
||||
sessions: dict[str, list[TraceRequest]],
|
||||
profile: str,
|
||||
start_time_s: float,
|
||||
max_sessions: int,
|
||||
max_turns_per_session: int,
|
||||
max_sampled_duration_s: float | None,
|
||||
) -> list[TraceRequest]:
|
||||
eligible: list[list[TraceRequest]] = []
|
||||
for session_requests in sessions.values():
|
||||
ordered = _ordered(session_requests)
|
||||
if len(ordered) < 2:
|
||||
continue
|
||||
if ordered[0].timestamp_s < start_time_s:
|
||||
continue
|
||||
if profile == "kvc-fit-smallappend" and not _is_kvc_fit_smallappend(ordered):
|
||||
continue
|
||||
eligible.append(ordered[:max_turns_per_session])
|
||||
|
||||
eligible.sort(key=lambda items: (items[0].timestamp_s, items[0].session_id))
|
||||
selected_sessions = eligible[:max_sessions]
|
||||
selected = [request for items in selected_sessions for request in items]
|
||||
selected.sort(key=lambda request: (request.timestamp_s, request.chat_id))
|
||||
if selected and max_sampled_duration_s is not None:
|
||||
first_ts = selected[0].timestamp_s
|
||||
end_ts = first_ts + max_sampled_duration_s
|
||||
selected = [
|
||||
request for request in selected if request.timestamp_s <= end_ts
|
||||
]
|
||||
return selected
|
||||
|
||||
|
||||
def _select_window(
|
||||
*,
|
||||
requests: list[TraceRequest],
|
||||
start_time_s: float,
|
||||
window_duration_s: float,
|
||||
) -> list[TraceRequest]:
|
||||
end_time_s = start_time_s + window_duration_s
|
||||
selected = [
|
||||
request
|
||||
for request in requests
|
||||
if start_time_s <= request.timestamp_s <= end_time_s
|
||||
]
|
||||
selected.sort(key=lambda request: (request.timestamp_s, request.chat_id))
|
||||
return selected
|
||||
|
||||
|
||||
def _select_window_session_sample(
|
||||
*,
|
||||
sessions: dict[str, list[TraceRequest]],
|
||||
start_time_s: float,
|
||||
window_duration_s: float,
|
||||
target_requests: int,
|
||||
bucket_count: int,
|
||||
min_turns: int,
|
||||
) -> list[TraceRequest]:
|
||||
if target_requests <= 0:
|
||||
raise ValueError("--window-target-requests must be positive")
|
||||
if bucket_count <= 0:
|
||||
raise ValueError("--window-buckets must be positive")
|
||||
if min_turns <= 0:
|
||||
raise ValueError("--window-min-turns must be positive")
|
||||
|
||||
end_time_s = start_time_s + window_duration_s
|
||||
bucket_width_s = window_duration_s / bucket_count
|
||||
buckets: list[list[list[TraceRequest]]] = [[] for _ in range(bucket_count)]
|
||||
for session_requests in sessions.values():
|
||||
ordered = _ordered(session_requests)
|
||||
if not ordered:
|
||||
continue
|
||||
first = ordered[0]
|
||||
if first.timestamp_s < start_time_s or first.timestamp_s > end_time_s:
|
||||
continue
|
||||
in_window = [
|
||||
request
|
||||
for request in ordered
|
||||
if start_time_s <= request.timestamp_s <= end_time_s
|
||||
]
|
||||
if len(in_window) < min_turns:
|
||||
continue
|
||||
bucket_index = min(
|
||||
bucket_count - 1,
|
||||
int((first.timestamp_s - start_time_s) / bucket_width_s),
|
||||
)
|
||||
buckets[bucket_index].append(in_window)
|
||||
|
||||
for bucket in buckets:
|
||||
bucket.sort(key=lambda items: (items[0].timestamp_s, items[0].session_id))
|
||||
|
||||
selected_sessions: list[list[TraceRequest]] = []
|
||||
selected_count = 0
|
||||
positions = [0 for _ in range(bucket_count)]
|
||||
while selected_count < target_requests:
|
||||
progressed = False
|
||||
for index, bucket in enumerate(buckets):
|
||||
if positions[index] >= len(bucket):
|
||||
continue
|
||||
session_requests = bucket[positions[index]]
|
||||
positions[index] += 1
|
||||
selected_sessions.append(session_requests)
|
||||
selected_count += len(session_requests)
|
||||
progressed = True
|
||||
if selected_count >= target_requests:
|
||||
break
|
||||
if not progressed:
|
||||
break
|
||||
|
||||
selected = [request for items in selected_sessions for request in items]
|
||||
selected.sort(key=lambda request: (request.timestamp_s, request.chat_id))
|
||||
if len(selected) < target_requests:
|
||||
raise ValueError(
|
||||
f"window session sample selected only {len(selected)} requests; "
|
||||
f"target was {target_requests}"
|
||||
)
|
||||
return selected
|
||||
|
||||
|
||||
def _is_kvc_fit_smallappend(session_requests: list[TraceRequest]) -> bool:
|
||||
initial = session_requests[0]
|
||||
if initial.input_length < 2048 or initial.input_length > 16000:
|
||||
return False
|
||||
for request in session_requests:
|
||||
if request.output_length > 2048:
|
||||
return False
|
||||
for previous, current in zip(session_requests, session_requests[1:], strict=False):
|
||||
append_tokens = current.input_length - (
|
||||
previous.input_length + previous.output_length
|
||||
)
|
||||
if append_tokens <= 0 or append_tokens > 2048:
|
||||
return False
|
||||
if _overlap_ratio(previous, current) < 0.75:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def _write_sample(
|
||||
*,
|
||||
selected: list[TraceRequest],
|
||||
input_trace_path: Path,
|
||||
output_path: Path,
|
||||
profile: str,
|
||||
max_sessions: int,
|
||||
max_turns_per_session: int,
|
||||
rebase_timestamps: bool,
|
||||
) -> SampleSummary:
|
||||
if not selected:
|
||||
raise ValueError(f"profile {profile!r} selected no requests")
|
||||
|
||||
first_ts = selected[0].timestamp_s
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with output_path.open("w", encoding="utf-8") as handle:
|
||||
for request in selected:
|
||||
timestamp = request.timestamp_s - first_ts if rebase_timestamps else request.timestamp_s
|
||||
payload = {
|
||||
"chat_id": request.chat_id,
|
||||
"parent_chat_id": request.parent_chat_id,
|
||||
"timestamp": round(timestamp, 6),
|
||||
"input_length": request.input_length,
|
||||
"output_length": request.output_length,
|
||||
"type": request.request_type,
|
||||
"turn": request.turn_id,
|
||||
"hash_ids": list(request.hash_ids),
|
||||
}
|
||||
handle.write(json.dumps(payload, sort_keys=True) + "\n")
|
||||
|
||||
sessions = defaultdict(list)
|
||||
for request in selected:
|
||||
sessions[request.session_id].append(request)
|
||||
|
||||
selected_chat_ids = {request.chat_id for request in selected}
|
||||
missing_parent_count = sum(
|
||||
1
|
||||
for request in selected
|
||||
if request.parent_chat_id >= 0 and request.parent_chat_id not in selected_chat_ids
|
||||
)
|
||||
append_values: list[float] = []
|
||||
gap_values: list[float] = []
|
||||
overlap_values: list[float] = []
|
||||
direct_eligible_count = 0
|
||||
for session_requests in sessions.values():
|
||||
ordered = _ordered(session_requests)
|
||||
for previous, current in zip(ordered, ordered[1:], strict=False):
|
||||
append_tokens = current.input_length - (
|
||||
previous.input_length + previous.output_length
|
||||
)
|
||||
overlap_ratio = _overlap_ratio(previous, current)
|
||||
append_values.append(float(append_tokens))
|
||||
gap_values.append(float(current.timestamp_s - previous.timestamp_s))
|
||||
overlap_values.append(overlap_ratio)
|
||||
if append_tokens > 0 and append_tokens <= 2048 and overlap_ratio > 0:
|
||||
direct_eligible_count += 1
|
||||
|
||||
turn2plus_count = sum(max(0, len(items) - 1) for items in sessions.values())
|
||||
|
||||
start = min(request.timestamp_s for request in selected)
|
||||
end = max(request.timestamp_s for request in selected)
|
||||
summary = SampleSummary(
|
||||
input_trace_path=str(input_trace_path),
|
||||
output_trace_path=str(output_path),
|
||||
profile=profile,
|
||||
request_count=len(selected),
|
||||
session_count=len(sessions),
|
||||
multi_turn_session_count=sum(1 for items in sessions.values() if len(items) > 1),
|
||||
turn2plus_count=turn2plus_count,
|
||||
direct_eligible_turn2plus_count=direct_eligible_count,
|
||||
direct_eligible_turn2plus_ratio=(
|
||||
direct_eligible_count / turn2plus_count if turn2plus_count else 0.0
|
||||
),
|
||||
missing_parent_count=missing_parent_count,
|
||||
max_sessions=max_sessions,
|
||||
max_turns_per_session=max_turns_per_session,
|
||||
start_time_s=0.0 if rebase_timestamps else start,
|
||||
end_time_s=end - start if rebase_timestamps else end,
|
||||
sampled_duration_s=end - start,
|
||||
rebased_timestamps=rebase_timestamps,
|
||||
input_tokens=_stats([float(request.input_length) for request in selected]),
|
||||
output_tokens=_stats([float(request.output_length) for request in selected]),
|
||||
append_tokens=_stats(append_values),
|
||||
inter_turn_gap_s=_stats(gap_values),
|
||||
overlap_ratio=_stats(overlap_values),
|
||||
)
|
||||
with output_path.with_suffix(output_path.suffix + ".summary.json").open(
|
||||
"w", encoding="utf-8"
|
||||
) as handle:
|
||||
json.dump(asdict(summary), handle, indent=2, sort_keys=True)
|
||||
return summary
|
||||
|
||||
|
||||
def _ordered(session_requests: list[TraceRequest]) -> list[TraceRequest]:
|
||||
return sorted(
|
||||
session_requests,
|
||||
key=lambda request: (request.timestamp_s, request.turn_id, request.chat_id),
|
||||
)
|
||||
|
||||
|
||||
def _overlap_ratio(previous: TraceRequest, current: TraceRequest) -> float:
|
||||
if not current.hash_ids:
|
||||
return 0.0
|
||||
previous_blocks = set(previous.hash_ids)
|
||||
overlap = sum(1 for block in current.hash_ids if block in previous_blocks)
|
||||
return overlap / len(current.hash_ids)
|
||||
|
||||
|
||||
def _stats(values: list[float]) -> dict[str, float] | None:
|
||||
if not values:
|
||||
return None
|
||||
ordered = sorted(values)
|
||||
return {
|
||||
"count": float(len(ordered)),
|
||||
"mean": statistics.fmean(ordered),
|
||||
"min": ordered[0],
|
||||
"p50": _percentile(ordered, 0.50),
|
||||
"p90": _percentile(ordered, 0.90),
|
||||
"p99": _percentile(ordered, 0.99),
|
||||
"max": ordered[-1],
|
||||
}
|
||||
|
||||
|
||||
def _percentile(sorted_values: list[float], percentile: float) -> float:
|
||||
if len(sorted_values) == 1:
|
||||
return sorted_values[0]
|
||||
return sorted_values[round((len(sorted_values) - 1) * percentile)]
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
73
scripts/run_all_experiments.sh
Executable file
73
scripts/run_all_experiments.sh
Executable file
@@ -0,0 +1,73 @@
|
||||
#!/bin/bash
|
||||
# Run all 3 PD hybrid experiments sequentially
|
||||
# Uses 52 sessions / 4,449 requests (10% sample of 497 sessions)
|
||||
# Each experiment takes ~30-40 min
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
TRACE="outputs/qwen35-swebench-50sess.jsonl"
|
||||
MODEL="/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B"
|
||||
OUTPUT="outputs/swebench-exps"
|
||||
|
||||
echo "=== Experiment A: pd-disaggregation ==="
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy default \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
|
||||
echo "=== Experiment B: pd-colo ==="
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism pd-colo \
|
||||
--policy default \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 2 --direct-tp-size 4 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
|
||||
echo "=== Experiment C: kvcache-centric ==="
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 2 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
echo "=== All experiments complete ==="
|
||||
24
scripts/run_exp_a_pd_disagg.sh
Executable file
24
scripts/run_exp_a_pd_disagg.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
# Experiment A: pd-disaggregation baseline
|
||||
# 1P(GPU 0-3) + 1D(GPU 4-7), TP4, mooncake TCP
|
||||
# Full 39K trace from SWE-Bench 500 instances
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-swebench-500.jsonl \
|
||||
--output-root outputs/swebench-exps \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy default \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 64 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
23
scripts/run_exp_b1_dp_colo_rr.sh
Executable file
23
scripts/run_exp_b1_dp_colo_rr.sh
Executable file
@@ -0,0 +1,23 @@
|
||||
#!/bin/bash
|
||||
# Experiment B1: Naive DP colocation — round-robin policy
|
||||
# 2 direct workers (GPU 0-3, 4-7), TP4, DP router with round-robin
|
||||
# No disaggregation — each worker does prefill+decode locally
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-swebench-50sess.jsonl \
|
||||
--output-root outputs/swebench-exps \
|
||||
--mechanism pd-colo \
|
||||
--policy default \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 2 --direct-tp-size 4 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
23
scripts/run_exp_b2_dp_colo_cache_aware.sh
Executable file
23
scripts/run_exp_b2_dp_colo_cache_aware.sh
Executable file
@@ -0,0 +1,23 @@
|
||||
#!/bin/bash
|
||||
# Experiment B2: Naive DP colocation — cache-aware (kv-aware) policy
|
||||
# 2 direct workers (GPU 0-3, 4-7), TP4, DP router with consistent-hashing
|
||||
# Replay kv-aware policy picks the worker with most prefix overlap
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-swebench-50sess.jsonl \
|
||||
--output-root outputs/swebench-exps \
|
||||
--mechanism pd-colo \
|
||||
--policy kv-aware \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 2 --direct-tp-size 4 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
24
scripts/run_exp_b_pd_colo.sh
Executable file
24
scripts/run_exp_b_pd_colo.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
# Experiment B: pd-colo (direct/colocation)
|
||||
# 2 direct workers (GPU 0-3, 4-7), TP4, no router
|
||||
# Full 39K trace from SWE-Bench 500 instances
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-swebench-500.jsonl \
|
||||
--output-root outputs/swebench-exps \
|
||||
--mechanism pd-colo \
|
||||
--policy default \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 2 --direct-tp-size 4 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 64 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
28
scripts/run_exp_c_kvcache_centric.sh
Executable file
28
scripts/run_exp_c_kvcache_centric.sh
Executable file
@@ -0,0 +1,28 @@
|
||||
#!/bin/bash
|
||||
# Experiment C: kvcache-centric (session-aware PD)
|
||||
# 1P(GPU 0-3) + 1D(GPU 4-7), TP4, mooncake TCP
|
||||
# Full 39K trace from SWE-Bench 500 instances
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-swebench-500.jsonl \
|
||||
--output-root outputs/swebench-exps \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 64 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 2 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
30
scripts/smoke_test.sh
Executable file
30
scripts/smoke_test.sh
Executable file
@@ -0,0 +1,30 @@
|
||||
#!/bin/bash
|
||||
# Smoke test: pd-disaggregation with mooncake TCP, 100 requests
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
# Sample a small trace for smoke testing
|
||||
uv run agentic-pd-hybrid sample-sessions \
|
||||
--trace outputs/qwen35-swebench-500.jsonl \
|
||||
--output outputs/qwen35-smoke-3sess.jsonl \
|
||||
--session-sample-rate 0.02 \
|
||||
--min-turns 5 \
|
||||
--target-duration-s 300 \
|
||||
--max-requests 100
|
||||
|
||||
# Run smoke test
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-smoke-3sess.jsonl \
|
||||
--output-root outputs/smoke \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy default \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
114
scripts/sweep_backpressure_smoke.sh
Executable file
114
scripts/sweep_backpressure_smoke.sh
Executable file
@@ -0,0 +1,114 @@
|
||||
#!/usr/bin/env bash
|
||||
# Smoke sweep: validate backpressure code change on top of v5 Option D config.
|
||||
# Designed to fit in ~3-4h GPU budget (4 runs × ~30-60 min).
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/sweep_backpressure_smoke.sh
|
||||
#
|
||||
# Prerequisites: GPUs available; trace at outputs/qwen35-swebench-50sess.jsonl;
|
||||
# model at $MODEL_PATH (default Qwen3-30B-A3B-Instruct-2507).
|
||||
set -euo pipefail
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
cd "$REPO_ROOT"
|
||||
|
||||
OUT_ROOT=${OUT_ROOT:-outputs/sweep_backpressure_smoke}
|
||||
TRACE=${TRACE:-outputs/qwen35-swebench-50sess.jsonl}
|
||||
MODEL=${MODEL:-/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507}
|
||||
|
||||
mkdir -p "$OUT_ROOT"
|
||||
LOG="$OUT_ROOT/sweep.log"
|
||||
echo "[$(date '+%F %T')] Starting backpressure smoke sweep" | tee -a "$LOG"
|
||||
echo " Trace: $TRACE" | tee -a "$LOG"
|
||||
echo " Model: $MODEL" | tee -a "$LOG"
|
||||
echo " Output root: $OUT_ROOT" | tee -a "$LOG"
|
||||
|
||||
KVC_COMMON_ARGS=(
|
||||
--trace "$TRACE"
|
||||
--model "$MODEL"
|
||||
--mechanism kvcache-centric
|
||||
--policy kv-aware
|
||||
--kvcache-admission-mode worker
|
||||
--kvcache-seed-min-turn-id 1
|
||||
--kvcache-seed-max-inflight-decode -1
|
||||
--kvcache-prefill-backup-policy release-after-transfer
|
||||
--kvcache-prefill-priority-eviction
|
||||
--prefill-workers 2
|
||||
--decode-workers 6
|
||||
--prefill-gpu-ids 0,1
|
||||
--decode-gpu-ids 2,3,4,5,6,7
|
||||
--transfer-backend mooncake
|
||||
--target-duration-s 2000
|
||||
--session-sample-rate 1.0
|
||||
--min-turns 2
|
||||
--concurrency-limit 32
|
||||
)
|
||||
|
||||
DP_COMMON_ARGS=(
|
||||
--trace "$TRACE"
|
||||
--model "$MODEL"
|
||||
--mechanism pd-colo
|
||||
--policy kv-aware
|
||||
--direct-workers 8
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7
|
||||
--transfer-backend mooncake
|
||||
--target-duration-s 2000
|
||||
--session-sample-rate 1.0
|
||||
--min-turns 2
|
||||
--concurrency-limit 32
|
||||
)
|
||||
|
||||
run_kvc_baseline_ts10() {
|
||||
local out="$OUT_ROOT/E1_kvc_baseline_ts10"
|
||||
echo "[$(date '+%F %T')] === E1: KVC baseline (no backpressure) time-scale=10 ===" | tee -a "$LOG"
|
||||
python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
"${KVC_COMMON_ARGS[@]}" \
|
||||
--output-root "$out" \
|
||||
--time-scale 10 \
|
||||
2>&1 | tee -a "$LOG"
|
||||
}
|
||||
|
||||
run_kvc_backpressure_ts10() {
|
||||
local out="$OUT_ROOT/E2_kvc_backpressure_ts10"
|
||||
echo "[$(date '+%F %T')] === E2: KVC + backpressure ON, time-scale=10 ===" | tee -a "$LOG"
|
||||
python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
"${KVC_COMMON_ARGS[@]}" \
|
||||
--output-root "$out" \
|
||||
--time-scale 10 \
|
||||
--enable-backpressure \
|
||||
--backpressure-max-pause-s 2.0 \
|
||||
2>&1 | tee -a "$LOG"
|
||||
}
|
||||
|
||||
run_kvc_backpressure_ts1() {
|
||||
local out="$OUT_ROOT/E3_kvc_backpressure_ts1_short"
|
||||
echo "[$(date '+%F %T')] === E3: KVC + backpressure ON, time-scale=1, FIRST 1000 reqs ===" | tee -a "$LOG"
|
||||
python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
"${KVC_COMMON_ARGS[@]}" \
|
||||
--output-root "$out" \
|
||||
--time-scale 1 \
|
||||
--enable-backpressure \
|
||||
--backpressure-max-pause-s 2.0 \
|
||||
--target-duration-s 1800 \
|
||||
2>&1 | tee -a "$LOG"
|
||||
}
|
||||
|
||||
run_dp_baseline_ts1() {
|
||||
local out="$OUT_ROOT/E4_dp_ts1_short"
|
||||
echo "[$(date '+%F %T')] === E4: 8-way DP cache-aware, time-scale=1, FIRST 1000 reqs ===" | tee -a "$LOG"
|
||||
python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
"${DP_COMMON_ARGS[@]}" \
|
||||
--output-root "$out" \
|
||||
--time-scale 1 \
|
||||
--target-duration-s 1800 \
|
||||
2>&1 | tee -a "$LOG"
|
||||
}
|
||||
|
||||
# Sequence — add/remove as fits the budget.
|
||||
run_kvc_baseline_ts10
|
||||
run_kvc_backpressure_ts10
|
||||
run_kvc_backpressure_ts1
|
||||
run_dp_baseline_ts1
|
||||
|
||||
echo "[$(date '+%F %T')] === sweep DONE ===" | tee -a "$LOG"
|
||||
echo "Run analysis with: python scripts/analysis/analyze_backpressure_smoke.py $OUT_ROOT" | tee -a "$LOG"
|
||||
60
scripts/sweep_kvc_qwen3_30b.sh
Executable file
60
scripts/sweep_kvc_qwen3_30b.sh
Executable file
@@ -0,0 +1,60 @@
|
||||
#!/bin/bash
|
||||
# KVC admission control parameter sweep on Qwen3-30B
|
||||
# 5 experiments, ~35 min each, ~3 hours total
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-exps
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
|
||||
run_kvc() {
|
||||
local label=$1
|
||||
local inflight=$2
|
||||
local min_turn=$3
|
||||
|
||||
echo "=== [$label] inflight=$inflight min_turn=$min_turn === $(date)"
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id $min_turn \
|
||||
--kvcache-seed-max-inflight-decode $inflight \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
echo "=== [$label] DONE === $(date)"
|
||||
echo ""
|
||||
}
|
||||
|
||||
# C1: inflight=8, min-turn=2
|
||||
run_kvc "C1" 8 2
|
||||
|
||||
# C2: inflight=16, min-turn=2
|
||||
run_kvc "C2" 16 2
|
||||
|
||||
# C3: inflight=-1 (disabled), min-turn=2
|
||||
run_kvc "C3" -1 2
|
||||
|
||||
# C4: inflight=8, min-turn=1
|
||||
run_kvc "C4" 8 1
|
||||
|
||||
# C5: inflight=-1 (disabled), min-turn=1
|
||||
run_kvc "C5" -1 1
|
||||
|
||||
echo "=== ALL SWEEP EXPERIMENTS DONE === $(date)"
|
||||
170
scripts/sweep_real_ali_kvc.sh
Executable file
170
scripts/sweep_real_ali_kvc.sh
Executable file
@@ -0,0 +1,170 @@
|
||||
#!/usr/bin/env bash
|
||||
# Real Ali workload sweep for KVC pd-hybrid.
|
||||
#
|
||||
# This script expects a prebuilt sample trace and replays it exactly for every
|
||||
# mechanism. It intentionally keeps pool polling disabled for performance runs.
|
||||
set -euo pipefail
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
cd "$REPO_ROOT"
|
||||
|
||||
MODEL=${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}
|
||||
TRACE=${TRACE:-outputs/real-ali-kvc-iter/samples-balanced/ali-kvc-fit-smallappend.jsonl}
|
||||
OUT_ROOT=${OUT_ROOT:-outputs/real-ali-kvc-iter/runs}
|
||||
TIME_SCALE=${TIME_SCALE:-1}
|
||||
CONCURRENCY=${CONCURRENCY:-32}
|
||||
REQUEST_TIMEOUT_S=${REQUEST_TIMEOUT_S:-300}
|
||||
STACK_TIMEOUT_S=${STACK_TIMEOUT_S:-1200}
|
||||
RUNS=${RUNS:-"dp kvc_bp"}
|
||||
EXTRA_SERVER_ARGS=${EXTRA_SERVER_ARGS:-}
|
||||
PREFILL_EXTRA_SERVER_ARGS=${PREFILL_EXTRA_SERVER_ARGS:-}
|
||||
DECODE_EXTRA_SERVER_ARGS=${DECODE_EXTRA_SERVER_ARGS:-}
|
||||
KVC_SEED_MIN_TURN_ID=${KVC_SEED_MIN_TURN_ID:-1}
|
||||
KVC_SEED_ONLY_MULTITURN=${KVC_SEED_ONLY_MULTITURN:-0}
|
||||
|
||||
mkdir -p "$OUT_ROOT"
|
||||
LOG="$OUT_ROOT/sweep.log"
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%F %T')] $*" | tee -a "$LOG"
|
||||
}
|
||||
|
||||
common_args=(
|
||||
--trace "$TRACE"
|
||||
--model-path "$MODEL"
|
||||
--output-root "$OUT_ROOT"
|
||||
--use-trace-as-sample
|
||||
--time-scale "$TIME_SCALE"
|
||||
--concurrency-limit "$CONCURRENCY"
|
||||
--timeout-s "$STACK_TIMEOUT_S"
|
||||
--request-timeout-s "$REQUEST_TIMEOUT_S"
|
||||
)
|
||||
if [[ -n "$EXTRA_SERVER_ARGS" ]]; then
|
||||
common_args+=(--extra-server-args "$EXTRA_SERVER_ARGS")
|
||||
fi
|
||||
if [[ -n "$PREFILL_EXTRA_SERVER_ARGS" ]]; then
|
||||
common_args+=(--prefill-extra-server-args "$PREFILL_EXTRA_SERVER_ARGS")
|
||||
fi
|
||||
if [[ -n "$DECODE_EXTRA_SERVER_ARGS" ]]; then
|
||||
common_args+=(--decode-extra-server-args "$DECODE_EXTRA_SERVER_ARGS")
|
||||
fi
|
||||
|
||||
kvc_args=(
|
||||
"${common_args[@]}"
|
||||
--mechanism kvcache-centric
|
||||
--policy kv-aware
|
||||
--prefill-workers 2
|
||||
--decode-workers 6
|
||||
--prefill-tp-size 1
|
||||
--decode-tp-size 1
|
||||
--prefill-gpu-ids 0,1
|
||||
--decode-gpu-ids 2,3,4,5,6,7
|
||||
--transfer-backend mooncake
|
||||
--gpu-budget 8
|
||||
--kvcache-admission-mode worker
|
||||
--kvcache-seed-min-turn-id "$KVC_SEED_MIN_TURN_ID"
|
||||
--kvcache-seed-max-inflight-decode -1
|
||||
--kvcache-prefill-backup-policy release-after-transfer
|
||||
--kvcache-prefill-priority-eviction
|
||||
)
|
||||
if [[ "$KVC_SEED_ONLY_MULTITURN" == "1" ]]; then
|
||||
kvc_args+=(--kvcache-seed-only-multiturn-sessions)
|
||||
fi
|
||||
|
||||
run_dp() {
|
||||
log "=== DP cache-aware baseline: 8 direct workers ==="
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
"${common_args[@]}" \
|
||||
--mechanism pd-colo \
|
||||
--policy kv-aware \
|
||||
--prefill-workers 0 \
|
||||
--decode-workers 0 \
|
||||
--direct-workers 8 \
|
||||
--direct-tp-size 1 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--gpu-budget 8
|
||||
}
|
||||
|
||||
run_pd_disagg() {
|
||||
log "=== PD-disaggregation baseline: 2P6D ==="
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
"${common_args[@]}" \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy kv-aware \
|
||||
--prefill-workers 2 \
|
||||
--decode-workers 6 \
|
||||
--prefill-tp-size 1 \
|
||||
--decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 \
|
||||
--decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8
|
||||
}
|
||||
|
||||
run_pd_sticky() {
|
||||
log "=== PD-disaggregation sticky baseline: 2P6D ==="
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
"${common_args[@]}" \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy sticky \
|
||||
--prefill-workers 2 \
|
||||
--decode-workers 6 \
|
||||
--prefill-tp-size 1 \
|
||||
--decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 \
|
||||
--decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8
|
||||
}
|
||||
|
||||
run_kvc() {
|
||||
log "=== KVC baseline: 2P6D worker admission, no backpressure ==="
|
||||
uv run agentic-pd-hybrid benchmark-live "${kvc_args[@]}"
|
||||
}
|
||||
|
||||
run_kvc_bp() {
|
||||
log "=== KVC candidate: 2P6D worker admission + backpressure ==="
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
"${kvc_args[@]}" \
|
||||
--enable-backpressure \
|
||||
--backpressure-max-pause-s 2.0
|
||||
}
|
||||
|
||||
summarize_latest() {
|
||||
log "=== Latest summaries ==="
|
||||
find "$OUT_ROOT" -maxdepth 2 -name 'request-metrics.jsonl.summary.json' -print \
|
||||
| sort \
|
||||
| while read -r summary; do
|
||||
python - "$summary" <<'PY'
|
||||
import json, sys
|
||||
p=sys.argv[1]
|
||||
d=json.load(open(p))
|
||||
lat=d.get("latency_stats_s") or {}
|
||||
tt=d.get("ttft_stats_s") or {}
|
||||
em=d.get("execution_modes") or {}
|
||||
print(p)
|
||||
print(" reqs", d.get("request_count"), "errors", d.get("error_count"), "trunc", d.get("truncated_request_count"))
|
||||
print(" lat mean/p50/p90/p99", lat.get("mean"), lat.get("p50"), lat.get("p90"), lat.get("p99"))
|
||||
print(" ttft mean/p50/p90", tt.get("mean"), tt.get("p50"), tt.get("p90"))
|
||||
print(" modes", em)
|
||||
PY
|
||||
done | tee -a "$LOG"
|
||||
}
|
||||
|
||||
log "Trace: $TRACE"
|
||||
log "Model: $MODEL"
|
||||
log "Runs: $RUNS | time-scale=$TIME_SCALE concurrency=$CONCURRENCY | kvc-seed-min-turn-id=$KVC_SEED_MIN_TURN_ID | kvc-seed-only-multiturn=$KVC_SEED_ONLY_MULTITURN"
|
||||
|
||||
for run in $RUNS; do
|
||||
case "$run" in
|
||||
dp) run_dp ;;
|
||||
pd) run_pd_disagg ;;
|
||||
pd_sticky) run_pd_sticky ;;
|
||||
kvc) run_kvc ;;
|
||||
kvc_bp) run_kvc_bp ;;
|
||||
*) log "Unknown run name: $run"; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
summarize_latest
|
||||
log "DONE"
|
||||
133
scripts/sweep_tp1_configs.sh
Executable file
133
scripts/sweep_tp1_configs.sh
Executable file
@@ -0,0 +1,133 @@
|
||||
#!/bin/bash
|
||||
# TP1 configuration sweep: 8-way DP, 1P7D KVC, 2P6D KVC
|
||||
# Qwen3-30B-A3B TP=1, single GPU per worker
|
||||
# Most aggressive KVC admission: inflight=-1 (off), seed-min-turn=1
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-exps
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
# Also copy summary to a named file for easy access
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
log "Saved to $OUTPUT/${label}_summary.json"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 configuration sweep"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 8-way DP cache-aware
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 8-way DP cache-aware (8 direct × TP1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism pd-colo \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 8 --direct-tp-size 1 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
|
||||
# Find latest run dir for this experiment
|
||||
EXP1_DIR=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_8way_dp_cache_aware" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 1P + 7D KVC (most aggressive)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 1P7D KVC (inflight=off, min-turn=1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_1p7d_kvc_aggressive" "$EXP2_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 3: 2P + 6D KVC (most aggressive)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP3] 2P6D KVC (inflight=off, min-turn=1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP3_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp3_2p6d_kvc_aggressive" "$EXP3_DIR"
|
||||
|
||||
########################################
|
||||
log ""
|
||||
log "=== ALL TP1 SWEEP EXPERIMENTS DONE ==="
|
||||
131
scripts/sweep_tp1_v2_fixed.sh
Executable file
131
scripts/sweep_tp1_v2_fixed.sh
Executable file
@@ -0,0 +1,131 @@
|
||||
#!/bin/bash
|
||||
# TP1 configuration sweep v2 — after session_params fix + audit fields
|
||||
# Qwen3-30B-A3B TP=1, single GPU per worker
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v2-fixed
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 v2 sweep (session_params fix + audit fields)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 8-way DP cache-aware
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 8-way DP cache-aware (8 direct × TP1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism pd-colo \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 8 --direct-tp-size 1 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_8way_dp_cache_aware" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 1P + 7D KVC (aggressive)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 1P7D KVC (inflight=off, min-turn=1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_1p7d_kvc_aggressive" "$EXP2_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 3: 2P + 6D KVC (aggressive)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP3] 2P6D KVC (inflight=off, min-turn=1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP3_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp3_2p6d_kvc_aggressive" "$EXP3_DIR"
|
||||
|
||||
########################################
|
||||
log ""
|
||||
log "=== ALL TP1 V2 SWEEP EXPERIMENTS DONE ==="
|
||||
108
scripts/sweep_tp1_v3_kvaware.sh
Executable file
108
scripts/sweep_tp1_v3_kvaware.sh
Executable file
@@ -0,0 +1,108 @@
|
||||
#!/bin/bash
|
||||
# TP1 v3 sweep — KVC with kv-aware policy (fix routing mismatch)
|
||||
# v2 used --policy default for KVC experiments, causing session routing
|
||||
# mismatch: replay round-robin ≠ router round-robin → "session not found".
|
||||
# v3 uses --policy kv-aware for KVC to ensure session affinity.
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v3-kvaware
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 v3 sweep (KVC with kv-aware policy)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Key change: --policy kv-aware for KVC (was --policy default in v2)"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 1P + 7D KVC kv-aware
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 1P7D KVC kv-aware ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_1p7d_kvc_kvaware" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 2P + 6D KVC kv-aware
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 2P6D KVC kv-aware ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_2p6d_kvc_kvaware" "$EXP2_DIR"
|
||||
|
||||
########################################
|
||||
log ""
|
||||
log "=== ALL TP1 V3 SWEEP EXPERIMENTS DONE ==="
|
||||
108
scripts/sweep_tp1_v4_cap16.sh
Executable file
108
scripts/sweep_tp1_v4_cap16.sh
Executable file
@@ -0,0 +1,108 @@
|
||||
#!/bin/bash
|
||||
# TP1 v4 sweep — KVC with kv-aware policy + soft_cap raised from 4 to 16
|
||||
# v3 (kv-aware) fixed routing but session-cap fallback still dominated 52-65%
|
||||
# of requests. Hardcoded min(4, ...) in _decode_session_soft_cap was the
|
||||
# bottleneck — only 4*7=28 session slots for 52 trace sessions.
|
||||
# v4 raises the cap to 16 (4*7=28 -> 16*7=112 slots).
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v4-cap16
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 v4 sweep (KVC kv-aware, session soft_cap raised 4->16)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Key change: _decode_session_soft_cap now min(16, ...) instead of min(4, ...)"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 1P + 7D KVC kv-aware (cap=16)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 1P7D KVC kv-aware cap=16 ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_1p7d_kvc_cap16" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 2P + 6D KVC kv-aware (cap=16)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 2P6D KVC kv-aware cap=16 ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_2p6d_kvc_cap16" "$EXP2_DIR"
|
||||
|
||||
log ""
|
||||
log "=== ALL TP1 V4 SWEEP EXPERIMENTS DONE ==="
|
||||
89
scripts/sweep_tp1_v5_baseline_rerun_exp2.sh
Executable file
89
scripts/sweep_tp1_v5_baseline_rerun_exp2.sh
Executable file
@@ -0,0 +1,89 @@
|
||||
#!/bin/bash
|
||||
# P0: Re-run v5 baseline EXP2 (2P6D) three times to establish whether
|
||||
# errors=9 is a stable property of the v5 config or single-run variance.
|
||||
# Critic of V5_PROFILE_INVESTIGATION_ZH.md flagged that the 415 errors in
|
||||
# v5+profile EXP2 may have been polling-induced. We need 3 baseline runs
|
||||
# (no polling, identical config to original v5) to test reproducibility.
|
||||
#
|
||||
# Output:
|
||||
# outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
|
||||
# ├── exp2_2p6d_run{1,2,3}_summary.json
|
||||
# ├── exp2_2p6d_run{1,2,3}_metrics.jsonl
|
||||
# └── kvcache-centric-...<ts>/ (one per run)
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v5-optD-baseline-rerun
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
run_exp2() {
|
||||
local run_idx=$1
|
||||
local label="exp2_2p6d_run${run_idx}"
|
||||
log ""
|
||||
log "=== [RUN ${run_idx}/3] EXP2 2P6D KVC kv-aware Option D (no polling) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
local run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [RUN ${run_idx}/3] $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
||||
log " errors = $errs (baseline reference = 9)"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
else
|
||||
log "WARNING: no summary file in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "=== P0: v5 baseline EXP2 reproducibility test (3 runs, no polling) ==="
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Goal: confirm whether errors=9 in v5 baseline EXP2 is reproducible"
|
||||
log " (v5+profile saw 415 errors; we need to know if polling was causal)"
|
||||
|
||||
for i in 1 2 3; do
|
||||
run_exp2 $i
|
||||
done
|
||||
|
||||
log ""
|
||||
log "=== P0 SUMMARY: errors per run ==="
|
||||
for i in 1 2 3; do
|
||||
if [ -f "$OUTPUT/exp2_2p6d_run${i}_summary.json" ]; then
|
||||
e=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/exp2_2p6d_run${i}_summary.json')); print(d.get('error_count',0))")
|
||||
log " run ${i}: errors = $e"
|
||||
fi
|
||||
done
|
||||
log "=== P0 ALL DONE ==="
|
||||
114
scripts/sweep_tp1_v5_optD.sh
Executable file
114
scripts/sweep_tp1_v5_optD.sh
Executable file
@@ -0,0 +1,114 @@
|
||||
#!/bin/bash
|
||||
# TP1 v5 sweep — Option D: D-side admission for seed/reseed.
|
||||
#
|
||||
# v4 (cap=16) still saw 35% session-cap fallback because the local soft_cap
|
||||
# evaluates min(16, usable_capacity_tokens / target_tokens) and target_tokens
|
||||
# (= input + output) is 50-100K in agentic workloads, giving cap = 1-2.
|
||||
#
|
||||
# v5 makes worker admission_mode authoritative for ALL admission decisions
|
||||
# (direct_append AND seed/reseed). Replay calls D's
|
||||
# /session_cache/admit_direct_append with mode={direct_append|seed} and
|
||||
# defers to D's KV pool availability + LRU eviction. Replay's local
|
||||
# _decode_session_soft_cap is bypassed entirely under worker mode.
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v5-optD
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 v5 sweep (Option D: D-side seed admission)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Key change: worker admission_mode now drives seed/reseed via D's admit endpoint"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 1P + 7D KVC kv-aware Option D
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 1P7D KVC kv-aware Option D ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_1p7d_kvc_optD" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 2P + 6D KVC kv-aware Option D
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 2P6D KVC kv-aware Option D ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_2p6d_kvc_optD" "$EXP2_DIR"
|
||||
|
||||
log ""
|
||||
log "=== ALL TP1 V5 SWEEP EXPERIMENTS DONE ==="
|
||||
125
scripts/sweep_tp1_v5_optD_profile.sh
Executable file
125
scripts/sweep_tp1_v5_optD_profile.sh
Executable file
@@ -0,0 +1,125 @@
|
||||
#!/bin/bash
|
||||
# TP1 v5 + profiling — re-run the v5 (Option D) config with the new
|
||||
# d-pool-timeseries poller enabled, so we can attribute each session-cap
|
||||
# fallback to actual D KV pool occupancy (held vs available vs idle-evictable
|
||||
# vs prefill-backup) instead of guessing.
|
||||
#
|
||||
# Output:
|
||||
# outputs/qwen3-30b-tp1-v5-optD-profile/
|
||||
# ├── kvcache-centric-kv-aware-worker-admission-<ts>/
|
||||
# │ ├── request-metrics.jsonl
|
||||
# │ ├── request-metrics.jsonl.summary.json
|
||||
# │ └── d-pool-timeseries.jsonl ← NEW (1Hz P/D /server_info snapshots)
|
||||
# ├── exp1_1p7d_kvc_optD_profile_metrics.jsonl
|
||||
# └── exp2_2p6d_kvc_optD_profile_metrics.jsonl
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v5-optD-profile
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
POLL_INTERVAL=1.0
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
|
||||
cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
|
||||
log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
|
||||
else
|
||||
log "WARNING: no d-pool-timeseries.jsonl produced"
|
||||
fi
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 v5 + profile sweep (Option D + ${POLL_INTERVAL}s pool polling)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Profiling: --pool-poll-interval-s $POLL_INTERVAL (writes d-pool-timeseries.jsonl)"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--pool-poll-interval-s $POLL_INTERVAL
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_1p7d_kvc_optD_profile" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--pool-poll-interval-s $POLL_INTERVAL
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_2p6d_kvc_optD_profile" "$EXP2_DIR"
|
||||
|
||||
log ""
|
||||
log "=== ALL TP1 V5+PROFILE EXPERIMENTS DONE ==="
|
||||
129
scripts/sweep_tp1_v6_p1_profile.sh
Executable file
129
scripts/sweep_tp1_v6_p1_profile.sh
Executable file
@@ -0,0 +1,129 @@
|
||||
#!/bin/bash
|
||||
# v6 P1: re-run the v5 (Option D) config with the pool_breakdown instrument
|
||||
# (commit 4978c0d) so d-pool-timeseries.jsonl carries radix_protected /
|
||||
# slot_private / running_batch / {transfer,prealloc,retracted}_queue tokens.
|
||||
#
|
||||
# This is the same config as scripts/sweep_tp1_v5_optD_profile.sh but writes
|
||||
# to a separate output dir, leaving the pre-instrument v5+profile run intact
|
||||
# for before/after comparison.
|
||||
#
|
||||
# Output:
|
||||
# outputs/qwen3-30b-tp1-v6-p1-profile/
|
||||
# ├── kvcache-centric-kv-aware-worker-admission-<ts>/
|
||||
# │ ├── request-metrics.jsonl
|
||||
# │ ├── request-metrics.jsonl.summary.json
|
||||
# │ └── d-pool-timeseries.jsonl ← now with pool_breakdown fields
|
||||
# ├── exp{1,2}_*_metrics.jsonl
|
||||
# └── exp{1,2}_*_pool_timeseries.jsonl
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v6-p1-profile
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
POLL_INTERVAL=1.0
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
|
||||
cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
|
||||
log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
|
||||
else
|
||||
log "WARNING: no d-pool-timeseries.jsonl produced"
|
||||
fi
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting v6 P1 sweep (v5 Option D config + ${POLL_INTERVAL}s pool polling + pool_breakdown)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Goal: capture pool_breakdown fields (radix_protected / slot_private / running_batch / queues)"
|
||||
log " to decompose 'other' on the v5 baseline workload"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--pool-poll-interval-s $POLL_INTERVAL
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_1p7d_kvc_v6_p1" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--pool-poll-interval-s $POLL_INTERVAL
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_2p6d_kvc_v6_p1" "$EXP2_DIR"
|
||||
|
||||
log ""
|
||||
log "=== ALL v6 P1 EXPERIMENTS DONE ==="
|
||||
@@ -3,13 +3,20 @@ from __future__ import annotations
|
||||
import asyncio
|
||||
import json
|
||||
import signal
|
||||
import shutil
|
||||
from collections import Counter
|
||||
from dataclasses import asdict, dataclass, replace
|
||||
from datetime import UTC, datetime
|
||||
from pathlib import Path
|
||||
|
||||
from agentic_pd_hybrid.replay import ReplayConfig, replay_trace
|
||||
from agentic_pd_hybrid.sampling import SessionSampleConfig, sample_trace_sessions
|
||||
from agentic_pd_hybrid.sampling import (
|
||||
SessionSampleConfig,
|
||||
SessionSampleSummary,
|
||||
sample_trace_sessions,
|
||||
)
|
||||
from agentic_pd_hybrid.stack import ManagedPdStack, launch_pd_stack
|
||||
from agentic_pd_hybrid.trace import load_trace
|
||||
from agentic_pd_hybrid.topology import SingleNodeTopology
|
||||
|
||||
|
||||
@@ -43,12 +50,18 @@ class BenchmarkConfig:
|
||||
kvcache_prefill_priority_eviction: bool = False
|
||||
kvcache_prefill_direct_priority: int = -100
|
||||
kvcache_prefill_normal_priority: int = 100
|
||||
pool_poll_interval_s: float = 0.0
|
||||
pool_poll_include_sessions: bool = True
|
||||
enable_backpressure: bool = False
|
||||
backpressure_max_pause_s: float = 2.0
|
||||
progress_interval_s: float = 30.0
|
||||
sample_profile: str = "default"
|
||||
min_initial_input_tokens: int | None = None
|
||||
max_initial_input_tokens: int | None = None
|
||||
max_append_input_tokens: int | None = None
|
||||
max_output_tokens: int | None = None
|
||||
min_overlap_ratio: float | None = None
|
||||
use_trace_as_sample: bool = False
|
||||
launch_stack: bool = True
|
||||
|
||||
|
||||
@@ -90,6 +103,21 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
)
|
||||
|
||||
sampled_trace_path = run_dir / "sampled-trace.jsonl"
|
||||
if config.use_trace_as_sample:
|
||||
shutil.copyfile(config.trace_path, sampled_trace_path)
|
||||
sample_summary = _summarize_trace_sample(
|
||||
input_trace_path=config.trace_path,
|
||||
sampled_trace_path=sampled_trace_path,
|
||||
profile=config.sample_profile,
|
||||
session_sample_rate=config.session_sample_rate,
|
||||
min_turns=config.min_turns,
|
||||
min_initial_input_tokens=config.min_initial_input_tokens,
|
||||
max_initial_input_tokens=config.max_initial_input_tokens,
|
||||
max_append_input_tokens=config.max_append_input_tokens,
|
||||
max_output_tokens=config.max_output_tokens,
|
||||
min_overlap_ratio=config.min_overlap_ratio,
|
||||
)
|
||||
else:
|
||||
sample_summary = sample_trace_sessions(
|
||||
SessionSampleConfig(
|
||||
trace_path=config.trace_path,
|
||||
@@ -119,6 +147,8 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
try:
|
||||
signal.signal(signal.SIGINT, _handle_termination)
|
||||
signal.signal(signal.SIGTERM, _handle_termination)
|
||||
_mechanisms_with_router = {"pd-disaggregation", "kvcache-centric", "pd-colo"}
|
||||
_naive_dp = config.mechanism_name == "pd-colo"
|
||||
if config.launch_stack:
|
||||
stack = launch_pd_stack(
|
||||
topology=topology,
|
||||
@@ -132,18 +162,19 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
else config.timeout_s
|
||||
),
|
||||
include_router=(
|
||||
config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
|
||||
config.mechanism_name in _mechanisms_with_router
|
||||
),
|
||||
naive_dp=_naive_dp,
|
||||
)
|
||||
router_url = (
|
||||
stack.router_url
|
||||
if config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
|
||||
if config.mechanism_name in _mechanisms_with_router
|
||||
else None
|
||||
)
|
||||
else:
|
||||
router_url = (
|
||||
topology.router_url
|
||||
if config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
|
||||
if config.mechanism_name in _mechanisms_with_router
|
||||
else None
|
||||
)
|
||||
|
||||
@@ -187,6 +218,11 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
),
|
||||
kvcache_prefill_direct_priority=config.kvcache_prefill_direct_priority,
|
||||
kvcache_prefill_normal_priority=config.kvcache_prefill_normal_priority,
|
||||
pool_poll_interval_s=config.pool_poll_interval_s,
|
||||
pool_poll_include_sessions=config.pool_poll_include_sessions,
|
||||
enable_backpressure=config.enable_backpressure,
|
||||
backpressure_max_pause_s=config.backpressure_max_pause_s,
|
||||
progress_interval_s=config.progress_interval_s,
|
||||
)
|
||||
if config.request_timeout_s is not None:
|
||||
replay_config = replace(
|
||||
@@ -243,12 +279,18 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
"kvcache_prefill_normal_priority": (
|
||||
config.kvcache_prefill_normal_priority
|
||||
),
|
||||
"pool_poll_interval_s": config.pool_poll_interval_s,
|
||||
"pool_poll_include_sessions": config.pool_poll_include_sessions,
|
||||
"enable_backpressure": config.enable_backpressure,
|
||||
"backpressure_max_pause_s": config.backpressure_max_pause_s,
|
||||
"progress_interval_s": config.progress_interval_s,
|
||||
"sample_profile": config.sample_profile,
|
||||
"min_initial_input_tokens": config.min_initial_input_tokens,
|
||||
"max_initial_input_tokens": config.max_initial_input_tokens,
|
||||
"max_append_input_tokens": config.max_append_input_tokens,
|
||||
"max_output_tokens": config.max_output_tokens,
|
||||
"min_overlap_ratio": config.min_overlap_ratio,
|
||||
"use_trace_as_sample": config.use_trace_as_sample,
|
||||
"sample_summary": asdict(sample_summary),
|
||||
"topology": {
|
||||
"model_path": config.topology.model_path,
|
||||
@@ -295,3 +337,44 @@ def _header_mode_for(policy_name: str) -> str:
|
||||
if policy_name == "kv-aware":
|
||||
return "target-worker"
|
||||
return "none"
|
||||
|
||||
|
||||
def _summarize_trace_sample(
|
||||
*,
|
||||
input_trace_path: Path,
|
||||
sampled_trace_path: Path,
|
||||
profile: str,
|
||||
session_sample_rate: float,
|
||||
min_turns: int,
|
||||
min_initial_input_tokens: int | None,
|
||||
max_initial_input_tokens: int | None,
|
||||
max_append_input_tokens: int | None,
|
||||
max_output_tokens: int | None,
|
||||
min_overlap_ratio: float | None,
|
||||
) -> SessionSampleSummary:
|
||||
requests = load_trace(sampled_trace_path)
|
||||
if not requests:
|
||||
raise ValueError(f"Trace sample is empty: {sampled_trace_path}")
|
||||
session_turns = Counter(request.session_id for request in requests)
|
||||
start_time_s = requests[0].timestamp_s
|
||||
end_time_s = requests[-1].timestamp_s
|
||||
return SessionSampleSummary(
|
||||
input_trace_path=str(input_trace_path),
|
||||
output_trace_path=str(sampled_trace_path),
|
||||
request_count=len(requests),
|
||||
session_count=len(session_turns),
|
||||
multi_turn_session_count=sum(1 for turns in session_turns.values() if turns > 1),
|
||||
start_time_s=start_time_s,
|
||||
end_time_s=end_time_s,
|
||||
sampled_duration_s=end_time_s - start_time_s,
|
||||
session_sample_rate=session_sample_rate,
|
||||
min_turns=min_turns,
|
||||
profile=profile,
|
||||
min_initial_input_tokens=min_initial_input_tokens,
|
||||
max_initial_input_tokens=max_initial_input_tokens,
|
||||
max_append_input_tokens=max_append_input_tokens,
|
||||
max_output_tokens=max_output_tokens,
|
||||
min_overlap_ratio=min_overlap_ratio,
|
||||
mean_append_input_tokens=None,
|
||||
mean_turn_overlap_ratio=None,
|
||||
)
|
||||
|
||||
@@ -2,6 +2,7 @@ from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import asyncio
|
||||
import shlex
|
||||
from pathlib import Path
|
||||
|
||||
from agentic_pd_hybrid.benchmark import BenchmarkConfig, run_live_benchmark
|
||||
@@ -228,6 +229,47 @@ def main() -> None:
|
||||
)
|
||||
replay.add_argument("--kvcache-prefill-direct-priority", type=int, default=-100)
|
||||
replay.add_argument("--kvcache-prefill-normal-priority", type=int, default=100)
|
||||
replay.add_argument(
|
||||
"--pool-poll-interval-s",
|
||||
type=float,
|
||||
default=0.0,
|
||||
help=(
|
||||
"Poll each P/D worker's /server_info every N seconds and write a "
|
||||
"time-series snapshot to <run_dir>/d-pool-timeseries.jsonl. "
|
||||
"0 disables polling."
|
||||
),
|
||||
)
|
||||
replay.add_argument(
|
||||
"--pool-poll-no-sessions",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Disable per-session detail in the pool timeseries (smaller files)."
|
||||
),
|
||||
)
|
||||
replay.add_argument(
|
||||
"--enable-backpressure",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Honor recommended_pause_ms hints from D's admission endpoint. "
|
||||
"When set, replay sleeps before issuing requests to a saturated D. "
|
||||
"Default off — preserves baseline behavior."
|
||||
),
|
||||
)
|
||||
replay.add_argument(
|
||||
"--backpressure-max-pause-s",
|
||||
type=float,
|
||||
default=2.0,
|
||||
help="Cap on per-request backpressure sleep, regardless of D hint.",
|
||||
)
|
||||
replay.add_argument(
|
||||
"--progress-interval-s",
|
||||
type=float,
|
||||
default=30.0,
|
||||
help=(
|
||||
"Write client-side replay progress to <output_dir>/replay-progress.jsonl "
|
||||
"every N seconds. 0 disables the heartbeat."
|
||||
),
|
||||
)
|
||||
|
||||
sample = subparsers.add_parser(
|
||||
"sample-sessions",
|
||||
@@ -439,6 +481,45 @@ def main() -> None:
|
||||
)
|
||||
benchmark.add_argument("--kvcache-prefill-direct-priority", type=int, default=-100)
|
||||
benchmark.add_argument("--kvcache-prefill-normal-priority", type=int, default=100)
|
||||
benchmark.add_argument(
|
||||
"--pool-poll-interval-s",
|
||||
type=float,
|
||||
default=0.0,
|
||||
help=(
|
||||
"Poll each P/D worker's /server_info every N seconds and write a "
|
||||
"time-series snapshot to <run_dir>/d-pool-timeseries.jsonl. "
|
||||
"0 disables polling."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--pool-poll-no-sessions",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Disable per-session detail in the pool timeseries (smaller files)."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--enable-backpressure",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Honor recommended_pause_ms hints from D's admission endpoint."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--backpressure-max-pause-s",
|
||||
type=float,
|
||||
default=2.0,
|
||||
help="Cap on per-request backpressure sleep, regardless of D hint.",
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--progress-interval-s",
|
||||
type=float,
|
||||
default=30.0,
|
||||
help=(
|
||||
"Write client-side replay progress to <run_dir>/replay-progress.jsonl "
|
||||
"every N seconds. 0 disables the heartbeat."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--sample-profile",
|
||||
choices=["default", "small-append"],
|
||||
@@ -450,16 +531,31 @@ def main() -> None:
|
||||
benchmark.add_argument("--max-append-input-tokens", type=int, default=None)
|
||||
benchmark.add_argument("--max-output-tokens", type=int, default=None)
|
||||
benchmark.add_argument("--min-overlap-ratio", type=float, default=None)
|
||||
benchmark.add_argument(
|
||||
"--use-trace-as-sample",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Replay the provided --trace exactly instead of sampling sessions into "
|
||||
"a new trace. Use this for prebuilt real-workload samples."
|
||||
),
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.command == "print-launch":
|
||||
topology = _topology_from_args(args)
|
||||
has_pd = bool(topology.prefill_workers and topology.decode_workers)
|
||||
has_direct_only = bool(
|
||||
topology.direct_workers
|
||||
and not topology.prefill_workers
|
||||
and not topology.decode_workers
|
||||
)
|
||||
plan = build_launch_plan(
|
||||
topology,
|
||||
prefill_policy=args.prefill_policy,
|
||||
decode_policy=args.decode_policy,
|
||||
include_router=bool(topology.prefill_workers and topology.decode_workers),
|
||||
include_router=has_pd or has_direct_only,
|
||||
naive_dp=has_direct_only,
|
||||
)
|
||||
print(plan.render())
|
||||
return
|
||||
@@ -513,6 +609,11 @@ def main() -> None:
|
||||
),
|
||||
kvcache_prefill_direct_priority=args.kvcache_prefill_direct_priority,
|
||||
kvcache_prefill_normal_priority=args.kvcache_prefill_normal_priority,
|
||||
pool_poll_interval_s=args.pool_poll_interval_s,
|
||||
pool_poll_include_sessions=not args.pool_poll_no_sessions,
|
||||
enable_backpressure=args.enable_backpressure,
|
||||
backpressure_max_pause_s=args.backpressure_max_pause_s,
|
||||
progress_interval_s=args.progress_interval_s,
|
||||
)
|
||||
results = asyncio.run(replay_trace(config))
|
||||
print(
|
||||
@@ -655,12 +756,18 @@ def main() -> None:
|
||||
kvcache_prefill_normal_priority=(
|
||||
args.kvcache_prefill_normal_priority
|
||||
),
|
||||
pool_poll_interval_s=args.pool_poll_interval_s,
|
||||
pool_poll_include_sessions=not args.pool_poll_no_sessions,
|
||||
enable_backpressure=args.enable_backpressure,
|
||||
backpressure_max_pause_s=args.backpressure_max_pause_s,
|
||||
progress_interval_s=args.progress_interval_s,
|
||||
sample_profile=args.sample_profile,
|
||||
min_initial_input_tokens=args.min_initial_input_tokens,
|
||||
max_initial_input_tokens=args.max_initial_input_tokens,
|
||||
max_append_input_tokens=args.max_append_input_tokens,
|
||||
max_output_tokens=args.max_output_tokens,
|
||||
min_overlap_ratio=args.min_overlap_ratio,
|
||||
use_trace_as_sample=args.use_trace_as_sample,
|
||||
launch_stack=True,
|
||||
)
|
||||
)
|
||||
@@ -720,6 +827,26 @@ def _add_topology_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"--no-trust-remote-code",
|
||||
action="store_true",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--extra-server-args",
|
||||
default="",
|
||||
help="Extra arguments appended to every sglang.launch_server command.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--prefill-extra-server-args",
|
||||
default="",
|
||||
help="Extra arguments appended only to prefill launch_server commands.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--decode-extra-server-args",
|
||||
default="",
|
||||
help="Extra arguments appended only to decode launch_server commands.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--direct-extra-server-args",
|
||||
default="",
|
||||
help="Extra arguments appended only to direct launch_server commands.",
|
||||
)
|
||||
|
||||
|
||||
def _topology_from_args(args: argparse.Namespace):
|
||||
@@ -749,7 +876,13 @@ def _topology_from_args(args: argparse.Namespace):
|
||||
force_rdma=args.force_rdma,
|
||||
trust_remote_code=not args.no_trust_remote_code,
|
||||
ib_device=args.ib_device,
|
||||
direct_extra_server_args=("--enable-streaming-session",),
|
||||
extra_server_args=tuple(shlex.split(args.extra_server_args)),
|
||||
prefill_extra_server_args=tuple(shlex.split(args.prefill_extra_server_args)),
|
||||
decode_extra_server_args=tuple(shlex.split(args.decode_extra_server_args)),
|
||||
direct_extra_server_args=(
|
||||
"--enable-streaming-session",
|
||||
*tuple(shlex.split(args.direct_extra_server_args)),
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
|
||||
@@ -34,7 +34,24 @@ def build_launch_plan(
|
||||
decode_policy: str = "manual",
|
||||
include_router: bool = True,
|
||||
router_request_timeout_s: float | None = None,
|
||||
naive_dp: bool = False,
|
||||
) -> LaunchPlan:
|
||||
router_command: tuple[str, ...] | None = None
|
||||
if include_router:
|
||||
if topology.prefill_workers and topology.decode_workers:
|
||||
router_command = _build_router_command(
|
||||
topology,
|
||||
prefill_policy=prefill_policy,
|
||||
decode_policy=decode_policy,
|
||||
request_timeout_s=router_request_timeout_s,
|
||||
)
|
||||
elif naive_dp and topology.direct_workers:
|
||||
router_command = _build_dp_router_command(
|
||||
topology,
|
||||
backend_policy=decode_policy,
|
||||
request_timeout_s=router_request_timeout_s,
|
||||
)
|
||||
|
||||
return LaunchPlan(
|
||||
prefill_commands=tuple(
|
||||
_build_server_command(topology, worker) for worker in topology.prefill_workers
|
||||
@@ -43,24 +60,17 @@ def build_launch_plan(
|
||||
_build_server_command(topology, worker) for worker in topology.decode_workers
|
||||
),
|
||||
direct_commands=tuple(
|
||||
_build_server_command(topology, worker) for worker in topology.direct_workers
|
||||
),
|
||||
router_command=(
|
||||
_build_router_command(
|
||||
topology,
|
||||
prefill_policy=prefill_policy,
|
||||
decode_policy=decode_policy,
|
||||
request_timeout_s=router_request_timeout_s,
|
||||
)
|
||||
if include_router and topology.prefill_workers and topology.decode_workers
|
||||
else None
|
||||
_build_server_command(topology, worker, naive_dp=naive_dp)
|
||||
for worker in topology.direct_workers
|
||||
),
|
||||
router_command=router_command,
|
||||
)
|
||||
|
||||
|
||||
def _build_server_command(
|
||||
topology: SingleNodeTopology,
|
||||
worker: WorkerSpec,
|
||||
naive_dp: bool = False,
|
||||
) -> tuple[str, ...]:
|
||||
command = [
|
||||
sys.executable,
|
||||
@@ -76,11 +86,15 @@ def _build_server_command(
|
||||
str(worker.port),
|
||||
"--base-gpu-id",
|
||||
str(worker.gpu_id),
|
||||
]
|
||||
# Naive DP direct workers: no disaggregation flags at all
|
||||
if not (naive_dp and worker.role == "direct"):
|
||||
command.extend([
|
||||
"--disaggregation-mode",
|
||||
_disaggregation_mode_for(worker),
|
||||
"--disaggregation-transfer-backend",
|
||||
topology.transfer_backend,
|
||||
]
|
||||
])
|
||||
if worker.tp_size > 1:
|
||||
command.extend(["--tp-size", str(worker.tp_size)])
|
||||
if topology.trust_remote_code:
|
||||
@@ -135,6 +149,32 @@ def _build_router_command(
|
||||
return tuple(command)
|
||||
|
||||
|
||||
def _build_dp_router_command(
|
||||
topology: SingleNodeTopology,
|
||||
*,
|
||||
backend_policy: str,
|
||||
request_timeout_s: float | None,
|
||||
) -> tuple[str, ...]:
|
||||
command: list[str] = [
|
||||
sys.executable,
|
||||
"-B",
|
||||
"-u",
|
||||
"-m",
|
||||
"agentic_pd_hybrid.pd_router",
|
||||
"--host",
|
||||
topology.router_host,
|
||||
"--port",
|
||||
str(topology.router_port),
|
||||
"--backend-policy",
|
||||
backend_policy,
|
||||
]
|
||||
if request_timeout_s is not None:
|
||||
command.extend(["--request-timeout-s", str(request_timeout_s)])
|
||||
for worker in topology.direct_workers:
|
||||
command.extend(["--backend", worker.url])
|
||||
return tuple(command)
|
||||
|
||||
|
||||
def _render_named_command(name: str, command: tuple[str, ...]) -> str:
|
||||
return f"# {name}\n" + " ".join(shlex.quote(part) for part in command)
|
||||
|
||||
|
||||
@@ -43,6 +43,9 @@ class RequestMetrics:
|
||||
ttft_s: float | None
|
||||
tpot_s: float | None
|
||||
error: str | None = None
|
||||
actual_output_tokens: int | None = None
|
||||
requested_output_tokens: int | None = None
|
||||
finish_reason: str | None = None
|
||||
|
||||
@classmethod
|
||||
def from_decision(
|
||||
@@ -63,6 +66,9 @@ class RequestMetrics:
|
||||
prefill_request_priority: int | None = None,
|
||||
decode_request_priority: int | None = None,
|
||||
error: str | None = None,
|
||||
actual_output_tokens: int | None = None,
|
||||
requested_output_tokens: int | None = None,
|
||||
finish_reason: str | None = None,
|
||||
) -> "RequestMetrics":
|
||||
return cls(
|
||||
request_id=request.request_id,
|
||||
@@ -95,6 +101,9 @@ class RequestMetrics:
|
||||
ttft_s=ttft_s,
|
||||
tpot_s=tpot_s,
|
||||
error=error,
|
||||
actual_output_tokens=actual_output_tokens,
|
||||
requested_output_tokens=requested_output_tokens,
|
||||
finish_reason=finish_reason,
|
||||
)
|
||||
|
||||
|
||||
@@ -158,6 +167,17 @@ def write_summary_json(
|
||||
str(key): value for key, value in sorted(decode_priorities.items())
|
||||
},
|
||||
"error_count": sum(1 for row in rows if row.error is not None),
|
||||
"truncated_request_count": sum(
|
||||
1
|
||||
for row in rows
|
||||
if row.actual_output_tokens is not None
|
||||
and row.requested_output_tokens is not None
|
||||
and row.requested_output_tokens > 1
|
||||
and row.actual_output_tokens < row.requested_output_tokens * 0.5
|
||||
),
|
||||
"actual_output_tokens_stats": _stats(
|
||||
[float(row.actual_output_tokens) for row in rows if row.actual_output_tokens is not None]
|
||||
),
|
||||
}
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with path.open("w", encoding="utf-8") as handle:
|
||||
|
||||
@@ -74,8 +74,58 @@ class RouterState:
|
||||
return idx
|
||||
|
||||
|
||||
@dataclass
|
||||
class DpRouterConfig:
|
||||
host: str
|
||||
port: int
|
||||
backend_urls: list[str]
|
||||
backend_policy: str = "round_robin"
|
||||
request_timeout_s: float = 1800.0
|
||||
|
||||
|
||||
class DpRouterState:
|
||||
"""DP (data-parallel) router: forward each request to exactly one backend."""
|
||||
|
||||
def __init__(self, config: DpRouterConfig):
|
||||
if not config.backend_urls:
|
||||
raise ValueError("At least one backend worker is required")
|
||||
self.config = config
|
||||
self.cursor = 0
|
||||
self.sticky_map: dict[str, int] = {}
|
||||
|
||||
def select_backend(self, headers: dict[str, str]) -> str:
|
||||
idx = self._select_index(headers)
|
||||
return self.config.backend_urls[idx]
|
||||
|
||||
def _select_index(self, headers: dict[str, str]) -> int:
|
||||
target_worker = headers.get("x-smg-target-worker")
|
||||
routing_key = headers.get("x-smg-routing-key")
|
||||
|
||||
if (
|
||||
self.config.backend_policy == "consistent_hashing"
|
||||
and target_worker is not None
|
||||
):
|
||||
idx = int(target_worker)
|
||||
if 0 <= idx < len(self.config.backend_urls):
|
||||
return idx
|
||||
|
||||
if self.config.backend_policy == "manual" and routing_key:
|
||||
cached = self.sticky_map.get(routing_key)
|
||||
if cached is not None:
|
||||
return cached
|
||||
idx = self.cursor % len(self.config.backend_urls)
|
||||
self.cursor += 1
|
||||
self.sticky_map[routing_key] = idx
|
||||
return idx
|
||||
|
||||
idx = self.cursor % len(self.config.backend_urls)
|
||||
self.cursor += 1
|
||||
return idx
|
||||
|
||||
|
||||
app = FastAPI()
|
||||
router_state: RouterState | None = None
|
||||
dp_state: DpRouterState | None = None
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
@@ -85,6 +135,16 @@ async def health() -> Response:
|
||||
|
||||
@app.get("/health_generate")
|
||||
async def health_generate() -> Response:
|
||||
if dp_state is not None:
|
||||
async with aiohttp.ClientSession() as session:
|
||||
tasks = [
|
||||
session.get(f"{url}/health_generate")
|
||||
for url in dp_state.config.backend_urls
|
||||
]
|
||||
for response in asyncio.as_completed(tasks):
|
||||
async with await response:
|
||||
pass
|
||||
return Response(status_code=200)
|
||||
state = _require_state()
|
||||
async with aiohttp.ClientSession() as session:
|
||||
tasks = []
|
||||
@@ -101,6 +161,11 @@ async def health_generate() -> Response:
|
||||
|
||||
@app.get("/v1/models")
|
||||
async def models() -> ORJSONResponse:
|
||||
if dp_state is not None:
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.get(f"{dp_state.config.backend_urls[0]}/v1/models") as resp:
|
||||
payload = await resp.json()
|
||||
return ORJSONResponse(payload, status_code=resp.status)
|
||||
state = _require_state()
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.get(f"{state.config.prefill_urls[0][0]}/v1/models") as response:
|
||||
@@ -147,6 +212,15 @@ async def _forward_to_backend(
|
||||
headers: dict[str, str],
|
||||
endpoint_name: str,
|
||||
) -> Response:
|
||||
# DP mode: forward to a single backend
|
||||
if dp_state is not None:
|
||||
return await _forward_to_dp_backend(
|
||||
request_data=request_data,
|
||||
headers=headers,
|
||||
endpoint_name=endpoint_name,
|
||||
)
|
||||
|
||||
# PD mode: coordinate prefill + decode
|
||||
state = _require_state()
|
||||
prefill_server, bootstrap_port, decode_server = state.select_pair(headers)
|
||||
prefill_request, decode_request = _build_backend_requests(
|
||||
@@ -186,6 +260,63 @@ async def _forward_to_backend(
|
||||
)
|
||||
|
||||
|
||||
async def _forward_to_dp_backend(
|
||||
*,
|
||||
request_data: dict,
|
||||
headers: dict[str, str],
|
||||
endpoint_name: str,
|
||||
) -> Response:
|
||||
assert dp_state is not None
|
||||
backend_server = dp_state.select_backend(headers)
|
||||
cleaned = _strip_internal_fields(request_data)
|
||||
timeout_s = dp_state.config.request_timeout_s
|
||||
|
||||
if request_data.get("stream", False):
|
||||
return StreamingResponse(
|
||||
_stream_dp_generate(
|
||||
request_data=cleaned,
|
||||
backend_server=backend_server,
|
||||
endpoint_name=endpoint_name,
|
||||
timeout_s=timeout_s,
|
||||
),
|
||||
media_type="text/event-stream",
|
||||
)
|
||||
|
||||
async with aiohttp.ClientSession(
|
||||
timeout=aiohttp.ClientTimeout(total=timeout_s)
|
||||
) as session:
|
||||
async with session.post(
|
||||
f"{backend_server}/{endpoint_name}", json=cleaned
|
||||
) as response:
|
||||
body = await response.read()
|
||||
return Response(
|
||||
content=body,
|
||||
status_code=response.status,
|
||||
media_type=response.content_type,
|
||||
)
|
||||
|
||||
|
||||
async def _stream_dp_generate(
|
||||
*,
|
||||
request_data: dict,
|
||||
backend_server: str,
|
||||
endpoint_name: str,
|
||||
timeout_s: float,
|
||||
) -> AsyncIterator[bytes]:
|
||||
async with aiohttp.ClientSession(
|
||||
timeout=aiohttp.ClientTimeout(total=timeout_s)
|
||||
) as session:
|
||||
async with session.post(
|
||||
f"{backend_server}/{endpoint_name}", json=request_data
|
||||
) as response:
|
||||
if response.status != HTTPStatus.OK:
|
||||
payload = await response.read()
|
||||
yield payload
|
||||
return
|
||||
async for chunk in response.content.iter_chunked(_STREAM_CHUNK_SIZE):
|
||||
yield chunk
|
||||
|
||||
|
||||
async def _stream_generate(
|
||||
*,
|
||||
prefill_request: dict,
|
||||
@@ -241,6 +372,12 @@ def _build_backend_requests(
|
||||
prefill_request.update(bootstrap_payload)
|
||||
decode_request.update(bootstrap_payload)
|
||||
|
||||
# session_params is only meaningful for the decode worker (streaming session
|
||||
# KV reuse). Sending it to the prefill worker causes the D side to
|
||||
# short-circuit with local-prefill on already-open sessions, returning
|
||||
# truncated responses while P's KV transfer gets aborted.
|
||||
prefill_request.pop("session_params", None)
|
||||
|
||||
if prefill_priority is not None:
|
||||
prefill_request["priority"] = int(prefill_priority)
|
||||
if decode_priority is not None:
|
||||
@@ -262,7 +399,7 @@ def _require_state() -> RouterState:
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Minimal local PD router")
|
||||
parser = argparse.ArgumentParser(description="Minimal local PD / DP router")
|
||||
parser.add_argument("--host", default="127.0.0.1")
|
||||
parser.add_argument("--port", type=int, default=8000)
|
||||
parser.add_argument(
|
||||
@@ -270,19 +407,44 @@ def main() -> None:
|
||||
nargs=2,
|
||||
metavar=("URL", "BOOTSTRAP_PORT"),
|
||||
action="append",
|
||||
required=True,
|
||||
default=None,
|
||||
)
|
||||
parser.add_argument(
|
||||
"--decode",
|
||||
action="append",
|
||||
required=True,
|
||||
default=None,
|
||||
)
|
||||
parser.add_argument("--prefill-policy", default="round_robin")
|
||||
parser.add_argument("--decode-policy", default="manual")
|
||||
parser.add_argument(
|
||||
"--backend",
|
||||
action="append",
|
||||
default=None,
|
||||
help="Backend URL for DP (data-parallel) mode. Repeat for each worker.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--backend-policy",
|
||||
default="round_robin",
|
||||
help="Routing policy for DP mode: round_robin, manual, consistent_hashing.",
|
||||
)
|
||||
parser.add_argument("--request-timeout-s", type=float, default=1800.0)
|
||||
args = parser.parse_args()
|
||||
|
||||
global router_state
|
||||
global router_state, dp_state
|
||||
|
||||
if args.backend:
|
||||
# DP mode: simple forward to one of N backends
|
||||
dp_state = DpRouterState(
|
||||
DpRouterConfig(
|
||||
host=args.host,
|
||||
port=args.port,
|
||||
backend_urls=list(args.backend),
|
||||
backend_policy=args.backend_policy,
|
||||
request_timeout_s=args.request_timeout_s,
|
||||
)
|
||||
)
|
||||
elif args.prefill and args.decode:
|
||||
# PD mode: prefill/decode coordination
|
||||
router_state = RouterState(
|
||||
RouterConfig(
|
||||
host=args.host,
|
||||
@@ -294,6 +456,9 @@ def main() -> None:
|
||||
request_timeout_s=args.request_timeout_s,
|
||||
)
|
||||
)
|
||||
else:
|
||||
parser.error("Either --backend (DP mode) or both --prefill and --decode (PD mode) are required")
|
||||
|
||||
uvicorn.run(app, host=args.host, port=args.port, log_level="info")
|
||||
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -66,6 +66,7 @@ def launch_pd_stack(
|
||||
timeout_s: float = 1200.0,
|
||||
router_request_timeout_s: float | None = None,
|
||||
include_router: bool = True,
|
||||
naive_dp: bool = False,
|
||||
) -> ManagedPdStack:
|
||||
run_dir.mkdir(parents=True, exist_ok=True)
|
||||
logs_dir = run_dir / "logs"
|
||||
@@ -77,6 +78,7 @@ def launch_pd_stack(
|
||||
decode_policy=decode_policy,
|
||||
include_router=include_router,
|
||||
router_request_timeout_s=router_request_timeout_s,
|
||||
naive_dp=naive_dp,
|
||||
)
|
||||
|
||||
prefill_processes = [
|
||||
@@ -195,6 +197,9 @@ def _build_process_env(topology: SingleNodeTopology) -> dict[str, str]:
|
||||
env["MC_MS_AUTO_DISC"] = "0"
|
||||
if topology.ib_device:
|
||||
env["MOONCAKE_DEVICE"] = topology.ib_device
|
||||
elif topology.transfer_backend == "mooncake":
|
||||
# Default to TCP when RDMA is not forced (e.g. loopback on same node)
|
||||
env.setdefault("MOONCAKE_PROTOCOL", "tcp")
|
||||
|
||||
repo_root = Path(__file__).resolve().parents[2]
|
||||
python_paths = [
|
||||
|
||||
@@ -189,10 +189,11 @@ class MooncakeTransferEngine:
|
||||
device_name if device_name is not None else "",
|
||||
)
|
||||
else:
|
||||
protocol = os.environ.get("MOONCAKE_PROTOCOL", "rdma")
|
||||
ret_value = self.engine.initialize(
|
||||
hostname,
|
||||
"P2PHANDSHAKE",
|
||||
"rdma",
|
||||
protocol,
|
||||
device_name if device_name is not None else "",
|
||||
)
|
||||
if ret_value != 0:
|
||||
|
||||
@@ -1602,6 +1602,9 @@ class DirectAppendAdmissionReqInput(BaseReq):
|
||||
session_id: str
|
||||
uncached_input_tokens: int
|
||||
output_tokens: int
|
||||
# "direct_append": existing behavior — require session resident on this D
|
||||
# "seed": new admission for session not yet resident; do capacity check + LRU eviction
|
||||
mode: str = "direct_append"
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -1619,6 +1622,9 @@ class DirectAppendAdmissionReqOutput(BaseReq):
|
||||
decode_prealloc_queue_reqs: int = 0
|
||||
decode_transfer_queue_reqs: int = 0
|
||||
decode_retracted_queue_reqs: int = 0
|
||||
# Backpressure hint: if > 0, the caller should pause this many ms before
|
||||
# sending more requests to this D. Computed from transfer-queue depth.
|
||||
recommended_pause_ms: int = 0
|
||||
|
||||
|
||||
@dataclass
|
||||
|
||||
@@ -3181,6 +3181,89 @@ class Scheduler(
|
||||
success = False
|
||||
return success
|
||||
|
||||
def _compute_pool_breakdown_for_diagnostics(self) -> dict:
|
||||
"""Read-only KV pool decomposition for the agentic-pd-hybrid profiler.
|
||||
|
||||
Decomposes capacity into:
|
||||
- radix_evictable_tokens / radix_protected_tokens: tree-managed
|
||||
- slot_private_held_tokens: SessionAwareCache out-of-tree slot holds
|
||||
- running_batch_kv_tokens: kv_allocated_len of currently-decoding reqs
|
||||
(overlaps with radix_protected; not additive)
|
||||
- {transfer,prealloc,retracted}_queue_{reqs,tokens}: disagg queues
|
||||
- available_tokens: free pool
|
||||
|
||||
Caller computes "unaccounted = capacity - sum_of_known" to find leakage.
|
||||
Implementation is best-effort; missing components return omitted keys.
|
||||
"""
|
||||
breakdown: dict = {
|
||||
"capacity_tokens": int(self.max_total_num_tokens or 0),
|
||||
"available_tokens": int(self.token_to_kv_pool_allocator.available_size()),
|
||||
}
|
||||
|
||||
# Radix tree (works for SessionAwareCache and most inner caches)
|
||||
try:
|
||||
ev = self.tree_cache.evictable_size()
|
||||
pr = self.tree_cache.protected_size()
|
||||
if isinstance(ev, tuple):
|
||||
ev = ev[0]
|
||||
if isinstance(pr, tuple):
|
||||
pr = pr[0]
|
||||
breakdown["radix_evictable_tokens"] = int(ev or 0)
|
||||
breakdown["radix_protected_tokens"] = int(pr or 0)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# SessionAwareCache slot-private holds (already in session_cache.held_tokens
|
||||
# but mirrored here for one-stop decomposition)
|
||||
try:
|
||||
from sglang.srt.mem_cache.session_aware_cache import SessionAwareCache
|
||||
if isinstance(self.tree_cache, SessionAwareCache):
|
||||
breakdown["slot_private_held_tokens"] = int(
|
||||
self.tree_cache.session_held_tokens()
|
||||
)
|
||||
breakdown["session_slot_count"] = int(
|
||||
self.tree_cache.session_held_req_count()
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Running batch KV (overlaps with radix_protected for tree-tracked reqs)
|
||||
try:
|
||||
running_reqs = self.running_batch.reqs
|
||||
breakdown["running_batch_reqs"] = len(running_reqs)
|
||||
breakdown["running_batch_kv_tokens"] = sum(
|
||||
int(getattr(req, "kv_allocated_len", 0) or 0)
|
||||
for req in running_reqs
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Disagg decode queues
|
||||
if self.disaggregation_mode == DisaggregationMode.DECODE:
|
||||
try:
|
||||
tq = self.disagg_decode_transfer_queue.queue
|
||||
pq = self.disagg_decode_prealloc_queue.queue
|
||||
rq = self.disagg_decode_prealloc_queue.retracted_queue
|
||||
breakdown["transfer_queue_reqs"] = len(tq)
|
||||
breakdown["transfer_queue_tokens"] = sum(
|
||||
int(getattr(getattr(dr, "req", None), "kv_allocated_len", 0) or 0)
|
||||
for dr in tq
|
||||
)
|
||||
breakdown["prealloc_queue_reqs"] = len(pq)
|
||||
breakdown["prealloc_queue_tokens"] = sum(
|
||||
int(getattr(getattr(dr, "req", None), "kv_allocated_len", 0) or 0)
|
||||
for dr in pq
|
||||
)
|
||||
breakdown["retracted_queue_reqs"] = len(rq)
|
||||
breakdown["retracted_queue_tokens"] = sum(
|
||||
int(getattr(req, "kv_allocated_len", 0) or 0)
|
||||
for req in rq
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return breakdown
|
||||
|
||||
def get_internal_state(self, recv_req: GetInternalStateReq):
|
||||
ret = vars(get_global_server_args())
|
||||
ret["last_gen_throughput"] = self.last_gen_throughput
|
||||
@@ -3196,6 +3279,7 @@ class Scheduler(
|
||||
ret["session_cache"] = (
|
||||
self.session_controller.get_streaming_session_cache_status()
|
||||
)
|
||||
ret["pool_breakdown"] = self._compute_pool_breakdown_for_diagnostics()
|
||||
|
||||
if not self.spec_algorithm.is_none() and self.spec_total_num_forward_ct > 0:
|
||||
ret["avg_spec_accept_length"] = (
|
||||
@@ -3508,6 +3592,9 @@ class Scheduler(
|
||||
reason="unsupported",
|
||||
)
|
||||
|
||||
mode = getattr(recv_req, "mode", "direct_append") or "direct_append"
|
||||
is_seed = mode == "seed"
|
||||
|
||||
session_cache_status = self.session_controller.get_streaming_session_cache_status(
|
||||
recv_req.session_id
|
||||
)
|
||||
@@ -3515,27 +3602,28 @@ class Scheduler(
|
||||
resident = bool(
|
||||
isinstance(target_session, dict) and target_session.get("resident")
|
||||
)
|
||||
if not resident:
|
||||
if not resident and not is_seed:
|
||||
# direct_append requires the session already resident on this D.
|
||||
# For seed we skip this check and let capacity decide.
|
||||
transfer_queue_depth = len(self.disagg_decode_transfer_queue.queue)
|
||||
retracted_queue_depth = len(self.disagg_decode_prealloc_queue.retracted_queue)
|
||||
available_size = int(self.token_to_kv_pool_allocator.available_size())
|
||||
token_usage = 1.0 - available_size / max(1, self.max_total_num_tokens)
|
||||
return DirectAppendAdmissionReqOutput(
|
||||
can_admit=False,
|
||||
resident=False,
|
||||
reason="session-not-resident",
|
||||
available_tokens_before=int(
|
||||
self.token_to_kv_pool_allocator.available_size()
|
||||
),
|
||||
available_tokens_after=int(
|
||||
self.token_to_kv_pool_allocator.available_size()
|
||||
),
|
||||
token_usage=(
|
||||
1.0
|
||||
- self.token_to_kv_pool_allocator.available_size()
|
||||
/ max(1, self.max_total_num_tokens)
|
||||
),
|
||||
available_tokens_before=available_size,
|
||||
available_tokens_after=available_size,
|
||||
token_usage=token_usage,
|
||||
num_running_reqs=len(self.running_batch.reqs),
|
||||
decode_prealloc_queue_reqs=len(self.disagg_decode_prealloc_queue.queue),
|
||||
decode_transfer_queue_reqs=len(self.disagg_decode_transfer_queue.queue),
|
||||
decode_retracted_queue_reqs=len(
|
||||
self.disagg_decode_prealloc_queue.retracted_queue
|
||||
decode_transfer_queue_reqs=transfer_queue_depth,
|
||||
decode_retracted_queue_reqs=retracted_queue_depth,
|
||||
recommended_pause_ms=self._compute_backpressure_pause_hint(
|
||||
transfer_queue_depth=transfer_queue_depth,
|
||||
retracted_queue_depth=retracted_queue_depth,
|
||||
token_usage_after=token_usage,
|
||||
),
|
||||
)
|
||||
|
||||
@@ -3543,10 +3631,13 @@ class Scheduler(
|
||||
0, recv_req.output_tokens
|
||||
)
|
||||
available_tokens_before = int(self.token_to_kv_pool_allocator.available_size())
|
||||
# Don't evict the session itself when it's already resident; for seed
|
||||
# of a fresh session there is nothing to exclude.
|
||||
exclude_ids = {recv_req.session_id} if resident else set()
|
||||
trim_result = self.maybe_trim_decode_session_cache(
|
||||
required_tokens=required_tokens,
|
||||
force=available_tokens_before < required_tokens,
|
||||
exclude_session_ids={recv_req.session_id},
|
||||
exclude_session_ids=exclude_ids,
|
||||
)
|
||||
available_tokens_after = int(self.token_to_kv_pool_allocator.available_size())
|
||||
decode_retracted_queue_reqs = len(self.disagg_decode_prealloc_queue.retracted_queue)
|
||||
@@ -3556,6 +3647,7 @@ class Scheduler(
|
||||
)
|
||||
reason = None if can_admit else "no-space"
|
||||
|
||||
transfer_queue_depth = len(self.disagg_decode_transfer_queue.queue)
|
||||
return DirectAppendAdmissionReqOutput(
|
||||
can_admit=can_admit,
|
||||
resident=True,
|
||||
@@ -3570,10 +3662,36 @@ class Scheduler(
|
||||
),
|
||||
num_running_reqs=len(self.running_batch.reqs),
|
||||
decode_prealloc_queue_reqs=len(self.disagg_decode_prealloc_queue.queue),
|
||||
decode_transfer_queue_reqs=len(self.disagg_decode_transfer_queue.queue),
|
||||
decode_transfer_queue_reqs=transfer_queue_depth,
|
||||
decode_retracted_queue_reqs=decode_retracted_queue_reqs,
|
||||
recommended_pause_ms=self._compute_backpressure_pause_hint(
|
||||
transfer_queue_depth=transfer_queue_depth,
|
||||
retracted_queue_depth=decode_retracted_queue_reqs,
|
||||
token_usage_after=(
|
||||
1.0 - available_tokens_after / max(1, self.max_total_num_tokens)
|
||||
),
|
||||
),
|
||||
)
|
||||
|
||||
def _compute_backpressure_pause_hint(
|
||||
self,
|
||||
*,
|
||||
transfer_queue_depth: int,
|
||||
retracted_queue_depth: int,
|
||||
token_usage_after: float,
|
||||
) -> int:
|
||||
# If D is already retracting requests, pause aggressively.
|
||||
if retracted_queue_depth > 0:
|
||||
return 1500
|
||||
# KV pool above 90%: pause proportional to overshoot.
|
||||
if token_usage_after >= 0.90:
|
||||
overshoot = int((token_usage_after - 0.90) * 10000)
|
||||
return max(200, min(2000, overshoot * 5))
|
||||
# Transfer queue heavy: pause linearly with depth.
|
||||
if transfer_queue_depth >= 8:
|
||||
return min(2000, transfer_queue_depth * 100)
|
||||
return 0
|
||||
|
||||
def maybe_sleep_on_idle(self):
|
||||
if self.idle_sleeper is not None:
|
||||
self.idle_sleeper.maybe_sleep()
|
||||
|
||||
Reference in New Issue
Block a user