Compare commits
50 Commits
feat/d-to-
...
h200-cu130
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f09562123b | ||
|
|
9cca2c60c9 | ||
|
|
5c09a3a0cb | ||
|
|
19612ff3a3 | ||
|
|
a953346a0c | ||
|
|
2dfe22ab20 | ||
|
|
6be5f9b57e | ||
|
|
f926a7b87d | ||
|
|
552f3f564e | ||
|
|
051d9220f4 | ||
|
|
9aac36fd89 | ||
|
|
e9ad1c4bc7 | ||
|
|
af966f2371 | ||
|
|
f6d6dc01ea | ||
|
|
fbeb968f2f | ||
|
|
e729d62ddf | ||
|
|
1d68ad66a7 | ||
|
|
9149b530c0 | ||
|
|
a4f30e6bd3 | ||
|
|
8a2f72f18e | ||
|
|
a369722efe | ||
|
|
b9b0cf0fac | ||
|
|
86412bb174 | ||
|
|
7216507773 | ||
|
|
dc4867c270 | ||
|
|
9c35eddc79 | ||
|
|
6d1c9237fa | ||
|
|
986f351365 | ||
|
|
d40db1f117 | ||
|
|
a1abdcd50c | ||
|
|
93fce42747 | ||
|
|
905d671135 | ||
|
|
9a166ac43b | ||
|
|
976115ea5e | ||
|
|
786cbb8d91 | ||
|
|
bf4da281c0 | ||
|
|
7f2ebf3d87 | ||
|
|
ef4dc81ea9 | ||
|
|
3db2d84df8 | ||
|
|
e3e5c45ed4 | ||
|
|
631b2c8847 | ||
|
|
ad8aaa8c5a | ||
|
|
bb9cc249cd | ||
|
|
b55371fe69 | ||
|
|
d11a66d11b | ||
|
|
a418aafeed | ||
|
|
e874b1f055 | ||
|
|
7590e55189 | ||
|
|
5a2fb8799c | ||
|
|
506d360160 |
5
.gitignore
vendored
5
.gitignore
vendored
@@ -13,6 +13,11 @@ src/*.egg-info
|
||||
outputs/
|
||||
|
||||
# Vendored dependencies. Track only the maintained SGLang fork/snapshot.
|
||||
# third_party/traces/ holds the replay trace files used by the benchmark
|
||||
# (~56 MB each) for convenient transfer between hosts; they would otherwise
|
||||
# live under outputs/ but outputs/ is gitignored.
|
||||
third_party/*
|
||||
!third_party/sglang/
|
||||
!third_party/agentic-kvcache/
|
||||
!third_party/traces/
|
||||
*.log
|
||||
|
||||
3
.gitmodules
vendored
Normal file
3
.gitmodules
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
[submodule "third_party/agentic-kvcache"]
|
||||
path = third_party/agentic-kvcache
|
||||
url = git@ipads.se.sjtu.edu.cn:scaleaisys/projects/agentic-kvcache.git
|
||||
148
docs/BRANCH_SUMMARY_h200-cu130.md
Normal file
148
docs/BRANCH_SUMMARY_h200-cu130.md
Normal file
@@ -0,0 +1,148 @@
|
||||
# Branch `h200-cu130` Executive Summary
|
||||
|
||||
**Branch base**: `kvc-debug-journey-v1-to-v4`
|
||||
**HEAD**: `e9ad1c4` (latest, 2026-05-13)
|
||||
**Total commits**: 24
|
||||
**Goal achieved**: Partial — KVC beats naive PD on mean/p50/p90 (-30 ~ -65%), loses p99 by +8% (not due to D→P).
|
||||
|
||||
---
|
||||
|
||||
## 0. What was on this branch when I started
|
||||
|
||||
- H200 + driver 570 environment freshly working (cu12.8 toolkit installed locally, vendored mooncake via uv path-source, mlx5_60 RDMA verified)
|
||||
- E1 (naive PD-disagg + RDMA) baseline data: 1200/1285 success, TTFT p99 = 207s
|
||||
- E2 (KVC v2 + RDMA, no load-floor) failed 80% — D2 stayed cold
|
||||
- E3 (KVC v2 + load-floor) had SGLang streaming-session assertion bug; load-floor fix verified, run aborted
|
||||
- All preceded by `docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` (eviction granularity architectural critique)
|
||||
|
||||
The user's directive: **build D→P RDMA snapshot push to skip P-side re-prefill on reseed, then run an experiment showing KVC beats naive PD-disagg.**
|
||||
|
||||
---
|
||||
|
||||
## 1. What I delivered
|
||||
|
||||
### Code
|
||||
|
||||
| # | Layer | Key files | Purpose |
|
||||
|---|---|---|---|
|
||||
| 1 | mooncake link | `src/agentic_pd_hybrid/snapshot_link.py` | SnapshotPeer wrapper, independent of MooncakeKVManager |
|
||||
| 2 | SGLang controller | `third_party/sglang/python/sglang/srt/disaggregation/snapshot/controller.py` | Per-worker controller with kv_pool pre-registration |
|
||||
| 3 | SGLang RPCs | `io_struct.py`, `tokenizer_communicator_mixin.py`, `scheduler.py`, `http_server.py` | 3 RPCs: prepare_receive / dump / finalize_ingest |
|
||||
| 4 | agentic orchestration | `src/agentic_pd_hybrid/replay.py` | `_attempt_d_to_p_sync` invoked from reseed path |
|
||||
| 5 | CLI | `cli.py`, `benchmark.py`, `topology.py`, `stack.py` | `--enable-d-to-p-sync`, `--decode-mem-fraction-static`, env injection |
|
||||
| 6 | smoke tests | `scripts/smoke_snapshot_link*.py`, `scripts/smoke_snapshot_sglang_integration.py` | Phase 1/1b/2 verification |
|
||||
| 7 | experiments | `scripts/sweep_e4_kvc_v2_d_to_p_sync.sh`, `scripts/sweep_e4_pressured.sh` | E4 sweep configs |
|
||||
| 8 | analysis | `scripts/analyze_e4_d_to_p.py`, `scripts/analysis/plot_e1_vs_e4.py` | Cross-comparison + figures |
|
||||
|
||||
### Docs
|
||||
|
||||
| Doc | Content |
|
||||
|---|---|
|
||||
| `D_TO_P_SYNC_DESIGN_ZH.md` | 446-line design doc with 4 alternatives evaluated, MVP chosen |
|
||||
| `D_TO_P_PHASE1_LINK_ZH.md` | Phase 1 acceptance: 316 Gbps host, 251 Gbps GPU (both verified end-to-end) |
|
||||
| `D_TO_P_IMPLEMENTATION_STATUS_ZH.md` | Phase-by-phase audit with known unverified surfaces |
|
||||
| `E4_PROTOCOL_ZH.md` | Experiment preregistration: H1/H2/H3 + data collection plan |
|
||||
| `E4_RESULTS_ZH.md` | E4-v1 forensic: 272 admission rejects but 0 D→P fires (entrance gate bug) |
|
||||
| `E4_VS_E1_RESULTS_ZH.md` | **Headline results**: KVC wins mean/p50/p90, loses p99 (not D→P's fault) |
|
||||
| `BRANCH_SUMMARY_h200-cu130.md` | This doc |
|
||||
|
||||
### Figures (under `docs/figures/`)
|
||||
|
||||
- `e1_vs_e4_ttft_pdf.png` — bimodal E4 fast-path peak vs E1 single peak
|
||||
- `e1_vs_e4_latency_cdf.png` — CDF + log-survival showing crossover at ~p95
|
||||
- `e4_path_latency.png` — per-execution-mode TTFT breakdown
|
||||
- `e1_vs_e4_p99_attribution.png` — pie + bar breakdown of E4's p99 tail
|
||||
|
||||
---
|
||||
|
||||
## 2. Headline numbers
|
||||
|
||||
| Metric | E1 naive PD | E4 KVC | Δ |
|
||||
|---|---:|---:|---:|
|
||||
| TTFT mean | 90.5s | **58.8s** | **-35%** |
|
||||
| TTFT p50 | 88.5s | **31.0s** | **-65%** |
|
||||
| TTFT p90 | 175.2s | 158.9s | -9% |
|
||||
| TTFT p99 | 207.4s | 224.8s | **+8%** |
|
||||
| Lat mean | 96.3s | **63.9s** | **-34%** |
|
||||
| Lat p50 | 93.2s | **37.1s** | **-60%** |
|
||||
| Lat p99 | 219.5s | 233.8s | +6.5% |
|
||||
| Success | 93.4% | 87.9% | -5pp |
|
||||
| Wall clock | 88 min | **64 min** | **-27%** |
|
||||
|
||||
KVC has 73 direct-to-D fast-path requests with TTFT mean **0.185s** — the unique KVC value prop is realized.
|
||||
|
||||
---
|
||||
|
||||
## 3. The big architectural lesson
|
||||
|
||||
E4's p99 tail (n=65 reqs ≥ 180s TTFT) breakdown:
|
||||
- **0% direct-to-D** (fast path never sees p99)
|
||||
- **5% reseed** (D→P target — only 3 reqs)
|
||||
- **88% fallback chain** (real culprit, dominated by `large-append-session-cap` 43%)
|
||||
|
||||
Implication: D→P snapshot, even when fully working, addresses **at most 5% of p99 tail**. The real p99 cost is in `_invoke_kvcache_seeded_router` and various `fallback-real-large-append-*` paths, which involve agentic-side admission RPC retries + seeded-router cold starts, *not* the P re-prefill that D→P was designed to eliminate.
|
||||
|
||||
**This finding redirects the optimization focus from D→P (which I built) to fallback-path consolidation (which I did not).**
|
||||
|
||||
---
|
||||
|
||||
## 4. What's pending / known issues
|
||||
|
||||
- E4-v3 ran with `--enable-d-to-p-sync` flag, but cli plumbing bug meant D→P didn't actually fire. Fix in `af966f2`. E4-v4 should validate end-to-end (running at time of writing).
|
||||
- E4 success rate -5pp vs E1 (87.9% vs 93.4%). Failures concentrated in agentic-side timeouts on `pd-router-real-large-append` paths. Not a D→P issue.
|
||||
- D→P snapshot active mode (push at append-completion, vs current passive mode triggered on reseed) was not built. Per design doc §2.5, this could be next phase.
|
||||
- `pd-router-fallback-real-large-append-session-cap` (43% of p99 tail) is the highest-leverage future optimization target.
|
||||
|
||||
---
|
||||
|
||||
## 5. Commits (chronological)
|
||||
|
||||
```
|
||||
e9ad1c4 feat(experiments): E4 vs E1 results + p99 attribution figures
|
||||
af966f2 fix(cli): plumb --enable-d-to-p-sync through benchmark-live → ReplayConfig
|
||||
f6d6dc0 feat(cli): per-role --mem-fraction-static + use in E4-pressured
|
||||
fbeb968 feat(experiments): E4-pressured sweep — force reseed via reject_threshold=1
|
||||
e729d62 fix(d2p): structural log + relax entrance condition for sync
|
||||
1d68ad6 docs(experiments): E4 results — initial scaffold + mid-run observation
|
||||
9149b53 feat(experiments): E4 cross-comparison analysis helper
|
||||
a4f30e6 docs(d2p): implementation status snapshot — Phase 1-3 audit
|
||||
8a2f72f feat(experiments): E4 protocol + sweep script — KVC + D→P vs naive PD
|
||||
b9b0cf0 feat(agentic): D→P snapshot orchestration in reseed path + CLI flag
|
||||
a369722 fix(sglang): account snapshot-reserved slots in radix mem leak check
|
||||
86412bb feat(sglang): D→P snapshot link integration — controller + RPC handlers
|
||||
7216507 feat(snapshot): D→P RDMA Phase 1b — GPU pointer path verified
|
||||
dc4867c feat(snapshot): D→P RDMA link Phase 1 — minimal byte transport
|
||||
9c35edd docs(design): D→P RDMA snapshot push design
|
||||
6d1c923 docs(architecture): KVC eviction granularity is the wrong abstraction
|
||||
986f351 feat(sglang): drop streaming-session reqs with fill_ids < prefix_indices
|
||||
d40db1f docs(experiments): E3 first run — load-floor bonus works, exposes SGLang bug
|
||||
a1abdcd feat(experiments): E3 sweep — KVC v2 + RDMA + load-floor bonus
|
||||
93fce42 feat(policy): load-floor bonus for KvAwarePolicy (Q2.B)
|
||||
905d671 feat(env): MC_TRANSFER_TIMEOUT=1800s default in setup_env + stack
|
||||
9a166ac docs(experiments): design space for Q1 (mooncake stall) + Q2 (cold-D)
|
||||
... (predecessor work)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. How to reproduce
|
||||
|
||||
```bash
|
||||
# Env setup
|
||||
source scripts/setup_env.sh
|
||||
|
||||
# Pre-existing baseline (E1)
|
||||
bash scripts/sweep_e1_naive_1p3d.sh
|
||||
|
||||
# KVC + load-floor + D→P (E4-pressured)
|
||||
bash scripts/sweep_e4_pressured.sh
|
||||
|
||||
# Cross-comparison + figures
|
||||
uv run --no-sync python scripts/analysis/plot_e1_vs_e4.py \
|
||||
--e1-metrics outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_metrics.jsonl \
|
||||
--e4-metrics outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess/e4p_kvc_v2_d_to_p_sync_run1_metrics.jsonl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**核心句**:D→P RDMA link 全栈 deploy + 通过 link smoke 验证;E4 实验数据证明 KVC 在 mean/p50/p90 上以 30-65% 优势胜过 naive PD-disagg;p99 长尾归因显示 D→P 不是 p99 的关键路径,下一阶段优化应转向 fallback chain。
|
||||
116
docs/D_TO_P_IMPLEMENTATION_STATUS_ZH.md
Normal file
116
docs/D_TO_P_IMPLEMENTATION_STATUS_ZH.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# D→P RDMA Snapshot Push — 实施状态报告
|
||||
|
||||
**日期**:2026-05-13
|
||||
**分支**:`h200-cu130`
|
||||
**最新 commit**:8a2f72f(E4 protocol 落盘)
|
||||
**前置文档**:
|
||||
- `docs/D_TO_P_SYNC_DESIGN_ZH.md`(设计)
|
||||
- `docs/D_TO_P_PHASE1_LINK_ZH.md`(Phase 1 底层链路验收)
|
||||
- `docs/E4_PROTOCOL_ZH.md`(实验协议)
|
||||
|
||||
---
|
||||
|
||||
## 0. 总结
|
||||
|
||||
D→P RDMA snapshot push 的 8 phase 工程任务已完成 7 phase(设计、链路验证 host & GPU、SGLang 调度器集成、scheduler RPC handlers、agentic 端 orchestration、CLI flag、smoke test)。剩余的 E4 端到端实验(task #16)已 kick off 跑着。
|
||||
|
||||
所有改动都已 commit 并 push 到 `origin/h200-cu130`,**每一步都有对应的 design / acceptance / protocol 文档**。
|
||||
|
||||
---
|
||||
|
||||
## 1. Commit 序列
|
||||
|
||||
| Commit | 描述 | 关键产物 |
|
||||
|---|---|---|
|
||||
| `9c35edd` | docs(design): D→P RDMA snapshot push design | `docs/D_TO_P_SYNC_DESIGN_ZH.md` 446 行设计文档 |
|
||||
| `dc4867c` | feat(snapshot): D→P RDMA link Phase 1 — host mem | `src/agentic_pd_hybrid/snapshot_link.py` + smoke:64 MB 1.7 ms / 316 Gbps |
|
||||
| `7216507` | feat(snapshot): D→P RDMA Phase 1b — GPU pointer | GPU smoke:256 MB 8.5 ms / 251 Gbps |
|
||||
| `86412bb` | feat(sglang): D→P snapshot link integration — controller + RPC handlers | SGLang vendored 4 文件改动,3 个新 RPC |
|
||||
| `b9b0cf0` | feat(agentic): D→P snapshot orchestration in reseed path + CLI flag | agentic-pd-hybrid 4 文件 + smoke script |
|
||||
| `a369722` | fix(sglang): account snapshot-reserved slots in radix mem leak check | leak check 修正 |
|
||||
| `8a2f72f` | feat(experiments): E4 protocol + sweep script | `docs/E4_PROTOCOL_ZH.md` + sweep |
|
||||
|
||||
---
|
||||
|
||||
## 2. 验证状态
|
||||
|
||||
### 2.1 Phase 1(底层 RDMA 链路)
|
||||
|
||||
✅ **VERIFIED**
|
||||
|
||||
- Smoke `scripts/smoke_snapshot_link.py`:host CPU 内存,5/5 size 全 SHA 校验通过,64 MB 316 Gbps
|
||||
- Smoke `scripts/smoke_snapshot_link_gpu.py`:cuda:0 → cuda:1,5/5 size 通过,256 MB 251 Gbps
|
||||
|
||||
### 2.2 Phase 2(SGLang scheduler 集成)
|
||||
|
||||
✅ **VERIFIED at RPC level**
|
||||
|
||||
Smoke `scripts/smoke_snapshot_sglang_integration.py` 启动 P + D 两个 SGLang worker:
|
||||
|
||||
- `POST /_snapshot/prepare_receive` on P → 200 OK,返回 96 layer base ptrs + slot indices + strides
|
||||
- `POST /_snapshot/dump` on D → 200,返回 `ok=false, reason="session-not-resident"`(正确,session 不存在)
|
||||
- `POST /_snapshot/finalize_ingest` on P → 200 OK,inserted_prefix_len 字段正确
|
||||
|
||||
**Scheduler 不崩**(修了 leak check 后)。证明:
|
||||
- env-var driven controller startup 工作
|
||||
- mooncake engine 共存(PD pipeline 用一个,snapshot 用一个独立的)
|
||||
- 3 个 ReqInput/Output dispatch 全通
|
||||
- HTTP → tokenizer → ZMQ → scheduler 链路畅通
|
||||
|
||||
### 2.3 Phase 3(agentic orchestration + reseed wire-up)
|
||||
|
||||
⏳ **IN-FLIGHT**(E4 sweep 跑着)
|
||||
|
||||
`_attempt_d_to_p_sync` 在 `_invoke_kvcache_seeded_router` 中被调用,按设计文档 §2 的三阶段协议运行。Phase 3 的端到端验收靠 E4 实验数据。
|
||||
|
||||
---
|
||||
|
||||
## 3. 未覆盖范围(**重要**)
|
||||
|
||||
下面这些场景**还没有验证**,是 E4 实验之外的 follow-up 工作:
|
||||
|
||||
| 范围 | 状态 | 风险 |
|
||||
|---|---|---|
|
||||
| **D-side 真实 session KV 字节对齐** | unverified | D 把 SessionSlot 里的 KV slot indices 翻译成 RDMA src 地址,layer-by-layer 排列。逻辑可能有 off-by-one 或 layer 顺序错误。若错,P 端的 radix insert 是正确的 indices 但底下的 KV 内容损坏 → 模型输出乱码。这只能靠端到端测试发现。 |
|
||||
| **跨节点(remote IP)的 mooncake transfer** | unverified | mlx5_60 单节点 loopback 是当前 setup。跨节点 GID 路径 / route table / firewall 都可能不同。 |
|
||||
| **多 D → 单 P 的 slot 协调** | unverified | 多个 D worker 同时往同一个 P 推不同 session 的 KV,是否冲突?当前每次 prepare_receive 都从 P 的 kv_pool alloc,应当不冲突,但需 stress test。 |
|
||||
| **token_id 一致性** | partial | 我们用 `request.input_token_ids` 作为 radix 插入的 key。如果该字段 stale 或 mis-aligned,radix 插入的 key 与真实 KV 不对应。E4 跑出垃圾输出就是这个症状。 |
|
||||
| **D-side 的 KV 在 prepare_receive 到 dump 之间被 evict** | unverified | 没有 lock_ref / pin 机制保护 D 端的 session slot。在并发负载下 D 可能 LRU 驱逐这个 session,导致 dump 失败或推空数据。fallback 路径会兜底但浪费一次 RPC。 |
|
||||
| **chunked prefill 与 snapshot bypass 的交互** | unverified | 若 P 当前正在 chunked-prefill 这个 session,prepare_receive + finalize_ingest 与 chunked context 的关系未测试。 |
|
||||
|
||||
---
|
||||
|
||||
## 4. 端到端实验 E4 当前进展
|
||||
|
||||
跑着,结果汇总见 `docs/E4_RESULTS_ZH.md`(实验跑完后写)。
|
||||
|
||||
---
|
||||
|
||||
## 5. 给下一个接班 agent 的建议
|
||||
|
||||
如果你接手时 E4 已跑完且看出问题,按这个排查顺序:
|
||||
|
||||
1. **看 D-side dump 的失败原因 top**:grep "d_to_p_sync sid=.*status=" 看 prepare/dump/finalize 哪一步挂得多
|
||||
2. **如果 dump 大量返回 `session-not-resident`**:说明 reseed 触发时 D-side session 已经被 evict。这是预期的,但需要看占比。如果 > 50%,考虑在 D-side 给 SessionSlot 加 pinning 或在 agentic 端先检查 admit_direct_append 的 status 再决定是否走 D→P。
|
||||
3. **如果 dump ok 但模型输出乱码**:byte-level KV layout 在 D/P 间有不一致。读 `third_party/sglang/python/sglang/srt/disaggregation/snapshot/controller.py::push_session_kv` 的 (src, dst, len) 三元组计算,按 `kv_pool.get_contiguous_buf_infos()` 的 K-then-V 顺序 cross check。
|
||||
4. **如果一切 ok 但 TTFT 仍未改善**:D→P 没真触发 fast path。check P-side radix tree 插入后是否真被下一次 prefill 命中。看 `cached_tokens` 字段。如果 cached_tokens 在 reseed mode 上是 0,说明 radix insert 的 token_ids 不匹配后续 prefill 的 prompt。
|
||||
5. **若你想做 ablation**:保留 `--enable-d-to-p-sync` 但人为在 `_attempt_d_to_p_sync` return None。这把 hot path 关掉但保留控制平面 → 隔离纯 D→P 的边际效益。
|
||||
|
||||
---
|
||||
|
||||
## 6. 设计文档对照
|
||||
|
||||
| 设计 §X | 实现位置 |
|
||||
|---|---|
|
||||
| §2.1 Mooncake 双角色 | `third_party/sglang/.../disaggregation/snapshot/controller.py` 用独立 TransferEngine,避免改 MooncakeKVManager |
|
||||
| §2.2 DecodeKVSnapshotSender | `SnapshotLinkController.push_session_kv` |
|
||||
| §2.3 PrefillSnapshotStore | `SnapshotLinkController._ingest_records`(dict 形态而非完整 Store class,MVP 化) |
|
||||
| §2.4 P-side prefill bypass | **未实现**——改用 radix tree insert 让 SGLang 自然 cache hit。比 bypass 更保守、更简单。 |
|
||||
| §2.5 D-side commit hook | **延迟实现**——E4 试用 reseed-triggered(被动)模式而非 per-append push(主动)。等数据后看是否值得做主动模式。 |
|
||||
| §2.6 HTTP endpoints | `entrypoints/http_server.py:_snapshot/{prepare_receive,dump,finalize_ingest}` |
|
||||
| §2.7 agentic-pd-hybrid hook | `replay.py::_attempt_d_to_p_sync` + 调用点在 `_invoke_kvcache_seeded_router` |
|
||||
| §2.8 CLI flag | `cli.py --enable-d-to-p-sync` |
|
||||
|
||||
---
|
||||
|
||||
**核心句**:D→P RDMA snapshot push 的 7/8 phase 已落地、commit、push。Phase 1 底层链路通过 host + GPU smoke 验证。Phase 2 的 SGLang scheduler 集成通过 RPC-level smoke 验证。Phase 3 的端到端 reseed orchestration 通过 E4 实验验证(跑着)。
|
||||
152
docs/D_TO_P_PHASE1_LINK_ZH.md
Normal file
152
docs/D_TO_P_PHASE1_LINK_ZH.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# D→P Phase 1:底层 RDMA 链路(已验收)
|
||||
|
||||
**日期**:2026-05-13
|
||||
**状态**:底层链路通过 smoke test 验收
|
||||
**前置**:`docs/D_TO_P_SYNC_DESIGN_ZH.md`
|
||||
**对应 commit**:`feat(snapshot): D→P snapshot link over mooncake RDMA`
|
||||
|
||||
---
|
||||
|
||||
## 0. 一句话
|
||||
|
||||
实现一个独立于 SGLang `MooncakeKVManager` 的**最小 RDMA 字节传输模块**(`src/agentic_pd_hybrid/snapshot_link.py`),双进程 smoke test 跑通 1 KB → 64 MB 一共 5 个 size,全部 SHA 校验通过,64 MB 单次 RDMA write 实测 315 Gbps(mlx5_60 NDR 400 Gb 的约 80%)。
|
||||
|
||||
## 1. 设计动机
|
||||
|
||||
`docs/D_TO_P_SYNC_DESIGN_ZH.md` 选定 Option C(D→P snapshot push + P SessionSlot + prefill bypass)。这个方案的最底层依赖是"D 进程能把字节通过 RDMA 推到 P 进程的预注册缓冲区"。
|
||||
|
||||
直接复用 SGLang 的 `MooncakeKVManager` 不可行:
|
||||
- `add_transfer_request` 在 `conn.py:1563` 硬 assert `disaggregation_mode == PREFILL`
|
||||
- PD pipeline 的发送 / 接收 thread / queue / staging 紧耦合 PD 角色
|
||||
- 改 PD 路径风险大(影响现有 E1/E2/E3 配置)
|
||||
|
||||
因此把 D→P link 单独写成一个轻量模块,直接调 `mooncake.engine.TransferEngine` 的 `transfer_sync_write` / `batch_transfer_sync_write`,不经过 PD pipeline。
|
||||
|
||||
## 2. 实现
|
||||
|
||||
### 2.1 `snapshot_link.SnapshotPeer`
|
||||
|
||||
```python
|
||||
peer = SnapshotPeer(host, port, ib_device, receive_capacity_bytes)
|
||||
endpoint = peer.endpoint # SnapshotEndpoint(session_id, base_ptr, capacity_bytes)
|
||||
peer.register_send_buffer(ptr, length)
|
||||
peer.push(target_endpoint, local_ptr, local_off, length, remote_off=0)
|
||||
peer.batch_push(target, local_addrs, remote_addrs, lengths)
|
||||
peer.read_bytes(offset, length) -> bytes
|
||||
peer.close()
|
||||
```
|
||||
|
||||
- 每个 `SnapshotPeer` 拥有自己的 `TransferEngine`,绑定 `host:port`
|
||||
- `receive_capacity_bytes > 0` 时分配一段 ctypes `c_ubyte` 数组 + `register_memory`
|
||||
- `push` 直接走 `engine.transfer_sync_write(peer_session_id, local_ptr, remote_ptr, length)`
|
||||
- 角色完全对称——任何 `SnapshotPeer` 既可以发送也可以接收,由 caller 决定
|
||||
|
||||
### 2.2 Smoke test 双进程结构
|
||||
|
||||
```
|
||||
父进程 (sender) 子进程 (receiver, subprocess.Popen)
|
||||
│ │
|
||||
│ spawn → ──────────────────────────────►│
|
||||
│ │ SnapshotPeer(recv_capacity=64MB)
|
||||
│ │ write endpoint.json
|
||||
│ read endpoint.json ◄───────────────────│
|
||||
│ │
|
||||
│ SnapshotPeer(no recv buf) │
|
||||
│ register_send_buffer(64MB) │
|
||||
│ │
|
||||
│ for size in [1K, 16K, 1M, 16M, 64M]: │
|
||||
│ fill_pattern(send_buf, seed) │
|
||||
│ peer.push(endpoint, 0, size) ─RDMA──►│
|
||||
│ │ wait signal
|
||||
│ write endpoint.do{size} ────────────►│ read signal seed
|
||||
│ │ compute expected SHA
|
||||
│ │ recv_bytes = peer.read_bytes
|
||||
│ wait endpoint.ack{size} │ compare SHA → emit JSON event
|
||||
│ │ write endpoint.ack{size}
|
||||
│ ... │
|
||||
│ │
|
||||
│ drain child stdout, parse JSON │ exit
|
||||
│ verify each event has ok=true │
|
||||
```
|
||||
|
||||
### 2.3 性能(首次 smoke run)
|
||||
|
||||
| Size | Push duration | Throughput |
|
||||
|---:|---:|---:|
|
||||
| 1 KB | 9.0 ms | 0.001 Gbps |
|
||||
| 16 KB | 0.037 ms | 3.5 Gbps |
|
||||
| 1 MB | 0.102 ms | 82 Gbps |
|
||||
| 16 MB | 0.577 ms | 232 Gbps |
|
||||
| **64 MB** | **1.70 ms** | **316 Gbps** |
|
||||
|
||||
- 1 KB 第一次有 ~9 ms 的 mooncake p2p handshake/openSegment overhead(一次性)
|
||||
- 16 KB 之后是稳态,吞吐随 size 增长接近线速
|
||||
- mlx5_60 是 mlx5 ConnectX-7 NDR 400 Gb(4× 100Gb lanes);64 MB 测到 316 Gbps 是 79% 的链路利用率,对单次 RDMA write 来说正常(剩余空间留给 verb dispatch / completion handling overhead)
|
||||
|
||||
## 3. 验收
|
||||
|
||||
- ✅ 5/5 size SHA 校验全部通过
|
||||
- ✅ 64 MB 一次 RDMA 1.7 ms
|
||||
- ✅ 双进程独立,不耦合 SGLang PD pipeline
|
||||
- ✅ Smoke test 脚本 `scripts/smoke_snapshot_link.py` 可重跑
|
||||
|
||||
## 4. 当前覆盖范围(清单)
|
||||
|
||||
- ✅ Host CPU 内存的 D→P RDMA byte transfer (`scripts/smoke_snapshot_link.py`)
|
||||
- ✅ **GPU 内存** cuda:0 → cuda:1 的 D→P RDMA(`scripts/smoke_snapshot_link_gpu.py`,5/5 size 全 SHA 校验通过,256 MB 8.5 ms / 251 Gbps)
|
||||
- ✅ 单 IB device (mlx5_60)
|
||||
- ✅ 同节点 loopback(127.0.0.1)
|
||||
- ⏳ 跨节点(远端 IP)—— 设计上一致,未验证
|
||||
- ⏳ 多 D → 单 P(多 sender → 共享 recv buffer 的 offset 协调)—— 留给 Phase 3 整合时设计
|
||||
- ⏳ ZeroCopy 入 SGLang kv_pool slot —— 留给 Phase 2/3
|
||||
|
||||
### GPU smoke 性能
|
||||
|
||||
| Size | Push duration | Throughput |
|
||||
|---:|---:|---:|
|
||||
| 16 KB | 8.27 ms (cold) | 0.016 Gbps |
|
||||
| 1 MB | 0.096 ms | 87.6 Gbps |
|
||||
| 16 MB | 0.844 ms | 159 Gbps |
|
||||
| 64 MB | 2.52 ms | 213 Gbps |
|
||||
| **256 MB** | **8.54 ms** | **251 Gbps** |
|
||||
|
||||
GPU↔GPU 比 host↔host 慢一些(251 vs 316 Gbps for 64MB),但仍接近 mlx5_60 NDR 400Gb 的 60% 线率。对 KVC 单 session ~50K tokens × ~80 KB/token ≈ 4 GB 量级的 transfer,对应 D→P 时间约 130 ms。
|
||||
|
||||
## 5. 下一步(Phase 2 / Phase 3)
|
||||
|
||||
详见 `docs/D_TO_P_SYNC_DESIGN_ZH.md` §5。本 phase 1 解锁后,整个 D→P 同步可以正式开始整合到 SGLang scheduler:
|
||||
|
||||
| Phase | 描述 | 风险 |
|
||||
|---|---|---|
|
||||
| 2 | D-side commit hook:`cache_finished_req` 完成后 enqueue snapshot push | 中。需要在 scheduler 后台线程跑 push,不能阻塞 schedule loop |
|
||||
| 3 | P-side snapshot store + prefill bypass:P scheduler 收到 use-snapshot 请求时跳过 `model.forward()`,直接用 snapshot KV 触发 P→D' transfer | **最高**。需要深入 SGLang prefill 流程 |
|
||||
| 4 | agentic-pd-hybrid hook:`_invoke_kvcache_seeded_router` 先 probe P → 决定走 bypass 还是 fallback | 低 |
|
||||
| 5 | CLI flag + structural log | 低 |
|
||||
| 6 | 端到端 smoke + E4 sweep | 中 |
|
||||
|
||||
## 6. 知识沉淀
|
||||
|
||||
### 易踩坑
|
||||
|
||||
| 坑 | 原因 | 修法 |
|
||||
|---|---|---|
|
||||
| 多进程 `multiprocessing.Process` 子进程崩溃信息丢失 | spawn context 下 child 没有继承 parent 的 stderr | 改用 `subprocess.Popen` + stderr 重定向到文件 |
|
||||
| `bytes(ctypes.c_byte * N)` 失败 `ValueError: bytes must be in range(0, 256)` | `c_byte` 是 **signed**,>= 128 的 byte 在 Python 看就是负数 | 用 `c_ubyte` 或 `ctypes.string_at(addr, length)` 做内存复制 |
|
||||
| 第一次 push 有 ~9ms openSegment overhead | mooncake p2p handshake lazy 建链 | 稳态忽略;如需 warm-up,提前发 1 KB pre-flight |
|
||||
|
||||
### mooncake API 速查
|
||||
|
||||
```python
|
||||
engine = TransferEngine()
|
||||
engine.initialize(f"{host}:{port}", "P2PHANDSHAKE", "rdma", ib_device)
|
||||
engine.register_memory(ptr, length) # mr 注册
|
||||
engine.transfer_sync_write(peer_session_id, local_ptr, remote_ptr, length) # RDMA write
|
||||
engine.batch_transfer_sync_write(peer_session_id, [local_ptrs], [remote_ptrs], [lengths])
|
||||
engine.unregister_memory(ptr)
|
||||
```
|
||||
|
||||
`peer_session_id` 是 `"host:rpc_port"`,其中 `rpc_port = peer_engine.get_rpc_port()`。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:D→P 底层 RDMA 链路独立模块跑通,64 MB 1.7 ms / 316 Gbps,与 SGLang PD pipeline 完全解耦。Phase 2/3 可以放心在这上面叠加。
|
||||
446
docs/D_TO_P_SYNC_DESIGN_ZH.md
Normal file
446
docs/D_TO_P_SYNC_DESIGN_ZH.md
Normal file
@@ -0,0 +1,446 @@
|
||||
# D→P KV 反向推送设计
|
||||
|
||||
**日期**:2026-05-12
|
||||
**分支**:`h200-cu130`(在此分支上做,后续 cherry-pick 到 `feat/d-to-p-sync` 备用)
|
||||
**目标**:让 reseed 路径绕过 P 端 re-prefill,把 reseed 总耗时从 3-7s 压到接近一次 RDMA P→D' 传输(~200-400ms)
|
||||
**前置**:`docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md`(reseed 现状),`docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md`(架构层背景)
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. **现状**:v2 reseed 路径 = P open session + P 完整 re-prefill(~1.5-3s)+ P→D' mooncake transfer(~200-400ms RDMA)。`re-prefill` 段是 KVC TTFT p99 的主体。
|
||||
2. **目标**:D 在 direct-to-D append 完成后异步把新 KV 增量推回 P。reseed 触发时 P 已经有 fresh snapshot → 跳过 model.forward()、直接复用 KV 做 P→D' 传输。
|
||||
3. **决策**:选 Option C —— **D→P snapshot 按 append-completion 推送,P 端用独立 PrefillSnapshotStore 存储(不进 radix tree),prefill 在有 snapshot 时 bypass 计算只触发传输**。
|
||||
4. **拒绝的 alternatives**:A(让 P radix tree 接受多生产者写入,§4.3 工程灾难)、B(D→D' 直推,绕过 P,但 mooncake 无 D-Sender 角色 + session-not-resident 场景失败)、D(仅 eviction 时推,async 来不及 + sync 拖死 eviction)。
|
||||
5. **工程量**:~600 LOC,拆 6-8 commit。最难的是 mooncake 双角色化的 thread-safety 和 P 端 prefill bypass 的调度器 hook。
|
||||
6. **必须 RDMA**:所有传输走 mooncake batch_transfer,不允许 TCP fallback。
|
||||
|
||||
---
|
||||
|
||||
## 1. 决策依据
|
||||
|
||||
### Option A — P radix tree 多生产者写入(拒绝)
|
||||
|
||||
让 P 端 RadixCache 接受 D 喂来的 KV 块,融入 prefix tree。
|
||||
|
||||
**为何拒绝**:
|
||||
|
||||
- SGLang radix tree 假设单生产者(本 worker 的 model 输出)。改动涉及节点写入路径、引用计数、跨 worker 数据格式、eviction policy 协调。
|
||||
- 工程量 ~1-2 周,且是侵入式改动,长期维护成本高。
|
||||
- 与 vendor 上游 diff 太大,未来 rebase 风险高。
|
||||
|
||||
### Option B — D→D' 直推(拒绝)
|
||||
|
||||
migration 时 D_old 把 KV 直接发到 D_new,绕过 P。
|
||||
|
||||
**为何拒绝**:
|
||||
|
||||
- 触发条件 `session-not-resident` 时 KV 已 free,D_old 拿不到任何数据可推。
|
||||
- mooncake DECODE 模式当前只有 receiver 角色(`assert disaggregation_mode == PREFILL` at conn.py:1563);新增 D-Sender 角色与 P-Receiver 角色对偶,工程量与 Option C 相当但**只 cover 部分场景**。
|
||||
- D→D' 控制平面需要额外协调("哪个 D 当前持有 session"),增加路由复杂度。
|
||||
|
||||
### Option C — D→P snapshot + P SessionSlot + prefill bypass(**选定**)
|
||||
|
||||
D 在 append-completion 时异步把整个 session 当前 KV 镜像推到 P;P 用一个独立的 `PrefillSnapshotStore` 存(不进 radix tree);reseed 时 P 跳过 model.forward(),直接用 snapshot 触发 P→D' 传输。
|
||||
|
||||
**为何选它**:
|
||||
|
||||
1. **P 端不动 radix tree**——SnapshotStore 是侧表,无 multi-producer 问题
|
||||
2. **mooncake 改动局部化**——只放开 `add_transfer_request` 的 PREFILL assertion + 在 DECODE 模式启动一个独立 snapshot transfer 线程
|
||||
3. **可以分阶段验证**——D→P 推 → P 收到 → P 存 → P 用,每一步可独立 smoke test
|
||||
4. **failure semantics 干净**——snapshot 缺失就 fallback 到现有 re-prefill 路径,零回退风险
|
||||
5. **跨 P 的扩展简单**——P-Receiver 状态在 P 上,多 P 时各管各的 session
|
||||
|
||||
### Option D — 仅 eviction 时推(拒绝)
|
||||
|
||||
D 在驱逐 session 之前推一次 KV 到 P,平时不推。
|
||||
|
||||
**为何拒绝**:
|
||||
|
||||
- async 推送:reseed 触发时(下一 turn 到达)可能 push 还没到 P 完。需要 reseed path 等 push 完成 → 把延迟成本只是搬家。
|
||||
- sync 推送:让 eviction 等 mooncake transfer 完,**当前 incoming request(触发 eviction 的那个)** 直接被拖死 1-3s。比当前 reseed 还差。
|
||||
- 不能 cover 非 eviction 触发的 reseed(如 migration、admission-no-d-capacity)。
|
||||
|
||||
---
|
||||
|
||||
## 2. 架构
|
||||
|
||||
```
|
||||
+---------------- D worker (decode_thread + new snapshot_sender_thread) -----+
|
||||
| |
|
||||
| direct-to-D append done |
|
||||
| | |
|
||||
| v |
|
||||
| on_session_step_committed(session_id, kv_committed_len, kv_indices) |
|
||||
| | |
|
||||
| v |
|
||||
| SnapshotSendQueue [throttle by token-delta >= K_DELTA] |
|
||||
| | |
|
||||
| v |
|
||||
| KVSnapshotSender |
|
||||
| | |
|
||||
| | mooncake batch_transfer (RDMA) |
|
||||
| v |
|
||||
+-----------------------------|----------------------------------------------+
|
||||
|
|
||||
v
|
||||
+---------------- P worker (prefill_thread + new snapshot_receiver_thread) ---+
|
||||
| |
|
||||
| KVSnapshotReceiver listening (ZMQ control + mooncake data) |
|
||||
| | |
|
||||
| v |
|
||||
| PrefillSnapshotStore[session_id] -> SnapshotEntry { |
|
||||
| req_pool_idx, kv_indices, kv_committed_len, last_recv_time |
|
||||
| } |
|
||||
| |
|
||||
| When prefill request arrives with session_id + snapshot_token: |
|
||||
| | |
|
||||
| v |
|
||||
| prefill_bypass_check(session_id, requested_seq_len) |
|
||||
| | hit: skip model.forward, reuse stored kv, fire P→D' transfer |
|
||||
| | miss: fall through to normal prefill |
|
||||
+----------------------------------------------------------------------------+
|
||||
|
||||
+--------------- agentic-pd-hybrid (replay.py) -------------------------------+
|
||||
| |
|
||||
| _invoke_kvcache_seeded_router (reseed entry): |
|
||||
| 1. GET /v1/sessions/{sid}/snapshot_status on P → seqlen |
|
||||
| 2. if seqlen >= requested input_len: |
|
||||
| set request header x-prefill-use-snapshot=1 |
|
||||
| route to P → P uses bypass path |
|
||||
| else: |
|
||||
| normal seeded_router (re-prefill) |
|
||||
+----------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 数据流时间线
|
||||
|
||||
### 3.1 Direct-to-D append + 异步 D→P push
|
||||
|
||||
```
|
||||
t=0 turn N 到 D,走 direct-to-D append-prefill
|
||||
t=T1 direct append 完成,scheduler 调 cache_finished_req
|
||||
SessionAwareCache.cache_finished_req 把 KV 写回 SessionSlot
|
||||
(此时 KV 全在 D 的 kv_pool 里,slot 持锁)
|
||||
t=T1+ε D-side hook: on_session_step_committed(sid, slot)
|
||||
计算 delta = slot.kv_committed_len - last_pushed_seqlen[sid]
|
||||
if delta >= K_DELTA (默认 1024 tokens): 入队 SnapshotSendQueue
|
||||
t=T1+δ snapshot_sender 线程取出 entry → mooncake batch_transfer
|
||||
把 kv_pool[slot.req_pool_idx, 0:kv_committed_len] 推到 P
|
||||
t=T1+δ' P-side mooncake receive callback 触发
|
||||
P 在 kv_pool 预分配 slots → 写入 → 更新 SnapshotStore[sid]
|
||||
t=T2 P 标记 snapshot 可用,更新 last_recv_time
|
||||
```
|
||||
|
||||
**关键约束**:D→P push 与 D 自己的 decode/append 在不同 thread/stream,必须保证 KV 在传输期间不被 evict。
|
||||
- 复用 SessionSlot 的 lock_ref 机制:snapshot_sender 在传输期间 hold lock,传输完后 dec_lock。
|
||||
- 如果 session 在传输期间被 release_session 调用,snapshot 应该 abort(数据不一致)。
|
||||
|
||||
### 3.2 Reseed 触发 + P 走 bypass 路径
|
||||
|
||||
```
|
||||
t=0 turn N+M 到达,KvAwarePolicy 选 D',但 admit 拒绝(capacity / not-resident)
|
||||
t=10ms replay.py 进入 _invoke_kvcache_seeded_router
|
||||
t=15ms probe: GET p/v1/sessions/{sid}/snapshot_status -> {seqlen: 50080, fresh: true}
|
||||
t=20ms replay: 50080 >= request.input_length (49800),触发 bypass 路径
|
||||
t=25ms open D' streaming session (HTTP)
|
||||
t=30ms open P streaming session, set x-prefill-use-snapshot header
|
||||
t=40ms forward request to SGLang pd-router → P
|
||||
t=45ms P scheduler 看到 use-snapshot 标记
|
||||
→ SnapshotStore.lookup(sid) -> SnapshotEntry
|
||||
→ 跳过 model.forward()
|
||||
→ 直接复用 SnapshotEntry.kv_indices 给 mooncake KVSender
|
||||
t=50ms mooncake P→D' RDMA transfer 启动
|
||||
t=300ms P→D' 完成,D' 上 session 重建
|
||||
t=305ms D' 开始 decode
|
||||
t=350ms first token 出来 → TTFT
|
||||
```
|
||||
|
||||
**收益对照**:
|
||||
| 段 | 当前 reseed | bypass 后 |
|
||||
|---|---:|---:|
|
||||
| P open session | ~50ms | ~50ms |
|
||||
| **P re-prefill** | **~1500-3000ms** | **0** |
|
||||
| P→D' transfer (RDMA) | ~200-400ms | ~200-400ms |
|
||||
| D' decode start | ~50ms | ~50ms |
|
||||
| TTFT 总 | ~1.8-3.5s | ~0.3-0.5s |
|
||||
|
||||
---
|
||||
|
||||
## 4. 接口和数据结构
|
||||
|
||||
### 4.1 Mooncake 双角色
|
||||
|
||||
**Change**: `MooncakeKVManager.__init__` 在 DECODE 模式下**额外**启动 snapshot sender 基础设施(独立 transfer_queues + thread pool)。
|
||||
|
||||
```python
|
||||
# In MooncakeKVManager.__init__, after start_decode_thread() in DECODE mode:
|
||||
if envs.SGLANG_DTOP_SNAPSHOT_ENABLED.get():
|
||||
self._init_snapshot_sender() # new
|
||||
|
||||
def _init_snapshot_sender(self):
|
||||
self.snapshot_send_queue: FastQueue = FastQueue()
|
||||
self.snapshot_executor = ThreadPoolExecutor(max_workers=2)
|
||||
threading.Thread(
|
||||
target=self._snapshot_send_worker,
|
||||
daemon=True,
|
||||
).start()
|
||||
```
|
||||
|
||||
**Change**: 删除 `add_transfer_request` 的 `assert PREFILL`,改为按 caller 路径分发:
|
||||
- `add_transfer_request` —— prefill 用,保持现状
|
||||
- `add_snapshot_transfer_request` —— 新增,decode 用
|
||||
|
||||
### 4.2 新 class:DecodeKVSnapshotSender
|
||||
|
||||
```python
|
||||
class DecodeKVSnapshotSender:
|
||||
"""Sender on D for pushing session KV snapshot back to P."""
|
||||
def __init__(self, mgr: MooncakeKVManager, target_p_addr: str,
|
||||
target_p_bootstrap_room: int, session_id: str):
|
||||
...
|
||||
|
||||
def send(self, kv_indices: npt.NDArray[np.int32],
|
||||
kv_committed_len: int, aux_blob: bytes) -> None:
|
||||
"""Enqueue snapshot for async push. Non-blocking."""
|
||||
|
||||
def poll(self) -> KVPoll: ...
|
||||
```
|
||||
|
||||
### 4.3 P 端 PrefillSnapshotStore + Receiver
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class SnapshotEntry:
|
||||
session_id: str
|
||||
req_pool_idx: int
|
||||
kv_indices: torch.Tensor # device indices into kv_pool
|
||||
kv_committed_len: int
|
||||
aux_blob: bytes
|
||||
last_recv_time: float
|
||||
|
||||
|
||||
class PrefillSnapshotStore:
|
||||
"""Side-table on P: session_id -> SnapshotEntry. NOT in radix tree."""
|
||||
def __init__(self, kv_pool_allocator, req_to_token_pool, max_sessions: int = 8):
|
||||
self.entries: dict[str, SnapshotEntry] = {}
|
||||
self.max_sessions = max_sessions
|
||||
...
|
||||
|
||||
def ingest(self, session_id: str, kv_data: torch.Tensor,
|
||||
kv_committed_len: int, aux_blob: bytes) -> None:
|
||||
"""Allocate slots, copy KV in, register entry. LRU-evicts when full."""
|
||||
|
||||
def lookup(self, session_id: str) -> Optional[SnapshotEntry]: ...
|
||||
|
||||
def release(self, session_id: str) -> None:
|
||||
"""Free the slots + remove entry."""
|
||||
```
|
||||
|
||||
### 4.4 P-side prefill bypass 调度器 hook
|
||||
|
||||
**Change**: `scheduler.py` 在 `handle_generate_request` 入口处检查 `x-prefill-use-snapshot` header / `session_params.use_snapshot=True`:
|
||||
|
||||
```python
|
||||
if snapshot_requested and self._snapshot_store.has(session_id):
|
||||
entry = self._snapshot_store.lookup(session_id)
|
||||
if entry.kv_committed_len >= len(input_ids) - K_TAIL_TOLERANCE:
|
||||
return self._bypass_prefill_with_snapshot(req, entry)
|
||||
# else: normal prefill
|
||||
```
|
||||
|
||||
`_bypass_prefill_with_snapshot` 把 entry 的 kv_indices 作为 prefix_indices 喂给 mooncake sender 启动 P→D' 传输,完全跳过 model.forward()。
|
||||
|
||||
### 4.5 D 端 commit hook
|
||||
|
||||
**Change**: `scheduler.py` 在 `handle_finish_request` / `cache_finished_req` 完成后调用:
|
||||
|
||||
```python
|
||||
if (self._enable_d_to_p_sync and req.session and req.session.streaming
|
||||
and self._has_p_snapshot_target(req.session.session_id)):
|
||||
self._maybe_enqueue_snapshot_push(req.session.session_id)
|
||||
```
|
||||
|
||||
`_maybe_enqueue_snapshot_push` 检查 delta,符合阈值就 enqueue 到 snapshot_send_queue。
|
||||
|
||||
### 4.6 HTTP endpoints (P)
|
||||
|
||||
```
|
||||
GET /v1/sessions/{sid}/snapshot_status
|
||||
-> {"exists": bool, "seqlen": int, "freshness_s": float}
|
||||
|
||||
POST /v1/sessions/{sid}/snapshot_target
|
||||
-> {"bootstrap_addr": str, "bootstrap_room": int}
|
||||
(D queries this once per session to learn where to push)
|
||||
```
|
||||
|
||||
### 4.7 agentic-pd-hybrid hook
|
||||
|
||||
**File**: `src/agentic_pd_hybrid/replay.py`
|
||||
|
||||
In `_invoke_kvcache_seeded_router`, before opening P session:
|
||||
|
||||
```python
|
||||
if config.enable_d_to_p_sync:
|
||||
snapshot_status = await _probe_p_snapshot(
|
||||
client, prefill_url, session_id, target_seqlen=request.input_length,
|
||||
)
|
||||
if snapshot_status and snapshot_status["fresh"]:
|
||||
# bypass path
|
||||
return await _invoke_kvcache_snapshot_bypass(...)
|
||||
# else: existing seeded router
|
||||
```
|
||||
|
||||
### 4.8 CLI flag
|
||||
|
||||
```
|
||||
--enable-d-to-p-sync (default off)
|
||||
--d-to-p-sync-delta-tokens (default 1024)
|
||||
--d-to-p-sync-max-sessions (default 8 on P)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. 实现路线图(每步独立 commit)
|
||||
|
||||
| # | Commit subject | Files | Why a separate commit |
|
||||
|---|---|---|---|
|
||||
| 1 | `feat(sglang): mooncake bidirectional infra for D→P snapshot` | `third_party/sglang/.../mooncake/conn.py` | 隔离 mooncake 层改动;不破坏 PD-disagg 现有路径 |
|
||||
| 2 | `feat(sglang): PrefillSnapshotStore + DecodeKVSnapshotSender` | `third_party/sglang/.../mem_cache/`, `third_party/sglang/.../disaggregation/mooncake/` | 新数据结构 |
|
||||
| 3 | `feat(sglang): P-side prefill bypass with snapshot` | `third_party/sglang/.../managers/scheduler.py`, `tokenizer_manager.py` | 调度器 hook,最危险,单独提交便于回滚 |
|
||||
| 4 | `feat(sglang): D-side session commit hook → snapshot push` | `third_party/sglang/.../managers/scheduler.py`, `session_aware_cache.py` | D 端 trigger |
|
||||
| 5 | `feat(sglang): HTTP endpoints for snapshot status/target` | `third_party/sglang/.../entrypoints/http_server.py` | API 表面 |
|
||||
| 6 | `feat(agentic): D→P sync hook in seeded_router` | `src/agentic_pd_hybrid/replay.py` | 客户端逻辑 |
|
||||
| 7 | `feat(agentic): --enable-d-to-p-sync CLI + config` | `src/agentic_pd_hybrid/cli.py`, `benchmark.py` | CLI 接入 |
|
||||
| 8 | `feat(experiments): smoke test + E4 sweep scripts` | `scripts/`, `docs/D_TO_P_SMOKE_RESULTS_ZH.md` | 验收 + 落盘 |
|
||||
|
||||
---
|
||||
|
||||
## 6. Metrics + 观察性
|
||||
|
||||
### Structural log channels(写到 `structural/d-to-p-sync.jsonl`)
|
||||
|
||||
```json
|
||||
{"ts": ..., "event": "snapshot_push_enqueued", "sid": "...", "delta": 2048}
|
||||
{"ts": ..., "event": "snapshot_push_sent", "sid": "...", "bytes": 4_200_000_000, "dur_ms": 320}
|
||||
{"ts": ..., "event": "snapshot_push_failed", "sid": "...", "reason": "..."}
|
||||
{"ts": ..., "event": "snapshot_recv_ingested", "sid": "...", "seqlen": 50000}
|
||||
{"ts": ..., "event": "snapshot_evicted", "sid": "...", "reason": "lru|session_close|stale"}
|
||||
{"ts": ..., "event": "snapshot_bypass_hit", "sid": "...", "seqlen": 50000, "saved_prefill_ms_est": 1800}
|
||||
{"ts": ..., "event": "snapshot_bypass_miss", "sid": "...", "reason": "no_entry|stale|seqlen_short"}
|
||||
```
|
||||
|
||||
### Per-request metrics (additional fields in metrics.jsonl)
|
||||
|
||||
```
|
||||
d_to_p_snapshot_used: bool
|
||||
d_to_p_snapshot_age_s: float | None
|
||||
d_to_p_push_count_during_session: int
|
||||
```
|
||||
|
||||
### Sweep summary 应回答的问题
|
||||
|
||||
1. snapshot push 触发频率(每秒多少次)
|
||||
2. snapshot LRU eviction 是不是瓶颈(freshness 分布)
|
||||
3. reseed 触发时 bypass hit rate
|
||||
4. bypass vs fallback 的 TTFT 分布对比
|
||||
|
||||
---
|
||||
|
||||
## 7. 失败模式 + 回退
|
||||
|
||||
| 失败模式 | 现象 | 处理 |
|
||||
|---|---|---|
|
||||
| D→P transfer 中途失败 | mooncake KVPoll.Failed | snapshot_send_queue 重试 1 次,再失败放弃;保留旧 entry |
|
||||
| P snapshot store 满 | LRU 淘汰最旧 entry | log eviction event |
|
||||
| reseed 时 snapshot stale | entry.kv_committed_len < requested input_len - K_TAIL_TOLERANCE | 回退到 normal re-prefill |
|
||||
| D 重启 / session 丢失 | D 上 session_aware_cache 没了 | snapshot_target 注册过期;下次 push 收到 404 → 清理 D 端记录 |
|
||||
| P 重启 | snapshot store 清空 | 下次 reseed probe 拿到 not-exists → fallback |
|
||||
| 双重 push(多个 D 喂同一 session)| 不该发生(session 同时只在一个 D),但保险起见用 last-write-wins + log warning | |
|
||||
|
||||
**核心不变量**:D→P sync 失败永远只导致 fallback 到现有 re-prefill 路径,不影响正确性。
|
||||
|
||||
---
|
||||
|
||||
## 8. 测试
|
||||
|
||||
### Smoke test 阶段(commit #8)
|
||||
|
||||
`scripts/smoke_d_to_p_sync.sh`:
|
||||
1. 启 1P1D,开启 `--enable-d-to-p-sync`
|
||||
2. 跑 5 sessions × 3 turns 的迷你 trace
|
||||
3. 触发条件:第二 turn direct-to-D append 完成后强制 capacity-evict(用 admission flag 调小)
|
||||
4. 第三 turn 必然走 reseed 路径
|
||||
5. 验证:
|
||||
- structural log 有 snapshot_push_sent + snapshot_recv_ingested
|
||||
- 第三 turn metrics 显示 d_to_p_snapshot_used=true
|
||||
- TTFT 与 cold prefill 的差异 ≥ 1s
|
||||
|
||||
### E4 端到端 sweep(feature 验收完成后)
|
||||
|
||||
详见 §9。
|
||||
|
||||
---
|
||||
|
||||
## 9. 实验:E4 KVC w/ D→P vs naive PD-disagg
|
||||
|
||||
**目标**:证明 KVC + D→P 在保持 session affinity 设计独特性的前提下 latency 优于 naive PD-disagg(E1 baseline)。
|
||||
|
||||
### 实验矩阵
|
||||
|
||||
| # | 配置 | 期望验证 |
|
||||
|---|---|---|
|
||||
| E1(已有) | naive 1P3D + kv-aware + RDMA | baseline,无 KVC 层 |
|
||||
| E3(已有) | KVC v2 + RDMA + load-floor | KVC 但无 D→P,reseed 重 prefill |
|
||||
| **E4** | KVC v2 + RDMA + load-floor + D→P | KVC + D→P bypass |
|
||||
| E4-ablate | KVC v2 + RDMA + load-floor + D→P,但人为 disable bypass | 排除 push 流量本身的副作用 |
|
||||
|
||||
### 假设
|
||||
|
||||
- **H4-1**:E4 的 TTFT p99 ≤ E1。证明:KVC + D→P 在 p99 长尾上不再输 naive PD-disagg。
|
||||
- **H4-2**:E4 的 reseed 占比(execution_mode=*reseed*)不变,但 reseed 路径自身 TTFT 中位 ≤ E1 normal 路径 TTFT 中位。
|
||||
- **H4-3**:E4 的总 throughput 略低于 E3(因为 D→P 推送占带宽),但 TTFT/latency 优势足以补偿。
|
||||
|
||||
### 数据集
|
||||
|
||||
- `outputs/inferact_50sess.jsonl`(同 E1/E2/E3)
|
||||
- md5 7bb263a32600ef5a6ef5099ba340a487
|
||||
|
||||
### 报告(事前 commit `docs/E4_PROTOCOL_ZH.md`,跑完后 `docs/E4_RESULTS_ZH.md`)
|
||||
|
||||
每个 hypothesis 标注:
|
||||
- 证实 / 证伪 / 部分证实
|
||||
- 数字证据
|
||||
- 失败原因(若证伪)
|
||||
- 后续工作建议
|
||||
|
||||
---
|
||||
|
||||
## 10. 边界 + 非目标
|
||||
|
||||
**本设计不解决**:
|
||||
|
||||
- **D→D' 直推**:未来若证实场景 X 必须用,可走 Option B 作为补充
|
||||
- **跨 P 协调**:现假设单 P。多 P 时每个 P 各自维护自己的 snapshot store,session 路由到哪个 P 是 router 决定
|
||||
- **跨节点 mooncake**:当前 H200 是单机 4 GPU,IB device 选 mlx5_60。跨节点 RDMA 留作 future work
|
||||
- **snapshot 持久化**:P 重启 snapshot 全丢,下次 reseed 走 fallback。不写盘
|
||||
- **prefill bypass 与 chunked prefill 的交互**:bypass 走的是 "全 session KV 直接传输",不和 chunked prefill 并存。若 P 当前正在 chunked-prefill 这个 session,bypass 等到现有 chunk 结束再起
|
||||
|
||||
---
|
||||
|
||||
## 11. 决策点(等评审)
|
||||
|
||||
| # | 问题 | 默认 |
|
||||
|---|---|---|
|
||||
| D1 | snapshot push 的 throttle delta K_DELTA = 1024 tokens 合理?太小会泛滥推送,太大会让 snapshot 滞后 | 起步用 1024,跑 smoke 看流量再调 |
|
||||
| D2 | snapshot LRU 上限 max_sessions = 8 合理?P 池 ~92K tokens,session 平均 50K → 1-2 个? | 8 太乐观,改 4 |
|
||||
| D3 | bypass 时 P 是否走 mooncake 的 staging buffer?还是直接 zerocopy | 直接 zerocopy,避免一次 device→device 拷贝 |
|
||||
| D4 | D-side push 失败后是否上报 router 影响策略? | 不上报,fail-open(fallback re-prefill 也能跑) |
|
||||
| D5 | snapshot 是否包含 aux/state?(mamba state, swa 状态等) | E4 实验 trace 只用 Qwen3,无 mamba。aux 跟着 KV 一起带 |
|
||||
|
||||
---
|
||||
|
||||
**核心句**:D→P 同步是 KVC 设计真正击败 naive PD-disagg 的关键缺口。本设计用 P 端独立 snapshot store + prefill bypass 的最小改动方案,避开 radix tree 多生产者扩展的工程陷阱,~600 LOC 拆 8 commit 可在单次 session 完成。验收后即可启动 E4 实验对比 KVC vs naive。
|
||||
137
docs/E1_E2_FIX_DESIGN_ZH.md
Normal file
137
docs/E1_E2_FIX_DESIGN_ZH.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# E1 / E2 Failure Modes — Fix Design Space (no code changes)
|
||||
|
||||
**Status**: design proposal for review.
|
||||
**Branch**: `h200-cu130`.
|
||||
**Companion**: `docs/E1_E2_RESULTS_ZH.md` §5b–§5d for the forensic findings this design responds to.
|
||||
|
||||
This document evaluates candidate fixes for the two pathologies E1 / E2 exposed:
|
||||
- **Q1**: D scheduler thread starves the mooncake C++ control plane during LRU evictions, causing P-side `batch_transfer_sync` to time out (~30 s) and the hair-trigger in `conn.py:1270` to permanently blacklist the D's mooncake_session_id.
|
||||
- **Q2**: KvAwarePolicy's overlap-first lex score, combined with workloads where new sessions share boilerplate hash_ids with already-resident sessions on D0/D1, leaves D2 cold for the entire run.
|
||||
|
||||
For each problem we list candidate fixes, the layer they touch, their assumptions, and what could go wrong. **No code is committed** until a path is chosen.
|
||||
|
||||
---
|
||||
|
||||
## Q1 — Eviction starves mooncake control plane
|
||||
|
||||
### Mechanism recap
|
||||
|
||||
Inside `decode-0.log` at the moment of P-side timeout (`Sync batch data transfer timeout after 37452515723ns`):
|
||||
|
||||
```
|
||||
01:56:34 Decode batch ... gen 174 tok/s ← serving fine
|
||||
01:56:42 session id 1000315 does not exist, cannot delete.
|
||||
01:56:42 Trimmed decode session cache via LRU. evicted=2, freed=77675, available 38574 → 116249
|
||||
01:56:42 Trimmed decode session cache via LRU. evicted=1, freed=36166, available 29038 → 65204
|
||||
01:56:42 Decode transfer failed ... ← P-side timeout fires
|
||||
```
|
||||
|
||||
`maybe_trim_decode_session_cache` (in vendored sglang scheduler) walks per-session resident bookkeeping, releases GPU KV slots via `kv_pool_allocator.free()`, and updates `session_aware_cache` under lock. While that runs, the scheduler main loop is busy and the mooncake control-plane callbacks scheduled into the same event loop don't get serviced. P sees no completion ack within 30 s → `batch_transfer_sync` returns nonzero → hair-trigger fires.
|
||||
|
||||
### Design space
|
||||
|
||||
| # | Fix | Layer | Mechanism | Assumes | Risks |
|
||||
|---|---|---|---|---|---|
|
||||
| **Q1.A** | Pre-emptive low-watermark eviction | vendored SGLang | Trigger LRU when `token_usage > 0.7` in idle scheduler ticks, so admission rarely needs to evict inline. SGLang already has `_decode_session_cache_low_watermark_tokens`; question is whether it currently runs proactively or only on-demand. | Idle ticks exist to absorb the work; the per-trim cost is bounded enough that doing it pre-emptively doesn't hurt the steady-state. | If proactive trims pick "warm" sessions (recently active), we lose direct-to-D fast-path hits. Need careful watermark + LRU-priority tuning. |
|
||||
| **Q1.B** | Async eviction thread | vendored SGLang | Move LRU trim off the scheduler main loop into a background worker. Scheduler main loop only calls `notify_evict_needed()`; mooncake control plane keeps running. | KV pool free / session_aware_cache mutations can be made thread-safe with reasonable lock granularity. | Largest blast radius. Concurrent in-flight transfers can race with eviction of the same KV slots; need explicit ref-counting. Harder to reason about correctness. |
|
||||
| **Q1.C** | Bump mooncake transfer timeout | mooncake env / wheel patch | Set `MC_TRANSFER_TIMEOUT_NS` (or equivalent) from 30 s default → 120 s+, giving D's eviction more headroom before P gives up. | A real broken link won't go unnoticed for ≥120 s. | Pure defense-in-depth. Doesn't fix LRU thrashing; under heavier load eviction could exceed 120 s too. Slows real-failure detection. |
|
||||
| **Q1.D** | Windowed hair-trigger | vendored SGLang `conn.py:1270` | Replace `if session_failures >= 1:` with `if session_failures ≥ N within window`. Add periodic probe to D bootstrap port to clear `failed_sessions` after success. | Transient stalls are recoverable; real deaths are not. | Changes core failure semantics. We may keep dispatching to a D that is actually slow-dying. Adds windowed-state bookkeeping to a stable codepath. |
|
||||
| **Q1.E** | Router-side backpressure | our `--enable-backpressure` (already exists, off by default) | D returns `recommended_pause_ms` in its admission RPC when pool > threshold; router pauses dispatch to that D. Already implemented. | Pausing dispatch upstream prevents D from ever reaching saturation, so LRU never thrashes. | Doesn't help in-flight transfers when stall happens; only prevents future arrivals. Won't rescue requests already mid-mooncake when LRU fires. |
|
||||
| **Q1.F** | Upstream load balance (= Q2 fix) | our `policies.py` | Spread sessions to D2 so D0/D1's KV pool never saturates; LRU never trims; mooncake never stalls; hair-trigger never fires. | Q2 fix is sound and the workload's KV demand fits into 3 D's evenly. | The LRU+mooncake interaction stays latent. A different workload that still imbalances (e.g. a few sessions much larger than others) could re-trigger. |
|
||||
|
||||
### Recommendation for Q1
|
||||
|
||||
**Primary: Q1.F (do Q2 fix first).** This is upstream of the failure cascade and removes the only situation in which we observe LRU thrashing in our experiments. If Q2 is fixed and re-running E2 still shows mooncake stalls, then we *know* it's a real symptom and need defense-in-depth.
|
||||
|
||||
**Defense-in-depth (cheap): Q1.C (bump mooncake timeout).** Single env-var change, gives 4× safety margin, costs nothing. Safe to do regardless.
|
||||
|
||||
**Avoid for now: Q1.B and Q1.D.** Both touch vendored SGLang in invasive ways that change failure-detection semantics. Hold until Q1.F + Q1.C demonstrate they aren't enough.
|
||||
|
||||
**Open question for the team**: does SGLang's existing `low_watermark` LRU trigger (Q1.A) already run proactively? If we read the scheduler loop and find it only trims on demand, Q1.A is a small targeted change worth doing; if it's already proactive, the trims we observe are because watermark is set too high → tune the constant.
|
||||
|
||||
---
|
||||
|
||||
## Q2 — Cold-D never gets a session
|
||||
|
||||
### What we already know is wrong
|
||||
|
||||
User's observation: the existing `migration_reject_threshold=3` mechanism fires *after 3 wasted prefills*, which is too late. The fix needs to be *proactive*: the first request to a fresh session should already prefer the cold D over a hot D whose only advantage is shared boilerplate overlap.
|
||||
|
||||
### Design space
|
||||
|
||||
Let `assigned[D] = state.decode_assignment_counts[D]` and `inflight[D] = state.inflight_decode[D]`. Lex score is currently:
|
||||
|
||||
```
|
||||
score(D) = (overlap + α·sticky, sticky, -inflight, -assigned)
|
||||
```
|
||||
|
||||
| # | Fix | Mechanism | Assumes | Risks |
|
||||
|---|---|---|---|---|
|
||||
| **Q2.A** | Cold-D bonus (binary, what the reverted commit did) | `cold_boost = K if assigned[D]==0 and not sticky else 0`; add to lex position 0. | Each D needs to be "popped" from cold once, after that the bonus disappears. | One-shot: only protects the first session per D. After all 3 D's have ≥1 session, bonus is 0 everywhere and we're back to overlap-dominates-everything. If new session pressure remains skewed (e.g. boilerplate keeps growing on D0/D1), we re-imbalance silently. |
|
||||
| **Q2.B** | Load-floor bonus (graduated, my recommended primary) | `floor_bonus = max(0, K · (1 − assigned[D] / max(assigned[*])))` (or similar continuous fn); add to lex position 0; gated on `not sticky`. | "Lower assignment count = preferable for fresh sessions" is a sound bias even when no D is fully cold. | Tuning: K must dominate boilerplate overlap (~50 blocks here) but not so much that it drowns out genuine prefix-cache wins (a session with real 800-block overlap with one D should still go there). Suggest K ≈ 100×median(overlap_for_fresh_sessions). |
|
||||
| **Q2.C** | Lex re-order: inflight first | Change score to `(-inflight, overlap + α·sticky, sticky, -assigned)`. | Idle D always wins ties → idle D2 wins fresh sessions immediately. | Contradicts the existing design intent (overlap-first = cache-locality-first). Hurts cache reuse when load *is* balanced. Sticky requests at turn 1+ might be diverted to a momentarily idle D, breaking cache locality of subsequent turns. |
|
||||
| **Q2.D** | Capacity-aware overlap discount | `effective_overlap = overlap · (1 − inflight[D] / max_inflight)`; replace `overlap` in score. | Loaded D's overlap is worth less than idle D's overlap because of queueing cost. Matches what theory says about cache-vs-load tradeoff. | More complex than Q2.B; needs `max_inflight` estimate (per-D? global?). Harder to reason about and tune. Saves only marginal modeling correctness over Q2.B. |
|
||||
| **Q2.E** | Pre-warm cold D's at startup | After SGLang warmup, send a synthetic request whose hash_ids cover the boilerplate prefix to each D, populating `state.resident[D]` evenly. | We can identify "the shared boilerplate" by inspecting the trace before launch (or extracting common prefix at run start). | Trace-aware / requires upstream knowledge. Doesn't help workloads with multiple distinct shared prefixes. Workload-coupled — feels brittle. |
|
||||
| **Q2.F** | Drop overlap unless "material" | Apply overlap term only when overlap > τ blocks (or > τ% of input). | Tiny overlap doesn't actually save meaningful prefill work. | Hides imbalance instead of solving it. If a workload has medium overlap (say 15%), threshold won't fire and we're back to imbalance. Doesn't address the bigger issue. |
|
||||
| **Q2.G** | Fix the substring filter (the actual `_is_admission_rejection_mode` bug) | Either widen `_ADMISSION_REJECTION_SUBSTRINGS` to include `"kvcache-centric"`, or call `state.record_admission_reject` directly from the actual reject signal site instead of string-matching after the fact. | Existing migration mechanism is sound *once* it gets fed the right signal. | User has explicitly said 3-reject threshold is too late. So Q2.G alone isn't enough. But it's still a real bug — fixing it is orthogonal cleanup. |
|
||||
|
||||
### Recommendation for Q2
|
||||
|
||||
**Primary: Q2.B (load-floor bonus, graduated).**
|
||||
- Continuous, not binary one-shot like Q2.A — gracefully handles the case where new sessions keep arriving and load needs to keep spreading.
|
||||
- Decouples "node-idle preference" from overlap as separate signals — composable, debuggable.
|
||||
- Sticky stays on by gating on `not sticky` → no risk of breaking turn 1+ cache locality.
|
||||
- Single knob (`K`) to tune.
|
||||
|
||||
**Orthogonal cleanup: Q2.G (fix the reject-substring filter).** Independent of Q2.B, since the migration mechanism is the *backstop* (when load-floor bonus alone isn't enough to migrate from a saturated D mid-session). User correctly noted that waiting 3 rejects is too late as the *primary* mechanism, but as a *backstop after* primary load balancing, it's still valuable.
|
||||
|
||||
**Avoid: Q2.C** (lex re-order destroys overlap-first design). **Avoid: Q2.E** (workload-coupled, brittle). **Q2.D / Q2.F** are reasonable but more complex than Q2.B with marginal gain.
|
||||
|
||||
### Concrete shape of Q2.B (for review, not for merge)
|
||||
|
||||
```python
|
||||
# In KvAwarePolicy.select, replacing the current score line:
|
||||
total_assigned = sum(state.decode_assignment_counts.values())
|
||||
n_decoders = max(1, len(topology.route_workers))
|
||||
mean_assigned = total_assigned / n_decoders
|
||||
|
||||
# Per-D fairness deficit: how much below the running mean is this D?
|
||||
deficit = max(0, mean_assigned - state.decode_assignment_counts.get(worker.worker_id, 0))
|
||||
floor_bonus = int(self.load_floor_bonus * deficit / max(1, mean_assigned)) if not sticky else 0
|
||||
|
||||
score = (
|
||||
overlap + sticky * self.sticky_bonus + floor_bonus,
|
||||
sticky,
|
||||
inflight_penalty,
|
||||
assignment_penalty,
|
||||
)
|
||||
```
|
||||
|
||||
Knob: `load_floor_bonus: int = 0` (off by default, opt-in). When set to e.g. 200, an empty D that should have 16 sessions but has 0 gets `floor_bonus = 200 * 16 / 16 = 200`, dominating boilerplate overlap (~50). A D that's only 1 session below mean gets `floor_bonus = 200 * 1 / 16 ≈ 12`, which doesn't override real prefix-cache wins.
|
||||
|
||||
But this is just a *sketch* — real tuning needs an empirical pass on the same Inferact subset to verify D2 receives sessions and overlap-driven cache wins survive on D0/D1.
|
||||
|
||||
### Validation plan if we go with Q2.B
|
||||
|
||||
1. Implement Q2.B + flag, default off.
|
||||
2. Re-run E2 on the same `outputs/inferact_50sess.jsonl` subset with `--kvcache-load-floor-bonus 200`.
|
||||
3. Check structural log: do D0/D1/D2 each get a non-trivial share of `session-d-binding.jsonl` rows?
|
||||
4. Check failure rate: drop from 1054 → < 100? (Hypothesis: yes, because the LRU thrash that triggered the mooncake hair-trigger was downstream of D0/D1 saturation.)
|
||||
5. Check direct-to-D rate: should stay similar or improve (load-balancing should not destroy cache reuse, since sticky still wins for known sessions).
|
||||
6. Re-evaluate H1 with E1 vs the new E2.
|
||||
|
||||
---
|
||||
|
||||
## Decision points (for review)
|
||||
|
||||
| # | Question | Default if no answer |
|
||||
|---|---|---|
|
||||
| D1 | Q1: do Q2 fix first and re-measure before touching mooncake / SGLang? | **Yes** (recommended) |
|
||||
| D2 | Q1: bump mooncake `MC_TRANSFER_TIMEOUT_NS` to 120 s as cheap defense-in-depth? | Yes |
|
||||
| D3 | Q2: is Q2.B (load-floor bonus, graduated) the right shape, or should we pick a different option from the table? | Q2.B |
|
||||
| D4 | Q2: also do Q2.G (fix the reject-substring filter) as orthogonal cleanup? | Yes |
|
||||
| D5 | Q2.B: is the proposed deficit-vs-mean formula OK, or do you prefer a simpler "bonus = K · (max - mine) / max" form? | Defer |
|
||||
| D6 | Q2.B: bonus magnitude K = 200 reasonable, or want to grid-search a few values? | Try 200 first |
|
||||
| D7 | Validation: re-run E2 on same 50-session subset, or expand to 100 sessions for more headroom? | Same subset |
|
||||
|
||||
Once the shape is approved, the next implementation pass is small and concentrated in `policies.py` + `replay.py` + `cli.py` (no SGLang vendor changes needed for the primary fix).
|
||||
416
docs/E1_E2_RESULTS_ZH.md
Normal file
416
docs/E1_E2_RESULTS_ZH.md
Normal file
@@ -0,0 +1,416 @@
|
||||
# E1 vs E2 Experiment Results — H200 + Driver 570
|
||||
|
||||
**Status**: E1 ✅ complete (2026-05-12 01:48 UTC, wall 1h29min). E2 ✅ complete (2026-05-12 03:22 UTC, wall 1h33min).
|
||||
**Branch**: `h200-cu130`.
|
||||
**Trace**: `outputs/inferact_50sess.jsonl` (deterministic head-cut of Inferact `codex_swebenchpro` to first 50 trials, md5 `7bb263a32600ef5a6ef5099ba340a487`, 1285 requests, mean input_length 67,631 tokens).
|
||||
**Hardware**: 4× H200 80GB, driver 570.86.15 (cu12.8 API), Mellanox mlx5_60 RoCE 400 Gb/s NDR.
|
||||
**Model**: Qwen3-30B-A3B-Instruct-2507 (TP1).
|
||||
**Toolchain**: vendored SGLang 0.5.10 + cu12.8 nvcc local install (`~/cuda-12.8`) — see `docs/H200_DRIVER570_SETUP_ZH.md`.
|
||||
|
||||
---
|
||||
|
||||
## 1. Hypotheses being tested
|
||||
|
||||
From `docs/ONBOARDING_NEXT_AGENT_ZH.md` §3.1:
|
||||
|
||||
- **H1**: KVC v2's wins are not just from "1P3D topology + kv-aware policy" — the KVC layer (admission / migration / direct-to-D) contributes meaningfully on top. Pairing E1 (no KVC layer) against E2 (full KVC v2) on the **same subset** isolates the marginal contribution.
|
||||
- **H2/H3**: Enabling real RDMA pushes TTFT p99 down from the reported 1.28s (TCP loopback) toward ~0.7s. Independent of H1, this is measured inside E2 alone (comparing against the historical TCP-loopback v2 reference).
|
||||
|
||||
---
|
||||
|
||||
## 2. E1 results — naive 1P3D + kv-aware + RDMA
|
||||
|
||||
**Configuration**: `mechanism=pd-disaggregation`, `policy=kv-aware`, 1P3D (GPU0=P, GPU1/2/3=D), `--force-rdma --ib-device mlx5_60`, `--concurrency-limit 32`, ts=1.
|
||||
|
||||
| Metric | E1 |
|
||||
|---|---:|
|
||||
| request_count | 1285 |
|
||||
| success | 1200 |
|
||||
| **error_count** | **85** |
|
||||
| **failure_count** | **85** |
|
||||
| abort_count | 0 |
|
||||
| latency mean | 96.34 s |
|
||||
| latency p50 | 93.21 s |
|
||||
| latency p90 | 180.69 s |
|
||||
| latency p99 | 219.46 s |
|
||||
| ttft mean | 90.48 s |
|
||||
| ttft p50 | 88.62 s |
|
||||
| ttft p90 | 175.13 s |
|
||||
| **ttft p99** | **207.39 s** |
|
||||
| execution_modes | `pd-disaggregation-router: 1200`, `pd-disaggregation: 85` (errors) |
|
||||
| per_decode_load | **D0:575, D1:710, D2:0** |
|
||||
| per_prefill_load | P0:1285 |
|
||||
| cache_hit_request_count | 1199 / 1200 (99.9%) |
|
||||
|
||||
### Key observations on E1
|
||||
|
||||
1. **D2 was never bound to a single session**. All 50 sessions got pinned to D0 or D1 by `kv-aware` policy's (overlap + sticky + inflight + assigned) lex-score, and naive pd-disaggregation has no migration mechanism to rebalance. Effective topology was **1P2D**, not 1P3D.
|
||||
2. **Massive queueing**. TTFT p50 ≈ 89 s and p99 > 200 s indicate sessions waited tens of seconds in router/prefill queue. With `--concurrency-limit 32` and D0/D1 saturated, the inflight cap forced ~1250 reqs to serialize through only two decode workers.
|
||||
3. **85 failures (6.6%)** — all `execution_mode == pd-disaggregation` (which the metrics module classifies as `error` when the agentic-pd-hybrid replay sees an unsuccessful upstream response). Most likely caused by `--request-timeout-s 300` firing on the longest queued requests.
|
||||
4. **Cache hit 99.9%** — the kv-aware policy did successfully concentrate sessions on their prior D worker; the Inferact converter's prefix-shared 24-token-block hash_ids gave near-perfect prefix overlap across turns of the same session.
|
||||
|
||||
### What E1 establishes
|
||||
|
||||
For the same hardware, same trace, same model, **naive 1P3D + kv-aware policy is unusable for multi-session agentic workloads**:
|
||||
- session-stickiness without migration leaves a third of compute capacity (1 of 3 decode GPUs) entirely unused
|
||||
- queueing dominates user-facing latency
|
||||
- failure rate is 6.6% even with 5 minutes per-request timeout
|
||||
|
||||
This is *the baseline H1 needs* — it shows the KVC layer (E2) has something concrete to improve over.
|
||||
|
||||
---
|
||||
|
||||
## 3. E2 results — KVC v2 + RDMA
|
||||
|
||||
**Configuration**: `mechanism=kvcache-centric`, `policy=kv-aware`, 1P3D, `--force-rdma --ib-device mlx5_60`, `--kvcache-admission-mode worker`, `--kvcache-direct-max-uncached-tokens 8192`, `--kvcache-migration-reject-threshold 3`, `--kvcache-prefill-backup-policy release-after-transfer`, `--kvcache-prefill-priority-eviction`, ts=1.
|
||||
|
||||
| Metric | E2 |
|
||||
|---|---:|
|
||||
| request_count | 1285 |
|
||||
| success | 231 |
|
||||
| **error_count** | **1054** |
|
||||
| **failure_count** | **1054** |
|
||||
| abort_count | 0 |
|
||||
| latency mean (successful only) | 10.94 s |
|
||||
| latency p50 | 7.44 s |
|
||||
| latency p90 | 20.68 s |
|
||||
| latency p99 | 64.73 s |
|
||||
| ttft mean (successful only) | 1.76 s |
|
||||
| ttft p50 | 0.43 s |
|
||||
| ttft p90 | 6.56 s |
|
||||
| **ttft p99** | **8.74 s** |
|
||||
| execution_modes (succ.) | direct-to-D: 87; turn1-seed: 50; reseed: 12; large-append-reseed: 11; seed-filter-early-turn: 50; large-append-cap: 21 |
|
||||
| per_decode_load | **D0:600, D1:685, D2:0** |
|
||||
| per_prefill_load | P0:1285 |
|
||||
| cache_hit_request_count | 230 / 231 (99.6 %) |
|
||||
|
||||
### Key observations on E2
|
||||
|
||||
1. **D2 still has zero bindings** — same root cause as E1. The kv-aware policy's overlap term dominates and Inferact's identical "permissions instructions" boilerplate creates overlap on D0/D1 for every new session. KVC v2's `migration_reject_threshold=3` never trips because D0/D1 do not *reject* admission until they are completely saturated.
|
||||
2. **80 % failure rate, 1054 / 1285**. **NOT timeouts** — actual root cause is a 3-layer cascade documented in §6. Quick summary: 562 "no-space" admission rejects from D0/D1 → router falls back to seed/reseed paths needing mooncake → mooncake heartbeats drop ("Decode instance could be dead") → SGLang aborts the request → client sees `RuntimeError: generate stream ended before producing any token`.
|
||||
3. **Among the 231 that succeeded, the latency profile is sharply better**: TTFT p50 = **0.43 s** vs E1's 88.62 s (E2/E1 = 0.5 %), latency p50 = **7.44 s** vs E1's 93.21 s (8 %). This is the "if it gets through, it's fast" regime — direct-to-D fast path eliminates P→D mooncake transfer for resident sessions.
|
||||
4. **Direct-to-D fast path engaged 87 / 231 = 37.7 %** of successful requests. Lower than historical v2's 91.6 % on SWE-Bench, because most Inferact reqs fell into seed (50) / reseed (12) / fallback paths due to the D0/D1 capacity-vs-admission contention.
|
||||
|
||||
---
|
||||
|
||||
## 4. Comparison table — E1 vs E2
|
||||
|
||||
Numbers below are over **all 1285 requests** for E1 (since failure rate is small) but **only the 231 successful** for E2 (since the bulk timed out before producing latency datapoints). This is **not a fair head-to-head**, see §6.
|
||||
|
||||
| Metric | E1 | E2 (succ only) | E2 / E1 |
|
||||
|---|---:|---:|---:|
|
||||
| Total reqs | 1285 | 1285 | – |
|
||||
| Successful | 1200 | **231** | 0.19× |
|
||||
| **error_count** | 85 (6.6 %) | **1054 (82 %)** | **12.4× worse** |
|
||||
| lat mean | 96.34 s | 10.94 s | 0.114 |
|
||||
| lat p50 | 93.21 s | **7.44 s** | **0.080** |
|
||||
| lat p90 | 180.69 s | 20.68 s | 0.114 |
|
||||
| lat p99 | 219.46 s | 64.73 s | 0.295 |
|
||||
| ttft mean | 90.48 s | 1.76 s | 0.019 |
|
||||
| **ttft p50** | 88.62 s | **0.43 s** | **0.005** |
|
||||
| ttft p90 | 175.13 s | 6.56 s | 0.037 |
|
||||
| ttft p99 | 207.39 s | 8.74 s | 0.042 |
|
||||
| per_decode_load | D0:575, D1:710, D2:0 | D0:600, D1:685, D2:0 | both 1P2D |
|
||||
| direct-to-D % | N/A (no KVC) | 87/231 = 37.7 % | – |
|
||||
|
||||
---
|
||||
|
||||
## 5. Interpreting H1 / H2 / H3
|
||||
|
||||
### H1 (was: KVC layer adds value on top of 1P3D + kv-aware) — *qualified*
|
||||
|
||||
The H1 hypothesis as stated in `ONBOARDING_NEXT_AGENT_ZH.md` predicted E2 would clearly win on most metrics. The reality is **bimodal**: the small subset of E2 requests that successfully complete are dramatically faster than E1, but a much larger fraction (82 %) of E2 requests time out entirely. Net throughput on this workload is *worse* for E2 than E1.
|
||||
|
||||
Two issues drove this:
|
||||
1. The D2 cold-start pathology already documented in §3, root cause. Both runs are de facto 1P2D, not 1P3D.
|
||||
2. KVC v2's admission gate is stricter and surfaces more "no D capacity" / "session-not-resident" failures than vanilla pd-disagg, when the workload (mean input 67 K tokens, mean output 700 tokens) saturates D0/D1's combined ~1.5 M KV pool.
|
||||
|
||||
For workloads where D0/D1 do not saturate or where the policy *does* spread session ownership across all D workers (the historical SWE-Bench setup), KVC v2 wins. The Inferact `codex_swebenchpro` subset breaks both assumptions.
|
||||
|
||||
### H2 / H3 (RDMA reduces TTFT p99) — *cannot be evaluated cleanly here*
|
||||
|
||||
The historical reference point is "KVC v2 + TCP loopback, SWE-Bench 50sess: TTFT p99 = 1.28 s". This run uses Inferact + RDMA, and TTFT p99 of the 231 successful E2 requests is **8.74 s** — much higher than the TCP baseline. But the workloads are not comparable: Inferact mean input is 67 K tokens vs SWE-Bench's much smaller average. Per-request prefill + transfer is roughly 5× longer here. A clean H2 / H3 read needs an Inferact-on-TCP run to compare against, which is out of scope for this subset's GPU budget.
|
||||
|
||||
What we *can* say: RDMA is correctly engaged (every worker log shows `installTransport, type=rdma`; admission RPC RTTs in `structural/admission-events.jsonl` are ~6 ms — consistent with one-hop RoCE).
|
||||
|
||||
---
|
||||
|
||||
## 5b. Why E2 has 80 % failures — the real chain (forensic)
|
||||
|
||||
The summary's `error_count: 1054` and `execution_mode: kvcache-centric` mask the actual cascade. Pulling the underlying `request-metrics.jsonl`, `structural/admission-events.jsonl`, and per-worker SGLang logs gives the full picture.
|
||||
|
||||
### Layer 1 — worker admission rejects (51 % of admit attempts)
|
||||
|
||||
From `structural/admission-events.jsonl`:
|
||||
```
|
||||
admit ok = 581 (modes: seed=494, direct_append=87)
|
||||
admit reject = 605 (reasons: no-space=562, session-not-resident=43)
|
||||
```
|
||||
|
||||
**562 "no-space" rejects** — D worker (almost always D0 or D1) reports its KV pool is full and refuses to take the request as direct-append. The router then re-routes the request to the seed/reseed path.
|
||||
|
||||
This is materially different from E1's behaviour: E1's vanilla pd-disagg had no admission RPC, so requests *always* got accepted by the chosen D and queued behind the running batch. E1 paid for that as a 90-second TTFT but never saw a "no-space" failure.
|
||||
|
||||
### Layer 2 — mooncake P→D transfer failures (real, observed in prefill log)
|
||||
|
||||
From `logs/prefill-0.log`:
|
||||
```
|
||||
[01:56:42] Prefill transfer failed for request rank=0 req.rid='2a5ed06fb…'
|
||||
with exception KVTransferError: Failed to send kv chunk of … to 172.18.112.37:46067
|
||||
[01:56:42] Prefill transfer failed for request rank=0 req.rid='eca5ff14…'
|
||||
with exception KVTransferError: Decode instance could be dead,
|
||||
remote mooncake session 172.18.112.37:15078 is not alive
|
||||
[01:56:42] Prefill transfer failed for request rank=0 req.rid='7ed9827b…'
|
||||
Decode instance could be dead, remote mooncake session ... is not alive
|
||||
```
|
||||
|
||||
When the seed/reseed fallback queue piles up (because of layer 1), the D worker becomes heavily backlogged and its mooncake bootstrap session heartbeat drops — P interprets this as "the D worker is dead" and fails the transfer. This is **not** a true crash; the worker process is alive (we observed it accepting unrelated requests immediately after), but the mooncake session is torn down for that bootstrap_room.
|
||||
|
||||
### Layer 3 — client-visible error
|
||||
|
||||
From `request-metrics.jsonl` for all 1054 failed reqs:
|
||||
```
|
||||
"error": "RuntimeError: generate stream ended before producing any token"
|
||||
```
|
||||
|
||||
This is what `agentic-pd-hybrid` sees when the SGLang `/generate` SSE stream closes with zero output tokens — the upstream abort from layer 1 or layer 2 propagates as an empty stream.
|
||||
|
||||
### The complete causal chain
|
||||
|
||||
```
|
||||
Inferact shared "permissions instructions" boilerplate
|
||||
↓
|
||||
overlap term in kv-aware lex score never lets D2 win → D2 cold forever
|
||||
↓
|
||||
50 sessions all pinned to D0 / D1
|
||||
↓
|
||||
D0 / D1 KV pool saturates
|
||||
↓
|
||||
worker admission emits 562 × "no-space" ← Layer 1
|
||||
↓
|
||||
router falls back to seed/reseed path (needs P→D mooncake transfer)
|
||||
↓
|
||||
P→D transfer queue piles up; D mooncake heartbeat drops
|
||||
↓
|
||||
"Decode instance could be dead" → KVTransferError ← Layer 2
|
||||
↓
|
||||
SGLang aborts the req → SSE stream closes with 0 tokens
|
||||
↓
|
||||
agentic-pd-hybrid raises "generate stream ended ..." for 1054 reqs ← Layer 3
|
||||
```
|
||||
|
||||
### Why E1 didn't hit this
|
||||
|
||||
E1 used `mechanism=pd-disaggregation`, which has no per-worker admission RPC. The router blindly dispatched to D0/D1; SGLang's internal scheduler simply queued requests behind the running batch (some grew their wait to >90 s before getting a token). Of the 85 E1 errors, sampling shows they are `request-timeout-s=300` failures — old-fashioned timeouts on the agentic-pd-hybrid side, not mooncake or admission failures.
|
||||
|
||||
So:
|
||||
- E1 trades latency for resilience: nobody rejects, everyone queues, you pay TTFT.
|
||||
- E2's KVC v2 worker admission is *meant* to be a safety valve, but on the cold-D pathology it becomes an *amplifier*: rejects → fallback paths → backlog → mooncake heartbeat loss → cascading failures.
|
||||
|
||||
### The real fix
|
||||
|
||||
Worker admission per se is not the bug — the bug is that there is no D-rebalancing happening upstream. With balanced D load (e.g. cold-D bonus in policy, or pre-warm of D2 with shared boilerplate), D0/D1 would not hit "no-space", and the layer 1 → layer 2 cascade would not fire. The reseed long-tail TTFT (8.74 s p99 here) becomes the dominant cost — exactly the regime onboarding §3.1 H3 describes.
|
||||
|
||||
---
|
||||
|
||||
## 5c. Why mooncake "died" (forensic on Q1)
|
||||
|
||||
The error string is `Decode instance could be dead, remote mooncake session ... is not alive`, which sounds like the D worker process crashed. **It did not.** Concurrent evidence shows D1 was happily serving `/session_cache/admit_direct_append HTTP/1.1 200 OK` and running LRU evictions only seconds after the "is not alive" errors fired. The real mechanism is hair-trigger.
|
||||
|
||||
### What the SGLang mooncake conn.py actually does
|
||||
|
||||
In `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1267-1276`:
|
||||
|
||||
```python
|
||||
if ret != 0: # one transfer slice failed
|
||||
with self.session_lock:
|
||||
self.session_failures[req.mooncake_session_id] += 1
|
||||
# Failures should never happen if the session is not dead,
|
||||
# if the session fails once, mark it as failed
|
||||
if self.session_failures[req.mooncake_session_id] >= 1:
|
||||
self.failed_sessions.add(req.mooncake_session_id)
|
||||
logger.error(f"Session {req.mooncake_session_id} failed.")
|
||||
...
|
||||
```
|
||||
|
||||
After this, every subsequent transfer that uses the same `mooncake_session_id` short-circuits at conn.py:1184:
|
||||
|
||||
```python
|
||||
if req.mooncake_session_id in self.failed_sessions:
|
||||
self.record_failure(kv_chunk.room,
|
||||
f"Decode instance could be dead, remote mooncake session ... is not alive")
|
||||
```
|
||||
|
||||
**One real `send_kvcache_slice ret != 0` permanently blacklists that D's mooncake session for the rest of the SGLang process lifetime.** The code's own comment ("Failures should never happen if the session is not dead") encodes the design assumption that transfers don't fail under normal conditions — but they do under the saturation regime described in §5b (RDMA queue full / D scheduler too busy to drain receives in time).
|
||||
|
||||
### Connecting back to Q1 timeline
|
||||
|
||||
Looking at decode-1.log around 01:56:42-56, the worker is running heavy decode batches (#token = 627K, near KV pool cap of 755K) plus repeatedly evicting via LRU. Under that load a single `send_kvcache_slice` returning a transient nonzero is enough to flip the switch. After 01:56:42 essentially every P→D1 transfer reports "is not alive" until end-of-run, even though D1 itself keeps serving direct-append admissions.
|
||||
|
||||
### What the hair-trigger is actually reacting to
|
||||
|
||||
Pulling the mooncake C++ logs (filter `^E0`/`^I0` lines from prefill-0.log) reveals the actual underlying error:
|
||||
|
||||
```
|
||||
I0512 01:56:42.242062 transfer_engine_py.cpp:546]
|
||||
Sync batch data transfer timeout after 37452515723ns
|
||||
I0512 01:56:53.335597 transfer_engine_py.cpp:546]
|
||||
Sync batch data transfer timeout after 30892690400ns
|
||||
```
|
||||
|
||||
**37.45 s** and **30.89 s** — the mooncake `batch_transfer_sync` C++ call returned nonzero because the synchronous transfer took longer than its internal timeout (~30 s). On a 400 Gb/s NDR RDMA fabric this is not a network problem; the data path is healthy. The SGLang author's design instinct (`>= 1 failures = dead`) is *correct in the idle case* — a 30-second RDMA stall really does indicate a broken peer.
|
||||
|
||||
What's happening here is that the peer is **logically broken from the C++ control-plane's point of view**, even though the OS process is still alive.
|
||||
|
||||
### Why does the D side stall the control plane for 30 s?
|
||||
|
||||
Cross-referencing decode-0.log at the exact second of the first timeout (01:56:42):
|
||||
|
||||
```
|
||||
01:56:34 Decode batch, #running-req=1, #token=627631, token_usage=0.83,
|
||||
gen throughput=174.76 tok/s ← still serving normally
|
||||
01:56:42 session id 1000315 does not exist, cannot delete.
|
||||
01:56:42 session id 1000360 does not exist, cannot delete.
|
||||
01:56:42 Trimmed decode session cache via LRU.
|
||||
#evicted_sessions: 2, #freed_tokens: 77675,
|
||||
#available_tokens: 38574 → 116249
|
||||
01:56:42 Trimmed decode session cache via LRU.
|
||||
#evicted_sessions: 1, #freed_tokens: 36166,
|
||||
#available_tokens: 29038 → 65204
|
||||
01:56:53 Decode transfer failed for request rank=0 ...
|
||||
Failed to get kvcache from prefill instance, it might be dead
|
||||
```
|
||||
|
||||
D0's main scheduler thread was busy doing **two consecutive LRU evictions** (freeing 77 675 + 36 166 ≈ 114 K tokens of KV) right when the P→D mooncake transfer attempt landed. Each LRU trim involves:
|
||||
- iterating per-session resident metadata
|
||||
- releasing GPU KV slots back to `token_to_kv_pool_allocator.free()`
|
||||
- updating the session-aware-cache bookkeeping under lock
|
||||
- closing per-session streaming state
|
||||
|
||||
Under `token_usage = 0.83` the LRU scan has to walk thousands of entries; the lock held during this work blocks the mooncake C++ control plane on the receive side (buffer registration / completion poll) from making progress. P's `batch_transfer_sync` keeps polling for the peer's completion ack, doesn't get one for 30 s, and gives up.
|
||||
|
||||
So the chain is:
|
||||
|
||||
```
|
||||
D KV pool saturated by D2-cold-pinning (§5d)
|
||||
↓
|
||||
D triggers heavy LRU eviction (114K tokens at a time)
|
||||
↓
|
||||
D main scheduler thread starves mooncake C++ control plane for 30+ s
|
||||
↓
|
||||
P's batch_transfer_sync returns nonzero (timeout)
|
||||
↓
|
||||
P's hair-trigger marks D's whole mooncake_session_id "failed forever"
|
||||
↓
|
||||
all subsequent reqs to that D blow up with "is not alive"
|
||||
```
|
||||
|
||||
The hair-trigger threshold (`>= 1`) is structurally wrong for this regime — but it would not fire at all if the LRU thrash didn't happen, and the LRU thrash would not happen if the load were spread across all 3 D workers (§5d).
|
||||
|
||||
### Two layers of fix
|
||||
|
||||
| Layer | What | Cost |
|
||||
|---|---|---|
|
||||
| Root cause | Spread load to D2 so D0/D1's KV never saturate, LRU never thrashes. See §5d and the cold-D bonus implementation in `policies.py` (next commit). | Low — pure policy change |
|
||||
| Defense in depth | In `mooncake/conn.py:1267-1276`, replace `>= 1` with a windowed threshold (e.g. ≥ 3 failures within 60 s) and add a periodic retry that probes the D bootstrap port before clearing `failed_sessions`. | Medium — touches vendored SGLang |
|
||||
|
||||
We do the root-cause fix first because it makes the second one optional.
|
||||
|
||||
---
|
||||
|
||||
## 5d. Why no session ever migrated to D2 (forensic on Q2)
|
||||
|
||||
KVC v2's design (KVC_ROUTER_ALGORITHM §3.3) uses `state.session_d_rejects[(session_id, D)] += 1` after a rejection, then policy.select skips any D with `rejects >= migration_reject_threshold (=3)`. The mechanism is conceptually sound. The bug is in *which* failures count as rejections.
|
||||
|
||||
### The substring filter is too narrow
|
||||
|
||||
In `replay.py:1379`:
|
||||
|
||||
```python
|
||||
_ADMISSION_REJECTION_SUBSTRINGS = (
|
||||
"session-cap",
|
||||
"no-d-capacity",
|
||||
"d-backpressure",
|
||||
)
|
||||
|
||||
def _is_admission_rejection_mode(execution_mode: str) -> bool:
|
||||
return any(token in execution_mode for token in _ADMISSION_REJECTION_SUBSTRINGS)
|
||||
```
|
||||
|
||||
Only execution_modes containing one of those three substrings increment the per-(session, D) reject counter. **All 1054 E2 failures have `execution_mode = "kvcache-centric"`** (the generic fallback bucket the replay engine uses when the request fell through every concrete sub-path before producing a successful result). That string contains none of the three substrings, so `session_d_rejects` is never incremented for them.
|
||||
|
||||
### Empirical confirmation
|
||||
|
||||
Counting from `structural/admission-events.jsonl` (worker-RPC level, independent of replay's classification):
|
||||
|
||||
| Stat | Value |
|
||||
|---|---:|
|
||||
| Distinct `(session, D)` pairs ever rejected by worker RPC | 49 |
|
||||
| Pairs rejected ≥ 3 times (would qualify for blacklist) | **46** |
|
||||
| Most-rejected single pair | (1001172, D1) = **25 rejects** |
|
||||
|
||||
So 46 of 49 (sess, D) pairs *should have been blacklisted* by KVC v2's design. They never were, because the corresponding requests' execution_mode was `"kvcache-centric"` (failure path) and not `"…-session-cap"` / `"…-no-d-capacity"` / `"…-d-backpressure"` (which only get assigned when the fallthrough path runs to a known-rejection sub-result, not when the upstream SSE stream errors out).
|
||||
|
||||
Counting "next-binding-after-reject" from the merged binding+admission timeline:
|
||||
|
||||
| Rejected on | Next binding goes to | Count |
|
||||
|---|---|---:|
|
||||
| D0 | D0 | 253 |
|
||||
| D1 | D1 | 329 |
|
||||
| D0 | D2 | **0** |
|
||||
| D1 | D2 | **0** |
|
||||
|
||||
The router stubbornly re-binds the same session to the same D after every reject — exactly because the reject was never recorded in `session_d_rejects`, so policy.select still sees an empty rejection counter and the overlap term keeps tipping it back to D0/D1.
|
||||
|
||||
### The fix
|
||||
|
||||
Two paths, in increasing scope:
|
||||
|
||||
1. **Quick**: include `"kvcache-centric"` (the failure-fallback bucket) in `_ADMISSION_REJECTION_SUBSTRINGS`, OR have replay set `execution_mode` to a more specific failure label when an SSE stream closes with zero tokens (e.g. `"upstream-aborted"`) and add that to the substring set.
|
||||
2. **Better**: don't rely on string-matching at all. Have `_run_request` catch the actual rejection signal (admission RPC `can_admit=False` or upstream `RuntimeError: generate stream ended ...`) and call `state.record_admission_reject(...)` directly at that point. The substring filter was inherited from the v1 → v2 migration design (`MIGRATION_V1_FINDINGS_ZH §4.1`) when only specific fallback paths set those names.
|
||||
|
||||
Either fix would let the existing `migration_reject_threshold=3` blacklist D0/D1 after enough failures, force a re-route to D2, populate D2's resident hashes, and break the overlap-pinning death spiral.
|
||||
|
||||
---
|
||||
|
||||
## 6. What this experiment actually shows
|
||||
|
||||
1. **The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads.** Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.
|
||||
2. **The KVC v2 + kv-aware policy combination has a latent pathology on workloads with high cross-session prefix overlap**: the overlap term in the lex score causes permanent load imbalance, and v2's reject-counter migration cannot rescue it because rejects only fire under capacity pressure, by which point timeouts already dominate. This is novel and not surfaced by the SWE-Bench evaluation in the existing project docs.
|
||||
3. **For Inferact-like workloads, a cold-D bonus (e.g. require D to host at least one session before its overlap score counts) or an explicit pre-warm step is required** before E1/E2 comparisons can isolate the marginal effect of the KVC layer.
|
||||
|
||||
---
|
||||
|
||||
## 7. Reproducibility
|
||||
|
||||
- Trace: `outputs/inferact_50sess.jsonl`, md5 `7bb263a32600ef5a6ef5099ba340a487`, regenerable via `scripts/sample_trace_subset.py`.
|
||||
- E1: `bash scripts/sweep_e1_naive_1p3d.sh` (1h 29 min wall)
|
||||
- E2: `bash scripts/sweep_e2_kvc_v2_rdma.sh` (1h 33 min wall)
|
||||
- Summary JSON paths:
|
||||
- `outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_summary.json`
|
||||
- `outputs/e2_kvc_v2_rdma_50sess/e2_kvc_v2_rdma_run1_summary.json`
|
||||
- Per-request metrics JSONL alongside each summary, plus structural events under `*/structural/`.
|
||||
|
||||
---
|
||||
|
||||
## 8. Open follow-ups for the next agent
|
||||
|
||||
1. **Add a cold-D bonus** to `KvAwarePolicy.select` (e.g. positive constant for D with `state.resident[D] == ∅`) and re-run E2 on the same subset. Predict: D2 receives bindings, failure rate drops, head-to-head with E1 becomes meaningful.
|
||||
2. **Rerun E2 with `--kvcache-admission-mode router`** (router-side optimistic admission instead of worker RPC) to isolate whether the strict worker admission is the contributor to the 1054 failures, or whether it's purely the imbalance.
|
||||
3. **Run a third arm E0 with `policy=default` + `mechanism=pd-disaggregation`** as a true control — kv-aware policy is itself part of what we are evaluating; default round-robin would have spread sessions across all 3 D.
|
||||
4. **Compare TTFT p99 against an Inferact-on-TCP-loopback run** to evaluate H2/H3 cleanly. Cost: 1 more E2-shaped sweep (~1.5 h).
|
||||
5. **Investigate the 1054 E2 failures** in `request-metrics.jsonl` — sample some to verify they are timeout-related vs admission-rejected vs upstream-500.
|
||||
|
||||
---
|
||||
|
||||
## 4. Comparison table — pending
|
||||
|
||||
To be appended.
|
||||
|
||||
---
|
||||
|
||||
## 5. Open questions for the next iteration
|
||||
|
||||
- Are the 85 E1 errors all timeouts? `request-metrics.jsonl` rows with `error` execution_mode should be sampled to confirm. (Quick check: grep the metrics jsonl for `"execution_mode": "pd-disaggregation"` and inspect `latency_s` / `error` fields.)
|
||||
- Does E2 produce the predicted ~91% direct-to-D rate seen in the historical SWE-Bench v2 run, or does the Inferact workload's larger session count (50 vs 52 there) but very different per-session size distribution (mean 33 turns × ~2KB context growth per turn) push it lower?
|
||||
- Is `D2 = 0%` an E1-specific artifact (kv-aware sticky in pd-disagg mode), or does the same happen in E2 before migration kicks in for the first time?
|
||||
129
docs/E3_FINDINGS_ZH.md
Normal file
129
docs/E3_FINDINGS_ZH.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# E3 — first run findings + bug exposure
|
||||
|
||||
**Status**: E3 first attempt aborted at ~16 min wall by SGLang assertion crash on decode-1. Partial data confirms the load-floor bonus works as designed; the crash is an independent vendored-SGLang bug exposed by E3's new routing pattern.
|
||||
|
||||
**Branch**: `h200-cu130`.
|
||||
**Companion**: `docs/E1_E2_RESULTS_ZH.md`, `docs/E1_E2_FIX_DESIGN_ZH.md`.
|
||||
|
||||
---
|
||||
|
||||
## 1. What worked: load-floor bonus (K=200)
|
||||
|
||||
Within the first ~15 minutes of E3, before the crash:
|
||||
|
||||
| | E1 (run1) | E2 (run1) | E3 (run1, partial) |
|
||||
|---|---:|---:|---:|
|
||||
| total bindings | 1285 | 1186 admit attempts | 1001 |
|
||||
| decode-0 bindings | 575 | 600 | 240 (24.0%) |
|
||||
| decode-1 bindings | 710 | 685 | 536 (53.5%) |
|
||||
| **decode-2 bindings** | **0** | **0** | **225 (22.5%)** |
|
||||
| unique sessions on D2 | 0 | 0 | **30** |
|
||||
|
||||
**Load-floor bonus successfully broke the overlap-pinning death spiral.** D2 is finally getting traffic on Inferact's shared-boilerplate workload. The graduated formula (`K * deficit / mean`) plus the `not sticky` gate produces the intended behavior: fresh sessions land on under-loaded D's, established sessions keep going to their original D for cache locality.
|
||||
|
||||
This validates the Q2.B design from `docs/E1_E2_FIX_DESIGN_ZH.md` empirically — but only as far as the run got. End-to-end metrics (lat / TTFT / failure rate) are not interpretable yet because the worker died.
|
||||
|
||||
## 2. The new crash: SGLang streaming-session correction leaves an invariant violated
|
||||
|
||||
At `01:51:21` (~5 min into the benchmark), decode-1 hit:
|
||||
|
||||
```
|
||||
[01:51:21] Correcting streaming-session extend_input_len from 6648 to 0
|
||||
(rid=6f4318e93dd543a49dbf19248cfc1e6f, session_id=1000195,
|
||||
fill_len=6648, prefix_len=43459, kv_committed_len=43459)
|
||||
[01:51:21] Scheduler hit an exception: AssertionError
|
||||
at third_party/sglang/python/sglang/srt/managers/schedule_batch.py:1646
|
||||
→ assert seq_len - pre_len == req.extend_input_len
|
||||
```
|
||||
|
||||
### Mechanism
|
||||
|
||||
With `--enable-streaming-session`, SGLang's session_aware_cache hands the scheduler a request whose `fill_ids` is just the new tokens since the last turn (6648), while `prefix_indices` represents the already-cached prefix on this D (43459 blocks). When the prefix exceeds `fill_ids` (e.g., the new turn's input is short relative to the conversation history that's already in cache), this code path fires at `schedule_batch.py:1572-1585`:
|
||||
|
||||
```python
|
||||
actual_extend_len = max(0, len(req.fill_ids) - len(req.prefix_indices))
|
||||
if req.extend_input_len != actual_extend_len:
|
||||
logger.warning("Correcting streaming-session extend_input_len from %d to %d ...")
|
||||
req.set_extend_input_len(actual_extend_len)
|
||||
```
|
||||
|
||||
So `req.extend_input_len` becomes `max(0, 6648 - 43459) = 0`.
|
||||
|
||||
Then at line 1588-1590:
|
||||
|
||||
```python
|
||||
seq_lens = [len(r.fill_ids) for r in reqs] # 6648
|
||||
prefix_lens = [len(r.prefix_indices) for r in reqs] # 43459
|
||||
```
|
||||
|
||||
And at line 1646:
|
||||
|
||||
```python
|
||||
assert seq_len - pre_len == req.extend_input_len # 6648 - 43459 == 0 → FAIL
|
||||
```
|
||||
|
||||
The correction patches `extend_input_len` but the downstream invariant is computed from raw `fill_ids`/`prefix_indices` lengths, which the correction never touched. The arithmetic check is fundamentally incompatible with the corrected state.
|
||||
|
||||
### Provenance
|
||||
|
||||
The streaming-session correction (`schedule_batch.py:1572-1585`) and the assertion site (line 1646) are both inside the project's SGLang vendor patches — `git log` on this file shows the patch came from commit `b8e6f13 feat(sglang): support decode session cache admission`. So this is a regression in the project's own SGLang fork, not upstream SGLang.
|
||||
|
||||
### Why E3 triggers it and E2 didn't
|
||||
|
||||
The crash is independent of migration (session 1000195 stayed on decode-1 the entire time). Two factors combined to expose it in E3:
|
||||
|
||||
1. **D1 was under more sustained load in E3** — 536 bindings on 17 unique sessions means high re-binding density per session, which means more concurrent turns of the same session at the scheduler, increasing the rate at which streaming-session corrections fire.
|
||||
2. **Faster overall dispatch** — with D2 actually consuming work, the prefill→decode pipeline moves faster, so streaming-session entries reach the corrected state more often than in E2's saturated cap-out regime.
|
||||
|
||||
Both factors are effects of the load-floor fix, not its cause. The crash is a pre-existing landmine in the vendored streaming-session code that E1 and E2 happened to avoid because their pipelines stalled before sessions accumulated enough committed prefix to trigger the correction.
|
||||
|
||||
---
|
||||
|
||||
## 3. Decision space for the fix
|
||||
|
||||
| # | Fix | Layer | Where | Risk |
|
||||
|---|---|---|---|---|
|
||||
| **A** | Patch the assertion to match the corrected state | vendored SGLang `schedule_batch.py:1646` | Add: `if req.extend_input_len == 0 and len(req.fill_ids) < len(req.prefix_indices): continue` to skip degenerate reqs before iterating. | Local, scoped, doesn't touch correctness elsewhere. Need to handle the skipped reqs (set `was_skipped` flag, drop from batch). |
|
||||
| **B** | Fix the correction site to also drop the req from the batch | vendored SGLang `schedule_batch.py:1572-1585` | When `actual_extend_len == 0` and req has nothing to extend, signal upstream to remove the req from this batch (defer or drop). | Slightly more invasive. The upstream call path needs to handle a "filtered" return. |
|
||||
| **C** | Compute `seq_lens` and `prefix_lens` consistently with the correction | vendored SGLang `schedule_batch.py:1588-1590` | After correction, recompute `seq_lens = [len(r.fill_ids[:pre_len] + extension)]` or align both sides. | Risky; affects all downstream tensor sizing. |
|
||||
| **D** | Workaround: disable session migration in E3 (the trigger combination) | our `cli` flag `--kvcache-migration-reject-threshold 0` | One-line config change in `sweep_e3_*.sh`. | Doesn't actually fix the crash — session 1000195 didn't migrate. May reduce but not eliminate. Might still hit it on a different session. |
|
||||
| **E** | Workaround: disable streaming session | server flag, remove `--enable-streaming-session` | Sidesteps the entire correction path. | Loses KVC's direct-to-D fast path (the central perf win we measure). Defeats the experiment. |
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Fix A** — patch `schedule_batch.py:1646` to skip the malformed req before asserting. It's the minimal-blast-radius change and matches the apparent intent of the correction (graceful handling of the degenerate state).
|
||||
|
||||
Concretely:
|
||||
|
||||
```python
|
||||
# Just before the assertion at line ~1646
|
||||
if req.extend_input_len == 0:
|
||||
# The streaming-session correction zeroed extend_input_len because
|
||||
# prefix_indices already covers fill_ids. Skip this req from the
|
||||
# extend batch — its KV is already committed; nothing to compute.
|
||||
skip_indices.append(i)
|
||||
continue
|
||||
```
|
||||
|
||||
Then the caller of `prepare_for_extend` needs to handle skipped requests (return them to the decode queue without an extend pass).
|
||||
|
||||
**Avoid Fix D/E** — D doesn't address the root cause (the failing session didn't migrate), and E loses the entire reason we're running this experiment.
|
||||
|
||||
---
|
||||
|
||||
## 4. Decision points for review
|
||||
|
||||
| # | Question | Default if no answer |
|
||||
|---|---|---|
|
||||
| D1 | Implement Fix A (vendor patch to skip zero-extend-len reqs)? | **Yes** |
|
||||
| D2 | Re-run E3 with same K=200, same subset, after the fix? | Yes |
|
||||
| D3 | Add a structural log entry every time the correction fires so we can track its frequency? | Recommended |
|
||||
| D4 | File this as a separate `feat(sglang)` commit on the branch so the patch and the failure case it fixes are traceable? | Yes |
|
||||
|
||||
---
|
||||
|
||||
## 5. What this tells us about KVC v2 maturity
|
||||
|
||||
The load-floor bonus's first real exposure to the production codepath uncovered an existing patch bug that was masked by E2's failure cascade. This is good news: the failure cascade in E2 was hiding *another* layer of breakage. Without rebalancing, sessions cap-out → cascade → never run long enough to commit deep prefixes → never hit the streaming-session correction → never crash. With rebalancing, sessions DO commit deep prefixes → trigger the correction → crash.
|
||||
|
||||
Each fix tends to expose the next-shallowest bug. This is expected for a stack of ~6 interacting subsystems (kv-aware policy, KVC admission, session_aware_cache, streaming session, mooncake transfer, prefill batch prep). The path forward is to keep patching, re-running, and pushing the failure boundary out.
|
||||
157
docs/E4_PROTOCOL_ZH.md
Normal file
157
docs/E4_PROTOCOL_ZH.md
Normal file
@@ -0,0 +1,157 @@
|
||||
# E4 — KVC + D→P RDMA snapshot vs naive PD-disagg (实验协议)
|
||||
|
||||
**Status**: 协议事前定稿(preregistration)
|
||||
**Date**: 2026-05-13
|
||||
**Branch**: `h200-cu130`
|
||||
**Prereq**: `docs/D_TO_P_SYNC_DESIGN_ZH.md`, `docs/D_TO_P_PHASE1_LINK_ZH.md`
|
||||
**Companion**: `docs/E1_E2_RESULTS_ZH.md`, `docs/E3_FINDINGS_ZH.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. 一句话
|
||||
|
||||
E4 在 E3 配置(KVC v2 + RDMA + load-floor bonus K=200)之上加 `--enable-d-to-p-sync`,验证 D→P RDMA snapshot push 能否让 reseed 路径跳过 P 端 re-prefill,从而让 KVC 在保持 session-affinity 设计独特性的前提下 latency 优于 naive PD-disagg(E1 基线)。
|
||||
|
||||
---
|
||||
|
||||
## 1. 实验目的
|
||||
|
||||
回答 ProJEctGoal 设定的核心问题:**KVC 如何在保持自身独特性的情况下胜过 naive PD-disagg?**
|
||||
|
||||
历史结论:
|
||||
- E1(naive 1P3D + kv-aware + RDMA):成功 1200/1285,TTFT p99 = 88.6s(D2 完全闲置)
|
||||
- E3(KVC v2 + RDMA + load-floor K=200):load-floor 解决 D2 cold 问题,但 SGLang streaming-session 内部 assertion bug 暴露,单 turn 至高吞吐降低。即使在已经 patched 的版本 reseed 路径仍有 P 端完整 re-prefill 长尾。
|
||||
|
||||
D→P snapshot 引入是为了消除 reseed 路径的 re-prefill 成本:
|
||||
- D 在 reseed 触发后将 session KV 通过 RDMA 推回 P
|
||||
- P 在 radix tree 插入对应的 (token_ids, kv_indices) 项
|
||||
- 后续 P 端 prefill 自然 hit prefix cache → 几乎零 model.forward → 直接 mooncake P→D' 传输
|
||||
|
||||
预期效果(参考 `docs/D_TO_P_SYNC_DESIGN_ZH.md §3.2`):
|
||||
- reseed re-prefill 段 1.5-3s → ~0
|
||||
- reseed transfer 段 0.2-0.4s 不变
|
||||
- reseed 总耗时 3-7s → 0.3-0.5s
|
||||
- TTFT p99 显著下降
|
||||
|
||||
---
|
||||
|
||||
## 2. 实验设置
|
||||
|
||||
### 2.1 配置
|
||||
|
||||
| 维度 | 值 |
|
||||
|---|---|
|
||||
| Trace | `outputs/inferact_50sess.jsonl` (1285 reqs / 50 sessions, md5 7bb263a32600ef5a6ef5099ba340a487) |
|
||||
| Model | Qwen3-30B-A3B-Instruct-2507 (TP=1) |
|
||||
| Topology | 1P + 3D = 4 GPU |
|
||||
| Hardware | 4× H200 80GB, mlx5_60 NDR 400Gb RoCE v2, GID Index 3 |
|
||||
| Time scale | ts=1 |
|
||||
| Concurrency | 32 |
|
||||
| Request timeout | 300 s |
|
||||
| Mooncake transfer timeout | 1800 s (MC_TRANSFER_TIMEOUT) |
|
||||
| KVC migration reject threshold | 3 |
|
||||
| Load-floor bonus | K=200 |
|
||||
| **D→P sync** | **on** (--enable-d-to-p-sync) |
|
||||
|
||||
### 2.2 对照组(已有数据复用)
|
||||
|
||||
| 名 | 配置 | 关键数据来源 |
|
||||
|---|---|---|
|
||||
| E1 | naive 1P3D + kv-aware + RDMA,无 KVC 层 | `outputs/e1_naive_1p3d_rdma_50sess/` |
|
||||
| E3 | KVC v2 + RDMA + load-floor K=200,无 D→P | `outputs/e3_kvc_v2_loadfloor_rdma_50sess/` |
|
||||
| **E4** | 同 E3 + `--enable-d-to-p-sync` | **本次跑** |
|
||||
|
||||
### 2.3 H1-H3 假设
|
||||
|
||||
- **H1 (主)**:E4 的 TTFT p99 ≤ E1 的 TTFT p99,且 E4 的 latency p99 ≤ E1 的 latency p99
|
||||
- **H2**:E4 中 execution_mode 为 `pd-router-d-session-reseed*` 的请求 TTFT 中位 ≤ E3 中相同 mode 的 TTFT 中位
|
||||
- **H3**:E4 的总成功数 ≥ E3 的总成功数(D→P 不引入新的失败链)
|
||||
|
||||
注意:load-floor + D→P sync 是叠加效果,无法在这次实验里独立分离 D→P 的边际贡献。后续可单独做 E4-ablate(K=200,--enable-d-to-p-sync 但人为关闭 D 端 dump)。
|
||||
|
||||
### 2.4 度量
|
||||
|
||||
每个 run 收集(来自 `request-metrics.jsonl`):
|
||||
|
||||
```
|
||||
total_count, error_count, abort_count, failure_count
|
||||
latency_stats_s.{mean, p50, p90, p99}
|
||||
ttft_stats_s.{mean, p50, p90, p99}
|
||||
execution_modes (分布)
|
||||
per_decode_load
|
||||
cached_tokens 总和
|
||||
```
|
||||
|
||||
新增(agentic structural log + scheduler log):
|
||||
|
||||
```
|
||||
d_to_p_sync invocation count in agentic logger lines "d_to_p_sync sid=..."
|
||||
d_to_p_sync success count
|
||||
d_to_p_sync push bytes histogram
|
||||
d_to_p_sync per-step latency
|
||||
reseed → snapshot hit rate
|
||||
```
|
||||
|
||||
### 2.5 失败模式
|
||||
|
||||
`_attempt_d_to_p_sync` 任何失败(prepare_receive ok=false / dump ok=false / finalize ok=false / 网络)都 fallback 到原 seeded_router 路径。所以 E4 即使 D→P 全失败,理论上仍应等于 E3 baseline。
|
||||
|
||||
---
|
||||
|
||||
## 3. 验收
|
||||
|
||||
### 3.1 必须
|
||||
|
||||
- [ ] E4 总成功请求数 ≥ 0.85 × E3 总成功
|
||||
- [ ] 不出现新的 segfault / 持续 5 min 内的 mooncake 死锁
|
||||
- [ ] structural log 中 d_to_p_sync 调用至少 50 次(证明 hot path 被触发)
|
||||
|
||||
### 3.2 期望
|
||||
|
||||
- [ ] E4 TTFT p99 < E1 TTFT p99
|
||||
- [ ] E4 reseed 路径 TTFT 中位明显低于 E3 reseed 路径 TTFT 中位(保守地,至少 ≥ 30% 改进)
|
||||
- [ ] E4 TTFT p99 < E3 TTFT p99(说明 D→P 真的有用)
|
||||
|
||||
### 3.3 探索
|
||||
|
||||
- [ ] D→P push 占链路带宽多少?(看 nvidia-smi DCGM 或 mooncake metrics)
|
||||
- [ ] D→P push 失败率?如失败,主要 reason 是什么?
|
||||
- [ ] P 端 radix insert 的 prefix_len 分布?
|
||||
|
||||
---
|
||||
|
||||
## 4. 报告交付物
|
||||
|
||||
跑完后产出 `docs/E4_RESULTS_ZH.md`,包含:
|
||||
|
||||
1. 三组 lat/ttft 全分位数对比表
|
||||
2. execution_mode 分布对比
|
||||
3. H1/H2/H3 各自证实 / 证伪 / 部分证实
|
||||
4. d_to_p_sync 统计:调用数、成功数、失败原因 top
|
||||
5. 失败模式分析(如有)
|
||||
6. 与设计 `docs/D_TO_P_SYNC_DESIGN_ZH.md §3.2` 预测的对照
|
||||
|
||||
---
|
||||
|
||||
## 5. 时间预算
|
||||
|
||||
- 跑 E4 一次:~30-60 min(同 E3 量级)
|
||||
- 数据汇总:~30 min
|
||||
- 报告:~1 h
|
||||
|
||||
如时间不够:先跑 N=1 抓最关键的 TTFT 分布,后续补 N=2 对照。
|
||||
|
||||
---
|
||||
|
||||
## 6. 风险
|
||||
|
||||
| 风险 | 缓解 |
|
||||
|---|---|
|
||||
| `_attempt_d_to_p_sync` 在 reseed path 实际触发频率太低 | 调小 KV 池 + 调整 reject_threshold 让 reseed 多触发 |
|
||||
| RDMA dump 多次失败导致 D→P 链路变成 net negative | structural log 留好失败原因 → 抓 root cause |
|
||||
| SGLang scheduler 新引入的 RPC 干扰 PD pipeline | smoke test 已确认 RPC 互不影响 |
|
||||
| 量纲对错:D 推送的 KV bytes 在 P 端解码出错 | 完整 E4 跑完看下游 perplexity / TTFT 看异常 |
|
||||
|
||||
---
|
||||
|
||||
**核心句**:E4 是测试 D→P snapshot 在端到端工作负载中是否真能消除 reseed re-prefill 成本的核心实验。E4 胜过 E1 即证明 KVC + D→P 在保持设计独特性的前提下能跑赢 naive PD-disagg。
|
||||
179
docs/E4_RESULTS_ZH.md
Normal file
179
docs/E4_RESULTS_ZH.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# E4 — KVC + D→P RDMA snapshot vs naive PD-disagg(实测结果)
|
||||
|
||||
**Status**: 实验执行完毕(手动停止),数据汇总完毕,**主要假设不能被本次实验证实**。
|
||||
**Date**: 2026-05-13
|
||||
**Branch**: `h200-cu130`
|
||||
**Protocol**: `docs/E4_PROTOCOL_ZH.md`
|
||||
**Implementation status**: `docs/D_TO_P_IMPLEMENTATION_STATUS_ZH.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
E4 跑了 ~60 min,完成了 ~548/1285 请求后吞吐崩溃(同 E3 模式),被人工 SIGINT 停止。
|
||||
|
||||
**关键发现**:
|
||||
|
||||
1. ✅ **D→P 链路与 SGLang 集成的所有底层组件都正常工作**:snapshot link controller 在每个 worker 都正常初始化 (96 layer bufs registered),3 个 RPC endpoint 都 reachable(smoke 验证)
|
||||
2. ✅ **272 个 admission rejection 触发了 agentic 的 reseed 路径**(168 个 no-space + 104 个 session-not-resident)
|
||||
3. ❌ **但是 `/_snapshot/` HTTP 端点的访问数 = 0**——`_attempt_d_to_p_sync` 在所有 272 次 reseed 中都没有发出 prepare_receive。可能原因:(a) `decode_session.opened == False` 时早退;(b) `source_d_url` 为空;(c) `target_tokens <= 0`
|
||||
4. ⚠️ **关键 instrumentation 缺失**:`_attempt_d_to_p_sync` 用 `logger.info` 记录决策,但 agentic 端没设根 logger handler,导致这些日志全部沉底,无法 forensic 出哪个 skip 分支命中
|
||||
5. ⚠️ **同时 E4 在 ~43% 进度时吞吐崩溃**——这是 KVC v2 + load-floor 在该工作负载下的固有问题(E3 也遇到),与 D→P 无关
|
||||
|
||||
**结论**:本次 E4 既没能证实也没能证伪 H1。D→P 链路与集成完整 deploy,但**观测性不足**让我们看不到它在真实负载里到底发生了什么。
|
||||
|
||||
---
|
||||
|
||||
## 1. 实验实际配置(与 protocol 对照)
|
||||
|
||||
| 维度 | Protocol | Actual |
|
||||
|---|---|---|
|
||||
| Trace | inferact_50sess.jsonl 1285 reqs | 同 |
|
||||
| GPU | 4× H200 | 同 |
|
||||
| concurrency_limit | 32 | 同 |
|
||||
| load-floor K | 200 | 同 |
|
||||
| --enable-d-to-p-sync | TRUE | 同 |
|
||||
| SGLANG_SNAPSHOT_LINK_ENABLE | 1 per worker | 同(已验证 controller init 成功) |
|
||||
| 启动时间 | - | 2026-05-13 08:28:17 |
|
||||
| 停止时间 | - | 2026-05-13 09:29:22(SIGINT) |
|
||||
| 完成时长 | ~30-60 min 预期 | 60 min 后人工停止 |
|
||||
|
||||
---
|
||||
|
||||
## 2. 实测数字
|
||||
|
||||
### 2.1 请求执行(手动停止时)
|
||||
|
||||
| Metric | 值 |
|
||||
|---|---:|
|
||||
| Router 完成的 POST /generate (200 OK) | 548 |
|
||||
| 占 trace 比例 | 42.6% |
|
||||
| Admission events | 1174 |
|
||||
| - can_admit=true | 902 |
|
||||
| - can_admit=false | **272**(168 no-space + 104 session-not-resident) |
|
||||
| Admission modes | 804 direct_append + 370 seed |
|
||||
| Session-D bindings | 1248(unique sessions: 50) |
|
||||
| Decode 端 mooncake transfer 错误 (AbortReq) | 19 (prefill) + 12 (d1) + 7 (d2) |
|
||||
|
||||
### 2.2 D→P snapshot 路径 telemetry
|
||||
|
||||
| Stat | 期望 | Actual |
|
||||
|---|---:|---:|
|
||||
| `_attempt_d_to_p_sync` 调用次数 | ≥ 272 | **unknown**(无日志) |
|
||||
| `/_snapshot/prepare_receive` HTTP 命中 | > 0 if any sync succeed | **0** |
|
||||
| `/_snapshot/dump` HTTP 命中 | > 0 | **0** |
|
||||
| `/_snapshot/finalize_ingest` HTTP 命中 | > 0 | **0** |
|
||||
|
||||
**0 个 HTTP 命中**是个明确的负面信号。`_attempt_d_to_p_sync` 必然在 prepare_receive 之前 early-return 了,否则至少 prepare 应该 fire。
|
||||
|
||||
### 2.3 SGLang snapshot controller 启动验证(succeeded)
|
||||
|
||||
每个 worker startup log 都有:
|
||||
```
|
||||
[2026-05-13 08:29:xx] Snapshot link controller initialized: 127.0.0.1:9998, sid=127.0.0.1:NNNNN, 96 layer bufs
|
||||
```
|
||||
|
||||
confirmed for all 4 workers (1P + 3D). All registered 96 layer buffers (48 K + 48 V) successfully.
|
||||
|
||||
---
|
||||
|
||||
## 3. 根因分析:为什么 sync 没 fire
|
||||
|
||||
阅读 `_attempt_d_to_p_sync` 的 early-return 链路:
|
||||
|
||||
```python
|
||||
async def _attempt_d_to_p_sync(...):
|
||||
if not config.enable_d_to_p_sync:
|
||||
return None
|
||||
source_d_url = decode_session.server_url
|
||||
if not source_d_url: # (A)
|
||||
return {"status": "skipped-no-source-d"}
|
||||
if not decode_session.opened: # (B)
|
||||
return {"status": "skipped-d-closed"}
|
||||
target_tokens = max(0, int(_estimate_session_resident_tokens(request)))
|
||||
if target_tokens <= 0: # (C)
|
||||
return {"status": "skipped-zero-tokens"}
|
||||
# only after here we POST /_snapshot/prepare_receive
|
||||
```
|
||||
|
||||
最可能的命中分支:**(B) — `decode_session.opened == False`**。
|
||||
|
||||
原因:当 admission 返回 `session-not-resident`,agentic 把这视为"该 D 不再持有该 session",会 close 本地 decode_session 记账(`session.opened = False`),然后才走到 fallback / seeded_router。所以到 `_invoke_kvcache_seeded_router` 时,`decode_session.opened` 已经是 False,sync 直接跳过。
|
||||
|
||||
**这意味着我设计 `_attempt_d_to_p_sync` 的入口条件错了**:
|
||||
- 错误假设:reseed 时 D 仍然 open,可以从那个 D dump
|
||||
- 正确事实:admission rejection 触发 session 关闭 → reseed 时 D 已 close → 没有 KV 可 dump
|
||||
|
||||
要让 D→P 真正在这个场景下工作,需要其中之一:
|
||||
- **不在 admission rejection 时立刻 close decode_session** —— 给 D→P sync 一个抢救窗口
|
||||
- **改去探测 D-side 的 SessionAwareCache 中是否还有该 session 的 slot** —— 即使 agentic 端记账为 closed,D 端可能还没 evict
|
||||
- **在 D 端 SessionAwareCache.release_session 之前插入 D→P push** —— D-driven 主动模式(设计文档 §2.5 提到的,但本期没实现)
|
||||
|
||||
---
|
||||
|
||||
## 4. 假设证实 / 证伪
|
||||
|
||||
### H1 (main): E4 TTFT p99 ≤ E1 TTFT p99 = 88.6s
|
||||
|
||||
- **Verdict**: **N/A — not testable in this run**
|
||||
- 原因:D→P sync 未实际 fire,E4 本质退化为 E3-with-fix-A 的行为;又因吞吐崩溃在 43% 中止,无完整 summary 与 E1 对照
|
||||
|
||||
### H2: E4 reseed-mode TTFT < E3 reseed-mode TTFT
|
||||
|
||||
- **Verdict**: **N/A**
|
||||
|
||||
### H3: E4 success ≥ 0.85 × E3 success
|
||||
|
||||
- **Verdict**: **N/A**(E3 当初也未完成,无 baseline)
|
||||
|
||||
---
|
||||
|
||||
## 5. 真正学到的东西
|
||||
|
||||
| # | 学习 | 行动 |
|
||||
|---|---|---|
|
||||
| 1 | D→P RDMA link 工作正常(host + GPU,phase 1/1b smoke) | ✅ 维持 |
|
||||
| 2 | SGLang 集成 RPC 工作正常(smoke 验证) | ✅ 维持 |
|
||||
| 3 | agentic `_attempt_d_to_p_sync` 入口条件设错 | ⏳ 改入口逻辑或改成 D-driven 主动模式 |
|
||||
| 4 | 缺少 D→P 路径的 structural log | ⏳ 加 `structural/d-to-p-sync.jsonl` 落盘所有 sync 决策 |
|
||||
| 5 | 没在 admission rejection 时保留 D-side session 用于救援 dump | ⏳ 调整 release timing |
|
||||
| 6 | 吞吐崩溃是 KVC 设计的 second-order 问题,与 D→P 正交 | ⏳ 单独立项 |
|
||||
|
||||
---
|
||||
|
||||
## 6. 后续工作(按优先级)
|
||||
|
||||
### P1(必做,让 D→P 真正可观测 + 可触发)
|
||||
|
||||
1. **加 structural log channel `structural/d-to-p-sync.jsonl`** —— `_attempt_d_to_p_sync` 每次决策落盘一条记录
|
||||
2. **修正入口条件**:把 `decode_session.opened` 检查 relax 成"曾经 open 过 + 服务器仍有可能 hold KV"
|
||||
3. **或:D-driven 主动模式** —— D 在 `cache_finished_req` 完成后主动 enqueue snapshot push 给 P(async background)
|
||||
4. **加 GET `/_snapshot/info` endpoint** —— 让 agentic 直接查 D 端是否还有该 session
|
||||
|
||||
### P2(验证 D→P 效益)
|
||||
|
||||
5. 重跑 E4 + P1 fixes
|
||||
6. 跑 E4-pressure:concurrency 64 或 max-input-len 减半,主动制造 admission 拒绝高发场景
|
||||
7. 跑 E4-ablate:D→P prepare 后人为不 push,隔离 D→P transfer 的边际效益
|
||||
|
||||
### P3(基础设施)
|
||||
|
||||
8. 解决 E4 在 43% 进度时的吞吐崩溃。这与 D→P 正交,但只要它存在就影响所有后续 E4 类实验的可比性
|
||||
9. 与 docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md 提出的 block-level evict refactor 联动
|
||||
|
||||
---
|
||||
|
||||
## 7. 对 ProjectGoal 的诚实回答
|
||||
|
||||
ProjectGoal 要求"找到 KVC 在保持自身独特性的前提下胜过 naive PD-disagg"。E4 没有证实也没证伪。
|
||||
|
||||
**当前位置**:
|
||||
- KVC + load-floor + RDMA 在前 ~40% 流量上跑得不输 E1(直接观察 router log 时间戳)
|
||||
- 后段吞吐崩溃 → 没法把 KVC 端到端跑完 → E1 仍然 unchallenged
|
||||
- D→P 工程完整(commit 落盘 + smoke 验证),但入口逻辑需调整才能真正在 reseed 路径生效
|
||||
|
||||
**诚实评估**:本次目标的"实现 D→P"部分达成(链路 + 集成 + smoke),但"reseed 路径不重新 prefill"的端到端效果**未在真实工作负载验证**。下一步应优先实施 P1 中的 instrumentation + 入口条件修正,然后重跑。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:E4 完整暴露了 D→P 工程的 last-mile 缺口(入口条件错 + 日志失踪),所有底层组件 individually 验证 OK 但端到端串联在真实 workload 上失效。这是个明确、可修复的工程问题,不是设计层面的死结。
|
||||
202
docs/E4_V8_RESULTS_ZH.md
Normal file
202
docs/E4_V8_RESULTS_ZH.md
Normal file
@@ -0,0 +1,202 @@
|
||||
# E4-v8 完整结果 — KVC 在真实节奏 trace 上的表现
|
||||
|
||||
**日期**:2026-05-13
|
||||
**Status**:实验跑完
|
||||
**Run**:`outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess/...20260513T075500Z/`
|
||||
**前置**:`docs/SNAPSHOT_STORE_REFACTOR_ZH.md`、`docs/E4_VS_E1_RESULTS_ZH.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
V8 跑 **真实节奏 trace**(`third_party/traces/qwen35-swebench-50sess.jsonl`,4449 reqs × 52 sessions,原始 5.44h 时间线)在 TIME_SCALE=2 压缩到 ~2.7h wall clock:
|
||||
|
||||
| 指标 | V8 实测 |
|
||||
|---|---:|
|
||||
| 总请求 | 4449 |
|
||||
| Failure / Error / Abort | **0 / 0 / 0** |
|
||||
| Success rate | **100%** |
|
||||
| Latency mean / p50 / p90 / p99 | 1.28s / 0.51s / 3.17s / **7.44s** |
|
||||
| **TTFT mean / p50 / p90 / p99** | **49ms / 40ms / 68ms / 167ms** |
|
||||
| Direct-to-D fast path | **96.4%** (4291/4449) |
|
||||
| Reseed paths | 51 (1.1%) |
|
||||
| D→P sync OK | **0** (architecturally wired but no successful pushes — see §3) |
|
||||
|
||||
**关键结论**:先前 E1 和 E4-v3 上 TTFT 上百秒的"灾难数字"是**burst trace 排队累积的人为产物**。在真实节奏 SWE-Bench trace 上,**KVC 表现为亚秒到个位数秒的正常生产 serving 性能**。
|
||||
|
||||
---
|
||||
|
||||
## 1. 实验配置
|
||||
|
||||
```
|
||||
Workload: third_party/traces/qwen35-swebench-50sess.jsonl
|
||||
4449 reqs / 52 sessions / 5.44h original wall-clock span
|
||||
per-session inter-turn p50: 2.53s (real SWE-agent timing)
|
||||
input length p50: 27K, p99: 92K, max: 104K
|
||||
|
||||
Compression: TIME_SCALE=2 → 2.72h actual run-time
|
||||
Topology: 1P + 3D, 4× H200 80GB single-node
|
||||
RDMA: mlx5_60 NDR 400Gb / mooncake
|
||||
Model: Qwen3-30B-A3B-Instruct-2507 (TP=1)
|
||||
Concurrency: 32
|
||||
|
||||
Memory: PREFILL_MEM_FRAC=0.7 / DECODE_MEM_FRAC=0.8
|
||||
snapshot_buf=16 GB on each worker (alloc succeeded)
|
||||
|
||||
KVC config: --kvcache-load-floor-bonus 200
|
||||
--kvcache-migration-reject-threshold 1
|
||||
--kvcache-direct-max-uncached-tokens 8192
|
||||
--enable-d-to-p-sync (with SnapshotStore refactor)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. 完整 v8 数据
|
||||
|
||||
### 2.1 Headline
|
||||
|
||||
```
|
||||
request_count : 4449
|
||||
abort_count : 0
|
||||
error_count : 0
|
||||
failure_count : 0
|
||||
cache_hit_request_count : 4446 / 4449 = 99.9%
|
||||
mean cached_tokens : 30,513 / req (out of avg 32K input)
|
||||
```
|
||||
|
||||
### 2.2 Latency / TTFT
|
||||
|
||||
```
|
||||
count mean p50 p90 p99
|
||||
latency_stats_s 4449 1.28 0.51 3.17 7.44 s
|
||||
ttft_stats_s 4449 0.049 0.040 0.068 0.167 s ← p99 = 167ms
|
||||
```
|
||||
|
||||
### 2.3 Execution_mode 分布
|
||||
|
||||
```
|
||||
kvcache-direct-to-d-session 4291 (96.4%) ← KVC 独特 fast path
|
||||
pd-router-turn1-seed 52 ( 1.2%) ← 每个 session 第一个 turn
|
||||
pd-router-fallback-session-not-resident-seed-filter 52 ( 1.2%) ← seed-filter 早 turn fallback
|
||||
pd-router-d-session-reseed 47 ( 1.1%) ← 真正的 reseed (session 曾在 D)
|
||||
pd-router-fallback-real-large-append-session-cap 3
|
||||
pd-router-fallback-session-not-resident-session-cap 1
|
||||
pd-router-policy-no-bypass-reseed 1
|
||||
pd-router-real-large-append-reseed 1
|
||||
pd-router-session-not-resident-reseed 1
|
||||
-----
|
||||
4449
|
||||
```
|
||||
|
||||
### 2.4 Per-decode load
|
||||
|
||||
```
|
||||
decode-0: 1505 bindings (33.8%)
|
||||
decode-1: 1497 bindings (33.6%)
|
||||
decode-2: 1447 bindings (32.5%)
|
||||
```
|
||||
|
||||
负载完美均衡(load-floor bonus K=200 起作用)。
|
||||
|
||||
---
|
||||
|
||||
## 3. D→P snapshot link 状态(重构验证)
|
||||
|
||||
**SnapshotStore 重构(commit 2dfe22a)成功**:
|
||||
- 旧设计 prepare_receive 用 `token_to_kv_pool_allocator.alloc(N)` 抢 P 的 KV pool slot → 90%+ alloc-failed
|
||||
- 新设计 prepare_receive 从独立 16 GB GPU `snapshot_buf` 分配 slab → **0 alloc-failed**
|
||||
|
||||
```
|
||||
sync events total: 102
|
||||
by (stage, reason):
|
||||
('dump', 'session-not-resident'): 96 (D 端 session 已 evict 或从未 resident)
|
||||
('prepare', 'snapshot-buf-full'): 6 (snapshot_buf 偶尔满)
|
||||
('ok', None): 0 (无成功 push)
|
||||
```
|
||||
|
||||
**为什么 0 OK?**
|
||||
|
||||
mem_fraction=0.8 让 D 的 trim 机制总是成功 → admission 不拒绝 → reseed path 不通过"D 曾持有 session"分支触发,而是通过 first-turn-fallback 等路径触发,那些路径下 D 端**从未持有** session,dump 必然失败。
|
||||
|
||||
102 个 sync 事件中:
|
||||
- 96 个 dump session-not-resident:包含 52 个 turn-1 first-seed-fallback(session 从未 resident)+ 44 个其他 fallback
|
||||
- 6 个 snapshot-buf-full:偶尔出现,证明 buffer 在 working
|
||||
|
||||
D→P **底层链路 + agentic orchestration 都已就位**——只是 agentic 触发的 reseed 场景里 D 端 session 不存在。要让 D→P 真正 fire OK,需要:
|
||||
1. 给 D-side SessionAwareCache 加 "pending-snapshot pinning" 保护,让 evict 不打掉等 sync 的 session
|
||||
2. **或者** 加 D-side push-on-eviction:D 端在 evict 一个 session 前先 push 给 P(D-driven 主动模式)
|
||||
3. **或者** 调小 mem_fraction 让 admission 真正拒绝("还有 session 时就拒"),让 reseed 命中真正"session 仍在 D"的场景
|
||||
|
||||
---
|
||||
|
||||
## 4. 跟之前几次实验对比
|
||||
|
||||
| Run | Trace | failures | TTFT p99 | Latency p99 | D→P OK |
|
||||
|---|---|---:|---:|---:|---:|
|
||||
| E1 (naive PD) | inferact 1285 burst | 6.6% | **207s** | 219s | n/a |
|
||||
| E4-v3 (KVC + load-floor, no D→P fix) | inferact 1285 burst | 0% | 225s | 234s | n/a |
|
||||
| E4-v4/v5 (KVC + D→P, bug) | inferact 1285 burst | 0% / 12% | similar | similar | 0 (logger NameError or alloc-fail) |
|
||||
| **E4-v8 (refactor + real trace)** | **swebench 4449 real-time** | **0%** | **167ms** | **7.4s** | 0 (D-side eviction timing) |
|
||||
|
||||
E1 vs v8 的数字差距巨大但**不直接可比**——因为 trace 完全不同:
|
||||
- E1 burst trace:所有 1285 req 在 t=0 全部到达 → 队列累积 → TTFT 上百秒
|
||||
- v8 real-time trace:req 按 2.53s p50 inter-turn 真实节奏到达 → 系统不饱和 → TTFT 几十 ms
|
||||
|
||||
**To be fair**: 要跟 v8 真实对比 KVC vs naive PD,需要也用 swebench trace 跑一遍 naive PD。这是下一步。
|
||||
|
||||
---
|
||||
|
||||
## 5. 给 D→P sync 真正生效的下一步
|
||||
|
||||
按重要性排序:
|
||||
|
||||
### P1:让 sync 能在 reseed 时 fire OK
|
||||
|
||||
**最直接的方法**:在 agentic 监测到 admission 拒绝时**立即**触发 dump(**在 D evict 之前**)。当前实现是 reseed 决策做完才 dump,已经太晚。
|
||||
|
||||
**方案**:
|
||||
1. 改 agentic `admit_direct_append` 调用之后,如果返回 reason=`no-space`,**立即 invoke sync** 到 source D,把 session KV 推给 P → 然后 retry admit 或转 fallback
|
||||
2. 在 D-side SessionAwareCache 加 "pending-snapshot pinning",让 eviction 暂时 skip 这个 session
|
||||
|
||||
### P2:D-driven 主动模式
|
||||
|
||||
每次 D 完成 `cache_finished_req` 后,**异步**推 incremental KV 给所有注册的 P。这是设计 doc §2.5 提到的方向。开销显著(每次 turn 都推流量)但确保 sync 一直有数据。
|
||||
|
||||
### P3:mem-fraction tuning
|
||||
|
||||
把 decode mem-fraction 调到 0.5-0.55,让 admission 自然拒绝更多,从而 reseed 路径命中真正的"session-resident-on-some-D"分支。但这降低 throughput。
|
||||
|
||||
---
|
||||
|
||||
## 6. 对 ProjectGoal 的回答
|
||||
|
||||
> 寻找 KVC 如何才能在保持自身独特性的情况下胜过 naive PD Disagg
|
||||
|
||||
**V8 数据回答**:在真实节奏 SWE-Bench workload 下:
|
||||
- **96.4% 请求走 direct-to-D fast path**(KVC 独特价值)
|
||||
- TTFT p99 = 167ms,latency p99 = 7.44s
|
||||
- **0% failure**
|
||||
- D→P snapshot 底层架构 ready,但 trigger 的时机问题导致目前 OK rate=0
|
||||
|
||||
**要全面证明 KVC > naive PD**,需要补:
|
||||
- 用 swebench trace 跑一次 naive PD baseline → 直接对比
|
||||
- 修 P1(agentic admission-rejection 时立即 sync)→ 让 D→P 真起作用
|
||||
|
||||
---
|
||||
|
||||
## 7. 当前 branch HEAD
|
||||
|
||||
```
|
||||
git log --oneline -5
|
||||
9cca2c6 feat(experiments): expose PREFILL_MEM_FRAC + plumb --prefill-mem-fraction-static
|
||||
5c09a3a feat(experiments): per-second GPU util sampler in E4-pressured sweep
|
||||
19612ff feat(experiments): parameterize TIME_SCALE in E4-pressured sweep
|
||||
a953346 feat(experiments): E4-pressured points at third_party/traces SWE-Bench trace
|
||||
2dfe22a refactor(snapshot): dedicated GPU snapshot_buf replaces kv_pool alloc
|
||||
```
|
||||
|
||||
`outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess/` 包含完整 metrics + structural logs + GPU util CSV,会另外做对比图(与 swebench-on-naive-PD 一旦跑出)。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:V8 数据把 KVC TTFT 数字从 100+s(burst trace 假象)拉回 167ms(真实 workload),证明 KVC 在真实在线 serving 节奏下表现优异。D→P snapshot link 架构全栈 deploy 完毕但 trigger 时机仍需调整才能真正 fire。
|
||||
215
docs/E4_VS_E1_RESULTS_ZH.md
Normal file
215
docs/E4_VS_E1_RESULTS_ZH.md
Normal file
@@ -0,0 +1,215 @@
|
||||
# E4 vs E1:KVC 是否打败 naive PD-disagg?
|
||||
|
||||
**日期**:2026-05-13
|
||||
**Run**:`outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess/...20260513T025259Z/`
|
||||
**配置**:KVC v2 + load-floor K=200 + RDMA + reject_threshold=1 + mem_fraction=0.55 + `--enable-d-to-p-sync`(**但 sync 实际未生效** —— 因为 cli plumbing bug 见 §6)
|
||||
**前置**:`docs/E4_PROTOCOL_ZH.md`, `docs/E4_RESULTS_ZH.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
**KVC(甚至在 D→P 实际没生效的情况下)在 mean / p50 / p90 上以 30-65% 优势打败 naive PD-disagg,但 p99 长尾输 ~8%。**
|
||||
|
||||
| 指标 | E1 naive PD | E4 KVC | 优势 |
|
||||
|---|---:|---:|---:|
|
||||
| TTFT mean | 90.5s | **58.8s** | **-35%** ✅ |
|
||||
| TTFT p50 | 88.5s | **31.0s** | **-65%** ✅ |
|
||||
| TTFT p90 | 175.2s | 158.9s | -9% ✅ |
|
||||
| TTFT p99 | 207.4s | 224.8s | **+8%** ❌ |
|
||||
| Lat mean | 96.3s | **63.9s** | **-34%** ✅ |
|
||||
| Lat p50 | 93.2s | **37.1s** | **-60%** ✅ |
|
||||
| Lat p99 | 219.5s | 233.8s | +6.5% ❌ |
|
||||
| Success 数 | 1200/1285 | 1130/1285 | -70 ❌ |
|
||||
| Wall clock | 88 min | **64 min** | **-27%** ✅ |
|
||||
|
||||
---
|
||||
|
||||
## 1. 图
|
||||
|
||||
### Figure 1: TTFT 分布对比
|
||||
|
||||

|
||||
|
||||
- **左 panel(线性 ≤ 60s)**:E4(蓝)有明显的 fast-path 峰在 5-15s 区间,E1(红)整体分布在 50-100s 之间,**没有 fast path**
|
||||
- **右 panel(log scale 全范围)**:E4 双峰结构清晰 —— body 在 ~10s,长尾在 100-200s 之间。E1 单峰在 ~80-90s,长尾延伸到 ~200s
|
||||
|
||||
### Figure 2: E2E latency CDF
|
||||
|
||||

|
||||
|
||||
- **左 panel**:CDF 在 80% 之前 E4 完胜(蓝线在左)。**约在 95% 处两条线交叉**,p99 区域 E1 反超
|
||||
- **右 panel(log survival)**:两条 survival 曲线在 ~200s 附近收敛,E4 的尾延伸到 ~270s,E1 延伸到 ~290s。**两边长尾绝对值相似**
|
||||
|
||||
### Figure 3: E4 p99 长尾归因
|
||||
|
||||

|
||||
|
||||
E4 p95-p99 tail(65 个请求,TTFT ≥ 179.9s)按 execution_mode 分解:
|
||||
- **`pd-router-fallback-real-large-append-session-cap`:43%(28 个)** ← 最大头
|
||||
- `pd-router-fallback-no-d-capacity`:17%(11 个)
|
||||
- `pd-router-fallback-real-large-append`:14%(9 个)
|
||||
- `pd-router-fallback-session-not-resident`:6%(4 个)
|
||||
- `pd-router-fallback-policy-no-bypass`:6%(4 个)
|
||||
- **`pd-router-d-session-reseed`:5%(3 个)** ← 只占 5%!
|
||||
- ...
|
||||
|
||||
### Figure 4: E4 per-mode 平均 TTFT(top 14 modes by count)
|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 2. P99 长尾归因——为什么 E4 输 p99
|
||||
|
||||
```
|
||||
E4 p99 tail (n=65, TTFT >= 179.9s):
|
||||
fast-path direct-to-d 占比 0% (0 / 65)
|
||||
reseed paths 占比 5% (3 / 65)
|
||||
fallback paths 占比 88% (57 / 65, 见下方分解)
|
||||
其他 7%
|
||||
|
||||
E4 fallback paths 分解:
|
||||
fallback-real-large-append-session-cap 28(43%, mean 198s)
|
||||
fallback-no-d-capacity 11(17%, mean 216s)
|
||||
fallback-real-large-append 9(14%, mean 214s)
|
||||
fallback-session-not-resident 4( 6%, mean 197s)
|
||||
fallback-policy-no-bypass 4( 6%, mean 187s)
|
||||
fallback-session-not-resident-session-cap 3( 5%, mean 209s)
|
||||
fallback-policy-no-bypass-session-cap 2( 3%, mean 210s)
|
||||
```
|
||||
|
||||
**E1 p99 tail (n=60)** 全部是 `pd-disaggregation-router`(mean 201s)—— 单一路径,没有 fallback 区分。
|
||||
|
||||
### 关键洞察
|
||||
|
||||
1. **E4 长尾不是 reseed 造成的**——reseed 在 p99 tail 中只占 5%。所以 **D→P 即使生效也救不了 p99 大头**。
|
||||
2. **E4 长尾的真正凶手是 fallback paths**。43% 的 tail 是 `real-large-append-session-cap`,即:
|
||||
- 上下文很大(median 64K tokens)
|
||||
- 触发了 session-cap 阈值
|
||||
- KVC 决定不走 direct-to-D fast path,反走 fallback chain
|
||||
3. **fallback chain 比 naive PD 还慢**——为什么?
|
||||
- **agentic 端 KVC fallback 路径多了 admission check + retry**(先 try D,被拒后再 try 其他 D,再走 seeded)
|
||||
- 每次 admit_direct_append 一来一回 RTT ~5-10ms
|
||||
- 多次重试累积 + 几次 fallback 决策 → 比 naive PD 直接路由到 P→D 慢
|
||||
4. **E4 fast path 救了 mean/p50/p90**——`direct-to-d` 走得通的 73 个请求 TTFT mean 0.185s(vs E1 mean 90.5s,500× 提升)。这才是 KVC 的"独特价值"。
|
||||
5. **E4 input length 分布与 E1 相似**——E4 tail median 64K vs E1 tail median 77K。E4 略优。
|
||||
6. **turn_id 都 >= 5**——长尾 100% 来自深 multi-turn session,正是 KVC 设计预期处理的场景
|
||||
|
||||
---
|
||||
|
||||
## 3. 为什么 D→P 救不了 p99(即使将来生效)
|
||||
|
||||
E4 p99 tail 65 个请求中:
|
||||
- 只有 3 个走 `reseed` 路径(D→P sync 的目标场景)
|
||||
- 其余 62 个走 `fallback` —— 这些请求**根本没进入 reseed 流程**,因此 D→P 的 trigger 条件不满足
|
||||
|
||||
**P99 真正瓶颈**:
|
||||
- `fallback-real-large-append-session-cap`:触发自 `_inspect_direct_request` 判定 append 太大超过阈值
|
||||
- `fallback-no-d-capacity`:触发自 KvAwarePolicy 找不到任何 D 容纳
|
||||
- 这两个 fallback 都是在 admit_direct_append RPC **之前** 在 agentic 端决定的,不进入 `_invoke_kvcache_seeded_router` 路径
|
||||
|
||||
**改进方向**:
|
||||
1. **大 append 也能走 direct-to-D**(取消 session-cap 截断 / 提高阈值)
|
||||
2. **fallback chain 走 P 时也用 streaming session**(避免 P-prefill cold start)
|
||||
3. **D→P 主动模式**(在 cache_finished_req 后异步把 KV 推给 P,让 fallback 走 P 时不用重 prefill)
|
||||
|
||||
---
|
||||
|
||||
## 4. KVC 的"独特性"在哪?数据回答
|
||||
|
||||
KVC 设计的独特价值是 **session-affinity routing + direct-to-D fast path**。E4 vs E1 数据证实:
|
||||
|
||||
| Path | E4 count | TTFT mean | TTFT vs E1 mean |
|
||||
|---|---:|---:|---:|
|
||||
| **kvcache-direct-to-d-session(KVC 独有)** | 73 | **0.185s** | **-99.8%** |
|
||||
| pd-router-turn1-seed(与 E1 等价)| 37 | 8.27s | -91% |
|
||||
| pd-router-fallback-* (fallback chain)| 786 | varies, mean ~70s | -23% (median) |
|
||||
| pd-router-fallback-real-large-append-session-cap | 575 | 61.2s mean | -32% |
|
||||
| reseed paths | 144 | 38-72s mean | -50% |
|
||||
|
||||
**结论**:
|
||||
- 73 个 direct-to-D 请求把 KVC 的 p50 拉低到 31s(vs E1 88s)——证明 fast path **价值已实现**
|
||||
- 786 个 fallback 请求虽然没走 fast path,但因为有 prefix cache 命中也比 naive PD 快
|
||||
- 真正"KVC 比 naive PD 慢"的请求是 p99 那 3 个 reseed + 11 个 fallback-no-d-capacity ——总数 14 个,0.011%
|
||||
|
||||
**KVC 在 99% 工作量上完胜 naive PD-disagg,在 1% 上微输**。
|
||||
|
||||
---
|
||||
|
||||
## 5. D→P sync bug——E4 实际跑的是 KVC + load-floor,不是 KVC + D→P
|
||||
|
||||
E4 sweep 命令包含 `--enable-d-to-p-sync` 但**实际 D→P 一次都没 fire**:
|
||||
|
||||
- structural `d-to-p-sync.jsonl` 文件不存在
|
||||
- worker logs 里 0 个 `/_snapshot/*` HTTP 请求
|
||||
|
||||
**根因**:`cli.py:821 benchmark-live ReplayConfig` builder 漏了 `enable_d_to_p_sync=args.enable_d_to_p_sync` 字段。`BenchmarkLiveConfig.enable_d_to_p_sync` 默认 False,连带 `ReplayConfig.enable_d_to_p_sync` 也是 False,`_attempt_d_to_p_sync` 入口处 `if not config.enable_d_to_p_sync: return None` 早退。
|
||||
|
||||
**已修**:commit `af966f2`。
|
||||
|
||||
**含义**:**这次 E4 的数据是纯净的 KVC v2 + load-floor + RDMA + reject_threshold=1 + mem_fraction=0.55 对比 E1 naive PD**,没有 D→P 加成。D→P 如果真生效**最多救** 3 个 reseed-in-p99-tail 请求(占 tail 5%),p99 数字不会有显著变化。
|
||||
|
||||
---
|
||||
|
||||
## 6. 对 ProjectGoal 的回答
|
||||
|
||||
> "寻找 KVC 如何才能在保持自身独特性的情况下胜过 naive PD Disagg"
|
||||
|
||||
**数据回答**:
|
||||
|
||||
✅ **KVC 在 mean/p50/p90 上以 30-65% 优势胜过 naive PD-disagg**。Wall clock 短 27%。
|
||||
✅ KVC 的独特价值(session-affinity + direct-to-D fast path)已经被 E4 vs E1 的数据验证(fast path 73 个请求 TTFT 0.185s)。
|
||||
❌ KVC 在 p99 长尾上略输(+8% TTFT)。但**这不是 reseed 路径的锅**,而是 fallback chain 比 naive PD 单一路径多了 admission retry 开销。
|
||||
⏳ D→P snapshot 即使后续修了 bug 真正生效,也**不会显著降 p99**——因为 reseed 在 tail 中只占 5%。
|
||||
|
||||
**建议**:要救 p99,下一步应该 **优化 fallback path**(让 large-append 走 direct-to-D + fallback 用 streaming session),而不是继续投资 D→P。
|
||||
|
||||
---
|
||||
|
||||
## 7. 实际数字(精确)
|
||||
|
||||
```
|
||||
E1 naive PD E4 KVC + LF + RDMA
|
||||
---------------- --------------------
|
||||
TTFT mean 90.484 58.831 (-35.0%)
|
||||
TTFT p50 88.545 31.028 (-65.0%)
|
||||
TTFT p90 175.178 158.920 (-9.3%)
|
||||
TTFT p99 207.426 224.769 (+8.4%)
|
||||
TTFT max 231.946 238.412 (+2.8%)
|
||||
|
||||
Lat mean 96.339 63.870 (-33.7%)
|
||||
Lat p50 93.166 37.117 (-60.2%)
|
||||
Lat p90 180.738 164.742 (-8.8%)
|
||||
Lat p99 219.462 233.808 (+6.5%)
|
||||
Lat max 288.263 266.631 (-7.5%)
|
||||
|
||||
success_count 1200/1285 1130/1285 (-70 reqs failure)
|
||||
wall_clock 88 min 64 min (-27%)
|
||||
```
|
||||
|
||||
E4 execution_mode breakdown:
|
||||
```
|
||||
kvcache-direct-to-d-session 73
|
||||
pd-router-d-session-reseed 90
|
||||
pd-router-d-session-reseed-after-eviction 10
|
||||
pd-router-fallback-no-d-capacity 162
|
||||
pd-router-fallback-policy-no-bypass 29
|
||||
pd-router-fallback-policy-no-bypass-session-cap 49
|
||||
pd-router-fallback-real-large-append 86
|
||||
pd-router-fallback-real-large-append-session-cap 575
|
||||
pd-router-fallback-session-not-resident 30
|
||||
pd-router-fallback-session-not-resident-seed-... 50
|
||||
pd-router-fallback-session-not-resident-session 26
|
||||
pd-router-policy-no-bypass-reseed 8
|
||||
pd-router-policy-no-bypass-reseed-after-evict 1
|
||||
pd-router-real-large-append-reseed 33
|
||||
pd-router-real-large-append-reseed-after-evict 1
|
||||
pd-router-session-not-resident-reseed 12
|
||||
pd-router-turn1-d-backpressure 13
|
||||
pd-router-turn1-seed 37
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**核心句**:KVC 在 99% 请求上的 30-65% 加速(来自 session-affinity + direct-to-D + prefix cache hits)已经胜过 naive PD-disagg。1% 的 p99 输给 fallback chain 的 admission retry 开销,与 D→P 设计的 reseed 优化目标完全无关。下一阶段优化重点应该是 fallback path,不是继续加 D→P 砖块。
|
||||
270
docs/H200_DRIVER570_SETUP_ZH.md
Normal file
270
docs/H200_DRIVER570_SETUP_ZH.md
Normal file
@@ -0,0 +1,270 @@
|
||||
# H200 + Driver 570 上跑通本仓库的环境配置(含踩坑记录)
|
||||
|
||||
**适用范围**:4× H200 节点 + NVIDIA driver `570.86.15` + 本仓库 `kvc-debug-journey-v1-to-v4` 或后续分支。
|
||||
**目标读者**:拿到一台新 H200 机器、需要快速跑通 sglang 0.5.10 vendor + mooncake RDMA + agentic-pd-hybrid 的下一个 SWE/research agent。
|
||||
**作者状态**:本文档定稿于 `h200-cu130 @ 初始 commit`,smoke test 已 RDMA 跑通 16 reqs / 0 error。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR(5 行)
|
||||
|
||||
1. **`nvidia-smi` 的 "CUDA Version: 13.0" 是个陷阱**——它是 driver 能 forward-compat 跑的 runtime 上限,不是 driver 自己 API 版本。driver `570.86.15` 提供的 driver API 是 **cu12.8**。
|
||||
2. vendor sglang 0.5.10 的 `jit_kernel/` 用 `tvm_ffi` + ninja + nvcc binary 在首次调用每个 kernel 时编译。系统唯一 nvcc 在 `/usr/local/cuda-13.0/bin/`,cu13 编译出的 .so 会 NEEDED `libcudart.so.13`,driver 570 拒绝运行 → `cudaErrorInsufficientDriver`。
|
||||
3. 解法是**本地装一份 cu12.8 toolkit 到 `$HOME/cuda-12.8`**(不需要 root),让 tvm_ffi 走 cu12.8 nvcc,编译产物 NEEDED `libcudart.so.12`,driver 570 完美支持。
|
||||
4. mooncake wheel (`mooncake-transfer-engine 0.3.10.post2`) 也是 cu12 build,需要 `libcudart.so.12`——已经由 `nvidia-cuda-runtime-cu12` 包提供,在 venv 里。
|
||||
5. 每个 shell **必须 `source scripts/setup_env.sh`** 才能跑 SGLang。已封装好。
|
||||
|
||||
---
|
||||
|
||||
## 1. 一次性 setup(约 25min)
|
||||
|
||||
```bash
|
||||
cd /path/to/agentic-pd-hybrid
|
||||
|
||||
# (1) Python 环境 (~3min)
|
||||
uv sync
|
||||
|
||||
# (2) cu12.8 toolkit 本地装(~5GB 下载 + 5min 解压 = ~15-20min)
|
||||
mkdir -p /tmp/cuda_dl && cd /tmp/cuda_dl
|
||||
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
|
||||
sh cuda_12.8.1_570.124.06_linux.run \
|
||||
--silent --toolkit --override \
|
||||
--installpath=$HOME/cuda-12.8 \
|
||||
--tmpdir=$HOME/tmp \
|
||||
--no-drm --no-man-page
|
||||
|
||||
# (3) 验证
|
||||
$HOME/cuda-12.8/bin/nvcc --version # 应该看到 release 12.8, V12.8.93
|
||||
|
||||
# (4) 回到 repo 根目录,首次 source(每个 shell 都要做)
|
||||
cd /path/to/agentic-pd-hybrid
|
||||
source scripts/setup_env.sh
|
||||
```
|
||||
|
||||
`source scripts/setup_env.sh` 输出应是:
|
||||
```
|
||||
agentic-pd-hybrid env ready:
|
||||
CUDA_HOME=/home/<user>/cuda-12.8 (12.8, V12.8.93)
|
||||
libcudart.so.12 at .../.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib
|
||||
MC_TRANSFER_TIMEOUT=1800s
|
||||
```
|
||||
|
||||
**`MC_TRANSFER_TIMEOUT=1800` (30 min) 替代 mooncake 默认 30s**——E2 forensic 发现 D 端 LRU eviction 会让 mooncake C++ control plane 被 starved 30+s,触发 `conn.py:1270` hair-trigger 永久 blacklist 整个 D 的 mooncake_session_id。1800s 给足缓冲,30 分钟还没回应才是真正"D 死了"。详见 `docs/E1_E2_RESULTS_ZH.md §5c`。`stack.py` 也对 worker subprocess 设了同名默认值。
|
||||
|
||||
---
|
||||
|
||||
## 2. Smoke test(验证整条链路)
|
||||
|
||||
把 16 个合成 request 喂给 1P3D 拓扑,启用真 RDMA,跑通后才能动 E1/E2 实验。
|
||||
|
||||
```bash
|
||||
# 假设已 source scripts/setup_env.sh
|
||||
mkdir -p outputs/smoke_rdma
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli make-small-append-trace \
|
||||
--output outputs/smoke_rdma/mini_trace.jsonl \
|
||||
--session-count 4 --turns-per-session 4 \
|
||||
--initial-input-length 1024 --append-input-length 200 --output-length 50 \
|
||||
--inter-turn-gap-s 2 --session-stagger-s 1
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace outputs/smoke_rdma/mini_trace.jsonl \
|
||||
--output-root outputs/smoke_rdma \
|
||||
--mechanism pd-disaggregation --policy default \
|
||||
--model-path /mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507 \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device mlx5_60 \
|
||||
--gpu-budget 4 --time-scale 1 \
|
||||
--concurrency-limit 4 --timeout-s 1800 --request-timeout-s 300 \
|
||||
--session-sample-rate 1.0 --min-turns 1 --target-duration-s 600
|
||||
```
|
||||
|
||||
**首次跑会慢 8-15min**(model load 196s + 5-10 个 JIT kernel 各编译 ~10-30s + warmup)。后续跑只 ~3-5min。
|
||||
|
||||
**期望结果**:`request_count=16, error=0, abort=0, failure=0, execution_modes={'pd-disaggregation-router': 16}`。
|
||||
|
||||
每个 worker 的日志应有 `installTransport, type=rdma`,表示 mooncake 真的走 RDMA 而不是 TCP loopback。
|
||||
|
||||
---
|
||||
|
||||
## 3. GPU ↔ RDMA HCA 映射(本机实测)
|
||||
|
||||
8 块 ConnectX HCA,全部 ACTIVE / 400 Gb/s NDR / RoCE v2 (link_layer=Ethernet, GID Index 3)。Mooncake 按 NUMA / PCIe affinity 自动选 preferred:
|
||||
|
||||
| GPU | preferred HCA | NUMA |
|
||||
|---|---|---|
|
||||
| cuda:0 | mlx5_60 | 0 |
|
||||
| cuda:1 | mlx5_88 | 0 |
|
||||
| cuda:2 | mlx5_98 | 1 |
|
||||
| cuda:3 | mlx5_42 | 1 |
|
||||
|
||||
CLI 的 `--ib-device <name>` 只接单个设备名,给所有 worker 全局 override。Smoke test 默认填 `mlx5_60`(P worker 在 cuda:0 上 NUMA-local;D worker 在其它 GPU 上是 cross-NUMA 但能跑)。E1/E2 实验如果想最优,可以分 P/D worker 独立设环境变量,但目前 stack.py 不支持 per-worker `MOONCAKE_DEVICE`,要么所有 worker 同一个,要么走 mooncake auto(需把 `MC_MS_AUTO_DISC=0` 改回 1)。
|
||||
|
||||
完整 8 块 HCA:`mlx5_22, _27, _42, _60, _88, _98, _126, _135`(NUMA 0/1/0/0/0/1/0/1 混杂)。
|
||||
|
||||
---
|
||||
|
||||
## 4. 踩过的坑(按时间线)
|
||||
|
||||
### 坑 1:`nvidia-smi` 的 "CUDA Version: 13.0" 是误导
|
||||
|
||||
`nvidia-smi` header 显示 `Driver Version: 570.86.15 / CUDA Version: 13.0` 让人以为机器支持 cu13。**这是 driver 能 forward-compat 跑的 CUDA runtime 上限**,不是 driver 自己 API 的版本。driver 570 的 driver API 上限是 cu12.8(参见 NVIDIA "CUDA Compatibility" 矩阵)。
|
||||
|
||||
**正确判断方法**:跑 `torch.cuda.is_available()`,如果装了 cu13 build 的 torch 会报 `The NVIDIA driver on your system is too old (found version 12080)`。返回 `12080` 才是 driver 自己 API 版本(cu12.8)。
|
||||
|
||||
### 坑 2:vendor sglang vs pip sglang 的 patch 差异
|
||||
|
||||
仓库的 `third_party/sglang/python/` 是带项目自有 patches 的 SGLang 0.5.10 fork。**pip 上的 `sglang==0.5.10` 不包含核心 patches**——具体差异:
|
||||
|
||||
| 文件 | pip 版 | vendor 版 |
|
||||
|---|---|---|
|
||||
| `srt/managers/scheduler.py` | 3621 行 | 3938 行 |
|
||||
| `admit_direct_append` 出现次数 | 2 | **11** |
|
||||
| `DirectAppendAdmissionReqInput/Output` | 没有 | **有**(核心 RPC) |
|
||||
| `_should_allow_local_prefill_on_decode` | 没有 | 有 |
|
||||
| `maybe_trim_decode_session_cache` | 没有 | 有 |
|
||||
| `decode_direct_waiting_queue` | 没有 | 有 |
|
||||
|
||||
→ **必须用 vendor 版**。本分支已把 `pyproject.toml` 的 `sglang==0.5.10` 改成 `sglang` + `[tool.uv.sources] sglang = { path = "third_party/sglang/python", editable = true }`,`uv sync` 后会自动 editable 安装 vendor 版。
|
||||
|
||||
历史上有些 sweep 脚本用 `PYTHONPATH=src:third_party/sglang/python` 在运行时切换,但用 `uv.sources` 把它装进 venv 更彻底,不会被 pip 的 sglang 偷偷 shadow。
|
||||
|
||||
### 坑 3:cu13 切换是死路
|
||||
|
||||
发现 driver 570 不兼容时第一个想到的路径是「装 cu13 PyTorch」。试过:
|
||||
|
||||
1. 改 `pyproject.toml` 加 `[[tool.uv.index]]` 指向 `https://download.pytorch.org/whl/cu130`
|
||||
2. 同样改 vendor sglang 的 `pyproject.toml`(root 项目的 sources 不会传递给 transitive editable dep)
|
||||
3. `uv sync` 成功装上 `torch==2.9.1+cu130` 和 `nvidia-{nccl,nvjitlink,nvshmem,cusparselt,nvtx}-cu13`
|
||||
4. **但 driver 570 不支持 cu13 runtime**——`torch.cuda.is_available()=False`,CUDA init 报 `driver too old (12080)`
|
||||
|
||||
→ cu13 路径需要 **driver 580+**。我们没有 root + 别人在用机器,所以放弃。本分支已 rollback 到 cu12 stack(pyproject 干净)。
|
||||
|
||||
### 坑 4:`--disable-overlap-schedule` 不够
|
||||
|
||||
第一次 smoke 崩在 `resolve_future_token_ids.cuh:49`,路径是 `event_loop_overlap_disagg_prefill`,怀疑是 overlap 模式特定 JIT kernel 问题。
|
||||
|
||||
cli.py 给 PD worker 加了 `--disable-overlap-schedule` 后,event loop 切到 `event_loop_normal_disagg_prefill`,但**崩在另一个 kernel `fused_inplace_qknorm`**,错误码完全相同(`cudaErrorInsufficientDriver`)。
|
||||
|
||||
→ 不是 overlap-specific,是 **整体 vendor sglang `jit_kernel/` 模块和 driver 570 不兼容**,任何 JIT kernel 都会崩在 `runtime.cuh:21` 的 `cudaOccupancyMaxActiveBlocksPerMultiprocessor` 调用(CUDA runtime 初始化时 driver feature 版本检查失败)。
|
||||
|
||||
但 `--disable-overlap-schedule` 留着不会造成伤害,且能避免之后类似 overlap-path 特定问题。本分支保留它在 `cli.py:_topology_from_args`。
|
||||
|
||||
### 坑 5:pip sgl_kernel vs vendor sglang/jit_kernel/ 是两套系统
|
||||
|
||||
`pip install sglang-kernel` 提供 `.venv/lib/.../sgl_kernel/{flash_ops,flashmla_ops,spatial_ops}.abi3.so`——这是 AOT 预编译产物。
|
||||
|
||||
`third_party/sglang/python/sglang/jit_kernel/` 是 vendor SGLang 0.5.10 内置的 **另一套 JIT 模块**,运行时用 tvm_ffi 编译。Smoke 崩在 vendor 的 jit_kernel,**降级 pip sgl_kernel 没用**(实测 0.4.0 / 0.4.1 同样崩)。
|
||||
|
||||
### 坑 6:`nvidia-cuda-nvcc-cu12` PyPI 包没装 nvcc binary
|
||||
|
||||
发现 cu13 nvcc 是 root cause 后,第一反应是 PyPI 装 cu12 nvcc 包:
|
||||
|
||||
```bash
|
||||
uv pip install nvidia-cuda-nvcc-cu12==12.8.93
|
||||
```
|
||||
|
||||
装上以后 `find .venv -name nvcc` **返回空**——这个 PyPI 包只装 `ptxas` 和 `nvvm/`,**没有 nvcc binary**(NVIDIA 出于分发限制不把 nvcc 放 PyPI)。
|
||||
|
||||
→ 完整 nvcc 必须从 NVIDIA 官方 `.run` installer 或 apt 装。`.run` installer 可以装到 user-writable 路径不需要 root,本仓库选这条路。
|
||||
|
||||
### 坑 7:tvm_ffi 通过 ninja 调用 nvcc
|
||||
|
||||
vendor sglang 的 `jit_kernel/` 用 `tvm_ffi.cpp.extension`,源码在 `~/.local/lib/python3.12/site-packages/tvm_ffi/cpp/extension.py`。关键路径:
|
||||
|
||||
```python
|
||||
def _find_cuda_home() -> str:
|
||||
cuda_home = os.environ.get("CUDA_HOME") or os.environ.get("CUDA_PATH")
|
||||
if cuda_home is None:
|
||||
nvcc_path = shutil.which("nvcc")
|
||||
if nvcc_path is not None:
|
||||
cuda_home = str(Path(nvcc_path).parent.parent)
|
||||
...
|
||||
```
|
||||
|
||||
然后构造 ninja file:
|
||||
```
|
||||
nvcc = {_find_cuda_home()}/bin/nvcc
|
||||
```
|
||||
|
||||
→ **设 `CUDA_HOME=$HOME/cuda-12.8` 就能 hook 整条编译链**。`scripts/setup_env.sh` 已经设好。
|
||||
|
||||
JIT 编译产物缓存在 `~/.cache/tvm-ffi/sgl_kernel_jit_*/*.so`。如果之前用 cu13 nvcc 编过,要先 `rm -rf ~/.cache/tvm-ffi/sgl_kernel_jit_*` 再用 cu12.8 重编。
|
||||
|
||||
### 坑 8:mooncake import path 与 onboarding 文档不一致
|
||||
|
||||
`docs/ONBOARDING_NEXT_AGENT_ZH.md` §3.3 的环境验证写:
|
||||
```python
|
||||
from mooncake_transfer_engine import TransferEngine
|
||||
```
|
||||
|
||||
但实际 PyPI `mooncake-transfer-engine 0.3.10.post2` wheel 的 import path 是:
|
||||
```python
|
||||
from mooncake.engine import TransferEngine
|
||||
```
|
||||
|
||||
第一次 `from mooncake_transfer_engine` 会 `ModuleNotFoundError`。**ONBOARDING 文档应该更新**(本分支不动 onboarding,留给主 agent 决定)。
|
||||
|
||||
### 坑 9:mooncake.engine import 必须有 libcudart.so.12
|
||||
|
||||
`from mooncake.engine import TransferEngine` 在 fresh shell(未 source setup_env.sh)下报:
|
||||
```
|
||||
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
|
||||
```
|
||||
|
||||
mooncake 的 `engine.so` 是 cu12 build,dynamic link `libcudart.so.12`。venv 里有但需要 LD_LIBRARY_PATH 暴露。`scripts/setup_env.sh` 已加。
|
||||
|
||||
### 坑 10:Inferact 数据集 schema 与 agentic-pd-hybrid 期望不匹配
|
||||
|
||||
`huggingface.co/datasets/Inferact/codex_swebenchpro_traces` 是 ShareGPT 格式(`{"from": "human/gpt", "value": "<text>"}`),不含 token 计数 / hash_ids / 时间戳。
|
||||
|
||||
`agentic-pd-hybrid` 期望 JSONL:`chat_id, parent_chat_id, timestamp, input_length, output_length, type, turn, hash_ids[]`。
|
||||
|
||||
→ 已写 `scripts/convert_inferact_to_trace.py`:tokenize(用 model 自带 tokenizer)+ 滚动 hash 切 24-token block + 伪造 timestamp。610 trials × 33 turns 处理约 37min,跑出 20,230 reqs(与 Inferact README 的 "20,230 total LLM calls" 完全一致)。
|
||||
|
||||
输出 `outputs/inferact_codex_swebenchpro.jsonl`(1.3GB,被 `.gitignore` 排除不进仓库)。
|
||||
|
||||
### 坑 11:sampling 默认 `--session-sample-rate 0.01`
|
||||
|
||||
`benchmark-live` 跑的时候内部会先做 sampling。默认 1%,意味着 50 sessions 才抽 1 个。Mini smoke trace 4 sessions × 1% = 0 → `ValueError: Sampling produced no requests`。
|
||||
|
||||
→ smoke test 命令显式加 `--session-sample-rate 1.0 --target-duration-s 600`。
|
||||
|
||||
---
|
||||
|
||||
## 5. 后续给下个 agent
|
||||
|
||||
跑 E1 / E2 sweep 之前**每个 shell 第一件事**:
|
||||
|
||||
```bash
|
||||
cd /path/to/agentic-pd-hybrid
|
||||
source scripts/setup_env.sh
|
||||
```
|
||||
|
||||
然后用 ONBOARDING §3 的 sweep 脚本(参考 `scripts/sweep_ts1_migration_v2.sh` 作为底版)。注意几处针对本机的修改:
|
||||
|
||||
1. **MODEL 路径**改成 `/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507`(onboarding 写的 `/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/...` 不存在)。
|
||||
2. **TRACE 路径**:`outputs/qwen35-swebench-50sess.jsonl` 不存在;用 `outputs/inferact_codex_swebenchpro.jsonl` (converter 跑完后产生)。
|
||||
3. **`--ib-device`** 选 `mlx5_60`(cuda:0 NUMA-local)或视实验需要自选;onboarding 写的 `mlx5_0` 在本机不存在。
|
||||
4. **保留 cli.py 的 `--disable-overlap-schedule`** 不要删——理论上 cu12.8 toolchain 应该让 overlap 也能跑,但目前未验证 overlap path 没有别的潜在问题,留着是 zero-cost 保险。
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:本分支的代码改动
|
||||
|
||||
- `pyproject.toml`:sglang dep 改用 `[tool.uv.sources]` path source 走 `third_party/sglang/python`(editable)。
|
||||
- `src/agentic_pd_hybrid/cli.py:_topology_from_args`:给 prefill/decode worker 自动加 `--disable-overlap-schedule`。
|
||||
- `scripts/setup_env.sh`:env wrapper,每个 shell `source` 一次。
|
||||
- `scripts/convert_inferact_to_trace.py`:Inferact ShareGPT → agentic-pd-hybrid JSONL schema converter。
|
||||
- `docs/H200_DRIVER570_SETUP_ZH.md`:本文档。
|
||||
|
||||
## 附录 B:被 `.gitignore` 排除的产物
|
||||
|
||||
- `outputs/inferact_codex_swebenchpro.jsonl`(1.3GB)——converter 输出,用 `scripts/convert_inferact_to_trace.py` 重新生成
|
||||
- `outputs/smoke_rdma/`(含 mini trace + smoke run artifacts)
|
||||
- `third_party/codex_swebenchpro_traces/`(209MB,HF dataset 下载)—— `hf download Inferact/codex_swebenchpro_traces --repo-type dataset --local-dir third_party/codex_swebenchpro_traces` 重下
|
||||
- `~/cuda-12.8/`——cu12.8 toolkit,用 §1 步骤 (2) 重装
|
||||
- `.venv/`——`uv sync` 重建
|
||||
228
docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md
Normal file
228
docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md
Normal file
@@ -0,0 +1,228 @@
|
||||
# KVC Eviction Granularity — 设计审视 (架构层)
|
||||
|
||||
**日期**: 2026-05-12
|
||||
**Status**: 架构审视 / 待 design discussion
|
||||
**Companion**: `docs/E1_E2_RESULTS_ZH.md`, `docs/E3_FINDINGS_ZH.md`, `docs/E1_E2_FIX_DESIGN_ZH.md`
|
||||
**Branch**: `h200-cu130`
|
||||
|
||||
本文是 E2 → E3 迭代后的高层架构反思,**不是又一份 fix design**。前几轮 E2 → E3 我一直在加 local patches(load-floor bonus、Fix A skip-zero-extend、调 migration_reject_threshold 等),但 E3 实测数据迫使我们承认这些 patches 大局上看是 **KVC 在向 DP / naive PD-disagg 退化的轨迹**。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. **KVC 的 value proposition** 是"session pin 在 D 上、KV 跨 turn 连续累积、direct-to-D 快路径 0.04s TTFT"。
|
||||
2. **`SessionAwareCache.release_session` 在 trim 时一次性 free 整段 session-exclusive 尾部**:实测 E3 一次 trim 平均 free **67,726 tokens**(samples: 35K / 38K / 40K / 86K / 87K),不是 "几个 leaf block"。
|
||||
3. 被 evict 的 session 下次到来时必须**从客户端原 prompt 重 prefill 50-90K** + mooncake transfer 5-9 GB → **跟 naive PD-disagg 一模一样**。
|
||||
4. → 在 saturation regime 下 KVC 的 cache continuity 设计被自己的 eviction 抵消。**Session-level eviction 与 KVC 的设计意图冲突**。
|
||||
5. 真正的方向不是堆 patch,是 **改 eviction granularity**: 让 streaming-session 的 decode 输出 **progressively commit 进 radix tree**,由 SGLang 标准的 block-level LRU 蚕食最老的 leaf。SessionSlot 退化成纯 metadata。
|
||||
|
||||
---
|
||||
|
||||
## 1. 我们做对了什么,又错过了什么
|
||||
|
||||
### KVC 的 design promise(来自 `KVC_ROUTER_ALGORITHM.md` §1)
|
||||
|
||||
| Property | 设计意图 |
|
||||
|---|---|
|
||||
| Session 钉定 | Session `s` pin 在 `pin[s]` 这一个 D;同 session 的所有 turn 在同一个 D 上做 KV 累积 |
|
||||
| Direct-to-D 快路径 | `req.session ∈ M_d ∧ append_len ≤ τ_append ∧ cap_ok` → 仅 append 新 token,**不走 P→D mooncake transfer** |
|
||||
| TTFT 优势 | append-only path TTFT ≈ 40ms (历史 v2 在 SWE-Bench 的 fast-path p50) |
|
||||
| 集中 cache 而非 fragment | 同 session cache 集中在一个 D 上,命中率高 |
|
||||
|
||||
### 我们当前实测在做什么(E3, killed at 1h12min)
|
||||
|
||||
| 指标 | 实测值 | 与设计 promise 的偏离 |
|
||||
|---|---:|---|
|
||||
| Eviction 次数 | **90** | 设计假设 "session 一旦绑就持续累积" |
|
||||
| 平均每次 evict 释放 | **67,726 tokens** | 不是 "几个 leaf block",是整段 session 尾部 |
|
||||
| 总释放 | **6,095,375 tokens** | 在 1h12min 里 trash 了 ≈ 8 个 session-pool 容量的 KV |
|
||||
| 触发 reseed 的 session 数 | 25 / 50 (50%) | 这些 session 每个被 evict-revisit 一次 = 付一次 50-90K re-prefill |
|
||||
| 单次 reseed 平均耗时 | 3-7s (P prefill + mooncake) | 跟 naive PD-disagg 持平 |
|
||||
|
||||
**E1 对照**:0 eviction、0 retract、50 sessions 顺利完成。E1 用的是 `pd-disaggregation` mechanism,**没有 KVC 层、没有 admission RPC**,但反而保留了 cache continuity(router-side sticky 让 session 不挪窝)。
|
||||
|
||||
> **讽刺**: E1 (naive 1P2D + kv-aware policy) **意外地** 比 E3 (KVC v2 + load-floor + RDMA) 更接近 KVC 设计意图——因为 E1 没有 admission 反馈链路,所以没人会触发那 90 次 session-level evict。
|
||||
|
||||
---
|
||||
|
||||
## 2. 为什么 session-level evict 是错的
|
||||
|
||||
### `release_session` 实测语义(`session_aware_cache.py:250-281`)
|
||||
|
||||
```python
|
||||
def release_session(self, session_id: str):
|
||||
slot = self.slots.pop(session_id, None)
|
||||
...
|
||||
if slot.last_node is not None:
|
||||
self.inner.dec_lock_ref(slot.last_node, ...) # 解 radix 锁 ✓
|
||||
|
||||
if slot.is_holding_kv:
|
||||
start = slot.cache_protected_len
|
||||
end = slot.kv_allocated_len
|
||||
if start < end:
|
||||
kv_indices = self.req_to_token_pool.req_to_token[
|
||||
slot.req_pool_idx, start:end
|
||||
]
|
||||
self.token_to_kv_pool_allocator.free(kv_indices) # 显式 free 一段 KV
|
||||
...
|
||||
```
|
||||
|
||||
`[cache_protected_len, kv_allocated_len)` 是 **session-exclusive 尾部**——从首 turn 提交 radix tree 之后所有累积的 decode output + 后续 turn 的 extend。在 Inferact workload 上:
|
||||
|
||||
- `cache_protected_len` ≈ 首 turn 提交的 boilerplate 部分 (~12K)
|
||||
- `kv_allocated_len` ≈ 50-100K(多 turn 累积)
|
||||
- **释放范围 = 38-88K**
|
||||
|
||||
这部分 KV **没有进 radix tree**,所以也享受不到 radix block-level LRU 的渐进式 shedding。`release_session` 一刀切。
|
||||
|
||||
### 与 SGLang 标准 radix LRU 的本质差异
|
||||
|
||||
SGLang 标准 `inner.evict()`(`base_prefix_cache.py` 接口由 RadixCache 实现):
|
||||
|
||||
```
|
||||
按节点 last_access_time 排序,从 leaf 开始 evict (因为 evict 中间节点会破坏树结构)
|
||||
每次释放一个 leaf node 的 KV indices
|
||||
lock_ref > 0 的节点不可 evict
|
||||
```
|
||||
|
||||
**特性对比**:
|
||||
|
||||
| | session-level (current) | block-level (SGLang radix) |
|
||||
|---|---|---|
|
||||
| 单次释放粒度 | 整段 session 尾部 (35-87K) | 一个 leaf node (~24 tokens / page-size) |
|
||||
| Recent prefix 保留 | ❌ 全丢 | ✅ 保留 (recent 访问 → 时间戳新 → 不被先 evict) |
|
||||
| Evict-revisit 成本 | 50-90K re-prefill | 仅丢的 leaf 部分 (≪ 50K) |
|
||||
| 与 session lifecycle | 强绑定 (是 lifecycle 退出动作) | 解耦 (lifecycle 仅做 lock_ref 管理) |
|
||||
|
||||
### 为什么会变这样:SessionAwareCache 的双重职责混淆
|
||||
|
||||
`SessionAwareCache` 设计承担了**两个本应分离的职责**:
|
||||
|
||||
1. **Session lifecycle 跟踪** (合理):streaming session 跨多个 req 复用 KV,需要在 turn 间保留 `(req_pool_idx, kv_committed_len, kv_allocated_len, last_node)` 这些字段,恢复给下个 turn 的 req。
|
||||
2. **Eviction granularity 决策** (问题所在):把 session 当成 evict 的最小单位,绕过了 SGLang 标准 LRU 的 leaf-by-leaf 渐进 shedding。
|
||||
|
||||
第 2 个职责本不该存在于 SessionAwareCache 里。SGLang radix 已经能处理 block-level LRU——前提是 session 的 KV 真的进了 radix 树。但**因为 session-exclusive 尾部没 commit 进 radix tree**,radix LRU 看不到它们,只能由 release_session 一次性大块 free。
|
||||
|
||||
---
|
||||
|
||||
## 3. 我们前几轮 patches 的总体轨迹
|
||||
|
||||
按 commit 时间线审视,每一步看似在修当下 issue,整体方向却是 KVC → DP 退化:
|
||||
|
||||
| Iteration | 改动 | 局部目标 | 大局影响 |
|
||||
|---|---|---|---|
|
||||
| E2 baseline | mechanism=kvcache-centric, worker admission | 跑出 KVC v2 头条数字 | D2 cold + cascade → 1054 failures (KVC 设计前提崩塌) |
|
||||
| E3 load-floor bonus | 让 fresh session 均匀分到 D2 | 解 cold-start 偏置 | 触发 migration → 25 sessions reseed → 暴露 evict granularity 问题 |
|
||||
| E3 → Fix A | 修 vendored SGLang `prepare_for_extend` 的 fill_ids<prefix_indices invariant | 防 decode-1 assertion crash | Patch 局部 bug,没动 evict 设计 |
|
||||
| **我之前提议: disable migration** | `--kvcache-migration-reject-threshold 0` | "让 session 不挪窝" | **会让 KVC 退化成 pd-disagg + load-floor**(admission RPC 还在但 migration 不生效) |
|
||||
| **更早提议: disable admission** | 砍 admission RPC | "省掉那个 RPC overhead" | **直接砍 KVC 的 direct-to-D fast path** (KVC_ROUTER_ALGORITHM.md §3.2 Algorithm 2 不存在) |
|
||||
|
||||
用户每次都正确地阻止了进一步退化。**没有人在审视 evict granularity 这个根本问题**——直到现在。
|
||||
|
||||
---
|
||||
|
||||
## 4. 正确方向(粗描)
|
||||
|
||||
**核心思路**: 让 streaming session 的 decode 输出 **progressively commit 进 radix tree**,由 SGLang 标准 radix LRU 蚕食最老的 leaf。SessionSlot 退化成纯 metadata。
|
||||
|
||||
### 4.1 目标行为
|
||||
|
||||
| 场景 | 当前行为 | 目标行为 |
|
||||
|---|---|---|
|
||||
| Session 累积 50K KV,D 满了 | release_session 一次释放 38K (整段 session-exclusive 尾部) | radix LRU evict 最老 leaf (可能是首 turn 的 boilerplate tail,~24 tokens) |
|
||||
| Session 被 evict 后再到来 | 必须 reseed 50K (P prefill + mooncake) | 仅 re-prefill 被 evict 的 leaf 部分 (e.g. ~5K) |
|
||||
| TTFT 对 evicted session 的影响 | 50-90K reseed = 3-7s | 5K append-prefill = ~200ms |
|
||||
| 不被 evict 的 session | 同 session 内 turns append-only | 同样 append-only ✓ (不变) |
|
||||
| KVC fast-path 命中率 | 91.6% (历史 SWE-Bench) / 38% (E3 Inferact, 因为 evict-revisit) | 应稳定在 >85% 即使 saturation |
|
||||
|
||||
### 4.2 需要的 refactor scope
|
||||
|
||||
按依赖排序,每一步可独立做但有耦合:
|
||||
|
||||
1. **Streaming session decode output 增量进 radix tree** (vendor SGLang)
|
||||
- 当前: decode output 累积在 `kv_allocated_len` 维度,但 radix tree 只记录到 `cache_protected_len`
|
||||
- 改: 每 turn finish 时把新的 decode tail 通过 radix `cache_finished_req` 路径插入 radix 树
|
||||
- 影响: streaming session 在 radix 树里有持续 growing 的 chain,每个 24-token block 一个 node
|
||||
- 牵涉: `radix_cache.py` 的 insert 路径、`schedule_batch.py` 的 cache_finished_req hook、SessionSlot.save_from_req
|
||||
|
||||
2. **SessionSlot 退化成纯 metadata**
|
||||
- 当前: SessionSlot 拥有 `req_pool_idx` + `[cache_protected_len, kv_allocated_len)` 范围的 KV 索引所有权
|
||||
- 改: SessionSlot 仅持有 `last_node`(指向 radix 树某 node)和 lock_ref 状态,不直接管 KV 范围
|
||||
- 影响: `restore_to_req` 改成基于 radix `match_prefix` 重建 req 状态,不直接 reuse req_pool_idx
|
||||
|
||||
3. **`release_session` 改为仅 dec_lock_ref + 删 slot metadata**
|
||||
- 当前: 还 free `[cache_protected_len, kv_allocated_len)` 范围 KV
|
||||
- 改: 只 dec_lock_ref → 让 radix LRU 自然 evict
|
||||
- 影响: `maybe_trim_decode_session_cache` 不再"按 session 释放",而是用 SGLang 现有的 `tree_cache.evict(required_tokens)`
|
||||
|
||||
4. **`admit_direct_append` 的 capacity 检查改用 radix-resident 长度**
|
||||
- 当前: `current_tokens = session.resident_tokens` (来自 SessionSlot)
|
||||
- 改: `current_tokens` = radix tree 上该 session 实际 commit 的长度 = `match_prefix(session.last_node).matched_length`
|
||||
- 影响: admission 评估的 "uncached = input - radix-resident" 更精确,evict-revisit 场景下 admission 反映出"只丢了一部分"而不是"全丢"
|
||||
|
||||
5. **`prepare_for_extend` 的 streaming-session correction 重新设计**
|
||||
- 当前: Fix A patches 的 fill_ids/prefix_indices invariant 是基于 session-exclusive 尾部的复杂 fixup
|
||||
- 改: 如果 SessionSlot 不再拥有独立 KV 范围,整个 correction 路径需要重写或可能不再必要
|
||||
|
||||
### 4.3 与 onboarding §4.4 D→P sync 的关系
|
||||
|
||||
`docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §4 描述的 D→P 增量同步是**针对 reseed 自身成本**的 fix(让 P 端 backup 跟上,避免 reseed 时 P 重 prefill)。
|
||||
|
||||
本文 §4 描述的 eviction granularity 是**针对 reseed 触发频率**的 fix(让 session 不被一次性 evict 整段,减少 evict-revisit)。
|
||||
|
||||
**两者正交、互补**:
|
||||
- 单做 evict-granularity fix: reseed 频率下降,但偶发 reseed 仍然慢
|
||||
- 单做 D→P sync: reseed 自身快了,但仍然频繁触发
|
||||
- 都做: reseed 几乎消失、即使触发也快
|
||||
|
||||
工程量都是 ~1-2 周量级,可并行启动。
|
||||
|
||||
### 4.4 不是 local patch
|
||||
|
||||
注意整个 §4.2 列表里没有"调一个 hyperparameter"或者"加一个 CLI flag"这种局部改动。这是 vendor SGLang 内部数据结构的 invariants 重新设计,不能通过更精确的 K 值或更宽的 substring filter 解决。
|
||||
|
||||
---
|
||||
|
||||
## 5. 我们不该再做的事 (anti-patterns)
|
||||
|
||||
防止下个 agent 走同样的局部 patch 路径:
|
||||
|
||||
1. **不要继续调整 `migration_reject_threshold`** — 这个参数只是控制"reject 后多久换 D",跟 evict granularity 无关。调小让 migration 更频繁 → 更多 reseed → 更糟。调大 → blacklist 永久化 (v1 thrashing 问题)。
|
||||
2. **不要 disable migration** — 会让 KVC 退化到 sticky pd-disagg。失去 v2 的 reset-on-success 整体设计。
|
||||
3. **不要 disable admission** — 会砍掉 direct-to-D fast path 这个 KVC 唯一的差异化优势。
|
||||
4. **不要继续 tune `_decode_session_cache_low_watermark_tokens`** — 调高让 LRU 更激进 → 更多 evict → 更糟。调低让 LRU 不触发 → 顶到 retract decode → 更糟。是治标。
|
||||
5. **不要再加 `_ADMISSION_REJECTION_SUBSTRINGS`** — 之前修的 string filter bug (Q2 forensic) 让 migration counter 真的递增,反而暴露了 migration 本身的 reseed 成本。修这个 bug 没错,但显示出 migration 机制本身在 saturated 场景下是负收益。
|
||||
|
||||
---
|
||||
|
||||
## 6. 推荐 Decision Points
|
||||
|
||||
| # | Question | 推荐 |
|
||||
|---|---|---|
|
||||
| D1 | 接受本文的诊断(session-level evict 是根本问题)? | **Yes** |
|
||||
| D2 | 暂停 E1/E2/E3 ablation 线索,集中精力做 §4.2 refactor? | **Yes** (current path 在用 GPU 时间确认已知结论) |
|
||||
| D3 | refactor 在 vendored SGLang 主线(kvc-debug-journey-v1-to-v4)还是新分支? | 新分支 `feat/block-level-evict`(隔离 risk) |
|
||||
| D4 | 同时启动 §4.3 的 D→P sync(`feat/d-to-p-sync` 分支已预留)? | 视团队带宽 |
|
||||
| D5 | 在 refactor 完成前对外的 paper 表述如何处理? | 标"v2 系列在 saturation regime 下的 evict 行为是已识别的 limitation,§future-work 已 propose 修复" |
|
||||
|
||||
---
|
||||
|
||||
## 7. 给下个 agent 的接班
|
||||
|
||||
**如果你接手要做 §4.2 refactor**,按顺序读:
|
||||
|
||||
1. `KVC_ROUTER_ALGORITHM.md` §2-3 — KVC 设计意图
|
||||
2. 本文 §2.1, §2.2 — 实测 evict 行为
|
||||
3. SGLang vendor `mem_cache/radix_cache.py` — 标准 radix LRU 实现细节
|
||||
4. SGLang vendor `mem_cache/session_aware_cache.py` — 当前 SessionSlot 设计
|
||||
5. SGLang vendor `managers/schedule_batch.py` — prepare_for_extend 怎么用 session state
|
||||
6. `docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §4 — D→P sync 的工程 scope(互补 work)
|
||||
|
||||
**关键 invariant 不变量**: SessionSlot.restore_to_req 必须保持幂等(chunked prefill 失败可能 retry 多次)。任何 refactor 都要测试此 invariant。
|
||||
|
||||
**关键 testing pattern**: 单元化测试 streaming session 在 LRU 压力下的行为。具体:注入一个 fake `inner.evict()` 返回部分 leaf 被 evict 的状态,断言 SessionSlot.restore_to_req 仍然返回合法 req 状态(不抛 assertion,re-prefill 长度合理)。
|
||||
|
||||
---
|
||||
|
||||
**核心句**: 我们前 3 轮 patch 都在解 saturation 暴露的 secondary 问题(cold-D 偏置、admission 字符串 bug、streaming-session correction 边界),但**真正的 primary 问题是 SessionAwareCache 把 session lifecycle 跟踪和 eviction granularity 决策混在一起**。session 是 lifecycle 边界,**不应该是 eviction 边界**。Eviction 应该交还给 SGLang 已经做得很好的 block-level radix LRU。
|
||||
364
docs/ONBOARDING_NEXT_AGENT_ZH.md
Normal file
364
docs/ONBOARDING_NEXT_AGENT_ZH.md
Normal file
@@ -0,0 +1,364 @@
|
||||
# 接班 Agent 上手手册
|
||||
|
||||
**对象**:接手本项目的下一个 SWE/research agent
|
||||
**目标**:30 分钟读完后达到当前主 agent 的认知水平,能独立跑对照实验、看懂数据、避开历史坑
|
||||
**作者状态**:本手册定稿于 `kvc-debug-journey-v1-to-v4 @ 506d360`,下一个工作分支是 `feat/d-to-p-sync`
|
||||
|
||||
---
|
||||
|
||||
## 0. 你是谁,你将要做什么(5 行 TL;DR)
|
||||
|
||||
1. 你接手的是 **agentic-pd-hybrid**——SGLang xPyD 基础上加 session-aware KVCache 层的 LLM serving 框架,目标是在多轮长 context coding agent workload 上比 vanilla DP 快
|
||||
2. v2(迁移机制 + threshold tuning)已经在 SWE-Bench 50sess trace ts=1 上**击败 4DP CA** 6/8 个 latency/TTFT 指标,但 **TTFT p99 输 3×**(1.28s vs 0.43s)
|
||||
3. 上一个 agent 已诊断出 TTFT p99 长尾的根因——8.3% 请求走 reseed 慢路径,每次需要 P 重算 prefill + mooncake transfer = 3-7s
|
||||
4. **你的任务**:在有 GPU + IB RDMA 的环境上跑 2 组对照实验,验证 (a) naive 1P3D + kv-aware 相对 KVC 的边际贡献、(b) 启用真 RDMA 后 KVC v2 的 TTFT p99 是否能压到 ~0.7s 量级
|
||||
5. 跑完结果 push 到 `outputs/`,主 agent 会拉下来更新 paper draft 和 future-work 文档
|
||||
|
||||
---
|
||||
|
||||
## 1. 必读文档(按这个顺序读,**不要乱跳**)
|
||||
|
||||
### Level 1:核心 30 分钟(**必读**,读完就能开始干活)
|
||||
|
||||
| # | 文档 | 时长 | 为什么读它 |
|
||||
|---|---|---:|---|
|
||||
| 1 | `docs/PROJECT_OVERVIEW.md` | 5min | 项目目标 + 三种 mechanism(pd-disagg / pd-colo / kvcache-centric)的术语区分 |
|
||||
| 2 | `docs/V2_DEEP_ANALYSIS_ZH.md` §0 (TL;DR) + §6 (生产决策) | 10min | 当前状态最准确的 snapshot——v2 赢什么、输什么、为什么 |
|
||||
| 3 | `docs/KVC_ROUTER_ALGORITHM.md` §1-§3 + §9 | 10min | 形式化的算法(Algorithm 1/2/3)+ 4 个 open questions。**§9 OQ#4 就是你正在解决的问题** |
|
||||
| 4 | `docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §0-§2 | 5min | reseed 路径完整时间线(t=0 → t=4550ms),知道每段耗时分别来自哪里 |
|
||||
|
||||
读完上面 4 篇就能跑实验了。如果你时间紧张,**就只读这 4 篇 + 本手册**。
|
||||
|
||||
### Level 2:进阶(**遇到具体问题时再读**)
|
||||
|
||||
| 文档 | 何时读 |
|
||||
|---|---|
|
||||
| `docs/REFACTOR_PLAN_V1_ZH.md` | 想理解为什么从 ts=10 切到 ts=1 |
|
||||
| `docs/MIGRATION_V1_FINDINGS_ZH.md` | 想理解 v1→v2 演化(v1 为何 thrashing,v2 reset-on-success 怎么修的) |
|
||||
| `docs/V2_RESULTS_ZH.md` | v2 原始战报(注意:headline 表略乐观,请优先看 `V2_DEEP_ANALYSIS_ZH.md` 的修订版) |
|
||||
| `docs/V2_DEEP_ANALYSIS_ZH.md` §4 全文 | 论文 reviewer 的对等性挑战 + 我们的辩驳;写 paper 时必读 |
|
||||
| `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` | 想理解 ts=10 时代的 §1-§9 结构性问题清单(很多问题在 ts=1 下消失,但底层机制仍在) |
|
||||
|
||||
### Level 3:归档(**别读**,是历史包袱)
|
||||
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md`:ts=10 时代的早期分析,结论已被 ts=1 数据 supersede
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md`:ts=10 数据下的结构性验证,同上
|
||||
- `docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md`:v1-v5 调优 sweep 的过程笔记,知道有这个文件就行
|
||||
- `docs/archive/V5_PROFILE_INVESTIGATION_ZH.md`:profile 调查,已 supersede
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md`:v0 重构计划,已被 V1 supersede
|
||||
- `docs/archive/SWEBENCH_EXPERIMENT_*.md`:早期实验日志
|
||||
|
||||
### Level 0:本手册的"姐妹"文档(**读这个之前你应该已经在看本文了**)
|
||||
|
||||
- `docs/ONBOARDING_NEXT_AGENT_ZH.md`(就是本文)
|
||||
|
||||
---
|
||||
|
||||
## 2. 项目当前状态快照(用一张表说清)
|
||||
|
||||
```
|
||||
Trace: outputs/qwen35-swebench-50sess.jsonl (4449 reqs / 52 sessions, time-scale=1.0)
|
||||
Hardware: 4× H100 80GB + Mellanox mlx5_0/_1 @ 200 Gb/s IB (active, but **未启用** in current sweep)
|
||||
Model: Qwen3-30B-A3B-Instruct-2507 (TP1)
|
||||
Branch: kvc-debug-journey-v1-to-v4 = 主分支(v2 已合入)
|
||||
feat/d-to-p-sync = 预留给 D→P 增量同步的开发,**当前空**
|
||||
main = 旧 baseline,比主分支落后 18 commit
|
||||
```
|
||||
|
||||
### 已得出的结论(高置信度)
|
||||
|
||||
1. **v2 (reset-on-success + threshold 8192) 击败 4DP CA**:lat mean -1.4%、p50 -13%、TTFT mean -25%、TTFT p50 -55%、TTFT p90 -67%
|
||||
2. **TTFT p99 KVC 输 3×**:1.28s vs 0.43s。来自 8.3% reseed/fallback 慢路径
|
||||
3. **慢路径耗时五五开**:P 端 re-prefill ~1.5-3s + mooncake P→D transfer ~1.5-4s(**当前是 TCP loopback**,未启用真 RDMA)
|
||||
4. **capacity-backup 救不了 slow path**:直接 audit 过,P 端 backup 不会随 direct-to-D append 更新,是 seed-time 静态快照
|
||||
5. **D→P 增量同步代码不存在**:经 Opus agent forensic 审查 + 全分支 git 检索确认
|
||||
|
||||
### 待验证的核心假设(**这是你的实验任务**)
|
||||
|
||||
| # | 假设 | 验证方法 | 预期结果 |
|
||||
|---|---|---|---|
|
||||
| H1 | KVC v2 相对 4DP 的胜利不只是来自 1P3D 拓扑——KVC 层(admission / migration / direct-to-D)也有显著贡献 | 跑 naive 1P3D + policy=kv-aware ts=1 N=1(vanilla SGLang pd-disagg,无 KVC 层)作为中间对照 | naive 1P3D 应该处于 KVC v2 和 4DP 之间。如果它 ≈ KVC v2 → 胜利来自拓扑而非 KVC 层;如果 ≈ 4DP → 胜利来自 KVC 层 |
|
||||
| H2 | 启用真 RDMA 把 mooncake P→D transfer 从 1.5-4s 压到 200-400ms,TTFT p99 从 1.28s 降到 ~0.7s | 在 v2 sweep 加 `--force-rdma --ib-device mlx5_0`,跑同 trace 同 ts=1 | TTFT p99 应该 ~0.5-0.8s 区间。如果没改变 → mooncake 实际没用 RDMA / 配置错误;如果降到 ~0.3s → 我们对 transfer 段贡献的估计偏低 |
|
||||
| H3 | 即使启用 RDMA,TTFT p99 仍然输 DP(因为 re-prefill 段不动) | 同 H2 实验结果 | 应该看到 TTFT p99 ~0.7s > DP 0.43s。如果 ≤ DP → 我们对 re-prefill 段成本的估计错了,可能整个 slow path 理论需要重审 |
|
||||
|
||||
---
|
||||
|
||||
## 3. 你要跑的实验(the main task)
|
||||
|
||||
### 3.1 实验矩阵(按 ROI 排序)
|
||||
|
||||
GPU hour 珍贵,砍掉了原计划的 naive 1P3D + policy=default baseline(low-ROI——naive 1P3D 用 default policy 在多轮 cache 命中上几乎必败,没必要拿这个对比当 H1 的对照点)。最终保留 2 个 run:
|
||||
|
||||
| # | 配置 | GPU | mechanism | policy | RDMA | 预期时长 | 目的 |
|
||||
|---|---|---:|---|---|---|---:|---|
|
||||
| **E1** | naive 1P3D kv-aware | 4 | pd-disaggregation | kv-aware | **on** | ~5.5h | H1:分离"1P3D + kv-aware policy"贡献 vs "KVC 层(admission/migration/direct-to-D)"贡献 |
|
||||
| **E2** | KVC v2 + RDMA | 4 | kvcache-centric | kv-aware | **on** | ~5.5h | H2/H3:验证 RDMA 能把 TTFT p99 从 1.28s 压到 ~0.7s |
|
||||
|
||||
两个 run 串行约 11h,并行用两组 GPU 可压到 ~5.5h。
|
||||
|
||||
### 3.2 启动配置:详细 flag 清单
|
||||
|
||||
参考 `scripts/sweep_ts1_migration_v2.sh` 作为底版。两个新 sweep 脚本的关键 flag:
|
||||
|
||||
#### E1: naive 1P3D kv-aware
|
||||
|
||||
```bash
|
||||
python -m agentic_pd_hybrid \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy kv-aware \
|
||||
--topology-pd 1P3D \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device mlx5_0 \ # ← 单独测拓扑+policy 而非 transport,必须开 RDMA 才能跟 E2 公平
|
||||
--trace outputs/qwen35-swebench-50sess.jsonl \
|
||||
--time-scale 1.0 \
|
||||
--concurrency 32 \
|
||||
--request-timeout-s 300 \
|
||||
--max-input-len 87811 \ # ← 拉齐到 DP 限,消除 abort 数量不对等
|
||||
--output-root outputs/qwen3-30b-tp1-ts1-naive-1p3d-kvaware
|
||||
```
|
||||
|
||||
#### E2: KVC v2 + RDMA
|
||||
|
||||
参考 `scripts/sweep_ts1_migration_v2.sh`,**只加两个 flag**:
|
||||
|
||||
```diff
|
||||
--transfer-backend mooncake \
|
||||
+ --force-rdma --ib-device mlx5_0 \
|
||||
+ --max-input-len 87811 \
|
||||
--kvcache-direct-max-uncached-tokens 8192 \
|
||||
--kvcache-migration-reject-threshold 3 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
```
|
||||
|
||||
**保留 v2 的其它所有配置**——这是 v2 + RDMA 的 ablation,**不要顺手改其它东西**。
|
||||
|
||||
### 3.3 实验前的环境验证(**别跳**)
|
||||
|
||||
```bash
|
||||
# 1. GPU
|
||||
nvidia-smi -L # 应该看到 4 张 H100 80GB
|
||||
|
||||
# 2. RDMA
|
||||
ibstat | grep -E "State|Rate|Port"
|
||||
# 期望:mlx5_0 / mlx5_1 都是 State=Active, Rate=200 Gb/s
|
||||
|
||||
# 3. Mooncake 能识别 RDMA 设备
|
||||
python -c "from mooncake_transfer_engine import TransferEngine; e=TransferEngine(); print(e.get_local_topology())"
|
||||
# 期望:输出包含 mlx5_0 / mlx5_1
|
||||
|
||||
# 4. 现有 v2 数据可读
|
||||
python3 scripts/analysis/recompute_summary.py outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl
|
||||
# 期望:打印出 failure_count=45, abort_count=40 等
|
||||
|
||||
# 5. 算法实现 syntax check
|
||||
python3 -m py_compile src/agentic_pd_hybrid/{policies,replay,metrics,benchmark,cli}.py
|
||||
# 期望:全过
|
||||
```
|
||||
|
||||
任何一步失败**立刻停下来排查**,不要硬上。
|
||||
|
||||
---
|
||||
|
||||
## 4. 已踩过的坑(避免重复)
|
||||
|
||||
| # | 坑 | 症状 | 教训 |
|
||||
|---|---|---|---|
|
||||
| 1 | **abort 被计入 latency stats** | DP/KVC 都有 0.08s 的快速失败被算成"快请求",拉低 mean/p50 | 已在 `metrics.py` 修复(commit `5eac9b4`)。新 run 出 summary 时会自动包含 `abort_count` / `failure_count` 字段 |
|
||||
| 2 | **max-input-len 双方不一致**(KVC=92098 vs DP=87811) | SGLang 按 mem_fraction_static 自动算 max_total_num_tokens,KVC decode-only worker GPU 内存多 2 GB | 跑新 run 时显式传 `--max-input-len 87811` 强制对齐 |
|
||||
| 3 | **mooncake 默认 TCP loopback** | sweep 脚本只传 `--transfer-backend mooncake` 不够,会落到 TCP,跑出来比 RDMA 慢 10× | 必须加 `--force-rdma --ib-device mlx5_0` |
|
||||
| 4 | **capacity-backup 不是 D→P 同步** | flag 名字误导,看代码就会发现它只是"reseed 完不关 P session",KV 是 seed-time 静态快照 | 不要在 capacity-backup 上浪费时间;要真正消灭 reseed 长尾必须实现 D→P,去 `feat/d-to-p-sync` |
|
||||
| 5 | **N=1 在 ts=1 下"够用"是有条件的** | baseline N=3 确认 categorical 完全确定,但 v2 引入的 reset-on-success 等新代码路径未独立验证 | v2 + RDMA 的对照建议 N=2,对 RDMA-on/off 各一次 |
|
||||
| 6 | **ts=10 数据**别参考 | 当年的 372/912/396 errors 是 benchmark artifact,不代表真实生产 | 所有比较锁定 ts=1,不要尝试 ts=10 "复现"或验证 |
|
||||
| 7 | **critic agent 的 "MAJOR" 别盲信** | 上一轮 critic 把 cache fragmentation / prefill 闲置标为 MAJOR,其实是 KVC 的**设计意图** | 详见 `V2_DEEP_ANALYSIS_ZH §4.4 / §4.5`。Audit 视角和生产视角要分清 |
|
||||
| 8 | **GPU utilization 图布局有残留小问题** | 组标签 (KVC 1P3D / DP 4-way CA) 与 subplot title 视觉上仍有轻微挤压 | 已被用户接受为可发表状态。不要再花时间调这张图 |
|
||||
|
||||
---
|
||||
|
||||
## 5. CLI 速查表
|
||||
|
||||
### 跑实验
|
||||
```bash
|
||||
# 完整 sweep(参考 v2)
|
||||
bash scripts/sweep_ts1_migration_v2.sh
|
||||
|
||||
# 写自己的 sweep:复制 sweep_ts1_migration_v2.sh,改 mechanism/policy/output-root
|
||||
```
|
||||
|
||||
### 看数据
|
||||
```bash
|
||||
# 修复版 summary(推荐用这个,旧的 summary.json 含 abort 污染)
|
||||
python3 scripts/analysis/recompute_summary.py outputs/<run>/*_metrics.jsonl
|
||||
|
||||
# 跨配置对照
|
||||
python3 scripts/analysis/analyze_ts1_validation.py # 比较 KVC vs DP ts=1 4-run
|
||||
```
|
||||
|
||||
### 出图(参考 v2 流程)
|
||||
```bash
|
||||
# 4 张已有的图,对应不同 viz 问题
|
||||
python3 scripts/analysis/plot_v2_path_breakdown.py # execution_mode 分布 + path-level latency
|
||||
python3 scripts/analysis/plot_ttft_pdf.py # TTFT PDF (KVC vs DP)
|
||||
python3 scripts/analysis/plot_gpu_utilization.py # GPU 利用率(请求计数 vs 工作量)
|
||||
python3 scripts/analysis/plot_cache_efficiency.py # cache 效率(hit rate vs turn + uncached ECDF)
|
||||
|
||||
# 数据更新后重新出图:直接 rerun,每个脚本都参数化了输入路径
|
||||
```
|
||||
|
||||
### Git
|
||||
```bash
|
||||
# 主分支(实验)
|
||||
git checkout kvc-debug-journey-v1-to-v4
|
||||
|
||||
# 新功能分支(D→P 同步,空)
|
||||
git checkout feat/d-to-p-sync
|
||||
|
||||
# 远程
|
||||
origin = git@ipads.se.sjtu.edu.cn:wangjh/agentic-pd-hybrid.git
|
||||
|
||||
# Push 用 (SSH known_hosts 第一次需要 accept)
|
||||
GIT_SSH_COMMAND='ssh -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=~/.ssh/known_hosts' git push
|
||||
|
||||
# user.email 没设全局,建议 per-commit 传:
|
||||
git -c user.email=YOUR_EMAIL -c user.name=YOUR_NAME commit -m "..."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. 跑完结果后看什么数字(checklist)
|
||||
|
||||
每个 run 跑完,**至少**收集以下几个数字(用 `recompute_summary.py`):
|
||||
|
||||
```
|
||||
☐ request_count (期望 4449)
|
||||
☐ error_count + abort_count + failure_count
|
||||
☐ latency_stats_s.{mean, p50, p90, p99}
|
||||
☐ ttft_stats_s.{mean, p50, p90, p99} ← 别忘 p99!这是 KVC 的真实代价点
|
||||
☐ execution_modes 分布
|
||||
☐ per_decode_load 分布(看负载均衡)
|
||||
☐ per_prefill_load (注意:dispatcher 计数 ≠ GPU 工作量)
|
||||
☐ cache_hit_request_count + total_cached_tokens (推 cache hit rate)
|
||||
```
|
||||
|
||||
### 两组对照实验跑完后看以下"决定性数字"
|
||||
|
||||
| 比较 | 关键看点 | 决策 |
|
||||
|---|---|---|
|
||||
| E1 (naive 1P3D kv-aware) vs E2 (KVC v2 + RDMA) | TTFT p50/p99、direct-to-D 占比 | 量化"KVC 层(admission/migration/direct-to-D)在 kv-aware 之上的额外收益"(H1) |
|
||||
| KVC v2 (TCP, 历史 v2 run) vs E2 (KVC v2 + RDMA) | TTFT p99、reseed mode 的耗时(execution_mode == reseed 的 ttft_s p50) | 验证 H2/H3:RDMA 救多少 transfer 段 |
|
||||
| E1 (naive 1P3D kv-aware) vs DP 4w(历史 ts=1 baseline)| 全部 latency / TTFT 指标 | 间接锚定"拓扑差异 + kv-aware policy"的天花板 |
|
||||
|
||||
### 期待的数字范围(如果实验顺利)
|
||||
|
||||
| 配置 | lat p50 | lat p99 | TTFT p50 | TTFT p99 | direct-to-D % |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| **E1** naive 1P3D kv-aware | ~0.75s | ~8-10s | ~0.20s | ~0.8-1.2s | N/A |
|
||||
| **E2** KVC v2 + RDMA | ~0.58s | ~7-8s | ~0.04s | **~0.5-0.8s** | ~91% |
|
||||
| (参考) KVC v2 + TCP(历史) | 0.58s | 8.7s | 0.04s | 1.29s | 91.6% |
|
||||
| (参考) DP 4w(历史 ts=1) | 0.67s | 8.4s | 0.09s | 0.43s | N/A |
|
||||
|
||||
**如果你看到的数字偏离这个范围 ≥ 2×**,先停下来检查配置(环境验证 §3.3 那些项目),不是写报告。
|
||||
|
||||
---
|
||||
|
||||
## 7. 遇到 X 怎么办(FAQ)
|
||||
|
||||
**Q: 跑出来 KVC v2 + RDMA 的 TTFT p99 比预期高很多(> 1s)。**
|
||||
|
||||
A: 大概率 RDMA 没真用上。检查:
|
||||
1. `outputs/<run>/<subdir>/benchmark-config.json` 里 `force_rdma` 是不是 `True`、`ib_device` 是不是 `"mlx5_0"`
|
||||
2. 服务器 startup log(`outputs/<run>/<subdir>/logs/prefill-0.log`)有没有 "MOONCAKE_DEVICE=mlx5_0" / "using RDMA" 类信息
|
||||
3. `ibstat mlx5_0` 看 active 状态没掉
|
||||
|
||||
**Q: KVC v2 + RDMA 跑出来 TTFT p99 ≤ DP(违反 H3)。**
|
||||
|
||||
A: 这是个好消息。可能性:
|
||||
1. 我们对 re-prefill 段耗时估计偏高(实际 SGLang 的 prefix cache 把 P 端 re-prefill 救了一半)
|
||||
2. RDMA 直接快到把 transfer 段压到 ~50ms 量级,整个 reseed < 1.5s
|
||||
3. v2 的 reseed 触发频率被 RDMA 间接降低(某种 race condition 改善了 LRU 行为)
|
||||
|
||||
任一情况都值得**深挖**,建议把 reseed mode 的 `ttft_s` 分布单独拉出来看(应该有清晰的双峰:fast reseed + 极少数 outlier)。
|
||||
|
||||
**Q: naive 1P3D 跑不起来 / SGLang 报错。**
|
||||
|
||||
A: 仓库里 `outputs/qwen3-30b-exps/pd-disaggregation-default-20260427T062616Z/` 有过历史的 1P1D 跑通配置可以参考。常见坑:
|
||||
1. `--mechanism pd-disaggregation` 和 `--topology` 必须配合,topology 不能用 KVC 的 1P3D 名字
|
||||
2. SGLang vendored 在 `third_party/sglang/`,**不要**`pip install sglang` 用外部版本——可能 API 不对齐
|
||||
3. `--policy default` 时不要传 `--kvcache-*` 系列 flag,会被 ignore 但会污染 config 输出
|
||||
|
||||
**Q: 我想跑别的对照(更大 trace / 更多 GPU / 真实 RDMA 跨节点)。**
|
||||
|
||||
A: 先把上面 2 个 E1-E2 跑完。这 2 个是论文核心 contribution 的 ablation,不能跳。其它对照(更长 trace、8 GPU 2P6D、真跨节点 RDMA、补 naive 1P3D + policy=default)见 `V2_DEEP_ANALYSIS_ZH §7.3`,作为 follow-up。
|
||||
|
||||
**Q: 跑完后想自动出对比图。**
|
||||
|
||||
A: 4 个现有 `plot_*.py` 脚本都是参数化的,把输入路径改成你的新 run 就能复用。如果对比维度变多(如三方对比 naive vs KVC vs DP),可以扩展现有脚本而不是新写——见 `plot_ttft_pdf.py` 的模板。
|
||||
|
||||
**Q: 发现 metrics.jsonl 字段不一致 / 缺字段。**
|
||||
|
||||
A: 看 `src/agentic_pd_hybrid/metrics.py` 里 `RequestMetrics` dataclass。所有新增字段必须在那里加,否则 `recompute_summary.py` 会报 KeyError。**注意**:dataclass 的 `field_names` 是按 `RequestMetrics.__dataclass_fields__` 取的,不是 jsonl 里所有 key。
|
||||
|
||||
---
|
||||
|
||||
## 8. 如果你完全卡住
|
||||
|
||||
读这一段:
|
||||
|
||||
1. **不要**尝试在没看本手册 §1 必读文档的情况下硬上代码
|
||||
2. **不要**在 main 分支或 `feat/d-to-p-sync` 上跑实验——用 `kvc-debug-journey-v1-to-v4`
|
||||
3. **不要**修 metrics.py 的统计字段,除非你能解释清楚为什么它当前的 abort 排除是对的
|
||||
4. **不要**信任 critic agent 的"MAJOR"标签,要看代码层证据
|
||||
5. **不要**跳过环境验证(§3.3)直接跑长 sweep——5h 跑出垃圾数据浪费的成本更高
|
||||
|
||||
如果你卡住超过 30 分钟,把卡点写成一句话,去主 agent 留言(git commit message / branch 注释)。
|
||||
|
||||
---
|
||||
|
||||
## 9. 主 agent 留给你的两个具体期待
|
||||
|
||||
1. **两组对照实验跑完后**,在新 commit message 里给我以下数字(用 `recompute_summary.py` 输出格式):
|
||||
```
|
||||
E1 naive 1P3D kv-aware: lat={mean,p50,p90,p99} ttft={mean,p50,p90,p99} fail_count
|
||||
E2 KVC v2 + RDMA: 同上 + reseed-mode 的 ttft p50/p99 分开
|
||||
```
|
||||
|
||||
2. **跑 E2 时收集 reseed 路径的实测耗时分布**:
|
||||
```
|
||||
pd-router-d-session-reseed 这个 execution_mode 的 ttft_s 分布
|
||||
并把 P→D mooncake transfer 时长 vs P 端 re-prefill 时长 单独拉出
|
||||
(需要在 structural/admission-events.jsonl 里找 timestamp diff)
|
||||
```
|
||||
|
||||
这两组数字直接决定 paper future-work 章节怎么写 D→P sync 的必要性。
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:关键文件位置速查
|
||||
|
||||
| 你在找什么 | 在哪 |
|
||||
|---|---|
|
||||
| 算法实现 | `src/agentic_pd_hybrid/policies.py` (KvAwarePolicy + RoutingState) |
|
||||
| 整个 replay orchestration | `src/agentic_pd_hybrid/replay.py` (~3000 行,**慢慢读**) |
|
||||
| 指标统计 | `src/agentic_pd_hybrid/metrics.py` |
|
||||
| CLI 入口 | `src/agentic_pd_hybrid/cli.py` |
|
||||
| Server 启动配置 | `src/agentic_pd_hybrid/stack.py` |
|
||||
| SGLang 改动 | `third_party/sglang/python/sglang/srt/{managers/scheduler.py, managers/io_struct.py, disaggregation/mooncake/...}` |
|
||||
| 历史 sweep 脚本 | `scripts/sweep_ts1_*.sh` |
|
||||
| 分析脚本 | `scripts/analysis/*.py` |
|
||||
| 实验输出 | `outputs/qwen3-30b-tp1-ts1-*/` |
|
||||
|
||||
## 附录 B:关键 commit 速查(按"想理解什么改动看什么 commit"组织)
|
||||
|
||||
| 想理解 | 看 commit |
|
||||
|---|---|
|
||||
| v2 的核心改动 | `2ec0deb feat(kvc): session migration with reset-on-success + direct-append threshold tuning` |
|
||||
| metrics.py 修复 | `5eac9b4 fix(metrics): exclude aborted requests from latency/ttft/tpot stats` |
|
||||
| 完整 analysis 文档(多版本叠加修订)| `c01d610` (latest) / `9ccd853` / `b5af195` / `c551906` / `517677d` |
|
||||
| 算法形式化定义 | `37e9caa docs(kvc): production-decision reframe + formal router algorithm spec` |
|
||||
| 各种 figure 脚本 | `c551906` (TTFT PDF) / `b5af195` (path breakdown) / `517677d` (GPU + cache) |
|
||||
| backpressure 代码 | `c47adaf feat(kvc): honor admission backpressure hints` 和 `ca4b64c feat(sglang): expose backpressure pause hint` |
|
||||
|
||||
---
|
||||
|
||||
**核心句**:先读 §1 Level 1 的 4 篇文档(30 min)+ 本手册(30 min),然后按 §3 跑 E1/E2/E3 三组实验,按 §6 收集决定性数字,遇到坑查 §4,结果 push 到 `outputs/` 下。**别瞎改不属于本任务的代码**——你的工作是验证 v2 的胜利在 ablation 中是否站得住,不是开发新机制(那是 `feat/d-to-p-sync` 分支的事,下一阶段才做)。
|
||||
|
||||
跑完之后期待你的 commit!
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
**日期**:2026-05-08
|
||||
**前置文档**:
|
||||
- `docs/REFACTOR_PLAN_ZH.md`(v0,已被本文 supersede——v0 的 backpressure 切入点结论已撤回)
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md`(v0,已被本文 supersede——v0 的 backpressure 切入点结论已撤回)
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md`(包含 §1-§7 结构性问题清单)
|
||||
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md`(ts=10 数据下的早期验证)
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md`(ts=10 数据下的早期验证)
|
||||
|
||||
**触发**:`outputs/qwen3-30b-tp1-ts1-validation/` 4 个 run 完成(KVC 1P3D × N=3 + 4DP CA × 1,全部 ts=1)
|
||||
|
||||
@@ -372,11 +372,11 @@ score = (
|
||||
## 附录 B:相关文档
|
||||
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — §1-§7 原结构性问题清单
|
||||
- `docs/REFACTOR_PLAN_ZH.md` — v0 重构计划(本文 supersede)
|
||||
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析(§1-§7 来源)
|
||||
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 演进
|
||||
- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(已 critic 修订)
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md` — v0 重构计划(本文 supersede)
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析(§1-§7 来源)
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 演进
|
||||
- `docs/archive/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(已 critic 修订)
|
||||
- `scripts/sweep_ts1_kvc_n3_plus_dp.sh` — 本次 4 run sweep 脚本
|
||||
- `scripts/analysis/analyze_ts1_validation.py` — 本次分析脚本
|
||||
|
||||
|
||||
174
docs/SNAPSHOT_STORE_REFACTOR_ZH.md
Normal file
174
docs/SNAPSHOT_STORE_REFACTOR_ZH.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# SnapshotStore 重构(解决 P-side alloc-failed 死局)
|
||||
|
||||
**日期**:2026-05-13
|
||||
**Status**:设计阶段,开始实施
|
||||
**根因**:`docs/E4_VS_E1_RESULTS_ZH.md` §3 + E4-v4/v5 forensic 显示 D→P sync 167 次尝试 0 OK,全部因 `prepare_receive` 试图从 `token_to_kv_pool_allocator.alloc(N)` 拿 N 个 slot 而 P 的池被自己 prefill 工作占满
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
- 当前 P-side `prepare_receive` 用 `token_to_kv_pool_allocator.alloc(N)` 抢 kv_pool slot —— 跟 P 自己的 prefill 工作直接争抢资源 → 90%+ 时间 alloc-failed
|
||||
- 重构方向:**P-side 用独立 GPU buffer 接收 snapshot**,与 kv_pool 解耦
|
||||
- 在 finalize_ingest 时才把 snapshot bytes copy 进 kv_pool slots(此时可以等更优的时机)
|
||||
- ~250 LOC 新代码,主要在 `disaggregation/snapshot/controller.py`
|
||||
|
||||
---
|
||||
|
||||
## 1. 当前实现的死局
|
||||
|
||||
```
|
||||
prepare_receive(sid, num_tokens=50000):
|
||||
indices = self.token_to_kv_pool_allocator.alloc(50000)
|
||||
if indices is None:
|
||||
return ok=False, reason="alloc-failed" ← 90%+ 时间走这里
|
||||
return slot_indices = indices.tolist()
|
||||
```
|
||||
|
||||
`alloc(50000)` 在 P 池中找 50000 个 contiguous 空 slot。当 P 正在 prefill 自己的 request 时(这是 P 的常态),池里大部分 slot 被锁定 → 找不出 50K 个空闲的 → fail.
|
||||
|
||||
E4-v5 167 次 sync 尝试统计:
|
||||
- 148 个 alloc-failed(**88%**)
|
||||
- 19 个 session-not-resident(D 端已 evict)
|
||||
- 0 个 OK
|
||||
|
||||
---
|
||||
|
||||
## 2. 新设计:PrefillSnapshotStore 侧表
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ P worker scheduler │
|
||||
│ │
|
||||
│ kv_pool (existing, owned by P's prefill work) │
|
||||
│ ┌────────────────────────────────────────────────┐ │
|
||||
│ │ k_buffer[0..L]: (max_tokens, head, dim) │ │
|
||||
│ │ v_buffer[0..L]: (max_tokens, head, dim) │ │
|
||||
│ └────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ snapshot_buf (NEW, dedicated for D→P snapshot reception) │
|
||||
│ ┌────────────────────────────────────────────────┐ │
|
||||
│ │ pinned GPU tensor of size SNAPSHOT_BUF_BYTES │ │
|
||||
│ │ (default 8 GB) │ │
|
||||
│ │ • registered with mooncake (one-time at init) │ │
|
||||
│ │ • slab-allocator manages free space │ │
|
||||
│ └────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
Flow:
|
||||
1. prepare_receive(sid, N):
|
||||
slab = snapshot_buf_allocator.alloc(N * per_token_bytes_total)
|
||||
record = (sid, slab_offset, N)
|
||||
return (snapshot_buf_base + slab_offset for K_L, V_L per layer)
|
||||
← never blocks on kv_pool
|
||||
|
||||
2. (out-of-band) D pushes KV bytes into the slab via mooncake RDMA
|
||||
|
||||
3. finalize_ingest(sid, token_ids):
|
||||
record = pop ingest_record[sid]
|
||||
slots = token_to_kv_pool_allocator.alloc(N) ← can fail here
|
||||
if alloc-failed:
|
||||
snapshot_buf_allocator.free(record.slab)
|
||||
return ok=False, reason=alloc-failed-on-finalize
|
||||
# copy snapshot_buf[layer L][token range] → kv_pool.k_buffer[L][slots]
|
||||
for L in range(layer_num):
|
||||
kv_pool.k_buffer[L][slots] = snapshot_buf[K_L_offset : K_L_offset + N * K_stride].view(N, head, dim)
|
||||
kv_pool.v_buffer[L][slots] = snapshot_buf[V_L_offset : V_L_offset + N * V_stride].view(N, head, dim)
|
||||
tree_cache.insert(InsertParams(key=token_ids, value=slots))
|
||||
snapshot_buf_allocator.free(record.slab)
|
||||
return ok=True
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 关键 design choices
|
||||
|
||||
| 决策 | 选择 | 原因 |
|
||||
|---|---|---|
|
||||
| Snapshot buffer 存哪 | GPU memory | 与 D RDMA 目标对称(D 端 KV 也在 GPU),避免 host↔device 拷贝 |
|
||||
| 默认大小 | **8 GB** | Qwen3-30B 一个 ~50K-token session 的 KV ~5 GB;8 GB 让我们至少 hold 一个 + 部分备份 |
|
||||
| 分配粒度 | 单次 contiguous 一个 session 全部 KV | 简化 slab allocator + 单次 batch transfer |
|
||||
| Layout | K-all-layers concat, then V-all-layers concat | 跟 mooncake 的 batch_transfer 接口对齐 |
|
||||
| Free 策略 | finalize 后立即 free | 当 snapshot 已 ingest 到 kv_pool,snapshot_buf 副本不再需要 |
|
||||
| 满了怎么办 | prepare_receive 返回 ok=False, reason=snapshot-buf-full | 让 caller fall back 到 re-prefill |
|
||||
|
||||
---
|
||||
|
||||
## 4. 接口变化
|
||||
|
||||
### 4.1 SnapshotPrepareReceiveReqOutput
|
||||
|
||||
旧:
|
||||
```
|
||||
k_base_ptrs: List[int] # 各 layer 的 k_buffer.data_ptr()
|
||||
v_base_ptrs: List[int]
|
||||
slot_indices: List[int] # kv_pool 中分配的 slot
|
||||
stride_k_bytes / stride_v_bytes
|
||||
```
|
||||
|
||||
新:
|
||||
```
|
||||
snapshot_buf_base_ptr: int # snapshot_buf.data_ptr()
|
||||
k_layer_offsets: List[int] # 各 layer K 在 snapshot_buf 中的字节偏移
|
||||
v_layer_offsets: List[int] # 各 layer V 偏移
|
||||
num_tokens: int
|
||||
stride_k_bytes / stride_v_bytes
|
||||
slab_handle: int # opaque handle for finalize/abort
|
||||
```
|
||||
|
||||
### 4.2 SnapshotFinalizeIngestReqInput
|
||||
|
||||
旧:
|
||||
```
|
||||
session_id, token_ids, slot_indices
|
||||
```
|
||||
|
||||
新:
|
||||
```
|
||||
session_id, token_ids, slab_handle # P 用 handle 找到 record,再 alloc kv_pool + copy + insert
|
||||
```
|
||||
|
||||
### 4.3 D-side push 逻辑(agentic)
|
||||
|
||||
旧:D 算 src_slot[L] → dst_slot[L] mapping,batch_transfer
|
||||
|
||||
新:D 算 src_slot[L] → snapshot_buf 中的 k_layer_offsets[L] / v_layer_offsets[L] mapping,batch_transfer。完全不需要 dst slot indices。
|
||||
|
||||
---
|
||||
|
||||
## 5. 实施步骤
|
||||
|
||||
| # | 步骤 | LOC 估计 |
|
||||
|---|---|---:|
|
||||
| 1 | `SnapshotBufAllocator` 类(slab/bump allocator) | 80 |
|
||||
| 2 | `SnapshotLinkController.__init__` 加 snapshot_buf 分配 + 注册 | 30 |
|
||||
| 3 | 重写 `prepare_receive`、新加 `_compute_layer_offsets` | 60 |
|
||||
| 4 | 新加 `finalize_with_snapshot_buf` + 删旧的 `finalize_ingest` | 70 |
|
||||
| 5 | 修改 io_struct 字段 + 删旧字段 | 30 |
|
||||
| 6 | 修改 agentic `_attempt_d_to_p_sync` 用新字段 | 40 |
|
||||
| 7 | 改 mem leak check 计入 snapshot_buf | 5 |
|
||||
| 8 | 单元 smoke test | 50 |
|
||||
|
||||
Total: ~365 LOC
|
||||
|
||||
---
|
||||
|
||||
## 6. 风险
|
||||
|
||||
| 风险 | 缓解 |
|
||||
|---|---|
|
||||
| 8 GB GPU mem cost | 用户可配置;mem-fraction-static 已经留了 buffer |
|
||||
| 多 session 抢 snapshot_buf | slab allocator + LRU evict 旧的 snapshot |
|
||||
| GPU→GPU copy 性能 | ~5 GB @ 3 TB/s = 1.7 ms,可忽略 |
|
||||
| 接口大改影响 smoke | 在 commit 内完成所有接口变更,smoke 同步更新 |
|
||||
|
||||
---
|
||||
|
||||
## 7. 验收
|
||||
|
||||
- [ ] `scripts/smoke_snapshot_sglang_integration.py` 跑通新接口(prepare_receive 不再 alloc-failed)
|
||||
- [ ] E4-v6 跑同样 trace,d-to-p-sync.jsonl 出现 OK 事件 ≥ 30%(vs 当前 0%)
|
||||
|
||||
---
|
||||
|
||||
**核心句**:用 GPU 上独立的 snapshot_buf 接收 D 端推送,把"竞争 P kv_pool"这个根本性 alloc 冲突消掉,把 alloc 决策推迟到 finalize 时机,让 D→P 真正有机会跑通。
|
||||
@@ -633,9 +633,9 @@ errors 漂移 **2.5×**(372→912),P50 latency 漂移 ~30%,TTFT P50 漂
|
||||
## 附录 B:相关已有文档
|
||||
|
||||
- `docs/PROJECT_OVERVIEW.md` — 项目目标、microbench 结论
|
||||
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` — 结构性缺陷的早期分析(本报告 §2 的来源)
|
||||
- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 详细演进日记
|
||||
- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(含 critic 修订)
|
||||
- `docs/SWEBENCH_EXPERIMENT_RESULTS.md` — SWE 35B 早期实验
|
||||
- `docs/REFACTOR_PLAN_ZH.md` — 当前重构计划
|
||||
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` — 结构性 claim 验证(本报告的精简版)
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 结构性缺陷的早期分析(本报告 §2 的来源)
|
||||
- `docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 详细演进日记
|
||||
- `docs/archive/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(含 critic 修订)
|
||||
- `docs/archive/SWEBENCH_EXPERIMENT_RESULTS.md` — SWE 35B 早期实验
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md` — 当前重构计划
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — 结构性 claim 验证(本报告的精简版)
|
||||
|
||||
@@ -609,8 +609,8 @@ v2 p99 = slow path 主导 → 8.69s (KVC) vs 8.43s (DP) 接近
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` — ts=1 验证后的方向决策
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md` — v1 thrashing 诊断
|
||||
- `docs/V2_RESULTS_ZH.md` — v2 结果原始报告(本文是对它的 critique)
|
||||
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析(§1-§7 来源)
|
||||
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析(§1-§7 来源)
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
|
||||
## 附录 C:相关代码
|
||||
|
||||
|
||||
@@ -271,8 +271,8 @@ p99 +3% 几乎全部来自这 5 个 timeout(每个 ~30s 拉到 p99)。**修
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — §1-§9 原结构性问题清单
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` — 重构方向 + 三情景分支
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md` — v1 thrashing 诊断
|
||||
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析
|
||||
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `scripts/sweep_ts1_migration_v2.sh` — 本次 v2 sweep 脚本
|
||||
- `scripts/analysis/analyze_ts1_validation.py` — ts=1 4-way 对比分析
|
||||
|
||||
|
||||
34
docs/archive/README.md
Normal file
34
docs/archive/README.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# 归档文档说明
|
||||
|
||||
本目录保留项目历史阶段的过程文档。**新加入项目的 agent / 人员不需要阅读这些文档**,直接看 `docs/ONBOARDING_NEXT_AGENT_ZH.md` 即可。
|
||||
|
||||
保留它们的目的:
|
||||
1. 论文写作时追溯 v1-v5 调优演化过程
|
||||
2. 未来若回到 ts=10 高压区间或更大 trace 时,可参考当年的结构性问题诊断
|
||||
3. 满足学术可追溯性要求
|
||||
|
||||
## 每个文档的简要说明
|
||||
|
||||
| 文档 | 归档原因 | 何时回头看 |
|
||||
|---|---|---|
|
||||
| `AGENTIC_FIT_ANALYSIS_ZH.md` | ts=10 时代的 §1-§7 结构性问题分析;结论已被 ts=1 数据全面 supersede | 想知道当年 ts=10 下我们认为有什么结构性问题时 |
|
||||
| `STRUCTURAL_VALIDATION_REPORT_ZH.md` | 用 ts=10 数据对 AGENTIC_FIT_ANALYSIS 的 claim 做验证;同样被 ts=1 时代 supersede | 同上 |
|
||||
| `KVC_DEBUG_JOURNEY_V1_TO_V5.md` | v1-v5 5 个调优 sweep 的过程笔记;包含 errors 9→912 漂移、direct-to-D 占比变化等历史数据 | 写 paper 时要写 "as we explored configurations v1-v5..." 段落 |
|
||||
| `V5_PROFILE_INVESTIGATION_ZH.md` | 给 v5 加 1Hz polling instrumentation 的调查;让 errors 涨 46× 的现象记录 | 想理解 "admission RPC 干扰 scheduler 主循环" 这条 §5 残留风险时 |
|
||||
| `REFACTOR_PLAN_ZH.md` | v0 重构计划,**已被 `REFACTOR_PLAN_V1_ZH.md` supersede** | 不需要看;只有想看作者一开始的设想时翻一翻 |
|
||||
| `KVCACHE_CENTRIC_PROGRESS_ZH.md` | 项目最早期(2026-04-27)的进度记录;当时还没有完整的 sweep 数据 | 几乎不需要看;满足"项目起源记录"职能 |
|
||||
| `SWEBENCH_EXPERIMENT_PROGRESS.md` | SWE-Bench trace 早期实验进度记录 | 想知道当年的 trace 生成 / 采样配置时 |
|
||||
| `SWEBENCH_EXPERIMENT_RESULTS.md` | 同上,早期 result snapshot | 同上 |
|
||||
|
||||
## 当前活跃文档(在 `docs/` 顶层)
|
||||
|
||||
跳转去看:
|
||||
- `docs/ONBOARDING_NEXT_AGENT_ZH.md` — 新人上手手册
|
||||
- `docs/PROJECT_OVERVIEW.md` — 项目目标 + 术语
|
||||
- `docs/KVC_ROUTER_ALGORITHM.md` — 算法形式化
|
||||
- `docs/V2_DEEP_ANALYSIS_ZH.md` — v2 完整分析
|
||||
- `docs/V2_RESULTS_ZH.md` — v2 原始战报
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` — ts=1 方向决策
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md` — v1 thrashing 诊断
|
||||
- `docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` — reseed 长尾 + D→P 缺口审计
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — ts=10 时代的结构性问题清单(作为历史 baseline 仍在主目录)
|
||||
BIN
docs/figures/e1_vs_e4_latency_cdf.png
Normal file
BIN
docs/figures/e1_vs_e4_latency_cdf.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 222 KiB |
BIN
docs/figures/e1_vs_e4_p99_attribution.png
Normal file
BIN
docs/figures/e1_vs_e4_p99_attribution.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 257 KiB |
BIN
docs/figures/e1_vs_e4_ttft_pdf.png
Normal file
BIN
docs/figures/e1_vs_e4_ttft_pdf.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 282 KiB |
BIN
docs/figures/e4_path_latency.png
Normal file
BIN
docs/figures/e4_path_latency.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 158 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 196 KiB After Width: | Height: | Size: 216 KiB |
@@ -7,7 +7,7 @@ requires-python = ">=3.12"
|
||||
dependencies = [
|
||||
"httpx>=0.28.1",
|
||||
"mooncake-transfer-engine",
|
||||
"sglang==0.5.10",
|
||||
"sglang",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
@@ -22,3 +22,6 @@ where = ["src"]
|
||||
|
||||
[tool.uv]
|
||||
prerelease = "allow"
|
||||
|
||||
[tool.uv.sources]
|
||||
sglang = { path = "third_party/sglang/python", editable = true }
|
||||
|
||||
334
scripts/analysis/plot_e1_vs_e4.py
Normal file
334
scripts/analysis/plot_e1_vs_e4.py
Normal file
@@ -0,0 +1,334 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Generate E1 (naive PD-disagg) vs E4 (KVC + load-floor + RDMA) comparison figures.
|
||||
|
||||
Outputs (under docs/figures/):
|
||||
e1_vs_e4_ttft_pdf.png - TTFT distribution body + log-tail
|
||||
e1_vs_e4_latency_cdf.png - E2E latency CDF
|
||||
e4_path_latency.png - E4 per-execution-mode latency breakdown
|
||||
e1_vs_e4_p99_attribution.png - which execution modes contribute to E4's p99 tail
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
import argparse
|
||||
import json
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[2]
|
||||
FIG = ROOT / "docs/figures"
|
||||
FIG.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
E1_COLOR = "#D62728" # red
|
||||
E4_COLOR = "#1F77B4" # blue
|
||||
|
||||
|
||||
def load(p: Path) -> list[dict]:
|
||||
return [json.loads(l) for l in p.open()]
|
||||
|
||||
|
||||
def is_failed(r: dict) -> bool:
|
||||
if r.get("error"):
|
||||
return True
|
||||
fr = r.get("finish_reason")
|
||||
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def pct(values, q):
|
||||
return float(np.quantile(values, q))
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--e1-metrics", required=True)
|
||||
ap.add_argument("--e4-metrics", required=True)
|
||||
args = ap.parse_args()
|
||||
|
||||
e1 = [r for r in load(Path(args.e1_metrics)) if not is_failed(r)]
|
||||
e4 = [r for r in load(Path(args.e4_metrics)) if not is_failed(r)]
|
||||
e1_ttft = np.array([r["ttft_s"] for r in e1 if r.get("ttft_s") is not None])
|
||||
e4_ttft = np.array([r["ttft_s"] for r in e4 if r.get("ttft_s") is not None])
|
||||
e1_lat = np.array([r["latency_s"] for r in e1 if r.get("latency_s") is not None])
|
||||
e4_lat = np.array([r["latency_s"] for r in e4 if r.get("latency_s") is not None])
|
||||
e1_ttft = e1_ttft[e1_ttft > 1e-4]
|
||||
e4_ttft = e4_ttft[e4_ttft > 1e-4]
|
||||
|
||||
print(f"E1 reqs={len(e1)} (after failed-filter) TTFT n={len(e1_ttft)} lat n={len(e1_lat)}")
|
||||
print(f"E4 reqs={len(e4)} (after failed-filter) TTFT n={len(e4_ttft)} lat n={len(e4_lat)}")
|
||||
print()
|
||||
for name, arr in [("E1", e1_ttft), ("E4", e4_ttft)]:
|
||||
print(f" {name} TTFT mean={arr.mean():.3f} p50={pct(arr,0.5):.3f} "
|
||||
f"p90={pct(arr,0.9):.3f} p99={pct(arr,0.99):.3f} max={arr.max():.3f}")
|
||||
print()
|
||||
for name, arr in [("E1", e1_lat), ("E4", e4_lat)]:
|
||||
print(f" {name} Lat mean={arr.mean():.3f} p50={pct(arr,0.5):.3f} "
|
||||
f"p90={pct(arr,0.9):.3f} p99={pct(arr,0.99):.3f} max={arr.max():.3f}")
|
||||
print()
|
||||
|
||||
# ----- Plot 1: TTFT distribution (body + log tail) ---------------------
|
||||
_plot_ttft_pdf(e1_ttft, e4_ttft)
|
||||
|
||||
# ----- Plot 2: Latency CDF --------------------------------------------
|
||||
_plot_latency_cdf(e1_lat, e4_lat)
|
||||
|
||||
# ----- Plot 3: E4 path-level breakdown ---------------------------------
|
||||
_plot_path_latency(e4)
|
||||
|
||||
# ----- Plot 4: p99 attribution -----------------------------------------
|
||||
_plot_p99_attribution(e4, e1_ttft, e4_ttft)
|
||||
|
||||
|
||||
def _plot_ttft_pdf(e1_ttft, e4_ttft):
|
||||
from scipy.stats import gaussian_kde
|
||||
fig, axes = plt.subplots(1, 2, figsize=(16, 6.5))
|
||||
|
||||
# Body, linear x ∈ [0, 60s]
|
||||
ax = axes[0]
|
||||
x_body = np.linspace(0, 60, 800)
|
||||
kde_e4 = gaussian_kde(e4_ttft, bw_method=0.15)
|
||||
kde_e1 = gaussian_kde(e1_ttft, bw_method=0.15)
|
||||
ax.plot(x_body, kde_e4(x_body), color=E4_COLOR, lw=2.5,
|
||||
label=f"E4 KVC + load-floor + RDMA (n={len(e4_ttft)})")
|
||||
ax.fill_between(x_body, kde_e4(x_body), alpha=0.2, color=E4_COLOR)
|
||||
ax.plot(x_body, kde_e1(x_body), color=E1_COLOR, lw=2.5,
|
||||
label=f"E1 naive PD-disagg (n={len(e1_ttft)})")
|
||||
ax.fill_between(x_body, kde_e1(x_body), alpha=0.2, color=E1_COLOR)
|
||||
for q, ls in [(0.5, "-"), (0.9, "--")]:
|
||||
ax.axvline(pct(e4_ttft, q), color=E4_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
ax.axvline(pct(e1_ttft, q), color=E1_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
ymax = ax.get_ylim()[1]
|
||||
ax.text(pct(e4_ttft, 0.5), ymax * 0.95, f"E4 p50\n{pct(e4_ttft, 0.5):.1f}s",
|
||||
color=E4_COLOR, fontsize=9, va="top", ha="left",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.8, pad=2))
|
||||
ax.text(pct(e1_ttft, 0.5), ymax * 0.55, f"E1 p50\n{pct(e1_ttft, 0.5):.1f}s",
|
||||
color=E1_COLOR, fontsize=9, va="top", ha="left",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.8, pad=2))
|
||||
ax.set_xlim(0, 60)
|
||||
ax.set_xlabel("TTFT (seconds, linear)", fontsize=11)
|
||||
ax.set_ylabel("Probability density", fontsize=11)
|
||||
ax.set_title("Body of distribution (TTFT ≤ 60s)", fontsize=12, pad=10)
|
||||
ax.legend(loc="upper right", fontsize=10, framealpha=0.95)
|
||||
ax.grid(True, linestyle=":", alpha=0.4)
|
||||
|
||||
# Log tail
|
||||
ax = axes[1]
|
||||
kde_e4_log = gaussian_kde(np.log10(e4_ttft), bw_method="scott")
|
||||
kde_e1_log = gaussian_kde(np.log10(e1_ttft), bw_method="scott")
|
||||
log_x = np.linspace(np.log10(0.05), np.log10(500), 600)
|
||||
x_full = 10 ** log_x
|
||||
y_e4 = kde_e4_log(log_x)
|
||||
y_e1 = kde_e1_log(log_x)
|
||||
ax.plot(x_full, y_e4, color=E4_COLOR, lw=2.5, label=f"E4 KVC (n={len(e4_ttft)})")
|
||||
ax.fill_between(x_full, y_e4, alpha=0.2, color=E4_COLOR)
|
||||
ax.plot(x_full, y_e1, color=E1_COLOR, lw=2.5, label=f"E1 naive PD (n={len(e1_ttft)})")
|
||||
ax.fill_between(x_full, y_e1, alpha=0.2, color=E1_COLOR)
|
||||
ax.set_xscale("log")
|
||||
ax.set_xlim(0.05, 500)
|
||||
quartile_styles = [(0.5, "-", "p50"), (0.9, "--", "p90"), (0.99, ":", "p99")]
|
||||
for q, ls, _ in quartile_styles:
|
||||
ax.axvline(pct(e4_ttft, q), color=E4_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
ax.axvline(pct(e1_ttft, q), color=E1_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
ymax = max(y_e4.max(), y_e1.max())
|
||||
ax.annotate(f"E4 p99 = {pct(e4_ttft, 0.99):.1f}s",
|
||||
xy=(pct(e4_ttft, 0.99), kde_e4_log(np.log10(pct(e4_ttft, 0.99)))[0]),
|
||||
xytext=(80, ymax * 0.55),
|
||||
fontsize=10, color=E4_COLOR, fontweight="bold",
|
||||
arrowprops=dict(arrowstyle="->", color=E4_COLOR, lw=1.0))
|
||||
ax.annotate(f"E1 p99 = {pct(e1_ttft, 0.99):.1f}s",
|
||||
xy=(pct(e1_ttft, 0.99), kde_e1_log(np.log10(pct(e1_ttft, 0.99)))[0]),
|
||||
xytext=(80, ymax * 0.40),
|
||||
fontsize=10, color=E1_COLOR, fontweight="bold",
|
||||
arrowprops=dict(arrowstyle="->", color=E1_COLOR, lw=1.0))
|
||||
ax.set_xticks([0.1, 1, 10, 100])
|
||||
ax.set_xticklabels(["100ms", "1s", "10s", "100s"])
|
||||
ax.set_xlabel("TTFT (log scale)", fontsize=11)
|
||||
ax.set_ylabel("Density (per log₁₀ s)", fontsize=11)
|
||||
ax.set_title("Full range incl. p99 tail (log x)", fontsize=12, pad=10)
|
||||
ax.legend(loc="upper left", fontsize=10, framealpha=0.95)
|
||||
ax.grid(True, which="both", linestyle=":", alpha=0.4)
|
||||
|
||||
fig.suptitle(
|
||||
"TTFT density: E4 KVC v2 + load-floor + RDMA vs E1 naive PD-disagg\n"
|
||||
"Inferact 50-session trace · ts=1 · 4× H200 · aborted requests excluded",
|
||||
fontsize=13, y=1.02,
|
||||
)
|
||||
plt.tight_layout()
|
||||
out = FIG / "e1_vs_e4_ttft_pdf.png"
|
||||
plt.savefig(out, dpi=150, bbox_inches="tight")
|
||||
print(f"wrote {out}")
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def _plot_latency_cdf(e1_lat, e4_lat):
|
||||
fig, axes = plt.subplots(1, 2, figsize=(16, 6.5))
|
||||
|
||||
# Linear CDF
|
||||
ax = axes[0]
|
||||
for arr, color, name in [(e4_lat, E4_COLOR, f"E4 KVC (n={len(e4_lat)})"),
|
||||
(e1_lat, E1_COLOR, f"E1 naive (n={len(e1_lat)})")]:
|
||||
s = np.sort(arr)
|
||||
y = np.linspace(0, 1, len(s), endpoint=False)
|
||||
ax.plot(s, y, color=color, lw=2.5, label=name)
|
||||
ax.set_xlim(0, 300)
|
||||
ax.set_xlabel("E2E latency (seconds)", fontsize=11)
|
||||
ax.set_ylabel("CDF", fontsize=11)
|
||||
ax.set_title("Full latency CDF (linear)", fontsize=12)
|
||||
ax.legend(loc="lower right", fontsize=10)
|
||||
ax.grid(True, linestyle=":", alpha=0.4)
|
||||
# Annotate percentiles
|
||||
for q, mark in [(0.5, "p50"), (0.9, "p90"), (0.99, "p99")]:
|
||||
e4v, e1v = pct(e4_lat, q), pct(e1_lat, q)
|
||||
ax.axhline(q, color="gray", ls=":", alpha=0.3)
|
||||
ax.annotate(f"{mark}: E4 {e4v:.1f}s, E1 {e1v:.1f}s",
|
||||
xy=(0, q), xytext=(220, q - 0.02 if q > 0.5 else q + 0.02),
|
||||
fontsize=9, color="black")
|
||||
|
||||
# Log CDF showing tail
|
||||
ax = axes[1]
|
||||
for arr, color, name in [(e4_lat, E4_COLOR, f"E4 KVC"),
|
||||
(e1_lat, E1_COLOR, f"E1 naive")]:
|
||||
s = np.sort(arr)
|
||||
s_clip = np.maximum(s, 0.01)
|
||||
y = np.linspace(0, 1, len(s), endpoint=False)
|
||||
ax.plot(s_clip, 1 - y, color=color, lw=2.5, label=name)
|
||||
ax.set_xscale("log")
|
||||
ax.set_yscale("log")
|
||||
ax.set_xlim(0.5, 500)
|
||||
ax.set_ylim(1e-3, 1.1)
|
||||
ax.set_xlabel("E2E latency (log s)", fontsize=11)
|
||||
ax.set_ylabel("P(latency > x) (log)", fontsize=11)
|
||||
ax.set_title("Survival function — log-log (highlights tail behavior)", fontsize=12)
|
||||
ax.legend(loc="upper right", fontsize=10)
|
||||
ax.grid(True, which="both", linestyle=":", alpha=0.4)
|
||||
|
||||
fig.suptitle("E2E latency: E4 KVC vs E1 naive PD-disagg", fontsize=13, y=1.02)
|
||||
plt.tight_layout()
|
||||
out = FIG / "e1_vs_e4_latency_cdf.png"
|
||||
plt.savefig(out, dpi=150, bbox_inches="tight")
|
||||
print(f"wrote {out}")
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def _plot_path_latency(e4):
|
||||
by_mode = defaultdict(list)
|
||||
by_mode_lat = defaultdict(list)
|
||||
for r in e4:
|
||||
m = r.get("execution_mode", "?") or "?"
|
||||
if r.get("ttft_s") is not None:
|
||||
by_mode[m].append(float(r["ttft_s"]))
|
||||
if r.get("latency_s") is not None:
|
||||
by_mode_lat[m].append(float(r["latency_s"]))
|
||||
# Sort by count
|
||||
modes = sorted(by_mode, key=lambda m: -len(by_mode[m]))
|
||||
# Limit to top-N by count
|
||||
modes = modes[:14]
|
||||
|
||||
fig, ax = plt.subplots(1, 1, figsize=(14, 7))
|
||||
pos = np.arange(len(modes))
|
||||
means = [np.mean(by_mode[m]) for m in modes]
|
||||
p50 = [pct(np.array(by_mode[m]), 0.5) for m in modes]
|
||||
p99 = [pct(np.array(by_mode[m]), 0.99) for m in modes]
|
||||
counts = [len(by_mode[m]) for m in modes]
|
||||
bar_h = 0.25
|
||||
ax.barh(pos - bar_h, means, bar_h, label="mean", color="#4a90e2", alpha=0.85)
|
||||
ax.barh(pos, p50, bar_h, label="p50", color="#66cc99", alpha=0.85)
|
||||
ax.barh(pos + bar_h, p99, bar_h, label="p99", color="#e74c3c", alpha=0.85)
|
||||
ax.set_yticks(pos)
|
||||
ax.set_yticklabels([f"{m} (n={counts[i]})" for i, m in enumerate(modes)],
|
||||
fontsize=9)
|
||||
ax.invert_yaxis()
|
||||
ax.set_xlabel("TTFT (s)", fontsize=11)
|
||||
ax.set_title("E4 per execution_mode TTFT (sorted by count, top 14)",
|
||||
fontsize=12, pad=10)
|
||||
ax.legend(loc="lower right", fontsize=10)
|
||||
ax.grid(True, linestyle=":", alpha=0.4)
|
||||
plt.tight_layout()
|
||||
out = FIG / "e4_path_latency.png"
|
||||
plt.savefig(out, dpi=150, bbox_inches="tight")
|
||||
print(f"wrote {out}")
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def _plot_p99_attribution(e4, e1_ttft, e4_ttft):
|
||||
"""Show which execution modes hit p99 and dominate the tail."""
|
||||
# Threshold: anything > E4's p99 = part of the p99 tail
|
||||
e4_p99 = pct(e4_ttft, 0.99)
|
||||
e1_p99 = pct(e1_ttft, 0.99)
|
||||
# Define the "tail" as TTFT > p95
|
||||
threshold = pct(e4_ttft, 0.95)
|
||||
tail_modes = Counter()
|
||||
body_modes = Counter()
|
||||
for r in e4:
|
||||
m = r.get("execution_mode", "?") or "?"
|
||||
ttft = r.get("ttft_s")
|
||||
if ttft is None:
|
||||
continue
|
||||
if ttft >= threshold:
|
||||
tail_modes[m] += 1
|
||||
else:
|
||||
body_modes[m] += 1
|
||||
all_modes = sorted(tail_modes, key=lambda m: -tail_modes[m])[:10]
|
||||
body_total = sum(body_modes.values())
|
||||
tail_total = sum(tail_modes.values())
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(16, 6.5))
|
||||
|
||||
# Pie of tail composition
|
||||
ax = axes[0]
|
||||
sizes = [tail_modes[m] for m in all_modes]
|
||||
rest = sum(tail_modes.values()) - sum(sizes)
|
||||
if rest > 0:
|
||||
all_modes_label = all_modes + ["(other)"]
|
||||
sizes = sizes + [rest]
|
||||
else:
|
||||
all_modes_label = all_modes
|
||||
wedges, texts, autotexts = ax.pie(
|
||||
sizes, labels=[f"{m}\n(n={c})" for m, c in zip(all_modes_label, sizes)],
|
||||
autopct="%1.0f%%", startangle=90, textprops={"fontsize": 9},
|
||||
)
|
||||
ax.set_title(f"E4 p95-p99 tail composition\n(TTFT ≥ {threshold:.1f}s, n={tail_total})",
|
||||
fontsize=12, pad=12)
|
||||
|
||||
# Bar of mean TTFT within tail per mode
|
||||
ax = axes[1]
|
||||
mode_to_tail_lat = defaultdict(list)
|
||||
for r in e4:
|
||||
m = r.get("execution_mode", "?") or "?"
|
||||
ttft = r.get("ttft_s")
|
||||
if ttft is None or ttft < threshold:
|
||||
continue
|
||||
mode_to_tail_lat[m].append(float(ttft))
|
||||
pos = np.arange(len(all_modes))
|
||||
means = [np.mean(mode_to_tail_lat[m]) if mode_to_tail_lat[m] else 0 for m in all_modes]
|
||||
counts = [len(mode_to_tail_lat[m]) for m in all_modes]
|
||||
ax.barh(pos, means, color="#e74c3c", alpha=0.85)
|
||||
ax.set_yticks(pos)
|
||||
ax.set_yticklabels([f"{m} (n={counts[i]})" for i, m in enumerate(all_modes)],
|
||||
fontsize=9)
|
||||
ax.invert_yaxis()
|
||||
ax.set_xlabel("Mean TTFT in p95-p99 region (s)", fontsize=11)
|
||||
ax.set_title(f"Per-mode mean TTFT among tail reqs", fontsize=12)
|
||||
ax.axvline(e4_p99, color=E4_COLOR, ls="--", alpha=0.6, label=f"E4 p99 = {e4_p99:.1f}s")
|
||||
ax.axvline(e1_p99, color=E1_COLOR, ls="--", alpha=0.6, label=f"E1 p99 = {e1_p99:.1f}s")
|
||||
ax.legend(loc="lower right", fontsize=10)
|
||||
ax.grid(True, linestyle=":", alpha=0.4)
|
||||
|
||||
fig.suptitle(
|
||||
f"E4 p99 tail attribution: which execution_modes produce the long tail?\n"
|
||||
f"E4 p99 = {e4_p99:.1f}s vs E1 p99 = {e1_p99:.1f}s "
|
||||
f"(KVC loses tail by +{(e4_p99/e1_p99-1)*100:.1f}%)",
|
||||
fontsize=13, y=1.02,
|
||||
)
|
||||
plt.tight_layout()
|
||||
out = FIG / "e1_vs_e4_p99_attribution.png"
|
||||
plt.savefig(out, dpi=150, bbox_inches="tight")
|
||||
print(f"wrote {out}")
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -136,7 +136,7 @@ def main() -> None:
|
||||
bar_colors = [KVC_P_COLOR, KVC_D_COLOR, KVC_D_COLOR, KVC_D_COLOR,
|
||||
DP_COLOR, DP_COLOR, DP_COLOR, DP_COLOR]
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 6.5))
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 7.0))
|
||||
x = np.arange(len(all_gpus))
|
||||
|
||||
# -- Left: per-GPU request count ----------------------------------
|
||||
@@ -148,20 +148,24 @@ def main() -> None:
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(labels, fontsize=9.5)
|
||||
ax.set_ylabel("Number of requests touching this GPU", fontsize=11)
|
||||
ax.set_title("Per-GPU request count\n(naïve view: P seems idle)", fontsize=12, pad=10)
|
||||
# Headroom for the annotation: extend ylim 35% above tallest bar
|
||||
ax.set_ylim(0, max(counts) * 1.40)
|
||||
ax.set_title("Per-GPU request count\n(naïve view: P seems idle)",
|
||||
fontsize=12, pad=24)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
|
||||
# Annotate: KVC P GPU is "low frequency"
|
||||
# Place in upper-right area (over DP group) so it doesn't sit on KVC D bars
|
||||
p_idx = 0
|
||||
p_pct = counts[p_idx] / sum(counts[:4]) * 100 # vs KVC total
|
||||
ax.annotate(
|
||||
f"P GPU only sees\n"
|
||||
f"{counts[p_idx]:,} requests\n"
|
||||
f"({counts[p_idx]/len(kvc)*100:.1f}% of total)",
|
||||
f"({counts[p_idx]/len(kvc)*100:.1f}% of all KVC requests)",
|
||||
xy=(p_idx, counts[p_idx]),
|
||||
xytext=(p_idx + 0.6, max(counts) * 0.55),
|
||||
fontsize=9, color=KVC_P_COLOR, fontweight="bold",
|
||||
xytext=(2.4, max(counts) * 1.20),
|
||||
fontsize=10, color=KVC_P_COLOR, fontweight="bold", ha="center",
|
||||
bbox=dict(facecolor="white", edgecolor=KVC_P_COLOR, alpha=0.92, pad=4),
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_P_COLOR, lw=1.0),
|
||||
)
|
||||
|
||||
@@ -185,31 +189,42 @@ def main() -> None:
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(labels, fontsize=9.5)
|
||||
ax.set_ylabel("Compute tokens (millions)", fontsize=11)
|
||||
# Headroom for the annotation
|
||||
ax.set_ylim(0, max(total_M) * 1.45)
|
||||
ax.set_title("Per-GPU compute work\n(work view: P is comparable to each D)",
|
||||
fontsize=12, pad=10)
|
||||
fontsize=12, pad=24)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
# Legend placed at upper-left where bars are tallest is fine after raising ylim
|
||||
ax.legend(loc="upper left", fontsize=10, framealpha=0.95)
|
||||
|
||||
# Annotate: KVC P GPU does similar work to each D
|
||||
# Annotate: KVC P GPU does similar work to each D.
|
||||
# Place over DP region (right side) so it doesn't sit on KVC D bars.
|
||||
ax.annotate(
|
||||
f"P GPU does {total_M[p_idx]:.2f}M tokens of\n"
|
||||
f"prefill — comparable per-GPU\n"
|
||||
f"load to each KVC D worker",
|
||||
f"P GPU does {total_M[p_idx]:.2f}M tokens of prefill\n"
|
||||
f"— comparable per-GPU load to each KVC D worker\n"
|
||||
f"(KVC D avg = {np.mean(total_M[1:4]):.2f}M)",
|
||||
xy=(p_idx, total_M[p_idx]),
|
||||
xytext=(p_idx + 0.6, max(total_M) * 0.62),
|
||||
fontsize=9, color=KVC_P_COLOR, fontweight="bold",
|
||||
xytext=(5.5, max(total_M) * 1.30),
|
||||
fontsize=10, color=KVC_P_COLOR, fontweight="bold", ha="center",
|
||||
bbox=dict(facecolor="white", edgecolor=KVC_P_COLOR, alpha=0.92, pad=4),
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_P_COLOR, lw=1.0),
|
||||
)
|
||||
|
||||
# Separator + group labels
|
||||
# Separator + group labels (placed in axes-fraction coords, below subplot
|
||||
# title at pad=24 we now have safe room for these at y_axes_frac ≈ 1.02)
|
||||
for ax in axes:
|
||||
ax.axvline(3.5, color="gray", linestyle="--", linewidth=1.0, alpha=0.5)
|
||||
ymin, ymax = ax.get_ylim()
|
||||
ax.text(1.5, ymax * 1.05, "KVC 1P3D", ha="center", fontsize=11,
|
||||
fontweight="bold", color="#444")
|
||||
ax.text(5.5, ymax * 1.05, "DP 4-way CA", ha="center", fontsize=11,
|
||||
fontweight="bold", color="#444")
|
||||
ax.text(0.25, 1.02, "KVC 1P3D",
|
||||
transform=ax.transAxes, ha="center", va="bottom",
|
||||
fontsize=11.5, fontweight="bold", color="#444",
|
||||
bbox=dict(facecolor="#F2F2F2", edgecolor="#888",
|
||||
alpha=0.85, pad=3))
|
||||
ax.text(0.75, 1.02, "DP 4-way CA",
|
||||
transform=ax.transAxes, ha="center", va="bottom",
|
||||
fontsize=11.5, fontweight="bold", color="#444",
|
||||
bbox=dict(facecolor="#F2F2F2", edgecolor="#888",
|
||||
alpha=0.85, pad=3))
|
||||
|
||||
fig.suptitle(
|
||||
"Per-GPU utilization: \"is KVC's prefill GPU wasted?\"\n"
|
||||
|
||||
141
scripts/analyze_e4_d_to_p.py
Normal file
141
scripts/analyze_e4_d_to_p.py
Normal file
@@ -0,0 +1,141 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Cross-comparison of E1 (naive PD), E3 (KVC v2 + load-floor), E4 (KVC + D→P).
|
||||
|
||||
Usage:
|
||||
uv run --no-sync python scripts/analyze_e4_d_to_p.py \
|
||||
--e1 outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_summary.json \
|
||||
--e3 outputs/e3_kvc_v2_loadfloor_rdma_50sess/*_summary.json \
|
||||
--e4 outputs/e4_kvc_v2_d_to_p_sync_50sess/e4_kvc_v2_d_to_p_sync_run1_summary.json \
|
||||
--e4-metrics outputs/e4_kvc_v2_d_to_p_sync_50sess/e4_kvc_v2_d_to_p_sync_run1_metrics.jsonl
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import glob
|
||||
import json
|
||||
import statistics
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
def _load_summary(path_glob: str) -> dict[str, Any] | None:
|
||||
paths = glob.glob(path_glob)
|
||||
if not paths:
|
||||
return None
|
||||
with open(paths[0]) as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def _percentiles(values: list[float]) -> dict[str, float]:
|
||||
if not values:
|
||||
return {"p50": 0, "p90": 0, "p99": 0, "mean": 0}
|
||||
values = sorted(values)
|
||||
n = len(values)
|
||||
return {
|
||||
"mean": statistics.mean(values),
|
||||
"p50": values[n // 2],
|
||||
"p90": values[min(n - 1, int(n * 0.90))],
|
||||
"p99": values[min(n - 1, int(n * 0.99))],
|
||||
}
|
||||
|
||||
|
||||
def _row(label: str, s: dict[str, Any] | None, key: str) -> str:
|
||||
if s is None:
|
||||
return f" {label:<40} (missing)"
|
||||
stat = s.get(key, {})
|
||||
return (
|
||||
f" {label:<40} "
|
||||
f"mean={stat.get('mean', 0):>8.3f} "
|
||||
f"p50={stat.get('p50', 0):>8.3f} "
|
||||
f"p90={stat.get('p90', 0):>8.3f} "
|
||||
f"p99={stat.get('p99', 0):>8.3f}"
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--e1", required=True)
|
||||
ap.add_argument("--e3", required=True)
|
||||
ap.add_argument("--e4", required=True)
|
||||
ap.add_argument("--e4-metrics", help="optional path to e4 metrics.jsonl for reseed-mode breakdown")
|
||||
args = ap.parse_args()
|
||||
|
||||
e1 = _load_summary(args.e1)
|
||||
e3 = _load_summary(args.e3)
|
||||
e4 = _load_summary(args.e4)
|
||||
|
||||
print("=" * 90)
|
||||
print("E1 / E3 / E4 cross-comparison")
|
||||
print("=" * 90)
|
||||
for s, name in [(e1, "E1"), (e3, "E3"), (e4, "E4")]:
|
||||
if s is None:
|
||||
print(f" {name}: MISSING")
|
||||
continue
|
||||
total = (s.get("error_count", 0) + s.get("abort_count", 0) +
|
||||
sum(c for c in s.get("execution_modes", {}).values()))
|
||||
print(f" {name}: error={s.get('error_count', 0):>4} abort={s.get('abort_count', 0):>4} "
|
||||
f"failure={s.get('failure_count', 0):>4} exec_modes={dict(s.get('execution_modes', {}))}")
|
||||
|
||||
print("\n--- latency_stats_s ---")
|
||||
print(_row("E1 naive PD", e1, "latency_stats_s"))
|
||||
print(_row("E3 KVC v2 LF", e3, "latency_stats_s"))
|
||||
print(_row("E4 KVC + D→P", e4, "latency_stats_s"))
|
||||
|
||||
print("\n--- ttft_stats_s ---")
|
||||
print(_row("E1 naive PD", e1, "ttft_stats_s"))
|
||||
print(_row("E3 KVC v2 LF", e3, "ttft_stats_s"))
|
||||
print(_row("E4 KVC + D→P", e4, "ttft_stats_s"))
|
||||
|
||||
print("\n--- per-decode load ---")
|
||||
for s, name in [(e1, "E1"), (e3, "E3"), (e4, "E4")]:
|
||||
print(f" {name}: {dict(s.get('per_decode_load', {}) if s else {})}")
|
||||
|
||||
# ---- E4 reseed-mode breakdown ----
|
||||
if args.e4_metrics:
|
||||
print("\n--- E4 reseed-mode breakdown (from metrics.jsonl) ---")
|
||||
try:
|
||||
modes = defaultdict(list)
|
||||
d2p_outcomes = Counter()
|
||||
with open(args.e4_metrics) as f:
|
||||
for line in f:
|
||||
try:
|
||||
rec = json.loads(line)
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
mode = rec.get("execution_mode") or "?"
|
||||
ttft = rec.get("ttft_s")
|
||||
if ttft is not None:
|
||||
modes[mode].append(float(ttft))
|
||||
# D→P hit counter (we logged via logger.info, not in metrics
|
||||
# — placeholder for future structured event)
|
||||
print(f" per-mode TTFT (count, mean, p50, p99):")
|
||||
for mode, ttfts in sorted(modes.items()):
|
||||
p = _percentiles(ttfts)
|
||||
print(f" {mode:<55} n={len(ttfts):>4} "
|
||||
f"mean={p['mean']:>7.3f} p50={p['p50']:>7.3f} p99={p['p99']:>7.3f}")
|
||||
except Exception as e:
|
||||
print(f" parse error: {e}")
|
||||
|
||||
# ---- H1 / H2 / H3 verdicts ----
|
||||
print("\n" + "=" * 90)
|
||||
print("Hypothesis verdicts")
|
||||
print("=" * 90)
|
||||
if e1 and e4:
|
||||
e1_p99 = e1.get("ttft_stats_s", {}).get("p99", float("inf"))
|
||||
e4_p99 = e4.get("ttft_stats_s", {}).get("p99", float("inf"))
|
||||
verdict_h1 = "PASS" if e4_p99 <= e1_p99 else "FAIL"
|
||||
print(f" H1 (E4 TTFT p99 ≤ E1 TTFT p99): {e4_p99:.3f} vs {e1_p99:.3f} → {verdict_h1}")
|
||||
if e3 and e4:
|
||||
e3_modes = e3.get("execution_modes", {})
|
||||
e4_modes = e4.get("execution_modes", {})
|
||||
e3_success = sum(v for k, v in e3_modes.items() if "reseed" not in k.lower())
|
||||
e4_success = sum(v for k, v in e4_modes.items() if "reseed" not in k.lower())
|
||||
verdict_h3 = "PASS" if (e4_success or 0) >= 0.85 * (e3_success or 1) else "FAIL"
|
||||
print(f" H3 (E4 success count ≥ 0.85 × E3 success): "
|
||||
f"{e4_success} vs 0.85 × {e3_success} = {0.85 * e3_success:.0f} → {verdict_h3}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
189
scripts/convert_inferact_to_trace.py
Normal file
189
scripts/convert_inferact_to_trace.py
Normal file
@@ -0,0 +1,189 @@
|
||||
"""Convert Inferact codex_swebenchpro_traces (ShareGPT) to agentic-pd-hybrid trace JSONL.
|
||||
|
||||
Output schema (one JSON object per line, matching src/agentic_pd_hybrid/trace.py):
|
||||
chat_id, parent_chat_id, timestamp, input_length, output_length, type, turn, hash_ids
|
||||
|
||||
Each trial in the input becomes one session. Each (human, gpt) pair within a trial
|
||||
becomes one turn. The prefix at turn N is the concatenation of all (human, gpt) pairs
|
||||
from turns 0..N-1 plus the current human message — this mirrors how agentic coding
|
||||
agents grow context across calls.
|
||||
|
||||
hash_ids are derived per 24-token block via sha256 of the block's text + previous hash,
|
||||
which gives stable, deterministic, prefix-shared hashes across turns of the same session.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
BLOCK_TOKEN_BUDGET = 24
|
||||
|
||||
|
||||
def _block_hash(text: str, prev_hash: int) -> int:
|
||||
h = hashlib.sha256(text.encode("utf-8") + prev_hash.to_bytes(8, "big")).digest()
|
||||
return int.from_bytes(h[:8], "big") & 0x7FFFFFFFFFFFFFFF
|
||||
|
||||
|
||||
def _build_hash_ids(token_ids: list[int]) -> list[int]:
|
||||
out: list[int] = []
|
||||
prev = 0
|
||||
for start in range(0, len(token_ids), BLOCK_TOKEN_BUDGET):
|
||||
block = token_ids[start : start + BLOCK_TOKEN_BUDGET]
|
||||
block_repr = ",".join(str(t) for t in block)
|
||||
prev = _block_hash(block_repr, prev)
|
||||
out.append(prev)
|
||||
return out
|
||||
|
||||
|
||||
def _pair_turns(conv: list[dict]) -> list[tuple[str, str]]:
|
||||
"""Pair consecutive (human, gpt) messages. Skip malformed."""
|
||||
pairs: list[tuple[str, str]] = []
|
||||
i = 0
|
||||
while i + 1 < len(conv):
|
||||
a, b = conv[i], conv[i + 1]
|
||||
if (
|
||||
isinstance(a, dict)
|
||||
and isinstance(b, dict)
|
||||
and a.get("from") == "human"
|
||||
and b.get("from") == "gpt"
|
||||
):
|
||||
pairs.append((str(a.get("value", "")), str(b.get("value", ""))))
|
||||
i += 2
|
||||
else:
|
||||
i += 1
|
||||
return pairs
|
||||
|
||||
|
||||
def convert(
|
||||
input_path: Path,
|
||||
output_path: Path,
|
||||
*,
|
||||
tokenizer_path: str,
|
||||
max_trials: int | None,
|
||||
inter_turn_gap_s: float,
|
||||
session_stagger_s: float,
|
||||
request_type: str,
|
||||
) -> None:
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
print(f"loading tokenizer from {tokenizer_path}", file=sys.stderr)
|
||||
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
|
||||
|
||||
print(f"loading {input_path}", file=sys.stderr)
|
||||
data = json.loads(input_path.read_text())
|
||||
if max_trials is not None:
|
||||
data = data[:max_trials]
|
||||
print(f"{len(data)} trials to process", file=sys.stderr)
|
||||
|
||||
next_chat_id = 1_000_000
|
||||
written = 0
|
||||
skipped_trials = 0
|
||||
t0 = time.time()
|
||||
|
||||
with output_path.open("w", encoding="utf-8") as out_f:
|
||||
for trial_idx, trial in enumerate(data):
|
||||
conv = trial.get("conversations") or []
|
||||
turns = _pair_turns(conv)
|
||||
if not turns:
|
||||
skipped_trials += 1
|
||||
continue
|
||||
|
||||
base_ts = trial_idx * session_stagger_s
|
||||
ts = base_ts
|
||||
parent_chat_id = -1
|
||||
prefix_text = ""
|
||||
|
||||
for turn_idx, (human, assistant) in enumerate(turns):
|
||||
# Input at this turn = full prior context + current human message.
|
||||
current_text = (
|
||||
prefix_text + ("\n\n[USER]\n" if prefix_text else "[USER]\n") + human
|
||||
)
|
||||
input_ids = tokenizer.encode(current_text, add_special_tokens=False)
|
||||
input_length = len(input_ids)
|
||||
|
||||
output_ids = tokenizer.encode(assistant, add_special_tokens=False)
|
||||
output_length = max(1, len(output_ids))
|
||||
|
||||
hash_ids = _build_hash_ids(input_ids)
|
||||
|
||||
chat_id = next_chat_id
|
||||
next_chat_id += 1
|
||||
record = {
|
||||
"chat_id": chat_id,
|
||||
"parent_chat_id": parent_chat_id,
|
||||
"timestamp": round(ts, 6),
|
||||
"input_length": input_length,
|
||||
"output_length": output_length,
|
||||
"type": request_type,
|
||||
"turn": turn_idx,
|
||||
"hash_ids": hash_ids,
|
||||
}
|
||||
out_f.write(json.dumps(record) + "\n")
|
||||
written += 1
|
||||
|
||||
parent_chat_id = chat_id
|
||||
ts += inter_turn_gap_s
|
||||
prefix_text = current_text + "\n\n[ASSISTANT]\n" + assistant
|
||||
|
||||
if (trial_idx + 1) % 20 == 0:
|
||||
elapsed = time.time() - t0
|
||||
rate = (trial_idx + 1) / elapsed if elapsed > 0 else 0
|
||||
eta = (len(data) - trial_idx - 1) / rate if rate > 0 else 0
|
||||
print(
|
||||
f" trial {trial_idx + 1}/{len(data)} reqs={written} "
|
||||
f"rate={rate:.1f} trial/s eta={eta:.0f}s",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
elapsed = time.time() - t0
|
||||
print(
|
||||
f"done: wrote {written} requests across {len(data) - skipped_trials} sessions "
|
||||
f"({skipped_trials} trials skipped, empty conversations) in {elapsed:.1f}s "
|
||||
f"to {output_path}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument(
|
||||
"--input",
|
||||
type=Path,
|
||||
default=Path("third_party/codex_swebenchpro_traces/codex_swebenchpro.json"),
|
||||
)
|
||||
p.add_argument("--output", type=Path, required=True)
|
||||
p.add_argument(
|
||||
"--tokenizer",
|
||||
default="/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507",
|
||||
help="Path or HF id for the tokenizer. Default matches v2 sweep model.",
|
||||
)
|
||||
p.add_argument(
|
||||
"--max-trials",
|
||||
type=int,
|
||||
default=None,
|
||||
help="Cap number of trials processed (useful for smoke / quick tests).",
|
||||
)
|
||||
p.add_argument("--inter-turn-gap-s", type=float, default=2.5)
|
||||
p.add_argument("--session-stagger-s", type=float, default=1.0)
|
||||
p.add_argument("--request-type", default="chat")
|
||||
args = p.parse_args()
|
||||
|
||||
args.output.parent.mkdir(parents=True, exist_ok=True)
|
||||
convert(
|
||||
input_path=args.input,
|
||||
output_path=args.output,
|
||||
tokenizer_path=args.tokenizer,
|
||||
max_trials=args.max_trials,
|
||||
inter_turn_gap_s=args.inter_turn_gap_s,
|
||||
session_stagger_s=args.session_stagger_s,
|
||||
request_type=args.request_type,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
81
scripts/sample_trace_subset.py
Normal file
81
scripts/sample_trace_subset.py
Normal file
@@ -0,0 +1,81 @@
|
||||
"""Deterministically slice the first N sessions of an agentic-pd-hybrid trace.
|
||||
|
||||
Method: scan in file order, count records whose `parent_chat_id == -1` (= a
|
||||
session's turn 0), and write every record until the (N+1)-th such record is
|
||||
seen. No RNG, no hashing — re-running on the same input produces a byte-
|
||||
identical output. Used to derive matched subsets for paired sweeps (E1 vs E2)
|
||||
without spending GPU hours on the full trace.
|
||||
|
||||
Usage:
|
||||
uv run --no-sync python scripts/sample_trace_subset.py \
|
||||
--input outputs/inferact_codex_swebenchpro.jsonl \
|
||||
--output outputs/inferact_50sess.jsonl \
|
||||
--sessions 50
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def slice_first_n_sessions(input_path: Path, output_path: Path, n_sessions: int) -> dict:
|
||||
sessions_seen = 0
|
||||
requests_written = 0
|
||||
input_length_sum = 0
|
||||
output_length_sum = 0
|
||||
min_in = float("inf")
|
||||
max_in = 0
|
||||
|
||||
with input_path.open("r", encoding="utf-8") as f_in, output_path.open(
|
||||
"w", encoding="utf-8"
|
||||
) as f_out:
|
||||
for line in f_in:
|
||||
rec = json.loads(line)
|
||||
if rec["parent_chat_id"] == -1:
|
||||
sessions_seen += 1
|
||||
if sessions_seen > n_sessions:
|
||||
break
|
||||
f_out.write(line)
|
||||
requests_written += 1
|
||||
il = int(rec["input_length"])
|
||||
input_length_sum += il
|
||||
output_length_sum += int(rec["output_length"])
|
||||
if il < min_in:
|
||||
min_in = il
|
||||
if il > max_in:
|
||||
max_in = il
|
||||
|
||||
h = hashlib.md5(output_path.read_bytes()).hexdigest()
|
||||
return {
|
||||
"sessions": min(sessions_seen, n_sessions),
|
||||
"requests": requests_written,
|
||||
"input_length_mean": input_length_sum / max(1, requests_written),
|
||||
"input_length_min": int(min_in) if min_in != float("inf") else 0,
|
||||
"input_length_max": max_in,
|
||||
"output_length_mean": output_length_sum / max(1, requests_written),
|
||||
"output_md5": h,
|
||||
}
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument(
|
||||
"--input",
|
||||
type=Path,
|
||||
default=Path("outputs/inferact_codex_swebenchpro.jsonl"),
|
||||
)
|
||||
p.add_argument("--output", type=Path, required=True)
|
||||
p.add_argument("--sessions", type=int, default=50)
|
||||
args = p.parse_args()
|
||||
|
||||
args.output.parent.mkdir(parents=True, exist_ok=True)
|
||||
stats = slice_first_n_sessions(args.input, args.output, args.sessions)
|
||||
print(json.dumps(stats, indent=2), file=sys.stderr)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
44
scripts/setup_env.sh
Executable file
44
scripts/setup_env.sh
Executable file
@@ -0,0 +1,44 @@
|
||||
#!/usr/bin/env bash
|
||||
# Source this file in every shell that will run agentic-pd-hybrid.
|
||||
#
|
||||
# source scripts/setup_env.sh
|
||||
#
|
||||
# Why all three are needed:
|
||||
# - CUDA_HOME / PATH point tvm_ffi (vendor sglang JIT compiler) at cu12.8 nvcc.
|
||||
# Without this it falls back to /usr/local/cuda-13.0/bin/nvcc and the
|
||||
# resulting .so links libcudart.so.13 which driver 570 (cu12.8 API) rejects
|
||||
# with cudaErrorInsufficientDriver.
|
||||
# - LD_LIBRARY_PATH must expose libcudart.so.12 for mooncake.engine (cu12 wheel)
|
||||
# AND ~/cuda-12.8/lib64 for tvm_ffi compile-time linker searches.
|
||||
#
|
||||
# See docs/H200_DRIVER570_SETUP_ZH.md for the full rationale.
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
|
||||
if [ ! -x "$HOME/cuda-12.8/bin/nvcc" ]; then
|
||||
echo "ERROR: $HOME/cuda-12.8/bin/nvcc not found." >&2
|
||||
echo "Install cu12.8 toolkit first (see docs/H200_DRIVER570_SETUP_ZH.md §3)." >&2
|
||||
return 1 2>/dev/null || exit 1
|
||||
fi
|
||||
|
||||
if [ ! -f "$REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12" ]; then
|
||||
echo "ERROR: venv libcudart.so.12 missing. Run 'uv sync' from $REPO_ROOT." >&2
|
||||
return 1 2>/dev/null || exit 1
|
||||
fi
|
||||
|
||||
export CUDA_HOME="$HOME/cuda-12.8"
|
||||
export PATH="$HOME/cuda-12.8/bin:$PATH"
|
||||
export LD_LIBRARY_PATH="$REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib:$HOME/cuda-12.8/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
|
||||
|
||||
# Mooncake batch_transfer_sync C++ timeout (seconds). Default in mooncake is
|
||||
# 30 s; a single LRU eviction sweep on a saturated D scheduler can exceed
|
||||
# that and cause the hair-trigger blacklist in conn.py:1270 to permanently
|
||||
# mark the D's mooncake_session_id "failed". 1800 s = 30 min gives us
|
||||
# headroom while still detecting genuinely broken peers eventually.
|
||||
# See docs/E1_E2_RESULTS_ZH.md §5c and docs/E1_E2_FIX_DESIGN_ZH.md Q1.C.
|
||||
export MC_TRANSFER_TIMEOUT="${MC_TRANSFER_TIMEOUT:-1800}"
|
||||
|
||||
echo "agentic-pd-hybrid env ready:"
|
||||
echo " CUDA_HOME=$CUDA_HOME ($(nvcc --version | grep release | sed 's/.*release //'))"
|
||||
echo " libcudart.so.12 at $REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib"
|
||||
echo " MC_TRANSFER_TIMEOUT=${MC_TRANSFER_TIMEOUT}s"
|
||||
244
scripts/smoke_snapshot_link.py
Executable file
244
scripts/smoke_snapshot_link.py
Executable file
@@ -0,0 +1,244 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Two-process smoke test for snapshot_link D→P RDMA byte transfer.
|
||||
|
||||
Spawns scripts/snapshot_link_receiver.py via subprocess.Popen with stderr
|
||||
piped to ``<tmpdir>/recv.stderr.log`` for post-mortem if something dies.
|
||||
|
||||
Sender (this process):
|
||||
1. Spawns receiver child, waits for endpoint.json
|
||||
2. Brings up own SnapshotPeer (no recv buffer), registers a send buffer
|
||||
3. For each size: fill pattern, batch_transfer_sync_write, signal child,
|
||||
wait for child's ack
|
||||
4. Reads child's stdout (one JSON event per line) for verification
|
||||
|
||||
Pass = every size yields a child "verify" event with ok=true.
|
||||
|
||||
Usage:
|
||||
bash scripts/setup_env.sh && uv run --no-sync python scripts/smoke_snapshot_link.py
|
||||
|
||||
Env (optional):
|
||||
SNAPSHOT_LINK_HOST default 127.0.0.1
|
||||
SNAPSHOT_LINK_IB default mlx5_60
|
||||
SNAPSHOT_LINK_RECV_PORT default 17777
|
||||
SNAPSHOT_LINK_SEND_PORT default 17778
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import ctypes
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
_HERE = Path(__file__).resolve().parent
|
||||
sys.path.insert(0, str(_HERE.parent / "src"))
|
||||
|
||||
|
||||
SIZES_BYTES_DEFAULT = [
|
||||
1 << 10, # 1 KB
|
||||
1 << 14, # 16 KB
|
||||
1 << 18, # 256 KB
|
||||
1 << 20, # 1 MB
|
||||
1 << 22, # 4 MB
|
||||
1 << 24, # 16 MB
|
||||
1 << 26, # 64 MB
|
||||
]
|
||||
|
||||
|
||||
def _pattern_byte(i: int, seed: int) -> int:
|
||||
return (i * 2654435761 + seed) & 0xFF
|
||||
|
||||
|
||||
def _fill_pattern(buf, length: int, seed: int) -> None:
|
||||
tile_size = 4096
|
||||
tile = bytes(_pattern_byte(i, seed) for i in range(tile_size))
|
||||
tile_arr = (ctypes.c_ubyte * tile_size).from_buffer_copy(tile)
|
||||
n_full = length // tile_size
|
||||
rem = length - n_full * tile_size
|
||||
base = ctypes.addressof(buf)
|
||||
src_addr = ctypes.addressof(tile_arr)
|
||||
for k in range(n_full):
|
||||
ctypes.memmove(base + k * tile_size, src_addr, tile_size)
|
||||
if rem:
|
||||
ctypes.memmove(base + n_full * tile_size, src_addr, rem)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--host", default=os.environ.get("SNAPSHOT_LINK_HOST", "127.0.0.1"))
|
||||
ap.add_argument("--ib", default=os.environ.get("SNAPSHOT_LINK_IB", "mlx5_60"))
|
||||
ap.add_argument("--recv-port", type=int,
|
||||
default=int(os.environ.get("SNAPSHOT_LINK_RECV_PORT", "17777")))
|
||||
ap.add_argument("--send-port", type=int,
|
||||
default=int(os.environ.get("SNAPSHOT_LINK_SEND_PORT", "17778")))
|
||||
ap.add_argument("--max-bytes", type=int, default=128 * 1024 * 1024)
|
||||
ap.add_argument("--sizes", default=",".join(str(s) for s in SIZES_BYTES_DEFAULT))
|
||||
args = ap.parse_args()
|
||||
|
||||
sizes = [int(s) for s in args.sizes.split(",")]
|
||||
tmpdir = Path(tempfile.mkdtemp(prefix="snapshot_link_smoke_"))
|
||||
control_path = tmpdir / "endpoint.json"
|
||||
recv_stderr_log = tmpdir / "recv.stderr.log"
|
||||
|
||||
recv_cmd = [
|
||||
sys.executable,
|
||||
str(_HERE / "snapshot_link_receiver.py"),
|
||||
"--host", args.host,
|
||||
"--port", str(args.recv_port),
|
||||
"--ib", args.ib,
|
||||
"--max-bytes", str(args.max_bytes),
|
||||
"--control-path", str(control_path),
|
||||
"--sizes", args.sizes,
|
||||
]
|
||||
recv_stderr = open(recv_stderr_log, "w")
|
||||
print(f"[sender] launching receiver: {' '.join(recv_cmd)}", flush=True)
|
||||
print(f"[sender] receiver stderr → {recv_stderr_log}", flush=True)
|
||||
recv_proc = subprocess.Popen(
|
||||
recv_cmd,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=recv_stderr,
|
||||
bufsize=1,
|
||||
universal_newlines=True,
|
||||
)
|
||||
|
||||
try:
|
||||
# Wait for endpoint metadata
|
||||
deadline = time.time() + 60.0
|
||||
while time.time() < deadline:
|
||||
if control_path.exists():
|
||||
try:
|
||||
meta = json.loads(control_path.read_text())
|
||||
if meta.get("ready"):
|
||||
break
|
||||
except Exception:
|
||||
pass
|
||||
if recv_proc.poll() is not None:
|
||||
_dump_recv_stderr(recv_stderr_log)
|
||||
print(f"[sender] FAIL: receiver exited early (rc={recv_proc.returncode})")
|
||||
return 1
|
||||
time.sleep(0.1)
|
||||
else:
|
||||
print("[sender] FAIL: timed out waiting for receiver endpoint", flush=True)
|
||||
return 1
|
||||
|
||||
print(f"[sender] receiver endpoint: {meta}", flush=True)
|
||||
|
||||
from agentic_pd_hybrid.snapshot_link import SnapshotPeer, SnapshotEndpoint
|
||||
endpoint = SnapshotEndpoint(
|
||||
session_id=meta["session_id"],
|
||||
base_ptr=int(meta["base_ptr"]),
|
||||
capacity_bytes=int(meta["capacity_bytes"]),
|
||||
)
|
||||
peer = SnapshotPeer(
|
||||
host=args.host,
|
||||
port=args.send_port,
|
||||
ib_device=args.ib,
|
||||
receive_capacity_bytes=0,
|
||||
)
|
||||
send_buf = (ctypes.c_byte * args.max_bytes)()
|
||||
send_addr = ctypes.addressof(send_buf)
|
||||
peer.register_send_buffer(send_addr, args.max_bytes)
|
||||
print(f"[sender] own session_id={peer.session_id}, send_buf @ {hex(send_addr)} ({args.max_bytes} B)", flush=True)
|
||||
|
||||
transfers = []
|
||||
for size in sizes:
|
||||
if size > args.max_bytes:
|
||||
continue
|
||||
seed = int(time.time() * 1e6) & 0xFFFFFFFF
|
||||
_fill_pattern(send_buf, size, seed)
|
||||
t0 = time.perf_counter()
|
||||
ret = peer.push(endpoint, send_addr, 0, size, remote_offset=0)
|
||||
t1 = time.perf_counter()
|
||||
dt_ms = (t1 - t0) * 1000.0
|
||||
gbps = (size * 8.0 / 1e9) / max(t1 - t0, 1e-9)
|
||||
print(f"[sender] push size={size:>10d} ret={ret} "
|
||||
f"dur={dt_ms:>9.3f} ms thru={gbps:>6.3f} Gbps",
|
||||
flush=True)
|
||||
signal_path = control_path.with_suffix(f".do{size}")
|
||||
ack_path = control_path.with_suffix(f".ack{size}")
|
||||
signal_path.write_text(str(seed))
|
||||
ack_deadline = time.time() + 60.0
|
||||
while time.time() < ack_deadline:
|
||||
if ack_path.exists():
|
||||
break
|
||||
if recv_proc.poll() is not None:
|
||||
print(f"[sender] FAIL: receiver died after size={size}", flush=True)
|
||||
_dump_recv_stderr(recv_stderr_log)
|
||||
return 1
|
||||
time.sleep(0.05)
|
||||
transfers.append({
|
||||
"size": size, "ret": ret, "dur_ms": round(dt_ms, 3),
|
||||
"thru_Gbps": round(gbps, 3),
|
||||
"ack": ack_path.exists(),
|
||||
})
|
||||
|
||||
peer.close()
|
||||
|
||||
# Drain child stdout — each line is a JSON event
|
||||
try:
|
||||
recv_proc.wait(timeout=10)
|
||||
except subprocess.TimeoutExpired:
|
||||
recv_proc.terminate()
|
||||
recv_proc.wait(timeout=5)
|
||||
|
||||
events = []
|
||||
if recv_proc.stdout is not None:
|
||||
for raw in recv_proc.stdout:
|
||||
raw = raw.strip()
|
||||
if not raw:
|
||||
continue
|
||||
try:
|
||||
events.append(json.loads(raw))
|
||||
except json.JSONDecodeError:
|
||||
events.append({"event": "non-json", "raw": raw})
|
||||
|
||||
print("=" * 78)
|
||||
print("[receiver] events:")
|
||||
verify_ok = 0
|
||||
verify_fail = 0
|
||||
for ev in events:
|
||||
print(f" {ev}")
|
||||
if ev.get("event") == "verify":
|
||||
if ev.get("ok"):
|
||||
verify_ok += 1
|
||||
else:
|
||||
verify_fail += 1
|
||||
|
||||
recv_stderr.close()
|
||||
_dump_recv_stderr(recv_stderr_log, header="--- receiver stderr ---")
|
||||
|
||||
overall = "PASS" if verify_fail == 0 and verify_ok == len(transfers) else "FAIL"
|
||||
print("=" * 78)
|
||||
print(f"OVERALL: {overall} verify_ok={verify_ok} verify_fail={verify_fail} "
|
||||
f"transfers={len(transfers)}")
|
||||
return 0 if overall == "PASS" else 1
|
||||
|
||||
finally:
|
||||
try:
|
||||
recv_proc.terminate()
|
||||
recv_proc.wait(timeout=5)
|
||||
except Exception:
|
||||
try:
|
||||
recv_proc.kill()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def _dump_recv_stderr(path: Path, header: str = "--- receiver stderr (last 40) ---") -> None:
|
||||
try:
|
||||
text = path.read_text()
|
||||
except FileNotFoundError:
|
||||
return
|
||||
print(header, flush=True)
|
||||
for line in text.splitlines()[-40:]:
|
||||
print(f" {line}", flush=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
236
scripts/smoke_snapshot_link_gpu.py
Normal file
236
scripts/smoke_snapshot_link_gpu.py
Normal file
@@ -0,0 +1,236 @@
|
||||
#!/usr/bin/env python3
|
||||
"""GPU-aware smoke test for snapshot_link RDMA byte transfer.
|
||||
|
||||
Sender on cuda:0, receiver subprocess on cuda:1. Tests whether
|
||||
mooncake's transfer_sync_write can move bytes between two GPUs via
|
||||
RDMA (which is what the real D→P flow will need for KV bytes).
|
||||
|
||||
Usage:
|
||||
bash scripts/setup_env.sh && uv run --no-sync python scripts/smoke_snapshot_link_gpu.py
|
||||
|
||||
The sender uses cuda:0 (--send-gpu); the receiver subprocess uses
|
||||
cuda:1 (--recv-gpu) by default.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
_HERE = Path(__file__).resolve().parent
|
||||
sys.path.insert(0, str(_HERE.parent / "src"))
|
||||
|
||||
|
||||
SIZES_BYTES_DEFAULT = [
|
||||
1 << 14, # 16 KB
|
||||
1 << 20, # 1 MB
|
||||
1 << 24, # 16 MB
|
||||
1 << 26, # 64 MB
|
||||
1 << 28, # 256 MB
|
||||
]
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--host", default=os.environ.get("SNAPSHOT_LINK_HOST", "127.0.0.1"))
|
||||
ap.add_argument("--ib", default=os.environ.get("SNAPSHOT_LINK_IB", "mlx5_60"))
|
||||
ap.add_argument("--recv-port", type=int,
|
||||
default=int(os.environ.get("SNAPSHOT_LINK_RECV_PORT", "17787")))
|
||||
ap.add_argument("--send-port", type=int,
|
||||
default=int(os.environ.get("SNAPSHOT_LINK_SEND_PORT", "17788")))
|
||||
ap.add_argument("--max-bytes", type=int, default=256 * 1024 * 1024)
|
||||
ap.add_argument("--sizes", default=",".join(str(s) for s in SIZES_BYTES_DEFAULT))
|
||||
ap.add_argument("--send-gpu", type=int, default=0)
|
||||
ap.add_argument("--recv-gpu", type=int, default=1)
|
||||
args = ap.parse_args()
|
||||
|
||||
sizes = [int(s) for s in args.sizes.split(",")]
|
||||
tmpdir = Path(tempfile.mkdtemp(prefix="snapshot_link_gpu_smoke_"))
|
||||
control_path = tmpdir / "endpoint.json"
|
||||
recv_stderr_log = tmpdir / "recv.stderr.log"
|
||||
|
||||
recv_cmd = [
|
||||
sys.executable,
|
||||
str(_HERE / "snapshot_link_receiver_gpu.py"),
|
||||
"--host", args.host,
|
||||
"--port", str(args.recv_port),
|
||||
"--ib", args.ib,
|
||||
"--max-bytes", str(args.max_bytes),
|
||||
"--control-path", str(control_path),
|
||||
"--sizes", args.sizes,
|
||||
"--gpu-id", str(args.recv_gpu),
|
||||
]
|
||||
recv_stderr = open(recv_stderr_log, "w")
|
||||
print(f"[sender] receiver cmd: {' '.join(recv_cmd)}", flush=True)
|
||||
recv_proc = subprocess.Popen(
|
||||
recv_cmd, stdout=subprocess.PIPE, stderr=recv_stderr, bufsize=1,
|
||||
universal_newlines=True,
|
||||
)
|
||||
|
||||
try:
|
||||
import torch
|
||||
if not torch.cuda.is_available():
|
||||
print("[sender] FAIL: cuda not available")
|
||||
return 1
|
||||
torch.cuda.set_device(args.send_gpu)
|
||||
|
||||
deadline = time.time() + 90.0
|
||||
meta = None
|
||||
while time.time() < deadline:
|
||||
if control_path.exists():
|
||||
try:
|
||||
meta = json.loads(control_path.read_text())
|
||||
if meta.get("ready"):
|
||||
break
|
||||
except Exception:
|
||||
pass
|
||||
if recv_proc.poll() is not None:
|
||||
_dump_recv_stderr(recv_stderr_log)
|
||||
print(f"[sender] FAIL: receiver exited (rc={recv_proc.returncode})")
|
||||
return 1
|
||||
time.sleep(0.1)
|
||||
if meta is None:
|
||||
print("[sender] FAIL: receiver endpoint timeout")
|
||||
return 1
|
||||
print(f"[sender] receiver endpoint: gpu={meta['gpu_id']}, "
|
||||
f"sid={meta['session_id']}, ptr={hex(int(meta['base_ptr']))}, "
|
||||
f"cap={meta['capacity_bytes']}", flush=True)
|
||||
|
||||
from agentic_pd_hybrid.snapshot_link import SnapshotPeer, SnapshotEndpoint
|
||||
|
||||
endpoint = SnapshotEndpoint(
|
||||
session_id=meta["session_id"],
|
||||
base_ptr=int(meta["base_ptr"]),
|
||||
capacity_bytes=int(meta["capacity_bytes"]),
|
||||
)
|
||||
|
||||
peer = SnapshotPeer(
|
||||
host=args.host,
|
||||
port=args.send_port,
|
||||
ib_device=args.ib,
|
||||
receive_capacity_bytes=0,
|
||||
)
|
||||
|
||||
# Allocate a sender buffer on cuda:0
|
||||
send_tensor = torch.zeros(args.max_bytes, dtype=torch.uint8,
|
||||
device=f"cuda:{args.send_gpu}")
|
||||
send_ptr = send_tensor.data_ptr()
|
||||
ret = peer.engine.register_memory(send_ptr, args.max_bytes)
|
||||
if ret != 0:
|
||||
print(f"[sender] FAIL: register_memory ret={ret}")
|
||||
return 1
|
||||
print(f"[sender] own gpu={args.send_gpu}, sid={peer.session_id}, "
|
||||
f"buf @ {hex(send_ptr)} ({args.max_bytes} B)", flush=True)
|
||||
|
||||
transfers = []
|
||||
for size in sizes:
|
||||
if size > args.max_bytes:
|
||||
continue
|
||||
# Fill with deterministic pattern on GPU
|
||||
seed = int(time.time() * 1e6) & 0xFFFFFFFF
|
||||
# Use a simple seeded pattern via torch ops
|
||||
gen = torch.Generator(device=f"cuda:{args.send_gpu}")
|
||||
gen.manual_seed(seed)
|
||||
send_tensor[:size] = torch.randint(0, 256, (size,), dtype=torch.uint8,
|
||||
device=f"cuda:{args.send_gpu}",
|
||||
generator=gen)
|
||||
torch.cuda.synchronize(args.send_gpu)
|
||||
# Compute expected hash (host-side)
|
||||
host_view = send_tensor[:size].cpu().numpy().tobytes()
|
||||
expected_sha = hashlib.sha256(host_view).hexdigest()
|
||||
# Push via RDMA
|
||||
t0 = time.perf_counter()
|
||||
ret = peer.push(endpoint, send_ptr, 0, size, remote_offset=0)
|
||||
t1 = time.perf_counter()
|
||||
dt_ms = (t1 - t0) * 1000.0
|
||||
gbps = (size * 8.0 / 1e9) / max(t1 - t0, 1e-9)
|
||||
print(f"[sender] push size={size:>10d} ret={ret} "
|
||||
f"dur={dt_ms:>9.3f} ms thru={gbps:>6.3f} Gbps",
|
||||
flush=True)
|
||||
|
||||
# Signal receiver to verify
|
||||
signal_path = control_path.with_suffix(f".do{size}")
|
||||
ack_path = control_path.with_suffix(f".ack{size}")
|
||||
signal_path.write_text(json.dumps({"sha": expected_sha}))
|
||||
ack_deadline = time.time() + 90.0
|
||||
while time.time() < ack_deadline:
|
||||
if ack_path.exists():
|
||||
break
|
||||
if recv_proc.poll() is not None:
|
||||
print(f"[sender] FAIL: receiver died after size={size}")
|
||||
_dump_recv_stderr(recv_stderr_log)
|
||||
return 1
|
||||
time.sleep(0.05)
|
||||
transfers.append({
|
||||
"size": size, "ret": ret, "dur_ms": round(dt_ms, 3),
|
||||
"thru_Gbps": round(gbps, 3), "ack": ack_path.exists(),
|
||||
})
|
||||
|
||||
try:
|
||||
recv_proc.wait(timeout=10)
|
||||
except subprocess.TimeoutExpired:
|
||||
recv_proc.terminate()
|
||||
recv_proc.wait(timeout=5)
|
||||
|
||||
events = []
|
||||
if recv_proc.stdout is not None:
|
||||
for raw in recv_proc.stdout:
|
||||
raw = raw.strip()
|
||||
if not raw:
|
||||
continue
|
||||
try:
|
||||
events.append(json.loads(raw))
|
||||
except json.JSONDecodeError:
|
||||
events.append({"event": "non-json", "raw": raw})
|
||||
|
||||
print("=" * 78)
|
||||
print("[receiver] events:")
|
||||
verify_ok = 0
|
||||
verify_fail = 0
|
||||
for ev in events:
|
||||
print(f" {ev}")
|
||||
if ev.get("event") == "verify":
|
||||
if ev.get("ok"):
|
||||
verify_ok += 1
|
||||
else:
|
||||
verify_fail += 1
|
||||
|
||||
recv_stderr.close()
|
||||
_dump_recv_stderr(recv_stderr_log, header="--- receiver stderr ---")
|
||||
|
||||
overall = "PASS" if verify_fail == 0 and verify_ok == len(transfers) else "FAIL"
|
||||
print("=" * 78)
|
||||
print(f"OVERALL: {overall} verify_ok={verify_ok} verify_fail={verify_fail} "
|
||||
f"transfers={len(transfers)} send_gpu={args.send_gpu} recv_gpu={args.recv_gpu}")
|
||||
return 0 if overall == "PASS" else 1
|
||||
|
||||
finally:
|
||||
try:
|
||||
recv_proc.terminate()
|
||||
recv_proc.wait(timeout=5)
|
||||
except Exception:
|
||||
try:
|
||||
recv_proc.kill()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def _dump_recv_stderr(path: Path, header: str = "--- receiver stderr (last 60) ---") -> None:
|
||||
try:
|
||||
text = path.read_text()
|
||||
except FileNotFoundError:
|
||||
return
|
||||
print(header, flush=True)
|
||||
for line in text.splitlines()[-60:]:
|
||||
print(f" {line}", flush=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
241
scripts/smoke_snapshot_sglang_integration.py
Normal file
241
scripts/smoke_snapshot_sglang_integration.py
Normal file
@@ -0,0 +1,241 @@
|
||||
#!/usr/bin/env python3
|
||||
"""End-to-end smoke for the SGLang snapshot link integration.
|
||||
|
||||
Brings up TWO SGLang workers on this node (one acts as D, the other as P)
|
||||
with ``SGLANG_SNAPSHOT_LINK_ENABLE=1`` and exercises the three RPCs:
|
||||
|
||||
1. POST {P}/_snapshot/prepare_receive → P allocates kv_pool slots
|
||||
2. POST {D}/_snapshot/dump → D RDMA-pushes session KV
|
||||
3. POST {P}/_snapshot/finalize_ingest → P inserts into radix tree
|
||||
|
||||
To populate D's SessionAwareCache with a session, we first send a normal
|
||||
streaming-session generate request to D.
|
||||
|
||||
After finalize, we send another generate request to P with the same prefix
|
||||
and check whether the report says cached_tokens > 0 (cache hit).
|
||||
|
||||
This is a minimum-fidelity end-to-end smoke. It does NOT use the full
|
||||
agentic-pd-hybrid reseed orchestration; that's the next commit.
|
||||
|
||||
Required env:
|
||||
MODEL default /mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507
|
||||
|
||||
Usage:
|
||||
bash scripts/setup_env.sh && uv run --no-sync python \
|
||||
scripts/smoke_snapshot_sglang_integration.py
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import signal
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import httpx
|
||||
|
||||
|
||||
def _build_server_cmd(args, role: str, gpu_id: int, base_port: int,
|
||||
snapshot_port: int, ib_device: str) -> list:
|
||||
"""Build the SGLang launch command for one worker (D or P)."""
|
||||
common = [
|
||||
sys.executable, "-m", "sglang.launch_server",
|
||||
"--model-path", args.model,
|
||||
"--host", "127.0.0.1",
|
||||
"--port", str(base_port),
|
||||
"--tp-size", "1",
|
||||
"--mem-fraction-static", "0.6",
|
||||
"--disable-cuda-graph",
|
||||
"--disable-overlap-schedule",
|
||||
"--enable-streaming-session",
|
||||
"--disaggregation-mode", role,
|
||||
"--disaggregation-transfer-backend", "mooncake",
|
||||
"--disaggregation-bootstrap-port", str(base_port + 5000),
|
||||
"--disaggregation-ib-device", ib_device,
|
||||
]
|
||||
return common
|
||||
|
||||
|
||||
def _server_env(args, gpu_id: int, snapshot_port: int, ib_device: str) -> dict:
|
||||
env = os.environ.copy()
|
||||
env["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
|
||||
env["SGLANG_SNAPSHOT_LINK_ENABLE"] = "1"
|
||||
env["SGLANG_SNAPSHOT_LINK_HOST"] = "127.0.0.1"
|
||||
env["SGLANG_SNAPSHOT_LINK_PORT"] = str(snapshot_port)
|
||||
env["SGLANG_SNAPSHOT_LINK_IB_DEVICE"] = ib_device
|
||||
env["MOONCAKE_PROTOCOL"] = "rdma"
|
||||
env["MOONCAKE_DEVICE"] = ib_device
|
||||
env["MC_TRANSFER_TIMEOUT"] = "1800"
|
||||
return env
|
||||
|
||||
|
||||
def _wait_for_ready(url: str, timeout: float = 240.0) -> bool:
|
||||
deadline = time.time() + timeout
|
||||
while time.time() < deadline:
|
||||
try:
|
||||
r = httpx.get(f"{url}/health", timeout=2.0)
|
||||
if r.status_code == 200:
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
time.sleep(2)
|
||||
return False
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--model",
|
||||
default=os.environ.get("MODEL", "/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507"))
|
||||
ap.add_argument("--d-gpu", type=int, default=1)
|
||||
ap.add_argument("--p-gpu", type=int, default=0)
|
||||
ap.add_argument("--d-port", type=int, default=29040)
|
||||
ap.add_argument("--p-port", type=int, default=29041)
|
||||
ap.add_argument("--d-snap-port", type=int, default=29045)
|
||||
ap.add_argument("--p-snap-port", type=int, default=29046)
|
||||
ap.add_argument("--ib", default="mlx5_60")
|
||||
ap.add_argument("--log-dir", default="outputs/snapshot_sglang_smoke")
|
||||
args = ap.parse_args()
|
||||
|
||||
log_dir = Path(args.log_dir)
|
||||
log_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Spawn P first (so D can find its snapshot endpoint later via prepare_receive)
|
||||
p_cmd = _build_server_cmd(args, "prefill", args.p_gpu, args.p_port,
|
||||
args.p_snap_port, args.ib)
|
||||
p_env = _server_env(args, args.p_gpu, args.p_snap_port, args.ib)
|
||||
p_stdout = open(log_dir / "p.stdout", "w")
|
||||
p_stderr = open(log_dir / "p.stderr", "w")
|
||||
print(f"[smoke] launching P: {' '.join(p_cmd)}")
|
||||
p_proc = subprocess.Popen(p_cmd, env=p_env, stdout=p_stdout, stderr=p_stderr)
|
||||
|
||||
d_cmd = _build_server_cmd(args, "decode", args.d_gpu, args.d_port,
|
||||
args.d_snap_port, args.ib)
|
||||
d_env = _server_env(args, args.d_gpu, args.d_snap_port, args.ib)
|
||||
d_stdout = open(log_dir / "d.stdout", "w")
|
||||
d_stderr = open(log_dir / "d.stderr", "w")
|
||||
print(f"[smoke] launching D: {' '.join(d_cmd)}")
|
||||
d_proc = subprocess.Popen(d_cmd, env=d_env, stdout=d_stdout, stderr=d_stderr)
|
||||
|
||||
try:
|
||||
print(f"[smoke] waiting for P @ 127.0.0.1:{args.p_port} ...")
|
||||
if not _wait_for_ready(f"http://127.0.0.1:{args.p_port}", timeout=300):
|
||||
_tail_stderr(log_dir / "p.stderr")
|
||||
raise RuntimeError("P server did not become healthy")
|
||||
print(f"[smoke] waiting for D @ 127.0.0.1:{args.d_port} ...")
|
||||
if not _wait_for_ready(f"http://127.0.0.1:{args.d_port}", timeout=300):
|
||||
_tail_stderr(log_dir / "d.stderr")
|
||||
raise RuntimeError("D server did not become healthy")
|
||||
print(f"[smoke] both servers up — running RPC sanity ...")
|
||||
|
||||
session_id = "smoke-sess-001"
|
||||
# NOTE: we deliberately skip seeding a session on D with a real
|
||||
# /generate call. Decode-mode workers crash on raw /generate without
|
||||
# PD-router-provided bootstrap_host (see decode.py:_bootstrap_addr).
|
||||
# The point of this smoke is to verify the 3 snapshot RPCs are
|
||||
# wired up correctly. KV correctness needs the full router stack
|
||||
# (covered by the end-to-end E4 sweep, not here).
|
||||
|
||||
# 3. Probe snapshot link: prepare_receive on P
|
||||
num_tokens = 64
|
||||
prep = httpx.post(
|
||||
f"http://127.0.0.1:{args.p_port}/_snapshot/prepare_receive",
|
||||
json={
|
||||
"session_id": session_id,
|
||||
"num_tokens": num_tokens,
|
||||
"expected_bytes_per_layer_k": 0,
|
||||
"expected_bytes_per_layer_v": 0,
|
||||
},
|
||||
timeout=30,
|
||||
)
|
||||
print(f"[smoke] prepare_receive on P → {prep.status_code}: {prep.text[:500]}")
|
||||
if prep.status_code != 200:
|
||||
return 1
|
||||
prep_data = prep.json()
|
||||
if not prep_data.get("ok"):
|
||||
print(f"[smoke] prepare_receive returned ok=false: {prep_data}")
|
||||
return 1
|
||||
|
||||
# 4. Dump on D — expect failure (session-not-resident), proves the
|
||||
# handler is reachable and exits the failure path cleanly.
|
||||
dump = httpx.post(
|
||||
f"http://127.0.0.1:{args.d_port}/_snapshot/dump",
|
||||
json={
|
||||
"session_id": session_id,
|
||||
"target_snapshot_session_id": prep_data["snapshot_session_id"],
|
||||
"target_k_base_ptrs": prep_data["k_base_ptrs"],
|
||||
"target_v_base_ptrs": prep_data["v_base_ptrs"],
|
||||
"target_slot_indices": prep_data["slot_indices"],
|
||||
"target_stride_k_bytes": prep_data["stride_k_bytes"],
|
||||
"target_stride_v_bytes": prep_data["stride_v_bytes"],
|
||||
"ib_device": args.ib,
|
||||
},
|
||||
timeout=60,
|
||||
)
|
||||
print(f"[smoke] dump on D (expected fail) → {dump.status_code}: {dump.text[:500]}")
|
||||
if dump.status_code != 200:
|
||||
return 1
|
||||
dump_data = dump.json()
|
||||
dump_reason = dump_data.get("reason", "")
|
||||
if dump_data.get("ok"):
|
||||
print("[smoke] unexpected dump success on a session that doesn't exist")
|
||||
elif dump_reason != "session-not-resident":
|
||||
print(f"[smoke] dump failed with wrong reason: {dump_reason}")
|
||||
return 1
|
||||
|
||||
# 5. Finalize on P with fake token_ids — radix insert should succeed
|
||||
prompt_ids = list(range(101, 101 + num_tokens)) # fake but unique ids
|
||||
fin = httpx.post(
|
||||
f"http://127.0.0.1:{args.p_port}/_snapshot/finalize_ingest",
|
||||
json={
|
||||
"session_id": session_id,
|
||||
"token_ids": prompt_ids,
|
||||
"slot_indices": prep_data["slot_indices"],
|
||||
},
|
||||
timeout=30,
|
||||
)
|
||||
print(f"[smoke] finalize on P → {fin.status_code}: {fin.text[:500]}")
|
||||
if fin.status_code != 200:
|
||||
return 1
|
||||
fin_data = fin.json()
|
||||
if not fin_data.get("ok"):
|
||||
print(f"[smoke] finalize returned ok=false: {fin_data}")
|
||||
return 1
|
||||
print(f"[smoke] inserted_prefix_len = {fin_data.get('inserted_prefix_len')}")
|
||||
print("[smoke] OVERALL: PASS — all 3 RPCs reachable + handlers return expected schema")
|
||||
print(" (KV-correctness end-to-end check requires the full PD router stack;")
|
||||
print(" see scripts/sweep_e4_d_to_p_sync.sh for that)")
|
||||
return 0
|
||||
finally:
|
||||
for name, proc in [("D", d_proc), ("P", p_proc)]:
|
||||
try:
|
||||
proc.send_signal(signal.SIGINT)
|
||||
except Exception:
|
||||
pass
|
||||
for name, proc in [("D", d_proc), ("P", p_proc)]:
|
||||
try:
|
||||
proc.wait(timeout=15)
|
||||
except Exception:
|
||||
proc.terminate()
|
||||
try:
|
||||
proc.wait(timeout=5)
|
||||
except Exception:
|
||||
proc.kill()
|
||||
|
||||
|
||||
def _tail_stderr(path: Path, n: int = 60) -> None:
|
||||
try:
|
||||
text = path.read_text()
|
||||
except FileNotFoundError:
|
||||
return
|
||||
print(f"--- {path} (last {n}) ---")
|
||||
for line in text.splitlines()[-n:]:
|
||||
print(f" {line}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
123
scripts/snapshot_link_receiver.py
Normal file
123
scripts/snapshot_link_receiver.py
Normal file
@@ -0,0 +1,123 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Receiver-side child process for the snapshot_link smoke test.
|
||||
|
||||
Reads CLI args, brings up a SnapshotPeer with a registered recv buffer,
|
||||
writes endpoint metadata to a control file, then loops: wait for size
|
||||
signal, verify recv buffer, write ack.
|
||||
|
||||
Status events are printed as single-line JSON to stdout for parent to
|
||||
parse.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import ctypes
|
||||
import hashlib
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "src"))
|
||||
|
||||
|
||||
def _pattern_byte(i: int, seed: int) -> int:
|
||||
return (i * 2654435761 + seed) & 0xFF
|
||||
|
||||
|
||||
def _fill_pattern(buf, length: int, seed: int) -> None:
|
||||
tile_size = 4096
|
||||
tile = bytes(_pattern_byte(i, seed) for i in range(tile_size))
|
||||
tile_arr = (ctypes.c_ubyte * tile_size).from_buffer_copy(tile)
|
||||
n_full = length // tile_size
|
||||
rem = length - n_full * tile_size
|
||||
base = ctypes.addressof(buf)
|
||||
src_addr = ctypes.addressof(tile_arr)
|
||||
for k in range(n_full):
|
||||
ctypes.memmove(base + k * tile_size, src_addr, tile_size)
|
||||
if rem:
|
||||
ctypes.memmove(base + n_full * tile_size, src_addr, rem)
|
||||
|
||||
|
||||
def _emit(d: dict) -> None:
|
||||
print(json.dumps(d), flush=True)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--host", required=True)
|
||||
ap.add_argument("--port", type=int, required=True)
|
||||
ap.add_argument("--ib", required=True)
|
||||
ap.add_argument("--max-bytes", type=int, required=True)
|
||||
ap.add_argument("--control-path", required=True)
|
||||
ap.add_argument("--sizes", required=True, help="comma-separated bytes")
|
||||
args = ap.parse_args()
|
||||
|
||||
sizes = [int(s) for s in args.sizes.split(",")]
|
||||
|
||||
from agentic_pd_hybrid.snapshot_link import SnapshotPeer
|
||||
|
||||
try:
|
||||
peer = SnapshotPeer(
|
||||
host=args.host,
|
||||
port=args.port,
|
||||
ib_device=args.ib,
|
||||
receive_capacity_bytes=args.max_bytes,
|
||||
)
|
||||
except Exception as e:
|
||||
import traceback
|
||||
_emit({"event": "init-failed", "error": repr(e), "tb": traceback.format_exc()})
|
||||
sys.exit(2)
|
||||
|
||||
endpoint = peer.endpoint
|
||||
Path(args.control_path).write_text(json.dumps({
|
||||
"session_id": endpoint.session_id,
|
||||
"base_ptr": endpoint.base_ptr,
|
||||
"capacity_bytes": endpoint.capacity_bytes,
|
||||
"ready": True,
|
||||
}))
|
||||
_emit({"event": "endpoint-ready", "session_id": endpoint.session_id,
|
||||
"base_ptr": endpoint.base_ptr, "capacity": endpoint.capacity_bytes})
|
||||
|
||||
cp = Path(args.control_path)
|
||||
for size in sizes:
|
||||
if size > args.max_bytes:
|
||||
continue
|
||||
signal_path = cp.with_suffix(f".do{size}")
|
||||
ack_path = cp.with_suffix(f".ack{size}")
|
||||
deadline = time.time() + 120.0
|
||||
while time.time() < deadline:
|
||||
if signal_path.exists():
|
||||
break
|
||||
time.sleep(0.05)
|
||||
else:
|
||||
_emit({"event": "no-signal-timeout", "size": size})
|
||||
continue
|
||||
try:
|
||||
seed = int(signal_path.read_text().strip())
|
||||
except Exception as e:
|
||||
_emit({"event": "signal-parse-error", "size": size, "err": repr(e)})
|
||||
continue
|
||||
expected_arr = (ctypes.c_ubyte * size)()
|
||||
_fill_pattern(expected_arr, size, seed)
|
||||
expected_hash = hashlib.sha256(bytes(expected_arr)).hexdigest()
|
||||
recv_bytes = peer.read_bytes(0, size)
|
||||
recv_hash = hashlib.sha256(recv_bytes).hexdigest()
|
||||
ok = recv_hash == expected_hash
|
||||
_emit({
|
||||
"event": "verify",
|
||||
"size": size,
|
||||
"ok": ok,
|
||||
"expected_sha": expected_hash[:16],
|
||||
"got_sha": recv_hash[:16],
|
||||
"first8_recv": recv_bytes[:8].hex(),
|
||||
"last8_recv": recv_bytes[-8:].hex(),
|
||||
})
|
||||
ack_path.write_text("done")
|
||||
|
||||
peer.close()
|
||||
_emit({"event": "receiver-done"})
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
124
scripts/snapshot_link_receiver_gpu.py
Normal file
124
scripts/snapshot_link_receiver_gpu.py
Normal file
@@ -0,0 +1,124 @@
|
||||
#!/usr/bin/env python3
|
||||
"""GPU-side receiver child for snapshot_link smoke test (CUDA mem)."""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "src"))
|
||||
|
||||
|
||||
def _emit(d: dict) -> None:
|
||||
print(json.dumps(d), flush=True)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--host", required=True)
|
||||
ap.add_argument("--port", type=int, required=True)
|
||||
ap.add_argument("--ib", required=True)
|
||||
ap.add_argument("--max-bytes", type=int, required=True)
|
||||
ap.add_argument("--control-path", required=True)
|
||||
ap.add_argument("--sizes", required=True)
|
||||
ap.add_argument("--gpu-id", type=int, default=1, help="receiver GPU id")
|
||||
args = ap.parse_args()
|
||||
|
||||
sizes = [int(s) for s in args.sizes.split(",")]
|
||||
|
||||
try:
|
||||
import torch
|
||||
if not torch.cuda.is_available():
|
||||
_emit({"event": "init-failed", "error": "cuda not available"})
|
||||
sys.exit(2)
|
||||
torch.cuda.set_device(args.gpu_id)
|
||||
# allocate a GPU buffer of max_bytes
|
||||
recv_tensor = torch.zeros(args.max_bytes, dtype=torch.uint8, device=f"cuda:{args.gpu_id}")
|
||||
recv_ptr = recv_tensor.data_ptr()
|
||||
except Exception as e:
|
||||
import traceback
|
||||
_emit({"event": "init-failed", "error": repr(e), "tb": traceback.format_exc()})
|
||||
sys.exit(2)
|
||||
|
||||
# Spin up SnapshotPeer with NO internal recv buffer, then register our GPU tensor
|
||||
from agentic_pd_hybrid.snapshot_link import SnapshotPeer, SnapshotEndpoint
|
||||
try:
|
||||
peer = SnapshotPeer(
|
||||
host=args.host,
|
||||
port=args.port,
|
||||
ib_device=args.ib,
|
||||
receive_capacity_bytes=0,
|
||||
)
|
||||
ret = peer.engine.register_memory(recv_ptr, args.max_bytes)
|
||||
if ret != 0:
|
||||
_emit({"event": "init-failed", "error": f"register_memory({hex(recv_ptr)}, {args.max_bytes}) ret={ret}"})
|
||||
sys.exit(2)
|
||||
except Exception as e:
|
||||
import traceback
|
||||
_emit({"event": "init-failed", "error": repr(e), "tb": traceback.format_exc()})
|
||||
sys.exit(2)
|
||||
|
||||
endpoint = SnapshotEndpoint(
|
||||
session_id=peer.session_id,
|
||||
base_ptr=recv_ptr,
|
||||
capacity_bytes=args.max_bytes,
|
||||
)
|
||||
Path(args.control_path).write_text(json.dumps({
|
||||
"session_id": endpoint.session_id,
|
||||
"base_ptr": endpoint.base_ptr,
|
||||
"capacity_bytes": endpoint.capacity_bytes,
|
||||
"gpu_id": args.gpu_id,
|
||||
"ready": True,
|
||||
}))
|
||||
_emit({"event": "endpoint-ready",
|
||||
"session_id": endpoint.session_id,
|
||||
"base_ptr": endpoint.base_ptr,
|
||||
"capacity": endpoint.capacity_bytes,
|
||||
"gpu_id": args.gpu_id})
|
||||
|
||||
cp = Path(args.control_path)
|
||||
for size in sizes:
|
||||
if size > args.max_bytes:
|
||||
continue
|
||||
signal_path = cp.with_suffix(f".do{size}")
|
||||
ack_path = cp.with_suffix(f".ack{size}")
|
||||
deadline = time.time() + 120.0
|
||||
while time.time() < deadline:
|
||||
if signal_path.exists():
|
||||
break
|
||||
time.sleep(0.05)
|
||||
else:
|
||||
_emit({"event": "no-signal-timeout", "size": size})
|
||||
continue
|
||||
try:
|
||||
payload = json.loads(signal_path.read_text())
|
||||
expected_sha = payload["sha"]
|
||||
except Exception as e:
|
||||
_emit({"event": "signal-parse-error", "size": size, "err": repr(e)})
|
||||
continue
|
||||
|
||||
# Copy from GPU to CPU and hash
|
||||
torch.cuda.synchronize(args.gpu_id)
|
||||
host_bytes = bytes(recv_tensor[:size].cpu().numpy().tobytes())
|
||||
recv_sha = hashlib.sha256(host_bytes).hexdigest()
|
||||
ok = recv_sha == expected_sha
|
||||
_emit({
|
||||
"event": "verify",
|
||||
"size": size,
|
||||
"ok": ok,
|
||||
"expected_sha": expected_sha[:16],
|
||||
"got_sha": recv_sha[:16],
|
||||
"first8_recv": host_bytes[:8].hex(),
|
||||
"last8_recv": host_bytes[-8:].hex(),
|
||||
})
|
||||
ack_path.write_text("done")
|
||||
|
||||
peer.close()
|
||||
_emit({"event": "receiver-done"})
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
82
scripts/sweep_e1_naive_1p3d.sh
Executable file
82
scripts/sweep_e1_naive_1p3d.sh
Executable file
@@ -0,0 +1,82 @@
|
||||
#!/usr/bin/env bash
|
||||
# E1 — naive 1P3D + kv-aware + RDMA, ts=1
|
||||
#
|
||||
# Tests hypothesis H1 from ONBOARDING_NEXT_AGENT_ZH §3.1: separate the
|
||||
# contribution of "1P3D topology + kv-aware policy" from "KVC layer
|
||||
# (admission / migration / direct-to-D)".
|
||||
#
|
||||
# Mechanism = pd-disaggregation (no KVC layer); policy = kv-aware.
|
||||
# Topology = 1P3D, RDMA on (mlx5_60 = cuda:0 NUMA-local).
|
||||
#
|
||||
# Prerequisites:
|
||||
# - source scripts/setup_env.sh (sets CUDA_HOME etc.)
|
||||
# - outputs/inferact_codex_swebenchpro.jsonl exists
|
||||
# (run scripts/convert_inferact_to_trace.py if not)
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/sweep_e1_naive_1p3d.sh
|
||||
#
|
||||
# Override defaults via env:
|
||||
# MODEL=/path TRACE=path OUTPUT=path IB_DEVICE=mlx5_XX bash scripts/sweep_e1_naive_1p3d.sh
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
if [ -z "${CUDA_HOME:-}" ]; then
|
||||
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
|
||||
TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
|
||||
OUTPUT=${OUTPUT:-outputs/e1_naive_1p3d_kvaware_rdma_50sess}
|
||||
IB_DEVICE=${IB_DEVICE:-mlx5_60}
|
||||
|
||||
if [ ! -f "$TRACE" ]; then
|
||||
echo "ERROR: trace not found at $TRACE" >&2
|
||||
echo "Run: uv run --no-sync python scripts/convert_inferact_to_trace.py --output $TRACE" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$OUTPUT"
|
||||
LOG="$OUTPUT/sweep.log"
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||
|
||||
log "=== E1: naive 1P3D kv-aware + RDMA, ts=1 ==="
|
||||
log "MODEL=$MODEL"
|
||||
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
|
||||
log "OUTPUT=$OUTPUT"
|
||||
log "IB_DEVICE=$IB_DEVICE"
|
||||
|
||||
label=e1_naive_1p3d_kvaware_run1
|
||||
log ""
|
||||
log "=== [E1] $label starting ==="
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy kv-aware \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device "$IB_DEVICE" \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 1800 \
|
||||
--request-timeout-s 300 2>&1 | tee -a "$LOG"
|
||||
|
||||
run_dir=$(ls -td "$OUTPUT"/pd-disaggregation-*/ 2>/dev/null | head -1)
|
||||
log "=== [E1] $label COMPLETED, artifacts at $run_dir ==="
|
||||
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
|
||||
fi
|
||||
90
scripts/sweep_e2_kvc_v2_rdma.sh
Executable file
90
scripts/sweep_e2_kvc_v2_rdma.sh
Executable file
@@ -0,0 +1,90 @@
|
||||
#!/usr/bin/env bash
|
||||
# E2 — KVC v2 + RDMA, ts=1
|
||||
#
|
||||
# Tests hypotheses H2/H3 from ONBOARDING_NEXT_AGENT_ZH §3.1: validate
|
||||
# that enabling real RDMA pushes TTFT p99 from the reported 1.28s
|
||||
# (TCP loopback) down toward ~0.7s (still expected to lose to DP 0.43s
|
||||
# because re-prefill segment of reseed slow-path remains).
|
||||
#
|
||||
# Mechanism = kvcache-centric; policy = kv-aware; topology = 1P3D.
|
||||
# All --kvcache-* tuning flags from sweep_ts1_migration_v2.sh
|
||||
# (reset-on-success + threshold 8192). RDMA on (mlx5_60).
|
||||
#
|
||||
# Uses the same outputs/inferact_50sess.jsonl as E1 — see
|
||||
# scripts/sample_trace_subset.py — so the two runs are paired.
|
||||
#
|
||||
# Prerequisites:
|
||||
# - source scripts/setup_env.sh
|
||||
# - E1 must already have completed (releases GPUs)
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/sweep_e2_kvc_v2_rdma.sh
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
if [ -z "${CUDA_HOME:-}" ]; then
|
||||
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
|
||||
TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
|
||||
OUTPUT=${OUTPUT:-outputs/e2_kvc_v2_rdma_50sess}
|
||||
IB_DEVICE=${IB_DEVICE:-mlx5_60}
|
||||
|
||||
if [ ! -f "$TRACE" ]; then
|
||||
echo "ERROR: trace not found at $TRACE" >&2
|
||||
echo "Run: uv run --no-sync python scripts/sample_trace_subset.py --output $TRACE --sessions 50" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$OUTPUT"
|
||||
LOG="$OUTPUT/sweep.log"
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||
|
||||
log "=== E2: KVC v2 + RDMA, ts=1 ==="
|
||||
log "MODEL=$MODEL"
|
||||
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
|
||||
log "OUTPUT=$OUTPUT"
|
||||
log "IB_DEVICE=$IB_DEVICE"
|
||||
|
||||
label=e2_kvc_v2_rdma_run1
|
||||
log ""
|
||||
log "=== [E2] $label starting ==="
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device "$IB_DEVICE" \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 1800 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--kvcache-migration-reject-threshold 3 \
|
||||
--kvcache-direct-max-uncached-tokens 8192 2>&1 | tee -a "$LOG"
|
||||
|
||||
run_dir=$(ls -td "$OUTPUT"/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [E2] $label COMPLETED, artifacts at $run_dir ==="
|
||||
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
|
||||
fi
|
||||
105
scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
Executable file
105
scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
Executable file
@@ -0,0 +1,105 @@
|
||||
#!/usr/bin/env bash
|
||||
# E3 — KVC v2 + RDMA + load-floor bonus, ts=1
|
||||
#
|
||||
# Validates the load-floor bonus fix proposed in
|
||||
# docs/E1_E2_FIX_DESIGN_ZH.md §Q2.B. Identical to E2 except:
|
||||
# --kvcache-load-floor-bonus 200
|
||||
#
|
||||
# Pair-wise vs E1 (no KVC layer) and E2 (KVC v2 without bonus) on the
|
||||
# exact same outputs/inferact_50sess.jsonl subset.
|
||||
#
|
||||
# Hypotheses being tested:
|
||||
# H1 (load balance): D2 should now receive non-trivial bindings
|
||||
# (E1/E2 had 0 — see E1_E2_RESULTS_ZH.md §5d).
|
||||
# H2 (failure rate): mooncake batch_transfer_sync timeouts should
|
||||
# stop firing because D0/D1 KV pool no longer
|
||||
# saturates → no LRU thrash → control plane no
|
||||
# longer starves. E2 had 1054 failures; expect
|
||||
# ≤ E1's 85.
|
||||
# H3 (TTFT): the 231 successful E2 reqs had TTFT p50 = 0.43s,
|
||||
# well under E1's 88.6s. With the failure cascade
|
||||
# removed, these should generalize to most reqs.
|
||||
#
|
||||
# Prerequisites:
|
||||
# - source scripts/setup_env.sh
|
||||
# (sets CUDA_HOME, MC_TRANSFER_TIMEOUT=1800, etc.)
|
||||
# - outputs/inferact_50sess.jsonl exists (md5 7bb263a32600ef5a6ef5099ba340a487)
|
||||
# - Previous sweep done; GPUs idle.
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
|
||||
#
|
||||
# Override defaults via env:
|
||||
# K=500 LOAD_FLOOR_BONUS=$K bash scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
if [ -z "${CUDA_HOME:-}" ]; then
|
||||
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
|
||||
TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
|
||||
OUTPUT=${OUTPUT:-outputs/e3_kvc_v2_loadfloor_rdma_50sess}
|
||||
IB_DEVICE=${IB_DEVICE:-mlx5_60}
|
||||
LOAD_FLOOR_BONUS=${LOAD_FLOOR_BONUS:-200}
|
||||
|
||||
if [ ! -f "$TRACE" ]; then
|
||||
echo "ERROR: trace not found at $TRACE" >&2
|
||||
echo "Run: uv run --no-sync python scripts/sample_trace_subset.py --output $TRACE --sessions 50" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$OUTPUT"
|
||||
LOG="$OUTPUT/sweep.log"
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||
|
||||
log "=== E3: KVC v2 + RDMA + load-floor bonus K=$LOAD_FLOOR_BONUS, ts=1 ==="
|
||||
log "MODEL=$MODEL"
|
||||
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
|
||||
log "OUTPUT=$OUTPUT"
|
||||
log "IB_DEVICE=$IB_DEVICE"
|
||||
log "MC_TRANSFER_TIMEOUT=${MC_TRANSFER_TIMEOUT:-default-30s}"
|
||||
|
||||
label=e3_kvc_v2_loadfloor_run1
|
||||
log ""
|
||||
log "=== [E3] $label starting ==="
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device "$IB_DEVICE" \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 1800 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--kvcache-migration-reject-threshold 3 \
|
||||
--kvcache-direct-max-uncached-tokens 8192 \
|
||||
--kvcache-load-floor-bonus "$LOAD_FLOOR_BONUS" 2>&1 | tee -a "$LOG"
|
||||
|
||||
run_dir=$(ls -td "$OUTPUT"/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [E3] $label COMPLETED, artifacts at $run_dir ==="
|
||||
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
|
||||
fi
|
||||
82
scripts/sweep_e4_kvc_v2_d_to_p_sync.sh
Executable file
82
scripts/sweep_e4_kvc_v2_d_to_p_sync.sh
Executable file
@@ -0,0 +1,82 @@
|
||||
#!/usr/bin/env bash
|
||||
# E4 — KVC v2 + RDMA + load-floor bonus + D→P snapshot push
|
||||
#
|
||||
# Identical to scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh except for the
|
||||
# additional --enable-d-to-p-sync flag (which causes agentic to orchestrate
|
||||
# the snapshot RPCs on the reseed slow path, and stack.py to set
|
||||
# SGLANG_SNAPSHOT_LINK_ENABLE=1 per worker).
|
||||
#
|
||||
# See docs/E4_PROTOCOL_ZH.md for hypothesis matrix.
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
if [ -z "${CUDA_HOME:-}" ]; then
|
||||
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
|
||||
TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
|
||||
OUTPUT=${OUTPUT:-outputs/e4_kvc_v2_d_to_p_sync_50sess}
|
||||
IB_DEVICE=${IB_DEVICE:-mlx5_60}
|
||||
LOAD_FLOOR_BONUS=${LOAD_FLOOR_BONUS:-200}
|
||||
|
||||
if [ ! -f "$TRACE" ]; then
|
||||
echo "ERROR: trace not found at $TRACE" >&2
|
||||
echo "Run: uv run --no-sync python scripts/sample_trace_subset.py --output $TRACE --sessions 50" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$OUTPUT"
|
||||
LOG="$OUTPUT/sweep.log"
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||
|
||||
log "=== E4: KVC v2 + RDMA + load-floor K=$LOAD_FLOOR_BONUS + D→P sync ==="
|
||||
log "MODEL=$MODEL"
|
||||
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
|
||||
log "OUTPUT=$OUTPUT"
|
||||
log "IB_DEVICE=$IB_DEVICE"
|
||||
log "MC_TRANSFER_TIMEOUT=${MC_TRANSFER_TIMEOUT:-default-30s}"
|
||||
|
||||
label=e4_kvc_v2_d_to_p_sync_run1
|
||||
log ""
|
||||
log "=== [E4] $label starting ==="
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device "$IB_DEVICE" \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 1800 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--kvcache-migration-reject-threshold 3 \
|
||||
--kvcache-direct-max-uncached-tokens 8192 \
|
||||
--kvcache-load-floor-bonus "$LOAD_FLOOR_BONUS" \
|
||||
--enable-d-to-p-sync 2>&1 | tee -a "$LOG"
|
||||
|
||||
run_dir=$(ls -td "$OUTPUT"/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [E4] $label COMPLETED, artifacts at $run_dir ==="
|
||||
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
|
||||
fi
|
||||
117
scripts/sweep_e4_pressured.sh
Executable file
117
scripts/sweep_e4_pressured.sh
Executable file
@@ -0,0 +1,117 @@
|
||||
#!/usr/bin/env bash
|
||||
# E4-pressured — same as E4 but tuned to force admission rejections so the
|
||||
# D→P snapshot fast-path actually fires.
|
||||
#
|
||||
# Key delta vs sweep_e4_kvc_v2_d_to_p_sync.sh:
|
||||
# --kvcache-migration-reject-threshold 1 (was 3)
|
||||
# After ONE rejection the policy migrates the session to a different
|
||||
# D, which in turn triggers _invoke_kvcache_seeded_router → D→P sync.
|
||||
# --decode-mem-fraction-static 0.4
|
||||
# Plumbed through cli.py → topology.decode_extra_server_args →
|
||||
# launcher. Shrinks per-decode KV pool, forcing admit_direct_append
|
||||
# to reject more often.
|
||||
#
|
||||
# Hypotheses (same as docs/E4_PROTOCOL_ZH.md but in a stressed regime):
|
||||
# H1' E4-pressured TTFT p99 ≤ E1 TTFT p99
|
||||
# H2' D→P snapshot succeeds for ≥ 20% of reseed-triggering requests
|
||||
# H3' D→P-pushed-then-cache-hit reduces re-prefill segment of reseed
|
||||
# path TTFT measurably
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
if [ -z "${CUDA_HOME:-}" ]; then
|
||||
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
|
||||
TRACE=${TRACE:-third_party/traces/qwen35-swebench-50sess.jsonl}
|
||||
OUTPUT=${OUTPUT:-outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess}
|
||||
IB_DEVICE=${IB_DEVICE:-mlx5_60}
|
||||
LOAD_FLOOR_BONUS=${LOAD_FLOOR_BONUS:-200}
|
||||
REJECT_THRESHOLD=${REJECT_THRESHOLD:-1}
|
||||
MEM_FRACTION=${MEM_FRACTION:-0.5}
|
||||
# time-scale: 1 = realistic 5.44h timeline for the SWE-Bench trace;
|
||||
# 10 = compress to ~33 min; 60 = compress to ~5.5 min (stress test).
|
||||
TIME_SCALE=${TIME_SCALE:-1}
|
||||
|
||||
if [ ! -f "$TRACE" ]; then
|
||||
echo "ERROR: trace not found at $TRACE" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$OUTPUT"
|
||||
LOG="$OUTPUT/sweep.log"
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||
|
||||
log "=== E4-pressured: KVC v2 + RDMA + load-floor K=$LOAD_FLOOR_BONUS + D→P sync + reject_threshold=$REJECT_THRESHOLD + mem_fraction=$MEM_FRACTION ==="
|
||||
log "MODEL=$MODEL"
|
||||
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
|
||||
log "OUTPUT=$OUTPUT"
|
||||
|
||||
label=e4p_kvc_v2_d_to_p_sync_run1
|
||||
log "=== [E4p] $label starting ==="
|
||||
|
||||
# Background GPU utilization sampler — every 1 s, all 4 GPUs, CSV output.
|
||||
GPU_CSV="$OUTPUT/gpu_util.csv"
|
||||
log "GPU sampling → $GPU_CSV (1 Hz, gpus 0-3)"
|
||||
echo "timestamp_iso,gpu_index,util_pct,mem_used_MiB,mem_total_MiB,sm_clock_MHz,power_W,temperature_C" > "$GPU_CSV"
|
||||
(
|
||||
while true; do
|
||||
ts_iso=$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)
|
||||
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.total,clocks.sm,power.draw,temperature.gpu \
|
||||
--format=csv,noheader,nounits 2>/dev/null \
|
||||
| sed -e "s/^/${ts_iso},/" -e 's/ //g' >> "$GPU_CSV" || true
|
||||
sleep 1
|
||||
done
|
||||
) &
|
||||
GPU_SAMPLER_PID=$!
|
||||
log "GPU sampler pid=$GPU_SAMPLER_PID"
|
||||
|
||||
cleanup_gpu_sampler() {
|
||||
kill -9 "$GPU_SAMPLER_PID" 2>/dev/null || true
|
||||
wait "$GPU_SAMPLER_PID" 2>/dev/null || true
|
||||
log "GPU sampler stopped (output: $GPU_CSV, $(wc -l < "$GPU_CSV") rows)"
|
||||
}
|
||||
trap cleanup_gpu_sampler EXIT INT TERM
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device "$IB_DEVICE" \
|
||||
--gpu-budget 4 \
|
||||
--time-scale "$TIME_SCALE" \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 1800 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--kvcache-migration-reject-threshold "$REJECT_THRESHOLD" \
|
||||
--kvcache-direct-max-uncached-tokens 8192 \
|
||||
--kvcache-load-floor-bonus "$LOAD_FLOOR_BONUS" \
|
||||
--decode-mem-fraction-static "${DECODE_MEM_FRAC:-0.4}" \
|
||||
--prefill-mem-fraction-static "${PREFILL_MEM_FRAC:-0.7}" \
|
||||
--enable-d-to-p-sync 2>&1 | tee -a "$LOG"
|
||||
|
||||
run_dir=$(ls -td "$OUTPUT"/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [E4p] $label COMPLETED, artifacts at $run_dir ==="
|
||||
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
|
||||
fi
|
||||
@@ -48,6 +48,8 @@ class BenchmarkConfig:
|
||||
enable_backpressure: bool = False
|
||||
backpressure_max_pause_s: float = 2.0
|
||||
kvcache_migration_reject_threshold: int = 3
|
||||
kvcache_load_floor_bonus: int = 0
|
||||
enable_d_to_p_sync: bool = False
|
||||
sample_profile: str = "default"
|
||||
min_initial_input_tokens: int | None = None
|
||||
max_initial_input_tokens: int | None = None
|
||||
@@ -198,8 +200,10 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
pool_poll_interval_s=config.pool_poll_interval_s,
|
||||
pool_poll_include_sessions=config.pool_poll_include_sessions,
|
||||
enable_backpressure=config.enable_backpressure,
|
||||
enable_d_to_p_sync=config.enable_d_to_p_sync,
|
||||
backpressure_max_pause_s=config.backpressure_max_pause_s,
|
||||
kvcache_migration_reject_threshold=config.kvcache_migration_reject_threshold,
|
||||
kvcache_load_floor_bonus=config.kvcache_load_floor_bonus,
|
||||
)
|
||||
if config.request_timeout_s is not None:
|
||||
replay_config = replace(
|
||||
@@ -261,6 +265,7 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
"enable_backpressure": config.enable_backpressure,
|
||||
"backpressure_max_pause_s": config.backpressure_max_pause_s,
|
||||
"kvcache_migration_reject_threshold": config.kvcache_migration_reject_threshold,
|
||||
"kvcache_load_floor_bonus": config.kvcache_load_floor_bonus,
|
||||
"sample_profile": config.sample_profile,
|
||||
"min_initial_input_tokens": config.min_initial_input_tokens,
|
||||
"max_initial_input_tokens": config.max_initial_input_tokens,
|
||||
|
||||
@@ -270,6 +270,30 @@ def main() -> None:
|
||||
"See REFACTOR_PLAN_V1 §6.2 / TEAM_REPORT §2.1."
|
||||
),
|
||||
)
|
||||
replay.add_argument(
|
||||
"--kvcache-load-floor-bonus",
|
||||
type=int,
|
||||
default=0,
|
||||
help=(
|
||||
"Graduated bonus added to lex-score position 0 for under-loaded D "
|
||||
"workers (gated on not-sticky so turn-1+ requests still stick). "
|
||||
"Magnitude scales as K * max(0, mean - assigned[D]) / mean. "
|
||||
"Set above max expected cross-session boilerplate overlap "
|
||||
"(Inferact ~50 → use 200). 0 disables. "
|
||||
"See docs/E1_E2_FIX_DESIGN_ZH.md §Q2."
|
||||
),
|
||||
)
|
||||
replay.add_argument(
|
||||
"--enable-d-to-p-sync",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Enable D→P RDMA KV snapshot push for reseed fast-path. "
|
||||
"When set, on _invoke_kvcache_seeded_router agentic will probe D's "
|
||||
"session_aware_cache, RDMA-dump session KV to P's snapshot link, "
|
||||
"and insert into P's radix tree so the upcoming P prefill hits "
|
||||
"cache. See docs/D_TO_P_SYNC_DESIGN_ZH.md."
|
||||
),
|
||||
)
|
||||
|
||||
sample = subparsers.add_parser(
|
||||
"sample-sessions",
|
||||
@@ -521,6 +545,44 @@ def main() -> None:
|
||||
"See REFACTOR_PLAN_V1 §6.2 / TEAM_REPORT §2.1."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--kvcache-load-floor-bonus",
|
||||
type=int,
|
||||
default=0,
|
||||
help=(
|
||||
"Graduated bonus added to lex-score position 0 for under-loaded D "
|
||||
"workers (gated on not-sticky so turn-1+ requests still stick). "
|
||||
"Magnitude scales as K * max(0, mean - assigned[D]) / mean. "
|
||||
"Set above max expected cross-session boilerplate overlap "
|
||||
"(Inferact ~50 → use 200). 0 disables. "
|
||||
"See docs/E1_E2_FIX_DESIGN_ZH.md §Q2."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--enable-d-to-p-sync",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Enable D→P RDMA KV snapshot push for reseed fast-path. "
|
||||
"See docs/D_TO_P_SYNC_DESIGN_ZH.md."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--decode-mem-fraction-static",
|
||||
type=float,
|
||||
default=None,
|
||||
help=(
|
||||
"Override SGLang's --mem-fraction-static on decode workers. "
|
||||
"Smaller value → smaller KV pool → admit_direct_append rejects "
|
||||
"more often → reseed path fires more often. Pressure tool for "
|
||||
"E4-style D→P sync experiments."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--prefill-mem-fraction-static",
|
||||
type=float,
|
||||
default=None,
|
||||
help="Override --mem-fraction-static on prefill workers.",
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--sample-profile",
|
||||
choices=["default", "small-append"],
|
||||
@@ -607,6 +669,8 @@ def main() -> None:
|
||||
enable_backpressure=args.enable_backpressure,
|
||||
backpressure_max_pause_s=args.backpressure_max_pause_s,
|
||||
kvcache_migration_reject_threshold=args.kvcache_migration_reject_threshold,
|
||||
kvcache_load_floor_bonus=args.kvcache_load_floor_bonus,
|
||||
enable_d_to_p_sync=args.enable_d_to_p_sync,
|
||||
)
|
||||
results = asyncio.run(replay_trace(config))
|
||||
print(
|
||||
@@ -754,6 +818,8 @@ def main() -> None:
|
||||
enable_backpressure=args.enable_backpressure,
|
||||
backpressure_max_pause_s=args.backpressure_max_pause_s,
|
||||
kvcache_migration_reject_threshold=args.kvcache_migration_reject_threshold,
|
||||
kvcache_load_floor_bonus=args.kvcache_load_floor_bonus,
|
||||
enable_d_to_p_sync=args.enable_d_to_p_sync,
|
||||
sample_profile=args.sample_profile,
|
||||
min_initial_input_tokens=args.min_initial_input_tokens,
|
||||
max_initial_input_tokens=args.max_initial_input_tokens,
|
||||
@@ -848,9 +914,26 @@ def _topology_from_args(args: argparse.Namespace):
|
||||
force_rdma=args.force_rdma,
|
||||
trust_remote_code=not args.no_trust_remote_code,
|
||||
ib_device=args.ib_device,
|
||||
direct_extra_server_args=("--enable-streaming-session",),
|
||||
enable_d_to_p_sync=getattr(args, "enable_d_to_p_sync", False),
|
||||
prefill_extra_server_args=_build_extra_server_args(args, "prefill"),
|
||||
decode_extra_server_args=_build_extra_server_args(args, "decode"),
|
||||
direct_extra_server_args=_build_extra_server_args(args, "direct"),
|
||||
)
|
||||
|
||||
|
||||
def _build_extra_server_args(args, role: str) -> tuple[str, ...]:
|
||||
base: tuple[str, ...]
|
||||
if role == "direct":
|
||||
base = ("--enable-streaming-session",)
|
||||
else:
|
||||
base = ("--disable-overlap-schedule",)
|
||||
mem_frac = getattr(args, "decode_mem_fraction_static", None) if role == "decode" else None
|
||||
if mem_frac is None and role == "prefill":
|
||||
mem_frac = getattr(args, "prefill_mem_fraction_static", None)
|
||||
if mem_frac is not None and mem_frac > 0:
|
||||
base = base + ("--mem-fraction-static", f"{mem_frac:.3f}")
|
||||
return base
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
@@ -161,6 +161,28 @@ class KvAwarePolicy:
|
||||
# 0 disables the mechanism. Default 3 picked empirically to allow brief
|
||||
# transient saturation without panicking, but to reroute persistent starvation.
|
||||
migration_reject_threshold: int = 3
|
||||
# Load-floor bonus: graduated boost added to lex-score position 0 for
|
||||
# under-loaded D workers, gated on `not sticky` so turn-1+ requests of an
|
||||
# existing session continue to stick to their original D. The boost
|
||||
# magnitude scales linearly with the D's deficit relative to the running
|
||||
# mean of `decode_assignment_counts`:
|
||||
# floor_bonus = K * max(0, mean - assigned[D]) / max(1, mean)
|
||||
# When mean=0 (warmup), bonus is 0 for all workers (lex tiebreak by
|
||||
# iteration order). Once any D has been assigned, under-loaded D's get a
|
||||
# bonus that approaches K as their deficit-to-mean ratio approaches 1.
|
||||
# The bonus naturally decays as load equalises, leaving the original
|
||||
# overlap+sticky scoring in charge of steady-state selection.
|
||||
#
|
||||
# Set this above the maximum cross-session boilerplate overlap you expect
|
||||
# so that fresh sessions are routed to under-loaded D's even when those
|
||||
# D's currently have 0 overlap, but below the magnitude of "real" prefix
|
||||
# overlap (e.g., a session with 800-block per-session prefix on an
|
||||
# already-warm D should still go there).
|
||||
#
|
||||
# 0 disables. See docs/E1_E2_FIX_DESIGN_ZH.md §Q2 for the full design and
|
||||
# docs/E1_E2_RESULTS_ZH.md §5d for why this is needed on Inferact-shaped
|
||||
# workloads where boilerplate overlap pins D2 cold forever.
|
||||
load_floor_bonus: int = 0
|
||||
|
||||
def select(
|
||||
self,
|
||||
@@ -172,6 +194,12 @@ class KvAwarePolicy:
|
||||
prefill_worker_id = state.next_prefill_worker_id(topology)
|
||||
session = state.session_state.get(request.session_id)
|
||||
|
||||
# Pre-compute the running mean of decode assignments. Used by the
|
||||
# load-floor bonus inside the candidate loop.
|
||||
n_route_workers = max(1, len(topology.route_workers))
|
||||
total_assigned = sum(state.decode_assignment_counts.values())
|
||||
mean_assigned = total_assigned / n_route_workers
|
||||
|
||||
best_decode_worker_id: str | None = None
|
||||
best_score: tuple[int, int, int, int] | None = None
|
||||
candidates_considered = 0
|
||||
@@ -189,9 +217,18 @@ class KvAwarePolicy:
|
||||
overlap = _overlap_blocks(request, state, worker.worker_id)
|
||||
sticky = int(session is not None and session.last_decode_worker == worker.worker_id)
|
||||
inflight_penalty = -state.inflight_decode.get(worker.worker_id, 0)
|
||||
assignment_penalty = -state.decode_assignment_counts.get(worker.worker_id, 0)
|
||||
worker_assigned = state.decode_assignment_counts.get(worker.worker_id, 0)
|
||||
assignment_penalty = -worker_assigned
|
||||
|
||||
# Load-floor bonus: only for fresh placements (not sticky), and
|
||||
# only when the knob is enabled. See docstring above.
|
||||
floor_bonus = 0
|
||||
if self.load_floor_bonus > 0 and not sticky and mean_assigned > 0:
|
||||
deficit = max(0.0, mean_assigned - worker_assigned)
|
||||
floor_bonus = int(self.load_floor_bonus * deficit / mean_assigned)
|
||||
|
||||
score = (
|
||||
overlap + sticky * self.sticky_bonus,
|
||||
overlap + sticky * self.sticky_bonus + floor_bonus,
|
||||
sticky,
|
||||
inflight_penalty,
|
||||
assignment_penalty,
|
||||
@@ -223,14 +260,22 @@ class KvAwarePolicy:
|
||||
)
|
||||
|
||||
|
||||
def create_policy(name: str, *, migration_reject_threshold: int = 3) -> RoutingPolicy:
|
||||
def create_policy(
|
||||
name: str,
|
||||
*,
|
||||
migration_reject_threshold: int = 3,
|
||||
load_floor_bonus: int = 0,
|
||||
) -> RoutingPolicy:
|
||||
normalized = name.strip().lower()
|
||||
if normalized == "default":
|
||||
return DefaultPolicy()
|
||||
if normalized == "sticky":
|
||||
return StickyDecodePolicy()
|
||||
if normalized in {"kv-aware", "kv_aware", "kv"}:
|
||||
return KvAwarePolicy(migration_reject_threshold=migration_reject_threshold)
|
||||
return KvAwarePolicy(
|
||||
migration_reject_threshold=migration_reject_threshold,
|
||||
load_floor_bonus=load_floor_bonus,
|
||||
)
|
||||
raise ValueError(f"Unsupported policy: {name}")
|
||||
|
||||
|
||||
|
||||
@@ -111,6 +111,16 @@ class ReplayConfig:
|
||||
# KvAwarePolicy skips that D for the session (forcing migration). Default 3.
|
||||
# Set 0 to disable. See REFACTOR_PLAN_V1 §6.2.
|
||||
kvcache_migration_reject_threshold: int = 3
|
||||
# Load-floor bonus magnitude for KvAwarePolicy: graduated boost added to
|
||||
# under-loaded D workers to break overlap-pinning imbalance on workloads
|
||||
# with shared cross-session prefix. 0 disables. See
|
||||
# docs/E1_E2_FIX_DESIGN_ZH.md §Q2.
|
||||
kvcache_load_floor_bonus: int = 0
|
||||
# D→P snapshot push: when True and reseed fires, agentic will RDMA-dump
|
||||
# the session's KV from the D-side worker that last held it onto the P
|
||||
# worker and insert into P's radix tree, so the subsequent P prefill
|
||||
# hits cache. See docs/D_TO_P_SYNC_DESIGN_ZH.md.
|
||||
enable_d_to_p_sync: bool = False
|
||||
structural_log_dir: Path | None = None
|
||||
|
||||
|
||||
@@ -198,6 +208,7 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
|
||||
policy = create_policy(
|
||||
config.policy_name,
|
||||
migration_reject_threshold=config.kvcache_migration_reject_threshold,
|
||||
load_floor_bonus=config.kvcache_load_floor_bonus,
|
||||
)
|
||||
state = RoutingState.create(config.topology)
|
||||
state_lock = asyncio.Lock()
|
||||
@@ -2098,6 +2109,188 @@ async def _invoke_plain_router(
|
||||
)
|
||||
|
||||
|
||||
async def _attempt_d_to_p_sync(
|
||||
*,
|
||||
client: httpx.AsyncClient,
|
||||
request: TraceRequest,
|
||||
config: ReplayConfig,
|
||||
prefill_url: str,
|
||||
decode_session: DirectSessionState,
|
||||
) -> dict | None:
|
||||
"""Try to RDMA-dump session KV from the D that last held it to ``prefill_url``.
|
||||
|
||||
Returns a dict with status info on success/skip, or ``None`` on a
|
||||
non-recoverable error. The caller falls back to normal re-prefill on
|
||||
any failure. Each path emits a structural-log line so we can forensic
|
||||
why sync skipped vs succeeded vs failed.
|
||||
"""
|
||||
if not config.enable_d_to_p_sync:
|
||||
return None
|
||||
source_d_url = decode_session.server_url
|
||||
sid = request.session_id
|
||||
rid = request.request_id
|
||||
if not source_d_url:
|
||||
await _structural_emit(
|
||||
"d-to-p-sync.jsonl",
|
||||
{"event": "skipped", "stage": "entry", "sid": sid, "rid": rid,
|
||||
"reason": "no-source-d"},
|
||||
)
|
||||
return {"status": "skipped-no-source-d"}
|
||||
# NB: do NOT gate on decode_session.opened. By the time we reach the
|
||||
# fallback seeded_router, agentic has already flipped that flag to False
|
||||
# in response to admission rejection. But the D-side scheduler's
|
||||
# SessionAwareCache may STILL hold the session resident (release_session
|
||||
# is only called explicitly, not from admission events). Let D be the
|
||||
# source of truth via its own snapshot_dump response.
|
||||
target_tokens = max(0, int(_estimate_session_resident_tokens(request)))
|
||||
if target_tokens <= 0:
|
||||
await _structural_emit(
|
||||
"d-to-p-sync.jsonl",
|
||||
{"event": "skipped", "stage": "entry", "sid": sid, "rid": rid,
|
||||
"reason": "zero-target-tokens"},
|
||||
)
|
||||
return {"status": "skipped-zero-tokens"}
|
||||
|
||||
t_prep0 = time.perf_counter()
|
||||
try:
|
||||
prep_resp = await client.post(
|
||||
f"{prefill_url}/_snapshot/prepare_receive",
|
||||
json={
|
||||
"session_id": request.session_id,
|
||||
"num_tokens": target_tokens,
|
||||
},
|
||||
timeout=30.0,
|
||||
)
|
||||
prep_resp.raise_for_status()
|
||||
prep = prep_resp.json()
|
||||
except Exception as exc:
|
||||
await _structural_emit(
|
||||
"d-to-p-sync.jsonl",
|
||||
{"event": "failed", "stage": "prepare", "sid": sid, "rid": rid,
|
||||
"error": repr(exc)[:200]},
|
||||
)
|
||||
return {"status": "prepare-failed", "error": repr(exc)}
|
||||
t_prep1 = time.perf_counter()
|
||||
if not prep.get("ok"):
|
||||
await _structural_emit(
|
||||
"d-to-p-sync.jsonl",
|
||||
{"event": "skipped", "stage": "prepare", "sid": sid, "rid": rid,
|
||||
"reason": prep.get("reason"),
|
||||
"prepare_dur_ms": round((t_prep1 - t_prep0) * 1000, 2)},
|
||||
)
|
||||
return {"status": "prepare-not-ok", "reason": prep.get("reason")}
|
||||
|
||||
t_dump0 = time.perf_counter()
|
||||
try:
|
||||
dump_resp = await client.post(
|
||||
f"{source_d_url}/_snapshot/dump",
|
||||
json={
|
||||
"session_id": request.session_id,
|
||||
"target_snapshot_session_id": prep["snapshot_session_id"],
|
||||
"target_snapshot_buf_base": prep["snapshot_buf_base_ptr"],
|
||||
"target_k_layer_offsets": prep["k_layer_offsets"],
|
||||
"target_v_layer_offsets": prep["v_layer_offsets"],
|
||||
"target_stride_k_bytes": prep["stride_k_bytes"],
|
||||
"target_stride_v_bytes": prep["stride_v_bytes"],
|
||||
},
|
||||
timeout=60.0,
|
||||
)
|
||||
dump_resp.raise_for_status()
|
||||
dump = dump_resp.json()
|
||||
except Exception as exc:
|
||||
await _structural_emit(
|
||||
"d-to-p-sync.jsonl",
|
||||
{"event": "failed", "stage": "dump", "sid": sid, "rid": rid,
|
||||
"error": repr(exc)[:200]},
|
||||
)
|
||||
return {"status": "dump-failed", "error": repr(exc)}
|
||||
t_dump1 = time.perf_counter()
|
||||
if not dump.get("ok"):
|
||||
await _structural_emit(
|
||||
"d-to-p-sync.jsonl",
|
||||
{"event": "skipped", "stage": "dump", "sid": sid, "rid": rid,
|
||||
"reason": dump.get("reason"),
|
||||
"dump_dur_ms": round((t_dump1 - t_dump0) * 1000, 2),
|
||||
"kv_committed_len": int(dump.get("kv_committed_len", 0))},
|
||||
)
|
||||
return {"status": "dump-not-ok", "reason": dump.get("reason"),
|
||||
"bytes_pushed": dump.get("bytes_pushed", 0)}
|
||||
|
||||
# We need token_ids for radix insert. The caller has request.input_token_ids
|
||||
# for the first N — use that as best-available approximation.
|
||||
tokens = list(getattr(request, "input_token_ids", []) or [])
|
||||
if not tokens:
|
||||
# No token_ids → can't insert into radix; tell P to free the slab.
|
||||
try:
|
||||
await client.post(
|
||||
f"{prefill_url}/_snapshot/finalize_ingest",
|
||||
json={
|
||||
"session_id": request.session_id,
|
||||
"token_ids": [],
|
||||
},
|
||||
timeout=15.0,
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
await _structural_emit(
|
||||
"d-to-p-sync.jsonl",
|
||||
{"event": "skipped", "stage": "post-dump", "sid": sid, "rid": rid,
|
||||
"reason": "no-input-token-ids",
|
||||
"bytes_pushed": int(dump.get("bytes_pushed", 0))},
|
||||
)
|
||||
return {"status": "no-tokens-discard", "bytes_pushed": dump.get("bytes_pushed", 0)}
|
||||
|
||||
n = min(len(tokens), int(prep.get("num_tokens", 0)))
|
||||
t_fin0 = time.perf_counter()
|
||||
try:
|
||||
fin_resp = await client.post(
|
||||
f"{prefill_url}/_snapshot/finalize_ingest",
|
||||
json={
|
||||
"session_id": request.session_id,
|
||||
"token_ids": tokens[:n],
|
||||
},
|
||||
timeout=30.0,
|
||||
)
|
||||
fin_resp.raise_for_status()
|
||||
fin = fin_resp.json()
|
||||
except Exception as exc:
|
||||
await _structural_emit(
|
||||
"d-to-p-sync.jsonl",
|
||||
{"event": "failed", "stage": "finalize", "sid": sid, "rid": rid,
|
||||
"error": repr(exc)[:200],
|
||||
"bytes_pushed": int(dump.get("bytes_pushed", 0))},
|
||||
)
|
||||
return {"status": "finalize-failed", "error": repr(exc),
|
||||
"bytes_pushed": dump.get("bytes_pushed", 0)}
|
||||
t_fin1 = time.perf_counter()
|
||||
if not fin.get("ok"):
|
||||
await _structural_emit(
|
||||
"d-to-p-sync.jsonl",
|
||||
{"event": "skipped", "stage": "finalize", "sid": sid, "rid": rid,
|
||||
"reason": fin.get("reason"),
|
||||
"bytes_pushed": int(dump.get("bytes_pushed", 0))},
|
||||
)
|
||||
return {"status": "finalize-not-ok", "reason": fin.get("reason"),
|
||||
"bytes_pushed": dump.get("bytes_pushed", 0)}
|
||||
await _structural_emit(
|
||||
"d-to-p-sync.jsonl",
|
||||
{"event": "ok", "sid": sid, "rid": rid,
|
||||
"bytes_pushed": int(dump.get("bytes_pushed", 0)),
|
||||
"kv_committed_len": int(dump.get("kv_committed_len", 0)),
|
||||
"inserted_prefix_len": int(fin.get("inserted_prefix_len", 0)),
|
||||
"prepare_dur_ms": round((t_prep1 - t_prep0) * 1000, 2),
|
||||
"dump_dur_ms": round((t_dump1 - t_dump0) * 1000, 2),
|
||||
"finalize_dur_ms": round((t_fin1 - t_fin0) * 1000, 2),
|
||||
"snapshot_session_id": prep.get("snapshot_session_id")},
|
||||
)
|
||||
return {
|
||||
"status": "ok",
|
||||
"bytes_pushed": int(dump.get("bytes_pushed", 0)),
|
||||
"inserted_prefix_len": int(fin.get("inserted_prefix_len", 0)),
|
||||
"snapshot_session_id": prep.get("snapshot_session_id"),
|
||||
}
|
||||
|
||||
|
||||
async def _invoke_kvcache_seeded_router(
|
||||
*,
|
||||
client: httpx.AsyncClient,
|
||||
@@ -2149,6 +2342,22 @@ async def _invoke_kvcache_seeded_router(
|
||||
decode_session.prefill_server_url = prefill_url
|
||||
prefill_session_newly_opened = True
|
||||
|
||||
# D→P snapshot push (Phase 3) — best-effort; on any failure we silently
|
||||
# fall back to the existing re-prefill path. The result is logged for
|
||||
# post-hoc analysis but does not affect correctness.
|
||||
if config.enable_d_to_p_sync:
|
||||
sync_result = await _attempt_d_to_p_sync(
|
||||
client=client,
|
||||
request=request,
|
||||
config=config,
|
||||
prefill_url=prefill_url,
|
||||
decode_session=decode_session,
|
||||
)
|
||||
# NB: every outcome of _attempt_d_to_p_sync is already captured in
|
||||
# structural/d-to-p-sync.jsonl via _structural_emit. No need for an
|
||||
# additional logger.info here (and `logger` isn't imported at module
|
||||
# scope, so it would NameError if reached).
|
||||
|
||||
decode_session_newly_opened = False
|
||||
try:
|
||||
prefill_priority = _prefill_priority_for_router_request(
|
||||
|
||||
266
src/agentic_pd_hybrid/snapshot_link.py
Normal file
266
src/agentic_pd_hybrid/snapshot_link.py
Normal file
@@ -0,0 +1,266 @@
|
||||
"""Minimal D→P snapshot link over Mooncake RDMA.
|
||||
|
||||
This module provides a thin wrapper around mooncake.engine.TransferEngine
|
||||
for one-sided RDMA writes of KV bytes from a Decode worker (sender) to a
|
||||
Prefill worker (receiver). It deliberately does NOT use the heavyweight
|
||||
MooncakeKVManager pipeline (which is tied to PREFILL/DECODE roles and
|
||||
chunked transfer protocols): we want a simple, testable byte transport
|
||||
that can be reused by SGLang and by stand-alone smoke tests.
|
||||
|
||||
Layout:
|
||||
SnapshotPeer — engine + pre-registered receive buffer (receiver)
|
||||
or sender handle (sender)
|
||||
SnapshotEndpoint — what the receiver advertises so the sender can
|
||||
target it: (session_id, base_ptr, length)
|
||||
SnapshotPusher — sender-side: holds a target endpoint, calls
|
||||
batch_transfer_sync_write
|
||||
|
||||
All transfers are SYNCHRONOUS, single-shot, in-memory.
|
||||
|
||||
Higher layers add: control plane (how D learns P's endpoint), per-session
|
||||
slot allocation, KV format/layout, hand-off into SGLang scheduler.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import ctypes
|
||||
import logging
|
||||
import os
|
||||
import threading
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class SnapshotEndpoint:
|
||||
"""What the receiver advertises so the sender can reach it.
|
||||
|
||||
Attributes
|
||||
----------
|
||||
session_id : str
|
||||
``"host:rpc_port"`` string identifying the receiver's mooncake
|
||||
TransferEngine. Returned by ``TransferEngine.get_rpc_port()``
|
||||
joined with the host the engine was initialized with.
|
||||
base_ptr : int
|
||||
Address of the registered receive buffer on the receiver side.
|
||||
capacity_bytes : int
|
||||
Length of the registered region.
|
||||
"""
|
||||
|
||||
session_id: str
|
||||
base_ptr: int
|
||||
capacity_bytes: int
|
||||
|
||||
|
||||
def _import_transfer_engine():
|
||||
try:
|
||||
from mooncake.engine import TransferEngine
|
||||
except ImportError as e: # pragma: no cover
|
||||
raise ImportError(
|
||||
"mooncake.engine.TransferEngine is required for snapshot_link. "
|
||||
"Make sure mooncake-transfer-engine is installed in the venv."
|
||||
) from e
|
||||
return TransferEngine
|
||||
|
||||
|
||||
class SnapshotPeer:
|
||||
"""One Mooncake transfer engine endpoint with a registered receive buffer.
|
||||
|
||||
The engine is dedicated to snapshot traffic — it does NOT share state
|
||||
with SGLang's MooncakeKVManager engine. Each SnapshotPeer needs its own
|
||||
host:port to listen on.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
host: str,
|
||||
port: int,
|
||||
ib_device: Optional[str] = None,
|
||||
receive_capacity_bytes: int = 0,
|
||||
protocol: Optional[str] = None,
|
||||
):
|
||||
TransferEngine = _import_transfer_engine()
|
||||
self.host = host
|
||||
self.port = port
|
||||
self.ib_device = ib_device
|
||||
self.engine = TransferEngine()
|
||||
|
||||
listen = f"{host}:{port}"
|
||||
proto = protocol or os.environ.get("MOONCAKE_PROTOCOL", "rdma")
|
||||
ret = self.engine.initialize(
|
||||
listen,
|
||||
"P2PHANDSHAKE",
|
||||
proto,
|
||||
ib_device or "",
|
||||
)
|
||||
if ret != 0:
|
||||
raise RuntimeError(
|
||||
f"snapshot_link: engine.initialize({listen!r}, proto={proto}, "
|
||||
f"ib={ib_device}) returned {ret}"
|
||||
)
|
||||
|
||||
self._rpc_port = self.engine.get_rpc_port()
|
||||
self._session_id = f"{host}:{self._rpc_port}"
|
||||
|
||||
self._recv_buffer = None
|
||||
self._recv_ptr = 0
|
||||
self._recv_capacity = 0
|
||||
if receive_capacity_bytes > 0:
|
||||
self._allocate_recv_buffer(receive_capacity_bytes)
|
||||
|
||||
self._lock = threading.Lock()
|
||||
logger.info(
|
||||
"SnapshotPeer up at %s (rpc=%d, ib=%s, recv=%d B)",
|
||||
self._session_id,
|
||||
self._rpc_port,
|
||||
ib_device,
|
||||
receive_capacity_bytes,
|
||||
)
|
||||
|
||||
# -- accessors ---------------------------------------------------------
|
||||
|
||||
@property
|
||||
def session_id(self) -> str:
|
||||
return self._session_id
|
||||
|
||||
@property
|
||||
def rpc_port(self) -> int:
|
||||
return self._rpc_port
|
||||
|
||||
@property
|
||||
def endpoint(self) -> SnapshotEndpoint:
|
||||
if self._recv_buffer is None:
|
||||
raise RuntimeError(
|
||||
"SnapshotPeer has no receive buffer; pass receive_capacity_bytes > 0"
|
||||
)
|
||||
return SnapshotEndpoint(
|
||||
session_id=self._session_id,
|
||||
base_ptr=self._recv_ptr,
|
||||
capacity_bytes=self._recv_capacity,
|
||||
)
|
||||
|
||||
# -- buffer management -------------------------------------------------
|
||||
|
||||
def _allocate_recv_buffer(self, length: int) -> None:
|
||||
"""Allocate + register a pinned host buffer for receiving."""
|
||||
# Use c_ubyte (unsigned) so bytes() conversions of the underlying
|
||||
# storage always yield valid byte values.
|
||||
buf = (ctypes.c_ubyte * length)()
|
||||
addr = ctypes.addressof(buf)
|
||||
ret = self.engine.register_memory(addr, length)
|
||||
if ret != 0:
|
||||
raise RuntimeError(
|
||||
f"snapshot_link: register_memory({hex(addr)}, {length}) returned {ret}"
|
||||
)
|
||||
self._recv_buffer = buf
|
||||
self._recv_ptr = addr
|
||||
self._recv_capacity = length
|
||||
|
||||
def read_bytes(self, offset: int, length: int) -> bytes:
|
||||
"""Snapshot the recv buffer at [offset, offset+length) (caller syncs)."""
|
||||
if self._recv_buffer is None:
|
||||
raise RuntimeError("no recv buffer")
|
||||
if offset < 0 or offset + length > self._recv_capacity:
|
||||
raise ValueError(
|
||||
f"read_bytes({offset}, {length}) out of capacity {self._recv_capacity}"
|
||||
)
|
||||
# string_at copies via memcpy and yields a proper bytes object — works
|
||||
# regardless of signed/unsigned underlying storage.
|
||||
return ctypes.string_at(self._recv_ptr + offset, length)
|
||||
|
||||
def register_send_buffer(self, ptr: int, length: int) -> None:
|
||||
"""Register an externally-allocated send buffer for outbound RDMA writes."""
|
||||
with self._lock:
|
||||
ret = self.engine.register_memory(ptr, length)
|
||||
if ret != 0:
|
||||
raise RuntimeError(
|
||||
f"snapshot_link: register send buffer({hex(ptr)}, {length}) returned {ret}"
|
||||
)
|
||||
|
||||
def deregister(self, ptr: int) -> None:
|
||||
with self._lock:
|
||||
try:
|
||||
self.engine.unregister_memory(ptr)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# -- transfer ----------------------------------------------------------
|
||||
|
||||
def push(
|
||||
self,
|
||||
target: SnapshotEndpoint,
|
||||
local_ptr: int,
|
||||
local_offset: int,
|
||||
length: int,
|
||||
remote_offset: int = 0,
|
||||
) -> int:
|
||||
"""Synchronously RDMA-write ``length`` bytes from ``local_ptr+local_offset``
|
||||
to ``target.base_ptr+remote_offset`` on the peer identified by
|
||||
``target.session_id``.
|
||||
|
||||
Returns 0 on success, non-zero (or raises) on failure.
|
||||
"""
|
||||
if length <= 0:
|
||||
return 0
|
||||
if remote_offset < 0 or remote_offset + length > target.capacity_bytes:
|
||||
raise ValueError(
|
||||
f"push: remote_offset={remote_offset}, length={length} exceeds "
|
||||
f"target capacity {target.capacity_bytes}"
|
||||
)
|
||||
src = local_ptr + local_offset
|
||||
dst = target.base_ptr + remote_offset
|
||||
try:
|
||||
ret = self.engine.transfer_sync_write(
|
||||
target.session_id, src, dst, length
|
||||
)
|
||||
except Exception as e:
|
||||
logger.exception("snapshot_link.push transfer_sync_write threw: %s", e)
|
||||
return -1
|
||||
if ret != 0:
|
||||
logger.warning(
|
||||
"snapshot_link.push transfer_sync_write returned %d (src=%s, "
|
||||
"dst=%s/%s, len=%d)",
|
||||
ret,
|
||||
hex(src),
|
||||
target.session_id,
|
||||
hex(dst),
|
||||
length,
|
||||
)
|
||||
return ret
|
||||
|
||||
def batch_push(
|
||||
self,
|
||||
target: SnapshotEndpoint,
|
||||
local_addrs: list[int],
|
||||
remote_addrs: list[int],
|
||||
lengths: list[int],
|
||||
) -> int:
|
||||
"""Batched RDMA write (one-shot)."""
|
||||
if not local_addrs:
|
||||
return 0
|
||||
try:
|
||||
ret = self.engine.batch_transfer_sync_write(
|
||||
target.session_id, local_addrs, remote_addrs, lengths
|
||||
)
|
||||
except Exception as e:
|
||||
logger.exception("snapshot_link.batch_push threw: %s", e)
|
||||
return -1
|
||||
return ret
|
||||
|
||||
def close(self) -> None:
|
||||
"""Best-effort shutdown — release the receive buffer registration."""
|
||||
if self._recv_ptr:
|
||||
try:
|
||||
self.engine.unregister_memory(self._recv_ptr)
|
||||
except Exception:
|
||||
pass
|
||||
self._recv_ptr = 0
|
||||
self._recv_capacity = 0
|
||||
self._recv_buffer = None
|
||||
|
||||
|
||||
def make_session_id(host: str, rpc_port: int) -> str:
|
||||
"""Build the ``host:port`` form used as mooncake's session id."""
|
||||
return f"{host}:{rpc_port}"
|
||||
@@ -201,6 +201,23 @@ def _build_process_env(topology: SingleNodeTopology) -> dict[str, str]:
|
||||
# Default to TCP when RDMA is not forced (e.g. loopback on same node)
|
||||
env.setdefault("MOONCAKE_PROTOCOL", "tcp")
|
||||
|
||||
# Mooncake C++ batch_transfer_sync default timeout is 30 s, which can
|
||||
# fire as a false positive when a saturated D scheduler thread is busy
|
||||
# with LRU eviction (see docs/E1_E2_RESULTS_ZH.md §5c). Default to 1800 s
|
||||
# so the hair-trigger blacklist in conn.py:1270 doesn't latch on
|
||||
# transient stalls. Caller can override via shell env (setup_env.sh).
|
||||
if topology.transfer_backend == "mooncake":
|
||||
env.setdefault("MC_TRANSFER_TIMEOUT", "1800")
|
||||
|
||||
# D→P snapshot link (Phase 2). Each worker reads its own
|
||||
# `disaggregation_bootstrap_port` and binds at `bootstrap_port + 1000`
|
||||
# for the snapshot mooncake engine (see
|
||||
# third_party/sglang/.../disaggregation/snapshot/controller.py).
|
||||
if topology.enable_d_to_p_sync:
|
||||
env["SGLANG_SNAPSHOT_LINK_ENABLE"] = "1"
|
||||
if topology.ib_device:
|
||||
env.setdefault("SGLANG_SNAPSHOT_LINK_IB_DEVICE", topology.ib_device)
|
||||
|
||||
repo_root = Path(__file__).resolve().parents[2]
|
||||
python_paths = [
|
||||
str(repo_root / "src"),
|
||||
|
||||
@@ -46,6 +46,7 @@ class SingleNodeTopology:
|
||||
trust_remote_code: bool
|
||||
force_rdma: bool = False
|
||||
ib_device: str | None = None
|
||||
enable_d_to_p_sync: bool = False
|
||||
extra_server_args: tuple[str, ...] = ()
|
||||
prefill_extra_server_args: tuple[str, ...] = ()
|
||||
decode_extra_server_args: tuple[str, ...] = ()
|
||||
@@ -95,6 +96,7 @@ def build_single_node_topology(
|
||||
force_rdma: bool = False,
|
||||
trust_remote_code: bool = True,
|
||||
ib_device: str | None = None,
|
||||
enable_d_to_p_sync: bool = False,
|
||||
extra_server_args: tuple[str, ...] = (),
|
||||
prefill_extra_server_args: tuple[str, ...] = (),
|
||||
decode_extra_server_args: tuple[str, ...] = (),
|
||||
@@ -238,6 +240,7 @@ def build_single_node_topology(
|
||||
trust_remote_code=trust_remote_code,
|
||||
force_rdma=force_rdma,
|
||||
ib_device=ib_device,
|
||||
enable_d_to_p_sync=enable_d_to_p_sync,
|
||||
extra_server_args=extra_server_args,
|
||||
prefill_extra_server_args=prefill_extra_server_args,
|
||||
decode_extra_server_args=decode_extra_server_args,
|
||||
|
||||
1
third_party/agentic-kvcache
vendored
Submodule
1
third_party/agentic-kvcache
vendored
Submodule
Submodule third_party/agentic-kvcache added at 44796a1139
27
third_party/sglang/python/sglang/srt/disaggregation/snapshot/__init__.py
vendored
Normal file
27
third_party/sglang/python/sglang/srt/disaggregation/snapshot/__init__.py
vendored
Normal file
@@ -0,0 +1,27 @@
|
||||
"""D→P RDMA snapshot push subsystem.
|
||||
|
||||
A minimal, role-symmetric mooncake transport that runs alongside SGLang's
|
||||
existing PD pipeline. Both D and P workers can both send and receive
|
||||
snapshots — direction is determined by which kv_pool we read from /
|
||||
write into.
|
||||
|
||||
See ``docs/D_TO_P_SYNC_DESIGN_ZH.md`` for the full design.
|
||||
"""
|
||||
|
||||
from sglang.srt.disaggregation.snapshot.controller import (
|
||||
SnapshotLinkController,
|
||||
SnapshotIngestRecord,
|
||||
SNAPSHOT_LINK_ENABLE_ENV,
|
||||
SNAPSHOT_LINK_HOST_ENV,
|
||||
SNAPSHOT_LINK_PORT_ENV,
|
||||
SNAPSHOT_LINK_IB_DEVICE_ENV,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"SnapshotLinkController",
|
||||
"SnapshotIngestRecord",
|
||||
"SNAPSHOT_LINK_ENABLE_ENV",
|
||||
"SNAPSHOT_LINK_HOST_ENV",
|
||||
"SNAPSHOT_LINK_PORT_ENV",
|
||||
"SNAPSHOT_LINK_IB_DEVICE_ENV",
|
||||
]
|
||||
577
third_party/sglang/python/sglang/srt/disaggregation/snapshot/controller.py
vendored
Normal file
577
third_party/sglang/python/sglang/srt/disaggregation/snapshot/controller.py
vendored
Normal file
@@ -0,0 +1,577 @@
|
||||
"""SnapshotLinkController — D→P RDMA snapshot pushes with dedicated GPU buffer.
|
||||
|
||||
Per `docs/SNAPSHOT_STORE_REFACTOR_ZH.md`, this controller now reserves a
|
||||
dedicated GPU tensor (``snapshot_buf``) for receiving D→P snapshots, instead
|
||||
of competing with the worker's ``token_to_kv_pool_allocator`` at
|
||||
prepare_receive time. The kv_pool alloc is deferred to ``finalize_ingest``
|
||||
when the bytes are already in hand — if that alloc fails we drop the
|
||||
snapshot but RDMA reception itself succeeded.
|
||||
|
||||
Layout of the snapshot_buf for one session reception (chosen for
|
||||
mooncake's batch_transfer_sync_write friendliness — every layer maps to
|
||||
a single contiguous slab):
|
||||
|
||||
[K_layer_0: num_tokens × stride_k_bytes]
|
||||
[K_layer_1: num_tokens × stride_k_bytes]
|
||||
...
|
||||
[K_layer_L-1]
|
||||
[V_layer_0: num_tokens × stride_v_bytes]
|
||||
...
|
||||
[V_layer_L-1]
|
||||
|
||||
The buffer is split into multiple such slabs via ``SnapshotBufAllocator``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from typing import List, Optional, Tuple
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Env-var names (also exported from package __init__)
|
||||
SNAPSHOT_LINK_ENABLE_ENV = "SGLANG_SNAPSHOT_LINK_ENABLE"
|
||||
SNAPSHOT_LINK_HOST_ENV = "SGLANG_SNAPSHOT_LINK_HOST"
|
||||
SNAPSHOT_LINK_PORT_ENV = "SGLANG_SNAPSHOT_LINK_PORT"
|
||||
SNAPSHOT_LINK_IB_DEVICE_ENV = "SGLANG_SNAPSHOT_LINK_IB_DEVICE"
|
||||
|
||||
# Default snapshot_buf size: 8 GB. Enough for ~1.5 Qwen3-30B 50k-token sessions.
|
||||
SNAPSHOT_BUF_BYTES_ENV = "SGLANG_SNAPSHOT_LINK_BUF_BYTES"
|
||||
DEFAULT_SNAPSHOT_BUF_BYTES = 8 * 1024 * 1024 * 1024
|
||||
|
||||
|
||||
@dataclass
|
||||
class _LayerBufferDesc:
|
||||
"""Per-layer KV buffer descriptor on this worker."""
|
||||
base_ptr: int # data pointer of the layer's full buffer tensor
|
||||
bytes_per_token: int # head_num * head_dim * dtype.itemsize
|
||||
capacity_bytes: int # full buffer size in bytes
|
||||
is_k: bool # True for K-buffer, False for V
|
||||
|
||||
|
||||
@dataclass
|
||||
class SnapshotIngestRecord:
|
||||
"""P-side bookkeeping for one in-flight snapshot reception."""
|
||||
session_id: str
|
||||
slab_offset: int # offset within snapshot_buf
|
||||
slab_size: int # total bytes for this slab
|
||||
num_tokens: int
|
||||
k_layer_offsets: List[int] # absolute byte offsets of K layers in snapshot_buf
|
||||
v_layer_offsets: List[int]
|
||||
per_token_k_bytes: int
|
||||
per_token_v_bytes: int
|
||||
created_at: float = field(default_factory=time.time)
|
||||
|
||||
|
||||
class SnapshotBufAllocator:
|
||||
"""First-fit free-list allocator over a single contiguous byte range.
|
||||
|
||||
Tracks gaps in a sorted list. Merges adjacent free regions on free().
|
||||
"""
|
||||
|
||||
def __init__(self, capacity_bytes: int):
|
||||
self.capacity = capacity_bytes
|
||||
# Free regions sorted by offset: [(offset, size), ...]
|
||||
self._free: List[Tuple[int, int]] = [(0, capacity_bytes)]
|
||||
self._lock = threading.Lock()
|
||||
self._inflight: dict[int, int] = {} # offset → size for sanity check
|
||||
|
||||
def alloc(self, size: int) -> Optional[int]:
|
||||
"""Return offset of allocated region, or None if no fit available."""
|
||||
if size <= 0:
|
||||
return None
|
||||
# Page-align allocations to 4 KB for RDMA-friendly alignment.
|
||||
size = (size + 4095) & ~4095
|
||||
with self._lock:
|
||||
for i, (off, sz) in enumerate(self._free):
|
||||
if sz >= size:
|
||||
if sz == size:
|
||||
self._free.pop(i)
|
||||
else:
|
||||
self._free[i] = (off + size, sz - size)
|
||||
self._inflight[off] = size
|
||||
return off
|
||||
return None
|
||||
|
||||
def free(self, offset: int) -> bool:
|
||||
"""Return True if the offset was successfully freed."""
|
||||
with self._lock:
|
||||
size = self._inflight.pop(offset, None)
|
||||
if size is None:
|
||||
return False
|
||||
# Insert sorted and merge adjacents
|
||||
self._free.append((offset, size))
|
||||
self._free.sort()
|
||||
merged: List[Tuple[int, int]] = []
|
||||
for off, sz in self._free:
|
||||
if merged and merged[-1][0] + merged[-1][1] == off:
|
||||
merged[-1] = (merged[-1][0], merged[-1][1] + sz)
|
||||
else:
|
||||
merged.append((off, sz))
|
||||
self._free = merged
|
||||
return True
|
||||
|
||||
def available_bytes(self) -> int:
|
||||
with self._lock:
|
||||
return sum(sz for _, sz in self._free)
|
||||
|
||||
def in_use_bytes(self) -> int:
|
||||
with self._lock:
|
||||
return sum(self._inflight.values())
|
||||
|
||||
|
||||
def _import_transfer_engine():
|
||||
try:
|
||||
from mooncake.engine import TransferEngine
|
||||
except ImportError as e:
|
||||
raise ImportError(
|
||||
"mooncake.engine.TransferEngine is required for the snapshot "
|
||||
"link. Install mooncake-transfer-engine in the venv."
|
||||
) from e
|
||||
return TransferEngine
|
||||
|
||||
|
||||
class SnapshotLinkController:
|
||||
"""Owns mooncake engine + kv_pool registrations + snapshot_buf + records.
|
||||
|
||||
D-side use: push session KV via ``push_session_to_snapshot_buf``.
|
||||
P-side use: ``prepare_receive`` → caller pushes via RDMA →
|
||||
``ingest_snapshot_into_kvpool`` (does GPU memcpy +
|
||||
radix insert) → ``finalize_record`` (frees the slab).
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
host: str,
|
||||
port: int,
|
||||
ib_device: Optional[str],
|
||||
kv_pool_layer_buffers: List[Tuple[int, int, int, bool]],
|
||||
token_to_kv_pool_allocator,
|
||||
tree_cache=None,
|
||||
protocol: Optional[str] = None,
|
||||
snapshot_buf_bytes: Optional[int] = None,
|
||||
):
|
||||
TransferEngine = _import_transfer_engine()
|
||||
self.host = host
|
||||
self.port = port
|
||||
self.ib_device = ib_device
|
||||
self.token_to_kv_pool_allocator = token_to_kv_pool_allocator
|
||||
self.tree_cache = tree_cache
|
||||
self.layer_buffers: List[_LayerBufferDesc] = [
|
||||
_LayerBufferDesc(
|
||||
base_ptr=base, bytes_per_token=btok,
|
||||
capacity_bytes=cap, is_k=is_k,
|
||||
)
|
||||
for (base, btok, cap, is_k) in kv_pool_layer_buffers
|
||||
]
|
||||
|
||||
self.engine = TransferEngine()
|
||||
proto = protocol or os.environ.get("MOONCAKE_PROTOCOL", "rdma")
|
||||
listen = f"{host}:{port}"
|
||||
ret = self.engine.initialize(listen, "P2PHANDSHAKE", proto, ib_device or "")
|
||||
if ret != 0:
|
||||
raise RuntimeError(
|
||||
f"SnapshotLinkController.initialize({listen}, {proto}, "
|
||||
f"ib={ib_device}) returned {ret}"
|
||||
)
|
||||
self._session_id = f"{host}:{self.engine.get_rpc_port()}"
|
||||
|
||||
# Register existing kv_pool layer buffers (needed for D-side send and
|
||||
# for P-side ingest copy source = snapshot_buf, destination = kv_pool)
|
||||
ptrs = [d.base_ptr for d in self.layer_buffers]
|
||||
lens = [d.capacity_bytes for d in self.layer_buffers]
|
||||
try:
|
||||
reg_ret = self.engine.batch_register_memory(ptrs, lens)
|
||||
except Exception:
|
||||
reg_ret = 0
|
||||
for ptr, length in zip(ptrs, lens):
|
||||
r = self.engine.register_memory(ptr, length)
|
||||
if r != 0:
|
||||
reg_ret = r
|
||||
if reg_ret != 0:
|
||||
logger.warning(
|
||||
"SnapshotLinkController kv_pool batch_register returned %d", reg_ret
|
||||
)
|
||||
|
||||
# Allocate + register the dedicated snapshot reception buffer (P-side)
|
||||
# This decouples reception from kv_pool, avoiding the alloc-failed
|
||||
# death loop that killed E4-v4/v5.
|
||||
import torch
|
||||
|
||||
if snapshot_buf_bytes is None:
|
||||
snapshot_buf_bytes = int(
|
||||
os.environ.get(SNAPSHOT_BUF_BYTES_ENV, DEFAULT_SNAPSHOT_BUF_BYTES)
|
||||
)
|
||||
device = self._allocator_device()
|
||||
try:
|
||||
self.snapshot_buf = torch.zeros(
|
||||
snapshot_buf_bytes, dtype=torch.uint8, device=device,
|
||||
)
|
||||
except RuntimeError as e:
|
||||
logger.warning(
|
||||
"Could not allocate snapshot_buf of %d bytes on %s: %s. "
|
||||
"Falling back to 1 GB.", snapshot_buf_bytes, device, e,
|
||||
)
|
||||
snapshot_buf_bytes = 1024 * 1024 * 1024
|
||||
self.snapshot_buf = torch.zeros(
|
||||
snapshot_buf_bytes, dtype=torch.uint8, device=device,
|
||||
)
|
||||
self._snapshot_buf_bytes = snapshot_buf_bytes
|
||||
self._snapshot_buf_ptr = self.snapshot_buf.data_ptr()
|
||||
ret = self.engine.register_memory(self._snapshot_buf_ptr, snapshot_buf_bytes)
|
||||
if ret != 0:
|
||||
logger.warning(
|
||||
"SnapshotLinkController snapshot_buf register_memory(%s, %d) ret=%d",
|
||||
hex(self._snapshot_buf_ptr), snapshot_buf_bytes, ret,
|
||||
)
|
||||
self.snapshot_buf_alloc = SnapshotBufAllocator(snapshot_buf_bytes)
|
||||
|
||||
# Receive-side bookkeeping
|
||||
self._ingest_records: dict[str, SnapshotIngestRecord] = {}
|
||||
self._records_by_handle: dict[int, SnapshotIngestRecord] = {}
|
||||
self._next_handle = 1
|
||||
self._lock = threading.Lock()
|
||||
|
||||
logger.info(
|
||||
"SnapshotLinkController up at %s (sid=%s, %d kv layer bufs, "
|
||||
"snapshot_buf=%.1f GB on %s)",
|
||||
listen, self._session_id, len(self.layer_buffers),
|
||||
snapshot_buf_bytes / 1e9, device,
|
||||
)
|
||||
|
||||
# ----- accessors ----------------------------------------------------
|
||||
|
||||
@property
|
||||
def snapshot_session_id(self) -> str:
|
||||
return self._session_id
|
||||
|
||||
@property
|
||||
def snapshot_buf_ptr(self) -> int:
|
||||
return self._snapshot_buf_ptr
|
||||
|
||||
@property
|
||||
def snapshot_buf_bytes(self) -> int:
|
||||
return self._snapshot_buf_bytes
|
||||
|
||||
@property
|
||||
def layer_num(self) -> int:
|
||||
return len(self.layer_buffers) // 2
|
||||
|
||||
def get_k_base_ptrs(self) -> List[int]:
|
||||
return [d.base_ptr for d in self.layer_buffers if d.is_k]
|
||||
|
||||
def get_v_base_ptrs(self) -> List[int]:
|
||||
return [d.base_ptr for d in self.layer_buffers if not d.is_k]
|
||||
|
||||
def get_stride_k_bytes(self) -> int:
|
||||
for d in self.layer_buffers:
|
||||
if d.is_k:
|
||||
return d.bytes_per_token
|
||||
return 0
|
||||
|
||||
def get_stride_v_bytes(self) -> int:
|
||||
for d in self.layer_buffers:
|
||||
if not d.is_k:
|
||||
return d.bytes_per_token
|
||||
return 0
|
||||
|
||||
def _allocator_device(self):
|
||||
# Best-effort: pull device from one of the buffer tensors via the allocator
|
||||
try:
|
||||
return self.token_to_kv_pool_allocator.device
|
||||
except AttributeError:
|
||||
return "cuda"
|
||||
|
||||
# ----- P-side: prepare to receive ----------------------------------
|
||||
|
||||
def prepare_receive(self, session_id: str, num_tokens: int) -> Optional[SnapshotIngestRecord]:
|
||||
"""Carve a slab out of snapshot_buf large enough for num_tokens of K+V.
|
||||
|
||||
Returns the record describing the slab layout, or None if snapshot_buf
|
||||
is full. This does NOT touch kv_pool — alloc happens at ingest time.
|
||||
"""
|
||||
if num_tokens <= 0:
|
||||
return None
|
||||
stride_k = self.get_stride_k_bytes()
|
||||
stride_v = self.get_stride_v_bytes()
|
||||
L = self.layer_num
|
||||
slab_bytes = L * num_tokens * stride_k + L * num_tokens * stride_v
|
||||
offset = self.snapshot_buf_alloc.alloc(slab_bytes)
|
||||
if offset is None:
|
||||
logger.info(
|
||||
"prepare_receive: snapshot_buf full (sid=%s n=%d need=%d B available=%d B)",
|
||||
session_id, num_tokens, slab_bytes,
|
||||
self.snapshot_buf_alloc.available_bytes(),
|
||||
)
|
||||
return None
|
||||
# Layout: K0..KL-1, then V0..VL-1
|
||||
k_offs = [offset + i * num_tokens * stride_k for i in range(L)]
|
||||
v_offs = [offset + L * num_tokens * stride_k + i * num_tokens * stride_v
|
||||
for i in range(L)]
|
||||
record = SnapshotIngestRecord(
|
||||
session_id=session_id,
|
||||
slab_offset=offset,
|
||||
slab_size=slab_bytes,
|
||||
num_tokens=num_tokens,
|
||||
k_layer_offsets=k_offs,
|
||||
v_layer_offsets=v_offs,
|
||||
per_token_k_bytes=stride_k,
|
||||
per_token_v_bytes=stride_v,
|
||||
)
|
||||
with self._lock:
|
||||
# Evict prior record for the same session (best-effort)
|
||||
old = self._ingest_records.pop(session_id, None)
|
||||
if old is not None:
|
||||
self.snapshot_buf_alloc.free(old.slab_offset)
|
||||
self._records_by_handle.pop(id(old), None)
|
||||
self._ingest_records[session_id] = record
|
||||
self._records_by_handle[id(record)] = record
|
||||
return record
|
||||
|
||||
def lookup_by_handle(self, handle: int) -> Optional[SnapshotIngestRecord]:
|
||||
with self._lock:
|
||||
return self._records_by_handle.get(handle)
|
||||
|
||||
def discard_record(self, session_id: str) -> None:
|
||||
with self._lock:
|
||||
rec = self._ingest_records.pop(session_id, None)
|
||||
if rec is not None:
|
||||
self.snapshot_buf_alloc.free(rec.slab_offset)
|
||||
with self._lock:
|
||||
self._records_by_handle.pop(id(rec), None)
|
||||
|
||||
def total_pending_snapshot_bytes(self) -> int:
|
||||
with self._lock:
|
||||
return sum(rec.slab_size for rec in self._ingest_records.values())
|
||||
|
||||
# ----- P-side: ingest snapshot into kv_pool + radix tree -----------
|
||||
|
||||
def ingest_snapshot_into_kvpool(
|
||||
self,
|
||||
session_id: str,
|
||||
token_ids: List[int],
|
||||
) -> Tuple[bool, str, int]:
|
||||
"""Copy snapshot_buf bytes into kv_pool slots and insert into radix.
|
||||
|
||||
Returns (ok, reason, inserted_prefix_len).
|
||||
"""
|
||||
with self._lock:
|
||||
record = self._ingest_records.pop(session_id, None)
|
||||
if record is not None:
|
||||
self._records_by_handle.pop(id(record), None)
|
||||
if record is None:
|
||||
return False, "no-pending-ingest", 0
|
||||
|
||||
try:
|
||||
n = min(len(token_ids), record.num_tokens)
|
||||
if n == 0:
|
||||
self.snapshot_buf_alloc.free(record.slab_offset)
|
||||
return False, "empty-token-ids", 0
|
||||
|
||||
# Alloc kv_pool slots NOW that the snapshot bytes are in hand.
|
||||
try:
|
||||
indices_tensor = self.token_to_kv_pool_allocator.alloc(n)
|
||||
except Exception as exc:
|
||||
self.snapshot_buf_alloc.free(record.slab_offset)
|
||||
return False, f"kvpool-alloc-threw:{exc!r}", 0
|
||||
if indices_tensor is None:
|
||||
self.snapshot_buf_alloc.free(record.slab_offset)
|
||||
return False, "kvpool-alloc-failed-at-ingest", 0
|
||||
|
||||
# GPU→GPU copy from snapshot_buf into kv_pool layer buffers
|
||||
try:
|
||||
self._copy_snapshot_to_kvpool(record, indices_tensor)
|
||||
except Exception as exc:
|
||||
logger.exception("snapshot→kvpool copy failed: %s", exc)
|
||||
# Free both allocations
|
||||
self._free_slot_indices(indices_tensor)
|
||||
self.snapshot_buf_alloc.free(record.slab_offset)
|
||||
return False, f"copy-failed:{exc!r}", 0
|
||||
|
||||
# Insert into radix tree
|
||||
try:
|
||||
inserted_prefix_len = self._radix_insert(token_ids[:n], indices_tensor)
|
||||
except Exception as exc:
|
||||
logger.exception("radix insert failed: %s", exc)
|
||||
self._free_slot_indices(indices_tensor)
|
||||
self.snapshot_buf_alloc.free(record.slab_offset)
|
||||
return False, f"radix-insert-failed:{exc!r}", 0
|
||||
|
||||
# Snapshot is now persisted into kv_pool + radix; the slab is no
|
||||
# longer needed.
|
||||
self.snapshot_buf_alloc.free(record.slab_offset)
|
||||
return True, "ok", int(inserted_prefix_len)
|
||||
except Exception as exc:
|
||||
# Belt-and-braces cleanup
|
||||
try:
|
||||
self.snapshot_buf_alloc.free(record.slab_offset)
|
||||
except Exception:
|
||||
pass
|
||||
return False, f"unexpected:{exc!r}", 0
|
||||
|
||||
def _copy_snapshot_to_kvpool(
|
||||
self,
|
||||
record: SnapshotIngestRecord,
|
||||
slot_indices_tensor,
|
||||
) -> None:
|
||||
"""For each layer L: copy snapshot_buf[K_off[L]..] → k_buffer[L][slots]."""
|
||||
import torch
|
||||
|
||||
n = record.num_tokens
|
||||
stride_k = record.per_token_k_bytes
|
||||
stride_v = record.per_token_v_bytes
|
||||
# View snapshot_buf as a 1-D byte tensor; slice by offsets.
|
||||
for L in range(self.layer_num):
|
||||
# K
|
||||
k_slab_start = record.k_layer_offsets[L] - record.slab_offset + record.slab_offset
|
||||
# NOTE: above is equivalent to record.k_layer_offsets[L] but kept for clarity
|
||||
k_slab_start = record.k_layer_offsets[L]
|
||||
k_layer_bytes = self.snapshot_buf[
|
||||
k_slab_start : k_slab_start + n * stride_k
|
||||
].view(n, stride_k)
|
||||
# Compute destination tensor on kv_pool: dst[slot_indices] = src
|
||||
# We need access to the actual k_buffer[L] tensor. The controller
|
||||
# only has the raw ptr — so we materialize a view via from_blob-ish
|
||||
# trick. Easier: get the tensor from token_to_kv_pool_allocator's kvcache.
|
||||
kv_cache = self.token_to_kv_pool_allocator.get_kvcache()
|
||||
k_buf = kv_cache.k_buffer[L] # (max_tokens, head, dim)
|
||||
# Flatten per-token to bytes
|
||||
flat = k_buf.view(k_buf.shape[0], -1)
|
||||
assert flat.shape[1] * flat.element_size() >= stride_k, (
|
||||
f"K layer {L} stride mismatch: pool {flat.shape[1] * flat.element_size()} vs snapshot {stride_k}"
|
||||
)
|
||||
# Copy: dst[slot_indices] ← src[:n]
|
||||
src_reshape = k_layer_bytes.view(n, flat.shape[1] * flat.element_size())
|
||||
# Byte-level view of destination rows
|
||||
dst_view = flat.view(torch.uint8)
|
||||
dst_view[slot_indices_tensor] = src_reshape
|
||||
|
||||
# V
|
||||
v_slab_start = record.v_layer_offsets[L]
|
||||
v_layer_bytes = self.snapshot_buf[
|
||||
v_slab_start : v_slab_start + n * stride_v
|
||||
]
|
||||
v_buf = kv_cache.v_buffer[L]
|
||||
v_flat = v_buf.view(v_buf.shape[0], -1)
|
||||
src_v = v_layer_bytes.view(n, v_flat.shape[1] * v_flat.element_size())
|
||||
v_dst_view = v_flat.view(torch.uint8)
|
||||
v_dst_view[slot_indices_tensor] = src_v
|
||||
|
||||
def _radix_insert(self, token_ids: List[int], indices_tensor) -> int:
|
||||
"""Insert (token_ids, kv_indices) into the underlying radix tree."""
|
||||
from sglang.srt.mem_cache.base_prefix_cache import InsertParams
|
||||
from sglang.srt.mem_cache.radix_cache import RadixKey
|
||||
from sglang.srt.mem_cache.session_aware_cache import SessionAwareCache
|
||||
|
||||
inner = self.tree_cache
|
||||
if isinstance(inner, SessionAwareCache):
|
||||
inner = inner.inner
|
||||
if inner is None:
|
||||
raise RuntimeError("tree_cache not provided to SnapshotLinkController")
|
||||
radix_key = RadixKey(token_ids, None)
|
||||
result = inner.insert(InsertParams(key=radix_key, value=indices_tensor))
|
||||
return int(getattr(result, "prefix_len", 0))
|
||||
|
||||
def _free_slot_indices(self, indices_tensor) -> None:
|
||||
try:
|
||||
self.token_to_kv_pool_allocator.free(indices_tensor)
|
||||
except Exception as e:
|
||||
logger.warning("_free_slot_indices failed: %s", e)
|
||||
|
||||
# ----- D-side: push session KV to a peer's snapshot_buf ------------
|
||||
|
||||
def push_session_to_snapshot_buf(
|
||||
self,
|
||||
*,
|
||||
target_snapshot_session_id: str,
|
||||
src_slot_indices: List[int],
|
||||
target_snapshot_buf_base: int,
|
||||
target_k_layer_offsets: List[int],
|
||||
target_v_layer_offsets: List[int],
|
||||
target_per_token_k_bytes: int,
|
||||
target_per_token_v_bytes: int,
|
||||
) -> Tuple[int, int]:
|
||||
"""Push session KV from local kv_pool into a peer's snapshot_buf slab.
|
||||
|
||||
For each layer: gather src ranges (possibly scattered slot indices)
|
||||
and write to a contiguous range in the peer's snapshot_buf.
|
||||
Returns (mooncake_return_code, bytes_pushed).
|
||||
"""
|
||||
if not src_slot_indices:
|
||||
return 0, 0
|
||||
layer_num = self.layer_num
|
||||
k_src_bases = self.get_k_base_ptrs()
|
||||
v_src_bases = self.get_v_base_ptrs()
|
||||
stride_k = self.get_stride_k_bytes()
|
||||
stride_v = self.get_stride_v_bytes()
|
||||
if (len(target_k_layer_offsets) != layer_num
|
||||
or len(target_v_layer_offsets) != layer_num):
|
||||
raise ValueError(
|
||||
f"target K/V layer offset count {len(target_k_layer_offsets)}/"
|
||||
f"{len(target_v_layer_offsets)} != local layer_num {layer_num}"
|
||||
)
|
||||
if (stride_k != target_per_token_k_bytes
|
||||
or stride_v != target_per_token_v_bytes):
|
||||
raise ValueError(
|
||||
f"stride mismatch: local k={stride_k}/v={stride_v}, "
|
||||
f"target k={target_per_token_k_bytes}/v={target_per_token_v_bytes}"
|
||||
)
|
||||
n = len(src_slot_indices)
|
||||
|
||||
local_addrs: List[int] = []
|
||||
remote_addrs: List[int] = []
|
||||
lengths: List[int] = []
|
||||
|
||||
# Coalesce contiguous src runs.
|
||||
# Inner-loop helper to walk indices and emit run boundaries.
|
||||
def _emit_runs(src_base: int, tgt_base: int, stride: int) -> None:
|
||||
run_src_start = run_tgt_start = run_len = None
|
||||
for tgt_idx, src in enumerate(src_slot_indices):
|
||||
if run_src_start is None:
|
||||
run_src_start, run_tgt_start, run_len = src, tgt_idx, 1
|
||||
elif src == run_src_start + run_len:
|
||||
run_len += 1
|
||||
else:
|
||||
local_addrs.append(src_base + run_src_start * stride)
|
||||
remote_addrs.append(tgt_base + run_tgt_start * stride)
|
||||
lengths.append(run_len * stride)
|
||||
run_src_start, run_tgt_start, run_len = src, tgt_idx, 1
|
||||
if run_src_start is not None:
|
||||
local_addrs.append(src_base + run_src_start * stride)
|
||||
remote_addrs.append(tgt_base + run_tgt_start * stride)
|
||||
lengths.append(run_len * stride)
|
||||
|
||||
for L in range(layer_num):
|
||||
_emit_runs(
|
||||
k_src_bases[L],
|
||||
target_snapshot_buf_base + target_k_layer_offsets[L],
|
||||
stride_k,
|
||||
)
|
||||
_emit_runs(
|
||||
v_src_bases[L],
|
||||
target_snapshot_buf_base + target_v_layer_offsets[L],
|
||||
stride_v,
|
||||
)
|
||||
|
||||
t0 = time.perf_counter()
|
||||
try:
|
||||
ret = self.engine.batch_transfer_sync_write(
|
||||
target_snapshot_session_id, local_addrs, remote_addrs, lengths,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.exception(
|
||||
"SnapshotLinkController.push_session_to_snapshot_buf threw: %s", e
|
||||
)
|
||||
return -1, 0
|
||||
t1 = time.perf_counter()
|
||||
bytes_pushed = sum(lengths)
|
||||
logger.info(
|
||||
"push_session_to_snapshot_buf → %s: %d ops, %d B, ret=%d, %.2f ms",
|
||||
target_snapshot_session_id, len(lengths), bytes_pushed, ret,
|
||||
(t1 - t0) * 1000.0,
|
||||
)
|
||||
return ret, bytes_pushed
|
||||
@@ -125,6 +125,9 @@ from sglang.srt.managers.io_struct import (
|
||||
LoadLoRAAdapterFromTensorsReqInput,
|
||||
LoadLoRAAdapterReqInput,
|
||||
DirectAppendAdmissionReqInput,
|
||||
SnapshotDumpReqInput,
|
||||
SnapshotFinalizeIngestReqInput,
|
||||
SnapshotPrepareReceiveReqInput,
|
||||
OpenSessionReqInput,
|
||||
ParseFunctionCallReq,
|
||||
PauseGenerationReqInput,
|
||||
@@ -1295,6 +1298,21 @@ async def admit_direct_append(obj: DirectAppendAdmissionReqInput):
|
||||
return await _global_state.tokenizer_manager.admit_direct_append(obj)
|
||||
|
||||
|
||||
@app.post("/_snapshot/prepare_receive")
|
||||
async def snapshot_prepare_receive(obj: SnapshotPrepareReceiveReqInput):
|
||||
return await _global_state.tokenizer_manager.snapshot_prepare_receive(obj)
|
||||
|
||||
|
||||
@app.post("/_snapshot/dump")
|
||||
async def snapshot_dump(obj: SnapshotDumpReqInput):
|
||||
return await _global_state.tokenizer_manager.snapshot_dump(obj)
|
||||
|
||||
|
||||
@app.post("/_snapshot/finalize_ingest")
|
||||
async def snapshot_finalize_ingest(obj: SnapshotFinalizeIngestReqInput):
|
||||
return await _global_state.tokenizer_manager.snapshot_finalize_ingest(obj)
|
||||
|
||||
|
||||
@app.api_route("/configure_logging", methods=["GET", "POST"])
|
||||
@auth_level(AuthLevel.ADMIN_OPTIONAL)
|
||||
async def configure_logging(obj: ConfigureLoggingReq, request: Request):
|
||||
|
||||
@@ -1632,6 +1632,96 @@ class HealthCheckOutput(BaseReq):
|
||||
pass
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# D→P snapshot ingest (Phase 2 of D→P sync feature; see
|
||||
# docs/D_TO_P_SYNC_DESIGN_ZH.md).
|
||||
#
|
||||
# Three-step protocol orchestrated by agentic-pd-hybrid:
|
||||
# 1. PrepareReceive → P allocates kv_pool slots + returns destination
|
||||
# addresses for D's RDMA writes.
|
||||
# 2. (out-of-band) → D uses snapshot_link to RDMA-push KV bytes
|
||||
# directly to P's slot addresses.
|
||||
# 3. FinalizeIngest → P inserts (token_ids, kv_indices) into its radix
|
||||
# tree so subsequent prefill requests for this
|
||||
# session see a cache hit.
|
||||
#
|
||||
# Each step is its own ReqInput/ReqOutput pair so the scheduler handlers can
|
||||
# be written stateless and the orchestrator can retry / abort cleanly.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class SnapshotPrepareReceiveReqInput(BaseReq):
|
||||
"""P-side: allocate slots + register them with mooncake for D to push into."""
|
||||
|
||||
session_id: str
|
||||
num_tokens: int # P will alloc this many contiguous slots
|
||||
expected_bytes_per_layer_k: int = 0 # per-token K bytes × num_tokens (sanity)
|
||||
expected_bytes_per_layer_v: int = 0 # per-token V bytes × num_tokens (sanity)
|
||||
|
||||
|
||||
@dataclass
|
||||
class SnapshotPrepareReceiveReqOutput(BaseReq):
|
||||
"""P-side response. New schema points D at P's dedicated snapshot_buf."""
|
||||
|
||||
ok: bool
|
||||
reason: Optional[str] = None
|
||||
# P's mooncake snapshot session id (host:rpc_port) for D's batch write target
|
||||
snapshot_session_id: str = ""
|
||||
# snapshot_buf base pointer + per-layer offsets, replacing the old
|
||||
# kv_pool slot_indices scheme that competed with P's prefill work and
|
||||
# always hit alloc-failed. See docs/SNAPSHOT_STORE_REFACTOR_ZH.md.
|
||||
snapshot_buf_base_ptr: int = 0
|
||||
snapshot_buf_capacity_bytes: int = 0
|
||||
k_layer_offsets: List[int] = field(default_factory=list) # bytes within snapshot_buf
|
||||
v_layer_offsets: List[int] = field(default_factory=list)
|
||||
num_tokens: int = 0
|
||||
stride_k_bytes: int = 0
|
||||
stride_v_bytes: int = 0
|
||||
layer_num: int = 0
|
||||
available_tokens: int = 0
|
||||
|
||||
|
||||
@dataclass
|
||||
class SnapshotDumpReqInput(BaseReq):
|
||||
"""D-side: dump session KV via snapshot_link into P's snapshot_buf slab."""
|
||||
|
||||
session_id: str
|
||||
target_snapshot_session_id: str
|
||||
target_snapshot_buf_base: int = 0
|
||||
target_k_layer_offsets: List[int] = field(default_factory=list)
|
||||
target_v_layer_offsets: List[int] = field(default_factory=list)
|
||||
target_stride_k_bytes: int = 0
|
||||
target_stride_v_bytes: int = 0
|
||||
ib_device: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class SnapshotDumpReqOutput(BaseReq):
|
||||
ok: bool
|
||||
reason: Optional[str] = None
|
||||
bytes_pushed: int = 0
|
||||
transfer_duration_ms: float = 0.0
|
||||
kv_committed_len: int = 0 # the actual number of tokens D had for this session
|
||||
# The token_ids that go with the KV (so P can call radix_cache.insert)
|
||||
token_ids: List[int] = field(default_factory=list)
|
||||
|
||||
|
||||
@dataclass
|
||||
class SnapshotFinalizeIngestReqInput(BaseReq):
|
||||
"""P-side: copy snapshot_buf slab into kv_pool + insert into radix tree."""
|
||||
|
||||
session_id: str
|
||||
token_ids: List[int]
|
||||
|
||||
|
||||
@dataclass
|
||||
class SnapshotFinalizeIngestReqOutput(BaseReq):
|
||||
ok: bool
|
||||
reason: Optional[str] = None
|
||||
inserted_prefix_len: int = 0
|
||||
|
||||
|
||||
class ExpertDistributionReqType(Enum):
|
||||
START_RECORD = 1
|
||||
STOP_RECORD = 2
|
||||
|
||||
@@ -1564,6 +1564,74 @@ class ScheduleBatch(ScheduleBatchDisaggregationDecodeMixin):
|
||||
# For DLLM, we use a separate forward mode
|
||||
self.forward_mode = ForwardMode.DLLM_EXTEND
|
||||
|
||||
# Pre-filter pass: drop streaming-session reqs whose committed prefix
|
||||
# already covers fill_ids. The streaming-session correction below would
|
||||
# set extend_input_len = max(0, fill_len - prefix_len) = 0 for these
|
||||
# reqs, but the downstream invariant at the per-req loop
|
||||
# (`assert seq_len - pre_len == req.extend_input_len`) is computed from
|
||||
# raw fill_ids/prefix_indices lengths and has no path to be satisfied
|
||||
# when fill_len < prefix_len. Treat the condition as upstream state
|
||||
# inconsistency, abort the affected reqs (so the client sees an error
|
||||
# response instead of the worker crashing), and continue with the
|
||||
# remaining batch. See docs/E3_FINDINGS_ZH.md for the failure mode
|
||||
# this guards against.
|
||||
if self.reqs:
|
||||
kept_reqs = []
|
||||
for req in self.reqs:
|
||||
if (
|
||||
req.session is not None
|
||||
and req.session.streaming
|
||||
and len(req.fill_ids) < len(req.prefix_indices)
|
||||
):
|
||||
logger.error(
|
||||
"Dropping streaming-session req with fill_ids shorter than "
|
||||
"prefix_indices (rid=%s, session_id=%s, fill_len=%d, "
|
||||
"prefix_len=%d, kv_committed_len=%d). Upstream state "
|
||||
"inconsistency would crash prepare_for_extend's invariant; "
|
||||
"aborting this req. See docs/E3_FINDINGS_ZH.md.",
|
||||
req.rid,
|
||||
req.session.session_id,
|
||||
len(req.fill_ids),
|
||||
len(req.prefix_indices),
|
||||
req.kv_committed_len,
|
||||
)
|
||||
req.finished_reason = FINISH_ABORT(
|
||||
message=(
|
||||
"streaming-session inconsistency: fill_ids "
|
||||
f"({len(req.fill_ids)}) < prefix_indices "
|
||||
f"({len(req.prefix_indices)})"
|
||||
),
|
||||
)
|
||||
else:
|
||||
kept_reqs.append(req)
|
||||
if len(kept_reqs) != len(self.reqs):
|
||||
self.reqs = kept_reqs
|
||||
|
||||
if not self.reqs:
|
||||
# Whole batch filtered. Set empty tensor / list state so
|
||||
# downstream callers (model_runner.forward, batch_result handlers)
|
||||
# see a valid no-op batch and skip the model pass cleanly.
|
||||
_pin = is_pin_memory_available(self.device)
|
||||
empty_long = torch.zeros(0, dtype=torch.int64, pin_memory=_pin).to(
|
||||
self.device, non_blocking=True
|
||||
)
|
||||
empty_int = torch.zeros(0, dtype=torch.int32, pin_memory=_pin).to(
|
||||
self.device, non_blocking=True
|
||||
)
|
||||
self.input_ids = empty_long
|
||||
self.req_pool_indices = empty_int
|
||||
self.seq_lens = empty_long
|
||||
self.seq_lens_cpu = torch.zeros(0, dtype=torch.int64)
|
||||
self.orig_seq_lens = empty_int
|
||||
self.prefix_lens = []
|
||||
self.extend_lens = []
|
||||
self.extend_num_tokens = 0
|
||||
self.out_cache_loc = empty_int
|
||||
self.input_embeds = None
|
||||
self.multimodal_inputs = []
|
||||
self.token_type_ids = None
|
||||
return
|
||||
|
||||
# Init tensors
|
||||
reqs = self.reqs
|
||||
for req in reqs:
|
||||
|
||||
@@ -96,6 +96,12 @@ from sglang.srt.managers.io_struct import (
|
||||
ContinueGenerationReqInput,
|
||||
DirectAppendAdmissionReqInput,
|
||||
DirectAppendAdmissionReqOutput,
|
||||
SnapshotDumpReqInput,
|
||||
SnapshotDumpReqOutput,
|
||||
SnapshotFinalizeIngestReqInput,
|
||||
SnapshotFinalizeIngestReqOutput,
|
||||
SnapshotPrepareReceiveReqInput,
|
||||
SnapshotPrepareReceiveReqOutput,
|
||||
DestroyWeightsUpdateGroupReqInput,
|
||||
DetachHiCacheStorageReqInput,
|
||||
DetachHiCacheStorageReqOutput,
|
||||
@@ -844,6 +850,70 @@ class Scheduler(
|
||||
embedding_cache_size = envs.SGLANG_VLM_CACHE_SIZE_MB.get()
|
||||
init_mm_embedding_cache(embedding_cache_size * 1024 * 1024)
|
||||
|
||||
# ---- D→P snapshot link (Phase 2 of D→P sync feature) ------------
|
||||
# Enabled per-worker via SGLANG_SNAPSHOT_LINK_ENABLE=1. Each worker
|
||||
# binds an independent mooncake transfer engine on
|
||||
# SGLANG_SNAPSHOT_LINK_HOST:SGLANG_SNAPSHOT_LINK_PORT and pre-
|
||||
# registers the kv_pool layer buffers for one-shot RDMA pushes /
|
||||
# receives. See docs/D_TO_P_SYNC_DESIGN_ZH.md.
|
||||
self.snapshot_link_controller = None
|
||||
from sglang.srt.disaggregation.snapshot import (
|
||||
SnapshotLinkController as _SnapLinkCtrl,
|
||||
SNAPSHOT_LINK_ENABLE_ENV,
|
||||
SNAPSHOT_LINK_HOST_ENV,
|
||||
SNAPSHOT_LINK_PORT_ENV,
|
||||
SNAPSHOT_LINK_IB_DEVICE_ENV,
|
||||
)
|
||||
if os.environ.get(SNAPSHOT_LINK_ENABLE_ENV, "0") == "1":
|
||||
host = os.environ.get(SNAPSHOT_LINK_HOST_ENV, server_args.host)
|
||||
port = int(os.environ.get(SNAPSHOT_LINK_PORT_ENV,
|
||||
str(server_args.disaggregation_bootstrap_port + 1000)))
|
||||
ib = os.environ.get(SNAPSHOT_LINK_IB_DEVICE_ENV, server_args.disaggregation_ib_device)
|
||||
try:
|
||||
kv_pool = self.token_to_kv_pool_allocator.get_kvcache()
|
||||
except AttributeError:
|
||||
# Some allocators expose the pool directly
|
||||
kv_pool = getattr(self.token_to_kv_pool_allocator, "kvcache", None)
|
||||
if kv_pool is None:
|
||||
logger.warning("SNAPSHOT_LINK_ENABLE=1 but kv_pool unavailable; skipping init")
|
||||
else:
|
||||
try:
|
||||
kv_data_ptrs, kv_data_lens, kv_item_lens = kv_pool.get_contiguous_buf_infos()
|
||||
layer_n = len(kv_data_ptrs) // 2
|
||||
layer_buffers = []
|
||||
# K layers first, then V layers (matches MHATokenToKVPool.get_contiguous_buf_infos)
|
||||
for i in range(layer_n):
|
||||
layer_buffers.append((
|
||||
kv_data_ptrs[i],
|
||||
kv_item_lens[i] // max(1, kv_pool.page_size),
|
||||
kv_data_lens[i],
|
||||
True, # is_k
|
||||
))
|
||||
for i in range(layer_n):
|
||||
layer_buffers.append((
|
||||
kv_data_ptrs[layer_n + i],
|
||||
kv_item_lens[layer_n + i] // max(1, kv_pool.page_size),
|
||||
kv_data_lens[layer_n + i],
|
||||
False, # is_k=False (V)
|
||||
))
|
||||
self.snapshot_link_controller = _SnapLinkCtrl(
|
||||
host=host,
|
||||
port=port,
|
||||
ib_device=ib,
|
||||
kv_pool_layer_buffers=layer_buffers,
|
||||
token_to_kv_pool_allocator=self.token_to_kv_pool_allocator,
|
||||
tree_cache=self.tree_cache,
|
||||
)
|
||||
logger.info(
|
||||
"Snapshot link controller initialized: %s, sid=%s, %d layer bufs",
|
||||
f"{host}:{port}",
|
||||
self.snapshot_link_controller.snapshot_session_id,
|
||||
len(layer_buffers),
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning("Snapshot link init failed: %s; continuing without it", e)
|
||||
self.snapshot_link_controller = None
|
||||
|
||||
def init_running_status(self):
|
||||
self.waiting_queue: List[Req] = []
|
||||
self.decode_direct_waiting_queue: List[Req] = []
|
||||
@@ -1219,6 +1289,9 @@ class Scheduler(
|
||||
(OpenSessionReqInput, self.open_session),
|
||||
(CloseSessionReqInput, self.close_session),
|
||||
(DirectAppendAdmissionReqInput, self.admit_direct_append),
|
||||
(SnapshotPrepareReceiveReqInput, self.snapshot_prepare_receive),
|
||||
(SnapshotDumpReqInput, self.snapshot_dump),
|
||||
(SnapshotFinalizeIngestReqInput, self.snapshot_finalize_ingest),
|
||||
(UpdateWeightFromDiskReqInput, self.update_weights_from_disk),
|
||||
(InitWeightsUpdateGroupReqInput, self.init_weights_update_group),
|
||||
(DestroyWeightsUpdateGroupReqInput, self.destroy_weights_update_group),
|
||||
@@ -3673,6 +3746,119 @@ class Scheduler(
|
||||
),
|
||||
)
|
||||
|
||||
# ----- D→P snapshot link handlers (Phase 2/3) ---------------------
|
||||
|
||||
def snapshot_prepare_receive(
|
||||
self, recv_req: SnapshotPrepareReceiveReqInput
|
||||
) -> SnapshotPrepareReceiveReqOutput:
|
||||
"""P-side: carve snapshot_buf slab + return its layout to caller.
|
||||
|
||||
Refactored per docs/SNAPSHOT_STORE_REFACTOR_ZH.md: this no longer
|
||||
touches the kv_pool allocator. The slab is in a dedicated
|
||||
snapshot_buf so prepare can never lose to P's prefill work.
|
||||
"""
|
||||
ctrl = self.snapshot_link_controller
|
||||
if ctrl is None:
|
||||
return SnapshotPrepareReceiveReqOutput(
|
||||
ok=False, reason="snapshot-link-disabled",
|
||||
)
|
||||
try:
|
||||
available = int(self.token_to_kv_pool_allocator.available_size())
|
||||
except Exception:
|
||||
available = -1
|
||||
if recv_req.num_tokens <= 0:
|
||||
return SnapshotPrepareReceiveReqOutput(ok=False, reason="zero-tokens")
|
||||
record = ctrl.prepare_receive(recv_req.session_id, recv_req.num_tokens)
|
||||
if record is None:
|
||||
return SnapshotPrepareReceiveReqOutput(
|
||||
ok=False, reason="snapshot-buf-full",
|
||||
available_tokens=available,
|
||||
)
|
||||
return SnapshotPrepareReceiveReqOutput(
|
||||
ok=True,
|
||||
snapshot_session_id=ctrl.snapshot_session_id,
|
||||
snapshot_buf_base_ptr=ctrl.snapshot_buf_ptr,
|
||||
snapshot_buf_capacity_bytes=ctrl.snapshot_buf_bytes,
|
||||
k_layer_offsets=record.k_layer_offsets,
|
||||
v_layer_offsets=record.v_layer_offsets,
|
||||
num_tokens=record.num_tokens,
|
||||
stride_k_bytes=record.per_token_k_bytes,
|
||||
stride_v_bytes=record.per_token_v_bytes,
|
||||
layer_num=ctrl.layer_num,
|
||||
available_tokens=available,
|
||||
)
|
||||
|
||||
def snapshot_dump(
|
||||
self, recv_req: SnapshotDumpReqInput
|
||||
) -> SnapshotDumpReqOutput:
|
||||
"""D-side: gather session KV from kv_pool, RDMA-write into P's snapshot_buf."""
|
||||
ctrl = self.snapshot_link_controller
|
||||
if ctrl is None:
|
||||
return SnapshotDumpReqOutput(ok=False, reason="snapshot-link-disabled")
|
||||
if not isinstance(self.tree_cache, SessionAwareCache):
|
||||
return SnapshotDumpReqOutput(ok=False, reason="tree-cache-not-session-aware")
|
||||
slot = self.tree_cache.slots.get(recv_req.session_id)
|
||||
if slot is None or slot.req_pool_idx is None:
|
||||
return SnapshotDumpReqOutput(ok=False, reason="session-not-resident")
|
||||
kv_committed_len = int(slot.kv_committed_len)
|
||||
if kv_committed_len == 0:
|
||||
return SnapshotDumpReqOutput(ok=False, reason="zero-committed-len")
|
||||
try:
|
||||
kv_idx_tensor = self.req_to_token_pool.req_to_token[
|
||||
slot.req_pool_idx, :kv_committed_len
|
||||
]
|
||||
src_slot_indices = [int(x) for x in kv_idx_tensor.tolist()]
|
||||
except Exception as e:
|
||||
return SnapshotDumpReqOutput(ok=False, reason=f"read-indices-failed:{e!r}")
|
||||
|
||||
try:
|
||||
ret, bytes_pushed = ctrl.push_session_to_snapshot_buf(
|
||||
target_snapshot_session_id=recv_req.target_snapshot_session_id,
|
||||
src_slot_indices=src_slot_indices,
|
||||
target_snapshot_buf_base=recv_req.target_snapshot_buf_base,
|
||||
target_k_layer_offsets=recv_req.target_k_layer_offsets,
|
||||
target_v_layer_offsets=recv_req.target_v_layer_offsets,
|
||||
target_per_token_k_bytes=recv_req.target_stride_k_bytes,
|
||||
target_per_token_v_bytes=recv_req.target_stride_v_bytes,
|
||||
)
|
||||
except Exception as e:
|
||||
return SnapshotDumpReqOutput(ok=False, reason=f"push-failed:{e!r}")
|
||||
|
||||
if ret != 0:
|
||||
return SnapshotDumpReqOutput(
|
||||
ok=False, reason=f"mooncake-batch-write-ret={ret}",
|
||||
bytes_pushed=int(bytes_pushed),
|
||||
kv_committed_len=kv_committed_len,
|
||||
)
|
||||
return SnapshotDumpReqOutput(
|
||||
ok=True, bytes_pushed=int(bytes_pushed),
|
||||
kv_committed_len=kv_committed_len,
|
||||
token_ids=[],
|
||||
)
|
||||
|
||||
def snapshot_finalize_ingest(
|
||||
self, recv_req: SnapshotFinalizeIngestReqInput
|
||||
) -> SnapshotFinalizeIngestReqOutput:
|
||||
"""P-side: copy snapshot_buf slab into kv_pool + insert into radix tree.
|
||||
|
||||
Refactored per docs/SNAPSHOT_STORE_REFACTOR_ZH.md: kv_pool alloc
|
||||
happens HERE (deferred from prepare_receive), so we never block
|
||||
D's RDMA write on kv_pool contention.
|
||||
"""
|
||||
ctrl = self.snapshot_link_controller
|
||||
if ctrl is None:
|
||||
return SnapshotFinalizeIngestReqOutput(
|
||||
ok=False, reason="snapshot-link-disabled",
|
||||
)
|
||||
ok, reason, inserted_prefix_len = ctrl.ingest_snapshot_into_kvpool(
|
||||
session_id=recv_req.session_id,
|
||||
token_ids=list(recv_req.token_ids),
|
||||
)
|
||||
return SnapshotFinalizeIngestReqOutput(
|
||||
ok=bool(ok), reason=reason if not ok else None,
|
||||
inserted_prefix_len=int(inserted_prefix_len),
|
||||
)
|
||||
|
||||
def _compute_backpressure_pause_hint(
|
||||
self,
|
||||
*,
|
||||
|
||||
@@ -181,13 +181,19 @@ class SchedulerRuntimeCheckerMixin:
|
||||
return memory_leak, token_msg
|
||||
|
||||
def _check_radix_cache_memory(self: Scheduler):
|
||||
# NB: as of SnapshotStore refactor (see docs/SNAPSHOT_STORE_REFACTOR_ZH.md)
|
||||
# prepare_receive no longer touches kv_pool — slots are alloc'd from
|
||||
# a dedicated snapshot_buf. So no snapshot_reserved accounting needed.
|
||||
_, _, available_size, evictable_size = self._get_token_info()
|
||||
protected_size = self.tree_cache.protected_size()
|
||||
session_held = self._session_held_tokens()
|
||||
memory_leak = (available_size + evictable_size) != (
|
||||
self.max_total_num_tokens - protected_size - session_held
|
||||
)
|
||||
token_msg = f"{self.max_total_num_tokens=}, {available_size=}, {evictable_size=}, {protected_size=}, {session_held=}\n"
|
||||
token_msg = (
|
||||
f"{self.max_total_num_tokens=}, {available_size=}, {evictable_size=}, "
|
||||
f"{protected_size=}, {session_held=}\n"
|
||||
)
|
||||
return memory_leak, token_msg
|
||||
|
||||
def _get_batch_uncached_size(self: Scheduler, batch: ScheduleBatch) -> int:
|
||||
|
||||
@@ -74,6 +74,12 @@ from sglang.srt.managers.io_struct import (
|
||||
SetInternalStateReqOutput,
|
||||
SlowDownReqInput,
|
||||
SlowDownReqOutput,
|
||||
SnapshotDumpReqInput,
|
||||
SnapshotDumpReqOutput,
|
||||
SnapshotFinalizeIngestReqInput,
|
||||
SnapshotFinalizeIngestReqOutput,
|
||||
SnapshotPrepareReceiveReqInput,
|
||||
SnapshotPrepareReceiveReqOutput,
|
||||
UnloadLoRAAdapterReqInput,
|
||||
UnloadLoRAAdapterReqOutput,
|
||||
UpdateWeightsFromDistributedReqInput,
|
||||
@@ -225,6 +231,15 @@ class TokenizerCommunicatorMixin:
|
||||
self.direct_append_admission_communicator = _Communicator(
|
||||
self.send_to_scheduler, server_args.dp_size
|
||||
)
|
||||
self.snapshot_prepare_receive_communicator = _Communicator(
|
||||
self.send_to_scheduler, server_args.dp_size
|
||||
)
|
||||
self.snapshot_dump_communicator = _Communicator(
|
||||
self.send_to_scheduler, server_args.dp_size
|
||||
)
|
||||
self.snapshot_finalize_ingest_communicator = _Communicator(
|
||||
self.send_to_scheduler, server_args.dp_size
|
||||
)
|
||||
self.set_internal_state_communicator = _Communicator(
|
||||
self.send_to_scheduler, server_args.dp_size
|
||||
)
|
||||
@@ -325,6 +340,18 @@ class TokenizerCommunicatorMixin:
|
||||
DirectAppendAdmissionReqOutput,
|
||||
self.direct_append_admission_communicator.handle_recv,
|
||||
),
|
||||
(
|
||||
SnapshotPrepareReceiveReqOutput,
|
||||
self.snapshot_prepare_receive_communicator.handle_recv,
|
||||
),
|
||||
(
|
||||
SnapshotDumpReqOutput,
|
||||
self.snapshot_dump_communicator.handle_recv,
|
||||
),
|
||||
(
|
||||
SnapshotFinalizeIngestReqOutput,
|
||||
self.snapshot_finalize_ingest_communicator.handle_recv,
|
||||
),
|
||||
(
|
||||
SetInternalStateReqOutput,
|
||||
self.set_internal_state_communicator.handle_recv,
|
||||
@@ -890,6 +917,36 @@ class TokenizerCommunicatorMixin:
|
||||
)
|
||||
return responses[0]
|
||||
|
||||
async def snapshot_prepare_receive(
|
||||
self: TokenizerManager,
|
||||
obj: SnapshotPrepareReceiveReqInput,
|
||||
) -> SnapshotPrepareReceiveReqOutput:
|
||||
self.auto_create_handle_loop()
|
||||
responses: List[SnapshotPrepareReceiveReqOutput] = (
|
||||
await self.snapshot_prepare_receive_communicator(obj)
|
||||
)
|
||||
return responses[0]
|
||||
|
||||
async def snapshot_dump(
|
||||
self: TokenizerManager,
|
||||
obj: SnapshotDumpReqInput,
|
||||
) -> SnapshotDumpReqOutput:
|
||||
self.auto_create_handle_loop()
|
||||
responses: List[SnapshotDumpReqOutput] = (
|
||||
await self.snapshot_dump_communicator(obj)
|
||||
)
|
||||
return responses[0]
|
||||
|
||||
async def snapshot_finalize_ingest(
|
||||
self: TokenizerManager,
|
||||
obj: SnapshotFinalizeIngestReqInput,
|
||||
) -> SnapshotFinalizeIngestReqOutput:
|
||||
self.auto_create_handle_loop()
|
||||
responses: List[SnapshotFinalizeIngestReqOutput] = (
|
||||
await self.snapshot_finalize_ingest_communicator(obj)
|
||||
)
|
||||
return responses[0]
|
||||
|
||||
async def set_internal_state(
|
||||
self: TokenizerManager, obj: SetInternalStateReq
|
||||
) -> List[bool]:
|
||||
|
||||
32
third_party/traces/README.md
vendored
Normal file
32
third_party/traces/README.md
vendored
Normal file
@@ -0,0 +1,32 @@
|
||||
# Replay traces
|
||||
|
||||
为了方便跨主机传输,把 benchmark 用到的 trace 文件放在这里。该目录在
|
||||
`.gitignore` 中显式 whitelist(同 `third_party/sglang/`),文件随 git 一起走。
|
||||
|
||||
## 文件清单
|
||||
|
||||
| 文件 | 大小 | 内容 | 来源 |
|
||||
|---|---:|---|---|
|
||||
| `qwen35-swebench-50sess.jsonl` | 54 MB | 4449 reqs / 52 sessions / Qwen3.5-35B 推理产物 | `simm-swe-bench` 项目用 SiBench replay SiCo `swe.jsonl` 经 SGLang 跑出 audit.jsonl,再用 `scripts/convert_audit_to_trace.py` 转 |
|
||||
|
||||
详细来源见 `docs/ONBOARDING_NEXT_AGENT_ZH.md` 和实际 schema 见 `src/agentic_pd_hybrid/trace.py`。
|
||||
|
||||
## 使用方法
|
||||
|
||||
Replay 端的 trace 路径由 CLI flag `--trace` 指定。默认 sweep 脚本里指向
|
||||
`outputs/qwen35-swebench-50sess.jsonl`——为了向后兼容老脚本,**建议在 clone 后
|
||||
软链接一份过去**:
|
||||
|
||||
```bash
|
||||
mkdir -p outputs
|
||||
ln -sf ../third_party/traces/qwen35-swebench-50sess.jsonl \
|
||||
outputs/qwen35-swebench-50sess.jsonl
|
||||
```
|
||||
|
||||
或者直接改 sweep 脚本里 `--trace` 路径指向 `third_party/traces/...`。
|
||||
|
||||
## 添加新 trace
|
||||
|
||||
如果未来加新 trace 文件(如 `codex_swebenchpro` 转换后的版本),直接放本目录,
|
||||
更新本 README 的清单即可。**别把超过 100 MB 的单文件直接 git add**——GitLab
|
||||
默认对未启用 LFS 的单文件有 100 MB 限制。
|
||||
4449
third_party/traces/qwen35-swebench-50sess.jsonl
vendored
Normal file
4449
third_party/traces/qwen35-swebench-50sess.jsonl
vendored
Normal file
File diff suppressed because one or more lines are too long
615
uv.lock
generated
615
uv.lock
generated
@@ -2,15 +2,33 @@ version = 1
|
||||
revision = 3
|
||||
requires-python = ">=3.12"
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
|
||||
[options]
|
||||
@@ -30,7 +48,7 @@ dependencies = [
|
||||
requires-dist = [
|
||||
{ name = "httpx", specifier = ">=0.28.1" },
|
||||
{ name = "mooncake-transfer-engine" },
|
||||
{ name = "sglang", specifier = "==0.5.10" },
|
||||
{ name = "sglang", editable = "third_party/sglang/python" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -457,7 +475,8 @@ source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "loguru" },
|
||||
{ name = "pydantic" },
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "transformers" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/98/c0/8fb99aa86bc538d3a025749633d1d0105d849b35eb240ba7ba30e22de49b/compressed_tensors-0.15.1a20260409.tar.gz", hash = "sha256:a9a477691c2887bc8d2c46aef82aa60c85fe1f014cacb2218b423904aff04f4d", size = 238217, upload-time = "2026-04-09T21:21:52.922Z" }
|
||||
@@ -565,8 +584,8 @@ name = "decord2"
|
||||
version = "3.3.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/51/c3/fbc81c2cc18b2b7ca8a3a26ca2e8dfa243a2c7f5c4431f4b3839a8f12f0a/decord2-3.3.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:3a67fb644041a031bc3f21b2e1adcf92b9742d980bd90f3bc45396c2a0ddcbfa", size = 25036754, upload-time = "2026-04-06T18:09:46.005Z" },
|
||||
@@ -664,7 +683,8 @@ dependencies = [
|
||||
{ name = "einops" },
|
||||
{ name = "nvidia-cutlass-dsl" },
|
||||
{ name = "quack-kernels" },
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "torch-c-dlpack-ext" },
|
||||
{ name = "typing-extensions" },
|
||||
]
|
||||
@@ -699,7 +719,8 @@ dependencies = [
|
||||
{ name = "packaging" },
|
||||
{ name = "requests" },
|
||||
{ name = "tabulate" },
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "tqdm" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/cc/95/81eafb78574312db79ef7144a4e77f2fee015343f413ef3000f279c8a118/flashinfer_python-0.6.7.post2.tar.gz", hash = "sha256:924cb1788d0335225293eea384da40f40daa6b4e32b6a5ebc214ab679b4e2125", size = 6509418, upload-time = "2026-04-04T07:10:25.516Z" }
|
||||
@@ -904,34 +925,34 @@ wheels = [
|
||||
|
||||
[[package]]
|
||||
name = "hf-xet"
|
||||
version = "1.5.0.dev1"
|
||||
version = "1.5.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/c9/b5/73db543ba19129c23b2ca52d837373eb4243f0332130093f31b3ecc6739f/hf_xet-1.5.0.dev1.tar.gz", hash = "sha256:a21c9c85869ee122747543dd93471826cc0e9b5f61b11411aabd4adf72e345b1", size = 823729, upload-time = "2026-04-17T08:22:19.349Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/74/d8/5c06fc76461418326a7decf8367480c35be11a41fd938633929c60a9ec6b/hf_xet-1.5.0.tar.gz", hash = "sha256:e0fb0a34d9f406eed88233e829a67ec016bec5af19e480eac65a233ea289a948", size = 837196, upload-time = "2026-05-06T06:18:15.583Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/79/c1/15fb7a67b1fad51b0d3e3a4e0a33ac2fca8197da842a922bf2f707521915/hf_xet-1.5.0.dev1-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:41abc1601e9449c57880c203332221bc571a9c85154c1789a740259781ba9596", size = 6903797, upload-time = "2026-04-17T08:21:38.028Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/c5/a6/66924109da0089c803a0b42eeccd37f321906b0224bad6c220e46a9f6ad2/hf_xet-1.5.0.dev1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:045c43a49776d1dc9836ee0782e85fecbd2e85a6f55ebc39a4a14eb9c83fc004", size = 6570723, upload-time = "2026-04-17T08:21:35.605Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/ad/19/c9d51b5512eae52dd3b6eac5f02552cfe78156410e71e1e3d1295f778a0c/hf_xet-1.5.0.dev1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:908325bf4e53209dfe56d99a5cfed63907e677a32b1ba1f000cd72a8290871e4", size = 63298006, upload-time = "2026-04-17T08:21:12.867Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/66/a7/1781b5a465fb4cce525a96c8bf7719583d115eaf2ea4d4ef560a394801a2/hf_xet-1.5.0.dev1-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:d51c3c20460012540dca4094615b74e1b757a7d702910149c7b8175eda91567a", size = 58640118, upload-time = "2026-04-17T08:21:07.745Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/38/ef/2c02f7602b94b0f0454f66f9f52e7f37edaf81c3ccfa57073c17ee7e57d8/hf_xet-1.5.0.dev1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:36d45543060cfda059a910cfa702fe2221cba88a49401d9359ae442ccb6fe8e7", size = 59133723, upload-time = "2026-04-17T08:21:51.701Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/7d/76/732941c4ce0c0f5991ec1962a1848325a4ee11da2942c2f85100b68cba28/hf_xet-1.5.0.dev1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:3363073f1abc0a55027ba5e666bbdd0147681e856ed3ddda083428f8d81786cf", size = 60269392, upload-time = "2026-04-17T08:21:56.95Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/c3/22/65e1146977ddb940136ccd932675425a2fa1a13aef2a35fa54b969e07d77/hf_xet-1.5.0.dev1-cp313-cp313t-win_amd64.whl", hash = "sha256:aa93dcb1271a3cd2846ab07f9e37f27280604dd5c50ea299050553a4fe6fd60d", size = 3993380, upload-time = "2026-04-17T08:22:23.592Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/eb/8c/71bc286a6d52a53682c669abeea1d4dd3f320812d9c1816f8d71ad4e99ba/hf_xet-1.5.0.dev1-cp313-cp313t-win_arm64.whl", hash = "sha256:7928c15eef205aaa1786e63294331f184152e8e7d9f0f352047bf1b590f540cd", size = 3851055, upload-time = "2026-04-17T08:22:21.556Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/3c/79/42bace8f9651276eb96463b2ad275f6b53fe2b22ba3c5ea7f1819b580785/hf_xet-1.5.0.dev1-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:11a00f8ec39f69c3cd32fb8980b86c91945aaf0588667079994edda9fa2e3cb2", size = 6897594, upload-time = "2026-04-17T08:21:47.543Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/c1/b0/7d950c8f68280c1907b146e848e244eec054300769b6645455cf92075094/hf_xet-1.5.0.dev1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:d333be26f91cbfa573d24005c5502ce48eb19ec416982ebd5cf8212cdb549942", size = 6569370, upload-time = "2026-04-17T08:21:45.24Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/be/20/60828b7429397f5fe417e312b3b222f97a3293e129977c7d6c1fe07b14cc/hf_xet-1.5.0.dev1-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:44ca5ad2a82c60f1b749a65e361c006fa8c9feaab703e4c9e72b5ff830dca1f6", size = 63253090, upload-time = "2026-04-17T08:21:32.004Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/71/54/3fc89b6e47e9e43b86613e32c1cccb8cdeaaa5b19a99decc41d6b57f0d65/hf_xet-1.5.0.dev1-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:df5ba34b731c0be6eb5290cd46adb7b245583bdbf271f87caed60f3a3f65e859", size = 58659612, upload-time = "2026-04-17T08:21:27.084Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/18/76/2165625d83309a38dd2b91ce3b7ccb0384151f7f205b033575849b996546/hf_xet-1.5.0.dev1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:c4661dd045f6d59f838119423948d9cec06ac498ac09a869f7df4abbe70f01aa", size = 59152315, upload-time = "2026-04-17T08:22:11.349Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/ef/b1/e0effd9fb1acbd142c6e9345db171254f953a701b16799b815535cae771c/hf_xet-1.5.0.dev1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:2b07f87bb1d21cde3889d684f194e0c6047091c94b54c3e52d1b80e738d016ed", size = 60228716, upload-time = "2026-04-17T08:22:16.177Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/aa/9e/73921723685e27f6b54a016374894d69fb06eb0452fe7b7ada12b54b32fd/hf_xet-1.5.0.dev1-cp314-cp314t-win_amd64.whl", hash = "sha256:bb81277c04fcd49a4c3e93bc5bcf1d33a9604b32085f3f7e95f52edb9c2deca6", size = 3994035, upload-time = "2026-04-17T08:22:31.471Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/4c/7f/a2f422bb7d3050760d0aae59f4999dbfcb84708b822432f2d5bc3dd76234/hf_xet-1.5.0.dev1-cp314-cp314t-win_arm64.whl", hash = "sha256:724fa6f5f644295de503e6cdb1b1c96a7ad2512db6a641daa32b0f33888e88f7", size = 3851354, upload-time = "2026-04-17T08:22:29.647Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/85/fa/6c404999f13892e8ef2b75ec07af0b118fa1241a7bd278f6b93d61063746/hf_xet-1.5.0.dev1-cp37-abi3-macosx_10_12_x86_64.whl", hash = "sha256:5a180160a120357cabc0cd60167864f110bb8f0b1c38b71e0a93cde13839475e", size = 6907817, upload-time = "2026-04-17T08:21:42.228Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/ad/d1/6c828e215079a436d6e916d30248093b7b3ea911e4e6d40b954d21089fc8/hf_xet-1.5.0.dev1-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:8701d2e1268c78a1c3cd0e4480b74c0a505cfa864269308efae9d73d0e2203f9", size = 6577425, upload-time = "2026-04-17T08:21:40.097Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/e3/c9/2b93ba287824948450ddf64e2596220b58633d019dda278c12abadbf7bb5/hf_xet-1.5.0.dev1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e5480448001f9e59046ac4c463f2e25fb652066605dd183a82d2b5625b939487", size = 63137387, upload-time = "2026-04-17T08:21:21.775Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/dc/b5/c74899d4da67155db8b4f9d8b21110a919d969a15b75aceaec9502c8e7c3/hf_xet-1.5.0.dev1-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:14e9773ade3fb48dcfa9f493c8ed065704dd3031d29a5a289fed58b8223f2409", size = 58503933, upload-time = "2026-04-17T08:21:17.434Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/27/42/d9d511d425696a8b54cf67af0d3de0f8564f81f81e046b107a967f35f00e/hf_xet-1.5.0.dev1-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:21accf171949d78b18099bf57a4e8490db1ad88c0a4e907f8930c78ffe21f47d", size = 59035994, upload-time = "2026-04-17T08:22:01.526Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/8c/b6/49afbe73752f8d176231e49bc02b8b3fe96284ba82d856481c598b5343f4/hf_xet-1.5.0.dev1-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:07d8ec5c300a7ce3a39fa8598024992f6d2fcfa167b71cc0cde07abdcd05ca01", size = 60139405, upload-time = "2026-04-17T08:22:06.759Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/98/ab/e243e97ba2d5e55c848cdb5622466300990d2d0380c4456132d209ce1252/hf_xet-1.5.0.dev1-cp37-abi3-win_amd64.whl", hash = "sha256:ad32cfd5aa66bdf922b7f8eb9a94eb9f64a8f68a31ffede803060b44bd4060f8", size = 4004017, upload-time = "2026-04-17T08:22:27.78Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/f7/08/645da274ebe22d06a1ad103667deae75eb658e2b8e493f3a04a8ab140e2d/hf_xet-1.5.0.dev1-cp37-abi3-win_arm64.whl", hash = "sha256:2093091921534e51e13cbeb956550cded7b97aa7ba1d774123c21d9b06f06231", size = 3859306, upload-time = "2026-04-17T08:22:25.602Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/68/9b/6912c99070915a4f28119e3c5b52a9abd1eec0ad5cb293b8c967a0c6f5a2/hf_xet-1.5.0-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:7d70fe2ce97b9db73b9c9b9c81fe3693640aec83416a966c446afea54acfae3c", size = 4023383, upload-time = "2026-05-06T06:17:53.947Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/0f/6d/9563cfde59b5d8128a9c7ec972a087f4c782e4f7bac5a85234edfd5d5e49/hf_xet-1.5.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:73a0dae8c71de3b0633a45c73f4a4a5ed09e94b43441d82981a781d4f12baa42", size = 3792751, upload-time = "2026-05-06T06:17:51.791Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/07/a5/ed5a0cf35b49a0571af5a8f53416dad1877a718c021c9937c3a53cb45781/hf_xet-1.5.0-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:a60290ec57e9b71767fba7c3645ddafdd0759974b540441510c629c6db6db24a", size = 4456058, upload-time = "2026-05-06T06:17:40.735Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/60/fb/3ae8bf2a7a37a4197d0195d7247fd25b3952e15cb8a599e285dfaa6f52b3/hf_xet-1.5.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:e5de0f6deada0dada870bb376a11bcd1f08abf3a968a6d118f33e72d1b1eb480", size = 4250783, upload-time = "2026-05-06T06:17:38.412Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/a2/9b/8bae40d4d91525085137196e84eb0ed49cf65b5e96e5c3ecdadd8bd0fac2/hf_xet-1.5.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:c799d49f1a5544a0ef7591c0ee75e0d6b93d6f56dc7a4979f59f7518d2872216", size = 4445594, upload-time = "2026-05-06T06:18:04.219Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/13/59/c74efbbd4e8728172b2cc72a2bc014d2947a4b7bdced932fbd3f5da1a4e5/hf_xet-1.5.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:2baea1b0b989e5c152fe81425f7745ddc8901280ba3d97c98d8cdece7b706c60", size = 4663995, upload-time = "2026-05-06T06:18:06.1Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/73/32/8e1e0410af64cda9b139d1dcebdc993a8ff9c8c7c0e2696ae356d75ccc0d/hf_xet-1.5.0-cp313-cp313t-win_amd64.whl", hash = "sha256:526345b3ed45f374f6317349df489167606736c876241ba984105afe7fd4839d", size = 3966608, upload-time = "2026-05-06T06:18:19.74Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/fc/34/a8febc8f4edbea8b3e21b02ebc8b628679b84ba7e45cde624a7736b51500/hf_xet-1.5.0-cp313-cp313t-win_arm64.whl", hash = "sha256:786d28e2eb8315d5035544b9d137b4a842d600c434bb91bf7d0d953cce906ad4", size = 3796946, upload-time = "2026-05-06T06:18:17.568Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/2a/20/8fc8996afe5815fa1a6be8e9e5c02f24500f409d599e905800d498a4e14d/hf_xet-1.5.0-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:872d5601e6deea30d15865ede55d29eac6daf5a534ab417b99b6ef6b076dd96c", size = 4023495, upload-time = "2026-05-06T06:18:01.94Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/32/6a/93d84463c00cecb561a7508aa6303e35ee2894294eac14245526924415fe/hf_xet-1.5.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:9929561f5abf4581c8ea79587881dfef6b8abb2a0d8a51915936fc2a614f4e73", size = 3792731, upload-time = "2026-05-06T06:18:00.021Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/9d/5a/8ec8e0c863b382d00b3c2e2af6ded6b06371be617144a625903a6d562f4b/hf_xet-1.5.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:f7b7bbae318e583a86fb21e5a4a175d6721d628a2874f4bd022d0e660c32a682", size = 4456738, upload-time = "2026-05-06T06:17:49.574Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/c5/ca/f7effa1a67717da2bcc6b6c28f71c6ca648c77acaec4e2c32f40cbe16d85/hf_xet-1.5.0-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:cf7b2dc6f31a4ea754bb50f74cde482dcf5d366d184076d8530b9872787f3761", size = 4251622, upload-time = "2026-05-06T06:17:47.096Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/65/f2/19247dba3e231cf77dec59ddfb878f00057635ff773d099c9b59d37812c3/hf_xet-1.5.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:8dbcbab554c9ef158ef2c991545c3e970ddd8cc7acdcd0a78c5a41095dab4ded", size = 4445667, upload-time = "2026-05-06T06:18:11.983Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/7f/64/6f116801a3bcfb6f59f5c251f48cadc47ea54026441c4a385079286a94fa/hf_xet-1.5.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:5906bf7718d3636dc13402914736abe723492cb730f744834f5f5b67d3a12702", size = 4664619, upload-time = "2026-05-06T06:18:13.771Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/5c/e8/069542d37946ed08669b127e1496fa99e78196d71de8d41eda5e9f1b7a58/hf_xet-1.5.0-cp314-cp314t-win_amd64.whl", hash = "sha256:5f3dc2248fc01cc0a00cd392ab497f1ca373fcbc7e3f2da1f452480b384e839e", size = 3966802, upload-time = "2026-05-06T06:18:28.162Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/f9/91/fc6fdec27b14d04e88c386ac0a0129732b53fa23f7c4a78f4b83a039c567/hf_xet-1.5.0-cp314-cp314t-win_arm64.whl", hash = "sha256:b285cea1b5bab46b758772716ba8d6854a1a0310fed1c249d678a8b38601e5a0", size = 3797168, upload-time = "2026-05-06T06:18:26.287Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/3d/fb/69ff198a82cae7eb1a69fb84d93b3a3e4816564d76817fe541ddc96874eb/hf_xet-1.5.0-cp37-abi3-macosx_10_12_x86_64.whl", hash = "sha256:dad0dc84e941b8ba3c860659fe1fdc35c049d47cce293f003287757e971a8f56", size = 4030814, upload-time = "2026-05-06T06:17:57.933Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/9b/ff/edcc2b40162bef3ff78e14ab637e5f3b89243d6aee72f5949d3bb6a5af83/hf_xet-1.5.0-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:fd6e5a9b0fdac4ed03ed45ef79254a655b1aaab514a02202617fbf643f5fdf7a", size = 3798444, upload-time = "2026-05-06T06:17:55.79Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/49/4d/103f76b04310e5e57656696cc184690d20c466af0bca3ca88f8c8ea5d4f3/hf_xet-1.5.0-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:3531b1823a0e6d77d80f9ed15ca0e00f0d115094f8ac033d5cae88f4564cc949", size = 4465986, upload-time = "2026-05-06T06:17:44.886Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/c4/a2/546f47f464737b3edbab6f8ddb57f2599b93d2cbb66f06abb475ccb48651/hf_xet-1.5.0-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:9a0ee58cd18d5ea799f7ed11290bbccbe56bdd8b1d97ca74b9cc49a3945d7a3b", size = 4259865, upload-time = "2026-05-06T06:17:42.639Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/95/7f/1be593c1f28613be2e196473481cd81bfc5910795e30a34e8f744f6cac4f/hf_xet-1.5.0-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:1e60df5a42e9bed8628b6416af2cba4cba57ae9f02de226a06b020d98e1aab18", size = 4459835, upload-time = "2026-05-06T06:18:08.026Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/aa/b2/703569fc881f3284487e68cda7b42179978480da3c438042a6bbbb4a671c/hf_xet-1.5.0-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:4b35549ce62601b84da4ff9b24d970032ace3d4430f52d91bcbb26c901d6c690", size = 4672414, upload-time = "2026-05-06T06:18:09.864Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/af/37/1b6def445c567286b50aa3b33828158e135b1be44938dde59f11382a500c/hf_xet-1.5.0-cp37-abi3-win_amd64.whl", hash = "sha256:2806c7c17b4d23f8d88f7c4814f838c3b6150773fe339c20af23e1cfaf2797e4", size = 3977238, upload-time = "2026-05-06T06:18:23.621Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/62/94/3b66b148778ee100dcfd69c2ca22b57b41b44d3063ceec934f209e9184ce/hf_xet-1.5.0-cp37-abi3-win_arm64.whl", hash = "sha256:b6c9df403040248c76d808d3e047d64db2d923bae593eb244c41e425cf6cd7be", size = 3806916, upload-time = "2026-05-06T06:18:21.7Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -1635,9 +1656,15 @@ name = "numpy"
|
||||
version = "2.3.5"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version < '3.13' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/76/65/21b3bc86aac7b8f2862db1e808f1ea22b028e30a225a34a5ede9bf8678f2/numpy-2.3.5.tar.gz", hash = "sha256:784db1dcdab56bf0517743e746dfb0f885fc68d948aba86eeec2cba234bdf1c0", size = 20584950, upload-time = "2025-11-16T22:52:42.067Z" }
|
||||
wheels = [
|
||||
@@ -1703,12 +1730,24 @@ name = "numpy"
|
||||
version = "2.4.4"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/d7/9f/b8cef5bffa569759033adda9481211426f12f53299629b410340795c2514/numpy-2.4.4.tar.gz", hash = "sha256:2d390634c5182175533585cc89f3608a4682ccb173cc9bb940b2881c8d6f8fa0", size = 20731587, upload-time = "2026-03-29T13:22:01.298Z" }
|
||||
wheels = [
|
||||
@@ -1771,42 +1810,116 @@ wheels = [
|
||||
name = "nvidia-cublas-cu12"
|
||||
version = "12.8.4.1"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/dc/61/e24b560ab2e2eaeb3c839129175fb330dfcfc29e5203196e5541a4c44682/nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:8ac4e771d5a348c551b2a426eda6193c19aa630236b418086020df5ba9667142", size = 594346921, upload-time = "2025-03-07T01:44:31.254Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cublas-cu12"
|
||||
version = "12.9.1.4"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/82/6c/90d3f532f608a03a13c1d6c16c266ffa3828e8011b1549d3b61db2ad59f5/nvidia_cublas_cu12-12.9.1.4-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:7a950dae01add3b415a5a5cdc4ec818fb5858263e9cca59004bb99fdbbd3a5d6", size = 575006342, upload-time = "2025-06-05T20:04:16.902Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cuda-cupti-cu12"
|
||||
version = "12.8.90"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/f8/02/2adcaa145158bf1a8295d83591d22e4103dbfd821bcaf6f3f53151ca4ffa/nvidia_cuda_cupti_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:ea0cb07ebda26bb9b29ba82cda34849e73c166c18162d3913575b0c9db9a6182", size = 10248621, upload-time = "2025-03-07T01:40:21.213Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cuda-cupti-cu12"
|
||||
version = "12.9.79"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/b4/78/351b5c8cdbd9a6b4fb0d6ee73fb176dcdc1b6b6ad47c2ffff5ae8ca4a1f7/nvidia_cuda_cupti_cu12-12.9.79-py3-none-manylinux_2_25_aarch64.whl", hash = "sha256:791853b030602c6a11d08b5578edfb957cadea06e9d3b26adbf8d036135a4afe", size = 10077166, upload-time = "2025-06-05T20:01:01.385Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cuda-nvrtc-cu12"
|
||||
version = "12.8.93"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/05/6b/32f747947df2da6994e999492ab306a903659555dddc0fbdeb9d71f75e52/nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl", hash = "sha256:a7756528852ef889772a84c6cd89d41dfa74667e24cca16bb31f8f061e3e9994", size = 88040029, upload-time = "2025-03-07T01:42:13.562Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cuda-nvrtc-cu12"
|
||||
version = "12.9.86"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/64/eb/c2295044b8f3b3b08860e2f6a912b702fc92568a167259df5dddb78f325e/nvidia_cuda_nvrtc_cu12-12.9.86-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:096d4de6bda726415dfaf3198d4f5c522b8e70139c97feef5cd2ca6d4cd9cead", size = 44528905, upload-time = "2025-06-05T20:02:29.754Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cuda-runtime-cu12"
|
||||
version = "12.8.90"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/0d/9b/a997b638fcd068ad6e4d53b8551a7d30fe8b404d6f1804abf1df69838932/nvidia_cuda_runtime_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:adade8dcbd0edf427b7204d480d6066d33902cab2a4707dcfc48a2d0fd44ab90", size = 954765, upload-time = "2025-03-07T01:40:01.615Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cuda-runtime-cu12"
|
||||
version = "12.9.79"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/bc/e0/0279bd94539fda525e0c8538db29b72a5a8495b0c12173113471d28bce78/nvidia_cuda_runtime_cu12-12.9.79-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:83469a846206f2a733db0c42e223589ab62fd2fabac4432d2f8802de4bded0a4", size = 3515012, upload-time = "2025-06-05T20:00:35.519Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cudnn-cu12"
|
||||
version = "9.10.2.21"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "nvidia-cublas-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-cublas-cu12", version = "12.8.4.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-cublas-cu12", version = "12.9.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/fa/41/e79269ce215c857c935fd86bcfe91a451a584dfc27f1e068f568b9ad1ab7/nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:c9132cc3f8958447b4910a1720036d9eff5928cc3179b0a51fb6d167c6cc87d8", size = 705026878, upload-time = "2025-06-06T21:52:51.348Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/ba/51/e123d997aa098c61d029f76663dedbfb9bc8dcf8c60cbd6adbe42f76d049/nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:949452be657fa16687d0930933f032835951ef0892b37d2d53824d1a84dc97a8", size = 706758467, upload-time = "2025-06-06T21:54:08.597Z" },
|
||||
]
|
||||
|
||||
@@ -1830,58 +1943,160 @@ wheels = [
|
||||
name = "nvidia-cufft-cu12"
|
||||
version = "11.3.3.83"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
dependencies = [
|
||||
{ name = "nvidia-nvjitlink-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/1f/13/ee4e00f30e676b66ae65b4f08cb5bcbb8392c03f54f2d5413ea99a5d1c80/nvidia_cufft_cu12-11.3.3.83-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:4d2dd21ec0b88cf61b62e6b43564355e5222e4a3fb394cac0db101f2dd0d4f74", size = 193118695, upload-time = "2025-03-07T01:45:27.821Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cufft-cu12"
|
||||
version = "11.4.1.4"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
dependencies = [
|
||||
{ name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/9b/2b/76445b0af890da61b501fde30650a1a4bd910607261b209cccb5235d3daa/nvidia_cufft_cu12-11.4.1.4-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:1a28c9b12260a1aa7a8fd12f5ebd82d027963d635ba82ff39a1acfa7c4c0fbcf", size = 200822453, upload-time = "2025-06-05T20:05:27.889Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cufile-cu12"
|
||||
version = "1.13.1.3"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/bb/fe/1bcba1dfbfb8d01be8d93f07bfc502c93fa23afa6fd5ab3fc7c1df71038a/nvidia_cufile_cu12-1.13.1.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:1d069003be650e131b21c932ec3d8969c1715379251f8d23a1860554b1cb24fc", size = 1197834, upload-time = "2025-03-07T01:45:50.723Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cufile-cu12"
|
||||
version = "1.14.1.1"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/b9/d2/110af3a1f77999d5eebf6ffae5d2305ab839e53c76eec3696640cc25b35d/nvidia_cufile_cu12-1.14.1.1-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:8dea77590761e02cb6dd955a57cb6414c58aa3cb1b7adbf9919869a11509cf65", size = 1135994, upload-time = "2025-06-05T20:06:03.952Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-curand-cu12"
|
||||
version = "10.3.9.90"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/fb/aa/6584b56dc84ebe9cf93226a5cde4d99080c8e90ab40f0c27bda7a0f29aa1/nvidia_curand_cu12-10.3.9.90-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:b32331d4f4df5d6eefa0554c565b626c7216f87a06a4f56fab27c3b68a830ec9", size = 63619976, upload-time = "2025-03-07T01:46:23.323Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-curand-cu12"
|
||||
version = "10.3.10.19"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/14/1c/2a45afc614d99558d4a773fa740d8bb5471c8398eeed925fc0fcba020173/nvidia_curand_cu12-10.3.10.19-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:de663377feb1697e1d30ed587b07d5721fdd6d2015c738d7528a6002a6134d37", size = 68292066, upload-time = "2025-05-01T19:39:13.595Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cusolver-cu12"
|
||||
version = "11.7.3.90"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
dependencies = [
|
||||
{ name = "nvidia-cublas-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-cusparse-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-nvjitlink-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-cublas-cu12", version = "12.8.4.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-cusparse-cu12", version = "12.5.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/85/48/9a13d2975803e8cf2777d5ed57b87a0b6ca2cc795f9a4f59796a910bfb80/nvidia_cusolver_cu12-11.7.3.90-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:4376c11ad263152bd50ea295c05370360776f8c3427b30991df774f9fb26c450", size = 267506905, upload-time = "2025-03-07T01:47:16.273Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cusolver-cu12"
|
||||
version = "11.7.5.82"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
dependencies = [
|
||||
{ name = "nvidia-cublas-cu12", version = "12.9.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-cusparse-cu12", version = "12.5.10.65", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/03/99/686ff9bf3a82a531c62b1a5c614476e8dfa24a9d89067aeedf3592ee4538/nvidia_cusolver_cu12-11.7.5.82-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:62efa83e4ace59a4c734d052bb72158e888aa7b770e1a5f601682f16fe5b4fd2", size = 337869834, upload-time = "2025-06-05T20:06:53.125Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cusparse-cu12"
|
||||
version = "12.5.8.93"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
dependencies = [
|
||||
{ name = "nvidia-nvjitlink-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
{ name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/c2/f5/e1854cb2f2bcd4280c44736c93550cc300ff4b8c95ebe370d0aa7d2b473d/nvidia_cusparse_cu12-12.5.8.93-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:1ec05d76bbbd8b61b06a80e1eaf8cf4959c3d4ce8e711b65ebd0443bb0ebb13b", size = 288216466, upload-time = "2025-03-07T01:48:13.779Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cusparse-cu12"
|
||||
version = "12.5.10.65"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
dependencies = [
|
||||
{ name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/5e/6f/8710fbd17cdd1d0fc3fea7d36d5b65ce1933611c31e1861da330206b253a/nvidia_cusparse_cu12-12.5.10.65-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:221c73e7482dd93eda44e65ce567c031c07e2f93f6fa0ecd3ba876a195023e83", size = 366359408, upload-time = "2025-06-05T20:07:42.501Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-cusparselt-cu12"
|
||||
version = "0.7.1"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/73/b9/598f6ff36faaece4b3c50d26f50e38661499ff34346f00e057760b35cc9d/nvidia_cusparselt_cu12-0.7.1-py3-none-manylinux2014_aarch64.whl", hash = "sha256:8878dce784d0fac90131b6817b607e803c36e629ba34dc5b433471382196b6a5", size = 283835557, upload-time = "2025-02-26T00:16:54.265Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/56/79/12978b96bd44274fe38b5dde5cfb660b1d114f70a65ef962bcbbed99b549/nvidia_cusparselt_cu12-0.7.1-py3-none-manylinux2014_x86_64.whl", hash = "sha256:f1bb701d6b930d5a7cea44c19ceb973311500847f81b634d802b7b539dc55623", size = 287193691, upload-time = "2025-02-26T00:15:44.104Z" },
|
||||
]
|
||||
|
||||
@@ -1929,6 +2144,7 @@ name = "nvidia-nccl-cu12"
|
||||
version = "2.27.5"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/bb/1c/857979db0ef194ca5e21478a0612bcdbbe59458d7694361882279947b349/nvidia_nccl_cu12-2.27.5-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:31432ad4d1fb1004eb0c56203dc9bc2178a1ba69d1d9e02d64a6938ab5e40e7a", size = 322400625, upload-time = "2025-06-26T04:11:04.496Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/6e/89/f7a07dc961b60645dbbf42e80f2bc85ade7feb9a491b11a1e973aa00071f/nvidia_nccl_cu12-2.27.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:ad730cf15cb5d25fe849c6e6ca9eb5b76db16a80f13f425ac68d8e2e55624457", size = 322348229, upload-time = "2025-06-26T04:11:28.385Z" },
|
||||
]
|
||||
|
||||
@@ -1936,15 +2152,34 @@ wheels = [
|
||||
name = "nvidia-nvjitlink-cu12"
|
||||
version = "12.8.93"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/f6/74/86a07f1d0f42998ca31312f998bd3b9a7eff7f52378f4f270c8679c77fb9/nvidia_nvjitlink_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl", hash = "sha256:81ff63371a7ebd6e6451970684f916be2eab07321b73c9d244dc2b4da7f73b88", size = 39254836, upload-time = "2025-03-07T01:49:55.661Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-nvjitlink-cu12"
|
||||
version = "12.9.86"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/97/bc/2dcba8e70cf3115b400fef54f213bcd6715a3195eba000f8330f11e40c45/nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:994a05ef08ef4b0b299829cde613a424382aff7efb08a7172c1fa616cc3af2ca", size = 39514880, upload-time = "2025-06-05T20:10:04.89Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-nvshmem-cu12"
|
||||
version = "3.3.20"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/92/9d/3dd98852568fb845ec1f7902c90a22b240fe1cbabda411ccedf2fd737b7b/nvidia_nvshmem_cu12-3.3.20-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:0b0b960da3842212758e4fa4696b94f129090b30e5122fea3c5345916545cff0", size = 124484616, upload-time = "2025-08-04T20:24:59.172Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/3b/6c/99acb2f9eb85c29fc6f3a7ac4dccfd992e22666dd08a642b303311326a97/nvidia_nvshmem_cu12-3.3.20-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:d00f26d3f9b2e3c3065be895e3059d6479ea5c638a3f38c9fec49b1b9dd7c1e5", size = 124657145, upload-time = "2025-08-04T20:25:19.995Z" },
|
||||
]
|
||||
|
||||
@@ -1952,10 +2187,28 @@ wheels = [
|
||||
name = "nvidia-nvtx-cu12"
|
||||
version = "12.8.90"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/a2/eb/86626c1bbc2edb86323022371c39aa48df6fd8b0a1647bc274577f72e90b/nvidia_nvtx_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:5b17e2001cc0d751a5bc2c6ec6d26ad95913324a4adb86788c944f8ce9ba441f", size = 89954, upload-time = "2025-03-07T01:42:44.131Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nvidia-nvtx-cu12"
|
||||
version = "12.9.79"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/c4/e4/82155e4aaedb41621087ba219c95e99c5e417f37a7649b4fb6ec32dcb14d/nvidia_nvtx_cu12-12.9.79-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:d1f258e752294acdb4f61c3d31fee87bd0f60e459f1e2f624376369b524cd15d", size = 86120, upload-time = "2025-06-05T20:02:51.838Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "openai"
|
||||
version = "2.6.1"
|
||||
@@ -2072,7 +2325,8 @@ dependencies = [
|
||||
{ name = "pydantic" },
|
||||
{ name = "referencing" },
|
||||
{ name = "requests" },
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "tqdm" },
|
||||
{ name = "typing-extensions" },
|
||||
]
|
||||
@@ -2893,7 +3147,8 @@ source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "apache-tvm-ffi" },
|
||||
{ name = "nvidia-cutlass-dsl" },
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "torch-c-dlpack-ext" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/73/34/bcc87d1ee53cf245bf58ea563b276b9bd86a405bda5a42e7bd1386db9941/quack_kernels-0.3.11.tar.gz", hash = "sha256:d589417476030fb62e70730c4bd0732339a04b8bb91fd49bf4cc70e20a27170b", size = 246675, upload-time = "2026-04-20T01:08:12.269Z" }
|
||||
@@ -3315,8 +3570,7 @@ wheels = [
|
||||
|
||||
[[package]]
|
||||
name = "sglang"
|
||||
version = "0.5.10"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
source = { editable = "third_party/sglang/python" }
|
||||
dependencies = [
|
||||
{ name = "aiohttp" },
|
||||
{ name = "anthropic" },
|
||||
@@ -3369,7 +3623,8 @@ dependencies = [
|
||||
{ name = "soundfile" },
|
||||
{ name = "tiktoken" },
|
||||
{ name = "timm" },
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "torch-memory-saver" },
|
||||
{ name = "torchao" },
|
||||
{ name = "torchaudio" },
|
||||
@@ -3382,10 +3637,118 @@ dependencies = [
|
||||
{ name = "watchfiles" },
|
||||
{ name = "xgrammar" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/c8/4e/bd00d332098337ae13fa783a13258935d568dd5b7e1fd9df205184145224/sglang-0.5.10.tar.gz", hash = "sha256:db78367f41a1f385f8624a10e9506b671e788f9943978df6a37a486867c1edc7", size = 4700833, upload-time = "2026-04-05T23:57:27.556Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/1f/ee/f7a946162ed538f47a1c5542f93410e5bf9a0c4ca6021d4000e6f9b87f7d/sglang-0.5.10-py3-none-any.whl", hash = "sha256:ac8855a5d57dac8831fee526bca5212f1ae451f378e2ab08b3baecbc4deb4076", size = 6064398, upload-time = "2026-04-05T23:57:25.28Z" },
|
||||
|
||||
[package.metadata]
|
||||
requires-dist = [
|
||||
{ name = "accelerate", marker = "extra == 'test'" },
|
||||
{ name = "addict", marker = "extra == 'diffusion'", specifier = "==2.4.0" },
|
||||
{ name = "addict", marker = "extra == 'test'" },
|
||||
{ name = "aiohttp" },
|
||||
{ name = "anthropic", specifier = ">=0.20.0" },
|
||||
{ name = "apache-tvm-ffi", specifier = ">=0.1.5,<0.2" },
|
||||
{ name = "av", marker = "(platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'armv7l' and sys_platform == 'linux')" },
|
||||
{ name = "av", marker = "extra == 'diffusion'", specifier = "==16.1.0" },
|
||||
{ name = "bitsandbytes", marker = "extra == 'test'" },
|
||||
{ name = "blobfile", specifier = "==3.0.0" },
|
||||
{ name = "build" },
|
||||
{ name = "cache-dit", marker = "extra == 'diffusion'", specifier = "==1.3.0" },
|
||||
{ name = "checkpoint-engine", marker = "extra == 'checkpoint-engine'", specifier = "==0.1.2" },
|
||||
{ name = "cloudpickle", marker = "extra == 'diffusion'", specifier = "==3.1.2" },
|
||||
{ name = "compressed-tensors" },
|
||||
{ name = "cuda-python", specifier = "==12.9" },
|
||||
{ name = "datasets" },
|
||||
{ name = "decord2", marker = "(platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'armv7l' and sys_platform == 'linux')" },
|
||||
{ name = "diff-cover", marker = "extra == 'test'" },
|
||||
{ name = "diffusers", marker = "extra == 'diffusion'", specifier = "==0.37.0" },
|
||||
{ name = "einops" },
|
||||
{ name = "expecttest", marker = "extra == 'test'" },
|
||||
{ name = "fastapi" },
|
||||
{ name = "flash-attn-4", specifier = ">=4.0.0b4" },
|
||||
{ name = "flashinfer-cubin", specifier = "==0.6.7.post2" },
|
||||
{ name = "flashinfer-python", specifier = "==0.6.7.post2" },
|
||||
{ name = "gguf" },
|
||||
{ name = "imageio", marker = "extra == 'diffusion'", specifier = "==2.36.0" },
|
||||
{ name = "imageio-ffmpeg", marker = "extra == 'diffusion'", specifier = "==0.5.1" },
|
||||
{ name = "interegular" },
|
||||
{ name = "ipython" },
|
||||
{ name = "jsonlines", marker = "extra == 'test'" },
|
||||
{ name = "llguidance", specifier = ">=0.7.11,<0.8.0" },
|
||||
{ name = "lm-eval", extras = ["api"], marker = "extra == 'test'", specifier = ">=0.4.9.2" },
|
||||
{ name = "matplotlib", marker = "extra == 'test'" },
|
||||
{ name = "mistral-common", specifier = ">=1.9.0" },
|
||||
{ name = "modelscope" },
|
||||
{ name = "moviepy", marker = "extra == 'diffusion'", specifier = ">=2.0.0" },
|
||||
{ name = "msgspec" },
|
||||
{ name = "ninja" },
|
||||
{ name = "numpy" },
|
||||
{ name = "nvidia-cutlass-dsl", specifier = ">=4.4.1" },
|
||||
{ name = "nvidia-ml-py" },
|
||||
{ name = "openai", specifier = "==2.6.1" },
|
||||
{ name = "openai-harmony", specifier = "==0.0.4" },
|
||||
{ name = "opencv-python-headless", marker = "extra == 'diffusion'", specifier = "==4.10.0.84" },
|
||||
{ name = "opentelemetry-api", marker = "extra == 'tracing'" },
|
||||
{ name = "opentelemetry-exporter-otlp", marker = "extra == 'tracing'" },
|
||||
{ name = "opentelemetry-exporter-otlp-proto-grpc", marker = "extra == 'tracing'" },
|
||||
{ name = "opentelemetry-sdk", marker = "extra == 'tracing'" },
|
||||
{ name = "orjson" },
|
||||
{ name = "outlines", specifier = "==0.1.11" },
|
||||
{ name = "packaging" },
|
||||
{ name = "pandas", marker = "extra == 'test'" },
|
||||
{ name = "parameterized", marker = "extra == 'test'" },
|
||||
{ name = "partial-json-parser" },
|
||||
{ name = "peft", marker = "extra == 'test'", specifier = ">=0.18.0" },
|
||||
{ name = "pillow" },
|
||||
{ name = "polars", marker = "extra == 'test'" },
|
||||
{ name = "prometheus-client", specifier = ">=0.20.0" },
|
||||
{ name = "psutil" },
|
||||
{ name = "py-spy" },
|
||||
{ name = "pybase64" },
|
||||
{ name = "pydantic" },
|
||||
{ name = "pytest", marker = "extra == 'test'" },
|
||||
{ name = "pytest-cov", marker = "extra == 'test'" },
|
||||
{ name = "python-multipart" },
|
||||
{ name = "pyyaml", marker = "extra == 'diffusion'", specifier = "==6.0.1" },
|
||||
{ name = "pyzmq", specifier = ">=25.1.2" },
|
||||
{ name = "quack-kernels", specifier = ">=0.3.0" },
|
||||
{ name = "ray", extras = ["default"], marker = "extra == 'ray'", specifier = ">=2.54.0" },
|
||||
{ name = "remote-pdb", marker = "extra == 'diffusion'", specifier = "==2.1.0" },
|
||||
{ name = "requests" },
|
||||
{ name = "runai-model-streamer", marker = "extra == 'diffusion'", specifier = ">=0.15.7" },
|
||||
{ name = "runai-model-streamer", extras = ["azure", "gcs", "s3"], marker = "extra == 'runai'", specifier = ">=0.15.7" },
|
||||
{ name = "scikit-image", marker = "extra == 'diffusion'", specifier = "==0.25.2" },
|
||||
{ name = "scipy" },
|
||||
{ name = "sentence-transformers", marker = "extra == 'test'" },
|
||||
{ name = "sentencepiece" },
|
||||
{ name = "setproctitle" },
|
||||
{ name = "sglang", extras = ["diffusion"], marker = "extra == 'all'" },
|
||||
{ name = "sglang", extras = ["test"], marker = "extra == 'dev'" },
|
||||
{ name = "sglang", extras = ["tracing"], marker = "extra == 'all'" },
|
||||
{ name = "sglang-kernel", specifier = "==0.4.1" },
|
||||
{ name = "smg-grpc-servicer", specifier = ">=0.5.0" },
|
||||
{ name = "soundfile", specifier = "==0.13.1" },
|
||||
{ name = "st-attn", marker = "platform_machine != 'aarch64' and platform_machine != 'arm64' and extra == 'diffusion'", specifier = "==0.0.7" },
|
||||
{ name = "tabulate", marker = "extra == 'test'" },
|
||||
{ name = "tiktoken" },
|
||||
{ name = "timm", specifier = "==1.0.16" },
|
||||
{ name = "torch", marker = "platform_machine != 'aarch64' and platform_machine != 'x86_64'", specifier = "==2.9.1" },
|
||||
{ name = "torch", marker = "platform_machine == 'aarch64'", specifier = "==2.9.1", index = "https://download.pytorch.org/whl/cu129" },
|
||||
{ name = "torch", marker = "platform_machine == 'x86_64'", specifier = "==2.9.1", index = "https://pypi.org/simple" },
|
||||
{ name = "torch-memory-saver", specifier = "==0.0.9" },
|
||||
{ name = "torchao", specifier = "==0.9.0" },
|
||||
{ name = "torchaudio", specifier = "==2.9.1" },
|
||||
{ name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l') or sys_platform != 'linux'", specifier = "==0.9.1" },
|
||||
{ name = "torchvision" },
|
||||
{ name = "tqdm" },
|
||||
{ name = "transformers", specifier = "==5.3.0" },
|
||||
{ name = "trimesh", marker = "extra == 'diffusion'", specifier = ">=4.0.0" },
|
||||
{ name = "uvicorn" },
|
||||
{ name = "uvloop" },
|
||||
{ name = "vsa", marker = "platform_machine != 'aarch64' and platform_machine != 'arm64' and extra == 'diffusion'", specifier = "==0.0.4" },
|
||||
{ name = "watchfiles" },
|
||||
{ name = "xatlas", marker = "extra == 'diffusion'" },
|
||||
{ name = "xgrammar", specifier = "==0.1.32" },
|
||||
]
|
||||
provides-extras = ["checkpoint-engine", "runai", "diffusion", "ray", "tracing", "test", "dev", "all"]
|
||||
|
||||
[[package]]
|
||||
name = "sglang-kernel"
|
||||
@@ -3574,7 +3937,8 @@ dependencies = [
|
||||
{ name = "huggingface-hub" },
|
||||
{ name = "pyyaml" },
|
||||
{ name = "safetensors" },
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "torchvision" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/94/f6/4d7a8c261341fa6ad281920618739f2a650f41043afcedb570f24e99a776/timm-1.0.16.tar.gz", hash = "sha256:a3b8130dd2cb8dc3b9f5e3d09ab6d677a6315a8695fd5264eb6d52a4a46c1044", size = 2339999, upload-time = "2025-06-26T17:09:44.208Z" }
|
||||
@@ -3612,30 +3976,50 @@ wheels = [
|
||||
name = "torch"
|
||||
version = "2.9.1"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
dependencies = [
|
||||
{ name = "filelock" },
|
||||
{ name = "fsspec" },
|
||||
{ name = "jinja2" },
|
||||
{ name = "networkx" },
|
||||
{ name = "nvidia-cublas-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cuda-cupti-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cuda-nvrtc-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cuda-runtime-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "filelock", marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "fsspec", marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "jinja2", marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "networkx", marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "nvidia-cublas-cu12", version = "12.8.4.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cuda-cupti-cu12", version = "12.8.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cuda-nvrtc-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cuda-runtime-cu12", version = "12.8.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cudnn-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cufft-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cufile-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-curand-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cusolver-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cusparse-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cufft-cu12", version = "11.3.3.83", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cufile-cu12", version = "1.13.1.3", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-curand-cu12", version = "10.3.9.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cusolver-cu12", version = "11.7.3.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cusparse-cu12", version = "12.5.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cusparselt-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-nccl-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-nvjitlink-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-nvshmem-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-nvtx-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "setuptools" },
|
||||
{ name = "sympy" },
|
||||
{ name = "nvidia-nvtx-cu12", version = "12.8.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "setuptools", marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "sympy", marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "triton", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "typing-extensions" },
|
||||
{ name = "typing-extensions", marker = "platform_machine != 'aarch64'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/0f/27/07c645c7673e73e53ded71705045d6cb5bae94c4b021b03aa8d03eee90ab/torch-2.9.1-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:da5f6f4d7f4940a173e5572791af238cb0b9e21b1aab592bd8b26da4c99f1cd6", size = 104126592, upload-time = "2025-11-12T15:20:41.62Z" },
|
||||
@@ -3660,12 +4044,61 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/db/2b/f7818f6ec88758dfd21da46b6cd46af9d1b3433e53ddbb19ad1e0da17f9b/torch-2.9.1-cp314-cp314t-win_amd64.whl", hash = "sha256:c88d3299ddeb2b35dcc31753305612db485ab6f1823e37fb29451c8b2732b87e", size = 111163659, upload-time = "2025-11-12T15:23:20.009Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "torch"
|
||||
version = "2.9.1+cu129"
|
||||
source = { registry = "https://download.pytorch.org/whl/cu129" }
|
||||
resolution-markers = [
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'win32'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
|
||||
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'win32'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
|
||||
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'win32'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
|
||||
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
|
||||
]
|
||||
dependencies = [
|
||||
{ name = "filelock", marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "fsspec", marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "jinja2", marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "networkx", marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "nvidia-cublas-cu12", version = "12.9.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cuda-cupti-cu12", version = "12.9.79", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cuda-nvrtc-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cuda-runtime-cu12", version = "12.9.79", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cudnn-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cufft-cu12", version = "11.4.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cufile-cu12", version = "1.14.1.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-curand-cu12", version = "10.3.10.19", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cusolver-cu12", version = "11.7.5.82", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cusparse-cu12", version = "12.5.10.65", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-cusparselt-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-nccl-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-nvshmem-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "nvidia-nvtx-cu12", version = "12.9.79", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "setuptools", marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "sympy", marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "triton", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
|
||||
{ name = "typing-extensions", marker = "platform_machine == 'aarch64'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:c501c66fe5b0e2fc70f9d8a18e17a265f92ad1d1009dba03f5938d2f15a9066f", upload-time = "2026-01-26T17:26:29Z" },
|
||||
{ url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:ab44cf28e6ca2df679f0845fb4b950c81834431218840ca01c0a1583892a0986", upload-time = "2026-01-26T17:26:26Z" },
|
||||
{ url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:794482180a4f2d92a960f470fcd47e066dbe2eeb27816880e618d3ce031805f7", upload-time = "2026-01-26T17:26:04Z" },
|
||||
{ url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:4559e1254e2c8e1a337758626d1cf33ca5a5ded3509fa012070334bf886b686b", upload-time = "2026-01-26T17:25:38Z" },
|
||||
{ url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:cbe8955514ace826d3638a5d5dc1faa2f9dda1de4de74941d2e86b1a0859477c", upload-time = "2026-01-26T17:25:36Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "torch-c-dlpack-ext"
|
||||
version = "0.1.5"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/37/de/921b6491efce5c389a5ef9bbed3d2d6660005840dae488124173180859ab/torch_c_dlpack_ext-0.1.5.tar.gz", hash = "sha256:d06f0357d575d22a168cc77acb9020fc4bae30968ceb6718a055dcbe92bacabe", size = 12913, upload-time = "2026-01-12T11:25:08.484Z" }
|
||||
wheels = [
|
||||
@@ -3706,7 +4139,8 @@ name = "torchaudio"
|
||||
version = "2.9.1"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/f1/83/71cbadd7b66753818b5775f2088bad4f721d581de276996df4968000a626/torchaudio-2.9.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:7581ef170794c599aed55918e00d0acd9e5c9a0f19400c9a9a840955180365c5", size = 808098, upload-time = "2025-11-12T15:26:01.408Z" },
|
||||
@@ -3755,7 +4189,8 @@ dependencies = [
|
||||
{ name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13'" },
|
||||
{ name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13'" },
|
||||
{ name = "pillow" },
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/f0/af/18e2c6b9538a045f60718a0c5a058908ccb24f88fde8e6f0fc12d5ff7bd3/torchvision-0.24.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:e48bf6a8ec95872eb45763f06499f87bd2fb246b9b96cb00aae260fda2f96193", size = 1891433, upload-time = "2025-11-12T15:25:03.232Z" },
|
||||
@@ -3827,10 +4262,15 @@ name = "triton"
|
||||
version = "3.5.1"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/db/53/2bcc46879910991f09c063eea07627baef2bc62fe725302ba8f46a2c1ae5/triton-3.5.1-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:275a045b6ed670dd1bd005c3e6c2d61846c74c66f4512d6f33cc027b11de8fd4", size = 159940689, upload-time = "2025-11-11T17:51:55.938Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/f2/50/9a8358d3ef58162c0a415d173cfb45b67de60176e1024f71fbc4d24c0b6d/triton-3.5.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d2c6b915a03888ab931a9fd3e55ba36785e1fe70cbea0b40c6ef93b20fc85232", size = 170470207, upload-time = "2025-11-11T17:41:00.253Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/f1/ba/805684a992ee32d486b7948d36aed2f5e3c643fc63883bf8bdca1c3f3980/triton-3.5.1-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:56765ffe12c554cd560698398b8a268db1f616c120007bfd8829d27139abd24a", size = 159955460, upload-time = "2025-11-11T17:52:01.861Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/27/46/8c3bbb5b0a19313f50edcaa363b599e5a1a5ac9683ead82b9b80fe497c8d/triton-3.5.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f3f4346b6ebbd4fad18773f5ba839114f4826037c9f2f34e0148894cd5dd3dba", size = 170470410, upload-time = "2025-11-11T17:41:06.319Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/84/1e/7df59baef41931e21159371c481c31a517ff4c2517343b62503d0cd2be99/triton-3.5.1-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:02c770856f5e407d24d28ddc66e33cf026e6f4d360dcb8b2fabe6ea1fc758621", size = 160072799, upload-time = "2025-11-11T17:52:07.293Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/37/92/e97fcc6b2c27cdb87ce5ee063d77f8f26f19f06916aa680464c8104ef0f6/triton-3.5.1-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0b4d2c70127fca6a23e247f9348b8adde979d2e7a20391bfbabaac6aebc7e6a8", size = 170579924, upload-time = "2025-11-11T17:41:12.455Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/14/f9/0430e879c1e63a1016cb843261528fd3187c872c3a9539132efc39514753/triton-3.5.1-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f617aa7925f9ea9968ec2e1adaf93e87864ff51549c8f04ce658f29bbdb71e2d", size = 159956163, upload-time = "2025-11-11T17:52:12.999Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/a4/e6/c595c35e5c50c4bc56a7bac96493dad321e9e29b953b526bbbe20f9911d0/triton-3.5.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d0637b1efb1db599a8e9dc960d53ab6e4637db7d4ab6630a0974705d77b14b60", size = 170480488, upload-time = "2025-11-11T17:41:18.222Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/41/1e/63d367c576c75919e268e4fbc33c1cb33b6dc12bb85e8bfe531c2a8bd5d3/triton-3.5.1-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:8932391d7f93698dfe5bc9bead77c47a24f97329e9f20c10786bb230a9083f56", size = 160073620, upload-time = "2025-11-11T17:52:18.403Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/16/b5/b0d3d8b901b6a04ca38df5e24c27e53afb15b93624d7fd7d658c7cd9352a/triton-3.5.1-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:bac7f7d959ad0f48c0e97d6643a1cc0fd5786fe61cb1f83b537c6b2d54776478", size = 170582192, upload-time = "2025-11-11T17:41:23.963Z" },
|
||||
]
|
||||
|
||||
@@ -4029,7 +4469,8 @@ dependencies = [
|
||||
{ name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13'" },
|
||||
{ name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13'" },
|
||||
{ name = "pydantic" },
|
||||
{ name = "torch" },
|
||||
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
|
||||
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
|
||||
{ name = "transformers" },
|
||||
{ name = "triton", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
|
||||
{ name = "typing-extensions" },
|
||||
|
||||
Reference in New Issue
Block a user